T-Labs Series in Telecommunication Services
Series Editors Sebastian Möller, TU Berlin and Deutsche Telekom Laboratories, Berlin, Germany Axel Küpper, TU Berlin and Deutsche Telekom Laboratories, Berlin, Germany Alexander Raake, TU Berlin and Deutsche Telekom Laboratories, Berlin, Germany
For further volumes: http://www.springer.com/series/10013
Nicolas Côté
Integral and Diagnostic Intrusive Prediction of Speech Quality
123
Nicolas Côté Centre Européen de Réalité Virtuelle Université de Bretagne Occidentale 25 rue Claude Chappe 29280 Plouzané France e-mail:
[email protected]
ISSN 2192-2810
e-ISSN 2192-2829
ISBN 978-3-642-18462-8
e-ISBN 978-3-642-18463-5
DOI 10.1007/978-3-642-18463-5 Springer Heidelberg Dordrecht London New York Ó Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: eStudio Calamar, Berlin/Figueres Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To my partner, Natacha
Preface
The telephony network broadly changed during the last decades with the intensive introduction of Voice over Internet Protocol (VoIP) technology and third generation mobile networks. These networks enable new transmission paradigms that affect the perceived quality of speech signals. The perceived characteristics of a speech signal transmitted by a VoIP network differ from the characteristics of a speech signal transmitted by a traditional analog telephony network: new impairments are introduced (e.g. packet losses), or traditional ones are exacerbated (e.g. echo). However, enhancements have been also introduced in VoIP and new mobile networks. For instance, WideBand (WB) and Super-WideBand (S-WB) transmissions both significantly improve speech quality. In this book, instrumental measurement methods for the perceived quality of transmitted speech are described. Such methods simulate the speech perception process employed by human subjects during auditory experiments. The measure standardized by the International Telecommunication Union (ITU), called “WidebandPerceptual Evaluation of Speech Quality” or WB-PESQ, is not able to quantify all of these perceived characteristics on a unidimensional quality scale, the Mean Opinion Score (MOS) scale. Recent experimental studies have shown that subjects make use of several perceptual dimensions to judge on the quality of speech signals. In order to represent the signal at a higher stage of perception, a new model, called “Diagnostic Instrumental Assessment of Listening quality (DIAL)”, has been developed. It includes a perceptual and a judgment model, which simulates the whole quality judgment process. Except for strong discontinuities, DIAL predicts very well the speech quality of different speech processing and transmission systems, and it outperforms the Wideband-Perceptual Evaluation of Speech Quality (WB-PESQ). The work presented in this book was carried out at three different places: (i) the Quality and Usability Lab, Technical University of Berlin, Germany (ii) the Research and Development department of France Télécom, Lannion, France, and (iii) the Ecole Nationale d’Ingénieurs de Brest, France. It would not have been possible without the help of many supporters. Especially, this work has been performed
vii
viii
Preface
within the framework of a collaboration between two companies, Orange (former France Télécom) and Deutsche Telekom AG. I would like to thank all persons who have given their support: · the head of the Quality and Usability Lab, Prof. Dr.-Ing. Sebastian Möller, for his support at both professional and personal levels and his ongoing availability and guidance, · Dr. Valérie Gautier-Turbin and Dr. Vincent Koehl for supporting my work and for their constructive review of the manuscript, · Vincent Barriac, Prof. Dr.-Ing. Alexander Raake, Marcel Wältermann, and Dr.Ing. Kirstin Scholz for many fruitful discussions, · my friends and colleagues Rachid Guerchouche, Virginie Durin, Julien Faure, Leatitia Gros, Adrien Leman, Anne Battistello, Emilie Geissner-Koehl, MarieNeige Garcia, Jens Ahrens, Matthias Geier, Lu Huo, Camille Dekeukelaere, Fabien Tencé, Matthieu Aubry, Nicolas Marion and Kristen Manac’h, for their assistance, advices, encouragements and great research atmosphere, · the managers Pierre Henry, Pierre Chevalier and Klaus-Jürgen Buss for enabling this collaboration, · the secretaries, Irene Hube-Achter and Nicole Bouteiller, for their grateful support in all administrative (or not) steps, · the laboratory assistants, Ulrike Stiefelhagen and Martine Apperry, for their technical supports in running numerous auditory tests, · Dr. Chistophe d’Alessandro for reviewing the manuscript and providing important corrections, · all colleagues within the ITU-T Study Group 12, Catherine Quinquis, Akira Takahashi, Ludovic Malfait and especially the winners of the ITU-T POLQA project, Dr. John Beerends, Dr.-Ing. Christian Schmidmer and Dr.-Ing. Jens Berger, · Jennifer Lefebvre and Marie-Paul Friocourt for proofreading draft material and providing comments and suggestions, · my music bands, Dimopulos Quartet and Century, · my mother, Anne-Marie Wascat, for her encouragements,
Preface
ix
· my partner, Natacha Lefebvre, for her understanding and ongoing support throughout these years. This work would not have been possible without you. To all students who try to get a doctoral degree; don’t drop out, you will pass!
December 15, 2010
Brest, France Nicolas Côté
Contents
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
Speech Quality in Telecommunications . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Auditory system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Speech perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Speech quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Definition of the integral speech quality . . . . . . . . . . . . . . . . . 1.2.2 Quality elements and quality features . . . . . . . . . . . . . . . . . . . 1.2.3 Quality of Service (QoS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Transmission systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Telephone scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Telephony networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 User interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Speech codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Speech enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Perceptual quality space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Example studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 5 5 7 9 12 12 15 16 18 19 19 21 22 26 27 27 28 34
2
Speech Quality Measurement Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Auditory methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Test subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Speech material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Utilitarian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Analytical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37 37 40 42 44 44 49
xi
xii
Contents
2.2.5 Relativity of subjects’ judgments . . . . . . . . . . . . . . . . . . . . . . . 2.2.6 Reference conditions and normalization procedures . . . . . . . 2.3 Instrumental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Parameter-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Signal-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Packet-layer models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52 56 58 60 64 84 84
3
Optimization and Application of Integral Quality Estimation Models 87 3.1 Optimization of an intrusive quality model, the WB-PESQ . . . . . . . . 88 3.1.1 WB-PESQ analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.1.2 The Modified WB-PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2 Application of WideBand intrusive quality models . . . . . . . . . . . . . . . 104 3.2.1 WideBand version of the E-model . . . . . . . . . . . . . . . . . . . . . . 105 3.2.2 Methodology for the derivation of Ie,WB and B pl,WB . . . . . . . 110 3.2.3 Derivation from WideBand intrusive models . . . . . . . . . . . . . 114 3.3 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4
Diagnostic Instrumental Speech Quality Model . . . . . . . . . . . . . . . . . . . . 133 4.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2 Overview of the DIAL model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.3.1 Active Speech Level (ASL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.3.2 Time-alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.3.3 Modeling of receiving terminals . . . . . . . . . . . . . . . . . . . . . . . . 149 4.4 Core model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.4.1 Perceptual transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.4.2 Partial compensation of the transmission system . . . . . . . . . . 153 4.4.3 Calculation of the loudness densities . . . . . . . . . . . . . . . . . . . . 157 4.4.4 Short-term degradation measure . . . . . . . . . . . . . . . . . . . . . . . . 159 4.4.5 Integration of the degradation over the time scale . . . . . . . . . 162 4.4.6 Computation of the core model quality score . . . . . . . . . . . . . 163 4.5 Dimension estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.5.1 Coloration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 4.5.2 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 4.5.3 Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.5.4 Noisiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 4.6 Judgment model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 4.6.1 Linear regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 4.6.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 4.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Contents
xiii
5
Evaluation of DIAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.1 Experimental set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.1.1 Reference instrumental measures . . . . . . . . . . . . . . . . . . . . . . . 187 5.1.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.1.3 Statistical measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 5.2 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 5.2.1 Integral quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 5.2.2 Perceptual dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 5.3 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6
Conclusions and Outlooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
A
Modulated Noise Reference Unit (MNRU) . . . . . . . . . . . . . . . . . . . . . . . . 219
B
Databases test-plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 B.1 Narrow-Band databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 B.2 WideBand and Super-WideBand databases . . . . . . . . . . . . . . . . . . . . . 224 B.3 Perceptual dimension databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Acronyms
Algebraic Code-Excited Linear Prediction Absolute Category Rating Acoustic Echo Cancellation Automatic Gain Control Articulation Index Adaptative Multi-Rate Auditory Non-Intrusive QUality Estimation Auditory Spectrum Distance Active Speech Level Asymmetric Specific Loudness Difference Asynchronous Transfer Mode Bit Error Rate Bark Spectral Distortion Computer-Aided Telephone Network Assessment Program Call Clarity Index Comité Consultatif International Téléphonique et Télégraphique (International Telegraph and Telephone Consultative Committee) Comparison Category Rating Cepstral Distance Code Division Multiple Access Code-Excited Linear Prediction CoHerence Function Comfort Noise Generation Class of Service Diagnostic Acceptability Measure Degradation Category Rating Discrete Cosine Transform Digital Enhanced Cordless Telecommunications Discrete Fourier Transform Directness/Frequency Content Diagnostic Instrumental Assessment of Listening quality xv
xvi
F0
fS
k
Acronyms
Deutsche Telekom—Speech Quality Estimation Enhanced Modified Bark Spectral Distortion Expert Pattern Recognition Equivalent Rectangular Bandwidth European Telecommunications Standards Institute Enhanced Variable Rate Codec fundamental frequency Full-Band Frame Error Rate Finite Impulse Response sampling frequency Global System for Mobile communication Head-And-Torso Simulator Hands-Free Terminal Information Index In-service Non-intrusive Measurement Device Internet Protocol Intermediate Reference System Itakura–Saito Integrated Services Digital Network International Organization for Standardization International Telecommunication Union Radiocommunication sector of the ITU Telecommunication standardization sector of the ITU k-Nearest Neighbors Log Likelihood Ratio Listening-Only Test Linear Predictive Coding Loudness Rating Line Spectrum Pair Long-Term Loudness Long-Term Memory Multi-Dimensional Scaling Million Instructions Per Second Modulated Lapped Transform Measuring Normalizing Blocks Modulated Noise Reference Unit Mean Opinion Score Moving Picture Experts Group MUlti Stimulus test with Hidden Reference and Anchor Narrow-Band Next Generation Network Noise Reduction Network Echo Cancellation Noise-to-Masking Ratio
Acronyms
xvii
Overall Performance Index model for Network Evaluation Project—Acoustic Assessment Model Perceptual Ascom Class Enhanced Perceptual Approaches for Multi-Dimensional Analysis Perceptual Analysis Measurement System Perceptual Audio Quality Measure Principal Component Analysis Pulse Code modulation PErception MOdel—Quality assessment Perceptual Evaluation of Speech Quality Packet Loss Concealment Perceptual Linear Prediction Perceptual Objective Listening Quality Analysis Plain Old Telephone System Power Spectral Density Perceptual Speech Quality Measure Public Switched Telephone Network Quality of Experience Quality of Service Relative Approach Relaxed Code-Excited Linear Prediction Root Mean Square Real-time Transport Protocol Semantic Differential Spherically Invariant Random Process Signal-to-Noise Ratio Short-Term Memory Short-Term Loudness Super-WideBand Time-Domain Aliasing Cancellation Time-Domain Bandwidth Extension Total Harmonic Distortion Telecommunication Objective Speech-Quality Assessment Transmission Rating User Datagram Protocol Universal Mobile Telecommunications System Voice Activity Detector Voice over Internet Protocol Voice Quality Enhancement Voice Quality Index WideBand Wideband-Perceptual Evaluation of Speech Quality Weighted Spectral Slope
Introduction
Telecommunication systems using technologies related to the production and the perception of human speech are widely employed. Examples of such systems are speech transmission, speech recognition, speech synthesis and speaker identification. The quality of the processed speech signal is a key factor in these systems. The present book focuses on the quality of speech transmission systems. For this purpose, the perceived quality of transmitted speech is assessed by either human subjects or instruments. Over the last three decades, the speech transmission systems broadly changed. In the literature, this transformation is referred to as the “digital revolution”. The first main transition was the introduction of digital transmission. For instance, the broadband Integrated Services Digital Network (ISDN) has been introduced to replace analog transmission on the landline telephony network. Then, mobile telephony systems, such as the Global System for Mobile communication (GSM) network, have been broadly set up over the world. Both mobile and landline telephony networks are based on “circuit-switched” networks. The two interlocutors are connected by a physical circuit, which holds the whole call. In addition to landline and mobile networks, computer networks such as Internet have been introduced. These computer networks, also known as “packet-switched” networks, are based on a discontinuous transmission of data packets. The telecommunication providers have adapted the packet-switched networks to the transmission of speech data using Voice over Internet Protocol (VoIP) transmissions. In the near future, VoIP transmissions will represent the main speech service among all types of speech transmission networks. Even though important advances were made in technology related to telephony in the last three decades, the next decade will probably know an increase of communication services including new possibilities. According to the International Telecommunication Union (ITU), by the end of 2006, there was a total of nearly 4 billion of mobile and landline subscribers worldwide. This included 1.27 billion of landline subscribers and 2.68 billion of mobile subscribers (61% of which were located in developing countries) as well as 1.13 billion of Internet users (ITU–D TTR, 2007).
1
2
Introduction
However, these three speech services (i.e. landline, mobile and VoIP transmissions) are highly different in nature. A circuit-switched network has a fixed framework and the introduction of new processing systems is hardly possible. The users’ perception of landline circuit-switched networks was studied in the past, see for instance Möller (2000). These studies rely on results of auditory experiments and were aimed at defining in detail the different degradations introduced by such networks. These degradations are continuous over the whole conversation call. Mobile and VoIP networks allow flexibility and the introduction of new speech processing systems. For instance, VoIP networks enable WideBand (WB) speech transmission. WB transmission is a significant improvement over circuit-switched network. In addition, the GSM network allows a greater mobility for the user. From a user’s point of view, these new networks seem to introduce mainly advantages. However, these networks introduce new type of degradations. For instance, the talker’s voice may sound discontinuous when packets are lost during the transmission. In addition, informations related to talker’s surrounding environment are more important in a mobile network than in a landline network. An interconnection between the networks may even combine different degradations. Speech quality assessment covers different branches of studies such as psychoacoustics, signal processing and cognitive sciences. In the telecommunication industry, speech quality is mainly used as a method for the evaluation of products using speech technology. The perceived quality of speech transmitted by telephony networks was studied over the last decades. The factors that influence the speech quality were defined. It resulted that speech quality is of multidimensional nature. A perceptual feature space where all networks can be compared has been derived. However, for this purpose numerous auditory tests were carried out. These auditory tests are unfortunately time-consuming and costly. However, there is an important need for telecommunication providers to measure the quality of speech transmission systems. Such measurement methods can be helpful in two different cases (i) during the planning and development phases of new speech processing systems (e.g. speech codecs), (ii) for the monitoring of in-use telephony networks. Developers of voice processing systems and communication equipments as well as speech service providers are interested in such techniques. The present book focuses on the instrumental measurement of the quality of transmitted speech. Information about the transmission system is provided by simple instrumentally measurable parameters such as the transmitted bandwidth, the Signal-to-Noise Ratio (SNR) or the percentage of lost packets. However, the interest is mainly in the quality as perceived by customers. The simulation of the human behavior is a relatively difficult task. Complex measurement methods have been developed to simulate a human subject into an auditory test, e.g. PSQM by Beerends and Stemerdink (1994) or PESQ described in ITU–T Rec. P.862 (2001). They analyze the speech signal through use of a human peripheral auditory system model. Such methods have been developed for circuit-switched networks or packet-switched networks. However, they show some limits in the comparison of different networks. In-use
Introduction
3
methods, such as PESQ, were developed several years ago and are not adapted to the new techniques used in speech processing systems. Focus, here, is on the development of a generic instrumental measurement method liable to assess every in-use networks and speech processing systems including their possible combinations. In Chap. 1, the perceived quality of transmitted speech is defined in detail. The physical elements that appear in telephony networks are described in Sec. 1.3. The auditory and instrumental measurement methods are exhaustively described in Chap. 2. Then, the second part of this book consists of the following scientific contributions: • In Chap. 3, the available existing instrumental measurement methods, including the current ITU-T standard called WB-PESQ (ITU–T Rec. P.862.2, 2005), are evaluated on WB transmissions. Based on detected discrepancies between WB-PESQ estimations and auditory quality values, the algorithm was improved resulting in a Modified WB-PESQ. This modified version shows high performance in the prediction of perceived quality of various WB low bit-rate speech codecs. A set of three WB instrumental methods, including the Modified WB-PESQ, were used to derive two parameters describing the quality of WB speech codecs. The first one, called WB equipment impairment factor Ie,WB , quantifies the degradation introduced by a coding/decoding process. The second parameter, called packet-loss robustness factor, B pl , describes the robustness of the speech codec PLC algorithm against transmission errors. By using specific normalization procedure, the impact of the experimental context on the estimated quality scores is attenuated. The resulting parameters are thus considered as “absolute”. It results from the estimated Ie,WB and B pl,WB values that current WB instrumental methods are unable to quantify the degradation of all WB speech codecs and especially the comparison between codecs based on different coding techniques. The wrong estimation is increased in the case of transmission errors. • A new instrumental model, called DIAL, is described in Chap. 4. It estimates an integral speech quality and decomposes the degradation introduced by the system under study on several quality dimensions. In order to represent the signal at a higher stage of perception, the new intrusive model includes a specific cognitive model that simulates the whole speech perception process. For this purpose, first the similarity between the transmitted and the corresponding clean speech signals is quantified by an instrumental measurement method. Then, the impairment is decomposed into four quality dimension values: Directness/Frequency Content, Continuity, Noisiness and Loudness. • The new instrumental model, DIAL, is exhaustively evaluated in Chap. 5 through comparisons to several reference models including those standardized by the Telecommunication standardization sector of the ITU (ITU-T). Except for strong discontinuities, DIAL proved to predict very well the speech quality of different speech processing and transmission systems. Overall, it outperforms the current ITU-T standard WB-PESQ and its improved version, the Modified WB-PESQ.
Chapter 1
Speech Quality in Telecommunications
As this book deals with the measurement, analysis and prediction of the speech transmission quality as perceived by a listener, this chapter will describe the perceptual characteristics of speech production and perception that. This exhaustive description leads to the definition of perceived quality.
1.1 Speech Human speech production and “perception” (or perceptual analysis) are two complex phenomena studied in many scientific areas such as, psychology, acoustics and linguistics. These phenomena are introduced in this section thanks to some basic concepts of psychoacoustics and cognition. Following the time axis, the first step corresponds to the production of speech sounds (see Sec. 1.1.1). Then, after air propagation or via a communication system (see Sec. 1.3), the speech signal is perceived by the distant listener. The hearing process, which involves the peripheral auditory system, is defined in Sec. 1.1.2. The resulting auditory signal is analyzed by the human brain in order to extract the main information, the semantic message produced by the distant speaker in order to communicate. Several cognitive processes used in speech communication are briefly introduced in Sec. 1.1.3.
1.1.1 Speech production The first step in speech communication corresponds to the production of intelligible sounds. The generation of speech sounds is composed of two successive steps. • First, a pressure wave is provided by the lungs. In normal breathing, the vocal folds (glottis) are far enough to avoid speech sound generation. In order to produce speech sounds, the pressure wave is interrupted by the quasi-periodic opening and closing of the vocal folds. The rate of this vibration corresponds to the
N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_1, © Springer-Verlag Berlin Heidelberg 2011
5
6
1 Speech Quality in Telecommunications
fundamental frequency (F0 ) of human voice, also known as pitch and expressed in Hertz (Hz). F0 depends on the size of the vocal folds. Therefore, a large difference in average of the F0 exists among males (100Hz), females (200 Hz) and children (300 Hz), see Shaughnessy (2000). • In a second step, the pressure wave composed of F0 and its many harmonics is transmitted through the vocal tract composed of the pharyngeal, oral and nasal cavities. Speech sounds are generated by the vocal tract resonances, which introduce several peaks and dips of energy in the frequency spectrum of the pressure wave. The peaks concentrate most of the sound energy in few formants approximately equivalent over talkers (abbreviated Fi, e.g. F1 corresponds to the formant with the lowest frequency). The position of the tongue, lips, velum and larynx (called articulators) has a strong effect on the sound wave journey through the vocal tract. The articulators modulate the periodicity (voice or unvoiced), frequency spectrum and duration of the speech sounds. The basic building blocks in human speech production are called phones. They correspond to the abstract linguistic unit referred to as phonemes. Phones can be divided into two groups: consonants sounds and vowel sounds (IPA Handbook, 1999). Vowel sounds (e.g. /i/, /a/, . . . ) are characterized by a constant airflow through the vocal tract and they are defined by the frequency position of the formants. On the contrary, consonant sounds are created by a discontinuous burst of airflow (e.g. plosive and fricative consonants). Formants are also important for consonant sounds, but for such phones they often vary with time. Even though more than 100 basic phones have been identified over all languages, they represent a small part of all the sounds human voice can produce. In this respect, the phones can be seen as a generic building-block with common characteristics, regardless of the physical attributes of the talker. At the output of the mouth, the bandwidth produced by human speech is usually defined over the range 100–7 000Hz (Deng and O’Shaughnessy, 2003). This bandwidth is highly dependent on the produced phones. For instance, vowels concentrate most of the energy in their first three formants (i.e. in the range 300–3 000Hz) whereas fricative consonants such as /f/ and /v/ (called labiodental) have little energy below 8 000Hz. However, in addition to the above defined attributes, speech is a composite of intelligible sounds and other indicators such as whistles, clicks, hisses, buzzes and hums (Pardo and Remez, 2006). These indicators are specific to each talker and used as “recognizer tools” by the listeners. As a result, the produced bandwidth may increase to the range 50–14 000Hz. For an overview of speech production, see Deng and O’Shaughnessy (2003), or Honda (2008).
1.1 Speech
7
1.1.2 Auditory system After air propagation or via a speech transmission system the speech signal reaches the listener’s ear. The acoustic event at the listener’s ear then produces an “auditory event” in his sensory system (Blauert, 1997). The hearing system includes several psycho-physical attributes that influence the perception of the speech signal, and emphasize the intelligible information of the acoustic signal. The auditory perception process is studied in a common branch of psychology and acoustics called psychoacoustics. Several basic laws of psychoacoustics will be introduced in this section. The resulting auditory signal, once processed by the hearing system, is transmitted to the human brain, which extracts the meaning thanks to cognitive processes. For an overview of the basic laws of psychoacoustics, see Zwicker and Fastl (1990).
1.1.2.1 Physiological description The human peripheral auditory system is composed of three different parts where the acoustic sound is transformed into a signal that the brain is able to process (see Fig. 1.1). External ear The acoustic wave is collected by the pinna, which acts as a funnel in order to send the sound wave down to the ear canal. Then, the acoustic wave is transmitted through the ear canal to the receiving membrane of the eardrum, also known as tympanic membrane. Due to its physiological characteristics, the resonance of the ear canal introduces an amplification of the frequencies around 2 000–3 000 Hz (i.e. third formant), which increases the intelligibility of the speech signal.
Pinna
Stapes (attached to oval window) Incus Malleus
0.5kHz 6 kHz Cochlear nerve
Fig. 1.1 Anatomy of the human outer, middle and inner ear and frequency mapping in the cochlea, redrawn from Purves et al. (2004)
Cochlea Ear canal
16 kHz Tympanic membrane
Tympanic cavity
8
1 Speech Quality in Telecommunications
Middle ear The acoustic wave sets up vibrations at the eardrum which are transmitted through the ossicles (malleus, incus, and stapes) to the oval window. The ossicles work as a matching impedance system by creating a mechanical wave from the previous acoustic wave. Inner ear The cochlea, in the inner ear, is formed of three fluid-filled parallel and snailshaped spaces (scalae) separated by the basilar and the Reissner’s membranes. The cochlea contains thousands of hair cells, which convert the mechanical wave into an electrical wave then transmitted to the auditory cortex via the cochlear nerve fibres.
1.1.2.2 Psychoacoustics The cochlea applies a spectral analysis which determines the frequency components of the sound wave. The area of the basilar membrane excited by high frequencies ( f ≈ 20 000Hz) is located closer to the oval window (i.e. front of the cochlea), and the area excited by low frequencies ( f ≈ 20 Hz) is located closer to the apex (i.e. end of the cochlea). This sensitivity of the basilar membrane to frequency components is referred to as a tonotopic organization. Observations have led to a description of this organization thanks to an array of overlapping bandpass filters called criticalband rates (Fletcher, 1940). However, the relation between the location of excitation and the frequency cannot be fully described mathematically. This description introduced by Fletcher (1940) assumes that sounds are preprocessed by auditory filters with specific characteristics: (i) their spacing corresponds to 1.5 mm steps along the basilar membrane and, (ii) their bandwidths are increasing with frequency. Several models of the peripheral human auditory system have been developed, and a widely used example was proposed by Zwicker et al. (1957). This model is used to calculate the loudness of steady sounds. It gives a nonlinear relationship between the frequency and the critical-band rate, measured in Bark. In a normal human ear, the basilar membrane reacts to frequencies ranging from 20 to 20 000Hz corresponding to critical-band rates in the range 0–24 Bark: f 2 zB = 13 arctan(0.76 f ) + 3.5 arctan . (1.1) 7.5 Here, f is the frequency in kHz and zB is the critical-band rate in Bark. A second mathematical model developed more recently by Moore et al. (1997) introduced a slightly different concept where the critical-band rates are measured in Equivalent Rectangular Bandwidth (ERB) in the range 0–40 ERB: zERB = 21.4 log10 (4.37 f + 1) ,
(1.2)
1.1 Speech
9
where f is the frequency in kHz and zERB is the critical-band rate in ERB. However, the hearing sensitivity is not constant over the whole Bark scale. This sensitivity declines for very high and very low frequencies, with a maximum of sensitivity between 1 000 and 5 000 Hz. The hearing system involves several psychoacoustic phenomena, such as time and spectral masking effects and loudness compression. These phenomena are described in detail in Sec. 4.4.1. Then, the excitation pattern associated with the speech stimuli is processed at higher levels in the human auditory cortex, see Young (2008).
1.1.3 Speech perception Language corresponds to the primary method of communication for humans. For talkers and listeners, this communication system involves speech signs. A speech sign is a representation of the information the speaker wishes to convey to the listener. During a face to face conversation, the speech sign is carried by an acoustic wave. In such conversations, the articulatory movements of the talker’s face provide additional information to the listener. During a telephone conversation, which is another example of conversation system, the face to face situation is replaced by a human-machine interaction with limited capabilities (single modality) and inherent degradations compared to the initial situation. A speech stimulus does not “mean” something by itself. However, since it carries information to the listener (Jekosch, 2005), the speech signal may become a sign when the listener makes use of it (i.e. by associating a “meaning” to it). Following the theory of Ogden and Richards (1923), the cognitive process is represented by a triangular diagram called semiotic triangle in Fig. 1.2. This triangle is composed of three constituents representing the relationship between the sign and its related meaning. The listener extracts the meaning (i.e. the sense made of the sign by its interpreter) thanks to both a sign carrier and a referent. The sign carrier is the form meaning
1
o
sign carrier
t ers ref
Fig. 1.2 Semiotic triangle according to Ogden and Richards (1923), applied here to speech perception
sym bo lize s
3
2 stands for
referent
10
1 Speech Quality in Telecommunications
in which the speech signal is represented (i.e. an acoustic signal at the listener’s ear). The referent is the object the sign stands for (i.e. an abstract object).
1.1.3.1 Auditory memory An important element in speech communication corresponds to the auditory memory (Baddeley, 2003), which is able to store the information presented orally. For an overview of the auditory memory, see Demany and Semal (2008). Echoic memory Just after the process in the peripheral auditory system, the first element of the auditory memory is the echoic memory (which refers to the acoustic phenomena). A brief mental echo of the speech signal leaves an auditory trace in the echoic memory for 150–300 ms. The analysis of the linguistic properties of the speech signal (i.e. phonetic categorization) does not appear directly into the echoic memory. Only are stored the important acoustic attributes of the speech signal (e.g. talker’s gender). Short-term memory The phonetic categorization appears when the auditory signal is transfered to the Short-Term Memory (STM). However, for sounds which are not directly analyzed in the STM, the acoustic attributes may be stored into the echoic memory for up to 3–4 seconds (e.g. nonspeech sounds). Over the different views on human auditory memory, one theory detailed by Baddeley and Hitch (1974) described the STM thanks to a model called working memory. The working memory consists in the temporary storage of informations, which are manipulated with complex cognitive processes. Regarding speech perception, the most important component of the working memory model corresponds to the phonological loop, which is composed of two parts: •
The phonological store (into the short-term memory) in which an auditory memory trace (i.e. after phonetic process) fades over a few seconds.
•
The articulatory rehearsal component which refreshes the phonological store with the current auditory memory trace.
Long-term memory A third part of the memory corresponds to the Long-Term Memory (LTM): elements are stored in this durable memory for periods from a few days up to several decades. Numerous perception processes make use of the auditory LTM, such as the recognition of musical instruments, the identification and localization of sound sources and the identification of spoken words (Demany and Semal, 2008).
1.1 Speech
11
1.1.3.2 Comprehension The acoustic form of the speech signal varies widely between talkers. However, this acoustic signal is processed by the listener’s hearing and cognitive system. The latter enables the listener to recognize the statistical characteristics and redundancy in the auditory traces. These characteristics are hierarchical and the smallest constituents called phonemes were described in Sec. 1.1.1. The phonetic categorization gives the linguistic information of the speech signal; a syllable is a sequence of consonant and vowel phonemes, a word is a composition of syllables, and a phrase consist of several words. The recognition process is associated to vocabulary learning. This learning process begins with a small number of phonemes used in the language of our parents (Jekosch, 2008) until the acquisition of the basic knowledge used in speech communication. Jekosch (2005) defines the comprehension as the result of the perceptual analysis of the speech signal. However, speech comprehension corresponds to the last stage in the speech perception process (Möller, 2000; Raake, 2006b). This process includes four successive stages briefly described hereafter: 1. Comprehensibility: The ability of the acoustic speech signal (i.e. sign carrier, see Fig. 1.2) to convey phonemic information. A high comprehensibility means a perfect recognition of each phoneme from the speech signal. 2. Intelligibility: The ability to extract the content of the speech signal (i.e. referent, see Fig. 1.2) from the recognized phonemes. 3. Communicability: The ability to understand the speech signal as it was intended by the talker (i.e. meaning, see Fig. 1.2). 4. Comprehension: The result of the speech perceptual process. Comprehension (i) achieves a high level of communication efficiency, and (ii) requires that the listener be both prepared and desirous to understand the speech message. As described by Lotto and Sullivan (2008), the phonetic process is “relative” to the temporal context. A local context influences the comprehensibility of the phonemes, i.e. the surrounding phonemes (ranging from 10 to 100 ms). A global context influences the comprehension of the speech message. The global context includes the previous sentence(s), i.e. up to several seconds. Here, the context comprises pragmatic and prosodic aspects. In addition, the listener’s knowledge about linguistics or about the talker has a strong influence on the comprehension of the message.
12
1 Speech Quality in Telecommunications
1.1.3.3 Talker characteristics In addition to the comprehension process described in the previous section, the speech perception process enables one to extract from the acoustic speech signal several contextual characteristics. For instance, speaker localization is an important task for the isolation of a talker in a noisy environment or in a multi-talker situation. This aspect of speech perception is referred to as the cocktail party effect (see a review by Bronkhorst, 2000). Moreover, the listener can recognize many specific personal characteristics of the speaker from the acoustic signal called indexical information (Nygaard and Pisoni, 1998). Two different groups of indexical information exist. The first group corresponds to characteristics identifying the speaker (including its gender). Following the speech production schema described in paragraph 1.1.1, it seems that several acoustic attributes corresponding to anatomical specificities of the talker such as the F0 will hardly change over a long period of time. These acoustic attributes are used by listeners to identify talkers. In addition to these relatively time-constant properties for a given talker, the second group of indexical information reflects the situational context (e.g. goals, attitudes, emotions) (Pardo and Remez, 2006).
1.2 Speech quality In the previous section, several characteristics of the speech perception process were defined including linguistic and contextual information. Even though during a telephone conversation the intelligibility of a speech message produced by a distant talker is almost perfect, the user’s perception of this particular situation is extremely different from a natural face to face interaction. In only a few extreme cases, such as strong discontinuities, the impairment may result in a relatively low intelligibility. Consequently, the intelligibility is a necessary, but insufficiently relevant feature in order to quantify the listener’s perception of a transmitted speech sign (Volberg et al., 2006). For this purpose, the generic attribute, quality, is used.
1.2.1 Definition of the integral speech quality
According to Jekosch (2005), quality is: [the] result of [the] judgment of the perceived composition of an entity with respect to its desired composition.
Here, the perceived composition is:
1.2 Speech quality
13
[the] totality of features of an entity.
and the desired composition is: [the] totality of features of individual expectations and/or relevant demands and/or social requirements.
and a feature is: [a] recognizable and nameable characteristic of an entity.
Fig. 1.3 Speech quality judgment process as seen from the listener’s perspective, according to Raake (2006b) and based on Jekosch (2005). Ellipses represent processes, italicized names refer to intermediate events and rectangles to the inputs and outputs by the listener
Response modifying factors
Acoustic speech event
Adjustment
Perception Perceived composition
Desired composition
Reflection
Reflection Desired features p
Comparison
Perceived features q
Perceived quality Judgment Integral quality Description
Listener
Rating
In the specific case of speech quality, the entity corresponds to an acoustic speech signal. According to Raake (2006b) and from ideas developed by Jekosch (2005), the whole speech quality judgment process can be decomposed, on a time scale, into five successive steps (see Fig. 1.3):
14
1 Speech Quality in Telecommunications
Perception
The acoustic speech signal is perceived by the listener and results in a perceived auditory composition. The auditory composition includes all perceptual aspects such as the phonetic information and the characteristics of the talker’s and listener’s environments. Such heterogeneous information, which are not yet related to quality, imply a multidimensional organization of all the perceptual aspects. With that information, a listener can, e.g., distinguish two acoustic speech signals on the basis of their perceived aspects. Several characteristics of the listener such as his motivation, memory, knowledge, experience, and expectations influence the perception process (Pardo and Remez, 2006). In addition to these personal characteristics, the context (i.e. listener’s environment) in which the sound occurs also contributes to the perception process and, therefore, to the speech quality. Both types of characteristics form the modifying factors (Raake, 2006b) which adjust the desired auditory composition (or internal reference) to a particular listening situation. It ensues that the same acoustic speech signal can produce different results when listened to by two different subjects (resp. two different desired compositions). The desired auditory composition does not correspond to the best quality achieved by a speech signal, but to an average of all of the previous interactions in which speech telecommunication systems were involved. These previous interactions are stored in the Long-Term Memory (LTM) as auditory memory traces. However, a new speech signal with a quality equivalent to this average will be considered as being of better quality (Duncanson, 1969). An attenuation factor seems to be applied to these old memory traces.
Reflection
The listener reflects on all of the signal characteristics that are relevant for quality, i.e. “names” each feature of the multidimensional space. These “nameable” features are related to quality. The perceived composition is, thus, defined by a set of values (one per feature), i.e. the perceived features: a precise position, q, in the multidimensional space. In parallel, the same “nameable” features are used to define a position, p, to the desired composition, i.e. the desired features.
Comparison
Any quantification of the quality of the acoustic speech signal requires a comparison of the desired and perceived features, q and p, i.e. their corresponding values for each “nameable” feature.
Judgment
The listener judges the quality through comparisons. The judgment process corresponds to an aggregation of all features into a single quality value, i.e introducing a weighting coefficient for each feature in relation with its impact on the quality. The acoustic speech
1.2 Speech quality
15
sample is, thus, of high quality only if the subject’s perception and the desired composition are alike (i.e. similar values for the perceived and desired “nameable” features). Description
The listener finds the best possible description of the perceived quality using the rating scale. In case that the scale corresponds to the 5-point listening-quality scale defined in ITU–T Rec. P.800 (1996), and described in Sec. 2.2.3.3, the listener chooses one of the five categories bad, poor, fair, good or excellent.
According to the comprehension process introduced in Sec. 1.1.3.2, it becomes clear that a good comprehensibility and intelligibility of the transmitted speech is a prerequisite to high quality. Although all the features of speech quality are rarely taken into account in other definitions, a consensus exists between authors: speech quality is considered as a “multidimensional” object. The user’s comparison process can be represented in his brain by a few orthogonal “nameable” features, called perceptual dimensions (of speech quality). Following the taxonomy used by Möller (2000), the term integral quality is used when the user’s judgment encloses all these perceptual dimensions. The term overall quality is frequently used as an equivalent to integral quality. They correspond to two slightly different concepts. The term overall quality frequently refers to the quality of an entire connection, from the talker’s mouth to the listener’s ear(s) (originally defined by the terms end-toend quality or mouth-to-ear quality). The term overall quality can be linked to the physical characteristics of the system whereas the integral quality is related to the perceptual characteristics of the speech signal.
1.2.2 Quality elements and quality features The terms integral quality and overall quality represent two different points of view which are associated with the concept of quality features and quality elements, respectively. The perceived integral speech quality comprises quality features: i.e., perceived characteristics of the signal such as its loudness or intelligibility. The overall quality of a speech transmission system is defined by the effect of each building-block (or quality element). A quality element is defined by Jekosch (2005) as: a contribution to the quality • of a material or immaterial product as the results of an action/activity or a process in one of the planning, execution or usage phases • of an action or of a process as the result of an element in the course of this action or process.
16
1 Speech Quality in Telecommunications
Section 1.3 of this chapter gives an overview of the most important quality elements included in speech transmission systems together with their impact on the integral quality and their quality features. Two different modeling approaches are described in the rest of this book. In Chap. 3, the impact of several quality features on the integral quality is quantified through use of instrumental measurement methods (see Sec. 2.3.2). In Chap. 4, is the opposite one: the instrumental measurement of the integral speech quality is estimated from a selection of the most important quality features.
1.2.3 Quality of Service (QoS) The quality of a telecommunication service is formed by different factors where speech quality, as defined in Sec. 1.2.1, corresponds to one specific factor. A general definition of quality provided by the International Organization for Standardization (ISO) is available in the ISO Standard 8402 (1994). Quality is: the totality of characteristics of an entity that bear on its ability to satisfy stated and implied needs.
However, in the field of telecommunication, the term QoS is usually employed as an attribute that quantifies the acceptability of user application. Following the International Telecommunication Union (ITU), the Quality of Service (QoS) is defined in ITU–T Rec. E.800 (2008) as: the totality of characteristics of a telecommunication service that bear on its ability to satisfy stated and implied needs of the user of the service.
Consequently, the definition of Quality of Service is closely related to the evaluated service. In the present study, the considered services, referred to as Class of Service (CoS), correspond to voice services. As defined by Hardy (2003), a voice service corresponds to a near real-time voice interaction through a telecommunication system. Even if the integral speech quality is a factor determining the QoS, user’s satisfaction encloses many different aspects. In addition to the previous definition of QoS, the Telecommunication standardization sector of the ITU (ITU-T) introduces in ITU–T Rec. E.802 (2007) a “universal” model where all aspects of QoS for telecommunication services are grouped into four categories, namely, aesthetic, presentational, ethical and performance (see Table 1.1).
1.2 Speech quality
17
• Aesthetic is related to the ease of interactions between the user and the service (mainly the handset for telephony service). Some examples of aesthetic criteria are ergonomic considerations, simplicity and functionality as well as clarity of design. • How the service is marketed or supplied is defined by the presentational aspects. Some examples of presentational aspects are user’s personalization and customization of bills. • Ethical aspects refer to how the service is offered to the user. Examples of ethical aspects are conditions for cutting off services, services for the disabled and environmental impact. • The performance category, which is the most suitable for telecommunication services, has its own model where seven quality criteria (e.g. accuracy, availability, reliability . . . ) can be transposed to the well-understood eleven QoS performance parameters (resp. speech quality, availability of call center, dropped call ratio . . . ). This performance model was introduced at first by Richters and Dvorak (1988) as a quantitative evaluation of telecommunication service quality by customers. This evaluation is not restricted to telephone service.
Table 1.1 Usage example of the ITU-T QoS “universal” model for a mobile telephony service (ITU–T Rec. E.802, 2007) Quality components and criteria Performance criteria Functional elements
1) Hardware 2) Service usage
Aesthetic criteria
Presentational aspects
Ergonomic design of handset usability Connection set-up and release Transmission quality Service availability
Ethical aspects Disposal and ecological aspects
Customization Security features of service features Bill presentation quality
3) Contract Supply time 4) Customer Hotline availability relations Response time Complaint resolution
Disabling mobile set when reported stolen
The effects generated by the first three categories on the QoS is hardly quantifiable. However, they play an important role in user’s satisfaction. For instance, Vaalgamaa (2007) showed that, even if speech quality is an important factor for the success of a telecommunication system, this factor is only one among the QoS facc tors. For an example of an exhaustive QoS measurement on the Skype network, see Chen et al. (2006). Möller (2000) developed a different framework, which includes a specific taxonomy in order to define the QoS of telephone services. The performance parameters
18
1 Speech Quality in Telecommunications
related to speech quality are used in both frameworks. One should note that Möller merged all speech communication quality features into three main speech communication aspects: • Voice transmission quality (one way) comprises all of the factors that impact upon the quality of the speech signal in a listening situation. • Conversation effectiveness comprises the voice transmission quality and the elements that affect the conversational capabilities of the system, called conversational quality (e.g. transmission delay, sidetone impairments, echo) (ITU–T Rec. P.800.1, 2006). • Ease of communication refers to the conversation partners’ related factors (e.g. background noise at the talker’s side). Examples of such factors for the three constituents are given in Chap. 1.3. In the literature, the term speech quality is frequently used as voice transmission quality or speech communication quality. In the rest of this book, the main interest is the assessment of transmitted speech signals in a listening situation. Consequently, the term speech quality will be referred to as voice transmission quality. The QoS evaluates the degree of users’ satisfaction. However this definition does not take into account variability in points of view. In addition, according to the framework defined by Möller (2000), the final stage of the QoS perception process corresponds to the acceptability of a service. This is not mentioned in the ITU-T QoS definition. As a result, the ITU-T has recently completed the QoS framework by an additional term, namely, Quality of Experience (QoE) (ITU–T Rec. P.10/G.100 Amend. 2, 2008). Quality of Experience (QoE) is: the overall acceptability of an application or service, as perceived subjectively by the end-user.
The measurement methods introduced in Chap. 2 have been mainly developed to quantify the speech quality as perceived by the end-user. Consequently, they are not used to assess the QoS of voice services, but to estimate a given specific characteristic.
1.3 Transmission systems The goal of any telecommunication system is the transmission of signals across a network. In the case of voice signals, Hardy (2003) referred to this transmission as voice services. Two types of voice services exist (i) the communication services, which imply a conversation between a talker and a listener (or several listeners in
1.3 Transmission systems y(t)
19 y(k)
A
CNG
D Echo e(t)
Decoding
PLC Network
AEC/NEC A
Speechx(t) signal
D
NR
VAD
Coding
x(k)
Noise n(t)
Fig. 1.4 Elements composing a speech transmission system. A/D refers to analog to digital conversion, AEC and NEC echo compensations, CNG Comfort Noise Generation, PLC Packet-Loss Concealment, VAD Voice Activity Detection and NR Noise Reduction
teleconferencing systems) in a near “real-time” manner, and (ii) the streaming services (e.g. recorded message stored on a device). In voice services, the air path between two persons having an oral interaction is replaced by a speech transmission system. This section will define the elements composing a speech transmission system and their impact on the communication quality.
1.3.1 Telephone scenario To give an overview of speech transmission systems, a typical example of a telephone scenario is described in this section and depicted in Fig. 1.4. Firstly, a telephone user talks and produces an acoustic signal, x(t). This signal is received by the microphone of the talker’s handset. However, this handset also receives all the different acoustic signals, n(t), produced by the sound sources surrounding the telephone user. The microphone converts the acoustic signal into an electrical signal, which is digitalized (i.e. sampled and quantified in x(k), where k is the sample index) and pre-processed in order to remove the undesired signals (i.e. background noise and echo). Then, this processed signal is compressed and sent to the network. During the transmission to the handset of the interlocutor, the signal passes through several gateways and nodes. At the listener’s side, a continuous electrical signal is synthesized with the help of several digital “post-”processing algorithms. Then, the loudspeaker of the listener’s handset converts the synthesized electrical signal into an acoustic signal, y(t). The listener’s perception of the quality of this transmitted acoustic signal, y(t), (or electrical signal in the case where the acoustic signal is unavailable) is the subject of the present book.
1.3.2 Telephony networks The traditional analog telephony network has been optimized for an almost perfect intelligibility of the conversation. This led to the creation of the world’s public fixed-
20
1 Speech Quality in Telecommunications
line telephone network, referred to as Public Switched Telephone Network (PSTN). For this purpose, values have been defined for a set of physical parameters, e.g. speech level, bandwidth and Signal-to-Noise Ratio (SNR). For instance, the bandwidth of the transmitted speech over PSTN corresponds to the transmission of the frequencies between 300 and 3 400Hz. This bandwidth is, nowadays, referred to as Narrow-Band (NB). The PSTN is based on a “circuit-switched” network: the two interlocutors are connected by a physical circuit, preserved over the whole conversation call. In such a network, all physical parameters are well controlled to ensure a stable speech quality. During the last two decades, the deregulation of the telecommunications market broadly changed the telephone networks. The first main transition was the introduction of digital transmissions1 . The Integrated Services Digital Network (ISDN) was introduced to replace analog transmissions on the PSTN. Nowadays, analog networks are referred to as Plain Old Telephone System (POTS). Then, mobile phone networks, such as the Global System for Mobile communication (GSM) network, have been broadly set up over the world. The users of mobile telephony services can move during a conversation. However, this network is highly dependent on the characteristics of the radio channel between the mobile phone and the GSM antenna. For instance, interferences on the radio frequencies such as co-channel interferences (Carrier-to-Interference ratio, C/I) and adjacent channel interferences (Carrier-to-Adjacent ratio, C/A) are introduced during the radio transmission. These interferences produce bit errors and frame losses quantified by a Bit Error Rate (BER) and a Frame Error Rate (FER), respectively. In addition, when the user travels his mobile phone may switch from an antenna to another one, which results in a brief interruption of the speech. This is referred to as handover. Consequently, in comparison to the usual fixed-line telephony network, the GSM network produces new degradations on the perceived quality. Wolf et al. (1991) listed the quality degradations introduced by mobile transmissions: gaps in speech, long echo delays, bursts of erroneous bits, speech clipping, phonemic distortions and cell loss. The GSM network is being replaced by a third generation network called Universal Mobile Telecommunications System (UMTS). This new mobile network uses a higher transmission bit-rate compared to the GSM and thus enables new services. In addition to the PSTN, computer networks such as the Internet have been introduced. They are based on a discontinuous transmission of packets of data and are consequently known as “packet-switched” networks. The telecommunication providers have adapted the packet-switched networks to the transmission of speech data. In such networks, the speech signal is encapsulated into packets of equal size. The payload (i.e. the speech data) are routed from one point to another across the network. This transmission is handled by a combination of several protocols such as the User Datagram Protocol (UDP) or the Real-time Transport Protocol (RTP). The information related to these protocols is added to the payload. It corresponds to a set of bits referred to as packet header. In packet-switched networks the voice 1
The transmitted informations are digitally encoded. However, the signal is always analogical in the sense that bits are transmitted by an analog electrical signal.
1.3 Transmission systems
21
service is called VoX, where Vo stands for “Voice over” and X represents the network used to transmit the speech signal (e.g. Internet Protocol (IP), or Asynchronous Transfer Mode (ATM)). Nowadays, the packet-switched network is one of the most widely used transmission path because of its enhanced flexibility compared to the circuit-switched network. A larger audio bandwidth than the narrow telephone bandwidth can be transmitted. Possible bandwidths are WideBand (WB) (i.e. 50–7 000Hz), Super-WideBand (S-WB) (i.e. 50–14 000Hz) and Full-Band (FB) (i.e. 20–20 000Hz). However, NB telephony is still used in most of the calls. Even though important advances were made in telephony services in the last three decades, the next decade will probably know an increase of enhanced communication services. For instance, in the Next Generation Network (NGN), the users will be able to switch between multiple networks by using different technologies during the same conversation call. However, in such packet-switched networks, several quality impairments may be increased compared to the PSTN. For instance, the packetization process lengthens the overall transmission delay. In addition, a long transmission delay introduces an audibility of the talker echo and reduces interactivity, i.e. the dynamics of a telephone conversation. Both impairments impact the speech quality in a conversational situation. However, one important degradation introduced by packet-switched networks corresponds to discontinuities in the transmitted message. Since the bit-rate allocation is not guaranteed over the whole call, time-varying distortions such as packet losses may appear.
1.3.3 User interfaces The physical interface between the customers and the transmission system can be a handset, a headset or a Hands-Free Terminal (HFT). The quality of such acoustic terminals is determined by: • The quality of the two transducers: the microphone and the small loudspeaker. • The degradation introduced by digital processing systems such as low bit-rate codec or Voice Quality Enhancement (VQE). The transducers introduce a linear frequency degradation based on their intrinsic frequency responses, their integrations in the housing and the coupling between the device and the head. The acoustic terminal has a strong effect on the frequency response of the overall transmission system. However, two other elements of the connection path may influence the bandwidth and spectral shape of the frequency response: (i) the transmitted audio bandwidth (e.g. NB, WB or S-WB) and (ii) the speech codec. In addition, the talking environment may introduce a reverberation that impacts the frequency response.
1 Speech Quality in Telecommunications 10
20
0
10
−10
0
|H| (dB)
|H| (dB)
22
−20 Initial
−30 −40 100
500 1000 2000 frequency (Hz)
(a)
Initial
−20
Modified 200
−10
50008000
−30 100
Modified 200
500 1000 2000 frequency (Hz)
50008000
(b)
Fig. 1.5 Frequency response |H| of filters modeling telephone band limitations. a IRS Send filter (IRSsend ). b IRS Receive filter (IRSreceive )
In the early 1970s, the ITU-T studied the frequency responses of several commercially available analog telephone handsets. ITU–T Rec. P.48 (1988) defines the Intermediate Reference System (IRS): two bandpass filters, IRSsend (microphone) and IRSreceive (loudspeaker). Both filters simulate the usual NB telephone handset transducers by an averaged frequency response, see Fig. 1.5. A modified version of these two filters, representing the quality improvement of telephone handset transducers and digital networks, was published as the Annex D of ITU–T Rec. P.830 (1996). Because of the limited number of available WB handsets, no WB version of the IRS filters has been published by the ITU-T yet. However, Sydow (2004) reviewed the standards for WB telephony published by other organizations. In addition, Gierlich et al. (2008b) evaluated the perceived quality of several available wideband handsets. Nowadays, the handset manufacturers introduce new services to user terminals in order, for instance, to increase the mobility of the user with cordless devices such as Digital Enhanced Cordless Telecommunications (DECT) handsets or “Bluetooth” headsets. Though they enable a greater mobility, these terminals include several digital processing systems such as low bit-rate codec liable to degrade the transmitted speech signal.
1.3.4 Speech codecs In the digital domain, the continuous electrical waveform is represented by a succession of discrete values. The bit-rate of this digital signal is expressed in kilo-bit per second (kbit/s). For instance, an analog monaural audio signal sampled at a sampling frequency ( fS ) of 8 000Hz and quantified on 16 bit per sample uses a 128 kbit/s network rate. Consequently, all frequencies above 4 000Hz are not encoded and all signal levels outside the quantification range are either set to 0 (i.e. too low) or
1.3 Transmission systems
23
clipped to −28 or 28 − 1 (i.e. too high). A speech coding algorithm is a system that reduces the network rate used to transmit the speech signal. The speech coder produces a compressed signal from the input speech signal, referred to as the bitstream. After transmission over the network, the aim is to get a synthesized speech signal as similar as possible to the original speech. The system producing the coding/decoding process is referred to as a speech CoDec. Physical characteristics of speech codecs are (i) the bit-rate expressed in kbit/s, (ii) the frame length expressed in milliseconds (typical ranges of frame length are 5–30 ms) for vocoder-type codecs, and (iii) the complexity of the coding algorithm expressed in Million Instructions Per Second (MIPS). Two additional characteristics are typically used to quantify the signal as perceived by the users (i) the fidelity of the synthesized signal to the original signal (which impacts the transmission quality), and (ii) the delay introduced by the coding–decoding process (which impacts the conversation effectiveness). For instance, a low bit-rate speech codec with a good signal fidelity may have a high complexity and consequently a low conversational effectiveness. The signal fidelity of almost all speech codecs is good enough to have a perfect intelligibility. However, the “naturalness” of the synthesized signal may be affected and result in a poor speaker recognition.
1.3.4.1 Coding techniques Several types of coding techniques are available. They are aimed at minimizing the error between input and synthesized speech in a perceptual domain. Examples of speech codecs are listed in Table 1.2. 1. Waveform method: it is based on Pulse Code modulation (PCM) techniques. These coders quantify the speech samples either in an absolute manner, e.g. Log. PCM (ITU–T Rec. G.711, 1988) or from only the difference with the previous sample only, e.g. AD-PCM (ITU–T Rec. G.726, 1990). 2. Vocoder method: this method makes use of the quasiperiodic properties of the speech signal, x(k). The short-term speech spectrum is estimated in x(k) and then synthesized in y(k) by using bandpass filters. The filter parameters and possibly residual signal (i.e. d(k) = x(k) − x(k)) are then transmitted. This method was introduced by Dudley (1939) who used analog bandpass filters. Modern vocoders usually employ a vocal tract model. This principle, introduced by Atal and Hanauer (1971), is used in modern low bit-rate speech codecs, e.g. ITU–T Rec. G.728 (1992). This model considers that any sample of a speech signal, x(k), can be estimated by a linear combination of the p preceding samples according to: p
x(k) = ∑ ai x(k − i) , i=1
(1.3)
24
1 Speech Quality in Telecommunications
where ai are the predictor coefficients. These coefficients, ai , can be estimated to define an all-pole linear filter. Called in this case Linear Predictive Coding (LPC) coefficients, they define a “smoothed” spectral representation of the signal, x(k). 3. Hybrid method: it combines either a waveform and a vocoder method or two different vocoder techniques, e.g. ITU–T Rec. G.729.1 (2006).
Table 1.2 Characteristics of speech codecs Band.
Codec
First standardized in
Codec type
Frame length (ms)
Bit-rate (kbit/s)
NB
G.711 G.726 G.728 G.729 GSM-FR GSM-EFR AMR EVRC iLBC
1972 1990 1992 1996 1996 1995 1998 1999 2004
log. PCM AD-PCM CELP CS-ACELP RPE-LTP ACELP ACELP RCELP LPC
0.125 0.125 2.5 10 20 20 20 20 20, 30
64 16–40 16 8–11.8 13 12.2 4.75–12.2 0.8–8.55 13.33–15.2
WB
G.722 G.722.1 AMR-WB a G.729.1 EVRC-WB G.711.1 G.718
1988 1999 2001 2006 2007 2008 2008
SB-ADPCM MLT ACELP CS-ACELP/TDAC RCELP/NELP log. PCM/MDCT CELP/MDCT
0.125 20 20 20 20 5 20
48–64 24–32 6.6–23.85 14–32 0.8–8.55 64–96 8–32
SWB
G.722.1 C AMR-WB+ Speex
2005 2005 -
MLT ACELP/TCX CELP
20 20 20
24–48 13.6–24 2.15–44.2
a
This wideband codec was standardized by the ETSI at first under the name AMR-WB (ETSI TS 126 190, 2007) and then as the ITU–T Rec. G.722.2 (2003).
Nowadays, speech codecs use a simple model of human auditory perception, e.g. Algebraic Code-Excited Linear Prediction (ACELP) ITU–T Rec. G.729 (2007). This perceptual model is usually simulated by a weighting filter, which exploits the frequency masking properties of the human ear. This perceptual model works well on NB speech signals but the wide dynamic range between high and low frequencies in speech spectrum imposes changes in this weighting function for WB speech codecs (Bessette et al., 2002).
1.3 Transmission systems
25
1.3.4.2 Voice Activity Detector (VAD) In order to reduce the network rate used in mobile and packet-switched networks, the speech codecs may include an optional algorithm to monitor the source speech signal. Only the voicing parts of the speech signal are really transmitted over the network. The digitized speech signal is processed by a VAD, which detects the silence parts of the signal. However, for a low-level speech signal, the VAD may introduce time clipping. It will hardly detect the starts (front-clipping) and ends (end-clipping) of sentences and words. In addition to the VAD algorithm, the transmission system uses a discontinuous transmission (DTX) algorithm to interrupt the transmission of the signal when no voice activity is detected. Such a DTX algorithm is used to save power and network rate. However, the complete absence of signal for the far-end user may result in a misinterpretation: he may think that the connection has been interrupted. This misinterpretation comes from the analog circuit-switched telephony network, which has a constant low-level noise. Consequently, when the DTX algorithm interrupts the signal transmission, the decoder may insert a “comfort noise”2 through a CNG algorithm. For an example of such optional algorithm see ITU–T Rec. G.729 Annex B (1996).
1.3.4.3 Jitter & Packet Loss Concealment (PLC) In packet-switched and mobile networks, the packets can take different transmission paths that lead to a time-varying transmission delay. The inter-sending time between consecutive packets at the coder side is usually fixed (e.g. every 20 ms). However, the inter-arrival time may be different between consecutive packets at the listener’s side. This variation in transmission delay is referred to as jitter. Consequently, to generate a continuous signal, a buffer is placed in front of the decoder so as to set in line several speech segments before the start of the decoding process. The size of the jitter buffer (e.g. 120 ms) defines the tolerated lengthening of transmission delay between two consecutive packets. However, the size of the jitter buffer increases the overall transmission delay and, thus, may affect the conversation effectiveness. Thus, this size is adapted to the used network. In addition, the transmitted packets may arrive in a wrong ranking order. A specific algorithm, referred to as jitter buffer management, handles the re-ordering of the incoming packets. A speech segment may be lost during the transmission or arrives too late to synthesize a continuous signal. The distribution of a packet-loss pattern is usually considered as random. However, in real networks, a single loss frequently includes several packets (called a bursty distribution). In bursty loss, the loss of the current packet highly depends on the precedent packets. Another optional algorithm of speech decoders consists in a reconstruction of the missing segments. This al2
The “comfort noise” is either fully artificial, or simulates the background noise at the talker’s side.
26
1 Speech Quality in Telecommunications
gorithm called Packet Loss Concealment (PLC) reduces the impact on the speech quality caused by a discontinuity in the speech signal. Several cancellation techniques are used in PLC algorithms. The simplest technique consists of a repetition of the last received segment. Nowadays more complex algorithms propose an interpolation of the lost segment consistent with the precedent and following segments. Another technique uses time-scale modifications of the speech signals (also known as “time-warping”). The smooth reconstruction of the waveform avoids any discontinuity in the speech signal and consequently reduces the perceived quality degradation. This algorithm is based on speech production techniques and network delay statistics (Liang et al., 2001).
1.3.5 Speech enhancement Complex speech transmission systems such as mobile or Voice over Internet Protocol (VoIP) networks introduce new degradations in the perceived speech communication quality. Consequently, in addition to the speech processing systems above described above (i.e. jitter buffer management, PLC, speech coding and VAD/DTX/ CNG), VQE algorithms are integrated into the network or even directly into the terminal. These algorithms are echo cancellation, noise reduction, de-reverberation and automatic gain control. However, VQE algorithms may introduce non-linear and time-variant degradations on the speech communication quality. For instance, a noise reduction algorithm may lead to speech sounding robotics and to the production of musical noises.
1.3.5.1 Echo cancellation When a human is talking to someone, he has a natural acoustic feedback of his own voice. In telephony, the natural acoustic feedback is simulated in the handset by an electrical return of the microphone input signal to the loudspeaker. This effect, referred to as sidetone, is important in talking situations. However, the talking quality is affected by sidetone impairments (e.g. high-level sidetone or coding distortions). In addition, echoes of the talker’s own voice can appear during a phone call and impair the talking quality. These echoes can be introduced by two different mechanisms: • Acoustic echo: The microphone input signal, x(t), of a terminal, A, is transmitted over the network and reproduced by the loudspeaker of the terminal, B, (i.e. the far-end terminal). This transmitted and reproduced signal, y(t), is then picked up by the terminal-B microphone, e(t) (i.e. the echo), and transmitted up to the terminal-A loudspeaker. By comparison with handsets, an HFTs causes an increase of the feedback gain.
1.4 Perceptual quality space
27
• Electrical echo (or line echo): In hybrid networks, i.e. including an analog and a digital network or two different analog networks, an impedance mismatch at the interconnection between the two networks can generate an electrical echo of the transmitted signal. The perception of this talker’s echo mainly depends on two parameters: (i) the echo delay and (ii) the echo attenuation. The longer the delay is the more the echo has to be attenuated. However, because of both the speech data packetization and the buffering process, the transmission delay and, consequently, the talker’s echo, are exacerbated in packet-switched networks compared to circuit-switched ones, where the delay is insufficient to make the talker’s perceptible. AEC and NEC are needed if the delay exceeds 15 ms. Consequently, in VoIP and mobile networks the echo is canceled in the very close vicinity of the echo source (i.e. in the user interface).
1.3.5.2 Noise Reduction (NR) In addition to the desired speech signal, the input signal of the telephone microphone, x(t), contains all of the artifacts due to the talker’s environment, n(t), such as the reverberation of the talking room and the background noise. These artifacts are amplified in the case where an HFT is employed or in mobile telephony. Consequently, an NR system complemented by a de-reverberation algorithm can be employed into the terminal to separate the desired signal components from the undesired ones.
1.3.5.3 Automatic Gain Control (AGC) In order to avoid an overload of the transmission channel, an AGC algorithm can be used to level-equalize the speech signal. This algorithm adjusts the input speech level of handset microphone before its transmission over the network.
1.4 Perceptual quality space 1.4.1 Definition Several studies have been carried out in order to establish a list of the different quality features used by subjects to judge the speech quality of transmitted speech. It results from these studies that speech quality, like other perceptual magnitudes, is by nature a multidimensional object. However, the main problem met in the definition of a multidimensional feature space related to speech quality is the characterization of the space with few orthogonal quality features, called perceptual dimensions. The quality space is extracted through a 4-step procedure:
28
1. 2. 3. 4.
1 Speech Quality in Telecommunications
Selection of conditions that span the whole perceptual space under study. Performance of an auditory experiment. Multidimensional analysis of the test results. Identification of the derived perceptual dimensions.
The resulting space is Euclidean and composed of orthogonal dimensions, i.e. the values on two perceptual dimensions are not correlated. In this space, the subject’s internal reference is defined by a vector, p, and the speech signal under study by a vector, q, such that: p = p1 , . . . , pNdim , q = q1 , . . . , qNdim . (1.4) The integral quality, Q, of the speech signal can be defined by the Euclidean distance between p and q:
N
dim Q = d(p, q) = α (p − q )2 , (1.5)
∑
i
i
i
i=1
where qi corresponds to the quality value on one of the Ndim perceptual dimensions and the weighting coefficient, αi , determines the influence of the dimension, i, on the integral quality (with ∀i, αi > 0). However, the perceptual dimensions may have different behaviors. For instance, the highest quality on the dimension, Noisiness, is defined as pnoi = 0, i.e. the less, the better. However, the highest quality for the dimension, Loudness, corresponds to an “ideal point” (ploud = 0). Any variation around this point will introduce a degradation. A new speech signal under study can be associated with a specific position in a previously defined multidimensional quality space. This technique also provides far more information about the degradation of quality applied to the speech signal than a single number provided by a conventional listening test (see Sec. 2.2.3).
1.4.2 Example studies This section introduces several examples of speech quality studies, which have led to the definition of a speech quality space. The studies, listed in Table 1.3, were selected from their reliability (number of test subjects, number of samples, type of degradations, coherence with previous studies). Several auditory test methods and multidimensional analyses, introduced in this section, will be described in detail in Sec. 2.2.
Degradation (num.) PSTN (22)
Loudspeakers (9)
NB Codecs (10)
NB Codecs (56)
VoIP (6) Mobile/Clean (41)
Mobile/Noisy (41)
NB (14)
WB (14)
WB Codecs (20)
Study
McDermott (1969)
Gabrielsson and Sjögren (1979)
Hall (2001)
Sen (2001)
Bernex and Barriac (2002)
Mattila (2002b)
Mattila (2002a)
Wältermann et al. (2006b)
Wältermann et al. (2006a)
Etame et al. (2008)
Abs. & Sim.
Abs. & Sim. & SD
Abs. & Sim. & SD
Abs. & SD
Abs. & Sim. & SD
Abs. & Cat.
DAM
Abs. & Sim.
Sim. & SD
Sim. & Pref.
Aud. test
MDS
MDS & PCA
MDS & PCA
PCA
MDS & PCA
MDS & Tree
MDS & PCA
MDS
MDS & PCA
MDS
Analysis
Methodology
Subjects
SD
SD
SD
SD
Subjects
DAM
Subjects
SD
Experimenter
Features ident.
Clarity & Correlated noise Noise on speech & Hiss
Continuity & Distance Lisping & Noisiness
Interruptedness & Noisiness Directness/Frequency content
High & Natural Interrupted & Noisy
Natural & Bright Interrupted & Bubbling & Noisy
Clipping & Metallic & Whistling
Temporal distortions Frequency distortions
Naturalness & Noisiness Spectral fullness
Nearness & Sharpness Brightness & Noisiness
Clarity & Loudness Speech vs. Background distortion
Dimensions
Table 1.3 Summary of studies related to perceptual quality dimensions of transmitted speech. The terms used for the quality features are those chosen by the respective authors. Abs. stands for speech quality test
1.4 Perceptual quality space 29
30
1 Speech Quality in Telecommunications
• McDermott (1969) employed two auditory test methods to derive the perceptual quality space of speech signals. The test corpus included a rather large set of processing conditions (22). Two paired-comparison tests were carried out: (i) one of them was based on preference and (ii) the other one relied on similarity. The set of conditions included speech coding algorithms, linear filtering, amplification/attenuation, addition of echo3 and noise. Two Multi-Dimensional Scalings (MDSs) were applied, respectively, to similarity results and preference results. Three common perceptual dimensions were derived from the preference space and the similarity space. Then, the author interpreted directly the three dimensions from the distribution of the 22 conditions on each dimension. The three derived dimensions were clarity, distinction between speech signal distortion and background interference and loudness. • Hall (2001) used two different auditory test methods: they consisted of a triadic similarity test similar to the one used by McDermott (1969) and an integral quality test. A similarity space defined by three perceptual dimensions was derived from the subjects’ results on the basis of a weighting MDS. In a second step, the subjects were invited to describe these dimensions with their own words. The resulting attributes of this free verbal description were naturalness, noisiness and spectral fullness (or amount of low-frequency content). However, the author derived two different spaces according to the sex of the talker. The two similarity spaces were found to be similar, but not alike. In addition, only 10 stimuli for the male talker and a subset of 7 among the 10 stimuli for the female talker were used in the auditory tests. One should note that this rather small test corpus did not fully represent the degradation introduced by real speech transmission systems employed in the year 2001. Computation of a regression analysis for the male talker between the integral quality values (MOS) and the values on three perceptual dimensions let to the following linear relationship: MOS = 2.99 +0.61 × naturalness −0.20 × noisiness +0.05 × spectral f ullness .
(1.6)
This combination of dimensions proved to provide a reliable estimation of the integral quality values: the standard deviation of the prediction error was σ = 0.11MOS. In addition, the impact by naturalness upon the integral speech quality was the highest. • Bernex and Barriac (2002) studied the impairments introduced by VoIP transmission networks. For this purpose, 12 auditory experiments were carried out under six conditions and with two talkers per condition. In each experiment, the test subjects assessed only one of the six conditions and a single talker. However, in each experiment 30 stimuli, corresponding to 30 different packet-loss patterns 3
An echo is a repetition of the signal with a short delay, here set to 66 ms.
1.4 Perceptual quality space
31
were assessed by two different auditory test methods. At First, the subjects were asked to judge the annoyance of the impairment along an absolute category scale. In the second test, the subjects were asked to create several groups of stimuli with similar impairments. The results were converted into dissimilarity data. Then, three perceptual dimensions were derived on using a nonmetric MDS and a tree analysis. The subjects were asked to verbally describe the previously created categories. Since the conditions were not mixed between the tests, the perceptual dimensions mainly described the dependency of the perceived quality on the packet-loss location. The three perceptual dimensions were clipping/reducing, metallic/robot voice/beep and whistling/breath/crackling. With respect to the previous study conducted by Hall, a third dimension called clipping/reducing was derived from the auditory judgment. This dimension underlies time-varying distortion within the envelope of the speech signal. In addition, a fourth category, no perceived impairment, likely related to the specific set-up in use was also proposed. • Mattila (2002a,b,c) studied the speech quality space covered by mobile transmissions. He carried out several auditory experiments with a large set of processing conditions (85) including real recordings of mobile telephony and artificially generated impairments. Among the artificial impairments, let us cite: real background noise, speech coding algorithms, user interfaces, transmissions errors, linear filtering and addition of echo. From the 85 conditions, 41 were processed in two different background noise situations (clean and car cabin noise at the send side with a SNR of 10 dB). Then, the test corpus included high-level car noise conditions with a SNR of 5 dB. Three auditory test methodologies were used: (i) an acceptability test using a continuous scale (all of the the 85 conditions), (ii) two similarity paired-comparison tests (with/without background noise, i.e. 2 × 41 conditions), and (iii) a Semantic Differential (SD) test using 21 antonym pairs of attributes. They described either the characteristics of the speech signal or those of the background noise (all of the 85 conditions). These attributes were selected from a pre-test where the subjects had been asked to describe the stimuli characteristics with their own words. From the acceptability results, a preference space was derived with a Principal Component Analysis (PCA) (Mattila, 2002c). This space was defined by two perceptual dimensions that explain about 90.0% of the total variance in the listening quality test. One should, however, note the existence of slight differences between male and female talkers. The linear relationship between the two dimensions and the 21 attributes used in the SD test were measured. None of these attributes showed a strong correlation with the two perceptual dimensions. Then, Mattila (2002b) applied a weighted MDS analysis to the similarity data. This led to five perceptual dimensions from the first set of 41 conditions (without background noise at the send side). These dimensions could be related to the respective attributes from the SD test, synthetic/natural, dark/bright, smooth/fluctuating/interrupted, bubbling and noisy. The addition of two additive
32
1 Speech Quality in Telecommunications
noises to the 41 conditions explains the fifth dimension, noisy, even if the speech signals were free of real background noise before any degradation by speech processing systems. In addition, the third condition smooth/fluctuating/interrupted, related to time-varying distortions is consistent with the clipping/reducing dimension found by Bernex and Barriac (2002). Through application of a PCA to the SD test results Mattila (2002a) derived four perceptual dimensions: low/high, synthetic/natural, smooth/fluctuating/interrupted and noisy from the second set of 41 conditions (with car noise at the send side). For both clean and noisy speeches, the smooth/fluctuating/interrupted dimension proved to have the highest contribution to the total variance of the similarity test. • Wältermann et al. (2006b) derived a perceptual quality space which covers modern telephone connections, see also ITU–T Del. Contrib. COM 12–71 (2005). Very different processing conditions, e.g. speech coding algorithm, noise-reduction algorithm, echo cancellation, VAD, addition of real background noise and user interfaces (i.e. acoustic recordings), were included in the test corpus. Different transmission techniques were also considered (e.g. VoIP and PSTN transmissions). The study was aimed at taking into account all of the potentially relevant quality features. Three experiments were carried out, and their corresponding results were compared and combined to get a stable and reliable description of the perceptual quality space. At first, a similarity paired-comparison test was conducted under the 14 processing conditions. Two perceptual quality spaces composed of four dimensions were derived on using a weighted MDS, with either a male talker, or a female one. The two spaces were found to be similar, but not alike. The authors interpreted the dimensions from the distribution of the impairments on each dimension. The attributes associated to the four dimensions were: directness/clearness, interruptedness, frequency content and noisiness. Then, in order to conduct an SD test, attributes were selected from two pretests, which resulted in 13 antonym pairs of adjectives. The quality space was reduced by a PCA to three perceptual dimensions. In contrast to the MDS analysis, the quality spaces for the two talkers showed no significant difference. Thanks to the correlation with the 13 antonym pairs, the attributes associated with the three features were directness/frequency content, interruptedness and noisiness. Two dimensions from the MDS analysis merged to the single dimension directness/frequency content. Consequently, the finding of a high correlation between the two perceptual quality spaces, extracted from the 2-experiment analysis, is indicative of their consistent. In parallel, the integral quality of the considered stimuli was assessed through a listening quality test according to ITU–T Rec. P.800 (1996). A multivariate regression similar to the one used by Hall (2001) was applied and led to a linear relationship between the integral quality values and the three perceptual dimensions common to the two multidimensional analyses.
1.4 Perceptual quality space
33
Q = α +0.457 × directness/ f requency content −0.472 × noisiness −0.698 × interruptedness ,
(1.7)
where α is an undetermined constant. Here, in contrast to the results obtained by Hall (2001), but in accordance with those in Mattila (2002a,b), interruptedness is the most important dimension of the integral speech quality. However, as time-varying degradations were not assessed in the investigations by Hall (2001), both studies are consistent. Wältermann et al. (2006a) extended the study to wideband transmissions and conducted the same three experiments (MDS/SD/integral quality) under 14 new processing conditions so as to reflect NB and WB transmissions. Four common perceptual dimensions were extracted from the 2-experiment analysis: continuity, distance, lisping and noisiness. Moreover, the corresponding relationship to the integral speech quality was established as being: Q = α +0.78 × continuity −0.13 × noisiness −0.30 × distance −0.14 × lisping ,
(1.8)
where α is again an undetermined constant. More recently, Wältermann et al. (2010a) combined both studies to derive a single set of three dimensions relevant for NB and WB speech. These three dimensions are discontinuity, noisiness and coloration. • Etame et al. (2008) derived a similarity space in a purely WB context. One should note that their set of 20 processing conditions cover only speech-coding algorithms in the case of WB speech. The auditory test method in use was a similarity paired-comparison test. However, in order to avoid a judgment based on the quality level instead of the type of degradation, a preliminary test was carried out. This listening quality test according to ITU–T Rec. P.800 (1996) permitted the authors to select processing conditions with similar integral speech quality. Two similarity spaces (with a male talker and a female one, respectively) composed of four dimensions were derived through use of a weighted MDS analysis. Both spaces proved to be similar and highly correlated. Then, the subjects were proposed a free verbal description task to identify the perceptual attributes associated to the four dimensions. The identified attributes were clear/muffle, highfrequency noise, noise on speech and hiss. However, compared to the previous study conducted by Wältermann et al. (2010a), this quality space is restricted to the evaluation of speech codecs covering a small fraction of the quality space defined by all speech transmission systems.
34
1 Speech Quality in Telecommunications
• In addition to the above studies, it is worth noting that several authors have tried to derive perceptual quality space of human voice (i.e. without any degradation by a transmission system.). For instance, Bele (2007) carried out a quality attribute test and derived a 4-dimension voice quality space based on a PCA and a factor analysis. These four dimensions are related to variation/sonority, irregularity, degree of noise and phonatory effort.
1.5 Summary and conclusions This chapter described the characteristics of speech production and perception. The quality judgment process leading to the definition of perceived integral speech quality was also detailed (Sec. 1.2). Moreover, the different building blocks of the speech transmission systems that liable to affect the speech quality were also described. It results that speech quality, like other perceptual magnitudes, is by nature a multidimensional object. Several common perceptual dimensions emerged from the seven studies briefly described in Sec. 1.4.2. Regarding the studied set of conditions studied and the type of analysis applied to the judgments by subjects, the results by Wältermann et al. (2006b) are of particular interest. Indeed, the resulting three orthogonal perceptual dimensions, discontinuity, noisiness and coloration cover a large set of possible impairments encountered in speech transmission systems. However, further to several studies (Côté et al., 2007; McDermott, 1969; Rothauser et al., 1968) the listening level is considered as a feature of the integral speech quality. In Wältermann et al. (2006a,b), all speech stimuli were normalized to the preferred listening level, see Sec. 2.2.3.3. Consequently, the perceptual dimension loudness should be included to the perceptual space defined by the three dimensions found by Wältermann et al. (2006b). A loudness impairment is introduced in the case of non-optimal listening level4 . However, this perceptual dimension can be correlated with the other perceptual dimension. Especially the frequency content (i.e. bandwidth) of a stimulus has an impact on its perceived loudness (Moore et al., 1997). This is why these four dimensions are more detailed here, and clues are given on their possible causes of degradation. • Discontinuity This dimension has the strongest effect on the integral speech quality. This dimension reacts to degradation in the time domain such as packet loss in VoIP networks or bit errors in radio transmissions. In addition, it may be affected by several speech processing systems like noise reduction algorithms or echo cancelers.
4
The optimum level is slightly different from the preferred listening level. The maximum integral quality for a given processing condition is obtained at the optimum level. The preferred listening level is assessed via a specific rating scale, the so-called loudness preference scale (see Sec. 2.2.4).
1.5 Summary and conclusions
35
• Noisiness This dimension has less impact on speech quality than discontinuity. It is affected by degradations such as real background noise (i.e. the environment at the speaker’s side and listener’s side), circuit noise in analog network and quantizing noise due to waveform codecs. • Coloration The frequency distortion introduced by both the whole transmission path and specific quality element such as user terminals is covered by the perceptual dimension coloration. This dimension can be affected by the two following elements: (i) a deviation from a reference timbre (e.g. dark or bright) introduced by the electro-acoustic properties of the user terminal (e.g. headsets and telephone handsets) and (ii) the bandwidth transmitted over the network (e.g. NB or WB). However, the acoustic properties of the talker’s and listener’s environment (i.e. the reverberation of the rooms) also have a strong influence on this dimension. • Loudness The degradation due to a non-optimum listening level, i.e. an attenuation or an amplification introduced by the entire transmission system. These four dimensions should reflect the whole perceptual quality space used by a subject in order to judge the quality of a transmitted speech signal. Instrumental speech quality models do not cover the four perceptual dimensions described above. Most of them, such as the current ITU standard, the Perceptual Evaluation of Speech Quality (PESQ) model, are specialized in the assessment of speech coding algorithms. Speech codecs introduce mainly non-linear degradations of low impact on the four perceptual dimensions. For instance, the ITU–T Rec. G.729 (2007) codec introduces a slight coloration whose impact on the Directness/Frequency content dimension is lower than the one by a restriction to the bandwidth 500–2 500Hz. These consideration explain why it is worth designing and developing a generic instrumental model able to cover these four perceptual dimensions as well as speech processing systems. Such an instrumental model is introduced in Chap. 4 and evaluated in Chap.5.
Chapter 2
Speech Quality Measurement Methods
This chapter deals with measurement methods particularly relevant to any assessment of the perceived quality of voice and speech. Such voice and speech quality measurement methods are employed in several scientific fields, such as medicine (e.g. the evaluation of voice-related problems), linguistics or speech technology (e.g. the evaluation of speech transmission systems or their components). Each field has its own assessment paradigm. This chapter makes use of two statistical parameters: (i) the Pearson correlation coefficient, ρ , and (ii) the standard deviation of the prediction error, σ . Both are defined in Sec. 5.1.3.2.
2.1 Definitions
In metrology, which is the science of measurement, measurement is generally defined as (BIPM Guides in metrology, 2008): a process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity.
and a measurement result is (BIPM Guides in metrology, 2008): a set of quantity values being attributed to a measurand together with any other available relevant information.
In a voice and speech quality measurement, a measurand is (Jekosch, 2005): a feature of the perceived speech event which can numerically be described on a measurement scale.
A generic description of an auditory experiment was proposed by Blauert (1997). A corresponding schematic representation of a listener involved in such an auditory N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_2, © Springer-Verlag Berlin Heidelberg 2011
37
38
2 Speech Quality Measurement Methods
Fig. 2.1 Auditory test: schematic representation of a listener (Blauert, 1997). s0 : sound event; h0 : auditory event; b0 : description of the sensation by the listener
Auditory event h0
Description b0
Sound event s0
Listener
experiment is shown in Fig. 2.1. In a first step, a sound event (i.e. an acoustic signal), s0 , reaches the listener’s ear. After the perception process, this acoustic signal results in an auditory event, h0 (i.e. a sensation), in the listener’s sensory system (see Sect. 1.1.3). Except for the subject himself (by introspection), the auditory event is hardly accessible to the experimenter. Only a description of the sensation by the listener, b0 , is accessible to other persons. In psychophysics, the subject is asked to define b0 , which relates at the best the auditory impression, h0 . The output of the assessment process, b0 , corresponds to either a linguistic description of h0 or a quantification on a psychophysical measurement scale (i.e. the amount of the sensation), see Stevens (1957). Such measurement scales used in auditory experiments are consequently a major part of the speech perception and assessment process. Given the relationship: b0 = f {s0 }, the loss of information related to the quality judgment should be small in order to get a benefit from the auditory experiment. Consequently, according to Jekosch (2005), the goal of a measurement scale is such that: [. . . ] that the numerical relational system forms an accurate copy of the structures and features of the speech that has been perceived and judged.
Recently, Durin and Gros (2008) investigated the impact of speech quality on human behavior in communication tasks. The dual task method they employed avoids the use of measurement scales in auditory test methods. However, such method is taxing for the test subjects and is still under developments. Quality measurement methods must be designed so that the parameter really quantified is the user’s perception. Following Osgood (1952), a satisfactory measurement method meets the six following characteristics: Objectivity
b0 is reproducible (verifiable) over different listeners (i.e. intersubjectivity).
2.1 Definitions
Reliability Validity
Sensitivity Comparability Utility
39
The results provided by the method show no large scattering when s0 is repeated to the same listener (i.e. intra-subjectivity). The parameter measured by the method is the one intended to be measured (i.e. one element of the semantic triangle, see Sec. 1.1.3). The distinctions enabled by the method are as fine as those made by the listener. The method is applicable to a wide range of perceived qualities and makes possible comparisons between groups of conditions. The pieces of information provided by the method are useful.
These six characteristics are congruent with the definition of measurement in use in metrology. In the literature, the terms, assessment and evaluation, both refer to measurement methods. The methodologies under focus here deal with assessment. Indeed, an overall evaluation of a system would include the measurement of many characteristics that are not into the scope of this book (e.g. cost). On the other hand, the term, assessment, is related to the performances of a system for comparison purposes (Möller, 2005). In addition, the “tool” or measurement apparatus used to measure the speech signal can be either a test subject (auditory method) or a physical equipment (instrumental method). In auditory methods, a test subject is asked to judge the quality of the speech signal. Since the perception process (described in Sec. 1.1.3) and the resulting perceived quality are internal to the user and not accessible from the outside, auditory experiments are the only method that meets the six characteristics introduced by Osgood (1952). But as they are costly and time-consuming, instrumental measurement methods have been developed. They provide one with a quality estimation from physically measured values. In this sense, the term model is used in computer sciences instead of the instrumental measurement method. In the literature, these two approaches are improperly referred to as subjective and objective methods (Blauert and Guski, 2009), respectively. This terminology is even used by the ITU-T organization (ITU–T Rec. P.800.1, 2006). However, the objectivity or degree of objectivity of the auditory results refers to the amount of consistency between listeners in perception. For Jekosch (2005) the objectivity is: the invariance with respect to the totality of what is individually perceived. [. . . ] Objectivity is the extent of inter-individual agreement.
In this book, several statistical metrics, which determine the degree of objectivity, will be introduced later (Sec. 5.1.3).
40
2 Speech Quality Measurement Methods
The present chapter gives an overview of different auditory methods (see Sec. 2.2) and instrumental methods (see Sec. 2.3). In speech quality, a specific unit, called Mean Opinion Score (MOS), is employed to define the resulting quality scores. It corresponds to an average of the individual scores for each processing condition. Then suffix notations are added to the term MOS (ITU–T Rec. P.800.1, 2006) in order to indicate: • The test modality: listening (MOSLQ ), talking (MOSTQ ) or conversation (MOSCQ ) quality. • The measurement method: instrumental (referred to as objective MOSLQO for signal-based models or estimated MOSCQE for parameter-based models) or auditory (referred to as subjective MOSLQS ). • The context of measurement: a Narrow-Band context (MOSLQON ), WideBand (MOSLQOW ) or a mix of both bandwidths (MOSLQOM ). No specific notation has been defined for a Super-WideBand context so-far.
2.2 Auditory methods The most accurate auditory measurement method would be an assessment by customers in natural environments. Theoretically, the customer should be able to assess the quality of an ongoing call through use of his phone keypad. In practice, such “in field” tests are hardly implemented, and speech quality is assessed thanks to artificial auditory quality tests carried out in laboratories (i.e. under designed and controlled conditions). According to Jekosch (2005), a speech quality test is: a routine procedure for examining one or more empirically restrictive quality features of perceived speech with the aim of making a quantitative statement on these features.
The ears of any individual are permanently submitted to a flow of acoustic signals. However, only the characteristics that are source of information for the listener are analyzed. In “undirected” speech perception processes (e.g. an everyday conversation) the interlocutors exchange pieces of information (i.e. meaning of the spoken sentences). However, during an auditory experiment the speech perception process is “directed” by the experimenter, i.e. the test subject is oriented throughout the experiment by means of directives. The directives are part of the modifying factors introduced in Sec. 1.21 . In directed communications, the test subjects do not expect the same type of information as in undirected communications. In this case, subjects may focus on the sign carrier (i.e. the form of the speech signal, see Fig. 1.2) 1
For an example of such directives see ITU–T Rec. P.800 (1996).
Instrumental
Auditory
2.2 Auditory methods
41
Year
1970
Network
PSTN
Analytical
SD
1980
1990 ISDN
2000
GSM
VoIP
UMTS
MOS
Parametric
TR
P.800 P.830 OPINE II
P.805
E-model BSD TOSQA
Intrusive
IS
Seg. SNR
NGN
P.835
DAM
Utilitarian
2010
CD
P.AAM
PSQM PAMS WB-PESQ PESQ
Non-intrusive
DIAL
INMD CCI P.563 ANIQUE
Fig. 2.2 List of auditory and instrumental speech quality measurement methods
which may bias the quality perception. However, speech quality tests must reflect the quality perception of users during undirected communications. Consequently, the directives and the measurement scale have to be carefully design by the experimenter. Following the tests classification introduced by Letowski (1989), speech quality tests can be classified into four categories according to two dichotomies: (i) analytic / utilitarian, and (ii) subject-oriented / object-oriented (see Table 2.1). By using specific directives, the experimenter can, in directed communications, adjust the influence of each quality feature. Listener’s attention can, thus, be focused on a group of speech quality features, or on a single one. The selected quality features and the corresponding perceived quality, are all stored as auditory memory traces in the Short-Term Memory (see Sec. 1.1.3.1). In utilitarian test methods, subjects assess the integral quality of speech transmission systems thanks to a single quality score on a one-dimensional rating scale. It consequently permits a comparison between different processing conditions. On the other hand, in analytic test methods, the perceptual features of the integral speech quality are identified and then quantified. Two different approaches are available: either a single one-dimensional scale is used and the listeners are asked to focus on a given feature, or several scales, one per quality feature, are employed. In the latter, the subjects’ judgments may be decomposed into orthogonal quality features on the basis of a multidimensional analysis. In addition to this first dichotomy, the speech quality test may lead to two different analyses: (i) an object-oriented analysis about the perceived quality of processing
42
2 Speech Quality Measurement Methods
conditions, or (ii) a subject-oriented analysis based on the role of the test subjects in the perception process. Table 2.1 Quality test classification following Letowski (1989) Subject-oriented tests
Object-oriented tests
Utilitarian judgments
Psychoacoustic research
Speech quality assessment
Analytical judgments
Audiological evaluation
Diagnostic quality assessment
Implementing such speech quality tests is a complex task. According to Stevens (1957), Möller (2000) and Raake (2006b), five main characteristics defines the exact test results. The experimenter selects the appropriate characteristics based on the number and the type of assessed processing conditions. The presentation method The scale level The scaling method The test modality The analysis method
Paired comparison or absolute assessment of stimuli A ratio-, interval-, ordinal- or nominal-scale2 A single- or multi-scale rating process Listening-only, talking-only or conversation test Simple average or multidimensional analysis
Several examples of utilitarian and analytical methods focusing on standard measurement methods which are widely used by telecommunication providers will be briefly presented hereafter. Moreover, Fig 2.2 (p. 41) proposes an exhaustive list of auditory test methodologies.
2.2.1 Test subjects The selection of the test subjects should be consistent with the test purpose. Indeed, they may be classified according to their knowledge about the selection of the processing conditions under test. The corresponding two groups are, namely expert subjects (or trained) and naïve subjects (or untrained). The learning and adaptation effects of trained subjects on the quality judgments were observed in IEEE Standards Publication 297 (1969). Utilitarian test methods are usually aimed at getting the speech quality as perceived by the “average” user population. Consequently such tests are commonly carried out with naïve subjects as recommended by the ITU-T organization.
2
The heard speech samples are ranked by the test subjects on an ordinal scale rank. From an interval (resp. ratio) scale, the difference (resp. ratio) between two categories is quantified by a numerical value.
2.2 Auditory methods
43
According to ITU–T Rec. P.800 (1996), the definition of a naïve test subject is as follows: Subjects taking part in listening tests are chosen at random from the normal telephone using population, with the provisos that: 1. they have not been directly involved in work connected with assessment of the performance of telephone circuits, or related work such as speech coding; 2. they have not participated in any subjective test whatever for at least the previous six months, and not in any listening-opinion test for at least one year; and 3. they have never heard the same sentence lists before.
The work, presented in the Chap.s 3, 4 and 5, is based on auditory tests carried out with naïve subjects. In speech quality tests, both expert and naïve test subjects must be free of any hearing impairment. A subject’s hearing ability is commonly evaluated from his hearing threshold determined by an audiometric test. In addition, the mother tongue of the subjects must correspond to the language in use in the experiments. Both characteristics (lack of hearing impairment and native speaker) can be seen as inconsistent with the definition of an average user. Therefore, auditory tests used for market-planning purposes have different requirements. For instance, if a telecommunication service is designed for a specific segment of the population (e.g. age range, disability . . . ), the selected test subjects have to be representative of this specific category of users. Raake et al. (2008) carried out an exhaustive quality test with different groups of test subjects: a subject-oriented analysis revealed several groupdependencies in the quality judgments by the subjects. For instance, “IP-expert” users gave a significantly lower quality rating than the other users under conditions with a high rate of packet-loss, probably because of their past experience with VoIP systems. Contrary to utilitarian methods, analytical methods may use a complex measurement process which implies a specific training of the test subjects. The expert subjects may then have a common understanding of the auditory quality features involved in the experiment. This training process can significantly improve the objectivity and reliability of the test method. Example of a subject-selection process is available in Isherwood et al. (2003). In addtion, the greater production of diagnostic information (e.g. through a linguistic description of the impairments) by expert subjects compared to naïve ones can lead to a reduction of the experiment cost. However, the outputs of an experiment conducting with few trained subjects should be considered as an informal test since it will not give a representative account of the quality perceived by the final users.
44
2 Speech Quality Measurement Methods
2.2.2 Speech material According to the perception process introduced in Sec. 1.1.3, the linguistic information is extracted and stored in the Short-Term Memory for a few seconds. But, as the auditory memory traces used in the quality judgment are fading very rapidly, the length of the speech samples used in quality measurements is limited to 8 seconds. The speech material used in speech quality tests must consist of simple, meaningful and phonetically balanced sentences so as to reflect the phonemic frequencies in the subjects’ language. For an example of phonetically balanced sentences in French language see Combescure (1981). Since the speech material has a strong influence on the perceived quality, the sentences must create an equivalent meaning in the mind of all of the test subjects. In listening quality tests, the speech samples should be ideally made of two sentences separated by a silent gap. Thus, the subjects can evaluate the background noise introduced by the transmission system without any masking effect from the speech signal. There should be no obvious connection of meaning between the two sentences. As described in Sec. 1.1.3.3, the perceived quality may be affected by the talker characteristics (e.g. gender, accent). Consequently, the ITU-T recommends that a test be conducted with, at least, 2 male and 2 female voices per processing condition.
2.2.3 Utilitarian methods Utilitarian test methods are employed to assess the integral quality of speech stimuli as perceived by an end-user. In these tests, a speech sample is played to a group of subjects, who are asked to rate the quality of the sample on a one-dimensional rating scale. Then, a unique quality value, comprising the effect of all quality features, is calculated for each processing condition. Such test methods are widely used for the assessment of new speech processing applications or the comparison of different versions of a single application (e.g. a speech coding algorithm).
2.2.3.1 Intelligibility tests Intelligibility is only one specific attribute of the perceived integral speech quality (Voiers, 1977; Volberg et al., 2006). Even if modern transmission systems lead to an almost perfect intelligibility of the far-end speaker, specific speech processing systems such as hearing aids are still evaluated through intelligibility experiments. In such tests, the subject is asked to rewrite the understood parts of the listened speech sample. The output score then corresponds to a percentage of correct recognition. In the phonetic process introduced in Sec. 1.1.3.2, the first stage, comprehensibility, is specifically assessed by Vowel–Consonant–Vowel (VCV) tests or by modified
2.2 Auditory methods
45
versions of this test (CV, VC, CCVC, . . . ): the subject has to recognize a particular phoneme. Examples of such tests are the Standard Segmental Test (SAM ) or CLuster IDentification test (CLID) (Jekosch, 1993). Then, intelligibility is assessed in tests such as the Diagnostic Rhyme Test (DRT) or the Modified Rhyme Test (MRT). Here, the subject has to recognize whole words. Other tests are targeted to the assessment the final stage of the phonetic process, the comprehension. Here, the whole (speech) message has to be recognized by the subject. For an example of such comprehension test, see Raake and Katz (2006). In addition, the ITU-T standardized a test method dedicated to the assessment of the effort made by the subject to understand the meaning of the sentence. The rating scale used is called listening-effort scale (see Table 2.2). An average over all listeners results in a mean listening-effort opinion score MOSLE . Table 2.2 Listening-effort scale Effort required to understand the meaning of sentence Complete relaxation possible; no effort required Attention necessary; no appreciable effort required Moderate effort required Considerable effort required No meaning understood with any feasible effort
Score 5 4 3 2 1
2.2.3.2 Conversation tests Conversation test methods are described in ITU–T Rec. P.800 (1996) and ITU– T Rec. P.805 (2007). Such experiments try to simulate a natural use of telephone services and are consequently the most relevant (Dimolitsas, 1993). The conversational quality assesses the interlocutors’ ability to communicate throughout a call. This ability is dependent upon the transmission quality and by conversation effectiveness factors such as echoes at the talking side, transmission delays and sidetone distortion (see Sec.1.2.3). Two or more test subjects are asked to achieve a task specified by an interactive communication scenario. After a short conversation of about 5 minutes, the subjects assess different aspects of the connection thanks to e.g. a listening quality rating scale and a talking quality rating scale. Seven scales are presented in ITU–T Rec. P.805 (2007). In addition, it is usual for the experimenter to ask the subject to describe in his/her own words the nature of the degradation (e.g. echo, noise). In conversation tests, the arithmetic mean over all test subjects of the quality judgments is called the MOS–Conversational Quality Score, and is denoted by MOSCQS (ITU–T Rec. P.800.1, 2006). An overview of conversational quality tests is available in Möller (2000). However, the design and conduct of conversational tests are more complex than those of listening tests. In practice, only few conditions are assessed in conversation tests. For an overview of the relationships between listening and conversational quality see Guéguin et al. (2008).
46
2 Speech Quality Measurement Methods
2.2.3.3 Listening-Only tests Listening-only experiments are carried out to gather the most important quality features. Their realism is lower than that of conversational tests since only the speech transmission quality can be assessed. The P-Series of Recommendations published by the ITU-T such as ITU–T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996) describe a general framework of measurement methods used in assessments of speech quality. In a listening quality test (referred to as Listening-Only Test (LOT) by the ITU-T), the listeners rate on a measurement scale a set of short speech samples, called stimuli, transmitted by different speech transmission systems. In such listening tests, the listening level is alike whatever the stimulus and set to 79 dBSPL (dB rel. 20 µPa), which corresponds to the preferred listening level in a NB context (ITU–T Handbook on Telephonometry, 1992).
Absolute Category Rating (ACR) In telecommunications, the most widely used speech quality test is an Absolute Category Rating (ACR) test that uses the 5-point integral quality scale presented in Table 2.3. An arithmetic mean over all listeners of the quality judgments is called a MOSLQS value. Table 2.3 Absolute Category Rating (ACR) Listening-quality scale Quality of the speech Score Excellent Good Fair Poor Bad
5 4 3 2 1
Degradation Category Rating (DCR) However, the sensitivity of such methods is insufficient for the comparison of speech processing systems of similar integral quality. In such cases, a Degradation Category Rating (DCR) method is more appropriate. For small impairments, a paired comparison (A–B) method is more sensitive than an ACR method (Combescure et al., 1982). In DCR tests, for each trial, the subjects listen to both a reference (i.e. non degraded) and a degraded speech signal. The listener is asked to rate on the 5-point rating scale presented in Table 2.4 the perceived degradation in quality of the processed (i.e. second) signal from comparison to the reference (i.e. first) signal. The resulting quality value is referred to as the Degradation Mean Opinion Score
2.2 Auditory methods
47
(DMOS). The DCR method is part of the quality test framework defined by ITU–T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996). Table 2.4 Degradation Category Rating (DCR) scale Score The degradation is . . . 5 4 3 2 1
inaudible audible but not annoying slightly annoying annoying very annoying
Comparison Category Rating (CCR) Another type of standard quality test uses a reference speech sample, which may be of lower quality than the rated sample. The scale in use is, thus, the 2-sided rating scale presented in Table 2.5. This method, called Comparison Category Rating (CCR) can be seen as a refinement of DCR tests where the reference can be presented in the first or the second position (A–B and B–A). The resulting quality value of CCR test is a Comparison Mean Opinion Score (CMOS). The CCR method is also part of the quality test framework defined by ITU–T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996). Table 2.5 Comparison Category Rating (CCR) scale Score
Quality of the second stimulus compared to the first one
3 2 1 0 −1 −2 −3
Much better Better Slightly better About the same Slightly worse Worse Much worse
2.2.3.4 High-quality listening tests ITU–T Rec. P.800 (1996) has been defined for the assessment of Narrow-Band telephony. With the introduction of WideBand transmissions, the ITU-T published a specific recommendation for the evaluation of WB speech codecs ITU–T Rec. P.830
48
2 Speech Quality Measurement Methods
(1996). It slightly differs from the ITU–T Rec. P.800 (1996). For instance, in WB tests, the listening terminal should reproduce at least the WB bandwidth: the typical IRS-type user terminal is replaced by a high-quality headphone. In addition, the listening mode, which is usually “monotic” in NB tests, is replaced by a “diotic” presentation of the stimuli3 . Nowadays, signals can be transmitted with a wider bandwidth than in the past (e.g. S-WB telephony). Unfortunately such methodologies are not suited to this range of quality. Consequently, high-quality speech processing systems are assessed by methodologies used in the audio world and published by the Radiocommunication sector of the ITU (ITU-R) organization. Two of these methodologies are presented below.
Assessment of small impairments In WB and S-WB telephony, an auditory method for the assessment of small impairments in audio systems is usually employed (ITU–R Rec. BS.1116–1, 1997; ITU–R Rec. BS.1284–1, 2003). In this method, three audio samples are presented, A, B and X, are presented to the listener, who is asked to select which of the samples, A and B, is identical to the reference, X . Then, the listener rates also the degradation of the other sample through comparison to X . This method is appropriate to the assessment of small impairments such as those introduced by high quality audio coding algorithms.
MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) Another ITU-R standardized method used in audio quality test is the MUSHRA (ITU–R Rec. BS.1534–1, 2003). In this test, several speech sample, including a known reference and hidden anchors, are presented together through a multi-scale interface. The subject is asked to rate on a continuous scale defined from 0 (i.e. lowest quality) to 100 (i.e. best quality) the whole set of stimuli, except the known reference. This scale is divided into five equal intervals and labeled with the same adjectives as the listening-quality scale shown in Table 2.3: bad, poor, fair, good and excellent. The resulting scores quantify the degradation of the conditions under test from comparison to known reference.
3
A monotic, or monaural, mode corresponds to a presentation of the speech stimuli to only one ear (left or right, depending on the subject). A diotic mode corresponds to a presentation of the same signal to both ears. A diotic mode differs from a dichotic presentation, where signals sent to the right and left ears are different. Such a dichotic mode can be either stereo or binaural when recorded with an artificial head (ITU–T Rec. P.58, 1996).
2.2 Auditory methods
49
2.2.4 Analytical methods A utilitarian test method quantifies the integral speech quality as it is perceived by an end-user. However, the information provided by a single quality value is insufficient to allow comparisons between very different processing conditions. Indeed, two communication systems may have the same integral quality but a totally different behavior. Analytical test methods give diagnostic information about the assessed processing conditions. Such quality tests rely on either a multi-scale rating process (e.g. SD) or a multidimensional analysis of the auditory results (e.g. MDS).
2.2.4.1 Diagnostic Acceptability Measure (DAM) Voiers (1977) developed a specific multidimensional scaling method called Diagnostic Acceptability Measure (DAM) which assesses several quality features of speech samples. The subjects evaluate the speech samples on 20 continuous rating scales. Each scale is dedicated to the assessment of a given quality feature from negligible (0) to extreme (100). This auditory method has the advantage of providing the individual differences in taste and preference. The scales are divided into three categories: (i) features related to the speech signal (e.g. interrupted, rasping), (ii) features related to the background noise (e.g. hissing, babbling), and (iii) features covering both speech and background noise (e.g. intelligibility, acceptability). However, such a test is expensive and time-consuming since the listeners are trained beforehand (experienced). Finally, on the basis of a linear relationship between the quality features (related to the speech signal and the background noise) and the acceptability, the auditory results can also be used for diagnostic purpose.
2.2.4.2 Semantic Differential (SD) The Semantic Differential (SD) method developed by Osgood (1952) was first applied to the definition of a semantic space related to “words”. It uses a set of opposite attributes, i.e. pairs of antonym terms (e.g. small/large and wet/dry). Each pair of antonyms defines the poles of a continuous rating scale. This method relies upon the following hypotheses (Osgood, 1952): The process of description or judgment can be conceived as the allocation of a concept to an experiential continuum, definable by a pair of polar terms. A limited number of such continua can be used to define a semantic space within which the meaning of any concept can be specified.
The subject is asked to judge the intensity and the polarity of the feature underlying the pair of antonyms (e.g. volume and wetness respectively). Using such pairs for characteristics related to voice- and speech-quality features (e.g. low/high) makes possible the application of the SD method for diagnostic purposes, e.g. see McGee (1964). This measurement method is sometimes referred to as attribute scaling.
50
2 Speech Quality Measurement Methods
2.2.4.3 Evaluation of Noise Reduction (NR) algorithm An important perceptual dimension of speech communication quality is the amount of noise in the transmitted signal. Communication systems where background noise can be present, e.g. mobile phones or hands-free terminals, are more and more frequent. A real background noise brings information about the environment of the far-end talker, especially in speech-free periods. As the integral quality is highly degraded in a speech signal polluted by noise, Noise Reduction (NR) systems have been integrated to user terminals. Such NR systems are designed to increase the SNR, but they may degrade the speech signal. Recently, the ITU-T published an analytical measurement method for the evaluation of noise reduction algorithms (ITU–T Rec. P.835, 2003). This methodology uses 3 5-point rating scales to assess the quality of the speech signal alone (speech signal distortion), the background noise alone (background noise intrusiveness) and the integral quality. The subject is asked to listen to the same speech sample three times; a silent pause consecutive to each listening allows him to score the sample on one of the three rating scales.
2.2.4.4 Assessment of speech quality dimensions In the SD method, the poles of the scales are labeled with a pair of antonym quality features (e.g. Continuous—Discontinuous). However, in SD and DAM tests, all scales are presented simultaneously. In a recent analytical method dedicated to diagnostic purpose and developed by Wältermann et al. (2010b), speech samples are assessed on three continuous rating scales dedicated, respectively, to the following perceptual speech quality dimension, i.e. Discontinuity, Noisiness and Coloration, and previously derived by Wältermann et al. (2008). The measurement takes place in two steps: (i) an LOT/ACR “overall quality” test according to ITU–T Rec. P.800 (1996) and (ii) a “dimension assessment”. In the interval between them, the meaning and use of the three scales are described in detail to the subjects by means of directives. Moreover, for each scale, examples are proposed for training.
2.2.4.5 Single quality feature In an auditory test, focus may be on a single specific quality feature. For instance, the listening level has a strong influence on the integral speech quality. Consequently, the ITU-T recommends an ACR scale for the specific assessment of the preferred listening level, see Table 2.6 and ITU–T Rec. P.800 (1996). The output quality score (mean loudness-preference opinion score) is denoted by MOSLP . Further to an exhaustive campaign of auditory experiments conducted in the 1980s, the preferred listening level for a monaural listening situation was found to be 79 dBSPL . This finding led the ITU–T Handbook on Telephonometry (1992) to recommend the use of this specific level in every monaural speech quality experiments. However, according to ITU–T Contrib. COM 12–11 (1993), a level higher
2.2 Auditory methods
51
Table 2.6 Loudness-preference scale Loudness preference Much louder than preferred Louder than preferred Preferred Quieter than preferred Much quieter than preferred
Score 5 4 3 2 1
than the preferred listening level leads to a higher MOSLQS value. The maximum speech quality is obtained at the optimum listening level. But, at levels higher than the optimum listening level, the integral speech quality is decreased. In addition, ITU–T Contrib. COM 12–11 (1993) showed that the difference between the preferred and the optimum listening levels is dependent upon other quality features such as the bandwidth. For instance, the difference is more marked under WB conditions than under NB ones.
2.2.4.6 Multi-Dimensional Scaling (MDS) A general description of this statistical analysis method is available in Kruskal (1964). Contrary to the other analytical methods, the Multi-Dimensional Scaling (MDS) focuses only on the perceptual “differences” between stimuli. This method requires dissimilarity data between several speech stimuli. They are acquired from a similarity test performed on a continuous scale labeled with the two attributes “very similar” and “not similar at all”. In the ideal case, the similarity of all N×(N−1) 2 possible pairs of stimuli (given N speech samples) is judged by the subjects. Then, a multidimensional similarity space can be derived from the auditory results. The number of dimensions that defines the derived space is a compromise between the covered variability of the original similarity results and the possibility to interpret each dimension. The different MDS techniques are distinguished through use of the following criteria: Classical / Nonclasical In classical MDS, a single dissimilarity space is derived for all subjects whereas in nonclassical MDS several spaces are derived. For instance, in weighted MDS (also known as individual difference scaling INDSCAL Carroll and Chang, 1970) a subject space is derived in addition to the similarity space. The subject space shows the weight given to the dimensions by each subject. Metric / Nonmetric In metric MDS, test subjects are required to quantify the dissimilarity. On the other hand, in nonmetric MDS, they judge about the rank order dissimilarity. For
52
2 Speech Quality Measurement Methods
instance, the results issued from a triadic comparison test can be analyzed by a nonmetric MDS. The interpretation of the derived dimensions is relatively complex. A first possibility is the “arbitrary” selection of an attribute through an experts’ exhaustive evaluation of the degradation differences along one specific dimension. A second possibility is the comparison of the derived space with other auditory test results. For instance, the dimensions can be described by the degree of correlation with the antonym pairs of attributes used in the SD test.
2.2.4.7 Preference mapping The preference mapping is a multidimensional statistical analysis of preference judgments. In this case, a preference test such as an ACR listening quality test or a paired-comparison preference test is conducted at first. It is followed with a factor analysis made through application of a Principal Component Analysis (PCA) algorithm, for example, to the test results. This permits one to distinguish two types of preference mapping: the internal and the external preference mapping methods. Internal preference mapping provides a multidimensional representation of the speech stimuli where test subjects or group of test subjects are represented as vectors. Carroll (1972) developed an internal preference mapping algorithm called multidimensional preference scaling (MDPREF). An external preference mapping uses a pre-existing multidimensional representation of the speech stimuli. Then, the relationship between the preference of the speech stimuli and each dimension (e.g. a quality feature) is derived. The external preference mapping is widely used in the food industry in order to adapt or create new products for each segment of the population.
2.2.5 Relativity of subjects’ judgments
Perception in real world is about perception in context.
(Lotto and Sullivan, 2008) Many factors can influence the way the user is perceiving the speech sample under test. Quality scores are “relative” to the test characteristics and consequently not “absolute”. The following section will briefly review several aspects of an auditory test liable to affect the subject’s judgment. For an exhaustive review of all influencing factors see Möller (2000); Poulton (1979); Zielin´ ski et al. (2008). According to
2.2 Auditory methods
53
the description made by Jekosch (2005), these aspects may be classified into three groups: induced by the use of the measurement scale as an interface with the subject. The subject-effect generated by the use of human listeners as an instrument of measurement. The context-effect due to the relationship between the context and the use of speech as an object of measurement. The scaling-effect
This means that getting an absolute quality value based on the subject’s judgment is not possible. Following some simple guidance rules, the experimenters try to get a quality score as absolute as possible in order to differentiate and compare the processing conditions under study in the test. These rules are aimed at reducing biases in subjects’ judgments.
2.2.5.1 Scaling-effect Jekosch (2005) assumed that the scale has a strong influence on the results. It should enable each subject to encode the different features he has perceived. On the other hand, rating by a subject should not represent his own interpretation of the speech message (i.e. meaning) but rather a common perception of the acoustic quality features (i.e. form, see Sec. 1.1.3). A review of all of the measurement scale effects is available in Poulton (1979). • Intervals between categories In category scales (e.g. ACR method), the intervals between two categories (i.e. the quality scale labels) may be unequal, and this inequality leads in non-linear measurement scales. Such scales are referred to ordinal scales. However, simple statistical parameters such as arithmetic mean have been developed for interval and ratio scales (Möller, 2000). These nonlinearities are attenuated by introducing numerical value in front of each category, see Table 2.3. • Language The translation into another language of the category names influences the MOS values. For instance, Zieli´nski et al. (2008) showed that the semantic difference between the English words, “Fair” and “Poor”, and the one between their French equivalents, “Assez bon” and “Médiocre”, are not alike. Quality assessments are, thus, language-dependent. • Sensitivity of category scales Even though a discrete 5-point scale seems to be the preferred scale in terms of “ease of use”, a 5-point MOS scale has a relatively low sensitivity (Preston and Colman, 2000). On the other hand, sensitivity is increased when a continuous scale is employed for rating speech quality (ITU–T Contrib. COM 12–39, 2009) since the standard deviation of the processing conditions is reduced. However,
54
2 Speech Quality Measurement Methods
ITU–T Contrib. COM 12–120 (2007) showed that subjects mostly use notches on continuous scales (e.g. numbers or category names) to judge the stimuli. • Saturation-effect Among the other scale-effects let us cite the saturation-effect. The extreme categories of the scale are neglected by naïve subjects, which introduces nonlinearities. In ITU–T Contrib. COM 12–39 (2009), the authors showed the saturationeffect of the discrete ACR scale, see Table 2.3.
2.2.5.2 Subject-effect In the specific case of a listening-only test, the subject-effect and the context-effect can be described by Fig. 2.3. This diagram shows the temporal relationship between both effects and their impact on the subject’s judgment about the current speech stimulus (i.e. t0 ). Three specific parts on the time scale are defined. The contexteffect corresponds to the last two parts but it is also dependent on the test situation. The left part corresponds to the subject-effect; it is caused by the differences in the internal reference of each subject. The internal reference corresponds to his overall experience in telecommunications. He has his own opinion about the importance of each quality feature involved in the integral speech quality. However, the test reliability is decreased by the variations in judgments with the personal internal reference: indeed, the subjects expects the auditory event to have a perceptual quality similar to his own internal reference. According to Takahashi et al. (2005b) subjects show a preference for the one or the other of the specific NB or WB bandwidths. To reduce this effect, the ITU-T proposed the two solutions described hereafter: • The introduction text, read to the subjects at the beginning of the auditory test (i.e. “directives”, see introduction of Sec. 2.2), should include a “question” in relation with the quality scale in use. This question defines how the subjects have to judge the speech samples. It has a strong influence on the features involved in the quality judgment and finally on the utility of the test results. The ITU–T Handbook on Telephonometry (1992) recommended some specific questions/scales such as; “Please give your opinion on whether you or your partner had any difficulty in talking or hearing over the connection according to the following rating t0 − 1 h Subject-effect
t0 − 8 s
Corpus-effect
t0 = subject’s judgment
Order-effect time (t)
Fig. 2.3 Influences of the subject-effect and the context-effect (i.e. corpus-effect and order-effect) on the subject’s judgment versus time (Côté et al., 2009)
2.2 Auditory methods
55
scale”: Yes (Difficulty), No (No difficulty). • The number of subjects should be large enough to get a certain degree of objectivity in the quality judgments since the resulting average over all subjects corresponds to the inter-individual agreement. This led the ITU–T Handbook on Telephonometry (1992) to recommend sets of 30 subjects. In addition, to reduce this subject-effect, some stimuli are usually presented over a training period prior to the conduct of the experiment. These stimuli include the highest and lowest qualities of the test corpus, which are then seen like an anchor by the subjects. At last, to avoid any fatigue effect, and a consequent decrease in the accuracy of subjects’ judgments, it is recommended to interrupt the test procedure by short breaks at regular intervals (e.g. 15–20 minutes).
2.2.5.3 Context-effect The context-effects represent the influence of assessment situation. One of the most influencing bias in auditory results corresponds to the listening environment. Assessments by subjects are made through in-laboratory tests, which are quite different from a real-life situation. For Guski and Blauert (2009), the highest bias of auditory judgments obtained in laboratory tests compared to real-life situations is their restriction to one signal modality. Perception in the real world is multimodal and in the case of speech signs, two modalities are perceived: vision and sound. However, in the specific case of telephony studies, the lack of visual cues reduces the gap between real-life and laboratory environments.
Corpus-effect The central part of Fig. 2.3 corresponds to the corpus-effect, and t0 − 1 hour refers to the test start. As subjects’ judgments are affected by both the range of degradations within the test corpus and their distribution over the quality range, the interpretation of MOS values is dependent on the corpus-effect. Hardy (2003) stated that conditions included in the test corpus should be realistic and, thus, account for the quality range met in an “ecologically valid” situation, i.e. under real telephony situation. Several studies have dealt with the influence of the context-effect on MOS values, and more specially with the influence of the corpus-effect. For instance, further to their investigations about the impact by bandwidth restriction, Möller et al. (2006) found that an uncoded NB (i.e. 300–3 400Hz) condition obtains a higher MOS value in a purely NB corpus than in a mixed-band one where high quality WB (i.e. 50–7 000 Hz) conditions are introduced. Côté and Durin (2008) demonstrated that all perceptual dimensions are dependent on this corpus-effect, which can, however, be reduced by introducing several reference conditions equally distributed over the judgment scale (Barriac et al., 2004). For instance, the systematic introduction
56
2 Speech Quality Measurement Methods
of a WB condition in a purely NB test corpus may reduce the corpus-effect for the perceptual dimension coloration. Ideally, all corpora should include NB, WB and S-WB conditions, but such a requirement is not possible in practice. In addition, studies by Takahashi et al. (2005b) about the subject’s sensitivity in a purely NB context and in a mixed-band one suggested that the introduction of WB conditions caused no decrease of his sensitivity. In order to compare new auditory tests to those previously carried out in a NB context, Barriac et al. (2004) proposed to define a mapping function from the NB MOS scale to the WB MOS one. Therefore, the ITU-T introduced in ITU–T Rec. P.800.1 (2006) a specific label for the context-related MOS values: this means that the NB context gives MOSLQSN values, whereas the WB one provides MOSLQSW values and the mixed-band one, NB/WB, leads to MOSLQSM values (see Sec. 2.1).
Order-effect The right part of Fig. 2.3 corresponds to the influence of the preceding stimulus on how the current stimulus is judged, given that the most recent stimulus has the strongest influence. This is called the order-effect. This effect is attenuated using different listening orders for each subject (or group of subjects).
2.2.6 Reference conditions and normalization procedures Some of the biases introduced in the previous section are reduced by setting reference conditions with their corresponding normalization procedures for application to the MOS values. Such reference conditions correspond to processing schemes defined by a known-in-advance parameter, which quantifies the introduced degradation. These reference units are added to the auditory test corpora and cover a defined quality range so that the quality of the conditions under study falls within this range (i.e. one reference unit has the lowest quality and another reference unit has the best quality of the test corpus). In addition, for a given auditory test, the MOS values of the conditions under study can be transformed to an “absolute” scale through a normalization procedure aimed at ruling out some of the test-specific effects. Consequently, the introduction of reference conditions enables one to compare the results across tests (and even across laboratories) despite the differences in languages, methodologies, etc. The MOS values obtained for all processing conditions can be normalized to a specific range (e.g. 1–4.5) by using a simple linear mapping function computed by: MOSnorm,i =
MOSi − MOSmin (MOSlim − 1) + 1 , MOSmax − MOSmin
(2.1)
2.2 Auditory methods
57
where i corresponds to the current condition, MOSmin and MOSmax are, respectively, the lowest and highest MOS values of the corpus, MOSlim is the highest MOS value wanted (e.g. 4.5) and MOSnorm,i is the resulting normalized MOS value.
2.2.6.1 Modulated Noise Reference Unit (MNRU) Over the first half of the twentieth century, transmitted speech was mainly degraded on the specific perceptual dimension Noisiness. This let to the definition by Rothauser et al. (1968) of a “reference unit” as a signal of the same nature as the samples included in the test corpus. Nowadays, a consensus seems to have reached about the conduct of comparison of the references and the conditions under tests along the same perceptual dimensions. For Rothauser et al. (1968), the easiest way to produce reference conditions is, therefore, the introduction of noise in speech samples. This additive noise can be white or shaped (e.g. pink noise), stationary or modulated with the speech signal amplitude (Law and Seymour, 1962). An example of the latter type was standardized as the Modulated Noise Reference Unit (MNRU) in ITU–T Rec. P.810 (1996) and further used quite extensively in the assessment of speech codecs. In this specific case the noise is correlated to the speech signal. The degradation introduced is similar to the quantizing noise produced by the logarithmic PCM technique used by waveform speech codecs. A detailed description of the MNRU normalization approach is available in App. A. However, nowadays, MNRUs neither give account for the current diversity of degradations, nor reduce the corpus-effect. For instance, a MDS analysis enabled Hall (2001) to demonstrate that signal-correlated noises are perceptually different from the non-linear degradations introduced by low bit-rate speech codecs. The perceptual dimension, Noisiness, is far more affected by MNRU conditions than by low bit-rate speech codecs.
2.2.6.2 Standard speech codecs A single auditory test rarely includes the whole perceptual space defined by modern speech transmission systems. In the extreme case where the subject of focus is only the quality degradation introduced by a specific speech codec, a test corpus including strong degradations like MNRU conditions will likely introduce a corpus-effect. It will prevent both an exact quantification of the speech codec quality and the intercomparisons of speech codecs. In the last decade, these consideration let the ITU-T to adopt a different type of reference units. The quality of several common speech codecs was quantified by a parameter called “equipment impairment factor” Ie from previously carried out auditory tests. ITU–T Rec. G.113 (2007) defines Ie values for several speech codecs introduced in Sec. 1.3.4.1. Then, some of these common speech codecs were proposed as reference conditions in auditory tests (ITU–T Rec. P.833, 2001) or in a pool of stimuli whose the quality is estimated by an instrumen-
58
2 Speech Quality Measurement Methods
tal model (ITU–T Rec. P.834, 2002). Wideband versions of the normalization procedures were recently published as the ITU–T Rec. P.833.1 (2008) and ITU–T Rec. P.834.1 (2009). The latter procedure is described in detail in Sec. 3.2.3.
2.3 Instrumental methods As described in the introduction of this chapter, auditory methodologies rely on judgments by subjects, who are asked to give their opinion about the quality of a speech signal. Since auditory tests are costly and time-consuming, instrumental methods, referred to as quality models, have been developed. They consist in a computer program designed to automatically estimate the perceived quality of speech signals. Such a method is based on a mathematical model, which establishes a relationship between a sensation and a physical magnitude. A first “psychophysic” model has been developed by Fechner (1860) and referred in the literature as the “Weber-Fechner law” states that: φ S = α · ln , (2.2) φ0 where S is the sensation (i.e. perceived intensity), φ is a physical parameter and φ0 is the perception threshold. However, such mathematical models provide an estimation of the sensation perceived by human subjects. Since auditory methods are the most reliable way to assess the perceived quality of a system under study, the development of an instrumental model must be based
Design 1st auditory quality test
Develop candidate instrumental quality measure
Design 2nd auditory quality test
Fig. 2.4 Development procedure of an instrumental model from Wolf et al. (1991)
Validate instrumental quality measure
ρ σ
2.3 Instrumental methods
59
on auditory results and, according to Wolf et al. (1991), should consist of four main steps (Wolf et al., 1991): (i) the design of a first auditory test, (ii) the development of a candidate instrumental measure based on the auditory test results, (iii) the design of a second auditory test, and (iv) the validation of the instrumental method on the auditory results issued from the second auditory test through use of statistical parameters, e.g. the Pearson correlation coefficient, ρ , and the prediction error, σ , (see Sec. 5.1.3.2). In the case where the candidate instrumental measure fails to the validation phase, additional developments, including the design of a new auditory test, may be necessary. Consequently, enhancements of instrumental models have always been related to the development of new speech processing systems. The accuracy of a candidate model is quantified by comparison of the quality estimations with the auditory speech quality ratings. This accuracy is used as the main criterion for the validation of candidate models. A detailed validation procedure is described in Sec. 5, and the four steps are depicted in Fig. 2.4. The resulting instrumental quality measurement is highly dependent upon the design of the first auditory test (e.g. the scale level, the presentation method and the test conditions). Instrumental measures are restricted to specific applications contrary to auditory methods. In this sense, an over-generalization of instrumental measurement and calculation methods are heavily criticized by scaling experts (Jekosch, 2005). The instrumental models can be characterized by six criteria adapted from Rabiner (1995): Completeness
Accuracy Credibility Extensibility Manipulability Consistency
All of the speech processing systems already in-use throughout the world fall within the scope of the model. This criterion shows that the development of speech quality models has been intimately related to the historical evolution of the speech processing systems. The most widely used criterion. The estimated scores are correlated with human perception. The estimation is easily interpretable. The scope of the model can increase. The model is easily employed. The model must be totally selfsufficient: there is no need for fine tuning by the users. The relationship between the estimations and the auditory results is monotonic (internal consistency). The absolute estimated values have approximately the same magnitude as the auditory results (external consistency).
Instrumental methods have different applications such as the daily monitoring of transmission networks or the optimization of speech processing systems. Instrumental quality models are classified in three different groups from their assessment paradigm (Takahashi et al., 2004): • Parameter-based models The quality elements of the transmission path are characterized by parameters,
60
2 Speech Quality Measurement Methods
to plan future transmission networks. • Signal-based models They use signals either transmitted through a telephony network or degraded by a speech processing system to evaluate under-development and in-use transmission networks and speech processing systems. • Packet-layer models They analyze the parameters provided by transmission networks such as the pattern of transmitted packets by VoIP networks to further monitor in-use packetswitched networks. The choice of an instrumental model depends on the current state of the speech processing system under study. The development phase of such systems is described by a quality loop (Jekosch, 2005) such as the one presented in Table 2.7. The quality elements are selected and evaluated over a network-planning phase (i.e. the network is not yet set up). During this planning phase, parameter-based models inform the telecommunication companies about the quality of the future transmission system. The parameter-based models can, thus, only predict the perceived quality of the future system. During the execution phase, signal-based models compare different versions of a speech processing system or different network configurations. Then, in the usage phase, telecommunication companies monitor, and maintain when needed, in-service networks or optimize some algorithm. Such estimations are provided by signal-based and packet-layer models. After a short recall of the historical evolution, the next section will give typical examples of applications of the three types of instrumental models and highlight their limitations (see Fig. 2.2 (p. 41 for an exhaustive list of instrumental quality models). Table 2.7 Quality loop (Jekosch, 2005) Phase 1st 2nd 3rd
Name
Description
Planning Execution Usage
Market research, design, testing and production planning Production, final testing and distribution Monitoring and maintenance
2.3.1 Parameter-based models The integral quality of an entire transmission path can be assessed from the characteristics of each element of the network. Then a relationship between the physical characteristics of the elements and the corresponding perceived quality is established through use of a set of parameters defining each element of the transmission
2.3 Instrumental methods
61
system from the talker’s mouth to the listener’s ear. From this set of parameters, parameter-based models are able to predict the speech communication quality of future networks, before the implementation of the system under study. In the next paragraphs, the historical evolution of network-planning models will be briefly recalled so as to further describe the corresponding relationships between the physical parameters of either the transmission network or user terminal and the expected speech communication quality.
2.3.1.1 Loudness Rating Fletcher and Galt (1950) developed an assessment procedure of the loudness loss mainly induced by electro-acoustic devices in telephone networks. The resulting measurement, expressed in dB, is referred to as the Loudness Rating (LR). This parameter is used in telephonometry to express the sum of attenuations by the transmission path in each frequency band. The transmission path is compared to a reference system. Different reference systems have been developed such as the orthotelephonic reference position (ITU–T Handbook on Telephonometry, 1992) (i.e. 1 m air path) and the IRS (ITU–T Rec. P.48, 1988). The LR model was published as ITU– T Rec. P.76 (1988). The specific acoustic procedure in use in LR measurements is available in ITU–T Rec. P.79 (2007). The overall transmission loss generated by the entire transmission path is the Overall Loudness Rating (OLR). It corresponds to the sum of three parameters: SLR JLR RLR
Send Loudness Rating: from the talker’s mouth to the handset microphone output. Junction Loudness Rating: linear and non-linear distortion in the transmission network. Receive Loudness Rating: from the handset loudspeaker input to the listener’s ear.
2.3.1.2 Opinion models In the 1960s and 1970s, several national telecommunication companies carried out auditory tests so as to evaluate and facilitate the extension and maintenance of their telephony networks. From auditory tests results, they derived algorithms predicting the opinion expressed by users about the phone connection, i.e. their own auditory “reaction” to the telephone network. Such parameter-based models of first generation are often termed opinion models. They cover almost all of the degradations observed in analog telephony networks: attenuation of the transmission path, circuit noise, environmental noise, quantizing noise, talker echo and sidetone. However, the principles in use in each of these models are different. In 1993, four different parameter-based models that all designed from the Loudness Rating (LR) model were proposed in the ITU–T Suppl. 3 to P-Series Rec. (1993). Among them, the Bellcore Transmission Rating (TR) model developed by Cavanaugh et al. (1976) re-
62
2 Speech Quality Measurement Methods
lies on the use of the OLR value of the transmission system whereas the three other ones employ an auditory perception model mainly based on the LR model. TR
From mainly mapping functions between the opinions expressed by subjects and empirical data, the Bellcore TR model (Cavanaugh et al., 1976) predicts the quality of telephone network on a “transmission rating scale” (R-scale). As this scale is anchored at two points, the produced scores are much less context-effect dependent. The combination of input scalar parameters leads to a transmission rating factor, R, which is increasing monotonically with the transmission quality.
CATNAP
The British Telecom Computer-Aided Telephone Network Assessment Program (CATNAP) was proposed in ITU–T Suppl. 3 to P-Series Rec. (1993) and based on a previous model called SUBjective MODel (SUBMOD). This model was described, at first, by Richards (1974). From a theoretical model of human auditory perception, CATNAP simulates this perception process through use cause-and-effect relationships between the input parameters and output values. These input parameters are frequency-dependent quantities, e.g. transmission path sensitivity, listener’s hearing and speaker’s talking features, room noise spectra and sidetone characteristics. CATNAP provides one with two opinion scores: (i) the conversation opinion score (YC ) and (ii) the listening-effort score (YLE ).
II
The Information Index (II) method developed by Lalou (1990) predicts the transmission quality using both scalar parameters and frequencydependent ones. Two scores are provided, (i) the listening information index (Ii ), and (ii) the information index in a conversation context (Ic ).
OPINE
The Overall Performance Index model for Network Evaluation (OPINE) developed by Osaka and Kakehi (1986) predicts the quality of a speech communication. It relies on the additivity theorem established by Allnatt (1975), which defines that all psychological factors are additive on a psychological scale. The OPINE model uses scalar parameters and frequency-dependent parameters.
All of these four models were developed to predict the quality of fixed-line networks such as PSTN. One should note that none of them considers the distortions introduced by digital systems such as low bit-rate speech codecs.
2.3.1.3 E-model In 1997, the different opinion models were integrated in a new parameter-based model, the E-model, by Johannesson (1997): this means that for instance, the transmission rating scale (R-scale) and the “additivity property” of impairment factors
2.3 Instrumental methods
63
used in the OPINE model are taken into account by the new algorithm. The author also included the effects related to modern digital networks. The E-model is thus adapted to both traditional impairments such as echo and transmission delay, and degradations introduced by modern transmission scenarios (e.g. non-linear distortions introduced by low bit-rate codecs). This parametric model was published at first as the ETSI ETR 250 (1996) by the European Telecommunications Standards Institute (ETSI). Then, it was published as the ITU–T Rec. G.107 (1998). The listening and talking terminals (e.g. LR), the transmission network (e.g. delay, circuit noise) and several environmental factors (e.g. environmental noise) are characterized by 21 input parameters. The transmission quality rating, R-value (i.e. rating), is computed as: R = R0 − IS − Id − Ie + A , (2.3) where R0 is the “highest” Signal-to-Noise Ratio (SNR) in absence of other impairments. This SNR is based on the basic noise parameters (real background noises and circuit noises). Then, each impairment factor quantifies a specific degradation. For example, IS represents the impairments that occur simultaneously with the speech signal, Id encompasses the impairments related to the conversational effectiveness (impact of delay and echo) and Ie corresponds to the equipment impairment factor introduced by a low bit-rate codec. The advantage factor A allows a compensation of the impairment factors in terms of “advantage of access” (e.g. cordless handset). The E-model is mainly used in network-planning where it helps planners to make sure that users will be satisfied with the overall transmission system. It determines a conversational quality on the R-scale ranging from R = 0 (the poorest possible quality) to R = 1004 (the best quality). Then, the following mapping function (2.4) enables one to transfer the predicted R-value to an opinion scale for transformation into a MOSCQEN value (i.e. in a NB context): For R < 0 : MOS = 1 For 0 ≤ R ≤ 100 : MOS = 1 + 0.035R + R (R − 60)(100 − R)7 · 10−6 For R > 100 : MOS = 4.5 .
(2.4)
The current algorithm has been developed for NB connections and is currently extended to WB scenario (Raake et al., 2010). An exhaustive description of the achieved steps will be given in Sec. 3.2.1. A set of default values for the 21 input parameters has been published. These default values correspond to a standard ISDN connection defined by: (i) user terminals corresponding to the IRS (ITU–T Rec. P.48, 1988), (ii) a certain amount of environmental noises, and (iii) an ITU– T Rec. G.711 (1988) speech codec. This channel obtains a relatively high R-value (R = 93.2).
Several parameter values, e.g. N f or > −64 dBmp, lead to R values greater than 100, which are outside the permitted range defined in ITU–T Rec. G.107 (1998) and are thus normalized to 100. 4
64
2 Speech Quality Measurement Methods
A recent update of the E-model (ITU–T Del. Contrib. COM 12–44, 2001) was aimed at predicting the communication quality when random transmission errors occur in packet-switched networks. The equipment impairment factor, Ie , was adjusted towards an Ie,eff value, which quantifies the impact of packet-loss on the speech quality as follows: Ppl Ie,eff = Ie + (95 − Ie) · , (2.5) Ppl + B pl where Ie,eff is the “effective” equipment impairment factor impaired by packet losses, Ie is the equipment impairment factor in the error-free case, Ppl is the percentage of lost packets, and B pl is a factor describing the robustness of the codec against packet loss. The higher B pl is, the lower the artifacts associated to packet-loss are. The constant of 95 in (2.5) represents approximately the R-value of an “optimal” NB condition. According to ETSI ETR 250 (1996), the standard deviation of E-model prediction errors on the MOS scale is about σ = 0.7. An exhaustive evaluation of the E-model by Möller (2000) showed that, in the case of individual degradations for both traditional network and low bit-rate speech codecs, the quality predictions are reliable. On the other hand, the assessment of combined degradations may lack of reliability and, thus, in this specific case, signal-based models are more accurate (Möller and Raake, 2002).
2.3.2 Signal-based models Whereas the parameter-based models introduced in the previous section are mainly used over the first stage of the quality loop, they do not provide useful information during the last stage, i.e. the usage phase of the transmission system (see Sec. 1.2.2), where telecommunication providers are interested in the quality assessment of inuse networks to detect problems liable to occur. In addition, parameter-based models are less reliable in that case of combined degradations, and thus a different type of instrumental models, the “signal-based” models, is employed. These models are split into two groups (see Fig. 2.5): Intrusive models :
also known as double-ended measurement methods since they use a reference (clean or system input) speech signal, x(t), and a corresponding degraded (distorted or system output) speech signal, y(t)5 .
Non-intrusive models:
also known as single-ended or output-based measurement methods since they use only the degraded speech signal, y(t).
5
In practice, the model input signals correspond either to digital signals, x(k) and y(k), or electrical ones, x(t) and y(t).
2.3 Instrumental methods Reference x(t)
Processing system
65 Degraded y(t)
Non-intrusive model
Estimated speech quality
Intrusive model Fig. 2.5 Intrusive and non-intrusive speech quality models
Early studies on speech quality estimation were mainly based on intrusive models. It is only since the 1990s that the estimations by non-intrusive models are reliable. Intrusive models consist of three main components: 1. a pre-processing step, 2. a component that transforms the speech signal(s), 3. an assessment unit. For instance, in frame-by-frame comparisons, intrusive methods need a re-alignment of the two speech signals. In this case, the pre-processing step includes a specific algorithm proceeds to a precise time-alignment, which may be a difficult task especially with packet-switched networks. In addition, for improvement of intrusive models, a Voice Activity Detector (VAD) algorithm may be used in the preprocessing step. The degradation is, thus, estimated from only the active frames. Then, different types of signal transformation exist. The first measurement methods are defined in the time domain and represent a simple way to characterize the performance of a system under test using a single number. The other measurement methods are defined in either the frequency domain or in the perceptual one and enable a more sophisticated approach to estimate the perceived speech quality. Comparison studies of several signal-based quality models are reported in Quackenbush and Barnwell (1985), Lam et al. (1996), Au and Lam (1998) and Côté et al. (2008).
2.3.2.1 Input signals As set above, instrumental methods have been introduced for reliable prediction of quality scores assessed through auditory tests. As seen in Sec. 2.2.2, the source material in use in auditory tests corresponds to speech records (i.e. natural speech). However, instrumental measurement methods may use different input signals. For instance, an intrusive model uses a reference signal and its corresponding degraded version. The former, x(t), is a “clean” signal with a linear PCM coding scheme (16 bit quantization) and frequency components within the bandwidth context: NB, WB or S-WB. The latter, y(t), is any signal processed by a speech processing system or transmitted through a network.
66
2 Speech Quality Measurement Methods
In some specific cases, the variability of natural speech leads to undesirable effects on the quality estimations. As, by definition, the perceived quality is dependent upon both the speech message (i.e. meaning) and talker’s characteristics (see Sec. 1.1.3), this leads to confusion when comparing instrumentally measured quality values. To avoid such a bias, the set of natural speech input signals used by the experimenter has to be large. Another solution is the use of a simpler signal such as pink noise or tones. For instance, in the audio domain, the usual analysis corresponds to the Total Harmonic Distortion (THD), and the stimulus is a white noise. However, the temporal and spectral properties displayed by such signals and human speech are not alike. Therefore, artificial voice signals have been developed (Billi and Scagliola, 1982; Brehm and Stammler, 1987; Hollier et al., 1993). Artificial voices recommended for instrumental evaluations of speech processing systems such as transmission networks or telecommunication devices are defined in ITU–T Rec. P.50 (1999) and ITU–T Rec. P.59 (1993). They are of two kinds so as to reproduce the temporal and spectral characteristics of human female and male voices. Kitawaki et al. (2004) used the ITU–T Rec. P.50 (1999) artificial voices with the intrusive quality model PESQ (ITU–T Rec. P.862.2, 2005). The authors employed two artificial voices and four real voices. Since the quality estimations made from the real voices proved to be highly correlated with the artificial voices (ρ > 0.95), they concluded that artificial voices may facilitate instrumental quality assessments of a whole transmission network.
2.3.2.2 Time analysis These methods use time-domain (i.e. waveform) differences between the reference and the degraded signals denoted by x(k) and y(k), respectively. A widely used instrumental parameter of speech quality, easy to compute and well understood, is the Signal-to-Noise Ratio (SNR). Through use of discrete signals of length N, it calculates the ratio between the energy of the input signal, x(k), and the transmission system-introduced noise, n(k) = y(k) − x(k) as follows: N
SNR = 10 log10
∑ x(k)2
k=1 N
∑ [y(k) − x(k)]
,
(2.6)
2
k=1
where k is the sample index. Since this parameter is computed over the entire speech signal, the result of Eq. (2.6) is called either long-term SNR or global SNR. The estimated value (in dB) is decreasing with reduction in the perceived quality.
2.3 Instrumental methods
67
The long-term SNR is a poor estimator of the perceived speech quality. Since speech sounds are considered as stationary within a time interval of approximately 20 ms, speech processing algorithms work on short segments, l, composed of M samples and called a frame. A typical frame length lies in the range 10–40 ms, see Table 1.2. Following this principle, Zelinski and Noll (1977) defined a segmental SNR (SNRseg ) as an arithmetic mean of the SNR values (in dB) calculated for individual speech segments. ⎛ 1 SNRSeg = L
⎜ ⎜ ∑ 10 log10 ⎜ ⎝ l=0
M
∑ x(lM + k)
L−1
⎞ 2
k=1 M
∑ [y(lM + k) − x(lM + k)]
⎟ ⎟ ⎟, ⎠ 2
(2.7)
k=1
where L is the number of frames. To reach a higher accuracy with the auditory results, SNRseg is usually restricted to the range 0–40 dB within each frame. Even if this parameter is more reliable than the long-term SNR, it fails to predict a correct ranking between different speech processing systems (Mermelstein, 1979). These first two SNR-based parameters employ the speech waveform, they are, thus, useful to estimate distortions introduced by additive noise (due to analog transmission or real background noise) or signal-correlated noise (generated by the waveform-coder). However, they are not reliable for other degradations such as filtering or phase distortions (Tribolet et al., 1978).
2.3.2.3 Frequency analysis As seen in Section 1.1.2.2, the human auditory system performs a spectral analysis. The perceived quality of a speech processing system depends, among other aspects, on the frequency distribution of the degradation over the short-term speech spectrum. SNR-based measurements needs further enhancements in order to estimate xx (l, e jΩ m ) with such a degradation distribution. A Power Spectral Density (PSD) Φ Ω m = m2π /M is estimated from the l th segment waveform through application of an M-point Discrete Fourier Transform (DFT) analysis. A time-window analysis such as a Hann-window is usually employed. The resulting spectrum is defined on m = 1 . . . M/2 points, i.e. up to half the Nyquist frequency. The third parameter from the SNR family is a frequency-weighted SNR (SNRFW ) that uses a frequency-band filtering as follows: ⎛ ⎞ M/2 xx (l, e jΩ m ) Φ jΩ m ) 10 log10 ⎜ ∑ W (e ⎟ nn (l, e jΩ m ) ⎟ 2 L−1 ⎜ Φ m=1 ⎜ ⎟, SNRFW = (2.8) ∑⎜ ⎟ M/2 L × M l=0 ⎝ ⎠ jΩ m ∑ W (e ) m=1
68
2 Speech Quality Measurement Methods
where W (e jΩ m ) is a long-term frequency-weighting function. Several examples of global and segmental SNRfreq are given in Tribolet et al. (1978). One specific global SNRFW parameter is the so-called Articulation Index (AI), which is still used as an intelligibility parameter (French and Steinberg, 1947). According to Hansen and Pellom (1998), the SNR-based measurements are poor predictors of speech quality; moreover, they are highly dependent on the timealignment and phase shift between the reference and degraded speech signals. A second type of instrumental measurement methods uses the vocoder technique employed by low bit-rate speech codecs, see Sec. 1.3.4.1. The Linear Predictive Coding (LPC) coefficients are estimated from the reference and the degraded signals. Then, the estimated LPC coefficients are compared with a specific assessment unit. The speech signal spectrum can be transformed to a slightly different frequency scale to better correlate with the perceived distortion. For instance, the Cepstral Distance (CD), developed by Kitawaki et al. (1984), compares the logarithmic spectra of x(k) and y(k). It is calculated as follows: p ∼ CD = 10 log 2 ∑ [cx (i) − cy (i)]2 , (2.9) 10
i=1
where cx (i) and cy (i) are the “cepstrum” coefficients of the reference and the degraded signals, respectively, and p is the prediction order usually in the range 10–15. Other typical LPC-based parameters are: • Itakura–Saito (IS) distance developed by Itakura and Saito (1968) • Log Likelihood Ratio (LLR) developed by Itakura (1975) • Line Spectrum Pair (LSP) distance developed by Coetzee and Barnwell (1989) In a study by Kitawaki et al. (1982) about the performance of seven time-domain and frequency-domain instrumental measurement methods in quality measurements of NB speech codecs, CD proved to be the best (σCD = 0.458). Even though frequencydomain measurement methods are more reliable than time-domain ones, they are not good enough for predicting the auditory quality of a wide range of distortions, in particular they are unsuited to speech quality prediction in case of simultaneous distortions (Tribolet et al., 1978). Some new LPC-based parameters have been significantly enhanced by using masking properties of human auditory system (Chen et al., 2003).
2.3.2.4 Perceptual analysis With the increasing introduction of non-linear speech processing systems in telephony network, studies are mainly focused on accurate representations of speech signals in the time-frequency domain. The input speech signal is transformed into an auditory nerve excitation through several psychoacoustic processes (see Sec. 1.1.2.2) to further simulate the peripheral hearing system. Usually, this trans-
2.3 Instrumental methods
69
formation follows the psychoacoustic model for loudness calculation developed by Zwicker and Fastl (1990) and published as the ISO Standard 532–B (1975). Therefore, the term model is used for perception-domain measurement methods whereas the term parameter is used for time- and frequency-domain ones. Beerends and Stemerdink (1994) assumed that, in speech quality estimation, the simulation by the perceptual model has not to be alike the activity by the human hearing system. However, they considered that a simulation of simple cognitive processes is necessary. In that case, the assessment unit mimics the “reflexion” phase described in Sec. 1.2.1. The integral quality is calculated from a combination of several quality features. However, the cognitive processes are less developed in current quality models than the perceptual signal transformation, mainly because of the little number of available studies devoted to the cognitive processes performed in the human auditory cortex. Following the description made by Grancharov and Kleijn (2007), in quality models, mimics of the auditory peripheral system can be achieved through two approaches: Masking threshold concept
The reference is used to compute the degradationmasking threshold, which is in turn used by the assessment unit to calculate the perceptual difference between the reference and degraded speech signals.
Perceptual representation
Here, both signals are transformed into signals that contain only the pieces of information essential for the auditory cortex. Then, the assessment unit compares the transformed signals.
From the late 1970s, perception-domain models have been developed to optimize the quality of waveform coders. Such a model was introduced, at first, by Schroeder et al. (1979), and then extended by Brandenburg (1987), which developed the Noiseto-Masking Ratio (NMR) model. This model assessed audible noise in the degraded speech signal through use of a frequency-masking model. The new requirements by networks have led to the extensive development of perceptual models and to their application to a wider range of distortions. Psychoacoustic features are employed to transform reference and degraded speech signals according to the peripheral auditory system. The first model based on this concept was introduced by Klatt (1982) in the Weighted Spectral Slope (WSS) algorithm to measure a phonetic distance and by Karjalainen (1985) in the Auditory Spectrum Distance (ASD) to compare the audible time (ms) pitch (Bark) amplitude (dB) representations of reference and degraded signals. Intrusive models similar to the ASD method have become predominant in speech quality assessment. Figure 2.2 (p. 41) gives an overview of all intrusive models.
70
2 Speech Quality Measurement Methods
Bark Spectral Distortion (BSD) The BSD was developed by Wang et al. (1992). The perceptual transformation emulates several auditory phenomena such as (i) the critical band integration in the cochlea (restricted to the first 15 Bark units in the range 50–3 400Hz) and (ii) the loudness compression. The algorithm begins with a Power Spectral Density (PSD) estimation of each 10 ms frame. The frames are overlapped by 50%. The disturbance is computed as the square of the Euclidean distance between both transformed speech signals: 1 L Z BSD = ∑ ∑ [Nx (l, z) − Ny (l, z)] 2 , (2.10) L l=1 z=1 where l and z are the frame and Bark index, respectively, L is the number of frames, Z is the number of critical bands, and Nx and Ny are the loudness densities of the reference and the degraded speech signals, respectively. In the study by Wang et al. (1992), a relatively high Pearson correlation coefficient between the predicted quality scores, MOSLQON , and the auditory quality scores, MOSLQSN , was achieved (ρ = 0.85) for the BSD algorithm. Then, the Modified Bark Spectral Distortion (MBSD) model was developed by Yang et al. (1998) to incorporate a noise-masking threshold in order to differentiate the audible distortions from the inaudible ones. When the absolute loudness difference between the reference and the degraded speech signals is less than the loudness of the noise-masking threshold, this difference is considered as imperceptible and, thus, set to 0. A comparison of the correlation coefficients respectively produced by the MBSD and the conventional BSD models (ρMBSD = 0.956 against ρBSD = 0.898) showed that the former worked better (Yang et al., 1998). Then, Yang (1999) developed an Enhanced Modified Bark Spectral Distortion (EMBSD) from the MBSD model by including a new cognitive model to simulate a time-masking effect by the perceptual distortion. The resulting improvement was confirmed by the finding of a higher correlation coefficient: ρEMBSD = 0.98 and ρMBSD = 0.95.
Perceptual Speech Quality Measure (PSQM) The Perceptual Audio Quality Measure (PAQM) and the Perceptual Speech Quality Measure (PSQM) were both developed by Beerends and Stemerdink in 1992 and 1994, respectively. The former is devoted to audio assessments whereas the latter is dedicated to speech assessments. Both of them employ a high level psychoacoustic model. Moreover, in the PSQM model, the perceptual transformation is optimized for speech signals. In addition, Beerends (1994) assumed that a perceptual model was unable to model speech quality and, thus, that a cognitive model was needed. For instance, signal components added by the system under study are much more annoying than components which are attenuated. This effect is quantified by an “asymmetry” factor which has been included in the PSQM model (Beerends and
2.3 Instrumental methods
71
Stemerdink, 1994). The following four intrusive quality models were studied by the Comité Consultatif International Téléphonique et Télégraphique (CCITT), the predecessor of the International Telecommunication Union (ITU), for possible recommendations: • • • •
Information Index (II) (Lalou, 1990), Cepstral Distance (CD) (Kitawaki et al., 1984), CoHerence Function (CHF) (Benignus, 1969) and Expert Pattern Recognition (EPR) (Kubichek et al., 1989).
But, as none of them achieved a minimum amount of accuracy on auditory results, no recommendation was published by this organization. However, after evaluation, the PSQM model was approved as an ITU-T standard and further published as the ITU–T Rec. P.861 (1996)6. It was developed and tested on numerous auditory tests carried out during the development of the ITU–T Rec. G.729 (2007) speech coding algorithm (published as ITU–T Suppl. 23 to P-Series Rec. (1998)). The PSQM model gives reliable estimations for low bit-rate speech codecs. An improved version of the PSQM model, called PSQM+ (ITU–T Contrib. COM 12–20, 1997), was further developed. Its scope is wider than that of the PSQM. This updated version predicts distortions due to transmission channel errors, such as time clipping and packet loss.
Measuring Normalizing Blocks (MNB) Following the idea introduced by Beerends and Stemerdink (1994), Voran (1999a) developed a model called Measuring Normalizing Blocks (MNB) with a rather simple perceptual transformation and a more sophisticated assessment unit. From the assumption of differences in the opinions expressed by listeners about short- and long-term spectral deviations, the author developed a family of perceptual analyses that covers multiple frequency- and time-scales. The resulting parameters were linearly combined to form the perceptual distance between the original and the degraded signal. A comparison against ITU–T Rec. P.861 (1996) showed that the correlation coefficients yielded by the MNB model were higher for transmission errors and low bit-rate speech codecs than those issued by the PSQM (Voran, 1999b). Consequently, in 1998, the ITU-T published an alternative intrusive quality measure to the PSQM quality model as an annex to the ITU–T Rec. P.861 (1996).
Telecommunication Objective Speech-Quality Assessment (TOSQA) To assess speech coding algorithms, Berger (1996), developed a quality model called Deutsche Telekom—Speech Quality Estimation (DT-SQE). But, contrary to 6
In 2001, the ITU–T Rec. P.861 (1996) was replaced by the ITU-T standard ITU–T Rec. P.862 (2001).
72
2 Speech Quality Measurement Methods
the other models, DT-SQE does not compute a distance, but a similarity between the reference speech signal, x(k), and the degraded speech signal, y(k), according to: 2 1 L BSA = rl Nx (l, z), Ny (l, z) , (2.11) ∑ L l=1 where rl is the correlation parameter between the loudness densities, and the BSA stands for “Barkspektrale Ähnlichkeit” (Bark-Spectral Similarity). Its basic structure matches the common structure and components of perceptual-based models. The pre-processing stages and those of the PSQM are alike. However, the DT-SQE model considers additional characteristics not taken into account in previous intrusive speech quality models. Firstly, x(k) and y(k) are both filtered by a standard 300–3 400 Hz bandpass filter to simulate the listener’s terminal. Because the model may also be applied to the acoustic signals available at the talker’s and listener’s terminals, the input signal, x(k), can be additionally filtered with a modified IRS sending characteristic. Then, the delay κ between the reference and the degraded signals is compensated. However, the variable delay is estimated from the highest possible similarity between reference and degraded frames. Then, the reference and the degraded signals are split and Hann-windowed into 16 ms-long segments. The spectrum of each segment is calculated by a DFT. The spectra are transformed to the perceptual domain by using the loudness calculation model developed by Zwicker and Fastl (1990). The resulting loudness pattern of the reference speech file is modified to reduce inaudible effects, known to only very slightly affect the perceived integral quality in a Listening-Only Test (LOT). For instance, the linear spectral distortions due to the frequency response of the system under study have a small effect on the perceived quality7 and are consequently eliminated. The computed similarity is the main speech quality result of the DT-SQE model. The simple arithmetic mean value of all short-time similarities yields the final quality score. An updated version of the DT-SQE model, called Telecommunication Objective Speech-Quality Assessment (TOSQA) is available in the ITU–T Contrib. COM 12– 34 (1997). However, the standard deviations of the prediction errors by TOSQA are slightly higher those by PSQM: σTOSQA = 0.31 and σPSQM = 0.28. In ITU–T Contrib. COM 12–19 (2000), the TOSQA model was extended in the so-called “2001 version” of TOSQA. Furthermore, a variable gain compensation, an adaptive threshold for the internal VAD as well as a modified background noise calculation to take into account CNG algorithms were included in this improved version. A comparison of the two models on the database “Sup23 XP1” described in ITU–T Suppl. 23 to P-Series Rec. (1998) and App. B showed that the TOSQA-2001 model performed slightly better than TOSQA with, for example, ρTOSQA−2001 = 0.961 against ρTOSQA = 0.953 for the auditory test “Sup23 XP1-D”. In addition, the replacement of the IRS Receive filter by a 200–7 000 Hz passband filter and that of the modified IRS Send filter applied to x(k) by a flat filter both permitted its adaptation to 7
Only when it falls within a small dynamic range (e.g. ±10 dB).
2.3 Instrumental methods
73
WB transmissions. Since this improved version can work with either electrically- or acoustically-recorded input signals, it can assess the impact of terminals (i.e. electroacoustic interfaces), such as degradation due to non-ideal frequency responses or non-linear transducer distortions. In ITU–T Contrib. COM 12–20 (2000), an evaluation of TOSQA-2001 on an auditory database including WB conditions and acoustic recordings led to a correlation coefficient of ρ = 0.91 between the TOSQA-2001 MOSLQOM estimations and the auditory MOSLQSM values.
Asymmetric Specific Loudness Difference (ASLD) Hauenstein (1997) developed a new quality model called ASLD model, which includes a psychoacoustic model developed especially for speech signals. The structure of the ASLD model includes several pre-processing stages: (i) both the overall delay and overall gain of the system are compensated, (ii) the pauses are eliminated by a VAD, and (iii) two filters simulate the NB telephone bandwidth and the frequency response of the average telephone handset, respectively. Then, the perceptual transformation includes a fine tuned temporal and frequency masking effect. The computing effort required by this model is, therefore, high. The quality score is computed as a weighted sum of positive and negative disturbances. This algorithm follows the asymmetric factor principle developed by Beerends (1994). A comparison of ASLD to several intrusive models in ITU–T Contrib. COM 12–34 (1997) showed that it yielded a higher correlation coefficient than PSQM and TOSQA.
PErception MOdel—Quality assessment (PEMO-Q) Hansen and Kollmeier (1997) applied an advanced model of auditory perception developed by Dau et al. (1996) to speech quality estimations. This “effective” auditory signal model simulates the transformation of acoustic signals into neural activity patterns by the human ear. This model contains a Gammatone filter bank, a model of the human hair cells and an adaptation loop to model critical band integration (on 19 Bark units) and temporal masking effects. The resulting speech quality score, called qC , corresponds to a similarity measure between the weighted perceptual representations of the reference and the degraded speech signals. However, this model is not fully adapted to any type of degradations because of a lack of pre-processing stages (e.g. speech activity detection function is missing). Therefore, the evaluation of qC is worse than those by PSQM, TOSQA and ASLD in ITU–T Contrib. COM 12–34 (1997): ρqC = 0.90 against ρPSQM = 0.93, ρTOSQA = 0.92 and ρASL = 0.96. Huber and Kollmeier (2006) expanded Hansen’s model to Full-Band (FB) audio signal assessments. New developments on the model of auditory perception by Dau et al. (1997) were included in this expanded model called PEMO-Q8 . Conversely to
8
The acronym comes from the perception model developed by Dau et al. (1996).
74
2 Speech Quality Measurement Methods
the qC model, this extended version estimates the perceived quality of any kind of distortion and any kind of audio signal (including speech).
Perceptual Ascom Class Enhanced (PACE) The perceptual quality model developed by Juric (1998) is called PACE and was designed, at first, for quality estimation of overall transmissions, especially in the field of mobile communications. In addition to the components included in the other intrusive models (e.g. time-alignment, critical band filtering, . . . ), the PACE model contains an algorithm dedicated to “importance-weighted” comparisons of the perceptually transformed original and degraded signals: this algorithm assumes that signal parts with a high energy are more important for the perceived speech quality. This model is integrated to the Qvoice equipment evaluation framework developed by the telecommunication company “Ascom”. The evaluation of PACE, in ITU–T Contrib. COM 12–62 (1998), on the “Sup23 XP1” and “Sup23 XP3” databases described in ITU–T Suppl. 23 to P-Series Rec. (1998) and App. B highlighted its capability by resulting in high correlation coefficient values on both databases: ρXP1 = 0.96 and ρXP3 = 0.94.
Perceptual Analysis Measurement System (PAMS) Hollier et al. (1994) developed a specific description of audible errors introduced by the system under study. The differences between the original and the degraded speech signals are represented by an error surface. Briefly, the error entropy (i.e. distribution of errors) enables one to extract several error descriptors and to quantify the total amount of errors. This description led to a first model used for quality estimations of speech coding algorithms (Hollier et al., 1995). For extension of this version to quality estimations of overall transmissions (referred to as “end-to-end”) including electro-acoustic interfaces, further developments were made by Rix et al. (1999). This quality model, called PAMS was developed for the assessment of recent voice technologies, including packet-switched networks where the time-delay between the reference and the degraded speech signals is liable to vary. Its estimation was a significant challenge in the late 1990s. This is why the PAMS model contains a robust time-alignment algorithm to precisely estimate the time-delay introduced by the transmission system. It is worth underlining that, in case of variable delay over the whole speech file, especially variation during a pause, the other quality models do not re-align the original and the degraded speech signals. In addition, following Berger’s concept, the PAMS model compensates the linear degradations of low impact on the perceived quality.
2.3 Instrumental methods
75
Rix and Hollier (1999) extended it to WB transmissions assessment by calibrating the perceptual layer that extracts the error descriptors so as to produce quality scores on the MOSLQOM quality scale, which is called WLQ by the authors.
Perceptual Evaluation of Speech Quality (PESQ) Since the publication of the PSQM model as the ITU–T Rec. P.861 (1996) for instrumental quality measurements of transmitted speech, telephony networks have broadly changed: indeed, further to the introduction of highly non-linear degradations, quality is no longer kept at a constant level over an entire call. Unfortunately, ITU–T Rec. P.861 (1996) is unsuited to these networks and poorly correlated to the perceived integral quality (Thorpe and Yang, 1999). Consequently, the ITU-T has been working on the development of an overall speech quality model as a successor to the ITU–T Rec. P.861 (1996). Five measurement algorithms were, thus, been proposed, namely PACE (Ascom), PAMS (British Telecom), TOSQA (Deutsche Telekom), VQI9 (Ericsson) and PSQM9910 (KPN). Across 22 auditory experiments, the PAMS and PSQM99 models gave the highest average correlation coefficient between the model estimations and the auditory quality scores (i.e. ρPAMS = 0.92 and ρPSQM99 = 0.93). But, as the five models failed to fulfill all of the requirements for minimum performances, none of them was declared overall winner. The whole statistical evaluation was published in ITU–T Contrib. COM 12–117 (2000). Consequently, the strongest components of both PAMS and PSQM99 model were integrated into a new algorithm denoted by PESQ. They consisted of (i) the perceptual transformation of the PSQM99 model (Beerends et al., 2002) and (ii) the time-alignment algorithm of the PAMS model (Rix et al., 2002). The average correlation coefficient found, over the same 22 databases, between the PESQ estimations and the auditory quality scores was ρ = 0.935 (Beerends et al., 2002). This new model was standardized as the ITU–T Rec. P.862 (2001). Thus, the different parts of this quality model are the result of evolutions over more than ten years. Since 2 001, PESQ is the most widely used instrumental model. As the original input file (and its related degraded version) of the PESQ model need to follow some simple rules so as to avoid inconsistently estimated MOS values, the ITU-T published an application guide termed as the ITU–T Rec. P.862.3 (2005). These guidelines provide one with relevant pieces of information required to get, in practice, stable, reliable and meaningful instrumental measurement results. In addition, to check whether the PESQ model has been correctly implemented, a procedure that uses the ITU–T Suppl. 23 to P-Series Rec. (1998) databases is described in detail in the ITU–T Rec. P.862.3 (2005).
9
VQI stands for Voice Quality Index (VQI). During the selection phase the name PSQM99 (PSQM, 1999 version) was used instead of PSQM+. Several improvements were included in the PSQM+ before its submission.
10
76
2 Speech Quality Measurement Methods
The PESQ model estimates the perceived quality of transmitted speech for the classical NB telephone bandwidth. In 2 005, the PESQ model was extended to the evaluation of WB transmissions, and this WB mode of PESQ, called WidebandPerceptual Evaluation of Speech Quality (WB-PESQ), was standardized as the ITU– T Rec. P.862.2 (2005). The algorithm of WB-PESQ is very similar to the one used by PESQ for NB signals. However, the WB-PESQ uses exclusively speech signals with a sampling frequency of fS = 16 000Hz.
Project—Acoustic Assessment Model (P.AAM) An acoustic model, like TOSQA, can work on acoustic recordings of speech signals transmitted over an electro-acoustic interface (e.g. handset, headset or loudspeaking telephones). Such a model is able to assess the influence of the transducers (i.e. microphone and loudspeaker) on the speech quality. Acoustic recordings are made by an artificial head (ITU–T Rec. P.58, 1996), which simulates the speech production and perception processes. Impairments introduced by acoustic components are outside the scope of the ITU–T Rec. P.862 (2001). This model uses electrical signals which means that the two input speech signals are electrically recorded in the transmission system. However, the user terminals may include complex signal processing systems (especially in mobile and DECT telephones, see Sec. 1.3.3). Studies have been devoted to updates of the PESQ model in order to further estimate the quality of acoustic path. For instance, in ITU–T Del. Contrib. COM 12–6 (2001) the PESQ model was extended to quality measurements of monaural acoustic interfaces in listening environments with background noise. In addition, in ITU–T Del. Contrib. COM 12–41 (2001) a post-processing tool for intrusive models such as PESQ was proposed for quality measurement of binaural recordings. The ITU-T has worked on the selection of a new standard dedicated to acoustic quality measurements. The requirements of this ITU-T project, called P.AAM, were described in ITU–T Contrib. COM12–42 (2002). Three measurement algorithms have been proposed by Psytechnics, TNO and Deutsche Telekom. A report on the statistical evaluation of the three models is available in the ITU–T Del. Contrib. COM 12–109 (2003). Then, the three proponents announced their co-operation to develop one single model integrating the best components of each individual model. The three models are all based on changes in the PESQ model (ITU–T Rec. P.862, 2001). The resulting integrated P.AAM model was described in Rix et al. (2003) and Goldstein and Rix (2004). The P.AAM model differs from the PESQ model by the following improvements: • Quality estimation in noisy listening environments. • Extension to acoustic quality measurements. • Extension from monaural to binaural acoustic interfaces.
2.3 Instrumental methods
77
Unfortunately, the submitted integrated model failed to meet the minimum requirements (i.e. correlation coefficients) in all cases. Consequently, the development was stopped in September 2003 (ITU–T Temp. Doc. TD.10 Rev.1, 2003).
Perceptual Objective Listening Quality Analysis (POLQA) In 2007, ITU-T launched a new standardization program, Perceptual Objective Listening Quality Analysis (POLQA) (ITU–T Temp. Doc. TD.52 Rev.1, 2007), aimed at selecting an intrusive speech quality model suitable for NB to S-WB connections and electro-acoustic interfaces and able to compensate for the defects observed in the PESQ model. The future ITU-T standard POLQA is expected to predict reliably the integral “speech transmission quality” for fixed, mobile and IP based networks. Speech quality models have been proposed by six proponents: Opticom, SwissQual, TNO, Psytechnics, Ericsson and a consortium formed by France Télécom and Deutsche Telekom. The model developed by Ericsson was published by Ekman et al. (2011). The DIAL model developed by the latter proponent is presented in Chap. 4. According to the selection criteria, detailed in Sec. 5.1.3, three proponents OPTICOM, SwissQual and TNO met the requirements. They agreed to combine their proposed algorithms into a joint model called POLQA. This joint model led to an improved version selected as a new ITU-T recommendation. A draft version of this new recommendation is available in ITU–T Contrib. COM 12–141 (2010).
2.3.2.5 Diagnostic measures The intrusive quality measurement methods introduced in the previous Sects. 2.3.2.2, 2.3.2.3 and 2.3.2.4 give a single estimated quality score MOSLQO which represents the integral perceived quality of the assessed speech signal. The possible occurrence of two degradations at the same time and in such a way that the integral quality is kept unchanged makes necessary the development of diagnostic measurements based on a decomposition of the integral quality into several attributes. For instance, the system under study can be characterized by several physical attributes such as its overall gain, its frequency response and its SNR. However, these information data are useless for the end user and give no insight into the influence of system parameters involved in user’s perception. According to Jekosch (2005), a diagnosis is: the production of a system performance profile with respect to some taxonomization of the space of possible inputs.
78
2 Speech Quality Measurement Methods
Such diagnostic measurement should rely on quality-attributes or -features defined, here, in Sec. 1.2.1. These quality features can be derived from a multidimensional analysis of the auditory results, see Sec. 2.2.4.6. The following paragraphs make a brief recall on several instrumental measurement methods in use. • Quackenbush et al. (1988) investigated the correlation between auditory results from a DAM experiment and several quality estimators based on time-analysis (e.g. segmental SNR and frequency-dependent SNR) and frequency-analysis (e.g. LPC-based spectral distances). From the ten original quality features assessed in the DAM experiment, four quality features were selected: SL BN SI
Speech Lowpass: muffled, smothered. Background Noisy: hissing, rushing. Speech Interrupted: irregular, interrupted.
Here, the acronyms refer to the DAM scales, see Voiers (1977). For each feature a corresponding estimator was developed (OSL , OBN , OSI and OSH ). Their linear combination into an estimation of the integral speech quality led to a correlation coefficient of ρ = 0.75 with the auditory results. More recently, Sen (2004) derived a set of three orthogonal dimensions from a PCA analysis applied to DAM auditory results. The author developed a new set of three estimators for the corresponding perceptual dimensions: SH (Speech Highpass), SL and BNH (Background Noise Hiss). • Halka and Heute (1992) decomposed the degradation into two quality features: (i) the linear distortion described by an average spectrum of the degraded signal Φ yy (e jΩ ), and (ii) the nonlinear distortion by an average “noise” spectrum Φ nn (e jΩ ). Then, an integral quality score dnlsd was calculated from the spectral distance issued from the two signal spectra. One should note that these authors used a specific reference signal. In order to reduce the impact by the talker’s characteristics, they recommend employing an artificial signal instead of natural speech signals. This test signal corresponds to a Spherically Invariant Random Process (SIRP). • The model of loudness calculation developed by Zwicker and Fastl (1990) is based on the Bark scale concept. It is used by several speech quality models, e.g. MNB and TOSQA. Glasberg and Moore (2002) developed a model based on a similar concept called Equivalent Rectangular Bandwidth scale, ERBN (N for normal hearing). Both scales reflect the concept of critical band filters introduced in Sec. 1.1.2.2. Contrary to the former used for steady sounds, Glasberg and Moore (2002)’s model can be used for time-varying stimuli, e.g. speech or music. From this loudness model, Moore et al. (2004) developed a perception-oriented approach to model how the perceived speech and music quality is affected by mixtures of linear and nonlinear distortions. For this purpose, a linear estimator
2.3 Instrumental methods
79
and a nonlinear one, respectively denoted by Slin and Snonlin, were developed to assess the impact of linear distortions (Moore and Tan, 2004) and that of nonlinear distortions (Tan et al., 2004). Then, the two predicted scores are combined as follows: Soverall = α Slin + (1 − α ) Snonlin , (2.12) where α = 0.3. Slin measures the coloration or naturalness as a function of the changes induced by the system under study in the “excitation pattern”, i.e. the image of the sound spectrum on the ERBN scale. Snonlin measures the harshness, roughness, noisiness or crackling in the audio signals. These degradations correspond to the introduction of frequency components missing in the original signal. This estimator uses an array of 40 Gammatone filters (1ERBN bandwidth) so as to cover the whole audible frequency range (50–19 739Hz). Then a similarity value is calculated between the perceptually transformed reference and degraded signals. The final model was evaluated on both music and speech stimuli (Moore et al., 2004) through auditory quality tests taking into account artificial conditions (i.e. digital filters) and acoustic recordings of transducers. Correlation coefficients of ρ = 0.85 and ρ = 0.90 are obtained for speech stimuli and music stimuli, respectively, between the estimations of the overall quality, Soverall , and the auditory scores. • The quality features instrumentally measured by either Quackenbush et al. (1988), or Halka and Heute (1992) or Moore et al. (2004) and the perceptual quality dimensions described in Sec. 1.5, which are by definition orthogonal, are not alike. In addition, Heute et al. (2005) showed that the composite scores issued from a combination of quality estimators are outperformed by almost all of the intrusive quality models that give a single integral quality score. The finding let them to assume that a reliable diagnostic model must rely on quality features that must: – – – –
be in a small number, correspond to perceptual dimensions, be estimated by a reliable instrumental measure, combined to give an integral quality score,
Then, from the perceptual quality space derived by Wältermann et al. (2006b) and described in Sec. 1.5, a set of three quality estimators has been developed by Scholz and Heute (2008). Briefly, each estimator quantifies the perceived quality on one out of the three next perceptual dimensions: – Directness/Frequency Content (DFC) – Discontinuity – Noisiness The estimator of the quality dimension, DFC, was defined in Scholz et al. (2006) to measure the linear frequency degradation introduced by a transmission system. From a perceptual representation, the bandwidth and the slope, β , of the system frequency response are expressed in terms of ERB and dB per Bark, respectively.
80
2 Speech Quality Measurement Methods
The estimator of the dimension Noisiness, defined in Scholz et al. (2008), quantifies two types of noise: (i) the additive noise, estimated by its level, NL(add) , and (add) its center of gravity, fG , and (ii) the amount of signal-correlated noise, estimated and quantified by the parameter N (cor) . A third estimator for the perceptual dimension Discontinuity was defined in Scholz (2008) and uses three parameters: (i) the Interruption Rate (IR), which quantifies the percentage of silence insertion in speech segments, (ii) the Clipping Rate (CR), which defines the percentage of silence insertion at the start or at the end of speech segments (also known as Front/End Clipping), and (iii) the Additive Instationary (AI) distortions, which quantify the influence of the musical noises contained in the speech signal. One should note that the packet losses concealed by PLC algorithm are not considered by this last estimator. Finally, an integral quality estimator was developed by Scholz and Heute (2008) from these three estimators. It relies on the linear combination of the perceptual dimensions introduced by Wältermann et al. (2006b). Its evaluation led to a correlation coefficient of ρ = 0.862 between the estimated MOSLQON values and the auditory MOSLQSN values. • A second set of three estimators for the same perceptual dimensions was developed by Huo et al. (2008a,b, 2007) to cover a wider scope of estimated transmission systems. Indeed, the former is restricted to only NB transmissions whereas the latter can estimate the quality of NB and WB transmissions. The estimator for the perceptual dimension DFC, defined in Huo et al. (2007), includes two new parameters: sharpness (S) and reverberation time (T30 ). Sharpness is used in place of the center of gravity, zG , and reverberation time, T30 , estimates the impact by the room in the case where an HFT is used. The estimator for Noisiness, defined in Huo et al. (2008a), uses Cepstral Distances (dcep ) and differentiates the contributions by high-frequency, nhf , and low-frequency, nlf , additive noises. From the Weighted Spectral Slope (WSS) distances and signal temporal loss, the estimator for Discontinuity, defined in Huo et al. (2008b), derives three parameters: (i) the interruption rate, rI , (ii) the artifact rate, rA , and (iii) the clipping rate, rC . A comparison against the estimator developed by Scholz (2008) showed its greater efficiency for packet-loss impairment. • From a model of human perception developed by Sottek (1993), Genuit (1996) developed an instrumental measurement method, termed Relative Approach (RA), for specific assessment of acoustic quality. This method is applicable to acoustic recordings of environmental noise (e.g. within an office or a car). According to Genuit (1996), a characteristic of human hearing is that humans are more affected by quite fast level variations than by slow changes. The RA compares the instantaneous signal with a “smooth” estimated version of the signal. Since temporal and spectral structures have both an influence on the perceived noise-induced annoyance, the comparison is made within each frequency band over time and within each time window over the whole frequency range. The model gives a three dimension representation of the spectrum displaying the amount of annoyance for each time-frequency cell. In addition, contrary to other
2.3 Instrumental methods
81
diagnostic models, the RA does not require a reference signal. Gierlich et al. (2008a) updated the RA to the diagnostic assessment of NB and WB communications in the presence of background noise. This model is based on the ITU–T Rec. P.835 (2003) auditory method which uses three rating scales. The resulting diagnostic model estimates three quality values, the Speech MOS (S-MOS), the Noise MOS (N-MOS) and the Global MOS (G-MOS, i.e. the integral quality including speech and background noise). In addition to the reference and the degraded signals used by intrusive quality models, this diagnostic model needs a third signal referred to as “unprocessed” and corresponding to the talker’s terminal input before any processing or transmission. This third signal is a mixture of both the speech and the background noise. This method is applicable to assessment of user terminals (at the talker’s side only), environmental noises, Noise Reduction (NR) algorithms, WB speech codecs and VoIP networks. In both WB and NB contexts and for all the three estimated scores, it proved to be accurate (ρGMOS,WB = 0.935 and ρGMOS,NB = 0.932). This diagnostic model was published by the ETSI as the ETSI EG 202 396-3 (2007). • Beerends et al. (2007) proposed a diagnostic model mainly based on the intrusive quality model, PESQ. Three quality features close to the perceptual dimensions derived by Wältermann et al. (2006b): the first degradation indicator quantifies the impact by additive noise in silent frames. Mapping of the estimated parameter, Noise Distortion (ND), to the MOS scale gives an objective MOS noise value, OMOSNOI . The second indicator quantifies the effect generated by deviation in the linear frequency response from computation of the frequency response of the system under study for only speech segments with a high loudness. After a noise and an overall gain compensation, an OMOSFRQ value is obtained by mapping the global Frequency Distortion (FD) on a MOS scale. The last indicator quantifies the time-varying impairments such as packet losses and pulses by using two mapped MOS values: an OMOSTIM−CLIP value (i.e. time clipping) and an OMOSTIM−PULSE value, which respectively give account for the local decrease and the local increase. Both values are computed after noise and frequency response compensation. The overall OMOSTIM value is defined as the minimum over the two OMOSTIM−CLIP and OMOSTIM−PULSE values. Beerends et al. (2007) evaluated the three degradation indicators using a specific auditory test procedure close to the one described in ITU–T Contrib. COM 12–82 (2009). In this auditory test, the speech stimuli have been degraded by a mixture of noise, frequency response and time-varying distortions. Moreover, they assumed that the integral quality was dominated by the worst degradation. Therefore, they derived the integral quality from the three degradation indicators as the minimum over the 3 MOS values as follows: OMOS1 = min{OMOSNOI , OMOSFRQ , OMOSTIM } ,
(2.13)
82
2 Speech Quality Measurement Methods
The correlation coefficient between the resulting OMOS1 estimations and the integral quality scores was ρ = 0.82. Use of the aggregation expressed in Eq. (2.14) √ 10 OMOS2 =
OMOSNOI +
10
OMOSFRQ + 3
√
10
OMOSTIM
10 ,
(2.14)
led to a slight improvement with a correlation coefficient of ρ = 0.85 obtained between the OMOS2 values and the integral quality scores. • Current work in ITU-T is focusing on the development of a new diagnostic model, called Perceptual Approaches for Multi-Dimensional Analysis (P.AMD), which is able to estimate the four quality features introduced in Sec. 1.5, namely, discontinuity, coloration, noisiness, and non-optimum loudness (ITU–T Contrib. COM 12–143, 2010). Additional dimensions might be included, such as lowfrequency coloration, high-frequency coloration, fast-varying time-localized distortions and slowly-varying time-localized distortions.
2.3.2.6 Non-intrusive models A crucial step in intrusive models is the alignment of the reference and the degraded speech signals. Indeed, a perfect alignment is difficult to achieve with signals transmitted by packet-switched networks introducing a variable delay. A wrong synchronization results in a dramatic decrease of the model accuracy (Rix et al., 2002). In addition, the signal under study and the reference signal are both required by intrusive models. But, in some important applications (e.g. network monitoring), the reference signal is unavailable. Therefore, is has made necessary the development of “non-intrusive” models. Non-intrusive measurement methods rely on two different approaches: (i) a priori-based approach, and (ii) source-based approach. In both cases, several parameters are derived from the degraded speech signal and may describe either perceptual features (e.g. LPC coefficients) or physical characteristics (e.g. speech level in dB).
A priori-based approach In the a priori-based approach, at first, a set of known distortions is characterized by several parameters. Then, a relationship between this finite set of distortions and the perceived speech quality is derived. This approach is usually based on machine learning techniques such as Gaussian mixture models or artificial neural networks. In this case, the parameters characterizing the set of known distortions are stored in
2.3 Instrumental methods
83
an optimally clustered codebook. For instance, Au and Lam (1998) inspected visual characteristics of the speech spectrogram to detect noise or frequency distortions. Another example corresponds to the ITU–T Rec. P.561 (2002) and ITU–T Rec. P.562 (2004). The former recommendation defines an In-service Non-intrusive Measurement Device (INMD) that quantifies physical characteristics in live call traffic such as speech level, noise level, echo loss and echo path delay. ITU–T Rec. P.562 (2004) shows how one can use INMDs to predict perceived speech quality through use of two parametric models: the Call Clarity Index (CCI) and the E-model previously introduced in Sec. 2.3.1.3. It is worth noting that none of them is based on machine learning techniques. Briefly, CCI provides an estimating MOSCQE value on the conversational quality scale defined in ITU–T Rec. P.800 (1996), whereas the parametric E-model gives an estimated R-value on the Transmission Rating scale defined in ITU–T Rec. G.107 (1998). Another type of a priori-based non-intrusive models uses the likelihood that the degraded speech signal has been produced by the human vocal system. The speech signal is reduced to few speech features related to physiological rules of voice production. The derived parameters are then combined and mapped to a quality scale. This approach was followed by Gray et al. (2000) who rand a vocal tract model to detect distortions in the transmission system. A non-intrusive model, including the vocal tract model developed by Gray et al. (2000), was published by the ITU-T as the ITU–T Rec. P.563 (2004). It relies on the association of three principles: (i) a derivation, from the degraded signal, of several parameters related to the voice production mechanism, (ii) after reconstruction of a reference signal from a degraded signal, both signals are assessed by an intrusive model, and (iii) detection of specific distortions in the degraded signal. Then, the derived parameters are linearly combined to predict a speech transmission quality. Over this aggregation step, the perceptual impact of each parameter is quantified through a distortion-dependent weighting operation. In an evaluation of ITU–T Rec. P.563 (2004) on the “Sup23 XP1” and “Sup23 XP3” databases (Malfait et al., 2006) described in ITU–T Suppl. 23 to P-Series Rec. (1998) and App. B of this book, the correlation coefficients obtained by the PESQ model for all the seven auditory tests were higher, for example, ρPESQ = 0.957 against ρP.563 = 0.842, for the auditory test “Sup23 XP1-D”. Kim et al. (2004) developed a non-intrusive model called Auditory Non-Intrusive QUality Estimation (ANIQUE), where the naturalness in degraded speech signals is detected by a machine learning technique. The performances of this model proved to be less than those of ITU–T Rec. P.563 (2004).
Source-based approach In the source-based approach, an artificial reference signal is selected from parameters characterizing the degraded speech signal. Then, the selected artificial reference
84
2 Speech Quality Measurement Methods
is compared to the degraded signal. Like the a priori-based approach, this kind of non-intrusive quality models usually relies on machine learning techniques. Moreover, here, the codebook saves parameters derived from a large set of reference speech materials, and the range of estimated distortion types is wider than the one by a priori-based models. An example of a source-based approach was applied by Liang and Kubichek (1994). After derivation of Perceptual Linear Prediction (PLP) coefficients from the degraded speech signal, the authors selected, from these PLP coefficients, an artificial reference signal to further calculate the Euclidean distance between the artificial reference and the degraded speech signals. Falk and Chan (2006) developed a source-based non-intrusive model that uses a neural network and showed improvements with respect to ITU–T Rec. P.563 (2004).
2.3.3 Packet-layer models A quality model can be designed to assess certain processing conditions or networks, e.g. for network monitoring purposes. However, the high number of assessment requests needed to monitor the whole transmission system implies simplifications in the algorithm complexity. Packet-layer models enable one to consider both these constraints and the need of a reliable instrumental model. These methods measure, in gateways, or at the listener’s side, several network-related and IP packet pattern-based parameters such as transmission delay, packet-loss percentage and burst ratio. Compared to intrusive methods, packet-layer models are less complex and require less memory. In addition, packet-layer quality measurement methods associate the advantages of parametric models to those of signal-based models. They use several parameters provided by the packet-switched network and estimations from simple non-intrusive models. The current ITU-T standard is the ITU–T Rec. P.564 (2007).
2.4 Summary and Conclusion This chapter reviewed in detail the auditory and instrumental measurement methods dedicated to assessments of perceived speech quality. In particular, it highlighted the standards published by organizations such as ITU-T, ETSI or ISO. One should be aware that, as the described auditory methods are usually set in practice within laboratories, the opinions expressed by subjects are affected by this specific listening environment. This means that getting an absolute quality value is by nature impossible, but thanks to the procedures made available by these organizations, biases on the quality judgments can be reduced. In other aspects, this chapter presented the historical evolution of instrumental quality models from the first one defined by
2.4 Summary and Conclusion
85
Fletcher and Arnold (1929) to the future POLQA standard (ITU–T Contrib. COM 12–141, 2010). This description also included the target applications with the corresponding performances of the instrumental quality models. Amongst the three instrumental quality measures recommended by the ITU-T, which are: (i) the parameter-based E-model (ITU–T Rec. G.107, 1998), (ii) the non-intrusive ITU–T Rec. P.563 (2004), and (iii) the intrusive PESQ (ITU–T Rec. P.862, 2001), the third one, PESQ, proved to be the most accurate on auditory quality scores (Falk and Chan, 2009). Many intrusive quality models, described in Sec. 2.3.2.4, simulate the human peripheral auditory system (i.e. represent the signal at the output of the inner ear). But, this paradigm shows limitations. A model of cognitive processes is, thus, a must to increase the accuracy of instrumental measurement methods. Ideally, as done by human subject, the instrumental measurement methods should interpret the perceptual dimensions involved in the assessment process. However, some cognitive effects are already simulated by instrumental models: • Linear distortion is generally less objectionable than nonlinear distortion (Thorpe and Rabipour, 2000). • Speech correlated distortion has a greater impact on the perceived quality than uncorrelated noise (Leman et al., 2008). • Distortions on time-spectrum components that carry information (e.g. formants) have a high impact on the perceived quality (Beerends and Stemerdink, 1994). Cognitive processes are generally modeled by machine learning techniques in both intrusive and non-intrusive models. For instance, Pourmand et al. (2009) used a Bayesian modeling to estimate the quality of Noise Reduction algorithms. An intrusive speech quality model developed by Chen and Parsa (2007) includes a highlevel psychoacoustic and a cognitive model. This non-intrusive method combines the model of loudness calculation developed by Moore et al. (1997) with a Bayesian modeling and a Markov Chain Monte Carlo (MCMC). Since defects have been observed in the PESQ model recommended by the ITU-T organization (ITU–T Rec. P.862, 2001), in the next chapter, after a detailed description of these shortcomings in the PESQ estimations, ways to enhance PESQ reliability will be proposed.
Chapter 3
Optimization and Application of Integral Quality Estimation Models
For several decades, the telephony network has used the same transmitted audio bandwidth, termed as Narrow-Band (NB) and corresponding by definition to the 300–3 400 Hz bandwidth. From available data, instrumental models have, thus, been developed mainly for NB speech. The instrumental quality models introduced in Sec. 2.3.2.4 obtain a high correlation with users’ judgments for a group of NB transmission systems such as mobile and VoIP networks. WideBand (WB) telecommunication systems have been developed since the 1990s and are able to transmit frequencies between 50 and 7 000Hz. They have led to significant enhancement in the perceived quality of the signal by, for example, increasing the feeling of the far-end talker “nearness”. However, for the network, the introduction of a new transmission paradigm implies some flexibility hardly possible with the fixed-line telephony system. For instance, a WB transmission requires a specific speech codec. Moreover, in the next near future, both NB and WB transmissions will be available. But, this mixed-band context constrains instrumental models to assess both bandwidths on a single quality scale. However, the performances displayed by instrumental models are degraded when they are used under WB conditions. These considerations led us to focus, in this chapter, on the evaluation of conventional WB intrusive models to check whether this requirement is fulfilled. After an analysis, in Sec. 3.1.1, of the WB intrusive quality model currently recommended by the ITU-T organization denoted as WB-PESQ, some cases of discrepancies by this model on specific WB speech codecs will be briefly reported. Their identification drove us to make changes in the WB-PESQ algorithm (Sec. 3.1.2) to improve the WB-PESQ reliability on WB speech codecs. The proposed new version of the model was, thus, called “Modified WB-PESQ”. Getting an absolute score of the speech transmission quality is an essential parameter for comparisons of telephony services by end-users. An absolute quality value of a single component (e.g. speech codec) is also important for telecommunication providers when new transmission networks are designed. Getting such ab-
N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_3, © Springer-Verlag Berlin Heidelberg 2011
87
88
3 Optimization and Application of Integral Quality Estimation Models
solute scores requires the development and application of normalization procedures aimed at attenuating the biases introduced by the experimental context. They have been developed, at first, on results issued from auditory experiments (ITU–T Rec. P.833, 2001). Our investigations dealt with the application of a pool of WB intrusive models (i.e. the “Mod. WB-PESQ”, the WB-PESQ and the TOSQA-2001) to a new normalization procedure on using several databases including both NB and WB conditions. Then, the normalized scores issued from these model estimations are compared to those from auditory tests. This new procedure led to an ITU-T standard presented in Sec. 3.2.3 which has been recently published as the ITU–T Rec. P.834.1 (2009). Finally, the limits of the current WB intrusive speech quality models are discussed (Sec. 3.3).
3.1 Optimization of an intrusive quality model, the WB-PESQ 3.1.1 WB-PESQ analysis The current intrusive quality model recommended by the ITU-T is the Perceptual Evaluation of Speech Quality model. During its development phase, this model obtained a correlation of ρ = 0.935 with the auditory quality scores over 22 speech databases (ITU–T Rec. P.862, 2001). The PESQ model and its wideband extension WB-PESQ (ITU–T Rec. P.862.2, 2005) are widely used by telecommunication providers to assess their telephony networks.
3.1.1.1 Description of the PESQ algorithm The PESQ model is based on both the perceptual transformation and the variable delay estimation used, respectively, in used in PSQM (Beerends et al., 2002) and PAMS models (Rix et al., 2002). It is currently recommended by the ITU-T for the assessment of the speech codecs, including the frame- or packet-loss effects (ITU–T Rec. P.862, 2001). This model proceeds through several typical stages of intrusive measurement methods:
Pre-processing steps At first, several pre-processing steps are computed. Level alignment The two input speech signals (the reference and the degraded signal) are aligned to a fixed level. It corresponds to the preferred listening level in use in auditory experiments (79 dBSPL ) as recommended by the ITU–T Handbook on Tele-
3.1 Optimization of an intrusive quality model, the WB-PESQ
89
phonometry (1992). IRS filtering The PESQ model is usually applied to the electrical part of the network, after the sending terminal and before the receiving terminal. At the talker’s side, the model expects a standard handset, usually simulated by a modified IRS Send filter according to ITU–T Rec. P.830 (1996) (see Sec. 1.3.3). To simulate the user’s telephone handset used to listen to the speech in an auditory test, the reference and degraded signals are both filtered in the model with an IRS Receive filter according to ITU–T Rec. P.48 (1988). Voice Activity Detector (VAD) Then, voice activity is detected in the speech signals as follows: frequencies outside the 500–2 500 Hz bandwidth are ruled out by a digital filter. Then, a crude noise level is estimated as the mean energy over the signal, and eventually amplified when the noise is varying in time. An “utterance” (i.e. active part) is detected when the signal level exceeds the estimated noise level for, at least, 200 ms. Time alignment The VAD output is used to align the two signals so as to rule out the delay and jitter (time-varying delay) due to modern voice transport techniques such as VoIP (Rix et al., 2002).
Perceptual transformation The two speech signals are compared once a perceptual transformation has been achieved to get a psychophysical representation of audio signals in the human auditory system (see Sec. 1.1.2.2). Spectrum A power-time-frequency representation (i.e. spectrogram) of both signals is calculated by a Discrete Fourier Transform (DFT) on each 32 ms long frame, l, with 50% overlap. Band integration The power-time-frequency representations are transformed to power-time-pitch representations through a critical band integration (in 49 bands, z, where each filter band corresponds to 0.5 Bark). Frequency compensation The linear frequency distortion introduced by the system under test is partly compensated (in the range −20 to +20 dB). Gain compensation The time-varying gain introduced by the system in successive frames is also
90
3 Optimization and Application of Integral Quality Estimation Models
partly compensated (in the range −35 to +7 dB). Loudness compression The power-time-pitch representations are transformed to loudness-time-pitch ones by using a modified version of the loudness calculation model developed by Zwicker and Fastl (1990).
Distance measure A distance is calculated between the two perceptually transformed signals. Disturbance At first, a disturbance density matrix is computed as the difference between the loudness-time-pitch representations of the two speech signals. D(l, z) = Ly (l, z) − Lx (l, z) ,
(3.1)
where Ly (l, z) and Lx (l, z) are the loudness-time-pitch representations of the degraded and the reference signals, respectively, and z is the filter band index. Masking A simple model of the disturbance masking effect in each time-pitch component is computed1 as follows: M(l, z) = 0.25 min Ly (l, z), Lx (l, z) . (3.2) Asymmetry A correction factor h(l, z) is calculated to quantify the new time-pitch components introduced in the degraded signal by the system under test. This step simulates a simple cognitive effect, i.e. the asymmetry in the perception of positive and negative distortions in periods of speech and silence (Beerends, 1994). This factor corresponds to the ratio between the degraded and the reference loudnesstime-pitch representations: h(l, z) =
Ly (l, z) + β Lx (l, z) + β
1.2 ,
(3.3)
where β is a constant set to 50. The multiplication of the disturbance density matrix by the correction factor gives an asymmetrical disturbance density matrix, DA(l, z):
1
Beerends et al. (2002) assumed that a more complex masking model would not enhance the reliability of the PESQ model.
3.1 Optimization of an intrusive quality model, the WB-PESQ
⎧ ⎪ ⎨0, DA(l, z) = DA(l, z) × 12, ⎪ ⎩ DA(l, z) × h(l, z),
if h(l, z) < 3 ; if h(l, z) > 12 ; otherwise .
91
(3.4)
Frequency averaging The frame disturbance vector and the asymmetrical frame disturbance vector denoted, respectively, as D(l) and DA(l) are calculated by a non-linear average (over pitch) of both the disturbance density matrix and the asymmetrical disturbance density matrix.
49
3 D(l) = M(l) (3.5) ∑ [W (z) D(l, z)]3 , z=1
49
DA(l) = M(l) ∑ W (z) DA(l, z) , z=1
where M(l) is a multiplication factor that emphasizes disturbances occurring in periods of silence, and W (z) is a vector related to the width of the filter bands. The resulting vectors lie in the range 0–45. Time averaging The values of D and DA are calculated by a non-linear average (over time) of both the frame disturbance vector and the asymmetrical frame disturbance vector. A L6 norm is used over 320ms intervals. Then a L2 norm is used over the whole period of speech. Quality score The final PESQ score is a linear combination of both values: PESQ = 4.5 − α1 D − α2 DA ,
(3.6)
where the coefficients are set to α1 = 0.1 and α2 = 3.09 10−2 . Mapping At last, the result of the final PESQ score is transferred to the MOS scale2 by a mapping function to get the instrumentally estimated MOSLQON value, i.e. in a NB context, see Sec. 2.1. MOSLQON = 0.999 + 2
4.999 − 0.999 . 1 + e−1.4945 PESQ+4.6607
(3.7)
During the development of the PESQ model, for each database a monotonic third order mapping function was applied to the estimated MOS values. Then, a mean mapping function was estimated from all databases and normalized as the ITU–T Rec. P.862.1 (2003).
92
3 Optimization and Application of Integral Quality Estimation Models
3.1.1.2 WideBand extension of PESQ, the WB-PESQ An extension of the PESQ model to WideBand connections, referred to as WB-PESQ, was proposed in ITU–T Del. Contrib. COM 12–7 (2001) and further published as the ITU–T Rec. P.862.2 (2005). Its algorithm is very similar to its NB counterpart: the time alignment step is, indeed, independent of the transmitted bandwidth. However, these models differ in two points: 1. In a wide- or mixed-band context, the listening terminal in use in auditory tests is a high quality headphone instead of a telephone handset. This means that the IRS Receive filter used in PESQ is replaced by a bandpass filter with a flat frequency response in the range 200–8 000Hz (see Fig. 3.1). 2. The estimated WB-PESQ raw-values are transformed into MOSLQOM values by a slightly different output mapping function (see Fig. 3.2), i.e. in a mixed-band context, see Sec. 2.1. MOSLQOM = 0.999 +
4.999 − 0.999 . 1 + e−1.3669 W BPESQ+3.8224
(3.8)
The application contexts are not alike in PESQ and WB-PESQ models. Indeed, as the PESQ model was developed on NB databases, it estimates MOSLQON values. On the other hand, the WB-PESQ model was developed on databases including both NB and WB conditions and, thus, it estimates MOSLQOM values (ITU–T Rec. P.800.1, 2006). This application context is related to the “corpus effect” introduced in Sec. 2.2.5.3: when the auditory tests are carried out under only NB transmission conditions, the quality score given to an NB condition is higher than when WB transmission channels are also tested. In ITU–T Del. Contrib. COM 12–187 (2004), the author described this effect from PESQ and WB-PESQ estimations. The relationship between both models has been published in ITU–T Contrib. COM12–59 (2007), leads to the following non-linear mapping function: Fig. 3.1 WB filter (ITU– T Rec. P.341, 2005), IRS Receive filter (ITU–T Rec. P.48, 1988) and input-filter of WB-PESQ
20
|H| (dB)
10 0 −10
IRS Receive P.341
−20 −30
WB−PESQ 50
100 200 500 1000 2000 4000 8000 frequency (Hz)
3.1 Optimization of an intrusive quality model, the WB-PESQ Fig. 3.2 Output mapping functions of PESQ and WB-PESQ
93
5 Mapped MOSLQO
PESQ 4
WB−PESQ
3 2 1 −1
0
1
3 2 Raw PESQ
MOSLQON − 1 MOSLQOM = 1 + 1.6 × exp −1 . 3.3
4
5
(3.9)
3.1.1.3 Evaluation of WB-PESQ Signal-based models have limitations since they are developed on auditory test data collected under a limited set of conditions. However, assessments of the WB-PESQ model proposed in ITU–T Del. Contrib. COM 12–7 (2001) on two different databases showed, in each case, a high correlation with the auditory judgments (ρ ≥ 0.94). These databases were described in detail in EURESCOM Project Report P905 (2000): the former included the ITU–T Rec. G.722 speech codec, the Moving Picture Experts Group (MPEG)-2 Layer III audio codec, several Discrete Cosine Transform (DCT)-based coding algorithms and MNRU conditions, whereas the latter was composed of three auditory tests including MPEG-4 AAC, MPEG-4 Code-Excited Linear Prediction (CELP) speech codec and MNRU conditions. However, several studies highlighted discrepancies of the WB-PESQ with other wideband speech codecs. For instance, a “codec dependence” of the WB-PESQ was shown in ITU–T Del. Contrib. COM 12–187 (2004). Though MNRU and ITU–T Rec. G.722 conditions are consistent with auditory scores, ITU–T Rec. G.722.1 seems to be underestimated by WB-PESQ. By comparison with NB conditions, the WB conditions are underestimated. Similar results were found in ITU–T Del. Contrib. COM 12–70 (2005) and in Kitawaki et al. (2005): the WB-PESQ obtains lower correlation with auditory quality scores for the ITU–T Rec. G.722.2 speech codec. Evidence of the underestimation of the perceived quality of the Enhanced Variable Rate Codec (EVRC) by WB-PESQ was provided in ITU–T Contrib. COM 16–83 (2006). A possible explanation is that, though the modification applied by EVRC to the original speech signal has no significant auditory impact, it may be considered as a degradation by WB-PESQ. This codec uses the Relaxed Code-Excited Linear Prediction (RCELP) technology, which synthesizes a time-warped version of the original speech signal. In addition, WB-PESQ seems to underestimate NB conditions compared to WB conditions in mixed-band databases. This is supported by
94
3 Optimization and Application of Integral Quality Estimation Models
the finding by Takahashi et al. (2005a) of underestimations by WB-PESQ in conditions where the low cut-off frequency is less than 200 Hz. In order to diagnose discrepancies in the WB-PESQ estimations, three tests have been selected whose speech signals and auditory results are both available. The main features of these databases are given in Table 3.1 and described in more detail in App. B. Table 3.1 Summary of the databases used in WB-PESQ evaluation No. Name
Reference
Conditions
1 FT-04 WB Barriac et al. (2004) 2 IKA/LIMSI WB Côté (2005) 3 FT-06 WB a
NB and WB speech codecs Narrow-, Middle- a and Wide-band bandwidths ITU–T Del. Contrib. COM 12–149 (2006) NB and WB speech codecs impaired by transmission errors
A Middle-band bandwidth lies between a NB and a WB bandwidth, e.g. 100–5 000 Hz.
The reference signals used in this evaluation correspond to the original highquality speech signals, i.e. a WB bandwidth produced by the filter, P341, produced by a tool included in the Software Tool Library provided by the ITU-T, see ITU–T Rec. G.191 (2005). Then, the signals were normalized to −26 dBASL 3 . Since the noise floor of the reference signals has always been above the minimum level of −75 dBov specified by the application guide for PESQ (ITU–T Rec. P.862.3, 2005), no noise was digitally added to the reference speech signals. In order to compare the WB-PESQ estimations to the auditory quality judgments, the Pearson correlation coefficient, ρ , and the standard deviation of the prediction errors, σ , were calculated. Both parameters are defined in Sec. 5.1.3.2 of this book. A close examination of Table 3.2 shows clearly that WB-PESQ is bandwidthdependent. Indeed, in Tests 1 and 2, the ρ -values are lower for WB conditions than for NB ones, and the σ -values are higher for WB conditions than for NB ones. On the other hand, in Test 3, both the ρ - and σ -values are higher for WB conditions than for the NB ones. These differences are explained by the differences in the degradations introduced by the speech codecs. • ITU–T Rec. G.722 (1988) This sub-band waveform codec uses a PCM technique which mainly introduces quantizing noise. • ITU–T Rec. G.722.2 (2003) This speech codec, called AMR-WB, has been first published by the ETSI as the ETSI TS 126 190 (2007). It corresponds to the WB version of the NB Adaptative Multi-Rate (AMR) speech codec. Both use an ACELP technique. 3
The Active Speech Level (ASL) was calculated with the normalization tool available in ITU–T Rec. P.56 (1993)
3.1 Optimization of an intrusive quality model, the WB-PESQ
95
• ITU–T Rec. G.722.1 (2005) With this codec, the short-term speech spectrum of the speech signal is analyzed with a Modulated Lapped Transform (MLT) technique. • G.729EV The G.729EV4 speech codec corresponds to a pre-published version (Version 1.14.1 – Jan. 31, 2006) of the ITU–T Rec. G.729.1 (2006) standard. This version has been used during the optimization/characterization phase. This coding algorithm proceeds through a sequence of three operations. At first, a CELP algorithm operates on the data relative to the lower band (below 4 000 Hz); this stage corresponds to the NB codec ITU–T Rec. G.729 (2007). Then, the higher band (above 4 000 Hz) is treated by a Time-Domain Bandwidth Extension (TD-BWE) technique at a 14 kbit/s bit-rate. Finally, a Time-Domain Aliasing Cancellation (TDAC) algorithm is applied, at higher bit-rates, to the whole bandwidth. Thus, the coding technique to be used will depend on the bit-rate.
Table 3.2 Pearson correlation coefficients, ρ , and standard deviation of the prediction errors, σ , for NB and WB conditions between auditory MOSLQSM values and WB-PESQ MOSLQOM estimations Database no. Cond.
1
2
3
ρ
All NB WB
0.923 0.967 0.833
0.818 0.921 0.648
0.913 0.914 0.927
σ
All NB WB
0.457 0.459 0.470
0.523 0.466 0.692
0.457 0.328 0.564
Figure 3.3 depicts the WB-PESQ evaluation in Test 3 and shows the auditory MOSLQSM values, the WB-PESQ MOSLQOM estimations for each WB speech codec and the reference conditions (i.e. MNRU and “clean” i.e. non-degraded). It highlights that the evaluation by WB-PESQ of the quality of the ITU–T Rec. G.722 speech codecs and reference conditions is more accurate than those of the ITU–T Rec. G.722.1, ITU–T Rec. G.722.2 and G.729EV speech codecs. From these results, the WB conditions are classified into two groups as follows: Group 1
The bandpass (including the “clean” condition), the MNRU conditions and the ITU–T Rec. G.722 speech codec.
Group 2
The other speech codecs; ITU–T Rec. G.722.1, ITU–T Rec. G.722.2 and G.729EV.
Figure 3.3 suggests a systematic underestimation of the auditory MOSLQSM values for the conditions set in Group 2, especially the ITU–T Rec. G.722.2 and the 4
EV stands for Embedded Variable (bit rate).
3 Optimization and Application of Integral Quality Estimation Models
Fig. 3.3 WB-PESQ MOSLQOM estimations and auditory MOSLQSM values for each WB speech codec and reference conditions (Ref., i.e. bandpass and MNRU conditions) in Test 3
5 4.5 4 Auditory MOS
96
3.5 3 2.5 Ref. G.722 G.722.2 G.729EV G.722.1
2 1.5 1 1
1.5
2
2.5 3 3.5 Estimated MOS
4
4.5
5
G.729EV speech codecs. Figure 3.4 deals with the relationship between auditory MOSLQSM values and the WB-PESQ MOSLQOM estimations of Test 3 under Group 2 conditions. It clearly shows that the relationship between the estimated and the auditory MOS values is not alike for male and female talkers: WB-PESQ estimations for female talkers are significantly lower than the auditory judgments, whereas the scores for the male talkers fall near the diagonal line. The comparison of the estimated and the auditory MOS values showed, thus, that the accuracy of WB speech quality estimations by WB-PESQ is dependent on the coding scheme: indeed, its estimations seem to be accurate for ITU–T Rec. G.722 codec and reference conditions. On the other hand, it underestimates the quality delivered by the other codecs under test as the codecs from Group 2. In addition, this problem seems to be talker-dependent. The speech codecs included in Group 2 (i) are mainly based on LPC-based analysis and, (ii) use a perceptual weighting filter, which simulates the frequency masking properties of the human auditory system, see Sec. 1.1.2. In conclusion, the more a codec makes use of the properties of the
5 4.5
Fig. 3.4 Relationship between WB-PESQ MOSLQOM estimations and auditory MOSLQSM values for male and female talkers in Test 3. Coders from only the second group
Auditory MOS
4 3.5 3 2.5 2 1.5 1 1
Female Male 1.5
2
2.5 3 3.5 Estimated MOS
4
4.5
5
3.1 Optimization of an intrusive quality model, the WB-PESQ
97
human speech production and auditory perception, the poorer the estimation of the corresponding effects by the intrusive model WB-PESQ (ITU–T Rec. P.862.2, 2005) is. Using these observations, in the next section, the origin of these discrepancies in WB-PESQ estimations will be identified.
3.1.1.4 Detailed analysis To evaluate WB-PESQ accuracy, a single condition, i.e. the ITU–T Rec. G.722.2 speech codec at 8.85kbit/s, was selected from the three databases. The two stimuli in use were by a male talker and a female one. Under these conditions, the auditory score was MOSLQSM = 4.50 for the female talker against MOSLQSM = 4.37 for the male one, and the estimated WB-PESQ score was MOSLQOM = 2.34 for the female talker against MOSLQOM = 3.90 for the male one. Figures 3.5 and 3.6 show the frame disturbance vector, the asymmetrical frame disturbance vector, and the audible power of these speech signals5 , over all 32 ms long frames, for the female talker and the male talker, respectively. The asymmetrical disturbance used to evaluate the introduction of new time-frequency components in the degraded signal is usually more prominent during silent intervals (Beerends, 1994). In addition, Fig. 3.5(c) shows that, in silent intervals, the audible power of the female reference speech signal is higher than that of the male talker (50 dB and 40 dB, respectively). Since the noise floor levels in male and female degraded signals are still alike (−72 dBov ), this indicates a change in the noise floor of the female degraded signal made by the WB-PESQ auditory model in the silent intervals. Moreover, the asymmetrical frame disturbance for the female talker is significantly higher than the one for the male talker. This may lower MOSLQOM score. This effect was noticed under several conditions for the speech codecs of Group 2 and led to an amplified difference in the WB-PESQ estimations between male and female talkers. Figure 3.7 shows the partial compensation of the linear frequency response of the ITU–T Rec. G.722.2 codec at 8.85 kbit/s for male and female talkers. The spectrum vectors were calculated by an aggregation of the power-time-pitch representations over the periods of active speech (see explanation in Sec. 3.1.1.1). Among the cells of these power-time-pitch representations, only those whose level was, at least, 20 dB above the absolute hearing threshold, were taken into account. This compensation was calculated from the degraded signal spectrum-to- the reference signal spectrum ratio. However, the time-pitch components used for the aggregation corresponded to linear values (i.e. not dB values). These calculations highlighted an amplified impact by high level components. Figure 3.7 shows an amplification of about 20 dB for the pitch component z = 3 (i.e. 60–100Hz) for the female talker and z = 49 (i.e. 7 000–8 000 Hz) for both talkers. Further to the filtering of WB-PESQ reference signals with a P.341 filter, only a 5
For a detailed description of the disturbance vector, see Sec. 3.1.1.1 or ITU–T Rec. P.862 (2001) Sec. 10.2.11.
3 Optimization and Application of Integral Quality Estimation Models
disturbance
98
40 20 0 0
50
100
150
200 250 300 time (frame index)
350
400
450
350
400
450
350
400
450
disturbance
(a)
40 20 0 0
50
100
150
200 250 300 time (frame index)
(b)
level (dB)
100 80 60 40 0
50
100
150
200 250 300 time (frame index)
(c) Fig. 3.5 Disturbance over frame index for the ITU–T Rec. G.722.2 (2003) at 8.85 kbit/s, female talker (32 ms long frame). a Frame disturbance. b Asymmetrical frame disturbance. c Corresponding audible power
small part of their energy falls outside the WB bandwidth, i.e. 50–7 000 Hz. A spectrum analysis of the degraded signals shows that, during the speech coding process, the frequency components are amplified by the ITU–T Rec. G.722.2 coder above 7 000 Hz for both voices and below 120 Hz for the female voice (see Fig. 3.8). The same effect is observed with the ITU–T Rec. G.722 codec above 7 000 Hz. These new components mainly correspond to coding artifacts, which seem to have a reduced perceptual effect (i.e. noise masked by speech). Due to the presence of lower frequency components in male voices (usually the fundamental frequency (F0 ) of a male voice is near 100 Hz, see Sec. 1.1.1), the amplification seen in Fig. 3.7(a) for the pitch component z = 3, only appears with female voices. In the auditory model, the partial compensation of the linear frequency response of the system under study (see above) leads to an amplification of the reference signal over the whole speech sample, including silent intervals. In addition, the compensation of the gain variation in successive frames (see Sec. 3.1.1.1 and ITU–T
disturbance
3.1 Optimization of an intrusive quality model, the WB-PESQ
99
40 20 0 0
50
100
150
200 250 300 time (frame index)
350
400
450
350
400
450
350
400
450
disturbance
(a)
40 20 0 0
50
100
150
200 250 300 time (frame index)
(b)
level (dB)
100 80 60 40 0
50
100
150
200 250 300 time (frame index)
(c) Fig. 3.6 Disturbance over frame index for the ITU–T Rec. G.722.2 (2003) at 8.85 kbit/s, male talker (32 ms long frame). a Frame disturbance. b Asymmetrical frame disturbance. c Corresponding audible power
Rec. P.862 (2001) Sec. 10.2.7) is calculated over the entire pitch scale. Thus, the amplified frequencies corresponding to z = 3 of the reference input signal introduce an amplification of disturbance densities over silent intervals. This overestimated disturbance is introduced by an amplification of the noise floor in silent intervals. It ensues that, as seen in Fig. 3.5(b), the “asymmetrical” disturbance, which quantifies the introduction of new time-frequency components is quite high in silent intervals.
3.1.2 The Modified WB-PESQ These considerations presented in the last section, led us to propose, here, modifications of the WB-PESQ algorithm prior to the evaluation of the new version of the model, i.e. the “Modified WB-PESQ”.
100
3 Optimization and Application of Integral Quality Estimation Models
freq. resp. (dB)
20
0
−20 0
10
20
30 pitch (z)
40
50
40
50
(a)
freq. resp. (dB)
20
0
−20 0
10
20
30 pitch (z)
(b) Fig. 3.7 Compensation of the linear frequency response of the ITU–T Rec. G.722.2 (2003) codec at 8.85 kbit/s. a Female talker. b Male talker
level (dB)
120 100 80 Ref.
60
Deg.
40 50
100
200
500 1000 pitch
2000
4000
8000
2000
4000
8000
(a)
level (dB)
120 100 80 Ref.
60
Deg.
40 50
100
200
500 1000 pitch
(b) Fig. 3.8 Spectrum of the reference and degraded speech files. a Female talker. b Male talker
3.1.2.1 Proposed modifications To preserve the NB estimations provided by the PESQ quality model, modifications in the WB-PESQ perceptual model were restricted to the frequencies outside the
3.1 Optimization of an intrusive quality model, the WB-PESQ
101
NB bandwidth 300–3 400 Hz, i.e. z ∈ / 8–41. First, both reference signal and degraded one should be both filtered for frequencies outside the WB bandwidth, i.e. 50–7 000Hz. This solves the previously identify problem of the new components within the frequency bandwidth 7 000–8 000Hz, and then the flat input-filter of the WB-PESQ simulates a high-quality listening device (e.g. headphone). However, even though a listener could use a high quality terminal with a flat frequency response over the whole audible frequency range, a filtering step with a P.341 filter was included in the pre-processing part of the WB-PESQ. Such a filter simulates the frequency response of an ideal hands-free WB terminal (ITU–T Rec. P.341, 2005) on assuming that the introduction of frequency components above 7 000 Hz (i.e. z ≥ 48) by speech processing systems (e.g. WB speech codec) does not result in an over-amplification of periods of speech pause by the WB-PESQ perceptual model. Overestimated degradations such as those depicted in Fig. 3.5(b) are, thus, prevented. Figure 3.9 shows the partial compensation (through P.341 filtering) of the linear frequency response of the condition used in the previous section, i.e. ITU–T Rec. G.722.2 at 8.85kbit/s. Its comparison with Fig. 3.7(a), evidences clearly changes in the component z = 49. However, the component, z = 3, is still quite high, since the P.341 filter only cuts off frequencies lower than 50 Hz. The characteristics of the input filter are directly related to the reference speech signal used in intrusive quality models. In electrical recording, the reference speech file is collected after the microphone of the talker’s terminal (see Sec. 2.3.2). In the WB case, the reference speech signal is defined by a bandwidth of 50–7 000Hz. Usually, the P.341 filter is applied to the source speech signal in order to simulate the impact of the microphone. However, the problem identified in Sec. 3.1.1.4 is introduced by some WB speech codecs that do not meet the WB bandwidth. In conclusion, the reference speech signal should be a source speech signal with the full bandwidth up to the Nyquist frequency (i.e. 8 000 Hz for a sampling frequency of fS = 16 000Hz).
freq. resp. (dB)
Auditory MOSLQSM values show that a distortion at low frequencies has a slight impact on the integral speech quality. In some auditory tests, conditions with band-
20
0
−20 0
10
20
30 pitch (z)
40
50
Fig. 3.9 Compensation of the linear frequency response of the ITU–T Rec. G.722.2 (2003) at 8.85 kbit/s for the female talker when both input-signals of WB-PESQ are filtered with the P.341 filter
3 Optimization and Application of Integral Quality Estimation Models
disturbance
102
40 20 0 0
50
100
150
200 250 300 time (frame index)
350
400
450
350
400
450
disturbance
(a)
40 20 0 0
50
100
150
200 250 300 time (frame index)
(b) Fig. 3.10 Asymmetrical disturbance over frame index for the G.729EV at 32 kbit/s, with 3% packet-loss, female talker (for 32 ms long frame). a Original WB-PESQ. b Modified WB-PESQ
width restrictions to lower frequency components (i.e. frequencies below 200Hz) are preferred to the WB bandwidth. This preference is likely context-dependent (i.e. monotic or diotic listening situation). For instance, a preference for restricted bandwidth appears in monaural situations as reported by Raake (2006b). There are two possible explanations: 1. a perception of low frequencies (e.g. lower than 100 Hz) as unnatural at one single ear, 2. an amplification of frequency masking in a monaural listening situation compared to a diotic one. To minimize the contribution of frequencies lower than 150 Hz to the frame disturbance, these observations led us to apply a frequency degradation weighting function during the aggregation over pitch of both the disturbance density matrix and of the asymmetrical disturbance density matrix (see Sec. 3.1.1.1 or ITU–T Rec. P.862 (2001) Sec. 10.2.11). In addition, the partial compensation of the system frequency response was calculated for only the components with a power-time-pitch representation higher than 30 dB above the absolute hearing threshold so as to reduce the impact of low-level noise components in the linear frequency compensation. Thus, the amplification of WB codecs seen in Fig. 3.8 was not considered in the frequency compensation calculation. In addition, as the linear frequency response is calculated over periods of active speech, it was not compensate during silent intervals. Figure 3.10 compares the asymmetrical frame disturbance vector produced by the
3.1 Optimization of an intrusive quality model, the WB-PESQ
103
original WB-PESQ mode the one issued from the modified model under the same condition (for the G.729EV codec at 32 kbit/s, with 3% of lost packets).
3.1.2.2 Evaluation of the Modified WB-PESQ The Modified WB-PESQ was tested on the three speech databases introduced in Sec. 3.1.1.3. Figure 3.11 compares the auditory MOS values and their estimations by the Modified WB-PESQ in Test 3 and shows that, for codecs from Group 2, the MOSLQOM estimations are consistent with the auditory MOSLQSM values. Figure 3.12 highlights the reduction of the differences between male and female talkers evidenced by the decrease of σ for both talkers, especially for female talkers in the case of codecs from the second group. In addition, Table 3.3 gives the values of the Pearson correlation coefficient, ρ , and those of the prediction error, σ , found between the auditory MOSLQSM values and the MOSLQOM estimations by the Modified WB-PESQ. For NB condition, it is worth noting the increase of σ values, despite the lack of change in ρ values between the original and the Modified WB-PESQ. Under this condition, the WB-PESQ model performances, are, thus, slightly modified by the changes in the algorithm: the non-compensation of the transmission system frequency response during silent intervals leads to an under-estimation of MOSLQSM values. A possible way of improvement may be to estimate a first frequency response over periods of active speech and another one over the silent intervals. The analysis shows that, for WB conditions, accuracy is enhanced by the modifications made in the intrusive WB-PESQ model. On the other hand, they have led to degraded performances on NB conditions. The instrumental speech quality model WB-PESQ (ITU–T Rec. P.862.2, 2005) proved to under-estimate the perceived quality delivered by several WB codecs, especially those with a complex coding process such as CELP technique. In addition, under specific WB conditions, the estimations for speech signals with a female voice were worse than those relative
5 4.5
Auditory MOS
4 3.5 3 2.5 Direct G.722 G.722.2 G.729.1 G.722.1
2
Fig. 3.11 Modified WB-PESQ MOSLQOM estimations and auditory MOSLQSM values in Test 3
1.5 1 1
1.5
2
3.5 3 2.5 Estimated MOS
4
4.5
5
104
3 Optimization and Application of Integral Quality Estimation Models
Table 3.3 Pearson correlation coefficients (ρ ) and prediction errors (σ ) for NB and WB conditions between auditory results and Mod. WB-PESQ estimations Database no. Cond.
1
2
3
ρ
All NB WB
0.906 0.933 0.946
0.803 0.667 0.596
0.949 0.914 0.948
σ
All NB WB
0.485 0.562 0.257
0.533 0.466 0.621
0.375 0.348 0.407
to male voices. The detailed study of the transformation in the perceptual model of WB-PESQ highlighted discrepancies in the partial compensation of the linear frequency response of WB codecs. The changes made in the WB-PESQ algorithms and proposed in Côté et al. (2006) led to an enhancement in the quality estimations under WB conditions. Finally, the differences observed here between MOSLQOM estimations and auditory MOSLQSM values allowed us to highlight different sources of concern and provided evidence about the difficulty of estimating the integral speech quality of different types of impairments such as bandwidth restrictions and coding artifacts with a single model.
3.2 Application of WideBand intrusive quality models The previous section described some improvements to apply to a WB intrusive model in order to more precisely estimate the integral speech quality. Speech quality models can indeed be used to quantify the quality of speech processing systems such as coding algorithms. However, such instrumental scores depend on the exper-
5 4.5
Auditory MOS
4
Fig. 3.12 Relationship between WB-PESQ MOSLQOM estimations and auditory MOSLQSM values for male and female talkers in Test 3 (coders from Group 2)
3.5 3 2.5 2 1.5 1 1
Female Male 1.5
2
2.5 3 3.5 Estimated MOS
4
4.5
5
3.2 Application of WideBand intrusive quality models
105
imental context. A direct comparison of PESQ and WB-PESQ estimations is not possible. This section describes the application of a normalization procedure to estimate quality scores. These scores are provided by several WB intrusive models. This procedure is used to derive “absolute” quality parameters on a one-dimension scale. Such a scale enables a direct quantification and comparison amongst highly different conditions. This normalization procedure is based on the ITU–T Rec. P.834 (2002) introduced in Sec. 2.2.6.2 which has been extended in the present case to WB connections. This procedure produces parameters to be used with the network planning model E-model (ITU–T Rec. G.107, 1998). Section 3.2.1 introduces the E-model extension to WB transmissions. Section 3.2.2 presents the normalization procedure recommended in ITU–T Rec. P.833.1 (2008) which uses auditory quality scores. The contribution of this section is presented in Sec. 3.2.3. Three WB intrusive models are evaluated using a normalization procedure adjusted to WB conditions. These adjustments lead to a new ITU-T standard which has been recently published as the ITU–T Rec. P.834.1 (2009).
3.2.1 WideBand version of the E-model The network planning model called E-model described in Sec. 2.3.1.3 and published as the ITU–T Rec. G.107 (1998) is widely used by network planners to predict the quality of future networks. However, the transmission rating scale, R-scale, ranges from R = 0–100, where the highest quality value (R = 100) corresponds to a NB transmission. A WB E-model is currently under development in ITU-T. ITU–T Del. Contrib. COM 12–28 (2005) includes an exhaustive list of input parameters needed for the WB version of the E-model. For instance, the Annex G of ITU–T Rec. P.79 (2007) describes the reference frequency characteristics of a WB transmission path. This annex can be used to measure WB Loudness Ratings (LRs), see Sec. 2.3.1.1. The rest of this section presents WB extensions of several E-model input parameters.
3.2.1.1 Quality improvement for WideBand transmission The first underlying question for a WB extension of the network planning E-model is the quality improvement between a “clean” NB and WB transmissions. In ITU–T Contrib. COM 12–11 (1993), this quality improvement has been estimated on the MOS scale. A maximum quality difference of 1.5 MOS has been reported when the stimuli are played at the optimum listening speech level. However, the improvement on the R-scale has been first reported in ITU–T Del. Contrib. COM 12–28 (2005) using auditory tests carried out by Barriac et al. (2004). In order to derive the quality of a WB clean condition on the extended R-scale, a combination of a specific auditory test set-up and a quality score analysis is applied. This procedure makes use of the “corpus effect” introduced in Sec. 2.2.5.3: pairs of auditory tests in which the same NB conditions, e.g. 25 in Barriac et al. (2004), are judged in a purely NB
106
3 Optimization and Application of Integral Quality Estimation Models
context and in a mixed-band context (i.e. including NB and WB conditions). The NB context provides MOSLQSN values whereas the mixed-band context provides MOSLQSM values. The MOS values of both tests are mapped to the R-scale using the relationship given in ITU–T Rec. G.107 (1998)6. Then, a relationship may be estimated using the R-values for the common NB conditions. In ITU–T Del. Contrib. COM 12–28 (2005), this relationship is described by the following curvilinear function: RNB/WB RNB = a × exp −1 , (3.10) b with a = 169.38 and b = 176.32. This relationship shows by which amount the Rscale has to be extended to be coherent with usual RNB values. A maximum value of RNB = 129 is obtained using an input value of RNB/WB = 100 (i.e. the best quality in a mixed-band context). This value represents a quality improvement of 29% when migrating from NB to WB. Raake (2006b) used an equivalent procedure with auditory test results provided by Côté (2005). In this study, 9 common NB conditions have been used. The relationship between both contexts is described by two first order polynomial functions: RNB = a × RNB/WB + b , RNB = a × RNB/WB ,
(3.11) (3.12)
with a = 1.371 and b = −5.981 in Eq. (3.11) and a = 1.245 in Eq. (3.12), when the line is forced to go through the origin. These results give a quality improvement of 31.1% (3.11) and 24.4% (3.12). A third study using 30 NB conditions has been reported in ITU–T Del. Contrib. COM 12–149 (2006). A quality improvement of 29.5% has been derived with (3.10) and 34.1% with (3.12). Over the three studies reported in this section, 29% seems to be a coherent value for the quality improvement of WideBand transmissions compared to Narrow-Band transmissions. In addition, in the rest of this book the relationship between a NB and a mixed-band contexts is assumed to be linear. Consequently, the ITU-T defines a value of Rmax = 129 on the extended R-scale for a “clean” WB transmission. This value has been published in App. II of ITU–T Rec. G.107 (1998). The recommended expansion method is the following linear expansion: RNB = 1.29 × RNB/WB .
(3.13)
3.2.1.2 Quality degradation from WideBand speech codecs WB connections have been introduced in modern speech transmission networks such as packet-switched network. In such networks, the speech signal is encoded 6
RNB refers to R-values in a NB context. RNB/WB refers to a mixed-band context and thus covers both NB and WB bandwidths.
3.2 Application of WideBand intrusive quality models
107
by WB low bit-rate speech codecs (see Sec. 1.3.4). However, they introduce audible degradations which impact on the quality of the overall transmission chain. These audible degradations clearly depend on both the bit-rate and the coding principle. During the network planning process (i.e. the first phase of the quality loop, see Table 2.7), a value quantifying the “integral” quality of this particular element is of interest. The quality degradations introduced by speech coding algorithms are quantified using auditory test methods, resulting in MOSLQSN values. These MOSLQSN values are then transformed to the R-scale. Then, a one-dimensional parameter can be derived from the R-values. This parameter describes the audible degradation introduced by the speech codec. The paradigm assumed by the E-model corresponds to the “additivity property” of audible degradations on the transmission rating scale, R-scale. The quality R of the overall transmission can be calculated by subtracting the impairment factors I from an “optimum” quality value R0 , see Eq. (2.3). On the R-scale, the degradation introduced by speech processing algorithms such as speech codec is quantified by an equipment impairment factor Ie . An increasing Ie value reflects a decreasing transmission quality. The Ie value is defined as the difference between the R-value corresponding to the “optimum” NB condition and the R-value corresponding to the codec under study. The optimum NB condition corresponds to a standard ISDN connection. An ISDN channel uses a ITU–T Rec. G.711 (1988) speech codec. However, the degradation introduced by this logarithmic PCM codec is considered by the E-model as negligible compared to the other impairments (e.g. circuit noise). The equipment impairment factor of the ITU–T Rec. G.711 (1988) has been set to Ie = 0. Appendix I of ITU–T Rec. G.113 (2007) lists the Ie values for a set of standard NB speech codecs. According to the additivity property of the R-scale, the audible degradation introduced by two speech codecs connected in series (i.e. tandem of codec A and B) should be equal to the sum of the individual degradations. Such tandem conditions appear when a signal is transmitted through two networks using different speech codecs (e.g. from PSTN to GSM). The overall impairment is thus defined as: Ie,A∗B = Ie,A + Ie,B .
(3.14)
Assuming the highest value of Rmax = 129 on the extended R-scale, WB equipment impairment factors Ie,WB can be derived for different WB speech codecs. These are relative to a “clean” WB transmission. The corresponding optimum WB condition R0 has not been defined yet. In the rest of this book, an optimum WB channel is defined as an clean condition, i.e. a 50–7 000 Hz frequency bandwidth and a linear PCM coding scheme (16 bit quantization). Thus, the optimum NB channel obtains an Ie,WB value corresponding to 129 − 93.2 = 35.8. In order to compare NB and WB speech codecs on a common quality scale, the resulting Ie,WB value for NB speech codecs is defined as:
108
3 Optimization and Application of Integral Quality Estimation Models
Ie,WB = 35.8 + Ie ,
(3.15)
where Ie corresponds to the impairment factor expressed in a NB context. However, using this simple relationship, the “additivity property” is no longer valid. Using (3.14) and (3.15), a tandem condition with two NB speech codecs A and B results in Ie,WB,A∗B = 2 × 35.8 + Ie,A + Ie,B . It leads to an overestimated overall impairment factor. A more accurate Ie,WB estimation for such tandem conditions has been proposed by Wältermann and Raake (2008). The authors assume that any equipment impairment factor can be decomposed into a linear and a residual part: Ie,WB = Ibw + Ires .
(3.16)
The linear part, expressed by Ibw , reflects a bandwidth impairment. The residual part, expressed by Ires , corresponds to the non-linear distortion introduced by the speech processing system. In case of tandem condition of two speech codecs A and B, the resulting Ie,WB,A∗B is thus defined as: Ie,WB,A∗B ≈ max Ibw,A , Ibw,B + Ires,A + Ires,A , (3.17) where max Ibw,A , Ibw,B corresponds approximately to the Ibw of the codec having the smallest bandwidth. However, this relationship assumes that no interaction exists between both components. Using (3.16), the constant value in (3.15) can be seen as a linear bandwidth impairment of Ibw = 35.8, introduced by a NB transmission in a WB context. Indeed, the NB and the WB clean conditions used to calculate this constant value do not introduce non-linear distortions. Provisional Ie,WB values for a set of WB codecs are listed in ITU–T Rec. G.113 Amend. 1 (2009). These have been derived in both “monotic” and “diotic” listening modes. The listening mode has shown impact on the auditory results and the ranking order (Nagle et al., 2007).
3.2.1.3 Packet-loss degradation The equipment impairment factor Ie , as defined in the previous section, quantifies the degradation of coding and decoding. In addition to the coding/decoding process which reduces the necessary network rate used to transmit the data, other speech processing algorithms which may be included in the coder and/or decoder have to be considered. These algorithms (e.g. VAD or PLC) have been defined in Sec.s 1.3.4.2 and 1.3.4.3. In packet-based networks packet losses may occur, either because of packets which do not arrive to the listener’s side or because of packets which have to be discarded by the receiving buffer management algorithm. The audible degradations introduced by these errors on the synthesized signal (i.e. at the listener’s side) depend on both the network and the strategy used by the speech codec and the network. The error pattern corresponds to the list of received and lost packets over the entire signal. It can be defined by the percentage of lost packets and their
3.2 Application of WideBand intrusive quality models
109
distribution over the whole pattern (i.e. random or bursty loss). However, the packet length has an influence on the perceptual degradation. This packet length depends on the number of coding frames included in one packet and on the frame length. Then, a packet-loss concealment method can be used in order to reduce the audible degradation (see Sec. 1.3.4.3). Network planners are interested in quality predictions with known percentage of packet-loss and error distribution. Raake (2006a) proposes to adjust the Ie value towards an Ie,eff which takes into account the transmission errors. This is done in the E-model (ITU–T Rec. G.107, 1998) in the following way: Ie,eff = Ie + (95 − Ie) ·
Ppl Ppl BurstR
+ B pl
,
(3.18)
where Ie,eff is the “effective” equipment impairment factor impaired by packet-loss, Ie the equipment impairment factor in the error-free case, Ppl the percentage of lost packets and the constant factor 95 approximates the R-value of the “optimum” NB channel. The parameter B pl , called packet-loss robustness factor, quantifies the robustness of the new speech codec against packet-loss. When B pl increases, and especially when a PLC algorithm is employed, the degradation introduced by packetloss is less audible. B pl values for several speech codecs are defined in ITU–T Rec. G.113 (2007), e.g. B pl = 4.3 for ITU–T Rec. G.711 (1988). Equation (3.18) is an extension of the random loss model introduced in Sec. 2.3.1.3 (ITU–T Del. Contrib. COM 12–44, 2001). In (3.18), the Burst Ratio (BurstR) defines the packet-loss distribution. The burst ratio is defined as: BurstR =
Average number of lost packets . Average number of lost packets for “random” loss
(3.19)
This is modeled by a 2-state Markov model (loss and no-loss states) in the E-model. In case of BurstR > 1, the packet-loss distribution is considered as bursty. However, in case of a random distribution (BurstR = 1), Eq. (3.18) converges to the random loss model, i.e. Eq. (2.5). Different methods have been developed to extend (3.18) to WB transmissions. The first model has been proposed by Möller et al. (2006). The authors assume that Ie,WB,eff may be approximated by: Ie,WB,eff = Ie,WB + (129 − Ie,WB) ·
Ppl , Ppl + B pl
(3.20)
which is a similar equation to (2.5) introduced in Sec. 2.3.1.3 and thus corresponds to a random loss distribution. In (3.20) the constant factor has been set to approximate the R-value of the “optimum” WB channel, i.e. 129 instead of 95 in (2.5). Raja et al. (2008) proposes three other relationships to model packet-loss degradations in a WB context. These relationships model both random and bursty loss distribu-
110
3 Optimization and Application of Integral Quality Estimation Models
tions. However, they have been developed using WB-PESQ estimations instead of auditory quality scores. In ITU–T Contrib. COM 12–23 (2009), three different relationships have been evaluated, including (3.20). In this study, it has been found that the WB effective equipment impairment factor is better approximated by: Ie,WB,eff = Ie,WB + (95 − Ie,x) ·
Ppl , Ppl + B pl
where Ie,x defers whether a NB or WB codec is being used: Ie , for a NB codec; Ie,x = Ie,WB , for a WB codec.
(3.21)
(3.22)
Equation (3.21) has been published in ITU–T Rec. G.107 Appendix II (2009). B pl,WB values for several standard WB speech codecs are listed in ITU–T Rec. G.113 Amend. 1 (2009) for a diotic listening mode only.
3.2.2 Methodology for the derivation of Ie,WB and B pl,WB The equipment impairment factor Ie,WB values defined in the previous section reflect the quality degradation introduced by speech codecs. Since these Ie,WB values are derived from auditory tests, they also reflect the test context. A normalization procedure used to reduce these test specific effects is described in the following section.
3.2.2.1 Methodology description A methodology to derive Ie,WB and B pl,WB values has been developed by Möller et al. (2006) and has been standardized by the ITU-T as the ITU–T Rec. P.833.1 (2008). The methodology applied in ITU–T Rec. P.833.1 (2008) is an extension to the WB context of the previously normalized ITU–T Rec. P.833 (2001) procedure developed for the NB context. This methodology is based on subjects’ quality judgments coming from listening-only tests. In such tests, the different quality dimensions impaired by the speech codecs are captured in an integral quality score, referred to as MOSLQO value. Then, the derivation procedure uses a combination of reference conditions included in the test corpus and a normalization procedure which has been briefly introduced in Sec. 2.2.6.2. This normalization procedure anchors the newly derived Ie values into the framework of already published Ie values for the other speech codecs (ITU–T Rec. G.113, 2007). A detailed presentation is given in this section.
3.2 Application of WideBand intrusive quality models
111
3.2.2.2 Auditory experiment setup First of all, an auditory experiment has to be carried out. The test setup has to respect the requirements defined in ITU–T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996) regarding the source material, the test material (i.e. the degraded speech stimuli), the choice of the subjects, the test environment (i.e. testing room), the subjects directives, etc. The auditory test method widely used to assess the integral quality of speech processing systems is the Listening-Only Test introduced in Sec. 2.2.3.3. This method uses a 5-point Absolute Category Rating (ACR) scale labeled with different quality attributes (see Table 2.3). This scale is usually referred to as the MOS scale. Then, a mean over all the subjects is calculated for each processing condition and gives Mean Opinion Scores (MOSs). Table 3.4 List of WideBand reference conditions without transmission errors, see Table 1 in ITU– T Rec. P.833.1 (2008) No.
Name
Codec type
Reference
Bit-rate (kbit/s)
Ie,WB value
1 2 3 4 5 6 7 8 9 10 11 12
Clean G.722.2 G.722.2 G.722.2 G.722.2 G.722 G.722.1 G.722.1 G.722 G.722.2 G.722 G.722.2
linear PCM, 16 bits ACELP ACELP ACELP ACELP SB-ADPCM MLT MLT SB-ADPCM ACELP SB-ADPCM ACELP
ITU–T Rec. G.722.2 (2003) ITU–T Rec. G.722 (1988) ITU–T Rec. G.722.1 (2005) -
256 23.05 19.85 15.85 14.25 64 32 24 56 8.85 48 6.6
0 1 3 7 10 13 13 19 20 26 31 41
A single auditory test rarely includes the whole perceptual space defined by modern speech transmission systems. In the present case, the quality degradation introduced by a new speech codec is of interest. The methodology described in ITU– T Rec. P.833.1 (2008) recommends to carry out an auditory test in which several reference conditions are included in the test corpus in addition to the new codec. Examples of reference conditions such as MNRU or common speech codecs have been introduced in Sec. 2.2.6. Table 3.5 defines three groups of reference conditions to be included in the test corpus when evaluating a new WB speech codec. These are described hereafter. Part A
Reference speech codecs have been selected on the basis of their coding scheme (i.e. PCM, ACELP, . . . ) and their associated quality degradation. These degradations can be very diverse among speech codecs. For instance, a codec may introduce quantization noise in the speech signal whereas the same speech signal would sound metallic with a different codec. Table 3.4 defines 12 WideBand codecs which have to be in-
112
3 Optimization and Application of Integral Quality Estimation Models
cluded as reference conditions in the same test corpus as the new WB speech codec under test. They include several widely used waveform codec like the ITU–T Rec. G.722, as well as low bit-rate codecs used in the VoIP and mobile networks, i.e. ITU–T Rec. G.722.2. It is also recommended to test the new codec at two alternative input speech levels such as −36 and −16 dBov 7 . Part B
An exhaustive evaluation of the new speech codec requires that 11 further reference tandem conditions are included to the test corpus, see Table 2 in ITU–T Rec. P.833.1 (2008). These tandem conditions are symmetrical: Newcodec * Re fcodec and Re fcodec * Newcodec .
Part C
In case the new speech codec is evaluated under transmission errors, the type of errors should represent the degradation encountered in a “live” network8 . In addition, m reference conditions impaired by transmission errors have to be included in the test corpus. The number of such conditions is not defined in ITU–T Rec. P.833.1 (2008). However, a minimum of m = 10 conditions should be included to avoid that subjects rate these conditions too badly. In order to reduce the overall number of conditions in the test corpus, the additivity check can be omitted if the new speech codec is evaluated under transmission errors.
3.2.2.3 Normalization procedure The derivation procedure uses the MOSLQSM values obtained from auditory experiments. It is performed in three or four steps, depending on whether or not the additivity of impairment factors is considered: 1. Transformation MOS to R : The MOS scale does not possess ratio characteristics that would justify the derivation procedure described in this section. Two other disadvantages of this scale are (i) the saturation effect which appears at the extremities of the scale (see Sec. 2.2.5.1), and (ii) the context-dependency of the MOSLQSM values. This step defines the transformation of the MOSLQSM values into the transmission rating scale R-scale values. Since no mapping function between the MOSLQSM scale and the extended R-scale has been standardized yet, the S-shaped mapping function defined in ITU–T Rec. G.107 (1998) initially defined for a NB context should be used. This mapping function is defined for MOS values in the range 1–4.5 MOS. However, a condition rating above 4.5 MOS is The nominal input speech level is set to −26 dBov (relatively to the digital overload of the quantizer). It is assumed that this input level maximizes the SNR of the speech codec and avoids overload at the same time. 8 Examples of transmission errors are random bit errors, random packet-loss, bursty packet-loss, and radio transmission errors in mobile networks. 7
3.2 Application of WideBand intrusive quality models
113
Table 3.5 Overview of test conditions for the evaluation of a new WB speech codec, see ITU–T Rec. P.833.1 (2008) Part A
B
C
Purpose
Test conditions
Mandatory / optional
Number of test conditions
Determination of Ie,WB for the new codec in error-free conditions
References 1–12 New codec in single operation
Mandatory Mandatory
13
New codec in single operation, at 2 speech input levels
Optional
Additional WB codec references
Optional
Additivity check
Determination of Ie,WB for the new codec in transmission error conditions
References 13–24
Mandatory
New codec alone in double and triple tandem operation
Mandatory
New codec in double and triple tandem operation with other codecs
Optional
New codec in single operation in different transmission error (m conditions)
Mandatory
Additional references in different transmission error conditions
12
m
Optional
limited to R = 100. This limitation may be more problematic in a WB context. 2. R-scale extension : Assuming a maximum value of Rmax = 129 on the extended R-scale, a linear expansion of the R-values using (3.13) should be used. 3. Derivation of raw Ie,WB values: A raw equipment impairment factor Ie,WB,aud is calculated for each processing condition (aud refers to auditory test) from the resulting R-values. The Ie,WB,aud value corresponds to the difference between the “optimum” WB condition and the conditions under test: Ie,WB,aud = Roptimum − Rcondition .
(3.23)
4. Normalization : The raw Ie,WB,aud values are normalized using the reference conditions included in the same test corpus, see Table 3.4. These reference conditions should obtain the Ie,WB value defined in ITU–T Rec. G.113 (2007). The following relationship is estimated in a least-square sense using the pairs of auditorily derived Ie,WB,aud and expected (i.e. known) Ie,WB,exp values for the reference conditions: Ie,WB,aud = a × Ie,WB,exp + b . (3.24)
114
3 Optimization and Application of Integral Quality Estimation Models
Using the estimated a and b values, the auditorily derived Ie,WB,aud values are normalized according to: Ie,WB,norm =
Ie,WB,aud − b . a
(3.25)
The resulting normalized equipment impairment factor Ie,WB,norm for the codec under test is now coherent with the framework of already published Ie,WB values for the other speech codecs (ITU–T Rec. G.113, 2007). 5. Additivity : The additivity of the derived Ie,WB,norm value for the new speech codec can be checked by comparing the Ie,WB,norm values obtained for the tandem conditions with the sum of the Ie,WB,norm values for the individual codecs. The resulting Ie,WB,norm values for the 11 reference tandem conditions listed in Table 2 in ITU–T Rec. P.833.1 (2008) should be equal to the sum of the Ie,WB,norm value for the new speech codec and the Ie,WB,norm value for the respective reference codecs. 6. Transmission errors : In the present case, the normalization procedure is applied on both Ie,WB and Ie,WB,eff values. The derivation of B pl,WB is not included in ITU–T Rec. P.833.1 (2008). This procedure defines a minimum consistency of Ie,WB,eff values: an increase of transmission error rates has to lead to an increase of the effective equipment impairment factors. However, B pl,WB values can be estimated in a least-squared sense using the Ppl and BurstR values to best approximate the normalized Ie,WB,eff values. The Ppl and BurstR parameter values are derived from each error pattern. Equation (3.18) can be used for the NB context or Eq. (3.21) for the WB context.
3.2.3 Derivation from WideBand intrusive models The methodology described in Sec. 3.2.2 shows how the WB equipment impairment factor (Ie,WB ) and the WB packet-loss robustness factor B pl,WB can be derived from the subjects’ quality judgments obtained in listening-only tests. Another approach is to replace subjects by one or several signal-based quality measures, see Fig. 3.13. Widely used signal-based measures are the intrusive quality models. For NB speech codecs, the ITU-T published a methodology using quality estimations instead of auditory scores as the ITU–T Rec. P.834 (2002). In this section, the methodology to derive Ie,WB and B pl,WB values using WB instrumental quality models is presented. This methodology significantly reduces the cost of WB speech codecs evaluation. An analysis of these instrumentally-derived Ie,WB and B pl,WB values will indicate whether WB intrusive models can quantify (i) the quality difference between NB and WB transmissions, (ii) the non-linear degradations introduced by WB speech
3.2 Application of WideBand intrusive quality models Fig. 3.13 Overview of the methodology adopted by the ITU-T and described in ITU– T Rec. P.834.1 (2009)
115 x(k)
Transmission
y(k)
system
WB integral model MOSLQOM ITU–T Rec. P.834.1 Ie,WB
B pl,WB
E-model (ITU–T Rec. G.107)
R
codecs, and (iii) the audible degradations due to discontinuities. For this purpose, an experimental set-up is described in Sec. 3.2.3.1. The databases have been collected in different languages and represent degradations encountered in real NB and WB networks. Then, the framework of this section is similar to the one of Sec. 3.2.2, i.e. the auditory analysis: (i) the R-scale is extended towards WideBand connections, (ii) equipment impairment factors Ie,WB are derived and normalized following a procedure similar to ITU–T Rec. P.833.1 (2008), and (iii) packet-loss robustness factors B pl,WB are estimated for a range of WideBand speech codecs. The results of each step of this analysis are compared to the auditorily derived parameters. This procedure resulted in a methodology which has been recently standardized by the ITU-T as the ITU–T Rec. P.834.1 (2009).
3.2.3.1 Experimental set-up Following the auditory test set-up defined in Sec. 3.2.1.1, databases including both NB and WB conditions are used. The databases used in this study are listed in the next section and summarized in Table 3.6.
Intrusive quality models Intrusive speech quality models are employed. Such models need pairs of speech signals (i.e. reference and corresponding degraded files). These have been processed with WB-PESQ, Modified WB-PESQ and TOSQA-2001 models in order to esti-
116
3 Optimization and Application of Integral Quality Estimation Models
mate either the MOSLQON or the MOSLQOM score for each speech signal pair. The characteristics of these three models have been described in Sec. 2.3.2.4.
Databases In order to compare the instrumental approach to the auditory approach, six databases obtained from 25 auditory tests have been selected. For each database both speech signals and auditory results are available. However, derived Ie,WB values have to be consistent with Ie framework for NB speech coding algorithms. For instance, Möller et al. (2006) used the ITU–T Rec. G.726 (1990), ITU–T Rec. G.729 (2007) and the “optimum” NB channel including the ITU–T Rec. G.711 (1988) speech codec, as NB references. Therefore, the selected databases include a selection of NB speech codecs for which Ie values are already defined in ITU–T Rec. G.113 (2007). The databases, summarized in Table 3.6 are described in App. B. Databases 1–4 have already been used for deriving the R-scale extension and Ie,WB values from auditory tests in Möller et al. (2006). Table 3.6 Summary of the databases used for the derivation of Ie,WB and B pl,WB from instrumental signal-based models. XP1 and XP3 stands for the two ACR listening tests included in ITU–T Suppl. 23 to P-Series Rec. (1998). These databases are described in more detail in App. B No.
Name
Reference
Auditory tests
1 2 3 4 5
FT-04 IKA/LIMSI Tsukuba NTT FT-06
Barriac et al. (2004) Côté (2005) ITU–T Del. Contrib. COM 12–33 (2005) Takahashi et al. (2005b) ITU–T Del. Contrib. COM 12–149 (2006)
NB and NB/WB NB and NB/WB NB/WB NB/WB NB and NB/WB
6
Suppl. 23
ITU–T Suppl. 23 to P-Series Rec. (1998)
3 NB 4 NB
XP1 XP3
Processing scenarios In order to derive Ie,WB and B pl,WB values, each quality model provides both MOSLQON and MOSLQOM values according to the processing scenario. The WB-PESQ model input signals are electrically recorded in the transmission system, whereas the TOSQA-2001 can also be applied to the acoustic interface of the talker’s and listener’s terminal. Figure 3.14 (solid lines) depicts the situation. The databases used in this study stem from ACR listening-only experiments. In order to give a realistic impression to the test participants, the sending and receiving terminals are either acoustically recorded or simulated. In several auditory labora-
3.2 Application of WideBand intrusive quality models
117
Sending terminal
Receiving terminal Transmission system
IRS send
IRS rec.
NB
NB P341
P341
WB
WB (Mod) PESQ / (Mod) WB-PESQ x(k)
y(k) TOSQA / TOSQA-2001 MOSLQOM ITU–T Rec. P.834.1
LR, delay, . . .
Ie,WB and B pl,WB
E-model (ITU–T Rec. G.107)
Fig. 3.14 Scenarios for using instrumental models estimating the quality degradation due to codec and frame/packet loss in a network
tories, the listening device is a high quality headphone. Thus, the degraded signal is usually filtered, either with an IRS Receive characteristic (NB case) or with a WB filter according to ITU–T Rec. P.341 (2005). The filtered degraded signals are used as an input y(k) to both WB-PESQ and TOSQA-2001 models. This situation is depicted via the dashed line in Fig. 3.14. According to the pre-processing steps applied on the databases, the signals x(k) and y(k) have either been filtered separately, or they have been sent directly to the models.
Correlations with auditory judgments Since auditory MOSLQS values are available for the databases, the estimation accuracy of each model can be verified. This is a first requirement for the normalization procedure described in this section. WB-PESQ, Modified WB-PESQ and TOSQA-2001 have been applied to the WB and the mixed NB/WB databases. In parallel, PESQ, Mod. PESQ and TOSQA have been applied to the NB databases. In order to quantify the accuracy of each model, a third-order polynomial mapping function has been applied to the estimated MOSLQO values. This mapping function attenuates the impact of the test corpus on the subjective judgments (so-called “corpus effect”, see Sec. 2.2.5.3), and it is commonly applied to compare results of auditory tests to the estimations of integral models. The third oder mapping function is used only for analyzing the accuracy of the models themselves; in the remaining of this chapter, the raw estimated MOSLQO values have been used for the instrumental derivation of equipment impairment factors, as auditory scores will usually not be available, and thus no corpus effects will occur. The Pearson correlation co-
118
3 Optimization and Application of Integral Quality Estimation Models
efficients ρ and the root mean square prediction errors σ are listed in Tables 3.7 and 3.8. Table 3.7 Pearson correlation coefficients ρ and standard deviation of the prediction errors σ between the auditory MOSLQSM and the estimated MOSLQOM values for Databases 1–5, using different WB intrusive models No. Test
Stimuli NB
WB-PESQ
ρ
WB
TOSQA-2001 Mod. WB-PESQ
σ
ρ
σ
ρ
σ
1
FT-04 NB/WB
100 NB only
44
0.92 0.97
0.46 0.46
0.90 0.89
0.48 0.57
0.93 0.97
0.54 0.63
2
IKA/LIMSI NB/WB
36 NB only
36
0.82 0.73
0.52 0.57
0.73 0.73
0.87 0.74
0.82 0.64
0.51 0.50
3
Tsukuba
112 NB only
392
0.97 0.95
0.36 0.42
0.91 0.69
0.59 0.56
0.97 0.95
0.47 0.52
4
NTT
96 1 288 NB only
0.90 0.87
0.42 0.70
0.86 0.82
0.74 0.38
0.92 0.90
0.34 0.30
5
FT-06 NB/WB
360 NB only
0.91 0.91
0.46 0.33
0.87 0.80
0.50 0.54
0.95 0.92
0.40 0.34
360
Table 3.8 Pearson correlation coefficients ρ and standard deviation of the prediction errors σ between auditory MOSLQSN and estimated MOSLQON values values for Databases 1–2 and 5–6, using different NB intrusive models No. Test a
1 2 5 6 a
FT-04 NB IKA/LIMSI NB FT-06 NB Suppl. 23 XP1 XP3
Stimuli
100 36 360 528 800
PESQ
TOSQA
ρ
σ
ρ
σ
0.95 0.43 0.93 0.96 0.94
0.77 1.29 0.38 0.26 0.26
0.95 0.91 0.82 0.97 0.70
0.49 0.63 0.47 0.18 0.65
Database 6: XP1 includes three tests and XP3 includes four tests, see App. B.
The results in Table 3.7 show that both the WB-PESQ and the Modified WB-PESQ reach correlations higher or equal to ρ = 0.90 on all databases except on Database 2. The modified version usually reaches higher correlation values, however sometimes at the expense of a larger prediction error. For Database 2, the low correlation seems to be linked to the NB stimuli; the correlation for the NB stimuli of this specific database is significantly lower for WB-PESQ and Modified WB-PESQ. The correlation of the corresponding NB PESQ model is also very low on these NB stimuli, see Table 3.8. TOSQA-2001 usually shows lower correlation than the two other models, and it has the same problems with Database 2. However, its NB ver-
3.2 Application of WideBand intrusive quality models
119
sion, TOSQA, reaches a high correlation on the NB part of this database. Overall, the prediction accuracy seems to be within the range of what is expected from the statistical measures given in Beerends et al. (2002), stating an average correlation of ρ = 0.935 between auditory MOSLQSN scores and PESQ MOSLQON estimations in a NB context.
3.2.3.2 Quality improvement for WB transmission The MOS values can be transformed to the R-scale using the S-shaped relationship given in ITU–T Rec. G.107 (1998). However this relationship transforms MOS values defined in the range 1–4.5 MOS, to the usual Narrow-Band R-scale defined in the range 0–100. All MOS values higher than 4.5 are limited to MOS = 4.5. For a WB or a mixed NB/WB context, the R-scale has to be extended in a way that leaves the NB-range of the scale unaffected. In Möller et al. (2006) and in Sec. 3.2.1.1, the R-scale has been extended using pairs of auditory tests in which the same NB test stimuli were judged once in a purely NB context and once in a mixed NB/WB context. The judgments on these common stimuli define a relationship between the use of the MOS-scale in a NB and in a mixed NB/WB context. On the basis of the auditory MOSLQS values of Databases 1 and 2, an average extension of the R-scale of around 29% was derived. In this section, the value of such an extension on the basis of the instrumental models described in the previous section is estimated. The applied procedure follows as far as possible the one used for the auditory test results, see Möller et al. (2006) and Sec. 3.2.1.1. However, the auditory test results are replaced by signal-based model estimations in both NB and WB operational modes: • The auditory MOSLQSM values obtained in the test corpus with mixed NB/WB conditions are replaced by WB-PESQ, Modified WB-PESQ or TOSQA-2001 MOSLQOM estimations. • The auditory MOSLQSN values obtained in the test corpus with NB conditions are replaced by the corresponding PESQ or TOSQA MOSLQON values. First of all, the MOSLQO estimations of the models have to be transformed to the R-scale. As no relationship between MOSLQOM values and R-values is defined for a NB/WB context, the fixed relationship given in ITU–T Rec. G.107 (1998) has been used. As an example, the resulting RNB (PESQ estimates) and RNB/WB (original/Modified WB-PESQ estimates) values for the NB conditions of Database 4 are displayed in Fig. 3.15. The results may be fitted in different ways, see Eq.s (3.10), (3.11) and (3.12). Möller et al. (2006) used simple linear and exponential functions with one or two
120
3 Optimization and Application of Integral Quality Estimation Models
140
140 R values Eq. (3.12) Eq. (3.10)
120 100
80
80
RNB
100
R
NB
120
60
60
40
40
20
20
0 0
20
40
60
80
100
0 0
R values Eq. (3.12) Eq. (3.10)
20
R
40
60
80
100
R
NB/WB
NB/WB
(a)
(b)
Fig. 3.15 Relationship between R-values derived from NB and WB instrumental models for Database 6: “Sup23 XP3”. a PESQ and WB-PESQ. b PESQ and Modified WB-PESQ
parameters and reached satisfying fittings for the auditory results. As a consequence, the curvilinear (3.10) and polynomial (3.11) (3.12) functions introduced in Sec. 3.2.1.1 are tried out here with model estimations. Due to the use of the NB relationship between the MOS scale and the R-scale to derive the RNB/WB values, the maximum RNB/WB value corresponding to MOS = 4.5 (the maximum value assumed by the E-model) is still 100. In order to derive a universal R-scale which is valid in both NB and NB/WB contexts, the RNB/WB values obtained from the WB models have to be expanded. This can be reached by applying the same Eq.s (3.10), (3.11) and (3.12) with the same parameter values for a and b as shown below: Rmax = a × 100 , Rmax = a × 100 + b , 100 −1 . Rmax = a × exp b
(3.26) (3.27) (3.28)
The Rmax value corresponding to RNB/WB = 100 indicates the amount by which the R-scale has to be extended in a NB/WB context in order to be valid in the NB case. It corresponds to a R-value for a “Clean” WB transmission. As an example, the extrapolations for Database 6 “Sup23 XP3” are presented in Fig. 3.15, and the averaged Rmax values for Databases 1–6 are listed in Table 3.9. Depending on the model and the database used for its derivation, the maximum on the extended R-scale (Rmax ) values and prediction error (σ ) vary. The best fitting function for all three models (in the sense of a minimum σ ) is Eq. (3.27). However, both WB-PESQ and TOSQA-2001 models estimate a relatively low Rmax value. These two models assumed to under-estimate the quality of NB conditions in a mixed-band context. As a matter of fact, WB-PESQ is not recommended for
3.2 Application of WideBand intrusive quality models
121
Table 3.9 Average maximum values Rmax derived with different instrumental models for Databases 1–6, using different fitting functions, and corresponding prediction error indicators σ WB-PESQ Eq. (3.26) (3.27) (3.28)
Rmax 130.0 123.9 130.3
TOSQA 2001
σ 4.21 3.82 4.21
Rmax 127.5 107.1 127.5
σ 11.63 5.54 11.63
Mod. WB-PESQ Rmax 129.1 131.5 138.4
σ 3.00 2.86 2.87
estimating the quality of NB conditions in the latest version of the application guide of the model given in ITU–T Rec. P.862.3 (2005). Thus, the low Rmax values are mainly due to under-estimations of the model, and the real Rmax value is closer to the estimations by Eq.s (3.26) and (3.28). The modified version of WB-PESQ is apparently better in predicting the respective extension also using Eq. (3.27). In addition, Fig. 3.15 shows that the exponential functions sometimes have a very small curvature and are quasi-linear as well. On an average of all models, fitting functions and databases, the procedure leads to Rmax = 130.5, i.e. a roughly 30% extension of the R-scale when migrating from NB to WB. Interestingly, this result is very similar to the 29% extension found by Möller et al. (2006) for auditory tests, in particular when considering that four new databases have been used in the Rmax derivation. The spread of the values found for the individual models and databases (Rmax ∈ 105.8–170.7) is higher than the values reported by Möller et al. (2006) (12–42%), which is due to the inclusion of new databases. Still, it can be concluded that the extension of the R-scale based on the three instrumental models leads to approximately the same results as using the auditory data. Comparing the 30% extension found here to the literature, Raja et al. (2008) found an extension of 7% with the WB-PESQ model. The authors used a linear equation following Eq. (3.11). The obtained parameters were a = 0.82 and b = 25.46. Using a linear relationship forced to go through the origin as Eq. (3.12), a significantly higher Rmax would probably have been found. To transfer MOS values to the expanded R-scale in the rest of the book, the existing relationship between MOS and R defined with the E-model in ITU–T Rec. G.107 (1998) is used. The resulting RNB/WB values are then multiplied by 1.29, see (3.13). This procedure is identical to the one used by Möller et al. (2006) (linear expansion) and in Sec. 3.2.1.1.
3.2.3.3 WB equipment impairment factor Ie,WB Using the linear extension of the R-scale derived in the previous section, WideBand equipment impairment factors Ie,WB can now be estimated for both NB and WB
122
3 Optimization and Application of Integral Quality Estimation Models
codecs included in Databases 1–5. Such Ie,WB values have been defined as the difference between the “optimum” WB channel, involving no other degradations than the ones caused by linear PCM coding (16 bit quantization), and the channel involving the codec under study. For the NB codecs, this should result in an Ie,WB value which corresponds to the sum of the Ie value defined for the NB case in ITU–T Rec. G.113 (2007) and the difference between the WB and the NB “optimum” channel, the latter having a position of R = 93.2 on the R-scale (default E-model parameters), see Eq. (3.15). The procedure employed in this section follows the one used in Sec. 3.2.2.1, replacing the auditory test results by estimations of the WB-PESQ, Modified WB-PESQ and TOSQA-2001. The calculation involves three steps: 1. Transformation MOS to R: The MOSLQO values estimated with WB-PESQ, Modified WB-PESQ and TOSQA-2001 are transformed to the extended R-scale, using the linear extension (3.13). In case that values higher than 4.5 MOS are estimated by an instrumental model, all the values of the corresponding database are linearly compressed to the range 1–4.5 using Eq. (2.1) assumed by the E-model prior to the transformation. 2. Derivation of raw Ie,WB values: Raw equipment impairment factors Ie,WB are calculated from the R-values as the difference between the “optimum” WB condition and the codec condition under test: Ie,WB,ins = Roptimum − Rcondition .
(3.29)
3. Normalization: The raw Ie,WB,ins values still reflect the database they have been derived from, in terms of speakers and sentence material. In order to reduce this influence, a normalization procedure has been described in Sec. 3.2.2.1 for the auditory derivation of impairment factors which is now recommended as ITU–T Rec. P.833.1 (2008). A similar approach is to use pairs of instrumentally-derived Ie,WB,ins values and known Ie,WB values for several reference conditions. A linear interpolation between the known and the unknown Ie,WB values according to Eq. (3.24) is then estimated in a least-square sense. Using the estimated a and b values, the instrumentally derived Ie,WB,ins values are normalized according to: Ie,WB,norm =
Ie,WB,ins − b . a
(3.30)
This procedure was applied to Databases 1–5, Database 6 containing only NB conditions. Figure 3.16 shows an example of a normalization procedure for Database 1. This figure presents the known Ie,WB values and the corresponding instrumentallyderived Ie,WB,ins values which are obtained from the Modified WB-PESQ.
3.2 Application of WideBand intrusive quality models
123
Overall results Table 3.10 gives an example of the individual Ie,WB values which are obtained from WB-PESQ estimations. Table 3.11 lists the average Ie,WB values for the WB codecs using the three instrumental models, as well as the average values obtained using the methodology of ITU–T Rec. P.833.1 (2008) and the auditory MOS values for the same databases. An inspection of Table 3.10 shows that there is a spread in the Ie,WB values derived from different databases by WB-PESQ. The same phenomenon is observed on the estimations from the Modified WB-PESQ and the TOSQA-2001 models which are not reproduced here, in order to save space. The spread is due to the different voices and sentences used in each database, and to slight differences in the basic quality (slight noise floor and frequency bandwidth differences). It is slightly smaller than the spread observed in the auditorily derived Ie,WB values in Möller et al. (2006). The differences in the judgments of different test panels are ruled out by the instrumental model. However, this does not imply any superiority of the instrumental approach: as the instrumental models only aim at estimating what would have been observed in an auditory test, the latter can still be regarded as the reference for our approach.
Fig. 3.16 Exemplary normalization procedure for Database 1 with known Ie,WB and instrumentally-derived Ie,WB,ins values obtained from the Modified WB-PESQ (stars) as well as auditorilyderived Ie,WB,aud values (circles)
Ie,WB values from the Mod. WB−PESQ
In some cases, the normalization step (3) results in negative Ie,WB values. The corresponding test conditions apparently show a very high rating compared to the other conditions of that database. Still, the third step (normalization) is used in order to bring the Ie,WB values in line with the Ie values known for the NB codecs, so that the principle is equally applicable to both NB and WB conditions. The last column of Table 3.10 shows that this objective is generally reached by the methodology. Table 3.11 shows that the third step (normalization) leads to averaged Ie,WB values for the NB conditions estimated with all 3 instrumental models which are close to
120 100 80 60 40 20
Ie,WB,ins values Extrapolation Ie,WB,aud values
0 −20 0
20
40 60 80 Known Ie,WB values
100
120
124
3 Optimization and Application of Integral Quality Estimation Models
Table 3.10 Impairment factor values Ie,WB for WB and NB speech codecs, derived on the basis of WB-PESQ estimations. Values in the last column have been calculated according to Eq. (3.15) using the defined values of ITU–T Rec. G.113 (2007) Band. Codec
Bit-rate (kbit/s)
WB
NB
a
Clean G.722 G.722 G.722 G.722.1 G.722.1 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.729EV a G.729EV a
256 64 56 48 32 24 6.6 8.85 12.65 14.25 15.85 18.25 19.85 23.05 23.85 32 24
G.711 G.726 G.726 G.726 G.728 G.729 GSM-EFR GSM-FR
64 32 24 16 16 8 12.2 13
Database no. 1 −11 22 24 28
2 −6 30
32
3 −21 17 21 31 26 29
13 41 58 82 42 46 40 61
22 43 60 85 49 46 40 72
29
10
19
51
−17 12
20
29
32 42 60 82 42 48
4
32 34 59 46 32 30 29 26 26 24 24
70
43
Average 5 −10 22 23 30 26 29 62 51 38
24 15 19 34
49
Expected from G.113
−13.1 20.8 25.5 32.5 28.0 30.9 63.6 49.5 33.0 30.3 29.2 26.1 25.7 16.7 27.8 15.3 19.0
0 13 20 31 13 19 41 26 13 10 7 5 3 1 8
25.6 43.3 59.7 82.2 45.2 48.4 42.0 60.8
36 43 61 86 43 46 41 56
G.729EV refers to the pre-published version of the ITU–T Rec. G.729.1 (2006) standard.
those obtained from the auditory tests. For the WB conditions, the averaged Ie,WB values of Table 3.11 vary significantly between models. Correlations to auditorily derived values are between: • ρ = 0.850 for WB-PESQ (σ = 14.25), • ρ = 0.890 for Modified WB-PESQ (σ = 12.35), and • ρ = 0.908 for TOSQA-2001 (σ = 11.33). The correlation for Ie,WB is higher with TOSQA-2001 and the Modified version of WB-PESQ. It can still be increased by averaging the estimations of the three instrumental models, leading to a correlation of ρ = 0.928 (σ = 10.06) between instrumentally and auditorily derived Ie,WB values. Comparing the results to the values recently defined in ITU–T Rec. G.113 (2007), the correlations vary between: • ρ = 0.865 for WB-PESQ (σ = 11.48),
3.2 Application of WideBand intrusive quality models
125
Table 3.11 Average impairment factor values Ie,WB for WB and NB speech codecs, derived using different instrumental models and auditory tests. Values in the column “Audit. test” have been calculated from the auditory results of the same databases, and values in the last column have been calculated according to Eq. (3.15) using the defined values of ITU–T Rec. G.113 (2007) Average Ie,WB value Band. Codec
Bit-rate WB-PESQ TOSQAMod. Average (kbit/s) 2001 WB-PESQ
WB
Clean 256 G.722 64 G.722 56 G.722 48 G.722.1 32 G.722.1 24 G.722.2 6.6 G.722.2 8.85 G.722.2 12.65 G.722.2 14.25 G.722.2 15.85 G.722.2 18.25 G.722.2 19.85 G.722.2 23.05 G.722.2 23.85 G.729EV a 32 G.729EV a 24
NB
G.711 G.726 G.726 G.726 G.728 G.729 GSM-EFR GSM-FR
a
64 32 24 16 16 8 12.2 13
Expected Auditory test
from G.113
−13 21 31 38 28 31 64 51 33 30 29 26 26 17 28 15 19
−18 −6 −5 7 −2 14 34 19 10 −8 −10 −13 −15 −5 4 9 21
−36 0 7 18 13 21 44 25 3 12 17 6 6 −15 12 −18 −12
−22 5 11 21 13 22 47 32 16 11 12 6 6 −1 14 2 9
−15 10 25 33 12 15 40 29 5 −1 2 −12 −8 −16 7 13 13
0 13 20 31 13 19 41 26 13 10 7 5 3 1 8
26 44 60 82 43 48 42 61
28 38 48 77 41 56 42 53
29 45 60 81 44 46 43 60
28 42 56 80 43 50 42 58
33 49 76 81 45 41 35 57
36 43 61 86 43 46 41 56
G.729EV refers to the pre-published version of the ITU–T Rec. G.729.1 (2006) standard.
• ρ = 0.934 for Modified WB-PESQ (σ = 8.17), and • ρ = 0.956 for TOSQA-2001 (σ = 6.72); This results in a global average correlation of ρ = 0.957 (σ = 6.63) with the three models. The overall range of these correlations shows that (in average) Ie,WB values can be quite reliably estimated with the instrumental approach. However, there are significant differences between the estimations of individual models. Firstly, the shift of about 21 units can be observed for Ie,WB of WB codecs between the WB-PESQ and its modified version. The modified frequency compensation of Modified WB-PESQ (see Sec. 3.1.2.1) probably leads to an overestimation of the noise floor degradation for the NB conditions. Ie,WB values obtained from the Modified WB-PESQ are normalized to the defined values in Step 3 Normalization,
126
3 Optimization and Application of Integral Quality Estimation Models
which leads to the shift observed for the WB conditions, and to a particularly low value (−36) for the clean WB condition. Still, the correlation between auditorily and instrumentally derived Ie,WB is higher than with the original WB-PESQ model. Overall, the Ie,WB values derived from WB-PESQ are consistently higher than the ones found in Möller et al. (2006) for all WB codecs (except for the clean condition, because of the normalization).
Predictions for different codecs of one family When decreasing the bit-rate, the ranking of the Ie,WB values for each codec follows the ranking of the expected values. With very few exceptions, the order of degradations associated with codec variants of the same family is mainly predicted by the models. The exceptions are due to the fact that not all the bit-rates were included in all the tests; as a consequence, some of the values are based on one or two databases only.These exceptions are a strong limitations of the presented method. The derived Ie,WB values should be stable over the different databases and thus values based on one databases only should have the same accuracy as values based on many databases. This is not the case in Table 3.16, due to the different number and type of codecs included in each test corpus. Therefore, ITU–T Rec. P.834.1 (2009) recommends to include a minimum of 12 reference codecs in such calculations in order to derive stale Ie,WB values.
Predictions for codecs from different families Unfortunately, the relationship between Ie,WB for codecs from different families is not necessarily reflected in the estimations. Thus, the procedure may be used to establish a quality relationship between different versions of the same codec family, but not necessarily to compare the quality of different codecs belonging to different families. This is an important finding which limits the applicability of the derivation procedure.
Summary The overall magnitude of Ie,WB values derived with the help of TOSQA-2001 seems to be best in line with the auditory method. By averaging the estimations of the three models, reliable estimations of Ie,WB for WB codecs can be obtained. The corresponding values for the NB codecs show that the methodology produces values which are well in line with the impairment factors Ie of the current E-model.
3.2 Application of WideBand intrusive quality models
127
3.2.3.4 WB packet-loss robustness factor (B pl,WB ) The E-model is able to estimate the degradation introduced by transmission errors in packet-switched networks, see Sec. 3.2.1.3. Databases 1, 3, 4 and 5 contain stimuli where random packet losses have been simulated, using the model given in ITU–T Rec. G.191 (2005). These samples have been analyzed with the three instrumental models, and the packet-loss robustness factor B pl,WB has been calculated by minimizing the prediction error between the relationship described by Eq. (3.21) and the data points, i.e. normalized Ie,WB,eff values and packet-loss percentages Ppl . Since only random packet losses have been applied, BurstR was set to 1. Normalized Ie,WB,eff values are limited to 0 to avoid strong negative values. Figure 3.17 presents the Ie,WB,eff values derived from the three instrumental models and the auditory experiment for an example codec included in Database 5. The estimated relationship according to Eq. (3.21) is presented also. Table 3.12 lists the obtained B pl,WB values and compares them to the auditorily derived values, following the same procedure. Each B pl,WB value is estimated using four or five packet-loss conditions (Ie,WB,eff /Ppl ) and then averaged over 1 to 3 databases. Table 3.12 Packet-loss robustness factors B pl,WB derived from Databases 1, 3, 4 and 5 (random packet losses). Conditions refers to the number of packet-loss conditions used to derive the B pl,WB parameters Average B pl,WB value Codec G.722 G.722.1 G.722.1 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.722.2 G.729EV a G.729EV a a
Bit-rate Databases (kbit/s) (conditions) 64 32 24 6.6 8.85 12.65 14.25 15.85 18.25 19.85 23.05 23.85 32 24
3(5), 4(5) 3(5), 4(5), 5(4) 1(3), 3(5), 4(5) 4(5) 4(5), 5(4) 4(5), 5(4) 4(5) 3(5), 4(5) 4(5) 4(5) 4(5) 3(5), 4(5), 5(4) 5(4) 5(4)
TOSQAMod. Auditory WB-PESQ 2001 WB-PESQ Average test 2.8 6.3 8.4 7.8 5.8 5.5 8.1 7.9 7.9 7.9 7.9 5.6 5.2 5.6
8.5 9.7 13.2 15.4 11.3 11.2 14.0 11.2 14.9 15.8 15.7 10.0 13.9 16.0
3.1 5.5 8.0 7.8 5.7 5.7 7.9 7.3 7.9 8.0 7.9 5.6 9.0 8.4
4.8 7.1 9.9 10.3 7.6 7.5 10.0 8.8 10.3 10.6 10.5 7.1 9.4 10.0
1.6 6.3 6.2 6.6 5.6 5.2 7.5 7.8 7.6 8.1 9.5 5.2 9.4 10.0
G.729EV refers to the pre-published version of the ITU–T Rec. G.729.1 (2006) standard.
A comparison to the values obtained from the auditory scores shows that the B pl,WB values derived with TOSQA-2001 are usually too high. This model seems to underestimate the packet-loss degradation and predicts a lower impact of the corresponding audible effect (and consequently higher B pl,WB values). In Fig. 3.17,
128
3 Optimization and Application of Integral Quality Estimation Models
Fig. 3.17 Ie,WB,eff values derived from the three instrumental models and the auditory experiment for G.722.2 at 8.85 kbit/s
G.722.2−8.85
90 80
Ie,WB,eff values
70 60 50 40 30
PESQ TOSQA Mod. PESQ Aud.
20 10
0
2 4 6 8 Percentage packet−loss Ppl (%)
10
the Ie,WB,eff values derived from TOSQA-2001 are lower than the auditorily derived ones. This difference increases at high packet-loss percentages. On the contrary, Fig. 3.17 and Table 3.12 show that WB-PESQ and in particular its modified version provide reasonable estimations of the values obtained in the auditory tests (with the exception of the G.729EV codec for WB-PESQ). The correlation coefficients between B pl,WB values derived from auditory scores and model estimations varies between: • ρ = 0.54 for WB-PESQ, • ρ = 0.79 for TOSQA-2001, and • ρ = 0.90 for Modified WB-PESQ. Especially the Modified WB-PESQ seems to be adequate for the instrumental procedure deriving packet-loss robustness factors, for a later use with the E-model. However, the correlation coefficient for the WB-PESQ model increases to ρ = 0.89 without the two G.729EV codec conditions.
3.2.3.5 Discussion A method has been proposed to derive equipment impairment factors Ie,WB and packet-loss robustness factors B pl,WB for WB speech codecs. In order to avoid costly and time-consuming auditory tests, and to be in line with corresponding methods available for NB codecs (ITU–T Rec. P.834, 2002), the method described in this section relies on the estimations of instrumental signal-based models. The derived Ie,WB values can be used in conjunction with a future WB extension of the E-model, in order to plan future mixed NB/WB networks. The proposed methodology has recently been approved by Study Group 12 as a new ITU–T Rec. P.834.1 (2009), emphasizing the need for such a standardized method to determine wideband speech codec degradations. Using instrumental signal-based models the following results have been found:
3.2 Application of WideBand intrusive quality models
129
1. When migrating from NB to WB, the extension to be made to the NB R-scale is around 30%, which is very similar to the value found with auditory tests. Thus, both auditory and instrumental methods make use of the same scale range. Thus, auditory and instrumental methods have an equivalent context dependency. This is an important pre-requisite for deriving input parameters to the E-model with the help of both auditory and instrumental methods. 2. The Ie,WB values derived with our three models (i.e. WB-PESQ, Mod. WB-PESQ and TOSQA-2001) are generally consistent with their auditorily determined counterparts, with an overall correlation ρ within the range 0.85–0.91. Best estimations can not be obtained from the current ITU-T standard WB-PESQ, but from a simple modification of it, or from TOSQA-2001. When averaging the 3 models estimates, the correlation increases slightly, showing that a combination of models is able to rule out some of the insufficiencies associated to each of them. With few exceptions, the models are able to predict the degradations associated with different bit-rates of the same codec family in the right rank order. However, the models are not always able to predict the relationship of degradations associated with different codecs (i.e. coding techniques) in the right way. 3. Applying the method further to derive packet-loss robustness factors B pl,WB leads in many cases to meaningful predictions, but not for all models. While the Modified WB-PESQ, and to a smaller extent, the unmodified version of this model provide a reasonable estimation of the codec robustness, TOSQA-2001 mainly overestimates B pl,WB . The observed correlations lead to the conclusion that instrumental models are quite useful to estimate meaningful input values for the E-model. A rough estimation of Ie,WB can be obtained with Modified WB-PESQ or TOSQA-2001, or by averaging different models’ estimations. Still, the predictions should be used with some care when ranking different codec families with respect to their impact on overall “mouth-to-ear” quality. However, using the right model (in terms of accuracy with the auditory scores), a prediction of the impact of codec bit-rate within one codec family can be made in most cases. Further work is necessary to better quantify the quality impact of codec tandems. Estimating the robustness of a particular codec towards packet-loss is possible for most codec families, using e.g. the modified version of WB-PESQ. For Ie,WB values, averaging the different models slightly increases the estimation accuracy. The method proposed here is expected to provide better results when more accurate instrumental signal-based models are available. Such models should especially focus on the relationship between the degradations introduced by different types of codecs. In addition, they need to reliably estimate the impact of packet losses, which seems to be underestimated by some of the models currently available. A new signal-based model which may be used for this purpose is described in Chap. 4.
130
3 Optimization and Application of Integral Quality Estimation Models
As soon as better instrumental models for estimating codec and packet-loss impact become available, the method described in this section can be re-assessed in the light of the new results. The normalization procedure may take profit of more Ie,WB values being available. Recently, values for WB speech codecs have been defined in ITU–T Rec. G.113 (2007) and added to the procedure described here, leading to the new ITU–T Rec. P.834.1 (2009). Still, more reference values for further WB speech codecs are necessary.
3.3 Summary and Conclusion WideBand instrumental models have been developed to quantify the integral quality of speech transmitted over WB networks. However, the model currently standardized by the ITU-T, called WB-PESQ, shows several limitations. In the first part of this chapter some modifications have been introduced to improve the WB-PESQ reliability. They result in a modified version of the WB-PESQ model. The modifications introduced in Sec. 3.1.2.1 improve the perceptual model of WB-PESQ. However, PESQ has difficulties to predict specific packet-switched networks introducing time-varying delay. Other studies have focused on the time-alignment algorithm of PESQ. For instance, Malfait et al. (2008) evaluated the PESQ model for the quality estimation of speech transmission systems introducing continuous variable delay. This specific effect called “time-warping” has a slight perceptual impact since it limits audible discontinuities. However, these are hardly covered by PESQ. Malfait et al. (2008) showed that accurate instrumental assessments of systems introducing time-warping effects require that the error in the estimated time-delay does not exceed 5 ms. A modification of the PESQ time-alignment algorithm has been proposed by Shiran and Shallom (2009). The resulting model called “Enhanced PESQ” uses a completely different time-alignment algorithm. It uses a sequence matching algorithm called Dynamic Time Warping (DTW). Both modifications lead to better estimations than the standard model ITU–T Rec. P.862 (2001). Since the present chapter focused on the instrumental models estimating the integral quality of speech samples, three intrusive quality models have been used to quantify the audible degradation introduced by WB speech codecs and transmission errors such as packet loss. For this purpose, the framework of the parameter-based Emodel has been used in parallel to a normalization procedure described in Sec. 3.2.2, see ITU–T Rec. P.833.1 (2008). This procedure rules out some test-specific effects. This methodology has been initially developed for auditory test results. Here, a similar approach using quality estimations has been used. Firstly, the same quality value of the “clean” WB transmission has been found using both approaches. It results in a 30% expansion of the transmission rating scale, the R-scale. However, the relationship between the NB and NB/WB contexts was found to be linear with intrusive models whereas several studies based on auditory tests show a non-linear relation-
3.3 Summary and Conclusion
131
ship (Raake, 2006b). In addition, audible degradations introduced by WB and NB speech codecs have been quantified by WB intrusive models. This degradation is reported in terms of WB equipment impairment factor Ie,WB . Six databases including several WB speech codecs have been used. A comparison between the auditorily and instrumentally derived Ie,WB values shows that a quality comparison between codecs using different coding techniques is not recommended. The E-model is able to predict the effects of lost or discarded packets in a WB context using a packet-loss robustness factor B pl,WB . Instrumental estimation of B pl,WB values seems possible but not with all WB intrusive signal-based models. The derivation methodology described in Sec. 3.2.3 was the basis of a newly published recommendation ITU–T Rec. P.834.1 (2009). However, the current WB intrusive quality models are not fully valid for all possible WB speech transmission scenarios yet. Additional developments are necessary. Especially because in a WB context strong interaction appears between the effects such as bandwidth and background noise. Apart from speech codecs and transmission errors, other degradations need to be taken into account by a future optimum WB intrusive model. For example, in some preliminary studies it has been proven that the impact of non-optimum listening levels (Côté et al., 2007) and of background noises (Raake et al., 2010) on NB and on WB transmissions is different and hardly estimated by quality models.
Chapter 4
Diagnostic Instrumental Speech Quality Model
An instrumental measure should provide a correct ranking of various speech processing systems. Even though new instrumental methods have been developed for the perceptual assessment of VoIP transmissions, none of the current model correctly estimates the degradations introduced by all in-use telecommunication systems. The reliable assessment of very different connections (i.e. either a clean WB or a noisy NB stimulus) by a single universal instrumental measure is not yet possible with existing models. The current ITU-T standards show some limits in their quality estimations. They are mainly caused by the ongoing complexity of network topologies and their speech processing components. These limits have been detailed in the previous Chapter. The ITU-T launched a new standardization program called Perceptual Objective Listening Quality Analysis (POLQA) to select a new intrusive speech quality model (ITU–T Temp. Doc. TD.52 Rev.1, 2007). This selection process is detailed in Sec. 5.1.3. A new model, called Diagnostic Instrumental Assessment of Listening quality (DIAL), developed as part of this standardization program is presented in this chapter. This model relies on a specific framework: it combines a Core model, based on TOSQA, and four dimension estimators. The Core model assesses the nonlinear degradations introduced by a speech transmission system, whereas the dimension estimators quantify the linear degradations on the four perceptual dimensions defined in Sec. 1.4: Coloration, Loudness, Discontinuity and Noisiness. Then, an aggregation of all the degradations simulates a cognitive process employed by human subject during the quality judgment process. In accordance with the development procedure of an instrumental model introduced in Sec. 2.3, this chapter presents the second step, i.e. results of auditory tests are already available and a candidate measure has been developed using those results. The validation process of this candidate model is presented in Chap. 5.
N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_4, © Springer-Verlag Berlin Heidelberg 2011
133
134
4 Diagnostic Instrumental Speech Quality Model
4.1 Scope The future ITU-T standard POLQA is expected to predict reliably the integral “speech transmission quality”. The output value of this model reflects the subject’s judgment obtained in listening quality experiments such as those described in ITU– T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996), see Sec. 2.2.3.3. However, the model does not assess the conversation effectiveness of a transmission system, i.e. degradations such as long transmission delay and echo at the talker’s side are not covered by this new model. In addition, this new model estimates the integral speech quality and thus has to take into account all the features of the perceived quality. The ITU-T recommends an ACR test using a 5-point scale presented in Table 2.3 to assess this specific quality value. In such tests, short speech signals ranging from 6 to 12 s are listened to by subjects. Since DIAL simulates an “averaged” subject involved in such a test, the output quality score of this new model is given in terms of “Mean Opinion Score–Listening Quality Objective” (MOSLQO ).
Table 4.1 Evolution of the scope of the intrusive speech quality models standardized by the ITU-T and the POLQA model Recommendation
Model
Context Scope
ITU–T Rec. P.861 (1996) PSQM NB ITU–T Rec. P.862 (2001) PESQ NB ITU–T Rec. P.862.2 (2005) WB-PESQ WB
Circuit-switched and GSM networks P.861 + packet-switched networks (e.g. VoIP) Same as P.862
POLQA
(expected) P.862.2 + 3rd generation networks, advanced speech processing technologies, acoustic interfaces, hands-free applications
-
S-WB
Table 4.1 shows the scope of the PSQM, the PESQ and the WB-PESQ models. The expected scope of POLQA, and therefore of the DIAL model, is presented in the same Table and detailed in this section. This includes the following telecommunication networks: • circuit-switched networks such as PSTN and ISDN, • mobile networks such as GSM, Code Division Multiple Access (CDMA) and third generation networks (i.e. UMTS), • packet-switched networks such as VoIP, and their possible interconnections, and the following speech processing systems: • commonly used ITU-T and ETSI speech codecs (e.g. AMR-WB) and other coding technologies such as the EVRC family codecs, • speech enhancement devices such as Noise Reduction (NR), Automatic Gain Control (AGC), Comfort Noise Generation (CNG), and their possible combinations. The scope of the PESQ model is restricted to electrical capture of the speech signal. The scope of the new standard includes the acous-
4.2 Overview of the DIAL model
135
tic paths (at both the talker’s and listener’s sides) in the entire speech transmission system. In this case, the speech signals are captured acoustically using either an artificial ear or a complete Head-And-Torso Simulator (HATS) (in case of hands-free scenarios). The following distortions (possibly not audible), which appear in the transmission scenarios described above, should be covered by the POLQA model (distortions not covered by the current ITU-T standards are pointed out by the term New): • Single and tandemed speech codecs as used in telecommunication scenarios today, • Packet losses and concealment systems used in packet-switched connections, • Frame- and bit-errors which appear in wireless connections, • Interruptions (such as un-concealed packet loss or handover in GSM network), • Front- and End-clipping (i.e. temporal clipping) due to VAD systems, • Overload, saturation (i.e. amplitude clipping), • Effects of NR systems (e.g. musical noises) and echo cancellers on clean speech (New), • Effects of NR systems (adaptation phase and converged state) and echo cancellers on noisy speech (New), • Effects of speech coding algorithms on noisy speech, • Variable delay, • Time-warping (New), • Gain variations, • Influence of linear distortions (also time variant), • Non-linear distortions produced by the transducer at acoustic interfaces (New), • Reverberations caused by HFT in defined acoustic environments (New), • Non-optimum listening levels (New).
4.2 Overview of the DIAL model The framework of the DIAL model is presented in Fig. 4.1. Intrusive models usually estimate the integral quality of the degraded speech files. DIAL follows the assumption that the combination of several specialized measures is more efficient than a single complex measure. DIAL includes four building blocks: Pre-processing
This block aligns x(k) and y(k) in time, and simulates the listening device used in the corresponding auditory test.
Core model
It estimates the non-linear degradations introduced mainly by speech processing systems such as low bitrate codecs.
136
4 Diagnostic Instrumental Speech Quality Model
Dimension estimators
They estimate the impact of linear degradations such as the amount of noise in the degraded speech signal.
Judgment model
An aggregation of all the quality feature estimations into an integral quality score simulates the processes taking place in the human listener during the quality judgment process.
As shown in Fig. 4.1, the DIAL model is intended to quantify the integral quality MOSLQO of a transmission system and diagnose the corresponding impairment on four perceptual dimensions MOScol , MOSdis , MOSnoi and MOSloud . This “degradation decomposition” relies on four perceptual dimensions coloration, loudness, discontinuity and noisiness. Several known diagnostic measures have been introduced in Sec. 2.3.2.5. However, these measures use a restricted speech quality space. Using these selected four dimensions (see Sec. 1.4), this new model has the advantage of assessing the conditions under study within the speech quality space used by listeners. In Sec. 4.1, the scope of the DIAL model is described and compared to the previous ITU-T standard quality model (ITU–T Rec. P.862, 2001). This model is able to assess the speech transmission quality of a wide range of transmission systems including mobile networks, packet-switched networks and several specific VoIP applications. In Sec. 4.3, the different pre-processing steps are described. These steps include a time-alignment algorithm, based on the one used in PESQ, and updated for time-warping conditions. Some examples of transmission networks introducing variable delay are outlined in this third section. Then, the DIAL model combines a core model and several feature estimators. Section 4.4 outlines the core model used
x(k)
Transmission
y(k)
system
Pre-processing x (k)
y (k)
Dimension estimators Col
Dis
Noi
x (k)
Core
Lou
model
Judgment model
Fig. 4.1 Overview of the DIAL model
MOScol MOSdis MOSnoi MOSloud
MOSLQO
y (k)
4.3 Pre-processing
137
in the DIAL model. This core model carries out the perceptual transformation of the two speech signals. An estimation of non-linear degradation is computed. It is based on a comparison of the perceptually transformed reference and degraded signals. Such non-linear degradations are introduced by speech processing systems such as coding algorithms and voice quality enhancements. The core model is mainly based on the intrusive model TOSQA described in Sec. 2.3.2.4. Then, Sec. 4.5 describes the four dimension estimators used to diagnose the degraded speech signal. Sec. 4.6 defines how the four perceptual estimators are combined with the core model to result in a robust intrusive speech quality model. The DIAL model is thus able to predict reliably very different types of conditions. The DIAL model allows the following operational modes: • Narrow-Band mode Since telephony communications are still mainly in NB, this mode uses speech signals sampled at fS = 8 kHz. In addition, acoustic recordings are not considered in this mode. Only acoustic insertions (i.e. at the talker’s side only) are allowed. • Super-WideBand mode In order to cover the newly-introduced S-WB transmissions, this mode uses speech signals sampled at fS = 48 kHz and covers three bandwidths: NarrowBand, WideBand and Super-WideBand.
4.3 Pre-processing The following section presents the different pre-processing steps of DIAL. As depicted in Fig. 4.2, these steps include a time-alignment algorithm, the computation of speech signal levels and the simulation of the receiving terminal.
4.3.1 Active Speech Level (ASL) The first stage of the DIAL model is the calculation of the Active Speech Level (ASL). This value corresponds to the digital level in dB relative to the overload of the input speech signal. Here, only the speech periods are taken into account. For this purpose, a relatively simple VAD algorithm is used. The ASL values are then used in the Coloration estimator, see Sec. 4.5.1. They are used to normalize the input signals to a pre-defined value. In general, the digital value of −26 dBov is used. It avoids overloaded samples and maximizes the SNR.
138
4 Diagnostic Instrumental Speech Quality Model y(k)
x(k)
Transmission system ASL
ASL PESQ
VAD
Crude delay κ
Fine delay
VAD
Time-warping detection IRS rec. / Flat
κl
x (k)
IRS rec. / Flat y (k)
Fig. 4.2 Diagram of the DIAL pre-processing steps
4.3.2 Time-alignment In the 1990s, one limitation of speech quality models was the wrong alignment of the reference signal with the degraded signal. Such wrong alignment results in an underestimated speech quality compared to the quality perceived by a user. Timedelay estimation is challenging for packet-based transmission systems. In this case, the system introduces a delay which may be variable over the length of the speech signal. The framework of the time-alignment algorithm included in DIAL is based on the delay estimation algorithm developed by Rix et al. (2002) and described in the following section. This algorithm, used in the PESQ model, has been selected among several time-delay estimators, because of its good delay estimation for a wide range of applications: this algorithm is robust against coding distortions and linear filtering. Rix et al. (2002) combines an enveloped-based crude delay estimation and a fine-scale delay estimation using a weighted histogram of the frame-by-frame delay. However, this time-delay estimator shows some discrepancies for several speech processing systems including a specific distortion called “time-warping” as described by Malfait et al. (2008). In usual networks delay changes in discrete steps. However, time-warping effects introduce a continuous variable delay. The DIAL time-alignment algorithm has been updated to cover these specific distortions: it includes a time-warping detection and compensation algorithm which is presented in Sec. 4.3.2.5.
4.3 Pre-processing
139
4.3.2.1 Initialization For instrumental models, the most relevant delay corresponds to the delay between the envelope of the reference signal and the envelope of the degraded signal. This delay is usually estimated using a cross-correlation method (either using windowed frames or the whole signal). Such a method gives unbiased delay estimates in case of linear phase systems (i.e. all frequencies have the same delay). However, the delay is generally not constant over the whole frequency range. This effect appears when the speech signal is transmitted by a system having a non-linear phase response. Such time-alignment algorithms may estimate the delay of the frequency range in which the signals have the highest energy instead of the effective delay. This highest energy mainly lies in the fundamental frequency (F0 ) and the formants. Consequently, both input signals are processed through a filter which attenuates the frequencies outside the range 500–3 000 Hz. The same process is used in both the NB and the S-WB mode of the DIAL model. Unfiltered signals are saved to be processed by the core model and by the dimension estimators.
4.3.2.2 Voice Activity Detector In order to align the reference speech signal x(k) with the degraded speech signal y(k), the intrusive quality model has to detect whether a frame includes speech, silence or noise. The following algorithm is applied to both input speech signals and detailed here for degraded speech signal y(k). The output of the VAD algorithm corresponds to a vector of SNR values xvad (g) and yvad (g) for each 4 ms non overlap frame g.
PESQ VAD algorithm Firstly, the envelope Ey (g) of the degraded speech signal is calculated (see Fig. 4.3). It corresponds to the power of the filtered signal. Then, a Voice Activity Detector (VAD) algorithm is applied on the envelope in order to detect the speech parts and the silent periods. For this purpose, a “threshold” value Ey,thresh is used to detect non speech frames. It corresponds to the mean energy over all frames: Ey,thresh =
1 G ∑ Ey (g) , G g=1
(4.1)
where g = 1, . . . , G is the number of frames. Then, the VAD used in the PESQ model assumes a maximum dynamic range of the envelope Ey (g) such as: Ey,max = max Ey (g) , Ey,min = Ey,max − Δ E ,
(4.2)
140
4 Diagnostic Instrumental Speech Quality Model
Fig. 4.3 Example envelope of a speech signal E(g) with its corresponding threshold level Ethresh and estimated noise level Enoi
100 E(g) E
thresh
Enoi
Power (dB)
80
60
40
20
1
2
3
4 Time (s)
5
6
7
8
where the maximum dynamic range Δ E is set to 40 dB. All frames having an energy lower than Ey,min are limited to the minimum energy Ey,min . Ey,min , if Ey (g) < Ey,min ; Ey (g) = (4.3) Ey (g), otherwise . On the basis of Ey,thresh the frames are detected as speech or noise/silence. A noise level Ey,noi is computed as the mean energy over all noisy/silence frames. In case of time-varying background noise, noisy frames can be detected as active speech frames. In order to avoid such wrong detections, Ey,noi is amplified. This assumption produces a VAD which is robust against background noise. The threshold level Ey,thresh and noise level Ey,noi are depicted in Fig. 4.3. Then, the output of the PESQ VAD algorithm yvad (g) is calculated using the energy in each frame Ey (g) and the threshold value: 0, if Ey (g) < Ey,noi ; (4.4) yvad (g) = Ey (g) , otherwise . ln E y,noi
A value of 0 corresponds to either a silent or a noisy frame. Fig. 4.4(b) presents the output vector yvad (g) using the PESQ VAD algorithm for a noisy speech file. The corresponding envelope Ey (g) of this noisy speech signal is shown in Fig. 4.4(a).
Modifications applied in DIAL Phonemes with low power values (especially after the filtering process) will be rarely detected as “speech” by the VAD. This wrong detection has a small impact on the time-alignment algorithm and has no impact on the PESQ perceptual model since it computes the disturbance over the whole signal. However, the DIAL model
4.3 Pre-processing
141
Power (dB)
80 60 40 20 0 0
1
2
3
4
5 Time (s)
6
7
8
9
6
7
8
9
7
8
9
Speech activity
(a) Envelope Ey (g)
10
5
0 0
1
2
3
4
5 Time (s)
Speech activity
(b) Speech activity yvad (g) (PESQ)
10
5
0
1
2
3
4
5 Time (s)
6
(c) Speech activity yvad (g) (DIAL) Fig. 4.4 a Envelope Ey (g) of the degraded speech signal and the corresponding output speech activity vector yvad (g) determined by b the PESQ VAD algorithm, and c the DIAL VAD algorithm
computes a coherence between the reference and the degraded speech signal during speech periods. The PESQ VAD algorithm has been slightly modified in order to accurately detect the active frames: • The maximum dynamic range of the speech signal is set to Δ E = 70 dB. • The threshold value Ethresh is calculated as the 50th percentile of the envelope energy. Both changes enable the detection of low-level phonemes, especially in case of variable noise level. This modified VAD has been applied on the same noisy speech sample, the resulting output vector yvad (g) is depicted in Fig. 4.4(c). Two additional parameters are computed for both speech signals:
142
4 Diagnostic Instrumental Speech Quality Model
• The percentage of vocal activity: %VAD,ref and %VAD,deg . • A rough estimation of the SNR: SNRref and SNRdeg . Using these 2 parameters, the right detection of the speech activity can be verified. In case of a wrong detection, a second VAD estimation with a lower Δ E can be applied. Three different use cases are expected: 1 2 3
The reference SNR is 3 dB lower than the degraded SNR, The degraded SNR is lower than 35 dB, The degraded vocal activity %VAD,deg is 30% higher than %VAD,ref .
In Case 1, noise frames in the reference speech signal may be detected as speech frames. A 2nd VAD estimation is applied to both signals. The maximum dynamic range Δ E is set to the reference SNR plus 10 dB. In Cases 2 and 3, noise frames in the degraded speech signal may be detected as speech frames. A 2nd VAD estimation is applied to the degraded signal. Here, the maximum dynamic range is set to the degraded SNR plus 10 dB (with a maximum of Δ E = 45 dB). In case of very low SNR, the maximum dynamic range is limited to Δ E = 20 dB.
Sentence vectors Using the reference output vector xvad (g), the PESQ VAD detects the beginning and the end of each “utterance”. An utterance is defined as a continuous section of speech of at least 300 ms duration, containing no silent period of more than 200 ms (Rix et al., 2002). In addition to this utterance framework, a sentence detection algorithm has been included in the DIAL model. This algorithm accurately detects the speech sentences included in the reference speech signal. The sentence detection algorithm uses the whole speech signal whereas the utterance framework does not use the first nor the last 200 ms. It produces a sentence vector xsentence (l) of either 0 (if noise or silence is detected) or 1 (if a speech sentence is detected) for each 16 ms frame l 1 (see Fig. 4.5). The short silent periods within a sentence and the first and last three frames of each sentence are set to 0.5. This supplementary status avoids a biased background noise estimation by the Noisiness estimator (see Sec. 4.5.4).
4.3.2.3 PESQ delay estimation The DIAL model uses the delay estimation algorithm developed by Rix et al. (2002) and used in PESQ. This algorithm is summarized in the following paragraph. In a first step, a crude delay κ is estimated and compensated for the whole signal. This crude delay corresponds to the maximum of the cross-correlation between the xvad (g) and the yvad (g) vectors. This algorithm usually provides accurate measures 1
Whereas the PESQ VAD outputs are defined for 4 ms frame length (labeled g), the sentence detection algorithm uses the DIAL core model 16 ms frame length (labeled l).
4.3 Pre-processing
143
Fig. 4.5 Detected sentence vector for the reference signal xsentence (l)
1
Speech activity
0.8
0.6
0.4
0.2
0 0
2
6
4
8
10
Time (s)
of the delay with a ±4 ms precision. The same procedure is applied to each utterance i: a crude delay κi is calculated using the cross-correlation of the xvad (g) and the yvad (g) vectors and is eliminated. Then, a fine-scale delay estimation of each utterance i is calculated using a histogram of the frame-by-frame estimated delay. The fine-scale delay estimation comprises the following steps: 1. The utterance is divided into frames h of 64 ms length with 75% overlapping. A Hanning window is used. 2. The maximum absolute cross-correlation between the signal waveforms x(k, h) and y(k, h) is computed for each frame h and saved as the estimated delay κh . 3. The maximum value (corresponding to the estimated delay) is raised to the power of 0.125. It acts as a weighting function which attenuates the impact of louder parts of the speech signal. 4. Each delay κh is saved in a histogram. 5. The histogram is smoothed by convolution with a 1 ms length triangular window. 6. The robust estimated delay κi for the current utterance i is given by the position of the maximum over the smoothed histogram. The maximum value gives a confidence measure ci . This measure ranging from 0 (each frame has a different delay) to 1 (all frames have the same delay). The algorithm detects a delay variation during a silent period between two utterances but is not able to take into account a variation within an utterance. In order to cover such variations, Rix et al. (2002) extended the utterance delay estimation method. Each utterance is divided in two sub-utterances (ia and ib ) and processed through the delay estimation stages defined above. A delay variation is detected in case (i) the confidence measures of the 2 sub-utterances are higher than the confidence measure of the whole utterance (cai > ci & cbi > ci ), and (ii) a change in delay of at least 4 ms is found. In this case, the utterance is split in 2 at the point producing the highest confidence measure. This process is applied recursively to each new utterance of at least 800ms.
144
4 Diagnostic Instrumental Speech Quality Model
4.3.2.4 Use of delay information The Core model in DIAL uses a frame-by-frame analysis between the aligned reference and degraded speech files. The delay determined for each utterance is used to create the corresponding detected sentence vector for the degraded speech file ysentence (l), see Fig. 4.6(b). In case the delay increases between two successive utterances, the gap between the two utterances (equal to this delay increase) is considered as being deleted in the reference signal. The detected sentence vector for the reference signal depicted in Fig. 4.5 is thus updated, see Fig. 4.6(a). The vector xsentence (l) is thus set to −1 for the corresponding deleted frames. In PESQ, each entirely deleted frame is discarded from the perceptual disturbance calculation. However, in DIAL the amount of active samples deleted due to delay variation is stored and used by the Discontinuity estimator, see Sec. 4.5.3.1. In case the degraded signal is well aligned with the reference signal, both vectors yvad (g) and ysentence (l) should correspond. The frames detected as “speech” by the VAD (i.e. yvad (g) > 0) should correspond to the position of the detected sentences (i.e. ysentence (l) > 0). A simple measure of the time-alignment accuracy is determined using these two vectors. It corresponds to the number of active frames in the degraded signal which were not detected by the sentence detection algorithm. The time-alignment is considered as inaccurate (NWA = 1, WA stands for Wrong Alignment) if this number exceeds 80 frames, i.e. 320 ms of speech activity.
4.3.2.5 Time-warping detection and compensation
1
1
0.5
0.5 Speech activity
Speech activity
The time-alignment algorithm detailed in the previous section has been developed for delays which are constant over a certain period of time (at least 200ms). This period corresponds to the time difference between 2 separate packet losses. Nowa-
0
−0.5
−0.5
−1 0
0
−1 2
4
6
8
Time (s)
(a) Reference xsentence (l)
10
0
2
4
6
8
10
Time (s)
(b) Degraded ysentence (l)
Fig. 4.6 Detected sentence vectors for a the reference signal (update of Fig. 4.5), and b the corresponding degraded signal
4.3 Pre-processing
145
days, continuous variable delay appears in transmissions systems. This effect, called time-warping can have different origins. For instance, a Packet Loss Concealment algorithm using time-warping effect attenuates the discontinuities due to transmission errors using a re-scaling (i.e. stretching or compression) over the time scale of the speech signal waveform. This effect is especially used in packet-switched networks where the network rate is not constant over the call. In addition, this effect may be introduced by several low bit-rate speech codecs (e.g. the EVRC codec family). After transmission over the network, the speech signal is synthesized by the speech codec. This may continuously change the fine time-scale structure of the original speech waveform. In these two cases, it introduces a relatively small perceived degradation since such algorithms produce a smooth reconstruction of the speech signal. A third source of continuous delay variation is produced in case the Analogical/Digital converters at the talker’s side and at the listener’s side use slightly different sampling frequency ( fS ), i.e. a wrong synchronization of their internal clock. Contrary to the former two cases, this third case introduces a linear deviation of the delay over the whole speech signal. During the development phase of the PESQ model, time warping was not widely used in telephone networks. Nowadays, this type of coders or PLC algorithms has been widely integrated in transmission systems. The PESQ time-alignment algorithm has been updated to cover this specific distortion. Firstly, wrong timealignments are detected by a specific algorithm. Then, depending on the source of the wrong-alignment a different algorithm is applied. After a new synchronization, the sentence vectors xsentence (l) and ysentence (l) are updated with the new delay values. However, the time-warping compensation algorithm is applied only on degraded signals having a SNR higher than 20 dB. This algorithm produces a wrong alignment on noisy degraded signals.
Time-warping detection A first algorithm is used to detect a linear deviation of the delay over the whole speech signal. In this case, the delay would have the following form:
κ = a×l+b ,
(4.5)
where l is the frame number. The coefficients a and b are estimated in a least-square sense using the delay κi estimated by the PESQ time-alignment for the different utterances i. The coefficient a describes the amount of the delay deviation. In addition, three other parameters are used to detect a wrong alignment: NWA NSplit NUtt
The measure of the PESQ time-alignment efficiency (see Sec. 4.3.2.4), The number of utterances which have been split in 2, The number of utterances with a small delay confidence measure ci < 0.3.
146
4 Diagnostic Instrumental Speech Quality Model
Depending on the values for the four parameters a, NWA , NSplit and NUtt , a new delay is estimated either for the whole signal or for specific utterances. 1. A new delay is estimated for the NUtt utterances which are considered as wrongly aligned (NUtt ≤ 2). As described in Sec. 4.3.2.3, the PESQ time alignment compensates the crude delay for the whole speech signal κ and then estimates the crude/fine delay for each utterance. Here, the PESQ time-alignment has estimated a delay κi for these utterances. The crude delay κ is thus replaced by the delay κi for the current utterance. 2. The time-warping compensation is applied to the whole signal if at least one of the following cases is verified. • There is a wrong alignment of the whole signal (NWA = 1). • The deviation exceeds one sample per frame (a > 1) and at least one utterance splitting appears (NSplit > 1). • More than two utterances are considered as wrong aligned (NUtt > 2) In this case, the procedure describes in the following paragraph is used.
Time-warping compensation 1. Initialization: • The time-warping compensation uses the PESQ time-alignment algorithm. However, new utterances are defined to avoid any wrong alignment. An utterance is defined as 128 ms of active speech2 which are extended by 400 ms at each side. It introduces an overlap of 75% between two successive utterances. The length of an utterance is thus 528 ms. On the basis of the previous utterance framework and the delay estimated by the PESQ time-alignment, the delays κPESQ corresponding to the new utterances are stored. In addition, the corresponding delay confidence measures cPESQ are calculated. However, to limit the influences of surrounding utterances, the side extensions are limited to 200ms for this specific measure. 2. First round trip: • The previous delay κPESQ estimated by the PESQ time-alignment is eliminated for the current utterance. • A new crude delay κi for the current utterance i is calculated using the xvad (g) and yvad (g) vectors cross-correlation. • The utterances are reduced by 200 ms at each side (i.e. a total utterance length of 328 ms). • A fine-scale delay estimation as described in Sec. 4.3.2.3 is applied. It updates the new estimated delay κi and its corresponding confidence measure ci . 2
This window represents 32 frames of 4 ms as used by the PESQ time-alignment.
4.3 Pre-processing
147
3. Verification: • If the confidence measure of the current utterance is relatively small (i.e. ci < 0.5) or smaller than the previous utterance i − 1, the algorithm may be influenced by the surrounding utterances. This effect appears if the delay changes between two successive utterances. In order to detect possible delay changes, the algorithm described in First round trip is applied on two sub-utterances: one without the 200 ms front extension and one without the 200 ms back extension. In case one sub-utterance obtains a higher confidence measure, the corresponding estimated delay and confidence measure are used for the whole utterance. • In case the confidence measure for the current utterance ci increases when using the delay of the previous utterance κi−1 , this delay κi−1 is used for the current utterance i (the confidence measure is updated using this delay κi−1 ). 4. Concatenation: • Successive utterances with a relatively small alignment confidence measure ci < (c − 0.3) are concatenated (c corresponds to the mean confidence measure over all utterances). This algorithm assumes that the concatenated frames have the same delay but have been previously wrongly aligned. • A crude delay/fine-scale delay κi is estimated for the concatenated frames. • The concatenation is divided to restore the previous utterances. For each utterance included in the concatenation, the estimated delay κi is used and the confidence measure is updated using κi . 5. Back to PESQ values: • The original delay estimated by the PESQ time-alignment is used in case the new confidence measure is smaller than the original one cPESQ > ci . 6. Fine delay estimation: • The utterances are reduced by 200 ms at each side (i.e. a total utterance length of 128 ms). • A new fine-scale delay estimation is applied. • In case the confidence measure for the current utterance is still relatively small (ci < 0.7) and increases when using the delay of the next utterance κi−1 or of the previous utterance κi+1 , the current utterance is updated. The delay of the next utterance (i + 1) or the last utterance (i − 1) is used for the current utterance i. The confidence measure is updated accordingly. • In case the confidence measure does not increase using the surrounding utterances, the utterance is divided in two sub-utterances of 64 ms (ia and ib ). If the confidence measures for the first (resp. second) sub-utterance is higher than the confidence measure of the utterance using the delay of the previous utterance (resp. next) the delay and confidence measures are updated according to κia = κi−1 and κib = κi+1 .
148
4 Diagnostic Instrumental Speech Quality Model
x
vad
20 10 0 0
1
2
3
1
2
3
4 Time (s)
5
6
7
8
4
5
6
7
8
y
vad
15 10 5 0 0
Time (s)
(a) Speech activity
12000 PESQ DIAL
Delay κ (samples)
10000 8000 6000 4000 2000 0 −2000 0
1
2
3
4 Time (s)
5
6
7
8
(b) Estimated delay κi
0.8
Delay confidence (c)
PESQ DIAL 0.6
0.4
0.2
0 0
1
2
3
4 Time (s)
5
6
7
8
(c) Confidence measure ci Fig. 4.7 Speech activity for the reference and degraded speech signals and corresponding estimated delay and confidence measure using the PESQ time-alignment algorithm and the timewarping compensation. a Speech activity xvad (g) and yvad (g) b Estimated delay (κi ). c Delay confidence measure (ci )
4.3 Pre-processing
149
This complex algorithm is time-consuming compared to the PESQ time-alignment. However, in the specific case of time-warping effect, it produces a more accurate alignment than the PESQ time-alignment, as depicted in Fig. 4.7. The stimulus comes from the database “FT-UMTS” (see App. B.2) and corresponds to a WB transmission through the “Skype” network. Differences in the estimated delay appear at around 5 s on the time scale, see Fig. 4.7(b). These differences correspond to a discontinuity in the degraded speech signal, see Fig. 4.7(a). Using the time-warping compensation, DIAL obtains a higher delay confidence measure than PESQ, see Fig. 4.7(c).
4.3.2.6 Frame index initialization In the DIAL core model, the waveforms of the reference signal x(k) and of the degraded signal y(k) are analyzed in successive time-windows, referred to as “frameby-frame”. Each frame is mapped to the perceptual domain using a frame length of 16 ms. Contrary to the TOSQA model, the frames are overlapped by 50%. The NB mode uses speech signals sampled at fS = 8 kHz (128 samples per frame), whereas the S-WB mode uses speech signals sampled at fS = 48 kHz (768 samples per frame). Speech periods (i.e. xsentence (l) = 1) are divided in 16 ms frames (labeled l). Using the delay estimated for each utterance, a delay is assigned to each frame. The delay κl for a given frame l is defined by the delay of the utterance in which this frame begins. Then, the silence (or noisy) periods (xsentence (l) = 0) are divided in 16 ms frames and a delay is set to each silence/noisy frame. However, the degraded speech signal may be longer or shorter than the reference speech signal. In this case, the framework of silence/noisy frames is updated accordingly.
4.3.3 Modeling of receiving terminals The input speech signals of the DIAL model can be either an electrical or an acoustic recording. A signal is considered as “electrical” when obtained at a network termination point and as “acoustic” if an acoustic interface is under study. In the latter case, the signal is produced by an artificial mouth and/or recorded by an artificial head. The two operational modes (NB and S-WB) considered by the DIAL model correspond to different auditory contexts. The model expects input signals which are specific to each mode in terms of sampling frequency and processing scenario. Each operational mode takes into account these expectations and thus simulates differently the listening terminal used by a “hypothetic” listener during an auditory test. • Narrow-Band mode In NB auditory tests, the degraded speech signals are scored against an IRS-type
150
4 Diagnostic Instrumental Speech Quality Model
handset following ITU–T Rec. P.48 (1988)3. The DIAL model simulates this listening situation by filtering both input speech signals using a Finite Impulse Response (FIR) digital filter. This filter has a frequency response which is equivalent to an IRS Receive type handset. This NB mode enables a direct comparison of DIAL estimations to estimations given by PESQ. • Super-WideBand mode In S-WB auditory tests, a high-quality diotic head-phone is used to score the degraded speech files. This head-phone is considered to have a flat frequency response in the range 50–14 000Hz. However, the DIAL assesses both electrical and acoustic recordings in the same operational mode. A single S-WB database may include different listening terminals (related to the acoustic recordings). As a result, in this mode no input-filter is applied to the two input speech signals. The model considers either a monotic (NB mode) or a diotic (S-WB mode) listening situation. Binaural effects are beyond the scope of this new standardization program. In the DIAL model, the receiving terminal simulation is applied in three consecutive steps (see the corresponding blocks in Fig. 4.2). Firstly, this receiving filter is applied to the degraded speech file only (resulting in y (k)). After this first step, the difference between both input signals corresponds to the overall transmission path (i.e. from mouth to ear including the user terminal in case of acoustic recordings). The frequency response of the entire system is computed by the Coloration estimator, see Sec. 4.5.1. Then, in order to reduce the impact of this frequency response in the Core model, the receiving filter is applied to the reference speech signal (resulting in x (k)).
4.4 Core model In order to improve the reliability of the DIAL speech quality model, the proposed measure uses a combination of four perceptual dimension estimators and an intrusive signal-based model. The intrusive signal-based model used here is mainly based on the TOSQA model described in Sec. 2.3.2.4. Contrary to the PESQ model (ITU– T Rec. P.862, 2001), the TOSQA model has been developed for the assessment of an entire transmission including the acoustic path. TOSQA was modified mainly to cover non-linear degradations. Such degradations are estimated by a comparison of the perceptually transformed reference and degraded signals. The following section describes the perceptual transformation of the two speech signals. Then, the distortions which are perceptually irrelevant (i.e. not audible) are compensated before the perceptual comparison of the reference speech signal with the degraded speech signal by the core model. These compensations are computed by an algo3
The listening device can be either a real telephone handset or a monaural headphone. In the latter case, the degraded speech signals are filtered with an IRS Receive filter beforehand.
4.4 Core model
151 x (k)
y (k)
κl Hann
Align. Hann
xw (k, l)
yw (k, l) PDS
Px x (l, e jΩ ) Bark
Partial compensation Freq. H(z)
PDS Py y (l, e j Ω )
Gain G(l)
Bark Py y (l, z)
Px x (l, z) Px x (l, z) Lx x (l, z)
Py y (l, z)
Similarity Freq. masking
Asym.
ρ
Ly y (l, z)
Time averaging Syl. detect.
Lp
MOSCore Fig. 4.8 Diagram of the DIAL core model
rithm described in Sec. 4.4.2. The perceptual comparison corresponds to a similarity measure which quantifies the non-linear degradations introduced by the processing system.
4.4.1 Perceptual transformation In order to calculate the non-linear degradations introduced by speech transmission systems, auditory signals are transformed into perceptual signals. The two input speech signals x (k) and y (k) are transformed to represent the signals as perceived by a human listener. The perceptual transformations are mainly based on the model of loudness calculation developed by Zwicker et al. (1957). In the absence of background noise, Huber and Kollmeier (2006) assume that silent intervals do not contribute to speech transmission quality. Following this assumption, in the Core model, active speech frames only, i.e. corresponding to xsentence (l) = 1, are transformed using the perceptual model. The background noise in non active frames i.e. corresponding to xsentence (l) = 0, is calculated by the Noisiness estimator, see Sec. 4.5.4.
152
4 Diagnostic Instrumental Speech Quality Model
4.4.1.1 Calibration The perceptual model has been calibrated to a maximum acoustic value of Pmax = 105 dBSPL . A pink-noise signal has been used with a WB bandwidth and a digital level of −26 dBov . According to Zwicker and Fastl (1990), a diotic listening is perceived about 2 times louder than a monaural listening. Therefore, a speech signal with an ASL level of −26 dBASL will correspond to an acoustic listening level of 79 dBSPL in a monaural listening situation (NB mode) and 73 dBSPL at each ear in a diotic listening situation (S-WB mode). No specific perceptual model has been used for diotic listening.
4.4.1.2 Spectrum estimation Both speech signals are divided in L successive aligned segments (l = 0, 1 . . . L − 1) of M samples each (k = 1, 2 . . . M). Then, a Hann window hw (k) is applied on the signal waveforms. xw (k, l) = x (k + l × (M − Δ M)) × hw(k) , yw (k, l) = y (k + l × (M − Δ M) + κl ) × hw(k) ,
(4.6)
where κl is the estimated delay for segment l and Δ M = M2 the 50% overlap. The resulting signals are expressed in the frequency domain using a short-term Discrete Fourier Transform (DFT): X (l, e jΩ ) =
M−1
∑ xw (k, l) × e− jΩ k ,
(4.7)
k=0
where Ω = k2π /M. A Power Spectral Density (PSD) is then computed as: Px x (l, e jΩ ) =
1 X × X ∗ , 2 M
(4.8)
where X ∗ is the conjugate of X . The resulting spectrum is defined on m = 1 . . . M/2 points, i.e. until half of the sampling frequency ( fS ).
4.4.1.3 Bark integration The signal spectra Px x (l, e jΩ ) and Py y (l, e jΩ ) are transformed in terms of critical bands z (see Sec. 1.1.2.2). For this purpose the mapping function between the frequency and the critical band defined by Zwicker et al. (1957) can be used, see Eq. (1.1). Here, critical band rates are simulated by an aggregation of energy into critical band rates of 1 Bark width each. The low cut frequencies flow (z) and bandwidth Δ f (z) for each band are defined using Eq. (1.1). These are given in Table 4.2. Using the corresponding frequency index klow (z) and kup (z) for each band, the Bark
4.4 Core model
153
power density is calculated as: kup (z)
Px x (l, z) =
∑
Px x (l, e jΩ k ) .
(4.9)
k=klow (z)
This integration function does not exactly follow the literature. The frequency bands have been defined in order to follow those used by the Coloration estimator, see Sec. 4.5.1. From the Bark power density, the energy per speech frame is calculated for the reference and degraded speech signals by: Px (l) =
1 24 ∑ Px x (l, z) , 24 z=1
Py (l) =
1 24 ∑ Pyy (l, z) . 24 z=1
(4.10)
Table 4.2 Definition of the frequency bands used in the core model and Coloration estimator. Low cut frequencies flow are calculated according to Eq. (1.1) z (Bark)
f low (Hz)
Δf (Hz)
z (Bark)
f low (Hz)
Δf (Hz)
1 2 3 4 5 6 7 8 9 10 11 12
1 101 204 308 417 531 652 780 922 1 079 1 255 1 457
101 102 105 108 114 121 129 142 156 176 202 234
13 14 15 16 17 18 19 20 21 22 23 24
1 691 1 968 2 303 2 711 3 211 3 822 4 553 5 412 6 413 7 618 9 166 11 416
277 335 408 501 611 731 858 1 002 1 204 1 549 2 249 4 013
4.4.2 Partial compensation of the transmission system In order to improve the reliability of intrusive speech quality models, linear degradations having a small impact on the integral quality are compensated. This compensation is used in several intrusive models such as TOSQA (ITU–T Contrib. COM 12–34, 1997) and PESQ (Beerends et al., 2002). Two linear degradations are partially compensated in DIAL: (i) the frequency response of the speech transmission system, and (ii) the variable gain.
154
4 Diagnostic Instrumental Speech Quality Model
4.4.2.1 Transfer function As stated in Sec. 1.3.3, different quality elements, such as the acoustic terminal, the speech codec or the talking environment, may impact on the frequency response of the overall transmission system. The DIAL model enables two different modes: (i) a NB mode where bandwidth restrictions rarely appear, and (ii) a S-WB mode where different bandwidth restrictions may impact on the degraded speech file. These restrictions are introduced by the network (i.e. a WB or a NB transmission) and by acoustic recordings of user terminals. Such terminals have a high influence on the frequency response. A reduced transmission bandwidth significantly decreases the perceived speech quality, whereas slight modifications of the speech spectrum have a small impact. Inaudible deviations in the speech spectrum are thus compensated whereas high deviations such as bandwidth limitations are not fully compensated. The partial frequency compensation for the system frequency response is computed in two steps. Firstly, the global frequency response of the transmission system is calculated. Then, this global frequency response is limited to avoid the compensation of high deviations. • Global frequency response: The mean spectra of both signals xw (k, l) and yw (k, l) are estimated by averaging the PSD over all L active frames: L−1 x x (z) = 1 ∑ Px x (l, z) , Φ L l=0 L−1 y y (z) = 1 ∑ Py y (l, z) . Φ L l=0
These are used to calculate the overall gain of the system: ⎛ ⎞ 24 ⎜ ∑ Φˆ y y (z) ⎟ ⎜ z=1 ⎟ ⎟. g = 10 log10 ⎜ ⎜ 24 ⎟ ⎝ ⎠ ˆ Φ (z) ∑ xx
(4.11)
(4.12)
z=1
Then, the system frequency response is calculated as the ratio of the reference spectrum to the degraded spectrum: H(z) =
Φˆ x x (z) + α , Φˆ y y (z) + α
(4.13)
4.4 Core model
155
2 where α is a constant set to 2 × 10−5 , i.e. the threshold of hearing 20 µPa (micro-pascal) or 0 dBSPL , to avoid null division. • limitation: The frequency response H(z) is then limited to compensate only the inaudible deviations. Contrary to the TOSQA model which includes NB and WB modes, the DIAL model includes a S-WB mode. The original algorithm used in TOSQA has been slightly modified to reliably assess the perceived quality of NB, WB and S-WB bandwidths in the same mode. The Coloration estimator defined in Sec. 4.5.1 is used to accurately calculate the bandwidth of the transmission system. This estimator provides the limits of the transmitted audio bandwidth: zlow and zhigh , which are both defined on a continuous Bark scale. However, H(z) is defined on a discrete scale (z = 1, 2 . . . 24). The zlow value is thus rounded to the highest digit, zhigh to the lowest digit and zG to the nearest digit (zG is the center of gravity, see Sec. 4.5.1.3). The algorithm compensates the spectral deviation within the transmitted audio bandwidth z ∈ zlow , zhigh , whereas the frequency deviation outside the audio bandwidth z ∈ / zlow , zhigh is not compensated. For this purpose, H(z) is divided in two parts: [zlow , zG ] and zG + 1, zhigh . Different limitations have been defined for both parts and both modes: – Narrow-Band mode:
[g − 10, g + 15], if zlow ≤ z ≤ zG ; H(z) ∈ [g − 5, g + 15], if zG < z ≤ zhigh .
(4.14)
– Super-WideBand mode: H(z) ∈
[g − 5, g + 15], if zlow ≤ z ≤ zG ; [g − 5, g + 18], if zG < z ≤ zhigh ,
(4.15)
where g is the overall gain of the system. Within the audio bandwidth, the frequency response is limited to these defined dynamic ranges. Then, outside the audio bandwidth only attenuations are limited. It results in limited compensation values H(z) . At too high or too low frequencies, high amplifications do not have an audible impact. A second assumption of the DIAL model is that the attenuation of high frequencies has a higher perceptual impact than attenuation of low frequencies. A different paradigm is used for low and high frequencies. At high frequencies (i.e. z > zhigh , attenuations until 5 dB are compensated. At low frequencies the limitation is updated at each band: H (z) = α × H (z + 1) ,
(4.16)
where α corresponds to 0.3162 (i.e. −5 dB), and z = 2 . . . zlow − 1. In order to keep unchanged the degraded signal, the original Bark power density Px x is equalized according to:
156
4 Diagnostic Instrumental Speech Quality Model 15 H(ejΩ) H(z) H(z)′
10 Frequency response (dB)
Fig. 4.9 Example of partial compensation of the reference speech signal for frequency response equalization. The real transfer function H(e jΩ ) has been estimated by the Coloration estimator, see Sec. 4.5.1.2.
5
0
−5
−10
−15 0
5
10 15 z (Bark)
Px x (l, z) = Px x (l, z) · diag H(z) ,
20
(4.17)
where diag {H(z) } corresponds to the diagonalized limited compensation values. The energy per speech frame in the reference speech signal is updated in Px (l), see Eq. (4.10). An example of the global and limited frequency compensations is depicted in Fig. 4.9. The stimulus comes from the database “P.OLQA 1” (see App. B.2) and corresponds to an acoustic recording of a headset.
4.4.2.2 Variable gain Digital speech processing systems such as speech coding introduce gain variation in the degraded speech signal which is seen as a distortion by signal-based models. However, this distortion has a slight impact on the speech transmission quality. Gain variation is also introduced by Voice Quality Enhancement (VQE) systems such as Noise Reduction (NR), Acoustic Echo Cancellation (AEC) or Automatic Gain Control (AGC). However, they increase the integral quality. For instance, in noisy environments, a NR algorithm may increase the SNR resulting in a better perceived quality. Such enhancements is not taken into account by the DIAL model. In order to reduce the impact of the gain variation in the Core model, a two steps approach similar to the frequency compensation is applied: a smooth time-varying gain is compensated whereas abrupt variations introducing a discontinuity into the degraded speech signal are not fully compensated. • Global gain variation: The variable gain is calculated by the ratio of the reference level to the degraded level: Px (l) + α , (4.18) G(l) = 10 log10 Py (l) + α
4.4 Core model
157
2 where α is a constant set to 2 × 10−5 , i.e. 0 dBSPL . • Limitation: Then, in a second step the variable gain is limited. The TOSQA model limits the gain variation G(l) to the range −6 . . . 3 dB. A smooth variation does not have an audible impact, whereas an abrupt variation may cause a discontinuity in the degraded speech file. Here, only high variations of the time-varying gain are limited. The global variable gain G(l) is limited to the range −20 . . . 20 dB. Such high variations correspond to discontinuity introduced by packet-loss or time-clipping of the current frame. A smoothed version of the time-varying gain Gs (l) is estimated from G(l) using a low-pass FIR filter. This smoothed version Gs (l) is then limited to the range −10 . . . 10 dB. Test subjects react to a slow variation introducing a high attenuation or amplification. Then, the time-varying gain is limited to the range: ⎧ ⎪ ⎨Gs (l) − 6, if G(l) < Gs (l) − 6 ; G(l) = Gs (l) + 3, if G(l) > Gs (l) + 3 ; (4.19) ⎪ ⎩ G(l), otherwise . Constant values −6 and 3 are not equal since amplifications introduce a higher degradation than attenuation. This effect corresponds to the “asymmetry” factor introduced by Beerends (1994). Contrary to the frequency compensation, the time-varying gain compensation is applied on both speech signals according to: Py y (l, z) = G(l) × Py y (l, z), if G(l) > 1 , Px x (l, z) =
Px x (l, z) , if G(l) < 1 . G(l)
(4.20)
Whereas an amplification of the transmission system is compensated on the degraded Bark power density Py y , an attenuation is compensated on the reference power density Px x . Two examples of the global and limited time-varying gain compensation are depicted in Fig. 4.10.
4.4.3 Calculation of the loudness densities The reference and the degraded Bark power densities Px x (l, z) and Py y (l, z) are transformed to the Sone scale using the loudness transformation developed by Zwicker and Fastl (1990).
158
4 Diagnostic Instrumental Speech Quality Model 20
20 G(l) Gs(l)
15
G(l) Gs(l)
15
G(l)′
G(l)′ 10 Variable gain (dB)
Variable gain (dB)
10 5 0 −5
5 0 −5
−10
−10
−15
−15
−20 0
1
2
3
4 Time (s)
5
6
−20 0
7
1
(a) Packet-loss condition
2
3
4 Time (s)
5
6
7
(b) AGC condition
Fig. 4.10 Examples of partial compensation for the time-varying gain. The dashed line corresponds to the smoothed version Gs (l). a Packet-loss condition. b AGC condition
γ Px x (l, z) γ P0 (z) × 0.5 + 0.5 × −1 , Lx x (l, z) = Sl × 0.5 P0 (z) γ Py y (l, z) P0 (z) γ Ly y (l, z) = Sl × × 0.5 + 0.5 × −1 , 0.5 P0 (z)
(4.21)
where P0 (z) is the absolute hearing threshold, γ is the Zwicker power set to γ = 0.23 and Sl is the loudness calibration factor set to Sl = 1.3733. This transformation is used in TOSQA and has been included in DIAL. However, the loudness factor Sl has been re-calibrated to obtain a value of 1 Sone using a 1 kHz tone signal at 40 dBSPL . Beerends et al. (2002) developed a slightly modified version of this function for the PESQ model. Below 4 Bark, the Zwicker power is increased slightly to simulate the loudness “recruitment effect” in the cochlea. This effect corresponds to a rapid growth in loudness for level slightly exceeding the absolute threshold (Allen et al., 1990). The opposite approach is used in DIAL: the Zwicker power is set to γ = 0.1 below 3 Bark to account for the lower ear sensibility at low frequencies. This modification slightly increase the accuracy of DIAL on bandwidth restrictions. The Short-Term Loudness (STL) value for a given frame l is calculated as the sum of loudness densities over the whole Bark scale: Lx (l) =
24
∑ Lx x (l, z) .
z=1
(4.22)
4.4 Core model
159
4.4.4 Short-term degradation measure In TOSQA the similarity between the reference and the degraded loudness density is calculated. This measure is also used in the audio quality model PEMO-Q. The TOSQA algorithm has been included in DIAL. The similarity between the reference and the degraded loudness densities is computed in two regions on the Bark scale: the “Low Bark” region (zLB ) and the “High Bark” region (zHB ). The TOSQA model uses three different regions. In DIAL, these regions have been updated to cover S-WB conditions. The Bark index of these two regions are: • Narrow-Band mode: zLB,NB = (1, . . . , 9) ,
zHB,NB = (10, . . . , 18) .
(4.23)
zHB,SWB = (11, . . ., 23) .
(4.24)
• Super-WideBand mode: zLB,SWB = (1, . . . , 10),
The 2 regions zLB,NB and zHB,NB have the same width in the NB mode. However, zHB,SWB is wider than zLB,SWB in the S-WB mode. This difference does not impact the estimated similarity since most of the energy of a voice signal is included in the NB bandwidth.
4.4.4.1 Reference optimization Prior to this similarity measure, the reference loudness density is optimized in order to improve the reliability of the DIAL core model. The loudness densities Lx x (l, z) and Ly y (l, z) represent the speech signal at the output of the cochlea, see Secs. 1.1.2.1 and. 1.1.2.2. These “patterns” are processed on higher levels in the human brain. An instrumental speech quality measure requires a model of the peripheral auditory system. However, the speech quality judgment involves several cognitive processes. An algorithm simulating simple cognitive processes is described in this section. • A simple frequency masking model which is not used in TOSQA, is employed here to reduce the influence of inaudible differences at high frequencies when masked by low frequency components. Firstly the difference dL(l, z) between the two loudness densities is calculated as: dL(l, z) = Ly y (l, z) − Lx x (l, z) .
(4.25)
A masking vector M(z) is calculated for each frame l and each band z according to:
160
4 Diagnostic Instrumental Speech Quality Model
M(z) = max Lx x (l, z ) − Lx x (l, z) , z ≤z
(4.26)
where maxz ≤z {Lx x (l, z )} corresponds to the highest component over all the bands z ≤ z. Then, M(z) is normalized to the range [0, 1] and updated according to M (z) = 1 − M(z). The resulting masking vector M (z) depends on the position and the value of the highest loudness component in the reference loudness density Lx x (l, z): M(z) = 1 means no masking and M (z) = 0 means no perceived degradation. Then, M (z) is applied only on loudness components in the “High Bark” region: d L(l, z) = M (z) × dL(l, z),
z ∈ zHB .
(4.27)
• New signal components added by the system under study are much more annoying than components which are attenuated (Beerends and Stemerdink, 1994). The intrusive models PESQ and TOSQA simulate this effect using an asymmetry factor. In DIAL, d L(l, z) is either amplified or attenuated following the sign of dL(l, zHB ): α × dL(l, z), if dL(l, z) < 0 ; d L(l, z) = (4.28) (1 − α ) × d L(l, z), if dL(l, z) > 0 , where α is the asymmetry factor. Contrary to TOSQA, this factor is not constant over the Bark scale. One assumption used by the DIAL model is that the introduction of new components has a higher influence on the perceived quality when it appears at low frequencies. The asymmetry factor is thus set to α = 0.3 for the low-frequency components zLB , and is set to α = 0.4 for the high-frequency components zHB . The resulting difference is then limited to the range d L(l, z) ∈ −1.8 . . . 1 Sone. Large deviations (i.e. outside this range) do not have a higher influence on the perceived quality. • In order to calculate a similarity between the loudness densities, an optimized reference loudness density is computed as: Lx x (l, z) = Ly y (l, z) − d L(l, z) .
(4.29)
4.4.4.2 Similarity measure Using the loudness densities Ly y (l, z) and Lx x (l, z), the similarity ρ is computed for both the “High Bark” and the “Low Bark” regions defined in Eqs. (4.23) and (4.24). First, the reference (resp. degraded) STL values for each region bandwidth is computed as:
4.4 Core model
161
Lx (l, LB) =
∑
Lx x (l, z) ,
∑
Lx x (l, z) .
z∈zLB
Lx (l, HB) =
z∈zHB
(4.30)
Then, for the example of the “Low Bark” region, the similarity ρ (l, LB) is computed as:
ρ (l, LB) =
∑
z∈zLB
∑
z∈zLB
(4.31)
2 Lx x (l, z) Ly y (l, z) − Δ Z Lx (l, LB) Ly (l, LB)
Lx x (l, z)2 − Δ Z
Lx (l, LB)
2
,
∑
Ly y (l, z) − Δ Z Ly (l, LB) 2
2
z∈zLB
where Δ Z corresponds to the bandwidth defined by zLB (i.e. 9 or 10 Bark). The overall similarity for a given frame l corresponds to a linear combination of the two frequency regions: ρ (l) = β ρ (l, LB) + γ ρ (l, HB) . (4.32) Since a degradation in the high frequency bandwidth has a higher impact on the overall speech quality than at low frequency bandwidth the two constants have been set to β = 0.4 and γ = 0.6. The similarity measures, which take values within the range 0–1, do not have a linear relationship with the auditory quality values MOSLQS . These are linearized using the following law: Plin (l) = 1.906 − 45.308 × arctan{21.7 × (ρ (l) − 0.998)} .
(4.33)
To avoid incoherent values, Plin (l) is limited to 0.
4.4.4.3 Similarity pattern attenuation Non-linear degradations have a higher influence on the similarity values ρ (l) of (4.31), than linear degradations such as bandwidth restriction. The Core model should mainly react to non-linear degradations. However, it has also to slightly react to linear degradations such as background noise and non-optimum listening level. A specific pattern, influenced by the amount of linear degradation, is thus computed. Then, the linearized similarity values Plin (l) are attenuated/amplified following the amount of linear degradations such as background noise or non-optimum listeninglevel. • The Core model assumes that users mainly perceive non-linear degradations when played at the optimum listening level (i.e. 79 dBSPL . Non-linear degrada-
162
4 Diagnostic Instrumental Speech Quality Model
tions are considered as less perceptible and masked by relatively high speech levels. For this purpose, a Long-Term Loudness indicator LT Ly (L) is computed by the Loudness estimator (see Eq. (4.52)) and used to detect frames having high speech levels. Consequently, in the first attempt, the attenuation pattern AP is computed using the reference STL value Lx (l), see Eq. (4.22): Lx (l) − LT Ly (L), if Lx (l) > LT Ly (L) ; AP(l) = (4.34) Lx (l), otherwise . • The Noisiness estimator computes a mean loudness value of the additive noise Ln , see Eq. (4.75), which is used to amplify AP(l) at low speech loudness. Ln , if AP(l) < Ln ; AP(l) = (4.35) AP(l), otherwise . • The attenuation pattern is normalized to the range AP(l) ∈ [0; 1] according to: AP(l) =
AP(l) . LT Ly (L)
(4.36)
The linearized similarity values are then multiplied with the attenuation pattern: Plin (l) = Plin (l) × AP(l) .
(4.37)
4.4.5 Integration of the degradation over the time scale The PESQ model uses a specific integration procedure over the time scale. This procedure uses the L p weighting which emphasizes loud degradations in comparison to a normal L1 norm. A L p weighting is computed as: ||degradation|| p =
L
∑ |degradation(l)| p
1/p ,
(4.38)
l=1
where L is the number of frames and p > 1. In PESQ, successive intervals of 320 ms with 50% overlap are integrated using a L6 normalization procedure. This interval length approximately corresponds to the length of a syllable. Then, the intervals are integrated over the speech file length using a L2 norm. A similar procedure has been used in the Core model of DIAL. This normalization procedure is employed to emphasize discontinuous degradations in the speech signal. First, using the reference STL values Lx (l), see Eq. (4.22), syllables are detected using a local-peak localization algorithm. The detected syllable has a length ranging from 16 to 60 frames (i.e. 136–488 ms). A L2 norm is applied on each syl-
4.4 Core model
163
(syllable). Then, the per-syllable lable to compute a per-syllable degradation Plin degradations are integrated over the speech file length using a second L2 norm. It . An example of this integration over the time results in an overall degradation Plin scale is depicted in Fig. 4.11. The stimulus comes from the database “P.OLQA 2” (see App. B.2) and corresponds to a S-WB speech codec.
4.4.6 Computation of the core model quality score The final Core model quality score reflects a speech transmission quality score in is thus mapped to the stead of a degradation value. The integrated degradation Plin range of MOS values (i.e. MOS ∈ [1, 5]) using a third order polynomial function: MOSCore = α0 + α1 × SQCORE + α2 × SQ2CORE + α3 × SQ3Core ,
(4.39)
. The coefficients α , α , α and α defined in Table 4.3 where SQCORE = 100 − Plin 0 1 2 3 have been optimized on a large set of auditory tests. These coefficients are different in the NB and in the S-WB mode.
Table 4.3 Third order mapping function coefficients determined by curve fitting of the auditory MOSLQS values Mode
α0
α1
α2
NB S-WB
−20.6 3.6
8.7 × 10−1 −8.1 × 10−2
−1.2 × 10−2 4.3 × 10−4
α3 5.8 × 10−5 5.1 × 10−6
60 80 50 Estimated degradation
Short−Term Loudness (Sone)
70 60 50 40 30
40
30
20
20 10 10 0 0
4 Time (s)
(a) Short-Term Loudness
8
0 0
4 Time (s)
8
(b) Estimated degradation
(l) and corresponding reference STL values L (l) Fig. 4.11 Example of estimated degradation Plin x for each frame l. a Reference STL values Lx (l). The vertical lines correspond to syllable bound aries. b Estimated degradation Plin (l). The horizontal line corresponds to the overall degradation Plin
164
4 Diagnostic Instrumental Speech Quality Model
4.5 Dimension estimators The following section describes the four estimators used to diagnose the quality degradations introduced by the speech transmission system. These estimators rely on perceptual dimensions introduced in Sec. 1.4. • • • •
Coloration, Loudness, Discontinuity, Noisiness.
The resulting quality-score framework is referred to as “degradation decomposition”. For each perceptual dimension, several signal parameters (corresponding to either physical or perceptual attributes) are quantified and combined to estimate reliably the dimension value: " # (i) (i) (i) !i = f Dim SP1 , SP2 , . . . , SP (i) , (4.40) Np
(i)
(i)
! i is the estimated score, and SPn one of the n = 1, 2 . . . Np signal pawhere Dim rameters of the perceptual dimension Dimi .
4.5.1 Coloration The perceptual dimension Coloration is related to the characteristics of the frequency response of the overall transmission system (i.e. mouth-to-ear), see Wältermann et al. (2010a). In a previous study, Wältermann et al. (2006a) referred this dimension to Directness/Frequency Content (DFC). This dimension covers the impact of bandwidth restrictions, i.e. the bandwidth of the degraded speech file, y(k), and specific impairments such as the influence of the talking-room reflections. Such effects mainly appear on the acoustic recordings of signals transmitted by an HFT. In addition, this dimension covers the effect of deviation from a reference timbre (e.g. dark or bright) introduced by the transducers in a user terminal (e.g. headsets and telephone handsets). A first estimator for modeling the influence of frequency bandwidths on speech quality has been developed by Raake (2006b). Then, Scholz et al. (2006) extended this estimator to the instrumental assessment of the dimension DFC. An improvement to this DFC estimator has been proposed by Huo et al. (2007). The extended version has been introduced in Sec. 2.3.2.5. The algorithm used in DIAL is mainly based on the extended version. However, it has been developed only for the analysis of NB conditions. The algorithm has been slightly modified, and renamed into Coloration estimator, for the analysis of wider bandwidth, up to S-WB conditions.
4.5 Dimension estimators
165
4.5.1.1 Pre-processing The input signals of the Coloration estimator are (i) the reference signal x(k), free of any filtering process, and (ii) the degraded signal y (k), after the simulation of the receiving terminal, see Sec. 4.3.3. Contrary to the estimator developed by Scholz et al. (2006), here only the “active” periods of x(k) and y(k) are used. These active periods correspond to successive frames without delay variation and detected as speech periods, i.e. xsentence (l) = 1 and ysentence (l) = 1. Then, these active periods are aligned in time and level: • The delay estimated by the time-alignment algorithm described in Sec. 4.3.2 is used. • Both signals are level-aligned to −26 dBASL using their corresponding ASL values, see Sec. 4.3.1. The resulting signals are divided in frames of 2 048 samples which are overlapped by 75%. A weighting Hann-window is used. In order to reduce the compexity of the algorithm, the same frame length is used in both NB and S-WB modes. 4.5.1.2 Estimation of the Gain function G(Ω ) Berger (1998) showed that the linear frequency response of the overall transmission system H(e jΩ ) can be calculated by: H(e jΩ ) =
xy (e jΩ ) Φ , xx (e jΩ ) Φ
(4.41)
xx (e jΩ ) is the Power Spectral Density (PSD) of the reference signal, and where Φ j Ω Φxy (e ) the cross-PSD of reference and degraded signals. Both PSD have been averaged over all L active frames. The PSD of a signal has been defined in Sec. 4.4.1.2, see Eq. (4.8). A cross PSD is defined as: Pxy (l, e jΩ ) =
1 X ×Y ∗ , 2 M
(4.42)
where Y ∗ is the conjugate of Y . The Coloration estimator relies on a specific description of the transmission system’s frequency response H(e jΩ ) in terms of a gain function: $ $ $ $ G(Ω ) = 20 × log10 $H(e jΩ )$ , (4.43) where G(Ω ) is defined on the dB scale. Then, G(Ω ) is transformed to reflect the perceptual transformation of the speech signals as it appears in the Cochlea. G(Ω ) is transformed into G(z) which can be defined in terms of critical band rates. The model developed by Zwicker et al. (1957), see Eq. 1.1, is used here. Whereas H(e jΩ ) corresponds to a physical representation of the system frequency response,
166
4 Diagnostic Instrumental Speech Quality Model
G(z) is the corresponding perceptual representation. The resulting G(z) is then preprocessed. Only Bark within the interval Δ z = [zmin , zmax ] are considered during this analysis. Scholz et al. (2006) initialized these limits to zmin = 1.5 (i.e. 150 Hz) and zmax = 17.0 (i.e. 3 800 Hz). In DIAL, these limits are initialized to the reference signal bandwidth. Noisy degraded signals may have a slightly wider bandwidth than the reference signal (due to either coding artifacts or background noise). This initialization using the reference signal avoids an exacerbated influence of added noise. Then, the interval Δ z is determined in two steps: 1. G(z) represents a transfer function and thus has a mean value close to 0 dB. In order to detect the analysis bandwidth, G(z) is amplified by a constant value called “stopband” (ST), set to ST = 45 dB here. = max{G(z) + ST, 0}, for z ∈ [zmin , zmax ] ; G(z) (4.44) 0, otherwise . and zlim = zmin +zmax , the lower and upper limits of the interval Δ z = 2. Using G(z) 2 [zmin , zmax ] are updated according to: % & G(zmin ) = 0.5 max G(z) , z∈{0,...,zlim } % & G(zmax ) = 0.5 max G(z) . (4.45) z∈{zlim ,...,24}
has a significant notch of at least 1 Bark within the interval In the case that G(z) Δ z, the interval limits are updated to reduce the analysis bandwidth.
4.5.1.3 Coloration parameters is decomposed according to: First, G(z) = G(z) ' +G R (z) , G(z)
(4.46)
' represents a smoothed version of G(z), R (z) the residual peaks where G(z) and G ' and notches of G(z). Two parameters are determined from G(z): • ERB: the bandwidth of G(Ω ) is quantified by an Equivalent Rectangular Band' width (ERB) of G(z): % & ' area G(z) % &. ERB = (4.47) ' max G(z)
4.5 Dimension estimators
167
Fig. 4.12 Gain function G(z) and estimated Coloration parameters for an example condition
60
50
Level (dB)
40
30 Fc = 1063 Hz 20 ERB = 17.9 Bark 10
0 0
5
10
15
20
z (Bark)
This parameter is expressed on the Bark scale. Following Raake (2006b), the ERB value represents an ideal rectangular filter which has the same perceptual characteristics as the bandwidth of G(Ω ). • zG : the center of gravity of G(z): ( zmax
zG =
'
zmin G(z) × z∂ z ( zmax ' zmin G(z)∂ z
.
(4.48)
This parameter is expressed on the Bark scale. • fc : the center frequency of G(z). This parameter is computed using the last two parameters. First, the limits of the estimated bandwidth are calculated: zlow = zG − ERB 2 , zhigh = zG + ERB 2 .
(4.49)
These limits are mapped to the frequency scale as flow and fhigh defined in Hz. Then, fc is calculated according to: fc = flow × fhigh . (4.50) The Coloration estimator developed by Scholz et al. (2006) uses a parameter called β which represents the “slope” of the system frequency response. This parameter characterizes the ratio between the frequency components transmitted by the sysdB tem. It is expressed in terms of Bark . However, this parameter is not used in DIAL. Figure 4.12 shows the gain function G(z) and estimated Coloration parameters for an example condition. The stimulus comes from the database “P.OLQA 1” (see App. B.2) and corresponds to an acoustic recording of a headset.
168
4 Diagnostic Instrumental Speech Quality Model
4.5.1.4 Computation of the MOScol score The 2 parameters ERB and fc have been introduced by Raake (2006b) to compute a bandwidth impairment factor Ibw . This factor estimates the influence of frequency bandwidths on speech quality: Ibw =
α1 × | fc − α6 × (ERB + α5 )| −α2 × ( fc − α6 × (ERB + α5)) −α3 × ERB + α4 ,
(4.51)
where the coefficients αi , i = 1 . . . 6, are defined in Table 4.4. This model has been developed using auditory results published by Barriac et al. (2004). The corresponding database, referred to as “FT-04 NB/WB” in App. B.2, includes both NB and WB conditions. The offset value α4 was originally set to 129.2, i.e. the R-value of an “optimum” WB condition. This coefficient has been updated to cover S-WB conditions. Table 4.4 Model coefficients for the bandwidth impairment factor, based on Raake (2006b) Mode
α1
α2
α3
α4
α5
α6
NB S-WB
3.5 × 10−2
6.7 × 10−3
7.4 7.4
100 160
101.8 101.8
9.9 9.9
3.5 × 10−2
6.7 × 10−3
Then, the MOScol is derived from the estimated bandwidth impairment. First, Ibw is limited to positive values (i.e. limited to 0 in case a negative value is estimated). Then, the impairment value is mapped to the MOS scale using the transformation described in ITU–T Rec. G.107 (1998), see Eq. (2.4). However, the input of Eq. (2.4) depends on the operational mode: • NB mode: R = 100 − Ibw • S-WB mode: R =
160−Ibw 1.6
4.5.2 Loudness The perceptual dimension “loudness” is ruled out of several instrumental models such as PESQ. This model normalizes the input speech signals to a constant power level. However, the “loudness” plays an important role for the integral speech quality and intelligibility. The prerequisite for an efficient communication is a sufficiently high speech level, especially in noisy acoustic environments. Excessively low or high loudness (which may lead for the latter to digital overload) results in a low intelligibility.
4.5 Dimension estimators
169
This loudness estimator quantifies the degradation for speech heard at a nonoptimum listening level. An optimum level corresponds to the speech level which leads to the highest auditory quality score, see Sec. 2.2.4.5. The speech level used in auditory experiments usually corresponds to the preferred speech level which is some dB lower than the optimum speech level. This difference between optimum and preferred level is dependent on other features such as the audio bandwidth (e.g. NB or WB) or the SNR. According to Eq. (2.2), the sensation “loudness” is related to the power of a sound. However, an accurate relationship between the loudness and the physical characteristics of a speech signal has not been developed yet especially because of its time-varying behavior. For instance, characteristics such as frequency content and duration contribute to the overall loudness of a stimulus. Several models were developed in order to estimate the loudness of a specific stimulus such as speech. Nowadays, several models estimate reliably the loudness for pure tones or burst of noise. However, loudness estimation of real-life signals such as speech or music is still a difficult task. The loudness models are separated in two groups: “single-band” models, and “multi-band” models. The models from the first group are based on time-domain information which are integrated into a mono-dimensional parameter describing the energy of the signal. The models in the second group are mostly based on the theory of critical band rates and level compression described in Sec. 1.1.2.2. The following section presents the algorithm used in DIAL to estimate a loudness value for speech stimuli. This algorithm is divided in two parts. Firstly, a ShortTerm Loudness (STL) is estimated for each frame using the loudness model for stationary sounds developed by Zwicker and Fastl (1990). Then, the model for timevarying sounds developed by Glasberg and Moore (2002) has been optimized for the calculation of a Long-Term Loudness (LTL) for the whole speech stimuli. The same algorithm is used in both listening situations: monaural (NB mode) and diotic (S-WB mode).
4.5.2.1 Short-Term Loudness The Short-Term Loudness (STL) is calculated in the core model of DIAL for each active frame, i.e. xsentence (l) = 1. The reference Lx (l) and the degraded Ly (l) total loudness values are computed by the model developed by Zwicker and Fastl (1990). This model is a standardized multi-band measurement method (ISO Standard 532– B, 1975). It was mainly developed with steady sounds, such as tones or noise bursts. However, it is widely used in speech quality models as a starting point for the perceptual representation of speech stimuli.
170
4 Diagnostic Instrumental Speech Quality Model
Fig. 4.13 Short-Term Loudness values and corresponding temporal integration for an example condition
STL LTL 50
Loudness (Sone)
LTL = 36.1 Sone 40
30
20
10
0
1
2
3 4 Time (s)
5
6
4.5.2.2 Long-Term Loudness A Long-Term Loudness (LTL) value corresponds to the perceived loudness of the whole speech signal. The LTL is calculated from the STL. According to the algorithm developed by Glasberg and Moore (2002), the LTL is computed using a temporal integration of the STL over the entire signal duration. In DIAL, the LTL estimates the perceived loudness at the end of the degraded speech signal (i.e. at frame L, just before the subject’s judgment). However, the STL is estimated for active speech frames only. Since a pause between two sentences will influence the LTL, the proposed algorithm needs all frames included in the degraded speech signal (either pause or speech periods). The total degraded loudness value Ly (l) is composed of Ly (l) values for active frames and 0 for either pauses (xsentence (l) = 0) or transition periods (xsentence (l) = 0.5). The LT Ly (1) is initialized to Ly (1) and then updated at each frame l according to: ⎧ ⎨αa L (l + 1) − LTLy (l) + LT Ly (l), if L (l + 1) > LT Ly (l) ; y y LT Ly (l +1) = ⎩αr LT Ly (l) − L (l + 1) + LT Ly (l), otherwise , y (4.52) where αa and αr are two coefficients related to the attack and release time of the temporal integration. These constants have been set to αa = 0.2 and αr = −10−3 to obtain a rapid increase of LT Ly (l) during attack phases and a smooth decrease of LT Ly (l) during release phases. The algorithm developed by Glasberg and Moore (2002) has been re-trained to better match the quality scores of databases carried out according to ITU–T Rec. P.800 (1996). Figure 4.13 shows the STL values and the temporal integration for an example condition.
4.5 Dimension estimators
171
4.5.2.3 Equivalent Continuous Sound Level A second estimator of the perceived listening level is the Equivalent Continuous Sound Level (Leq ). This measure, defined in dBSPL , corresponds to the mean energy of the degraded signal over all active speech frames: 1 L
Lyeq = 10 log10
L
∑ Py (l)
l=1
p2ref
,
(4.53)
where pref = 2 × 10−5 Pa is the auditory threshold for a pure tone at 1 kHz and Py (l) is the energy per speech frame of the degraded signal y(k) before the variable gain compensation, see Eq. (4.10). This simple measure is widely used for the level measurement of background noise. In addition, the ITU-R standardized a similar measure for the estimation of the loudness of audio samples (ITU–R Rec. BS.1770–1, 2007).
4.5.2.4 Computation of the MOSloud score Following the impairment factor principle used by the parametric E-model (ITU– T Rec. G.107, 1998), a loudness impairment factor IL is derived from the estimated LTL value. The estimation from the multi-band loudness model developed by Zwicker and Fastl (1990) takes into account the bandwidth of the degraded speech signal. However, the effects due to bandwidth restrictions are also covered by the Coloration estimator, see Sec. 4.5.1. A comparison between the two loudness measures showed that a combination of the LTL parameter and the bandwidth impairment factor Ibw reduces the reliability of the estimated integral quality. Two scores are provided by the Loudness estimator. First, a MOSlev score is derived from the Leq and then used by the judgment model. This model described in Sec. 4.6 provides the integral speech transmission quality MOSLQO . Then, a degradation over the loudness MOS scale is calculated using the LTL. For this purpose, a loudness impairment factor is calculated using a third order polynomial function: IL = α0 + α1 × LT Ly (L) + α2 × LT Ly (L)2 + α3 × LT Ly (L)3 ,
(4.54)
where L is the last frame of the degraded signal and the coefficients α0 = 138.48, α1 = −3.45, α2 = 3.52 × 10−4 and α3 = 3.10 × 10−4. These coefficients have been determined using the database “P.OLQA 1” described in App. B.2 including 8 S-WB conditions impaired by a non-optimum listening level. Following the approach used with Ibw , MOSloud is derived from the estimated loudness impairment. First, IL is limited to positive values (i.e. normalized to 0 if a negative value is estimated). Then, the impairment value is mapped to the MOS
172
4 Diagnostic Instrumental Speech Quality Model
scale using the transformation described in ITU–T Rec. G.107 (1998), see Eq. (2.4). Again, the input of Eq. (2.4) depends on the operational mode: • NB mode: R = 100 − IL • S-WB mode: R =
160−IL 1.6
4.5.3 Discontinuity Discontinuity degradations correspond to either an isolated distortion or a nonstationary distortion. An isolated distortion is introduced by the loss of packets during VoIP transmissions or by erroneous bits during radio transmissions. Even though isolated, such distortion may be repeated during the transmission. In the worst case, the lost packets are replaced by silence frames of the same length, called “zero insertion”. However, a Packet Loss Concealment (PLC) algorithm can reduce the impairment of such isolated distortions, see Sec. 1.3.4.3. The instrumental assessment of “discontinuities” is a very difficult task. The estimator described in this section combines different algorithms. A first algorithm calculates the amount of lost samples because of variable delay, the time-clipping rate. Then, a second algorithm developed by Huo et al. (2008b) is used to detect the discontinuity in the speech sample.
4.5.3.1 Time clipping Time-clipping parameters have been originally introduced by Freeman et al. (1989) for the performance measure of VAD algorithms. They measure wrong detections of voice activity in speech signals with low SNR. Such wrong detections may introduce time-clipping into the speech signal at the listener’s side. Ding et al. (2006) applied a similar approach to measure the effects of time-clipping on the integral quality. Regarding the location of the degradation on the time scale, three parameters are used to categorize time-clipping: Front-End Clipping (FEC)
Loss of the first speech samples.
Middle Speech Clipping (MSC)
Samples lost between 2 successive frames in the degraded signal.
Back-End Clipping (BEC)
Loss of the last speech samples.
A schematic description of each type of time-clipping is shown in Fig. 4.14. This figure represents the sentence vectors xsentence (l) and ysentence (l) described in Sec. 4.3.2.4. Ding et al. (2006) derived an integral speech quality using these three
4.5 Dimension estimators
173
parameters according to: MOSdis = 4.55 − α1 FEC − α2 MSC − α3 BEC ,
(4.55)
where the coefficients α1 , α2 and α3 depend on the frame length. Ding et al. (2006) found that the parameter having the highest impact on the integral quality is FEC. However, no time-clipping detection algorithm is described by Ding et al. (2006). In addition, this relationship was derived from PESQ estimations instead of auditory results. Speech transmission systems introduce a delay in the speech signal. This delay can be either fixed over the call in circuit-switched networks (e.g. PSTN), or variable in packet-switched networks (e.g. VoIP). Delay variation may introduce discontinuities in the speech sample. The three parameters FEC, MSC and BEC are used by the Discontinuity estimator to quantify the impact of lost frames in case of highly variable delay. The time-alignment algorithm estimates a precise delay for each active frame, see Sec. 4.3.2.6. For instance, if the delay between two successive packets is reduced, one or several frames from the reference signal will not appear in the degraded speech signal. In this case, these frames are discarded from the model analysis (i.e. xsentence (l) = −1). In order to take into account this discontinuity, the sample-loss rates corresponding to each category is calculated. Then, an overall time-clipping rate is computed as: TC = FEC + MSC + BEC .
(4.56)
4.5.3.2 Artifact detector A artifact detector developed by Huo et al. (2008b) is compared to the time-clipping parameters. This estimator quantifies the impact of audible distortions due to packet losses. Three parameters are estimated from this artifact detector: (i) the interruption rate rI (i.e. long level attenuation), (ii) the artifact rate rA (i.e. spectrum deviation), and (iii) the short level variation rate rL .
Pre-processing Since strong linear distortions may influence the artifact detector, the signals used by the Discontinuity estimator have been pre-processed by the Core model: • The delay κl between reference and degraded signals is estimated for each corresponding frame l, see Sec. 4.3.2. • The receiving terminal is simulated on both input signals resulting in x (k) and y (k), see Sec. 4.3.3.
174
4 Diagnostic Instrumental Speech Quality Model FEC 1
ysentence (l)
xsentence (l)
1
0
0
-1
-1 0
8
4 Time (s)
0
4 Time (s)
8
0
4 Time (s)
8
0
4 Time (s)
8
(a) MSC 1
ysentence (l)
xsentence (l)
1
0
-1
0
-1 0
8
4 Time (s)
(b) BEC 1
ysentence (l)
xsentence (l)
1
0
-1
0
-1 0
4 Time (s)
8 (c)
Fig. 4.14 Schematic description of each type of time-clipping using the reference (xsentence (l)) and degraded (xsentence (l)) sentence vectors. a Front-End Clipping (FEC). b Middle Speech Clipping (MSC). c Back-End Clipping (BEC)
• The frequency response and variable gain introduced by the speech transmission system are partially compensated resulting in Px x (l, z) and Py y (l, z), see Sec. 4.4.2.
4.5 Dimension estimators
175
In addition, the envelope of the reference and of the degraded speech signals, in terms of dBSPL , are calculated from the energy per frame by: Px (l) , p2ref Py (l) ey (l) = 10 log10 2 , pref
ex (l) = 10 log10
(4.57)
where pref = 2 × 10−5 Pa is the auditory threshold for a pure tone at 1 kHz. Using the envelope of the reference speech signal, a segmental SNR, SNRSeg (l), is calculated according to: SNRSeg (l) = ex (l) − Lneq , (4.58) where Lneq is the noise equivalent level estimated by the Noisiness estimator, see Eq. (4.73). The envelopes and the segmental SNR are used by the artifact detector later on.
Spectrum deviation In case a packet is lost during a transmission or discarded by the buffer management algorithm at the listener’s side, the corresponding frames are synthesized into the degraded signal. PLC algorithms create the lost frames using an interpolation from the previous and/or next frame. The estimator developed by Huo et al. (2008b) uses a Weighted Spectral Slope (WSS) algorithm to detect the spectral distortions in the interpolated frames. The WSS algorithm has been introduced by Klatt (1982) to measure a phonetic distance. The algorithm described in Huo et al. (2008b) has been updated to follow the analysis framework used by the Core model (i.e. frame length, overlap). The spectral slope is calculated from the reference and the degraded Bark power densities Pxx (l, z) and Pyy (l, z) as: Px x (l, z + 1) Sx x (l, z) = 10 log10 , Px x (l, z) Py y (l, z + 1) Sy y (l, z) = 10 log10 . (4.59) Py y (l, z) Since the Bark power densities use 24 critical band rates (Bark), the spectral slopes are defined on 23 intervals. Then, the WSS measure for frame l is computed as: dWSS (l) =
23
∑ W (l, z) ×
z=1
Sx x (l, z) − Sy y (l, z)
2
,
(4.60)
176
4 Diagnostic Instrumental Speech Quality Model
where W (l, z) is a weighting function which depends on the reference speech spectrum Pxx (l, z). Since the WSS measure is used as a phonetic distance, its algorithm is mainly influenced by the formants in the speech spectrum. In order to attenuate the impact of high background noise levels, the analysis is restricted to the bandwidth zlow , zhigh , defined in Sec. 4.5.1.3. Artifacts, i.e. audible spectrum deviations, are detected in case that the dWSS (l) value is higher than the following threshold:
γWSS (l) = max {γSNR,WSS (l) + 1.5 γfloor + 5, 30} ,
(4.61)
where γSNR,WSS (l) depends on the segmental SNR γSNR,WSS (l) = exp −0.12 × SNRSeg (l) − 60 ,
(4.62)
and γfloor depends on the distribution of WSS values, see Huo et al. (2008b) for more details. Then, from the detected artifacts (audible spectrum deviations) an “artifact rate” rA is computed.
Strong level variations Since the variable gain introduced by speech transmission systems may influence the detection of either long or short level variations, the signals used by this algorithm are not compensated by the variable gain G(l) estimated in Sec. 4.4.2.2 but only by the smoothed version Gs (l). Two sources of strong level variations are considered by this algorithm. First, in case no PLC algorithm is used by a transmission system, the lost frames are replaced by silence frames. Such interruptions produce a long level attenuation in the degraded speech signal. These interruptions are detected using the difference between the reference and the degraded envelopes: d e(l) = ex (l) − ey (l) .
(4.63)
Interruptions γe1 are detected in case d e(l) is higher than the threshold γSNR,Int (l). This threshold depends on the estimated segmental SNR according to: γSNR,Int (l) = exp −0.2 × SNRSeg (l) − 40 − 10 . (4.64) Then, from the detected interruptions an “interruption rate” rI is computed. A second source of strong level variations is the electromagnetic interferences produced by mobile phones. These interferences create short burst noises in acoustic devices. Such short-level variations, referred to as γe2 , are detected by an algorithm originally used in TOSQA and updated by Huo et al. (2008b). First, level variation
4.5 Dimension estimators
177
Fig. 4.15 Difference d e(l) between the reference and degraded envelopes. The horizontal lines correspond to the detected discontinuities
20 r = 7.4% L
15
Δ gain (dB)
10 5 0 −5 −10 −15 −20 0
0.5
1
1.5 Time (s)
2
2.5
in both signals is calculated according to: d ex = ex (l + 1) − ex (l) , d ey = ey (l + 1) − ey (l) .
(4.65)
Then, the differences between the reference and the degraded short level variations are detected: Δ d e(l) = d ey (l) − d ex (l) . (4.66) In case Δ d e(l) is higher than a threshold γSNR,Var (l), a short level variation is considered as detected. This threshold depends on the segmental SNR according to: γSNR,Var (l) = max exp −0.55 × SNRSeg (l) − 50 , 4 . (4.67) Then, a short level variation rate rL is computed from the detected variations. Figure 4.15 shows an example condition d e(l) and the detected discontinuities. The stimulus comes from the database “P.OLQA 1” (see App. B.2) and corresponds to a S-WB condition with 10% of interruptions.
4.5.3.3 Computation of the MOSdis score For few conditions, the time-clipping rate TC parameter is highly overestimated because of a wrong time-alignment and reduces the reliability of the estimated integral quality. Therefore the estimated discontinuity MOS value does not take into account this parameter. An approach similar to the Loudness estimator can be used. A first MOSdis value is used by the judgment model to derive the integral speech transmission quality MOSLQO . Then, a second MOSdis value is used for diagnose purpose (i.e. detection of the discontinuities). The TC parameter may be used in the latter
178
4 Diagnostic Instrumental Speech Quality Model
case. However, this approach depends on the accuracy of time-alignment algorithm. Since in some specific cases, the TC parameter has been over-estimated due to a wrong time-alignment of the two speech signals, the TC parameter was not used in the DIAL model. This parameter may be included in a future update of the DIAL model. The MOSdis score is thus estimated following the model developed by Huo et al. 1 related to the short level vari(2008b). First, two sub-dimensions are calculated: sd 2 related to the artifact rate rA (spectrum deviations): ation rate rL and sd 1 = −1.6153 + 0.98 ln(rL + 1) , sd 2 = −1.0592 + 0.7530 rA − 0.0543 r2 . sd A
(4.68)
Then, these two sub-dimensions are combined to the interruption rate rI to calculate the discontinuity MOS score: 1 − 0.214 sd 2 − 0.978 r2 . MOSdis = 2.39 − 1.048 sd I
(4.69)
4.5.4 Noisiness Among the different features of speech transmission quality, the influence of background noise on the speech transmission quality has been studied for several decades, see for instance Fletcher and Arnold (1929). Two different types of noise may impact differently on the perceptual dimension Noisiness. For instance, Wältermann et al. (2008) include in this dimension: (i) environmental background noises such as noise surrounding the talker, and (ii) the circuit noise introduced by analogical transmission. In addition, waveform coders introduce a coding noise, especially for AD-PCM-type coders at low bit-rate, e.g. the ITU–T Rec. G.726 (1990) at 16 kbit/s. Nowadays, the instrumental assessment of background noise is still under study especially with the introduction of mobile networks. The noisiness estimator introduced in this section combines different algorithms which have been developed especially for the DIAL model. Several other estimators have been developed for the perceptual dimension noisiness. For instance, see Scholz et al. (2008), Kühnel et al. (2008) and Leman et al. (2009).
4.5.4.1 Pre-processing The DIAL pre-processing described in Sec. 4.3 comprise different steps which serve both the Core model and the Noisiness estimator. Firstly, active frames in the reference signal are detected by the VAD (xsentence (l) = 1). Then, a time-alignment algo-
4.5 Dimension estimators
179
rithm estimates the delay κl of the corresponding active and silence frames between the reference and the degraded signal. The silence (or noisy) frames l n = 1, . . . , Ln of the degraded signal (ysentence (l) = 0) are thus detected and used by the noisiness estimator. However, in order to represent the degraded signal as listened to by the user, the user terminal at the listener’s side has been simulated by a FIR filter, see Sec. 4.3.3. The resulting input signals of the Noisiness estimator corresponds to x (k) and y (k).
4.5.4.2 Background noise estimation Since the silence segments (i.e. without energy) do not influence the subject’s judgment, these are ruled out of the analysis. This algorithm estimates the additive noise nadd (k) in the degraded signal using the “noisy” segments only. This algorithm uses the “non-active” periods of y(k), i.e. ysentence (l) = 0. Using an approach similar to the core model, a Hann-window is applied on the signal waveform. The PSD of every silence/noisy frames yw (k, l n ) is defined as: Pnn (l n , e jΩ ) =
1 (Y × Y ∗ ) , M2
(4.70)
where l n is the index of silence/noisy frames, and Y the short-term DFT of yw (k, l n ). Then, the PSD is transformed to the Bark scale using the following summation: kup (z)
Pnn (l n , z) =
∑
Pnn (l n , e jΩ k ) ,
(4.71)
k=klow (z)
where the limits of each critical-band rate klow (z) and kup (z) are defined in Sec. 4.4.1.3. The energy in each noisy frame is calculated by averaging the Pnn (l n , z) over the whole Bark scale: Pn (l n ) =
1 24 ∑ Pnn(l n , z) . 24 z=1
(4.72)
Then, a noise equivalent level Lneq is calculated as the average over all silence/noisy frames: n 1 Ln
Lneq = 10 log10
L
∑ Pn(l n )
l n =1
p2ref
,
where pref = 2 × 10−5 Pa is the auditory threshold for a pure tone at 1 kHz.
(4.73)
180 90
Lneq Ln′ eq 70
Level (dB)
Fig. 4.16 Estimated additive noise Pn (l n ) for an example condition including strong level variations. The solid line corresponds to the equivalent continuous noise level Lneq . The dotted line corresponds to the equivalent continuous noise level Ln eq without discontinuous frames. The vertical lines correspond to the detected discontinuities
4 Diagnostic Instrumental Speech Quality Model
50
30
10 0
2
4 Time (s)
6
8
4.5.4.3 Discontinuities in the background noise Additive noise may vary over the whole speech signal. Usually, environmental noises such as a “street” or “cafeteria” noise, vary within a short dynamic range (e.g. 15 dB). However, abrupt level variations such as mobile phone interferences within speech pauses are considered here as a discontinuity. Such discontinuities may bias the estimated Lneq value. An algorithm used to detect the discontinuities in the noisy frames has been developed and included in the Noisiness estimator. Any noisy frame having a level Pn (l n ) higher than 40 dBSPL and 15 dB higher than the noise equivalent level Lneq is considered as an abrupt level variation. Then the noise equivalent level is updated with the remaining frames. This process is repeated several times (until a maximum of 10 iterations) while such strong level variations are detected. Figure 4.16 shows the noise level for each frame for an example condition. Frames with detected abrupt level variations are depicted in this figure (vertical lines). The mean noise spectrum is calculated by an average of the PSD over all Ln frames without strong level variation: L −1 nn (z) = 1 ∑ Pnn (l n , z) . Φ Ln l n =0 n
(4.74)
This mean noise spectrum is transformed to a noise loudness density Lnn (z) according to Eq. (4.21). Frequency masking effects are calculated according to the model for loudness calculation developed by Zwicker and Fastl (1990). It results in a noise loudness density Lnn (z) as perceived by the user. Then, the total noise loudness Ln is calculated as a simple sum of Lnn (z) over the Bark scale: Ln =
24
∑ Lnn (z) .
z=1
(4.75)
4.5 Dimension estimators
181
4.5.4.4 Additive noise in active frames A discontinuous transmission (DTX) algorithm will avoid the transmission of the signal during speech pauses. In this case, the environmental noise present at the talker’s side is transmitted during speech periods only, resulting in an underestimated additive noise loudness value Ln (calculated during speech pauses). Additive noise components during speech dP(l, z) are calculated according to: Py y (l, z) dP(l, z) = 10 log10 , (4.76) Px x (l, z) where Py y (l, z) and Px x (l, z) are the Bark spectrum densities after frequency and variable gain compensation, see Eq. (4.20). Additive noise produces by nature positive values only. Consequently, the derived parameter is limited to the range dP(l, z) ∈ [0, +∞]. Above the high bandwidth limitation, i.e. z > zhigh (see Sec. 4.5.1.3), inaudible components in the degraded signal may amplify dP(l, z). In order to avoid this bias, the additive noise components dP(l, z) are aggregated (i) over the Bark range z = 2, . . . , zhigh , and (ii) over the L active speech frames. This aggregation process results in the “Noise on Speech” (NoS) parameter: NoS =
1 1 L ΔZ
L
zhigh
∑ ∑ dP(l, z) ,
(4.77)
l=1 z=2
where Δ Z corresponds to the bandwidth Δ Z = 2, zhigh .
4.5.4.5 Computation of the MOSnoi score A noisiness impairment factor is calculated with all parameters Ln and NoS. Using a third order polynomial function with the coefficients defined in Table 4.5, the L NoS value. The final noisiness impairment parameters are mapped into an Inoin and an Inoi factor is simply the maximum value over the 2 impairment factors. % & L NoS Inoi = max Inoin , Inoi . (4.78) Then, following the principle used with the bandwidth and loudness impairment factor, MOSnoi is derived from the noisiness impairment factor Inoi . For this purpose, Inoi is mapped to the MOS scale using Eq. (2.4). Again, the R input value of Eq. (2.4) depends on the operational modes: • NB mode: R = 100 − Inoi • S-WB mode: R =
160−Inoi 1.6
182
4 Diagnostic Instrumental Speech Quality Model
Table 4.5 Model coefficients for the noisiness impairment factor Inoi Parameter NoS Ln
α1
α2
α3
α4
20.7 18.7
13.1 3.19
−1.56 −7.22 × 10−2
8.70 × 10−2 6.46 × 10−4
4.6 Judgment model The DIAL model provides an integral speech transmission quality estimation MOSLQO and four additional quality estimates, MOScol , MOSloud , MOSdis and MOSnoi according to the four perceptual dimensions Coloration, Loudness, Discontinuity and Noisiness respectively. These perceptual estimators described in the previous sections are used to diagnose the quality degradations. The DIAL instrumental measure has been obtained by combining the dimension estimators to the Core model to form a reliable intrusive speech quality model. Following the quality judgment process introduced in Sec. 1.2.1, this aggregation simulates the cognitive process employed by the user to judge the quality of the listened stimuli: ! = f (qi ) + ε , MOS
(4.79)
where qi are the quality estimations and ε the error distribution (i.e. a consistency measure). The function f is unknown and needs to be approximated in a sense that ε is minimized. The ideal function has been introduced in Sec. 1.4, see Eq. (1.5). This function can be derived by applying linear regression analysis or using machine learning techniques. The following section presents several examples. However, a selection process has been employed to choose the aggregation included in DIAL. This selection process used the evaluation procedure described in Sec. 5.1.3. The aggregation finally included in DIAL is described in Sec. 4.6.2. It relies on a machine learning technique called k-Nearest Neighbors (k-NN).
4.6.1 Linear regression analysis A simple simulation of Eq. (1.5) is a linear combination of the perceptual estimators and the core model: M
! = α0 + ∑ αi × qi , MOS
(4.80)
i=1
where i is the parameter index and αi the weighting coefficient associated to the quality estimate qi . An example of such a combination corresponds to the “additivity theorem” followed by the network planning model E-model (ITU–T Rec. G.107, 1998): all impairments are additive on a psychological scale (Allnatt, 1975). In this case, the integral quality score on the R-scale is defined as:
4.6 Judgment model
183
R = RCORE − Ibw − IL − IC − Inoi ,
(4.81)
where the discontinuity impairment factor IC is derived from the MOSdis score. However, such a linear combination suppose that the estimated quality features are not inter-dependent. Using all 10 full-scale databases (i.e. their test corpus covered the whole speech quality space), 9 different linear functions have been derived. Five functions follows Eq. (4.80). Four additional functions used crossed parameters: M M
! = α0 + ∑ ∑ αi j × q i q j , MOS
(4.82)
i=1 j=1
where αi j is the weighting coefficient associated to the quality estimates qi and q j . These linear functions employed either impairment factors or MOS values as input parameters. For each database, the corpus-effect has been attenuated using a specific normalization procedure. Each database includes the same 11 anchor conditions. The MOSLQS,d values for each anchor condition have been averaged over all 10 databases resulting in a mean auditory MOSLQS value. A third order polynomial function fd has been estimated between the MOSLQS,d values for the 11 anchor conditions in database d and the 11 MOSLQS values. Then, the mapping function fd has been applied to all the auditory MOSLQS,d values of database d. Over these 9 functions, a maximum Pearson correlation of ρ = 0.810 has been achieved. In order to improve the judgment model, a second type of aggregation using a machine learning technique has been employed. This technique is described in the next section.
4.6.2 k-Nearest Neighbors Following Fu et al. (2000), machine learning techniques provide more robust estimates than linear methods. In DIAL, the combination of the estimated quality features with the Core model estimation relies on a machine learning technique, called k-Nearest Neighbors (k-NN). This is a non-parametric method used for different purposes such as density estimation, identification or instrumental assessment. Here, this statistical method provides an effective estimation of the integral speech transmission quality using a large data set of known stimuli.
184
4 Diagnostic Instrumental Speech Quality Model
4.6.2.1 Algorithm description The algorithm uses two data sets {x1 . . . xN } and {y1 . . . yN } comprising n = 1, . . . , N observations (i.e. speech stimuli). Each observation n includes a vector xn = (xn,1 , . . . , xn,5 ) of five estimated quality values and a corresponding auditory integral quality score yn . The parameters i correspond to the MOS estimations from the four perceptual estimators and the Core model. Considering a test vector μ = (μ1 , . . . , μ5 ) (an unknown stimulus), the goal of the k-NN algorithm is to assign an integral quality score to μ . This algorithm comprises four stages: 1. The parameter values x and μ are normalized to the range [−1;+1]. 2. The Euclidean distance to the test value μ for all N observations is calculated as:
5
Dn = ∑ ( μi − xn,i )2 , (4.83) i=1
where n is the observation index and i the parameter index. 3. The K observations having the lowest distance are selected as the “neighbors” of the test value μ . 4. The estimated integral quality score corresponds to the averaged integral quality scores over the selected K neighbors: K !LQO = 1 ∑ yk . MOS K k=1
(4.84)
Figure 4.17 presents an example of the k-NN algorithm with a combination of two estimated quality values: MOSCore (x axis) and MOScol (y axis). One drawback of the k-NN method is the important amount of data x and y ! needed to provide meaningful estimates. Since these data sets are limited, a MOS consistency measure is calculated. This corresponds to the variance of yk over the K neighbors. 2 1 K !LQO . ConsMOS = ∑ yk − MOS (4.85) K k=1 4.6.2.2 Training phase This training phase is used to (i) save the two data sets {x1 . . . xN } and {y1 . . . yN } comprising N observations, and (ii) select the best number of neighbors K. Even though the computational process is limited during this training phase, a specific algorithm is applied to accurately select the best number of neighbors. The following procedure is applied for each number of neighbors within the range K = 4 . . . 15:
4.6 Judgment model
185 New stimulus μ MOScol
Training stimuli
5
{y1 . . .yN }
1.93
2.56
{x1 . . .xN } 2.25 1.20
2.28 2.65
×
3.30 2.43 2.74
!LQO MOS
4.65 4.50 4.20 4.17 3.12
MOSCore 0
5
Fig. 4.17 Example of the k-NN algorithm with a combination of two estimated quality values: MOSCore (x axis) and MOScol (y axis)
• The whole data sets of N observations is partitioned in 20 sub-sets, randomly selected. Within these 20 sub-sets, 10 are used for training purposes (ntrain (i), with i = 1, . . . , 10) and the other 10 for testing purposes (ntest(i)). • Using the training sub-set, integral quality estimations MOSLQO are calculated for the corresponding testing sub-set, i.e. the same sub-set index i. • A prediction error σ (K, i) is calculated between the auditory MOSLQS values and the MOSLQO values for each testing sub-set. • The σ (K, i) are averaged over all 10 sub-sets resulting in σ (K) The best number of neighbors K corresponds to the lowest σ (K) over the range K = 4 . . . 15. This training phase is applied in both NB and S-WB modes. Since each database is slightly biased by a “corpus effect”, see Sec. 2.2.5.3, “noise” appears in the data set {y1 . . . yN }. In order to reduce the bias introduced by the corpus effect, the training phase comprises three stages: 1. A procedure as described above is applied to the selected databases. 2. The MOSLQO values, based on physical parameters, are not biased by the corpus effect. Then, for each database the auditory MOSLQS values are mapped to the estimated MOSLQO values using a monotonic 3rd order polynomial function. This transformation attenuates the corpus effect in the auditory MOSLQS values. 3. The same procedure is applied to the transformed auditory scores a second time.
186
4 Diagnostic Instrumental Speech Quality Model
Table 4.6 shows the resulting parameters for each mode of the DIAL model. Table 4.6 Parameters of the k-Nearest Neighbors (k-NN) statistical model for each operational mode of the DIAL model Mode
Databases
Observations N
Neighbors K
NB S-WB
44 24
18 276 12 161
13 12
4.7 Summary and Conclusion Quality measures such as the PESQ model show limitations in the assessment of the integral speech quality when different degradations are combined. Consequently, in Chap. 4, a new intrusive speech quality model has been defined. This model relies on a specific combination of a Core model, based on TOSQA, and four dimension estimators. The Core model assesses the non-linear degradations introduced by a speech transmission system, whereas the dimension estimators quantify the linear degradations on the four perceptual dimensions defined in Sec. 1.4: Coloration, Loudness, Discontinuity and Noisiness. Then, an aggregation of all the degradations simulates a cognitive process employed by human subject during the quality judgment process. In Chap.5, an evaluation of DIAL is presented. The Core model uses the similarity measure employed by the TOSQA model. This measure has been developed by Berger (1998) on NB stimuli impaired by low bit-rate speech codecs. Nowadays, such impairments represent a small part of all possible degradations introduced by speech transmission systems. Both the core model and the perceptual estimators have been updated to current degradations such as time-warping effects and S-WB transmissions. However, in order to obtain a fully optimized DIAL model in both NB and S-WB contexts, many building blocks should be improved to cope with all new degradations. Further possible developments are introduced in Chap. 6.
Chapter 5
Evaluation of DIAL
In this Chapter, the experimental evaluation of the DIAL model is presented. This evaluation corresponds to a comparison of the DIAL estimations to subjects’ quality judgments. Following the instrumental model development procedure, the evaluation presented here corresponds to a “validation” process of the candidate model, see Sec. 2.3. Section 5.1 describes the experimental set-up. It includes (i) the “testing” databases which are described in more detail in App. B, (ii) the statistical measures, and (iii) the reference measures compared to the new model. Section 5.2 presents the experimental results for the NB mode and for the S-WB mode. Then, in Sec. 5.2.2 the dimensions estimations are evaluated. Overall, DIAL has been compared to 94 auditory experiments, including 55 databases for the NB mode and 39 databases for the S-WB mode.
5.1 Experimental set-up 5.1.1 Reference instrumental measures The DIAL model estimates the integral speech transmission quality by providing MOSLQO values. The degradation decomposition provides four additional MOSdim values related to the four perceptual dimensions described in Sec. 1.4. To evaluate reliably the newly developed model, DIAL is compared to the reference intrusive speech quality measures listed in Table 5.1. Scholz’s and Huo’s measures are the two diagnostic models described in Sec. 2.3.2.5. The results from these two measures are compared to the scores provided by the DIAL dimension estimators. Table 5.1 and the following section describe the input parameters used for reference models and DIAL. Several reference models accept only signals sampled at fS = 16 kHz. For those models, the 11 S-WB databases used in this evaluation procedure have been down-sampled from fS = 48 kHz to fS = 16 kHz. This produces a slight bias in the estimations. N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_5, © Springer-Verlag Berlin Heidelberg 2011
187
188
5 Evaluation of DIAL
Table 5.1 Instrumental methods command-line parameters. FB stands for a Full-Band audio bandwidth ranging from 20 to 20 000 Hz Context Measure
Command-line
NB
DIAL TOSQA EMBSD qC P.AAM PESQ Enh. PESQ Scholz Huo
DIAL_main ref.pcm deg.pcm 0 8000 tosqa ref.pcm deg.pcm 8000 v500/f1000 FLAT/HS_Rec_XY NORM EMBSD ref.pcm deg.pcm 8000 speech_eval ref.wav deg.wav 1000 paam ref.pcm deg.pcm 0 8000 e 79 pesq +8000 ref.pcm deg.pcm enhpesq +8000 ref.pcm deg.pcm Scholz_main ref.pcm deg.pcm Huo_main ref.pcm deg.pcm 0 8000
WB
TOSQA-2001 P.AAM a WB-PESQ Enh. WB-PESQ Mod. WB-PESQ Huo
tosqa ref.pcm deg.pcm 16000 v500/f1000 FLAT/WIDEBAND NORM paam ref.pcm deg.pcm deg.pcm 16000 e 73 pesq +wb +16000 ref.pcm deg.pcm enhpesq +wb +16000 ref.pcm deg.pcm modpesq +wb +16000 ref.pcm deg.pcm Huo_main ref.pcm deg.pcm 1 16000
S-WB
DIAL
DIAL_main ref.pcm deg.pcm 1 48000
FB
PEMO-Q
audio_eval ref.wav deg.wav 10 fb 1000
a
Since this NB instrumental model analyzes the whole bandwidth of the input speech signals, it has been evaluated on WB and S-WB databases also.
DIAL
The DIAL model is the new intrusive model described in Chap. 4. • •
TOSQA
With NB databases: Raw-PCM signals sampled at fS = 8 kHz were used. The NB algorithm was employed. With WB and S-WB databases: Raw-PCM signals sampled at fS = 48 kHz were used. The S-WB algorithm was employed.
The TOSQA model, see Sec. 2.3.2.4, provides reliable estimates on acoustic recordings. The TOSQA-2001 model described in ITU–T Contrib. COM 12–19 (2000) corresponds to the WB version of TOSQA. •
•
With NB databases: TOSQA used raw-PCM signals sampled at fS = 8 kHz. An IRS Receive filter is applied on both input speech signals. The time-alignment algorithm searches for a fixed delay κ within the range ±1 000ms, and a variable delay within the range ±500 ms. Speech signals are level aligned. With WB and S-WB databases: TOSQA-2001 uses similar input parameters excepted that (i) the IRS-type filter is
5.1 Experimental set-up
189
replaced by a WB filter 200–7 000Hz, and (ii) raw-PCM signals sampled at fS = 16 kHz are used. EMBSD
The EMBSD model, see Sec. 2.3.2.4, is an improved version of the BSD measure including a simple judgment model. However, it does not include a time-alignment algorithm. A fixed delay estimation using a cross-correlation of the input signals has been included to EMBSD. The model uses raw-PCM speech signals sampled at fS = 8 kHz.
qC
The qC measure is described in Sec. 2.3.2.4. The input parameter was set to compensate delay κ within the range ±1 000ms. In addition, this model uses “WAV” input signals sampled at fS = 8 kHz.
PEMO-Q
The PErception MOdel—Quality assessment model, see Sec. 2.3.2.4, measures the perceived quality of audio signals including speech and music. PEMO-Q provides estimations in high-quality context only. An analysis frame length of 10 ms and a modulation filter bank are used (default values). Delay, κ , ranging from −1 000 to 1 000 ms is compensated. Similarly to qC , PEMO-Q uses “WAV” input signals sampled at fS = 48 kHz for S-WB databases, and sampled at fS = 16 kHz for WB databases.
P.AAM
The P.AAM model used in this evaluation procedure has been developed by Deutsche Telekom AG, Germany, see Sec. 2.3.2.4. The P.AAM model has been developed for a NB context and employed here in both NB and S-WB contexts. •
•
PESQ
With NB databases: The measure simulates a monaural listening situation. Raw-PCM signals sampled at fS = 8 kHz and electrically-recorded signals are expected. Speech signals are level aligned to 79 dBSPL . With WB and S-WB databases: Similar to the NB context except that raw-PCM signals sampled at fS = 16 kHz are expected. However, in this context a diotic listening situation is expected. The speech signals are thus level aligned to 73 dBSPL .
The Perceptual Evaluation of Speech Quality (PESQ) and the WB-PESQ are described in Sec. 3.1.1.1. •
With NB databases: The PESQ uses raw-PCM signals sampled at fS = 8 kHz.
190
5 Evaluation of DIAL
•
With WB and S-WB databases: The WB-PESQ uses a specific WB algorithm. WB-PESQ uses raw-PCM signals sampled at fS = 16 kHz.
Enh. PESQ
The “Enhanced PESQ” (Enh. WB-PESQ resp.) has been developed by Shiran and Shallom (2009). This model uses the same perceptual model than PESQ. However, the time-alignment algorithm has been improved to cope with time-warping effects. The same input parameters as PESQ and WB-PESQ are used.
Mod. WB-PESQ
The “Modified WB-PESQ” has been developed as part of the present study. This model is used in a WB context only. The differences with the usual WB-PESQ algorithm are introduced in Sec. 3.1.2.1. The same input parameters as WB-PESQ are used.
Scholz
This dimension-based model, developed by Scholz (2008), is described in Sec. 2.3.2.5. This model provides quality estimates in a NB context only. However, it uses only raw-PCM signals sampled at fS = 32 kHz. No additional input parameters are required. Compared to other reference measures, Scholz’s model is not adapted to all conditions included in the testing databases. For instance, this model is not able to estimate perceived quality of packet-loss conditions.
Huo
This second dimension-based model, developed by Huo et al. (2008a,b, 2007), is described in Sec. 2.3.2.5. Huo’s model has a NB and a WB operational mode. • •
With NB databases: Huo’s model uses raw-PCM signals sampled at fS = 8 kHz. With WB and S-WB databases: In this context a WB algorithm is applied. Huo’s model uses raw-PCM signals sampled at fS = 16 kHz.
5.1.2 Databases Once instrumental models have been developed and trained over a large set of conditions, they must be validated to see how accurate they are on “unknown” data. Following the validation criteria introduced in Sec. 2.3, the choice and the amount of unknown speech stimuli quantify the “completeness” of the DIAL model. However, quantifying this “completeness” in a meaningful way is not easy. Instrumental models may be sensitive to (i) the test material (i.e. talkers and sentences) and, (ii)
5.1 Experimental set-up
191
Table 5.2 Processing conditions included in the NB databases. They have been used for the experimental evaluation of the DIAL model in its Narrow-Band operational mode. PL refers to packetloss conditions. BGN stands for background noise. MT means musical tones. IR refers to interruptions. BP stands for bandpass filter. Acous. refers to acoustic recordings. T D stands for tandeming of speech codecs. No.
Name
Conditions a
1 2 3
Nagle Diotic NB IP-Tiphon a FT-IP a
4 5 6 7
Dimension-Cont Dimension-DFC Dimension-Noi Dimension-NB
Clean, G.711, G.728, G.729A, MT, IR and PL 3, 5, 10, 20% Clean, G.711, BP, Acous. Clean, MNRU, G.726, BGN Clean, G.729A, IR and PL, BGN, BP, Acous.
8 9 10 11 12
Nagle Mono NB G.729EV NB IKA/LIMSI NB FT-04 NB FT-06 NB
idem “Nagle Diotic NB” Clean, MNRU, G.729A, G.729.1 b , PL at 3% Clean, G.726, G.729A, BP Clean, G.726, G.729, TD, PL at 1, 5% Clean, G.711, G.729A, G.723.1, AMR, TD, PL at 3, 5, 10%
13 14 15 16 17 18 19
P.862 Prop P.862 BGN P.862 Shared P.AAM 1 P.AAM 2 P.AAM UAQ P.SEAM
MNRU, G.726, G.729, G.728, Mobile, ISDN, BGN, IR MNRU, G.723.1, EVRC, Mobile, BGN Clean, MNRU, G.726, G.729, GSM-EFR, TD, PL at 5, 10, 15, 20% Clean, MNRU, G.711, G.726, G.729, GSM-FR, TD, Acous. Clean, MNRU, BP, Acous. Clean, MNRU, G.711, G.723.1, VoIP, Acous., PL at 2, 5, 10, 20, 30% Clean, MNRU, G.711, G.729, G.723.1, Acous., BGN, PL at 3, 5, 10%
20 21 22 23 24 25 26
Sup23 XP1-A Sup23 XP1-D Sup23 XP1-O Sup23 XP3-A Sup23 XP3-C Sup23 XP3-D Sup23 XP3-O
MNRU, G.711, G.726, G.729, GSM-FR, TD idem “Sup23 XP1-A” idem “Sup23 XP1-A” Clean, MNRU, G.726, G.729, BGN, PL at 3, 5% idem “Sup23 XP3-A” idem “Sup23 XP3-A” idem “Sup23 XP3-A”
a b
Clean, G.711, G.729, AMR, PL 3 and 6% G.711, G.723.1, G.726, GSM-FR, TD, IR and PL at 5, 10, 15, 20% MNRU, G.711, G.729, VoIP, BGN
Unknown databases, i.e. not used during the k-NN training phase. France Télécom candidate.
the conditions themselves. In order to characterize the scope of DIAL, the type of degradations included in the testing databases, the languages and number of stimuli per condition, should be large enough to represent the range of potential applications (defined in 4.1). Here, a database corresponds to the speech signals and their associated subjects’ quality judgments. Overall, the DIAL model has been evaluated on 55 databases (20 819 speech stimuli, 2 797 conditions) for its NB mode and on 39 databases (20 023 speech stimuli, 1 874 conditions) for its S-WB mode. A significant part of these databases has been used for the training phase of the k-NN statistical model. Over the 94 databases,
192
5 Evaluation of DIAL
some of them have been carried out within either France Télécom or Deutsche Telekom Laboratories internal projects. Therefore, these specific databases can not be disclosed here. In this chapter, a subset of the 94 databases is briefly presented. They are described in more detail in App. B. However, this set of databases represent all types of conditions included in the 94 databases. Tables 5.2 and 5.3 summarize the included processing conditions. Table 5.3 Processing conditions included in the WB and S-WB databases. They have been used for the experimental evaluation of the DIAL model in its Super-WideBand operational mode. MB refers to middle-band 100–5 000 Hz. LL stands for non-optimum listening level. NR means Noise Reduction No.
Name
Conditions a
1 2 3 4 5
AMR-WB Nagle Mono WB a Dim-Scaling 1 a Dim-Scaling 2 a Tsukuba a
Clean, MNRU, Clean-NB, MNRU-NB, AMR-WB, UMTS Clean, G.722, G.729.1, AMR-WB, PL 3, 6% Clean, G.722, G.722.2, G.711, G.726, BP, TD, IR, BGN, PL at 2, 8% MNRU, G.722, G.722.2, G.711, G.729A, TD, BGN, PL at 1, 2, 4, 8% Clean, G.722, G.722.1, G.722.2, G.711, G.726, G.729, GSM-EFR
6 7 8 9 10 11 12 13
Nagle Diotic WB Dimension-WB FT-UMTS G.729EV MB IKA/LIMSI MB Loudness FT-04 MB FT-06 MB
idem “Nagle Mono WB” Clean, G.722.2, G.722.1, G.711, BGN, BP, Acous., PL at 20% P.OLQA Anchors, AMR, VoIP Clean, MNRU (WB, MB and NB), G.722.2, G.729A, G.729.1 b Clean, G.722.2, G.711, G.726, BP G.722.2, G.726, TD, LL, PL at 3% Clean, G.722, G.722.1, G.722.2, G.711, G.726, G.729, PL at 1, 5% Clean, G.722, G.722.2, G.729.1, G.711, AMR, TD, PL 3, 5, 10%
14 15 16
Skype P.OLQA 1 P.OLQA 2
P.OLQA Anchors, VoIP P.OLQA Anchors, BP, Acous., BGN, LL, PL at 2% P.OLQA Anchors, AMR-WB+, G.722.1C, G.722.2, NR, PL at 2, 5%
a b
Unknown databases, i.e. not used during the k-NN training phase. France Télécom candidate.
5.1.3 Statistical measures Several figure of merits (i.e. statistical measures) have been developed in order to evaluate the performance of instrumental models. Since auditory experiments are the most valid and reliable method for measuring perceived speech quality, the estimated scores from instrumental models are compared to the auditory scores from test databases described in the previous section. Almost all test databases were carried out using an ACR 5-point listening quality scale defined in ITU–T Rec. P.800 (1996). ACR scales have several drawbacks which are described in Sec. 2.2.5.1. For instance, such scales do not strictly possess the interval properties. This property is required to obtain information about distances between the conditions under study.
5.1 Experimental set-up
193
However, to apply the statistical measures introduced in the current section, the ACR 5-point listening quality scale is assumed to have the properties of an interval scale.
5.1.3.1 Mapping function As described in Sec. 2.2.5, auditory test results may be biased by uncontrolled factors and particularly by a “test-corpus” effect. It results in a slightly non-linear relationship between model estimations and auditory scores. In order to compensate for this slight offset, a monotonic1 third order polynomial mapping function is ap plied on the raw estimated values. It results in MOSLQO values. The coefficients of the mapping function are estimated in a least-squared sense by curve-fitting of the auditory MOSLQS values.
5.1.3.2 Common statistical measures Two statistical measures are commonly used to evaluate the performance of instrumental models. These are “per-condition” measures. The two measures are calculated after averaging all individual scores into a single MOS value for each processing condition. The accuracy of the model is measured by the Pearson correlation coefficient (ρ ) that measures the linear relationship between the auditory and the estimated scores. This measure is calculated by: J
∑
MOSLQS ( j) − MOSLQS · MOSLQO ( j) − MOSLQO
j=1
ρ=
J
J ∑ MOSLQS ( j) − MOSLQS 2 · ∑ MOSLQO ( j) − MOS
2
(5.1)
LQO
j=1
j=1
where MOSLQS ( j) (resp. MOSLQO ( j)) is an auditory (resp. estimated) MOS value for condition j = 1, . . . , J, MOSLQS (resp. MOSLQO ) is the mean auditory (resp. estimated) MOS value over all conditions.
However, the correlation coefficient value does not provide a meaningful measure of the model “consistency” with auditory results. For instance, an identical large prediction error on all speech stimuli will result in a relatively high correlation coefficient. The correlation coefficient has a second disadanvtage. It depends on the distribution of the MOSLQS values across the MOS scale. Therefore the consistency of instrumental measures is evaluated using the Root Mean Square Error (RMSE) (σ ), i.e. the standard deviation of the prediction error. It uses the absolute difference 1
The estimated third order polynomial function is monotonic within the range defined by the MOSLQO values.
194
5 Evaluation of DIAL
between the auditory and the estimated scores, called prediction error: dMOS ( j) = MOSLQS ( j) − MOSLQO ( j)
(5.2)
An unbiased estimation of the standard deviation of the prediction error is computed by:
1 J σ = (5.3) ∑ dMOS ( j)2 J − d j=1 where coefficient d reflects a “degree of freedom” correction factor. This coefficient is set to d = 4 in case a 3rd order mapping function is applied to the estimated values, and is set to d = 2 in case of a 1st order mapping function.
5.1.3.3 POLQA statistical evaluation For the POLQA standardization program a specific statistical evaluation of the candidate models has been used. This procedure has been first developed for the evaluation of instrumental video quality measures and then updated by the ITU-T organization. Compared to the two common statistical measures introduced in the previous section, the POLQA procedure takes into account the uncertainty of the auditory scores. In other terms, a model will not be able to precisely estimate the speech quality of a stimulus if a large variation is observed within subjects’ judgments. The different steps comprising the statistical evaluation is detailed below. These measures are calculated “per-stimuli” which provides a more precise comparison between the intrusive models.
Degree of uncertainty The degree of uncertainty of the auditory scores is usually calculated by the standard deviation and confidence interval. First, the standard deviation σl of stimulus l is calculated by: 2 1 K σl = νk,l − MOSLQS (l) (5.4) ∑ K − 1 k=1 where νk,l is the individual rating of subject k (k = 1, . . . , K) and MOSLQS (l) the mean score for stimulus l (l = 1, . . . , L). Then, the confidence interval is calculated from the standard deviation σl as:
σl ci(1−α ) (l) = t(α , K) √ K
(5.5)
where function t is the Student’s cumulative distribution function. The coefficient α is usually set to 0.05 which gives the 95th percentile of the auditory MOSLQS (l) values.
5.1 Experimental set-up
195
MOS
× MOSLQS ci95
×
d∗MOS = 0
$ $ $ $ dMOS = $MOSLQS − MOSLQ0 $ $ $ $ $ d∗MOS = $MOSLQS − MOSLQ0 $ − ci95
MOSLQ0
Fig. 5.1 Examples of common dMOS and modified d∗MOS prediction errors
ε -insensitive prediction error In order to take into account the degree of uncertainty of subjects’ judgments, the definition of the common prediction error σ has been modified to be insensitive to prediction error lower than ε . Figure 5.1 shows the difference between the common and the modified prediction error. This limit is set to ε = ci95 (l), i.e. α = 0.05. The modified prediction error is defined as: 0, if |dMOS (l)| ≤ ci95 (l); ∗ dMOS (l) = (5.6) |dMOS (l)| − ci95 (l), otherwise. Then, the standard deviation of the modified prediction error is calculated according to: 1 L ∗ σ∗ = (5.7) ∑ dMOS(l)2 L − d l=1 where L is the total number of stimuli.
Distance measure Using σ ∗ , a distance measure between the reference models is calculated by: ∗2 0, if σμ∗2 ≤ σmin × F(0.05, L, L); dμ = (5.8) ∗2 × F(0.05, L, L), otherwise. σμ∗2 − σmin ∗ the lowest σ ∗ over all instrumental models and where μ is the model index, σmin F is the Fisher-Snedecor cumulative distribution function. Using the F function, dμ represents a statistically significant distance between the instrumental models. Then, these distance measures are averaged over all available databases according to:
196
5 Evaluation of DIAL
dμ =
1 M ∑ dμ ,m , M m=1
(5.9)
where M corresponds to the total number of databases. Then, a second significant test is applied on each averaged value d μ . ⎧ dμ ⎨0, if d +4×10 −4 ≤ F(0.05, M, M); min tμ = (5.10) d μ ⎩ − F(0.05, M, M), otherwise. d +4×10−4 min
where dmin is the lowest dμ value over all instrumental models. The resulting tμ values correspond to the main statistical criteria used to compared the instrumental models. A positive measure tμ > 0 means that model μ is significantly worse than the best model, i.e. the model which obtains the dmin value.
Absolute prediction error In order to obtain the aggregated distance tμ , several significant tests are applied: one for each database and one on the overall distance d μ . It may happen in the extreme case that two models are considered as equivalent for all databases even though one is always better (but not significantly better) than the other one. A second statistical measure, called absolute prediction error, has been defined to avoid this situation. This measure is considered as an unbiased value of the models consistency. For this purpose, the “per-stimuli” total modified prediction error is calculated for each database m:
2
1 L dMOS (l) σ ∗∗ = (5.11) ∑ max {ci95 (l), 0.1} L − d l=1 These total errors are averaged over all databases according to: Abs. σμ∗ =
1 M ∗∗ ∑ σμ ,m M m=1
Then, a significant test is applied on each absolute prediction error value ⎧ Abs. σμ∗ ⎨0, if Abs. σ ∗ +0.1 ≤ F(0.05, T, T ); min rμ = ∗ Abs. σ μ ⎩ Abs. σ ∗ +0.1 − F(0.05, T, T ), otherwise.
(5.12)
(5.13)
min
∗ is the lowest Abs. σ ∗ value over all instrumental models and T where Abs. σmin μ corresponds to the total number of stimuli in all M databases. The resulting rμ values is the second statistical criteria used to compared the instrumental models.
5.2 Performance results
197
5.2 Performance results The statistical evaluation presented in the previous section has been applied to the integral speech quality values. The same procedure has been applied to each testing database and each instrumental model. Here, results for both operational modes are presented. The last section describes the experimental results of the dimensions estimators. The statistical measures are presented in the following way (e.g. here the correlation): ρModel,DB , where DB corresponds to the database number.
5.2.1 Integral quality 5.2.1.1 Super-WideBand mode In this section estimates provided by 7 instrumental models with a WB, S-WB and FB operational modes, are compared to auditory results coming from experiments carried out in a WB or a S-WB context. The averaged results over 39 databases are presented in Table 5.4. A subset of 24 among the 39 databases (61% of the speech stimuli) have been used during the k-NN training phase. This high percentage of know stimuli may have an impact on the overall results. Then, Table 5.5 presents for each auditory test the results of the DIAL S-WB operational mode. It includes the “per-condition” Pearson correlation coefficients ρ and the “per-stimuli” modified prediction error σ ∗ . Table 5.5 gives the distance measure dDIAL (after significance test) and the ranking order of the DIAL model among 7 reference measures. Results are detailed for each type of quality degradation.
Overall results
Table 5.4 Global experimental results over 39 WB and S-WB auditory tests of 7 intrusive speech quality models including DIAL. These measures are computed after the third order mapping function Ranking order
Model
1 2 3 4 5 6 7
DIAL Mod. WB-PESQ Enh. WB-PESQ WB-PESQ P.AAM TOSQA-2001 PEMO-Q
Correlation ρ
Eq. (5.13) r
Eq. (5.10) t
0.913 0.879 0.858 0.856 0.841 0.826 0.749
0 0.14 0.29 0.32 0.38 0.61 1.32
0 0 1.33 1.46 1.65 3.34 8.63
198
5 Evaluation of DIAL
Table 5.4 presents the “per-stimulus” absolute prediction error r and distance measure t (after significance test). Using these two measures, DIAL and the reference measures are presented in ranking order. The table gives the “per-condition” overall correlation ρ as an additional measure. Over the 39 databases, the DIAL model obtains the best correlation ρDIAL = 0.913. When the specific POLQA statistical evaluation is applied, DIAL is also the best model of the 7 intrusive measures. The ITU-T standard WB-PESQ obtains the fourth place. Both modified versions, the Mod. WB-PESQ and the Enh. WB-PESQ, are considered as significantly better than the initial version. The Mod. WB-PESQ obtains the second place and is considered as equivalent to DIAL in the distance measure t. However, the DIAL obtains a lower averaged distance value than the Mod. WB-PESQ (dDIAL = 0.0183, dMod. WB−PESQ = 0.0315. For models used in both modes, the global correlation is slightly lower in the WB mode than in the NB mode, see Table 5.6. The three reference measures WB-PESQ, TOSQA-2001 and Enh. WB-PESQ have been developed on NB databases and then have been extended to the WB context. The P.AAM model has a single NB mode and is consequently not adapted to such wider bandwidths. It has thus difficulties to estimate the speech transmission quality in a WB or a S-WB context. These results show the problem of extensibility of instrumental measures, i.e. to apply these measures on unknown data. The PEMO-Q model has been developed for the quality assessment of full-band audio systems. However, this model obtains a relatively low correlation on WB and S-WB databases.
Unknown databases The DIAL S-WB operational mode has been evaluated on five unknown databases (1–5). Over these five databases, DIAL is the best intrusive model for Database 2 and belongs to the best models for Databases 1 and 5. These three databases include mainly non-linear degradations introduced by WB low bit-rate speech codecs and PLC algorithms. However, DIAL obtains a relatively low correlation on Database 1 (ρDIAL,1 = 0.90) and Database 4 (ρDIAL,4 = 0.83). Database 1 includes mainly “live” recordings of UMTS networks. For this database, DIAL is equivalent to P.AAM and Mod. WB-PESQ. Databases 3 and 4 are “degradation decomposition” databases (see App. B.2). They include degradations on the three dimensions Coloration, Noisiness and Discontinuity. For these particular databases, the judgment model should provide better estimates than the reference measures. However, DIAL is the seventh and fifth model for Database 3 and 4 respectively. A detailed analysis of the Core model and dimension estimations shows that: • For Database 3, the Discontinuity estimator gives worse MOSdis values for codec tandeming than for packet-loss conditions. • For Database 4, the Core model underestimates noisy conditions.
5.2 Performance results
199
Table 5.5 Experimental results of the DIAL model in its S-WB operational mode. Ranking order The first value corresponds to the resulting ranking order of DIAL. The second value (in brackets) refers to the number of models considered as statistically equivalent No.
Name
Correlation ρ
Mod. Pred. error σ∗
Distance dDIAL
Ranking order
1 2 3 4 5
AMR-WB a Nagle Mono WB a Dim-Scaling 1 a Dim-Scaling 2 a Tsukuba a
0.90 0.98 0.91 0.83 0.94
0.31 0.16 0.13 0.09 0.21
0 0 1.44 × 10−2 4.57 × 10−3 0
1 (3) 1 (1) 7 (1) 5 (1) 1 (3)
6 7 8 9 10 11 12 13
Nagle Diotic WB Dimension-WB FT-UMTS G.729EV MB IKA/LIMSI MB Loudness FT-04 MB FT-06 MB
0.98 0.90 0.91 0.98 0.96 0.98 0.97 0.93
0.24 0.23 0.27 0.11 0.18 0.30 0.21 0.23
0 4.18 × 10−2 0 0 0 0 0 1.08 × 10−2
1 (2) 4 (2) 1 (4) 1 (1) 1 (1) 1 (1) 1 (1) 3 (2)
14 15 16
Skype P.OLQA 1 P.OLQA 2
0.94 0.80 0.82
0.32 0.49 0.48
0 0 0
1 (1) 1 (5) 1 (1)
a
Unknown databases, i.e. not used during the k-NN training phase.
Speech codecs and transmission errors Databases 6, 9, 12 and 13 include mainly conditions degraded by low bit-rate codecs and transmission errors. DIAL obtains a high correlation for these four databases (ρ > 0.93). It is the best model for Databases 9 and 12, belongs to the best models for Database 6, and is the third model behind P.AAM and Mod. WB-PESQ for Database 13. These results, show the high accuracy of DIAL quality estimations of non-linear degradations.
Acoustic recordings and bandpass filtering Databases 7 and 10 include several bandpass filtering and acoustic recordings of user terminals such as Hands-Free Terminal (HFT). DIAL is the best model for Database 10. This database includes mainly bandpass filtering. However, it obtains a relatively low correlation for Database 7 (ρDIAL,7 = 0.90). The Core model underestimates the perceived quality of HFT conditions, introducing time-varying frequency deviations. Such conditions introduce a bias in the partial compensation of the system transfer function applied to the Core model.
200
5 Evaluation of DIAL
Time-warping effects Databases 8 and 14 include several conditions impaired by time-warping effects. DIAL is the best model for Database 14 and belongs to the best models with WB-PESQ, Enh. WB-PESQ and Mod. WB-PESQ for Database 8. For this latter database, over the 69 stimuli corresponding to “live” recordings over the VoIP network, a similar correlation is obtained for the Core and DIAL model (ρCore = 0.755, ρDIAL = 0.760). A detailed analysis shows that the Discontinuity estimator reacts more on time-warping effects than on usual transmission errors. Over the same 69 stimuli, WB-PESQ obtains a correlation of ρWB−PESQ = 0.803 and the Enh. WB-PESQ ρEnh. WB−PESQ = 0.822. Only the time-alignment algorithm differs between these two reference measures. These results show that the right realignment of the two input speech files has a strong effect on the reliability of the intrusive measures.
Full-scale databases The last three databases are full-scale databases. These have been designed especially for the POLQA project. They include 12 specific anchor conditions and the three bandwidths NB, WB and S-WB. Over the three databases, DIAL is the best model for Databases 14 and 16 and belongs to the best models for Database 15. However, the correlation is relatively low for the two official POLQA databases (ρDIAL,15 = 0.80 and ρDIAL,16 = 0.82). Both databases include specific distortions such as time-rescaling and pitch deviation which are strongly underestimated by both the Core model and the dimension estimators. Figure 5.2 shows the “per-condition” DIAL estimations for Database 15. Conditions 12 and 13 are strongly underestimated. These conditions were resampled at a new sampling frequency of fS,y = 1.1 × fS,x and fS,y = 0.9 × fS,x , where fS,x is the sampling frequency of the reference speech signal, x(k). Conditions 4 and 7 are slightly underestimated. They correspond to acoustic recordings of HFTs. Conditions 10 and 11, corresponding to digital overload played back at a normal listening level, are underestimated by the Core model. These conditions introduced strong non-linear distortions. Conditions 19 and 35 are impaired by real background noise (street and car noise respectively). However, neither the Noisiness estimator nor the Core model underestimate these conditions. The high difference in the estimated quality between e.g. Conditions 33 and 35 may originate from the judgment model.
5.2.1.2 Narrow-Band mode Even though DIAL has been mainly developed for the assessment in a S-WB context, it has been evaluated in the NB context. In this section estimates provided by 7 instrumental models with a NB operational modes, are compared to auditory re-
5.2 Performance results
201
Fig. 5.2 DIAL estimations for Database 15 “POLQA 1”. The gray curve represents the estimated third order mapping
5 4.5 28
4
MOSLQOM
2 466 41 17 1429 27 4431305 34 36 20 8329 16 43 39 19 33 4515 26 40 18 10
3.5 3 2.5
21 2524 322 23 37 74
1 42
13
47 2 1.5
48 38
1 1
12
11 35
2
3 MOSLQSM
4
5
sults coming from experiments carried out in a NB context. Firstly, the results over 55 databases are presented in Table 5.6. A subset of 44 among the 55 databases (88% of the speech stimuli) have been used during the k-NN training phase. Again, this high percentage of know stimuli may influence the overall results. Then, table 5.7 presents the results of DIAL for each example auditory test including the databases not used during the k-NN training phase. It includes the “per-condition” Pearson correlation coefficient ρ and the “per-stimuli” modified prediction error σ ∗ . Table 5.7 gives also the distance measure dDIAL and the ranking order of the DIAL model among 7 instrumental measures. Results are detailed for each type of quality degradation.
Overall results
Table 5.6 Overall experimental results over 55 NB auditory tests of 7 NB intrusive speech quality models including DIAL. These measures are computed after the third order mapping function Ranking order
Model
1 1 3 4 5 6 7
Enh. PESQ PESQ P.AAM DIAL TOSQA qC EMBSD
Correlation ρ
Eq. (5.13) r
Eq. (5.10) t
0.931 0.929 0.924 0.935 0.877 0.821 0.706
0 0 0 0.17 0.53 1.09 2.30
0 0 0.28 0.68 3.46 6.93 15.43
202
5 Evaluation of DIAL
Table 5.6 gives the results of the two significant tests, see Eq. (5.13) (r) and Eq. (5.10) (t). Using these two measures, the resulting ranking order of DIAL and the reference measures is presented. The table gives the “per-condition” overall correlation ρ as an additional measure. Over the 55 databases, the DIAL model obtains the best correlation ρ = 0.935. However, when the specific POLQA statistical evaluation is applied, “PESQ” and “Enhanced PESQ” appear to be the best models among the 7 intrusive measures. These two models are considered as statistically equivalent (i.e. r = 0 and t = 0). DIAL is the fourth model for each measure. Strangely, the P.AAM is outperformed by PESQ and by Enh. PESQ only in the distance measure t. The absolute prediction error is similar for all three models (1.4667 < Abs. σμ∗ < 1.4682). However, the P.AAM is highly outperformed by PESQ and by Enhanced PESQ for Database 5 which leads to a higher dμ value for P.AAM (dPAAM = 0.0215, dPESQ = 0.0113 and dEnh. PESQ = 0.0126).
Unknown databases Databases 1–3 were unknown by the k-NN statistical model. Over the three databases, DIAL outperforms the reference models in Database 3 only. Even though DIAL is the fourth model for Database 1, it obtains a high correlation ρDIAL,1 = 0.96. DIAL shows accurate quality estimations of speech codecs impaired by packet losses. However, for Database 2, DIAL is outperformed by four reference measures including TOSQA and obtains a relatively low correlation ρDIAL,2 = 0.79. This database includes mainly live recordings of VoIP transmissions. A detailed analysis of Database 2 shows that the Core model overestimates the quality of four conditions impaired by short interruptions (i.e. 10 ms).
Speech codecs The highest correlation is obtained for Database 9, ρDIAL,9 = 0.99. Such a high correlation is obtained since the MOSLQSN values are not equally distributed over the whole MOS scale. The DIAL model obtains high correlation coefficients for Databases 20 to 22. Databases 9, 20, 21 and 22 include mainly conditions degraded by low bit-rate speech codecs. DIAL belongs to the best models for Database 20 and 21. In the same way than the S-WB operational mode, this new model shows accurate estimations of non-linear degradations. The core model has been developed to assess such non-linear degradations. However, the correlation between the auditory scores and the core model estimations is relatively low, e.g. ρCore,20 = 0.78. A detailed analysis shows that the Core model overestimates the quality of MNRU conditions.
5.2 Performance results
203
Table 5.7 Experimental results of the DIAL model in its NB operational mode. Ranking order The first value corresponds to the resulting ranking order of DIAL. The second value (in brackets) refers to the number of models considered as statistically equivalent No.
Name
Correlation ρ
Mod. Pred. error σ∗
Distance dDIAL
Ranking order
1 2 3
Nagle Diotic NB a IP-Tiphon a FT-IP a
0.96 0.79 0.94
0.29 0.31 0.28
2.60 × 10−2 7.60 × 10−2 0
4 (3) 5 (3) 1 (1)
4 5 6 7
Dimension-Cont Dimension-DFC Dimension-Noi Dimension-NB
0.75 0.94 0.95 0.89
0.32 0.16 0.08 0.19
5.31 × 10−2 0 0 0
4 (2) 1 (1) 1 (1) 1 (1)
8 9 10 11 12
Nagle Mono NB G.729EV NB IKA/LIMSI NB FT-04 NB FT-06 NB
0.92 0.99 0.81 0.95 0.96
0.21 0.08 0.38 0.35 0.18
1.17 × 10−3 0 2.14 × 10−2 8.15 × 10−2 0
4 (3) 1 (4) 3 (1) 6 (4) 1 (4)
13 14 15 16 17 18 19
P.862 Prop P.862 BGN P.862 Shared P.AAM 1 P.AAM 2 P.AAM UAQ P.SEAM
0.82 0.97 0.94 0.92 0.94 0.92 0.94
0.33 0.20 0.20 0.23 0.20 0.24 0.17
7.73 × 10−2 8.27 × 10−3 1.30 × 10−2 3.67 × 10−2 3.49 × 10−2 2.38 × 10−2 1.46 × 10−2
5 (1) 4 (2) 4 (3) 4 (3) 4 (3) 4 (2) 5 (3)
20 21 22 23 24 25 26
Sup23 XP1-A Sup23 XP1-D Sup23 XP1-O Sup23 XP3-A Sup23 XP3-C Sup23 XP3-D Sup23 XP3-O
0.95 0.96 0.97 0.91 0.93 0.94 0.96
0.22 0.17 0.21 0.27 0.27 0.15 0.13
0 0 1.54 × 10−2 0 1.64 × 10−2 2.24 × 10−3 0
1 (2) 1 (4) 4 (3) 1 (4) 4 (3) 4 (3) 1 (4)
a
Unknown databases, i.e. not used during the k-NN training phase.
Transmission errors Databases 1, 2, 4, 8, 12, 13 and 15 mainly include conditions impaired by transmission errors. These errors are attenuated by Packet Loss Concealment (PLC) algorithms in Databases 1, 8, 12 and 15. Such algorithms avoid an interruption of the speech waveform. Over these 7 databases, DIAL is the best measure for Database 12 only. However, relatively high correlation coefficients between the auditory scores and DIAL estimations are obtained for databases including PLC algorithms, i.e. ρ > 0.92. Databases 2, 4 and 13 include interruptions which are overestimated by the Core model. The DIAL model accurately estimates the non-linear degradations introduced by PLC algorithms and fails to estimate strong discontinuities.
204
5 Evaluation of DIAL
Acoustic recordings and bandpass filtering Conditions such as acoustic recordings of user terminals and bandpass filtering are included in the test corpus of Databases 5, 10 and 16 to 19. Except for Database 10 which includes mainly bandpass filtering conditions, DIAL obtains a high correlation for these databases, i.e. ρ > 0.92. For Database 10, DIAL is significantly better than Enh. PESQ (dEnh. PESQ,10 = 0.135), PESQ (dPESQ,10 = 0.135) and P.AAM (dP.AAM,10 = 0.193). For Database 5, DIAL is the best model. This database includes conditions impaired exclusively on the perceptual dimension Coloration. These results show that DIAL provides reliable quality estimations for acoustically recorded speech signals and outperforms the current ITU-T standard on bandpass filtering.
Background noises Databases 3, 6, 14 and 23 to 26 include conditions impaired by environmental background noises. Over these 7 databases, DIAL is the best model for 2 other databases (3 and 6) and belongs to the best models for two databases (23 and 26). Database 6 includes solely noisy conditions. The DIAL is thus able to estimate the quality of noisy stimuli even mixed with non-noisy stimuli within the same test corpus.
Combined degradations Database 7 has been carried out to define the whole perceptual space used by test subjects to assess the quality of speech stimuli. Its test corpus includes conditions impaired on the three dimensions Coloration, Noisiness and Discontinuity. For this specific database, the DIAL model is the best intrusive quality model. In addition, DIAL estimations obtain a better correlation (ρDIAL,7 = 0.89) with the auditory scores than the Core model (ρCore,7 = 0.79). The judgment model using the k-NN algorithm significantly improves this new intrusive model. Figure 5.3 shows the “per-condition” relationship between DIAL estimations and auditory scores for Database 7. Following the results found for the S-WB case, HFT conditions 1, 25, 26 and 30 are slightly underestimated. In addition, conditions impaired by strong discontinuities such as conditions 9, 16, 19 and 22 are slightly overestimated. This figure supports the results observed in the different NB databases.
5.2.2 Perceptual dimensions This section is composed of three parts. The first paragraph presents the diagnostic estimations on an example database. For this specific database, the auditory scores for each perceptual dimensions are available. Then, the second paragraph compares
5.2 Performance results
205
Fig. 5.3 DIAL estimations for Database 7 “DimensionNB”. The gray curve represents the estimated third order mapping function
5 4.5
MOSLQON
4 9
3.5 3 2.5 2 1.5 1 1
48 15 18 23 40
24 6 35 47 29 27 3437 38 5 1013
8 28 20 39 46 16 17 61 31 11 44 21 26 4258 1 25 49 14 62 63 22 64 53 12 4 7 66 59 67 36 3 41 19 50 51 30 57 52 566555 2 33 3254 60 68 4543
2
3 MOSLQSN
4
5
the performance of DIAL to two other diagnostic models. In the third paragraph, diagnostic estimations are applied to a NB database. For this database, only the auditory MOSLQSN values are available and the estimated dimensions scores are used to characterize each condition.
Example database Figure 5.4 shows a degradation decomposition provided by the DIAL dimension estimators for the WB Database 3 “Dim-Scaling 1”. The DIAL is used in the S-WB operational mode. The estimations are shown after the third order mapping function. Except for Loudness, the auditory scores for each perceptual dimension are available. The auditory MOSLQSM scores and integral MOSLQOM estimations are depicted in the bottom right hand part of the figure. The corresponding correlation coefficients between the auditory and estimated MOSdim scores are presented in Table 5.8. For the Coloration dimension, the DIAL estimator provides relatively good quality estimations. However, the auditory experiment has been carried out in a WB context, whereas the Coloration estimator provides S-WB estimations. A saturation effect appears on the MOScol estimations before the mapping function. For the Noisiness dimension, a correlation coefficient of ρNoi,3 = 0.78 is obtained, see Table 5.8. The database “Dim-Scaling 1” does not follow the usual input signals framework. The averaged percentage of vocal activity of the input speech signals are close to 80% (instead of 50% in other databases). This prevents a reliable estimation of the background noise parameter Ln . Here, the added noise is estimated by the parameter “Noise on Speech” (NoS) which seems to provide reliable estimations. For the Discontinuity dimension, the DIAL estimator seems to be less reliable than for the Coloration dimension. As described in the previous section, the Discontinuity estimator gives worse MOSdis values for codec tandeming (e.g. condition 56: tandeming
206
5 Evaluation of DIAL
noi
5
4
Est. MOS
Est. MOS
col
5
3 2 1 1
2 3 Aud. MOS
4
4 3 2 1 1
5
2
4
4
LQOM
5
3
MOS
dis
Est. MOS
5
2 2
3 4 Aud. MOS
5
noi
col
1 1
3 4 Aud. MOS
5
dis
3 2 1 1
2
3 MOS
4
5
LQSM
Fig. 5.4 Example degradation decomposition for WB Database 3 “Dim-Scaling 1”. DIAL is used in its S-WB operational mode. This figure shows the estimated and auditory quality values for the three dimensions Coloration, Noisiness and Discontinuity. The integral quality scores (i.e. auditory MOSLQSM values and MOSLQOM estimations) are depicted in the right bottom part of the figure
of G.729A codec.) than for packet-loss conditions (e.g. condition 61: 3% of interruptions). However, the bottom right hand part of the figure and the high correlation coefficient between MOSLQSM and MOSLQOM values (ρDIAL,3 = 0.91) show that DIAL provides reliable integral quality values.
Comparison with reference estimators Table 5.8 shows the experimental results of the DIAL dimension estimators on five databases. In NB Databases 4, 5 and 6, the test corpus includes degradations on one dimension only. The auditory MOS results are considered as dimension quality scores instead of integral quality scores in this case. In WB Databases 3 and 4, a specific auditory measurement method proposed by Wältermann et al. (2010b) has been used. The subjects rate the speech stimuli on an ACR 5-point listening quality scale and on three dimension scales Coloration, Noisiness and Discontinuity. Here, a third order mapping function has been applied on the dimension quality estimations. Table 5.8 presents the “per-condition” Pearson correlation coefficient ρ and the “per-stimuli” modified prediction errors σ ∗ . These statistical measures are
5.2 Performance results
207
Table 5.8 “Per-condition” Pearson correlation coefficients ρ and “per-stimuli” modified prediction error σ ∗ for each dimension estimator. These measures are computed after the third order mapping function Context
No.
Name Database
Estimator Scale
Scholz
Huo
DIAL
ρ
σ∗
ρ
σ∗
ρ
σ∗
0.68 0.98 0.97
0.43 0.10 0.09
0.68 0.94 0.84
0.37 0.17 0.42
0.60 0.96 0.89
0.41 0.14 0.20
NB
4 5 6
Dimension-Cont Dimension-DFC Dimension-Noi
ACR ACR ACR
S-WB
3
Dim-Scaling 1 Dim-Scaling 1 Dim-Scaling 1
Dis Col Noi
0.87 0.95 0.51
0.06 0.04 0.30
0.84 0.97 0.78
0.14 0.03 0.17
4
Dim-Scaling 2 Dim-Scaling 2 Dim-Scaling 2
Dis Col Noi
0.87 0.99 0.44
0.12 0.03 0.19
0.80 0.98 0.67
0.14 0.03 0.09
calculated using the auditory dimension scores and the scores provided by the corresponding estimators, e.g. the Discontinuity estimator for the “Dimension-Cont” database. This table presents the results for the Scholz’s and the Huo’s diagnostic models. Here, the term Coloration is used for the Scholz’s and the Huo’s DFC estimator. Firstly, the estimators developed by Scholz (2008) obtain the highest correlation coefficients for the three NB Databases. However, Scholz’s estimators have been developed on these databases, whereas DIAL has been extended to other type of degradations. The three Coloration estimators obtain low σ ∗ values. They all used the parameter ERB which is highly correlated with the auditory scores (ρERB,5 = 0.917). The DIAL Noisiness estimator obtains a higher ρ coefficient and a lower σ ∗ value than Huo’s Noisiness estimator. The three Discontinuity estimators obtain relatively low ρ coefficients and high σ ∗ values. Over the three dimensions, Discontinuity seems to be the most difficult quality dimension to estimate, whereas Coloration seems to be the easiest one. Databases 3 and 4 have been carried out in a WB context. Since the estimators developed by Scholz (2008) have only a NB mode, table 5.8 presents results only from DIAL and Huo’s dimension estimators. Over the 2 WB databases, Huo’s Continuity estimator obtains higher ρ coefficients and lower σ ∗ values than DIAL Discontinuity estimator. These statistical measures show that Discontinuity dimension is better estimated in a WB context. Both Coloration estimators obtain high ρ coefficients and low σ ∗ values. However, an important gap exists between the 2 Noisiness estimators. Many conditions are highly underestimated by Huo’s Noisiness estimator.
208
5 Evaluation of DIAL
3 2
49
1
4 dis
2723 2831 2232 24 9 87 1920 14 21 15 13 29 4 25 17 165 2 181110 16 26 43 12 38 30342 37 44 39 35 34 40 3641 33
MOS
MOScol
4
5
50 46 47 48 45
3 2 1
2 4 MOSLQSN
24 18 7 50 11 25 21 44 48 49 10 38 16 2 27 47 32 34 45 30 41 46 4 3 19 33 13
5 MOSnoi
5
4 3
316 21 2 10 6 25 7 1424 41 34 13 1811 50 30 38 4429 4 19 532 33 17 3631 37 27 43 35 208 12 40 3942 15 23 22 9 26 1 28
2
14 31 35 8 40 39 1636 28 226 42 917 22 43 537 1512 20 923
1
2 4 MOSLQSN
49 46 45
47 48
2 4 MOSLQSN
(a)
3 2
49
4
6
5 24 16 149 8 710 45 313 21152527 1811 20 41 29 12 17 31 26 30 221924 38 28 3642 32 23 40 44 39 34 33 1 3537 43
1
2 MOS
5
5
50
dis
46
48
MOS
MOScol
4
47
3 2 1
4
LQSN
9 8 50 20 49 7 3537122311 42 19 483327 17 1 26 43 18 39 22 29 45 46 47 31152524 38 10 28 36 32 34 44 2 5 16 40 30 21 14 313 41 4 6
2 MOS
MOSnoi
5
4
2 1
4
5 14 6331 16 41 21 13 4 25 44 34 2 24 32 40 3036 15 50 388 7 27 39 29 22 2826 43 923 10 46 47 33 1218 1 1719 3537 20 42 11 48 49
3 45
2 MOS
4
LQSN
LQSN
(b)
3 2
2 2 MOS
4
LQSN
1
949 7 50 46 47 48 12 45 2081110 15 37 192723 1 17 14 18 4333 39 24 24 26 36 32 2842 38 22 25 5 34 21 13 40 30 4429 16 35 41 331 6
2 MOS
4 LQSN
MOSnoi
3
1
4 dis
498 7 50 48 6 16 4 918 515 14 46 47 13 45 2 17 12 13226 20 231110 21 25 28 27 19 24 922 32 31 38 43 37 39 34 35 44 33 40 30 41 3642
MOS
MOScol
4
5
5
5
4 3
50 710 2524 11 14 2 6316 2132 4 18 13 49 415 342723 3029 31 38 44 17 4819 208 22 33 12 35 42 36 28 43 37 15 9 47 3926 40 46 1 45
2 1
2 MOS
4
LQSN
(c) Fig. 5.5 Degradation decomposition for the NB Database 19 “P.SEAM”. a Scholz’s estimators. b Huo’s estimators. c DIAL estimators
Applications Figure 5.5 shows the results for the NB Database 19 “P.SEAM”. This database includes noisy and packet-loss conditions. In addition, the test corpus includes acoustic recordings of user terminals. For this database, only the integral auditory quality scores MOSLQSN are available. Here, the estimators are applied to diagnose each condition. For instance, conditions 45 to 50 obtain higher Coloration and Discontinuity scores. These are reference conditions: four MNRU and one “Clean” condi-
5.3 Summary and Conclusion
209
tions. However, using the MOSLQSN values, the three sets of dimension estimators can be compared. For instance, Fig. 5.5(a) shows a large deviation of the scores estimated by the Noisiness and the Discontinuity estimators developed by Scholz (2008). The Discontinuity estimator developed by Huo et al. (2008a) seems to provide better estimations, see Fig. 5.5(b). However, DIAL only provides reliable estimations on these 2 dimensions Noisiness and Discontinuity, see Fig. 5.5(c). For the Coloration dimension, Scholz’s and Huo’s Coloration estimations are more consistent with the auditory MOSLQSN values than the DIAL Coloration estimations. However, DIAL obtains the highest correlation over the 80 stimuli degraded only on the Coloration dimension (i.e. conditions 7, 10, 33, 38 and 50). The correlation coefficients for the three Coloration estimators are ρDIAL = 0.79, ρScholz = 0.73 and ρScholz = 0.67. Scholz’s estimator uses a parameter called β which represents the ratio between low and high frequencies in the degraded speech signal. The Coloration estimator described in Sec. 4.5.1 estimates a bandwidth impairment Ibw . The deviation from a reference timbre introduced by user terminals is reliably covered by Scholz’s estimator. A combination of both approaches may improve the DIAL Coloration estimator.
5.3 Summary and Conclusion This chapter evaluates the new intrusive model DIAL. For this purpose a large set of databases has been used. In the NB mode, the ITU-T standard PESQ obtains better results. However, DIAL outperforms PESQ for packet-loss conditions or bandwidth restrictions. In the S-WB mode, DIAL outperforms all reference instrumental models including the ITU-T standard WB-PESQ and its improved versions, the Enh. WB-PESQ (Shiran and Shallom, 2009) and the Mod. WB-PESQ (Côté et al., 2006). The results of this exhaustive evaluation can be summarized following the six criteria introduced in Sec. 2.3: • Completeness: The scope of DIAL is slightly reduced compared to the scope of the future POLQA model, see Sec. 4.1. For instance, DIAL shows problems with conditions impaired by strong discontinuities such as interruptions. Systems introducing time-varying linear deviations of the transfer function are also underestimated by DIAL. However, DIAL has a wider scope than the current ITU-T standards, PESQ and WB-PESQ. These standards fail to estimate the perceived quality of acoustic recordings or conditions impaired by time-warping effects. • Accuracy: In both operational modes, the DIAL estimations obtain relatively high correlations with the auditory scores. Over the 1 874 conditions included in the WB and S-WB databases, DIAL obtains the highest correlation among the 7 intrusive models (including the current ITU-T standard).
210
5 Evaluation of DIAL
• Credibility: The DIAL estimations are easily interpretable. First, the estimated integral quality takes into account all perceptual dimensions. Then, the degradation introduced in the speech signal is decomposed on four “perceptual” dimensions. • Extensibility: Contrary to PESQ, DIAL relies on a statistical model called k-NN. In this sense, the scope of DIAL can be extended whereas an extension of PESQ would require a new calibration of the internal parameters. In order to extend the scope of DIAL to new degradations, a new training phase of the k-NN model on databases including these new degradations is required. • Manipulability: DIAL is a command-line tool which requires the reference and the degraded speech signals and two additional input parameters. These two parameters are the operational mode and the sampling frequency of the input signals. • Consistency: This criterion is the main statistic measure used in Chapter 5. This exhaustive evaluation shows that DIAL consistency is slightly lower than PESQ in its NB operational mode and is the most consistent model in its S-WB operational mode. However, several inaccuracies of DIAL estimations have been described in this chapter. They may come from either the quality estimates (i.e. the Core model or dimension estimators estimations) or the judgment model (i.e. the k-NN statistical model). For instance, the Discontinuity estimator underestimates the quality of codec tandeming conditions. Even though the k-NN data sets include about 15 000 points, discontinuities exist within these data sets. Figure 5.6 presents this latter problem in the NB mode. Here, the four input values MOSCore , MOSnoi , MOSlev and MOSdis are set to 4.5 MOS. Then, the Coloration score varies within the range 1–5 MOS. For each estimated MOSLOQN value, the consistency measure ConsMOS of the k-NN model has been calculated according to Eq. 4.85. This measure is presented in Fig. 5.6 by a confidence interval of the MOSLOQN values. This consistency measure increases dramatically for MOScol values below 2.5 MOS. For several conditions with strong delay variations, DIAL does not properly align the two input speech signals. For such conditions, the time-alignment algorithm used by the Enh. PESQ may be employed. However, this model is outperformed by PESQ on the NB databases. It seems that the PESQ model estimates a more precise fine delay whereas the Enh. PESQ better estimates the strong delay variation. The Enh. WB-PESQ time-alignment algorithm is slower than the algorithm used in PESQ. As a result, the Enh. PESQ time-alignment algorithm may replace the one used in DIAL only in case that time-warping effect is detected. In other case, the usual PESQ time-alignment algorithm should be used.
5.3 Summary and Conclusion 5 4.5 4 3.5 MOSLQON
Fig. 5.6 Relationship between DIAL integral quality estimations MOSLQON and Coloration estimations MOScol . The consistency measures ConsMOS are calculated according to Eq. (4.85)
211
3 2.5 2 1.5 1
1
1.5
2
2.5
3 MOScol
3.5
4
4.5
5
Chapter 6
Conclusions and Outlooks
6.1 Conclusions Auditory tests are the most reliable assessment method for the evaluation of speech processing systems. However, auditory tests are costly and time-consuming. This book aims at developing a reliable intrusive speech quality model which estimates auditory test results. For this purpose, Chap. 1 provides an overview of the human speech production and perception processes. Chapter 1 also presents the definition of perceived quality and the degradations introduced by speech transmission systems. The resulting degraded speech message can be assessed using the methods described in Chap. 2. Several intrusive speech quality models have been developed during the last decades. Their scope is highly linked to the degradations introduced by available transmission systems at that time. The last intrusive model standardized by the ITU-T organization was published as the ITU–T Rec. P.862 (2001). This model called PESQ was validated on two databases including low bit-rate speech codecs, VoIP transmissions and background noises. However, new techniques have been introduced in speech transmission systems such as time-warping techniques used by PLC algorithms or WideBand (WB) coding in VoIP networks. For the latter case, an updated version of the PESQ model, called WB-PESQ, was published as ITU–T Rec. P.862.2 (2005). An evaluation of this WB ITU-T standard on several databases is produced in Chapter 3. From this evaluation, it results that WB-PESQ under-estimates the perceived quality of hybrid speech codecs such as the ITU– T Rec. G.722.2 (2003). These under-estimations are exacerbated for female talkers. The first contribution of this book has been to propose modifications of the WB-PESQ algorithm, see Sec. 3.1.2. In the modified version, the partial compensation of the system’s linear frequency response is applied on active frames only whereas WB-PESQ uses the whole speech signal. In addition, this partial frequency compensation is calculated using frames having a higher level (10 dB) compared to the normal WB-PESQ algorithm. These first two modifications avoid an exacer-
N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, T-Labs Series in Telecommunication Services, 1, DOI: 10.1007/978-3-642-18463-5_6, © Springer-Verlag Berlin Heidelberg 2011
213
214
6 Conclusions and Outlooks
bated influence of background noise in the frequency response calculation. Then, a weighting function is applied during the aggregation over the frequency scale. This function attenuates the impact of distortions at low frequencies on the resulting MOSLQOM estimations. These modifications result in what has been called the “Modified WB-PESQ”. Evaluated in Chaps. 3 and 5, the Mod. WB-PESQ provides overall more reliable estimations than WB-PESQ. Modified WB-PESQ which has been developed especially to improve WB-PESQ provides reliable quality estimations of the standardized WB speech coding algorithms. The E-model quantifies the combined effect of speech coding algorithms and transmission errors on a pseudo-absolute quality scale, called the transmission rating scale, see ITU–T Rec. G.107 (1998). For this purpose, the E-model uses two parameters: the equipment impairment factor Ie and the packet-loss robustness factor B pl . In Sec. 3.2.3, a method is proposed in order to derive these two parameters from quality estimations provided by WB intrusive models. This method includes a normalization procedure which attenuates context effects. Therefore the derived Ie,WB and B pl,WB values are considered as being “absolute”. A detailed analysis of the instrumentally derived Ie,WB values shows that the comparison of WB speech codec using the same coding technique is feasible. However, the comparison of speech codecs using different coding techniques is not recommended. This analysis shows also that the instrumental assessment of packet-loss conditions and derivation of B pl,WB values is feasible with few WB intrusive models only. This newly developed procedure has been accepted by the ITU-T organization as the new ITU–T Rec. P.834.1 (2009) standard. A normalization procedure based on quality estimations from WB intrusive models has been proposed in this book and published as an ITU-T recommendation, the ITU–T Rec. P.834.1 (2009). Current WB intrusive quality models are not fully valid for all possible WB speech transmission scenarios. Additional developments are thus necessary. A new concept for the instrumental assessment of speech quality is proposed in Chapter 4. The resulting model combines a signal-based speech quality model and four dimension estimators. This concept has been developed using existing instrumental measures. The signal-based model uses the TOSQA perceptual model and the PESQ time-alignment algorithm. However, to cope with specific time-warping effects, the time-alignment algorithm has been improved. Four dimension estimators are combined to the signal-based model. Instrumental measures on the dimensions Coloration, Noisiness and Discontinuity were already available. Since, the Coloration and Discontinuity estimators had been developed for a NB context and evaluated on few databases, they have been extended to the S-WB context. For the dimensions
6.1 Conclusions
215
Noisiness and Loudness, new estimators have been developed. The resulting model, called Diagnostic Instrumental Assessment of Listening quality (DIAL), has two different operational modes: (i) a Narrow-Band mode and (ii) a Super-WideBand mode. In each mode, it provides an integral speech transmission quality MOSLQO and a decomposition of identified degradations over four perceptual dimensions: MOScol , MOSnoi , MOSdis and MOSloud . A new intrusive model based on a new concept has been developed. This model, called DIAL, combines a signal-based model and four dimension estimators. In Chapter 5, this new intrusive model is exhaustively evaluated on a large set of databases. A specific statistical evaluation procedure, which takes into account the uncertainty of the auditory scores, is applied. Several reference instrumental models including the current ITU-T standards, PESQ and WB-PESQ, are compared to DIAL. Overall, in a NB context, DIAL obtains the highest correlation with auditory scores. However, when the specific statistical evaluation is applied, PESQ outperforms DIAL. Even though DIAL provides reliable estimations for low bit-rate speech codecs and acoustic recordings, it fails to predict the quality of interrupted speech. In WB and S-WB contexts, DIAL outperforms all instrumental reference models, including WB-PESQ and the Modified WB-PESQ introduced in Sec. 3.1.2. However, the S-WB operational model of DIAL fails to predict the quality of some specific conditions. For instance, codec tandeming conditions are under-estimated by DIAL. Chapter 5 describes the origin of such inaccuracy in the DIAL estimations. These come from either the quality estimates or the judgment model. The scope of DIAL is thus slightly reduced compared to the scope of the future POLQA standard, see Sec. 4.1. The reliable estimation of the integral quality over the whole speech quality space is thus a difficult task which needs further developments.
DIAL obtains Pearson coefficients of correlation of ρ = 0.936 over 55 Narrow-Band databases and ρ = 0.910 over 39 WideBand and SuperWideBand databases. Even though DIAL outperforms the current WB ITU-T standard WB-PESQ, DIAL is outperformed by the NB standard PESQ. In this book, speech transmission quality has been assessed by new intrusive signal-based models. This was necessary because of the inaccuracy of available models such as the current ITU-T standards, PESQ and WB-PESQ. Here, two different experimental contexts has been studied: a NB context and a S-WB context. How speech quality is perceived in a S-WB context is still under study. Few S-WB databases are available compared to the amount of tests which have been carried out in a NB context. The development of the new intrusive quality model Diagnostic Instrumental Assessment of Listening quality (DIAL) has been limited by this
216
6 Conclusions and Outlooks
restricted knowledge of subject’s perception in a S-WB context. In order to represent the whole speech quality space, in both NB and S-WB contexts the same four perceptual quality dimensions have been selected. However, until now, no speech quality space has been derived for a S-WB context. Some aspects of speech quality have not been studied in this book. For instance, the recency effect observed in a quality assessment of an overall conversation has not been evaluated here. Because of its much more inherent complexity, the instrumental quality assessment in a conversational context has not been addressed.
6.2 Outlook Chapter 5 shows some limitations of the current DIAL model. In order to obtain a more reliable intrusive model over the whole scope of applications described in Sec. 4.1, the following improvements might be considered: • Core model – The core model relies on a psychoacoustic model for steady sounds developed by Zwicker et al. (1957). However, the speech signal is a time-varying sound. As a result, the current perceptual model may be replaced by the one developed by Moore et al. (1997). – DIAL uses a Voice Activity Detector (VAD) which has been developed for the specific time-alignment algorithm used in PESQ. This VAD is thus not optimized for WB and S-WB signals. A new VAD algorithm, with two specific operational modes depending on the input signal bandwidth, may be considered. – Monaural and diotic listening situations lead to different auditory results. This effect described by Nagle et al. (2007) is not fully integrated in DIAL. The perceptual model would require a fine-tuning optimization of the two operational modes. The NB operational mode simulates a monaural listening situation, whereas its S-WB mode simulates a diotic situation. • Loudness estimator – The first three perceptual dimensions Coloration, Noisiness and Discontinuity have been derived by Wältermann et al. (2006b) within the same experimental study. A fourth estimator for the quality feature Loudness is used by DIAL. However, this last feature is not fully orthogonal with the other perceptual dimensions. In DIAL, two loudness MOS scores are calculated. The first one, derived from a Long-Term Loudness (LTL) value, is used to estimate the quality values on the Loudness dimension (i.e. a MOSloud value). This first MOS score, when used as an input value of the judgment model, reduces the accuracy of the DIAL integral quality estimations. Therefore, a second MOS score, derived from an “Equivalent Continuous Sound Level” Leq , is used by the judgment model to calculate the integral speech quality MOSLQO . New
6.2 Outlook
217
experimental studies are needed to derive the whole speech quality space including the feature Loudness. – The LTL calculation algorithm replaces the non-active frames by silence. However, a strong background noise may influence the overall loudness of the degraded signal. This algorithm may be improved by taking into account both speech loudness values Ly (l) and noise loudness values Ln (l n ). • Discontinuity estimator – The Discontinuity estimator has relatively low accuracy with auditory scores. This estimator was first developed by Huo et al. (2008b) and slightly modified in DIAL. A new estimator appears to be necessary to cover all artifacts producing a discontinuity in the speech signal. • Noisiness estimator – The Noisiness estimator quantifies the perceived impact of background noises. However, quantization noise introduced by waveform speech codecs such as the ITU–T Rec. G.726 (1990) is not covered by the current Noisiness estimator. In a prior version of the Noisiness estimator, a Signal-Correlated Noise (SCN) estimation algorithm was included. Since this algorithm was too sensitive to wrong re-alignments of the two input signals, it was ruled out from the Noisiness estimator. A robust time-alignment algorithm may allow to include again this SCN estimation algorithm and thus improve the Noisiness estimator. • Judgment model – The discontinuities in the k-NN model presented in Sec. 5.3 avoid the estimation of consistent integral quality values for relatively high impairments (e.g. below 2.5 MOS). Even though the Core model is able to estimate the correct rank order between speech coding algorithms, a judgment model based on a different machine learning technique may provide a higher accuracy on small impairments.
Appendix A
Modulated Noise Reference Unit (MNRU)
The Modulated Noise Reference Unit (MNRU) is a reference condition described in ITU–T Rec. P.810 (1996) which simulates quantizing noise produced by logarithmic PCM technique, e.g. ITU–T Rec. G.726 (1990). In this specific case the noise is correlated to the speech signal. The MNRUs are used quite extensively in the assessment of speech codecs. There are two versions available, namely one for NB and one for WB speech. To produce a MNRU condition, a parameter called Q defined in dB is used to quantify the degradation as in the following: ) * y(k) = x(k) 1 + 10−Q/20 · N(k) (A.1) where k corresponds to the sample index, x(k) is the input speech sample, N(k) a white noise, Q the SNR in dB, and y(k) the degraded output speech sample. Usually, a test corpus includes four to seven MNRU conditions with Q between 5 dB and 45 dB. Figure A.1 shows the average spectrum of a speech sample before and after the addition of the signal-correlated noise, here at a low Q value (8 dB). This figure shows that the spectrum is nearly flat above 2 kHz. It shows that the added noise corresponds to a “White” noise. Then, a filter rules out all frequency components above 7 kHz.
219
220
A Modulated Noise Reference Unit (MNRU)
Fig. A.1 Signal spectrum before and after the addition of signal-correlated noise. The Q is set here to 8 dB
70 Speech MNRU (Q = 8 dB)
60
Level (dBSPL)
50 40 30 20 10 0 100
200
2000 1000 500 Frequency (Hz)
5000 8000
Appendix B
Databases test-plan
During this study, intrusive models were applied to several databases. Overall, 94 auditory tests have been analyzed by the DIAL model in Chap. 5. Part of these tests come from a pool of databases available for the POLQA project. This pool includes databases carried out during previous standardization projects. Due to legal restrictions, not all databases can be disclosed in this Appendix. However, example databases of interest are described in more details in the following paragraphs. All auditory tests of App.s B.1 and B.2 have been carried out in accordance with ITU–T Rec. P.800 (1996) and ITU–T Rec. P.830 (1996). A 5-point ACR listening quality scale has been used. The listening level was set to 79 dBSPL (dB rel. 20 µPa) which corresponds to the preferred listening level in a NB context (ITU–T Handbook on Telephonometry, 1992). Almost all databases have been provided by either Deutsche Telekom Laboratories (TU Berlin, Germany) or France Télécom R&D (Lannion, France) with German and French source material respectively. The following sections describe the databases in more detail.
B.1 Narrow-Band databases All Narrow-Band databases used for the evaluation procedure were collected using an ACR 5-point listening quality scale according to ITU–T Rec. P.800 (1996). Table B.1 summarizes the experimental factors. Nagle NB The same database has been collected in two different listening situations, with (i) a monaural headset as “Nagle Mono NB” (ii) a diotic headphone as “Nagle Diotic NB”. The auditory results have been published by Nagle et al. (2007). The test corpus includes a “clean” (i.e. non degraded) condition and four NB speech codecs (G.711, G.729.1 at 8 and 12 kbit/s and AMR at 4.75 and 12.2kbit/s) at
221
222
B Databases test-plan
Table B.1 Experimental factors of the testing databases for the Narrow-Band mode. Cond. means the total number of conditions included in the test. Subj. means the total number of test subjects who rate a single stimulus. Stim. means the total number of stimuli per condition Name
Language
Cond.
Subj.
Nagle Diotic NB Nagle Mono NB Dimension-Cont Dimension-DFC Dimension-Noi Dimension-NB G.729EV NB IKA/LIMSI NB IP-Tiphon FT-04 NB FT-06 NB FT-IP
Listening device
Stim.
French French German German German German French German German French French French
16 16 42 42 42 68 24 9 43 25 30 30
24 24 20 20 20 20 8 22 24 24 8 16
Diotic headphone Monaural headset Monaural headset Monaural headset Monaural headset Monaural headset Monaural headset Handset Handset Monaural headset Monaural headset Monaural headset
4 4 2 2 2 4 12 4 4 4 12 4
P.862 Prop P.862 BGN P.862 Shared P.AAM 1 P.AAM 2 P.AAM UAQ P.SEAM
German German German German German German French
50 49 50 36 27 76 50
24 28 24 27 27 24 6
Handset Handset Handset Monaural headset Monaural headset Monaural headset Monaural headset
4 4 4 4 4 4 16
Sup23 XP1-A Sup23 XP1-D Sup23 XP1-O Sup23 XP3-A Sup23 XP3-C Sup23 XP3-D Sup23 XP3-O
French Japan English French Italian Japan English
44 44 44 50 50 50 50
24 24 24 24 24 24 24
Handset Handset Handset Handset Handset Handset Handset
4 4 4 4 4 4 4
three different packet-loss ratios (0, 3 and 6%). Dimension This database comprises four auditory experiments. Three tests include conditions impaired on a single perceptual dimension: “Dimension-DFC” (bandpass filtering, acoustic recordings), “Dimension-Noisiness” (G.726, MNRU and background noises) and “Dimension-Continuity” (musical tones, interruptions and speech codecs at four different packet-loss ratios 3, 5, 10 and 20%). The fourth auditory test “Dimension-NB” is described in Wältermann et al. (2006b). It includes a sub-set of each Dimension-X experiment and conditions with combined degradations (e.g. HFT with background noise). G.729EV NB This database stems from the ITU–T Rec. G.729.1 (2006) codec qualification (or selection) phase. The experiment was carried out at France Télécom R&D. This
B.1 Narrow-Band databases
223
database includes a NB speech codec (G.729A), several MNRU conditions and the candidate codec of France Télécom at two different packet-loss ratios (0 and 3%) and at two different bit-rates (8 and 12 kbit/s). Details on the test set-up are available in ITU–T Temp. Doc. TD.65 (2005) and on the test results in ITU–T Temp. Doc. TD.71 (2005). IKA/LIMSI NB This listening-only test was carried out at the Institut für Kommunikationsakustik (Bochum, Germany) by Côté (2005). Nine NB conditions are included in the test corpus: four bandpass filtering and five NB codecs (G.711, IS-54, G.726 at 24 and 32 kbit/s and G.729A). IP-Tiphon This database was collected by Deutsche Telekom Berkom (Berlin, Germany) for an ETSI project. The auditory test is composed of 50 conditions including a set of reference conditions (MNRU and clean conditions) and several NB speech codecs (G.711, G.723.1, G.726 and GSM-FR) impaired by transmission errors (packet-loss ratios of 0, 5, 10, 15 and 20%). For this study, the MNRU and clean conditions were not available resulting in 43 conditions. The auditory results are described in ETSI Temp. Doc. WG 5-64 (1999). P.862 This database is composed of three listening-only tests carried out at Deutsche Telekom Berkom (Berlin, Germany) during the ITU-T standardization project ITU–T Rec. P.862 (2001). Each test includes a set of NB reference conditions (MNRU and clean conditions) and several NB speech codecs (G.726, G.729, G.728, G.723.1 and EVRC). Both experiments “P.862 Prop” and “P.862 BGN” include live electric recordings from mobile networks and background noises. Experiment “P.862 Shared” includes several NB codecs (G.726, G.729 and GSM-EFR) impaired by transmission errors (packet-loss ratios of 0, 5, 10, 15 and 20%). P.AAM Three auditory tests were carried out by Deutsche Telekom Berkom during the ITU-T standardization Project—Acoustic Assessment Model (P.AAM). The two experiments “P.AAM 1” and “P.AAM 2” correspond to acoustic recordings only. However, “P.AAM 1” includes a set of NB speech codecs (G.711, G.726, G.729 and GSM-FR) and reference conditions, and “P.AAM 2” includes mainly bandpass and reference conditions. The last experiment “P.AAM UAQ” includes live acoustic recordings over the VoIP network with different packet-loss ratios (0, 2, 5, 10, 15, 20 and 30%). P.SEAM This database has been carried out at France Télécom R&D (Lannion, France) during the ITU-T standardization project P.SEAM. This experiment includes
224
B Databases test-plan
6 reference conditions (MNRU and clean conditions). The other conditions have been acoustically recorded. These conditions are NB speech codecs (G.711, G.729 and G.723.1) impaired by transmission errors (packet-loss ratios of 0, 3, 5 and 10%) and background noise. FT-04 NB This experiment was carried out at France Télécom R&D (Lannion, France) by Barriac et al. (2004). Details on the test set-up are described in ITU–T Del. Contrib. COM 12–46 (2005). The test corpus includes 18 NB conditions and 7 WB conditions down-sampled to a sampling rate of 8 kHz. Several standard speech codecs, such as the G.726 and the G.729, have been used to process the stimuli. The clean condition corresponds to a flat bandpass ranging from 0 to 4 kHz. FT-06 NB This database has been collected by France Télécom R&D (Lannion, France). Details on the test set-up are described in ITU–T Del. Contrib. COM 12–149 (2006). The test corpus includes several speech codecs (G.711, G.729A, G.723.1 and AMR) in single and tandem conditions at different packet-loss ratios (0, 3, 5 and 10%). FT-IP This database has been carried out at France Télécom R&D (Lannion, France). The test corpus includes reference conditions (MNRU and clean conditions) and several live recordings of VoIP transmissions. Eight VoIP conditions are impaired by environmental noises. Sup23 This database is provided in ITU–T Suppl. 23 to P-Series Rec. (1998) and stems from the ITU–T Rec. G.729 (2007) codec selection phase. It includes two listening tests XP1 and XP3, carried out in three languages for XP1 and four languages for XP3. This database is considered as the reference database for PESQ to verify the correctness of its implementation, cf. ITU–T Rec. P.862.3 (2005). Test 1 includes several speech codecs (G.711, G.726, G.729 and GSM-FR) in single and tandem conditions. Test 3 include a set of NB speech codecs impaired by transmission errors (packet-loss ratios of 0, 3 and 5%) and background noise.
B.2 WideBand and Super-WideBand databases All WideBand and Super-WideBand databases used for the evaluation procedure were carried out using an ACR 5-point listening quality scale according to ITU–T Rec. P.830 (1996). Table B.2 summarizes the experimental factors.
B.2 WideBand and Super-WideBand databases
225
Table B.2 Experimental factors of the testing databases for the Super-WideBand mode. Cond. means the total number of conditions included in the test. Subj. means the total number of test subjects who rate a single stimulus. Stim. means the total number of stimuli per condition. Context A full-scale FS experiment includes degradations on four perceptual dimensions. A NB/WB experiment (mixed-band) includes both NB and WB conditions Context
Name
Language
Cond.
Subj.
Listening device
Stim.
WB
Nagle Diotic WB Nagle Mono WB
French French
16 16
24 24
Diotic headphone Monaural headset
12 12
NB/WB
AMR-WB Dimension-NB/WB Dim-Scaling 1 Dim-Scaling 2 FT-UMTS G.729EV NB/WB IKA/LIMSI NB/WB Loudness FT-04 NB/WB FT-06 NB/WB NTT Tsukuba
French German German German French French German French French French Japan Japan
56 14 66 76 29 40 18 38 36 60 164 21
32 19 4 4 6 8 23 24 24 8 24 32
Monaural headset Diotic headphone Diotic headphone Diotic headphone Diotic headphone Monaural headset WB-Handset Monaural headset Monaural headset Monaural headset Monaural headset Monaural headset
4 4 12 12 4 12 4 8 4 12 4 8
FS
Skype P.OLQA 1 P.OLQA 2
German French French
18 48 49
24 24 24
Diotic headphone Diotic headphone Diotic headphone
4 4 4
Nagle WB Similarly to “Nagle NB”, the same database has been collected in two different listening situations resulting in two auditory experiments: “Nagle Mono WB” and “Nagle Diotic WB”. The auditory results have been published by Nagle et al. (2007). Each experiment includes a “clean” (i.e. non degraded) condition and five speech codecs at three different packet-loss ratios (0, 3 and 6%). The WB codecs are the G.722, the G.729.1 at 16 and 32 kbit/s and the AMR-WB at 12.65 and 23.85kbit/s. AMR-WB This database was collected by France Télécom R&D (Lannion, France). This experiment includes 12 NB and WB reference conditions (MNRU and clean conditions). The other conditions correspond to live electric recordings of signals transmitted through the UMTS network. These conditions include the AMR-WB speech codec at 6.60, 8.85, 12.65, 15.85 and 23.85kbit/s and different transmission paths: “uplink” (from the phone to the base station) and “down-link” (from the base station to the phone). Dimension-NB/WB This mixed-band database is described in Wältermann et al. (2006a). The test corpus includes conditions impaired on the three perceptual dimensions, i.e. acous-
226
B Databases test-plan
tic recordings, background noise and packet losses (at 20%) and several speech codecs such as the G.729A, the G.722.1 and the AMR-WB. Dim-Scaling This database collected at Deutsche Telekom Laboratories (Berlin, Germany) is composed of two auditory tests. The test plan and test results are described in Wältermann et al. (2010b). A specific test procedure called “degradation decomposition” has been used. The subjects judged the integral quality of the stimuli on an ACR listening quality scale. In addition, the stimuli have been judged on three quality dimension scales: Discontinuity, Noisiness, and Coloration. For this purpose, continuous scales have been used. The poles of the scales are labeled with the following pairs of antonym attributes: “continuous–discontinuous” , “not noisy–noisy”, and “uncolored–colored”. FT-UMTS This database was collected at Deutsche Telekom Laboratories (Berlin, Germany). The test corpus comprises 12 reference conditions similar to those defined for the POLQA training databases and 17 conditions including time-warping effects. Two different VoIP applications (Skype and MSN-Messenger) were evaluated. G.729EV NB/WB Similarly to “G.729EV NB”, this NB/WB test stems from the ITU–T Rec. G.729.1 (2006) codec qualification (or selection) phase. The experiments were carried out at France Télécom R&D. The test corpus includes WB or NB speech codecs (G.729A, G.722 and AMR-WB), several Narrow-, Middle- (i.e. 100–5 000Hz) and WB MNRU conditions and the candidate codec of France Télécom at different bit-rates. Details on the test set-up are available in ITU–T Temp. Doc. TD.65 (2005) and on the test results in ITU–T Temp. Doc. TD.71 (2005). IKA/LIMSI NB/WB This listening-only test was carried out at the Institut für Kommunikationsakustik (Bochum, Germany) by Côté (2005). The test corpus comprises the 9 NB conditions included in the NB test “IKA/LIMSI NB” and also 9 WB conditions (bandpass filtering and two WB codecs). The bandpass filtering conditions have different bandwidths (lower cut-off frequency fl ∈ 50–600Hz, upper cut-off frequency fu ∈ 2 000–7 000 Hz). Loudness This database includes NB and WB speech codecs in error-free and packet-loss conditions. The conditions were played at five listening levels (78, 73, 68, 63 and 53 dBSPL ). The test set-up and test results are described in Côté et al. (2007).
B.2 WideBand and Super-WideBand databases
227
FT-04 NB/WB Similarly to “FT-04 NB”, this mixed-band listening-only test was carried out at France Télécom R&D (Lannion, France) by Barriac et al. (2004). The test corpus comprises the 25 NB conditions included in “FT-04 NB” and 11 WB conditions (the G.722, the G.722.1 and the AMR-WB speech codecs). The clean WB condition corresponds to a flat bandpass ranging from 0 to 8 kHz. FT-06 NB/WB This mixed-band test has been carried out at France Télécom R&D (Lannion, France). Details on the test set-up are described in ITU–T Del. Contrib. COM 12–149 (2006). The test corpus is composed of the 30 NB conditions included in “FT-06 NB” and 30 WB conditions. These WB conditions include several speech codecs (G.722, G.722.1, AMR-WB and G.729EV1 ) in single and tandem conditions at different packet-loss ratios (0, 3, 5 and 10%). NTT This database was collected by NTT (Tokyo, Japan), see ITU–T Del. Contrib. COM 12–5 (2005), ITU–T Del. Contrib. COM 12–64 (2005) and Takahashi et al. (2005b). It includes both NB- and WB-coded samples in error-free and packetloss conditions. Tsukuba It comes from a test carried out at the Institute of Information Sciences and Electronics (Tsukuba, Japan) and includes both NB and WB coded samples. Details on the test set-up are described in ITU–T Del. Contrib. COM 12–33 (2005) and ITU–T Del. Contrib. COM 12–77 (2005). Skype This “full-scale” database was collected at Deutsche Telekom Laboratories (Berlin, Germany). Its test corpus includes the 12 reference conditions defined for the POLQA training databases. This database includes 6 conditions (3 in NB, 3 in WB) transmitted through the VoIP application Skype. POLQA This database is composed of two full-scale listening-only tests carried out at France Télécom R&D (Lannion, France). Both tests include 12 reference conditions defined in ITU–T Temp. Doc. TD.52 Rev.1 (2007). These conditions reflect the whole speech quality space. The first auditory test “P.OLQA 1” includes conditions impaired by linear degradations such as non-optimum listening levels or background noises and a combination of them. The second auditory test “P.OLQA 2” includes NB, WB and S-WB speech codecs. 1
The G.729EV corresponds to a pre-published version (Version 1.14.1 – Jan. 31, 2006) of the ITU–T Rec. G.729.1 (2006) standard. This version has been used during the optimisation/characterization phase.
228
B Databases test-plan
B.3 Perceptual dimension databases From the dimension databases, MOS values on specific perceptual quality dimensions have been obtained. For this purpose, two different approaches have been used. These approaches are briefly introduced in Sec. B.1 (“Dimension”) and Sec. B.2 (“Dim-Scaling”). In the three databases, the speech stimuli were in German. Table B.3 summarizes the experimental factors. Table B.3 Experimental factors of the testing databases used for the statistical evaluation of the degradation decomposition. Cond. means the total number of conditions included in the test. Subj. means the total number of test subjects who rate a single stimulus. Stim. means the total number of stimuli per condition Context
Name
Cond.
Subj.
NB
WB
Listening device
Stim.
Dimension-Cont Dimension-DFC Dimension-Noi
42 42 42
20 20 20
Monaural headset Monaural headset Monaural headset
2 2 2
Dim-Scaling 1 Dim-Scaling 2
66 76
4 4
Diotic headphone Diotic headphone
12 12
References
Allen, J., Hall, J., and Jeng, P. (1990). Loudness Growth in 1/2-Octave Bands (LGOB)—A Procedure for the Assessment of Loudness. Journal of the Acoustical Society of America, 88(2):745– 753. 158 Allnatt, J. (1975). Subjective Rating and Apparent Magnitude. International Journal of ManMachine Studies, 7(6):801–816. 62, 182 Atal, B. S. and Hanauer, S. L. (1971). Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. Journal of the Acoustical Society of America, 50(2B):637–655. 23 Au, O. and Lam, K. (1998). A Novel Output-Based Objective Speech Quality Measure for Wireless Communication. In Proc. 4th Int. Conf. on Signal Processing (ICSP’98), 666–669. 65, 83 Baddeley, A. D. and Hitch, G. J. (1974). Recent Advances in Learning and Motivation, volume 8, chapter Working Memory, 47–89. Academic Press, USA–New York, NY. 10 Baddeley, A. (2003). Working Memory and Language: an Overview. Journal of Communication Disorders, 36(3):189–208. 10 Barriac, V., Le Saout, J.-Y., and Lockwood, C. (2004). Discussion on Unified Objective Methodologies for the Comparison of Voice Quality of Narrowband and Wideband Scenarios. In ETSI Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction, DE–Mainz. 55, 56, 94, 105, 116, 168, 224, 227 Beerends, J. G., Busz, B., Oudshoorn, P., Van Vugt, J., Ahmed, K., and Niamut, O. (2007). Degradation Decomposition of the Perceived Quality of Speech Signals on the Basis of a Perceptual Modeling Approach. Journal of the Audio Engineering Society, 55(12):1059–1074. 81 Beerends, J. (1994). Modelling Cognitive Effects that Play a Role in the Perception of Speech Quality. In Proc. Workshop on Speech Quality Assessment, 2–9, DE–Bochum. 70, 73, 90, 97, 157 Beerends, J., Hekstra, A., Rix, A., and Hollier, M. (2002). Perceptual Evaluation of Speech Quality (PESQ): The New ITU Standard for End-to-End Speech Quality Assessment Part II— Psychoacoustic Model. Journal of the Audio Engineering Society, 50(10):765–778. 75, 88, 90, 119, 153, 158 Beerends, J. and Stemerdink, J. (1992). A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation. Journal of the Audio Engineering Society, 40(12):963–963. 70 Beerends, J. and Stemerdink, J. (1994). A Perceptual Speech-Quality Measure Based on a Psychoacoustic Sound Representation. Journal of the Audio Engineering Society, 42(3):115–123. 2, 69, 70, 71, 85, 160 Bele, I. V. (2007). Dimensionality in Voice Quality. Journal of Voice, 21(3):257–272. 34 Benignus, V. (1969). Estimation of the Coherence Spectrum and its Confidence Interval using the Fast Fourier Transform. IEEE Transactions on Audio and Electroacoustics, 17(2):145–150. 71 Berger, J. (1996). Ein Ansatz zur Instrumentellen Sprachqualitätsabschätzung im Festnetz der Deutschen Telekom (An approach to instrumental speech quality estimation in Deutsche
229
230
References
Telekom’s fixed network). In Proc. Workshop on Quality Assessment in Speech, Audio and Image Communication, DE–Darmstadt. in German. 71 Berger, J. (1998). Instrumentelle Verfahren zur Sprachqualitätsschätzung – Modelle auditiver Tests. Shaker, DE–Aachen. 165, 186 Bernex, E. and Barriac, V. (2002). Architecture of Non-Intrusive Perceived Voice Quality Assessment. In Proc. Int. Conf. Measurement of Speech and Audio Quality in Networks (MESAQIN), 13–16, CZ–Prague. 29, 30, 32 Bessette, B., Salami, R., Lefebvre, R., Jelinek, M., Rotola-Pukkila, J., Vainio, J., Mikkola, H., and Jarvinen, K. (2002). The Adaptive Multirate Wideband Speech Codec (AMR-WB). IEEE Transactions on Speech and Audio Processing, 10(8):620–636. 24 Billi, R. and Scagliola, C. (1982). Artificial Signals and Identification Methods to Evaluate the Quality of Speech Coders. IEEE Transactions on Communications, 30(2):325–335. 66 BIPM Guides in metrology (2008). The International Vocabulary of Metrology—Basic and General Concepts and Associated Terms (VIM). International Bureau of Weights and Measures, FR–Sèvres. 37 Blauert, J. and Guski, R. (2009). Critique of Pure Psychoacoustics. In Proc. NAG/DAGA 2009 Int. Conf. on Acoustics, volume 3, 1518–1519, NL–Rotterdam. 39 Blauert, J. (1997). Spatial Hearing: the Psychophysics of Human Sound Localization. MIT Press, USA–Cambridge, Mass., revised edition. 7, 37, 38 Brandenburg, K. (1987). Evaluation of Quality for Audio Encoding at Low Bit-Rates. In Proc. 82nd AES Convention, number 2433, DE–Erlangen. 69 Brehm, H. and Stammler, W. (1987). Description and Generation of Spherically Invariant SpeechModel Signals. Signal Processing, 12(2):119–141. 66 Bronkhorst, A. (2000). The Cocktail Party Phenomenon: A Review of Research on Speech Intelligibility in Multiple-Talker Conditions. Acustica, 86(1):117–128. 12 Carroll, J. D. (1972). Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, volume 1, chapter Individual Differences and Multidimensional Scaling, 105–155. Seminar Press, USA–New York, NY. 52 Carroll, J. and Chang, J. (1970). Analysis of Individual Differences in Multidimensional Scaling via an N-way Generalization of “Eckart-Young” Decomposition. Psychometrika, 35(3):283– 319. 51 Cavanaugh, J. R., Hatch, R. W., and Sullivan, J. L. (1976). Models for the Subjective Effects of Loss, Noise, and Talker Echo on Telephone Connections. Bell System Technical Journal, 55(9):1319–1371. 61, 62 Chen, G., Koh, S., and Soon, I. (2003). Enhanced Itakura Measure Incorporating Masking Properties of Human Auditory System. Signal Processing, 83(7):1445–1456. 68 Chen, G. and Parsa, V. (2007). Loudness Pattern-Based Speech Quality Evaluation Using Bayesian Modeling and Markov Chain Monte Carlo Methods. Journal of the Acoustical Society of America, 121(2):77–83. 85 Chen, K., Huang, C., Huang, P., and Lei, C. (2006). Quantifying Skype User Satisfaction. In Proc. Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), 399–410. IT–Pisa. 17 Coetzee, H. J. and Barnwell, T. (1989). An LSP Based Speech Quality Measure. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’89), volume 1, 596–599, UK–Glasgow. 68 Combescure, P. (1981). 20 listes de 10 Phrases Phonétiquement Équilibrées. Revue d’Acoustique, 56:34–38. 44 Combescure, P., Le Guyader, A., and Gilloire, A. (1982). Quality Evaluation of 32 kbit/s Coded Speech by Means of Degradation Category Ratings. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’82), volume 7, 988–991. 46 Côté, N. (2005). Qualité Perçue de Parole Transmise par Voie Téléphonique Large-bande. Master thesis, Université Pierre et Marie Curie (Paris VI), FR–Paris. 94, 106, 116, 223, 226 Côté, N. and Durin, V. (2008). Effect of Degradations’ Distribution in a Corpus Test on Auditory Ratings. In Proc. 155th Meeting of the Acoust. Soc. of America / 5th Forum Acusticum / 9e
References
231
Congrès Français d’Acoustique / 2nd ASA-EAA Joint Conference (Acoustics’08), 465–470, FR–Paris. 55 Côté, N., Gautier-Turbin, V., and Möller, S. (2008). Evaluation of Instrumental Quality Measures for Wideband-Transmitted Speech. In Proc. 8th ITG-Fachtagung Sprachkommunikation, DE– Aachen. 65 Côté, N., Gautier-Turbin, V., Raake, A., and Möller, S. (2006). Analysis of a Quality Prediction Model for Wideband Speech Quality, the WB-PESQ. In Proc. 2nd ISCA/DEGA Tutorial and Research Workshop on Perceptual Quality of Systems, 115–122, DE–Berlin. 104, 209 Côté, N., Koehl, V., Gautier-Turbin, V., Raake, A., and Möller, S. (2009). Reference Units for the Comparison of Speech Quality Test Results. In Proc. 126th AES Convention, number 7784, DE–Munich. 54 Côté, N., Gautier-Turbin, V., and Möller, S. (2007). Influence of Loudness Level on the Overall Quality of Transmitted Speech. In Proc. 123rd AES Convention, number 7175, USA–New York, NY. 34, 131, 226 Dau, T., Kollmeier, B., and Kohlrausch, A. (1997). Modeling Auditory Processing of Amplitude Modulation. I. Detection and Masking with Narrow-band Carriers. Journal of the Acoustical Society of America, 102(5):2892–2905. 73 Dau, T., Püschel, D., and Kohlrausch, A. (1996). A Quantitative Model of the Effective Signal Processing in the Auditory System. I. Model Structure. Journal of the Acoustical Society of America, 99(6):3615–3622. 73 Demany, L. and Semal, C. (2008). Auditory Perception of Sound Sources, volume 29, chapter The Role of Memory in Auditory Perception, 77–113. Springer, DE–Berlin. 10 Deng, L. and O’Shaughnessy, D. (2003). Speech Processing: a Dynamic and OptimizationOriented Approach. Marcel Dekker, Inc., USA–New-York, NY. 6 Dimolitsas, S. (1993). Speech and Audio Coding for Wireless and Network Applications, chapter Subjective Assessment Methods for the Measurement of Digital Speech Coder Quality, 43–54. Kluwer Academic Publ. 45 Ding, L., Radwan, A., El-Hennawey, M. S., and Goubran, R. A. (2006). Measurement of the Effects of Temporal Clipping on Speech Quality. IEEE Transactions on Instrumentation and Measurement, 55(4):1197–1203. 172, 173 Dudley, H. (1939). Remaking speech. Journal of the Acoustical Society of America, 11(2):169– 177. 23 Duncanson, J. (1969). The Average Telephone Call is Better than the Average Telephone Call. Public Opinion Quarterly, 33(1):112–116. 14 Durin, V. and Gros, L. (2008). Measuring Speech Quality Impact on Tasks Performance. In Proc. 11th Int. Conf. on Spoken Language Processing (ICSLP), 2074–2077, AU–Brisbane. 38 Ekman, L., Grancharov, V., and Kleijn, W. (2011). Double-Ended Quality Assessment System for Super-Wideband Speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(3):558–569. 77 Etame, T., Faucon, G., Gros, L., Le Bouquin Jeannes, R., and Quinquis, C. (2008). Characterization of the Multidimensional Perceptive Space for Current Speech and Sound Codecs. In Proc. 124th AES Convention, number 7410, NL–Amsterdam. 29, 33 ETSI EG 202 396-3 (2007). Speech Quality Performance in the Presence of Background Noise Part 3: Background Noise Transmission - Objective Test Methods. European Telecommunication Standards Institute, FR–Sophia Antipolis. 81 ETSI ETR 250 (1996). Speech Communication Quality from Mouth to Ear for 3,1 kHz Handset Telephony across Networks. European Telecommunications Standards Institute, FR–Sophia Antipolis. 63, 64 ETSI Temp. Doc. WG 5-64 (1999). Results of VoIP Simulation. European Telecommunication Standards Institute, FR–Sophia Antipolis. 223 ETSI TS 126 190 (2007). Speech Codec Speech Processing Functions; Adaptive Multi-Rate Wideband (AMR-WB) Speech Codec; Transcoding Functions. European Telecommunication Standards Institute, FR–Sophia Antipolis. 24, 94
232
References
EURESCOM Project Report P905 (2000). Assessment of Quality for Audio-Visual signals over Internet and UMTS. European Institute for Research and Strategic Studies in Telecommunications, DE–Heidelberg. 93 Falk, T. H. and Chan, W. Y. (2006). Single-Ended Speech Quality Measurement Using Machine Learning Methods. IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1935–1947. 84 Falk, T. H. and Chan, W.-Y. (2009). Performance Study of Objective Speech Quality Measurement for Modern Wireless-VoIP Communications. EURASIP Journal on Audio, Speech, and Music Processing, 2009. Article ID 104382. 85 Fechner, G. T. (1860). Elemente der Psychophysik. Breitkopf & Härtel, DE–Leipzig. 58 Fletcher, H. (1940). Auditory Patterns. Reviews of Modern Physics, 12(1):47–65. 8 Fletcher, H. and Arnold, H. (1929). Speech and Hearing. Van Nostrand, USA–New York, NY. 85, 178 Fletcher, H. and Galt, R. H. (1950). The Perception of Speech and Its Relation to Telephony. Journal of the Acoustical Society of America, 22(2):89–151. 61 Freeman, D., Cosier, G., Southcott, C., and Boyd, I. (1989). The Voice Activity Detector for the Pan-European Digital Cellular Mobile Telephone Service. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’89), volume 1, 369–372, UK–Glasgow. 172 French, N. R. and Steinberg, J. C. (1947). Factors Governing the Intelligibility of Speech Sounds. Journal of the Acoustical Society of America, 19(1):90–119. 68 Fu, Q., Yi, K., and Sun, M. (2000). Speech Quality Objective Assessment Using Neural Network. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’00), volume 3, 1511– 1514, TK–Istanbul. 183 Gabrielsson, A. and Sjögren, H. (1979). Perceived Sound Quality of Sound-Reproducing Systems. Journal of the Acoustical Society of America, 65(4):1019–1033. 29 Genuit, K. (1996). Objective Evaluation of Acoustic Quality Based on a Relative Approach. In Proc. Int. Congress on Noise Control Engineering (Internoise’96), volume 18, 3233–3238, UK–Liverpool. 80 Gierlich, H. W., Kettler, F., Poschen, S., and Reimes, J. (2008a). A New Objective Model for Wide-and Narrowband Speech Quality Prediction in Communications Including Background Noise. In Proc. 16th European Signal Processing Conference (EUSIPCO), CH–Lausanne. 81 Gierlich, H., Poschen, S., Kettler, F., Raake, A., and Spors, S. (2008b). Optimum Frequency Response Characteristics for Wideband Terminals. In Proc. ITU–T Workshop “From Speech to Audio: Bandwidth Extension, Binaural Perception”, FR–Lannion. 22 Glasberg, B. R. and Moore, B. C. J. (2002). A Model of Loudness Applicable to Time-Varying Sounds. Journal of the Audio Engineering Society, 50(5):331–342. 78, 169, 170 Goldstein, T. and Rix, A. W. (2004). Perceptual Speech Quality Assessment in Acoustic and Binaural Applications. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’04), volume 3, 1064–1067. 76 Grancharov, V. and Kleijn, W. B. (2007). Springer Handbook of Speech Processing, chapter Speech Quality Assessment, 83–99. Springer, DE–Berlin. 69 Gray, P., Hollier, M. P., and Massara, R. E. (2000). Non-Intrusive Speech-Quality Assessment Using Vocal-Tract Models. IEE Proceedings-Vision, Image, and Signal Processing, 147(6):493– 501. 83 Guéguin, M., Le Bouquin-Jeannes, R., Gautier-Turbin, V., Faucon, G., and Barriac, V. (2008). On the Evaluation of the Conversational Speech Quality in Telecommunications. EURASIP Journal on Advances in Signal Processing, 2008. Article ID 185248. 45 Guski, R. and Blauert, J. (2009). Psychoacoustics Without Psychology. In Proc. NAG/DAGA 2009 Int. Conf. on Acoustics, volume 3, 1550–1551. 55 Halka, U. and Heute, U. (1992). A New Approach to Objective Quality-Measures Based on Attribute-Matching. Speech Communication, 11(1):15–30. 78, 79 Hall, J. L. (2001). Application of Multidimensional Scaling to Subjective Evaluation of Coded Speech. Journal of the Acoustical Society of America, 110(4):2167–2182. 29, 30, 32, 33, 57
References
233
Hansen, J. H. and Pellom, B. L. (1998). An Effective Quality Evaluation Protocol for Speech Enhancement Algorithms. In Proc. 5th Int. Conf. on Spoken Language Processing (ICSLP), volume 7, 2819–2822, AUS–Sydney. 68 Hansen, M. and Kollmeier, B. (1997). Using a Quantitative Psychoacoustical Signal Representation for Objective Speech Quality Measurement. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’97), volume 2, 1387–1390, DE–Munich. 73 Hardy, W. (2003). VoIP Service Quality: Measuring and Evaluating Packet-Switched Voice. McGraw-Hill, USA–New York. 16, 18, 55 Hauenstein, M. (1997). A Computationally Efficient Algorithm for Calculating Loudness Patterns of Narrowband Speech. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’97), volume 2, 1311–1314, DE–Munich. 73 Heute, U., Möller, S., Raake, A., Scholz, K., and Wältermann, M. (2005). Integral and Diagnostic Speech-Quality Measurement: State of the Art, Problems, and New Approaches. In Proc. Forum Acusticum, 1695–1700, HO–Budapest. 79 Hollier, M. P., Hawksford, M. O., and Guard, D. R. (1993). Characterisation of Communications Systems Using a Speech-Like Test Stimulus. Journal of the Audio Engineering Society, ˝ 41(12):1008U–1021. 66 Hollier, M. P., Hawksford, M. O., and Guard, D. R. (1994). Error Activity and Error Entropy as a Measure of Psychoacoustic Significance in the Perceptual Domain. IEE Proceedings-Vision, Image and Signal Processing, 141(3):203–208. 74 Hollier, M. P., Hawksford, M. O., and Guard, D. R. (1995). Algorithms for Assessing the Subjectivity of Perceptually Weighted Audible Errors. Journal of the Audio Engineering Society, 43(12):1041–1045. 74 Honda, K. (2008). Springer Handbook of Speech Processing, chapter Physiological Processes of Speech Production, 7–26. Springer, DE–Berlin. 6 Huber, R. and Kollmeier, B. (2006). PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception. IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1902–1911. 73, 151 Huo, L., Wältermann, M., Heute, U., and Möller, S. (2008a). Estimation Model for the SpeechQuality Dimension “Noisiness”. In Proc. 155th Meeting of the Acoust. Soc. of America / 5th Forum Acusticum / 9e Congrès Français d’Acoustique / 2nd ASA-EAA Joint Conference (Acoustics’08), 5921–5926. 80, 190, 209 Huo, L., Wältermann, M., Heute, U., and Möller, S. (2008b). Estimation of the Speech Quality Dimension “Discontinuity”. In Proc. 8th ITG-Fachbericht-Sprachkommunikation, DE–Aachen. 80, 172, 173, 175, 176, 178, 190, 217 Huo, L., Wältermann, M., Scholz, K., Raake, A., Heute, U., and Möller, S. (2007). Estimation Model for the Speech-Quality Dimension “Directness/Frequency Content”. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 78–81, USA–New Paltz, NY. 80, 164, 190 IEEE Standards Publication 297 (1969). Recommended Practice for Speech Quality Measurements. Institute of Electrical and Electronics Engineers, USA–New York, NY. 42 IPA Handbook (1999). A Guide to the Use of the International Phonetic Alphabet. International Phonetic Association. 6 Isherwood, D., Lorho, G., Zacharov, N., and Mattila, V.-V. (2003). Augmentation, Application and Verification of the Generalized Listener Selection Procedure. In Proc. 115th AES Convention, number 5894, USA–New York, NY. 43 ISO Standard 532–B (1975). Method for Calculating Loudness Level. International Organisation for Standardisation, CH–Geneva. 69, 169 ISO Standard 8402 (1994). Quality Management and Quality Assurance – Vocabulary. International Organisation for Standardisation, CH–Geneva. 16 Itakura, F. (1975). Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):67–72. 68 Itakura, F. and Saito, S. (1968). Analysis Synthesis Telephony Based on the Maximum Likelihood Method. In Y. Kohasi, E., editor, Proc. 6th Int. Congress on Acoustics, 17–20. 68
234
References
ITU–D TTR (2007). Trends in Telecommunication Reform, The Road to Next-Generation Networks (NGN). International Telecommunication Union, CH–Geneva, 8th edition. 1 ITU–R Rec. BS.1116–1 (1997). Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems. International Telecommunication Union, CH–Geneva. 48 ITU–R Rec. BS.1284–1 (2003). General Methods for the Subjective Assessment of Sound Quality. International Telecommunication Union, CH–Geneva. 48 ITU–R Rec. BS.1534–1 (2003). Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems. International Telecommunication Union, CH–Geneva. 48 ITU–R Rec. BS.1770–1 (2007). Algorithms to Measure Audio Programme Loudness and Truepeak Audio Level. International Telecommunication Union, CH–Geneva. 171 ITU–T Contrib. COM 12–11 (1993). Difference in Loudness and Speech Quality for 3.1 kHz and 7 kHz Handset Telephony. International Telecommunication Union, CH–Geneva. 50, 51, 105 ITU–T Contrib. COM 12–117 (2000). Report of the Question 13/12 Rapporteur’s meeting (Solothurn, Germany, 6–10 March 2000). International Telecommunication Union, CH– Geneva. 75 ITU–T Contrib. COM 12–120 (2007). Investigating the Proposed P.OLQA Subjective Test Method. International Telecommunication Union, CH–Geneva. 54 ITU–T Contrib. COM 12–141 (2010). Stable Draft of New Recommendation P.863. International Telecommunication Union, CH–Geneva. 77, 85 ITU–T Contrib. COM 12–143 (2010). Draft Requirement Specification for P.AMD (Perceptual Approaches for Multi-Dimensional Analysis). International Telecommunication Union, CH– Geneva. 82 ITU–T Contrib. COM 12–19 (2000). Results of Objective Speech Quality Assessment of Wideband Speech Using the Advanced TOSQA–2001. International Telecommunication Union, CH–Geneva. 72, 188 ITU–T Contrib. COM 12–20 (1997). Improvement of the P.861 Perceptual Speech Quality Measure. International Telecommunication Union, CH–Geneva. 71 ITU–T Contrib. COM 12–20 (2000). Results of Objective Speech Quality Assessment Including Receiving Terminals Using the Advanced TOSQA–2001. International Telecommunication Union, CH–Geneva. 73 ITU–T Contrib. COM 12–23 (2009). Wideband Equipment Impairment Factors Ie,wb for the G.729.1 and Packet Loss Robustness Factors Bpl for the G.729.1, the G.722.2 and the G.722 WB-Codecs for Usage in the WB-E-Model. International Telecommunication Union, CH– Geneva. 110 ITU–T Contrib. COM 12–34 (1997). TOSQA Telecommunication Objective Speech Quality Assessment. International Telecommunication Union, CH–Geneva. 72, 73, 153 ITU–T Contrib. COM 12–39 (2009). Comparison Between the Discrete ACR Scale and an Extended Continuous Scale for the Quality Assessment of Transmitted Speech. International Telecommunication Union, CH–Geneva. 53, 54 ITU–T Contrib. COM 12–62 (1998). Results of Processing ITU Speech Database Supplement 23 with the End-to-End Quality Assessment Algorithm “PACE”. International Telecommunication Union, CH–Geneva. 74 ITU–T Contrib. COM 12–82 (2009). Assessment of Speech Quality Dimensions: Methodology, Experiments, Analysis. International Telecommunication Union, CH–Geneva. 81 ITU–T Contrib. COM 16–83 (2006). Applicability of G.729.1 Enhancement Layers to 3GPP2 EVRC-family of Codecs. International Telecommunication Union, CH–Geneva. 93 ITU–T Contrib. COM12–42 (2002). Call for Model Submission for a New ITU–T Recommendation for Objective Mouth-to-Ear Speech Quality Assessment Algorithms Including Terminals. International Telecommunication Union, CH–Geneva. 76 ITU–T Contrib. COM12–59 (2007). Study on a Transfer Function Between MOS-LQON and MOS-LQOM Scales for Narrow-band Conditions. International Telecommunication Union, CH–Geneva. 92
References
235
ITU–T Del. Contrib. COM 12–109 (2003). Preliminary Results for the P.AAM Benchmark Models. International Telecommunication Union, CH–Geneva. 76 ITU–T Del. Contrib. COM 12–149 (2006). Equipment Impairment Factor Ie and Packet-loss Robustness Factor Bpl for Wideband Speech Codecs. International Telecommunication Union, CH–Geneva. 94, 106, 116, 224, 227 ITU–T Del. Contrib. COM 12–187 (2004). Performance Evaluation of the Wideband PESQ Algorithm. International Telecommunication Union, CH–Geneva. 92, 93 ITU–T Del. Contrib. COM 12–28 (2005). How Much Better Can Wideband Telephony Be? — Estimating the Necessary R-scale Extension. International Telecommunication Union, CH– Geneva. 105, 106 ITU–T Del. Contrib. COM 12–33 (2005). Subjective Quality Assessment Result for Wideband Speech Coding. International Telecommunication Union, CH–Geneva. 116, 227 ITU–T Del. Contrib. COM 12–41 (2001). Enhancement of P.862 Results by Post-processing with “TOCQ”. International Telecommunication Union, CH–Geneva. 76 ITU–T Del. Contrib. COM 12–44 (2001). Modelling Impairment due to Packet Loss for Application in the E-model. International Telecommunication Union, CH–Geneva. 64, 109 ITU–T Del. Contrib. COM 12–46 (2005). Discussion on Unified Objective Methodologies for the Comparison of Voice Quality of Narrowband and Wideband Scenarios. International Telecommunication Union, CH–Geneva. 224 ITU–T Del. Contrib. COM 12–5 (2005). Listening Quality Assessment of Wideband and Narrowband Speech. International Telecommunication Union, CH–Geneva. 227 ITU–T Del. Contrib. COM 12–6 (2001). Proposal for the Use of Draft Recommendation P.862, the Perceptual Evaluation of Speech Quality (PESQ), for Measurements in the Acoustic Domain With Background Masking Noise. International Telecommunication Union, CH–Geneva. 76 ITU–T Del. Contrib. COM 12–64 (2005). Examples of Ie and Bpl Values for Wideband Codecs. International Telecommunication Union, CH–Geneva. 227 ITU–T Del. Contrib. COM 12–7 (2001). Proposed Modification to Draft P.862 to Allow PESQ to Be Used for Quality Assessment of Wideband Speech. International Telecommunication Union, Ch–Geneva. 92, 93 ITU–T Del. Contrib. COM 12–70 (2005). Performance Evaluation of Wideband Extension of P.862 for WB and NB Codecs. International Telecommunication Union, CH–Geneva. 93 ITU–T Del. Contrib. COM 12–71 (2005). Perceptual Correlates of the E-model’s Impairment Factors. International Telecommunication Union, CH–Geneva. 32 ITU–T Del. Contrib. COM 12–77 (2005). Ie and R Values of Wideband Speech Coding. International Telecommunication Union, CH–Geneva. 227 ITU–T Handbook on Telephonometry (1992). International Telecommunication Union, CH– Geneva. 46, 50, 54, 55, 61, 88, 221 ITU–T Rec. E.800 (2008). Definitions of Terms Related to Quality of Service. International Telecommunication Union, CH–Geneva. 16 ITU–T Rec. E.802 (2007). Framework and Methodologies for the Determination and Application of QoS Parameters. International Telecommunication Union, CH–Geneva. 16, 17 ITU–T Rec. G.107 (1998). The E-Model, a Computational Model for Use in Transmission Planning. International Telecommunication Union, CH–Geneva. Superseded. 63, 83, 85, 105, 106, 109, 112, 115, 117, 119, 121, 168, 171, 172, 182, 214 ITU–T Rec. G.107 Appendix II (2009). Provisional Impairment Factor Framework for Wideband Speech Transmission. International Telecommunication Union, CH–Geneva. 110 ITU–T Rec. G.113 (2007). Transmission Impairments due to Speech Processing. International Telecommunication Union, CH–Geneva. 57, 107, 109, 110, 113, 114, 116, 122, 124, 125, 130 ITU–T Rec. G.113 Amend. 1 (2009). Revised Appendix IV—Provisional Planning Values for the Wideband Equipment Impairment Factor and the Wideband Packet Loss Robustness Factor. International Telecommunication Union, CH–Geneva. 108, 110 ITU–T Rec. G.191 (2005). Software Tools for Speech and Audio Coding Standardization. International Telecommunication Union, CH–Geneva. 94, 127
236
References
ITU–T Rec. G.711 (1988). Pulse Code Modulation (PCM) of Voice Frequencies. International Telecommunication Union, CH–Geneva. 23, 63, 107, 109, 116 ITU–T Rec. G.722 (1988). 7 kHz Audio-coding Within 64 kbit/s. International Telecommunication Union, CH–Geneva. 93, 94, 95, 96, 98, 111, 112 ITU–T Rec. G.722.1 (2005). Low-complexity Coding at 24 and 32 kbit/s for Hands-free Operation in Systems With Low Frame Loss. International Telecommunication Union, CH–Geneva. 93, 95, 111 ITU–T Rec. G.722.2 (2003). Wideband Coding of Speech at Around 16 kbit/s Using Adaptive Multi-Rate Wideband (AMR-WB). International Telecommunication Union, CH–Geneva. 24, 93, 94, 95, 97, 98, 99, 100, 101, 111, 112, 213 ITU–T Rec. G.726 (1990). 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM). International Telecommunication Union, CH–Geneva. 23, 116, 178, 217, 219 ITU–T Rec. G.728 (1992). Coding of Speech at 16 kbit/s Using Low-Delay Code Excited Linear Prediction. International Telecommunication Union, CH–Geneva. 23 ITU–T Rec. G.729 (2007). Coding of Speech at 8 kbit/s Using Conjugate-Structure AlgebraicCode-Excited Linear Prediction (CS-ACELP). International Telecommunication Union, CH– Geneva. 24, 35, 71, 95, 116, 224 ITU–T Rec. G.729 Annex B (1996). A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70. International Telecommunication Union, CH– Geneva. 25 ITU–T Rec. G.729.1 (2006). Based Embedded Variable Bit-rate Coder: An 8-32 kbit/s Scalable Wideband Coder Bitstream Interoperable with G.729. International Telecommunication Union, CH–Geneva. 24, 95, 124, 125, 127, 222, 226, 227 ITU–T Rec. P.10/G.100 Amend. 2 (2008). New Definitions for Inclusion in Recommendation ITU– T P.10/G.100. International Telecommunication Union, CH–Geneva. 18 ITU–T Rec. P.341 (2005). Transmission Characteristics for Wideband (150-7000 Hz) Digital Hands-free Telephony Terminals. International Telecommunication Union, CH–Geneva. 92, 101, 117 ITU–T Rec. P.48 (1988). Specification for an Intermediate Reference System. International Telecommunication Union, CH–Geneva. 22, 61, 63, 89, 92, 150 ITU–T Rec. P.50 (1999). Artificial Voices. International Telecommunication Union, CH–Geneva. 66 ITU–T Rec. P.56 (1993). Objective Measurement of Active Speech Level. International Telecommunication Union, CH–Geneva. 94 ITU–T Rec. P.561 (2002). In-Service Non-Intrusive Measurement Device—Voice Service Measurements. International Telecommunication Union, CH–Geneva. 83 ITU–T Rec. P.562 (2004). Analysis and Interpretation of INMD Voice-Service Measurements. International Telecommunication Union, CH–Geneva. 83 ITU–T Rec. P.563 (2004). Single-Ended Method for Objective Speech Quality Assessment in Narrow-band Telephony Applications. International Telecommunication Union, CH–Geneva. 83, 84, 85 ITU–T Rec. P.564 (2007). Conformance Testing for Voice over IP Transmission Quality Assessment Models. International Telecommunication Union, CH–Geneva. 84 ITU–T Rec. P.58 (1996). Head and Torso Simulator for Telephonometry. International Telecommunication Union, CH–Geneva. 48, 76 ITU–T Rec. P.59 (1993). Artificial Conversation Speech. International Telecommunication Union, CH–Geneva. 66 ITU–T Rec. P.76 (1988). Determination of Loudness Ratings; Fundamental Principles. International Telecommunication Union, CH–Geneva. 61 ITU–T Rec. P.79 (2007). Calculation of Loudness Ratings for Telephone Sets. International Telecommunication Union, CH–Geneva. 61, 105 ITU–T Rec. P.800 (1996). Methods for Subjective Determination of Transmission Quality. International Telecommunication Union, CH–Geneva. 15, 32, 33, 40, 43, 45, 46, 47, 48, 50, 83, 111, 134, 170, 192, 221
References
237
ITU–T Rec. P.800.1 (2006). Mean Opinion Score (MOS) Terminology. International Telecommunication Union, CH–Geneva. 18, 39, 40, 45, 56, 92 ITU–T Rec. P.805 (2007). Subjective Evaluation of Conversational Quality. International Telecommunication Union, CH–Geneva. 45 ITU–T Rec. P.810 (1996). Modulated Noise Reference Unit (MNRU). International Telecommunication Union, CH–Geneva. 57, 219 ITU–T Rec. P.830 (1996). Subjective Performance Assessment of Telephone-band and Wideband Digital Codecs. International Telecommunication Union, CH–Geneva. 22, 46, 47, 89, 111, 134, 221, 224 ITU–T Rec. P.833 (2001). Methodology for Derivation of Equipment Impairment Factors From Subjective Listening-only Tests. International Telecommunication Union, CH–Geneva. 57, 88, 110 ITU–T Rec. P.833.1 (2008). Methodology for the Derivation of Equipment Impairment Factors from Subjective Listening-only Tests for Wideband Speech Codecs. International Telecommunication Union, CH–Geneva. 58, 105, 110, 111, 112, 113, 114, 115, 122, 123, 130 ITU–T Rec. P.834 (2002). Methodology for the Derivation of Equipment Impairment Factors from Instrumental Models. International Telecommunication Union, CH–Geneva. 58, 105, 114, 128 ITU–T Rec. P.834.1 (2009). Extension of the Methodology for the Derivation of Equipment Impairment Factors from Instrumental Models for Wideband Speech Codecs. International Telecommunication Union, CH–Geneva. 58, 88, 105, 115, 117, 126, 128, 130, 131, 214 ITU–T Rec. P.835 (2003). Subjective Test Methodology for Evaluating Speech Communication Systems That Include Noise Suppression Algorithm. International Telecommunication Union, CH–Geneva. 50, 81 ITU–T Rec. P.861 (1996). Objective Quality Measurement of Telephone-band (300-3400 Hz) Speech Codecs. International Telecommunication Union, CH–Geneva. 71, 75, 134 ITU–T Rec. P.862 (2001). Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-end Speech Quality Assessment of Narrow-band Telephone Networks and Speech Codecs. International Telecommunication Union, CH–Geneva. 2, 71, 75, 76, 85, 88, 97, 98, 102, 130, 134, 136, 150, 213, 223 ITU–T Rec. P.862.1 (2003). Mapping Function for Transforming P.862 Raw Result Scores to MOS-LQO. International Telecommunication Union, CH–Geneva. 91 ITU–T Rec. P.862.2 (2005). Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs. International Telecommunication Union, CH–Geneva. 3, 66, 76, 88, 92, 97, 103, 134, 213 ITU–T Rec. P.862.3 (2005). Application Guide for Objective Quality Measurement Based on Recommendations P.862, P.862.1 and P.862.2. International Telecommunication Union, CH– Geneva. 75, 94, 121, 224 ITU–T Suppl. 23 to P-Series Rec. (1998). ITU–T Coded-speech Database. International Telecommunication Union, CH–Geneva. 71, 72, 74, 75, 83, 116, 224 ITU–T Suppl. 3 to P-Series Rec. (1993). Models for Predicting Transmission Quality from Objective Measurements. International Telecommunication Union, CH–Geneva. 61, 62 ITU–T Temp. Doc. TD.10 Rev.1 (2003). Status Report of Question 9/12. International Telecommunication Union, CH–Geneva. 77 ITU–T Temp. Doc. TD.52 Rev.1 (2007). Requirement Specification for P.OLQA. International Telecommunication Union, CH–Geneva. 77, 133, 227 ITU–T Temp. Doc. TD.65 (2005). Quality Assessment Qualification Test Plan for the ITU–T G.729 based embedded variable bit-rate (G.729EV) extension to the ITU–T G.729 Speech Codec. International Telecommunication Union, CH–Geneva. 223, 226 ITU–T Temp. Doc. TD.71 (2005). Qualification phase of G729EV: Test Results (Exp 1-4). International Telecommunication Union, CH–Geneva. 223, 226 Jekosch, U. (2008). Semio-acoustics: a Domain of Communication Acoustics. Journal of the Acoustical Society of America, 123(5):3417–3417. 11
238
References
Jekosch, U. (1993). Speech Quality Assessment and Evaluation. In Proc. of the 3rd European Conference on Speech Communication and Technology (EuroSpeech ’93), 1387–1394, DE– Berlin. 45 Jekosch, U. (2005). Voice and Speech Quality Perception: Assessment and Evaluation. Springer, DE–Berlin. 9, 11, 12, 13, 15, 37, 38, 39, 40, 53, 59, 60, 77 Johannesson, N. O. (1997). The ETSI Computation Model: a Tool for Transmission Planning of Telephone Networks. IEEE Communications Magazine, 35(1):70–79. 62 Juric, P. (1998). An Objective Speech Quality Measurement in the QVoice. In Proc. of the 5th Int. Workshop on Systems, Signals and Image Processing (IWSSIP’98), 156–163, CR–Zagreb. 74 Karjalainen, M. (1985). A New Auditory Model for the Evaluation of Sound Quality of Audio Systems. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’85), volume 10, 608–611. 69 Kim, D. S., Tarraf, A., Technol, L., and Whippany, N. J. (2004). Perceptual Model for NonIntrusive Speech Quality Assessment. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’04), volume 3, 1060–1063. 83 Kitawaki, N., Honda, M., and Itoh, K. (1984). Speech-Quality Assessment Methods for SpeechCoding Systems. IEEE Communications Magazine, 22(10):26–33. 68, 71 Kitawaki, N., Itoh, K., Honda, M., and Kakehi, K. (1982). Comparison of Objective Speech Quality Measures for Voiceband Codecs. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’82), volume 7, 1000–1003. 68 Kitawaki, N., Nagai, K., and Yamada, T. (2004). Objective Quality Assessment of Wideband Speech Coding using W-PESQ Measure and Artificial Voice. In Proc. Int. Conf. Measurement of Speech and Audio Quality in Networks (MESAQIN), 1–6, CZ–Prague. 66 Kitawaki, N., Nagai, K., and Yamada, T. (2005). Objective Quality Assessment of Wideband Speech Coding. IEEE Transactions on Communications, E88-B(3):1111–1118. 93 Klatt, D. (1982). Prediction of Perceived Phonetic Distance from Critical-band Spectra: A First Step. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’82), volume 7, 1278–1281. 69, 175 Kruskal, J. B. (1964). Multidimensional Scaling by Optimizing Goodness-of-Fit to a Non-Metric Hypothesis. Psychometrika, 29(1):1–27. 51 Kubichek, R. F., Quincy, E. A., and Kiser, K. L. (1989). Speech Quality Assessment Using Expert Pattern Recognition. In IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 208–211, USA–Victoria, BC. 71 Kühnel, C., Scholz, K., and Heute, U. (2008). Dimension-Based Speech-Quality Assessment: Development of an Estimator of the Dimension “Noisiness”. In Proc. 8th ITG-FachberichtSprachkommunikation, DE–Aachen. 178 Lalou, J. (1990). The Information Index: An Objective Measure of Speech Transmission Performance. Annals of Telecommunications, 45(1–2):47–65. 62, 71 Lam, K. H., Au, O. C., Chan, C. C., Hui, K. F., and Lau, S. F. (1996). Objective Speech Quality Measure for Cellular Phone. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’96), volume 1, 487–490, USA–Atlanta, GA. 65 Law, B. H. and Seymour, R. A. (1962). A Reference Distortion System Using Modulated Noise. In Proc. IEE-Part B: Electronic and Communication Engineering, number 3992-E, 484–485. 57 Leman, A., Faure, J., and Parizet, E. (2008). Influence of Informational Content of Background Noise on Speech Quality Evaluation for VoIP Application. In Proc. 155th Meeting of the Acoust. Soc. of America / 5th Forum Acusticum / 9e Congrès Français d’Acoustique / 2nd ASA-EAA Joint Conference (Acoustics’08), 471–476. 85 Leman, A., Faure, J., and Parizet, E. (2009). A Non-Intrusive Signal-Based Model for Speech Quality Evaluation Using Automatic Classification of Background Noises. In Proc. 12th Int. Conf. on Spoken Language Processing (ICSLP), 1139–1142, UK-Brighton. 178 Letowski, T. (1989). Sound Quality Assessment: Concepts and Criteria. In Proc. 87th AES Convention, number 2825, USA–New York, NY. 41, 42
References
239
Liang, J. and Kubichek, R. (1994). Output-Based Objective Speech Quality. In IEEE 44th Vehicular Technology Conference, volume 3, 1719–1723. 84 Liang, Y. J., Farber, N., and Girod, B. (2001). Adaptive Playout Scheduling Using Time-Scale Modification in Packet Voice Communications. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’01), volume 3, 1445–1448, USA–Salt Lake City, Utah. 26 Lotto, A. J. and Sullivan, S. C. (2008). Auditory Perception of Sound Sources, volume 29, chapter Speech as a Sound Source, 281–305. Springer, DE–Berlin. 11, 52 Malfait, L., Berger, J., and Kastner, M. (2006). P. 563—The ITU–T Standard for Single-Ended Speech Quality Assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1924–1934. 83 Malfait, L., Gray, P., and Reed, M. J. (2008). Objective Listening Quality Assessment of Speech Communication Systems Introducing Continuously Varying Delay (Time-Warping): A Time Alignment Issue. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’08), 4213–4216. 130, 138 Mattila, V. V. (2002a). Descriptive Analysis and Ideal Point Modelling of Speech Quality in Mobile Communication. In Proc. 113th AES Convention, number 5704, USA–Los Angeles, CA. Audio Engineering Society. 29, 31, 32, 33 Mattila, V. V. (2002b). Ideal Point Modelling of Speech Quality in Mobile Communications Based on Multidimensional Scaling (MDS). In Proc. 112th AES Convention, number 5546, DE– Munich. Audio Engineering Society. 29, 31, 33 Mattila, V.-V. (2002c). Analysing Individual Differences in Speech Quality with Internal Preference Mapping. In Proc. Int. Conf. Measurement of Speech and Audio Quality in Networks (MESAQIN), CZ–Prague. 31 McDermott, B. J. (1969). Multidimensional Analyses of Circuit Quality Judgments. Journal of the Acoustical Society of America, 45(3):774–781. 29, 30, 34 McGee, V. E. (1964). Semantic Components of the Quality of Processed Speech. Journal of Speech, Language and Hearing Research, 7(4):310–323. 49 Mermelstein, P. (1979). Evaluation of a Segmental SNR Measure as an Indicator of the Quality of ADPCM Coded Speech. Journal of the Acoustical Society of America, 66(6):1664–1667. 67 Möller, S. (2000). Assessment and Prediction of Speech Quality in Telecommunications. Kluwer Academic Publ., USA–Boston, MA. 2, 11, 15, 17, 18, 42, 45, 52, 53, 64 Möller, S. and Raake, A. (2002). Telephone Speech Quality Prediction: Towards Network Planning and Monitoring Models for Modern Network Scenarios. Speech Communication, 38(1-2):47– 75. 64 Möller, S. (2005). Quality of Telephone-based Spoken Dialogue Systems. Springer, DE–Berlin. 39 Möller, S., Raake, A., Kitawaki, N., Takahashi, A., and Wältermann, M. (2006). Impairment Factor Framework for Wideband Speech Codecs. IEEE Transactions on Audio, Speech and Language Processing, 14(6):1969–1976. 55, 109, 110, 116, 119, 121, 123, 126 Moore, B. C. J., Glasberg, B. R., and Baer, T. (1997). A Model for the Prediction of Thresholds, Loudness, and Partial Loudness. Journal of the Audio Engineering Society, 45(4):224–240. 8, 34, 85, 216 Moore, B. C. J. and Tan, C. T. (2004). Development and Validation of a Method for Predicting the Perceived Naturalness of Sounds Subjected to Spectral Distortion. Journal of the Audio Engineering Society, 52(9):900–914. 79 Moore, B. C. J., Tan, C. T., Zacharov, N., and Mattila, V. V. (2004). Measuring and Predicting the Perceived Quality of Music and Speech Subjected to Combined Linear and Nonlinear Distortion. Journal of the Audio Engineering Society, 52(12):1228–1244. 78, 79 Nagle, A., Quinquis, C., Sollaud, A., and Slock, D. (2007). Quality Impact of Diotic versus Monaural Hearing on Processed Speech. In Proc. 123rd AES Convention, number 7720. 108, 216, 221, 225 Nygaard, L. and Pisoni, D. (1998). Talker-Specific Learning in Speech Perception. Perception and Psychophysics, 60(3):355–376. 12
240
References
Ogden, C. and Richards, I. (1923). The Meaning of Meaning: A Study of the Influence of Language Upon Thought and of the Science of Symbolism. Kegan Paul, Trench, Trubner & Co., UK– London. 9 Osaka, N. and Kakehi, K. (1986). Objective Model for Evaluating Telephone Transmission Performance. Review of the Electrical Communication Laboratories, 34(4):437–444. 62 Osgood, C. E. (1952). The Nature and Measurement of Meaning. Psychological Bulletin, 49(3):197–237. 38, 39, 49 Pardo, J. S. and Remez, R. E. (2006). Handbook of Psycholinguistics, chapter The Perception of Speech, 201–248. Academic Press., USA–New York, NY, 2nd edition. 6, 12, 14 Poulton, E. C. (1979). Models for Biases in Judging Sensory Magnitude. Psychological Bulletin, 86(4):777–803. 52, 53 Pourmand, N., Suelzle, D., Parsa, V., Hu, Y., and Loizou, P. (2009). On the Use of Bayesian Modeling for Predicting Noise Reduction Performance. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’09), 3873–3876, Taipei, TW. 85 Preston, C. C. and Colman, A. M. (2000). Optimal Number of Response Categories in Rating Scales: Reliability, Validity, Discriminating Power, and Respondent Preferences. Acta Psychologica, 104(1):1–15. 53 Purves, D., Augustine, G., Fitzpatrick, D., Katz, L., LaMantia, A., McNamara, J., and Williams, S. (2004). Neurosciences. Sinauer Associates, USA–Sunderland, MA, 3rd edition. 7 Quackenbush, S. and Barnwell, T. (1985). Objective Estimation of Perceptually Specific Subjective Qualities. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’85), volume 10, 419–422. 65 Quackenbush, S. R., Barnwell, T., and Clements, M. A. (1988). Objective Measures of Speech Quality. Prentice Hall, USA–Englewood Cliffs, NJ. 78, 79 Raake, A. (2006a). Short-and Long-Term Packet Loss Behavior: Towards Speech Quality Prediction for Arbitrary Loss Distributions. IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1957–1968. 109 Raake, A. and Katz, B. F. G. (2006). SUS-based Method for Speech Reception Threshold Measurement in French. In Proc. 5th Int. Conf. on Language Resources and Evaluation (LREC), 2028–2033, IT-Genoa. 45 Raake, A., Spors, S., Maempel, H., Marszalek, T., Ciba, S., and Cote, N. (2008). Speech Quality of Wide-and Narrowband Speech Codecs: Object-and Subject-Oriented View. In Fortschritte der Akustik (DAGA ’08). 43 Raake, A., Möller, S., Wältermann, M., Côté, N., and Ramirez, J.-P. (2010). Parameter-Based Prediction of Speech Quality in Listening Context—Towards a WB E-Model. In Proc. 2nd Int. Workshop on Quality of Multimedia Experience (QoMEX’10), Trondheim, Norway. 63, 131 Raake, A. (2006b). Speech Quality of VoIP - Assessment and Prediction. Wiley, UK–Chichester. 11, 13, 14, 42, 102, 106, 131, 164, 167, 168 Rabiner, L. (1995). The Impact of Voice Processing on Modern Telecommunications. Speech Communication, 17(3–4):217–226. 59 Raja, A., Azad, R., Flanagan, C., and Ryan, C. (2008). A Methodology for Deriving VoIP Equipment Impairment Factors for a Mixed NB/WB Context. IEEE Transactions on Multimedia, 10(6):1046–1058. 109, 121 Richards, D. L. (1974). Calculation of Opinion Scores for Telephone Connections. Proc. Institution of Electrical Engineers (IEE), 121(5):313–323. 62 Richters, J. S. and Dvorak, C. A. (1988). A Framework for Defining the Quality of Communications Services. IEEE Communications Magazine, 26(10):17–23. 17 Rix, A. and Hollier, M. (1999). Perceptual Speech Quality Assessment from Narrowband Telephony to Wideband Audio. In Proc. 107th AES Convention, number 5018. 74 Rix, A., Reynolds, R., and Hollier, M. (1999). Robust Perceptual Assessment of End-to-End Audio Quality. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 39–42, USA–New Paltz, NY. 74
References
241
Rix, A., Berger, J., and Beerends, J. (2003). Perceptual Quality Assessment of Telecommunications Systems Including Terminals. In Proc. 114th AES Convention, number 5724, NE–Amsterdam. 76 Rix, A., Hollier, M., Hekstra, A., and Beerends, J. (2002). Perceptual Evaluation of Speech Quality (PESQ), the New ITU Standard for End-to-End Speech Quality Assessment, Part I-Time Alignment. Journal of the Audio Engineering Society, 50(10):755. 75, 82, 88, 89, 138, 142, 143 Rothauser, E. H., Urbanek, G. E., and Pachl, W. P. (1968). Isopreference Method for Speech Evaluation. Journal of the Acoustical Society of America, 44(2):408–418. 34, 57 Scholz, K. and Heute, U. (2008). Dimension-based Speech-Quality Assessment: Instrumental Measure for the Overall Quality of Telephone-band Speech. In Proc. 8th ITG-Fachtagung Sprachkommunikation, DE–Aachen. 79, 80 Scholz, K., Kühnel, C., Wältermann, M., Möller, S., and Heute, U. (2008). Assessment of the Speech-Quality Dimension “Noisiness” for the Instrumental Estimation and Analysis of Telephone-band Speech Quality. In Proc. 11th Int. Conf. on Spoken Language Processing (ICSLP), 703–706, Brisbane, Australia. ISCA. 80, 178 Scholz, K., Wältermann, M., Huo, L., Raake, A., Möller, S., and Heute, U. (2006). Estimation of the Quality Dimension “Directness/Frequency Content” for the Instrumental Assessment of Speech Quality. In Proc. 9th Int. Conf. on Spoken Language Processing (ICSLP), 1523–1526, USA–Pittsburgh, PA. 79, 164, 165, 166, 167 Scholz, K. (2008). Instrumentelle Qualitätsbeurteilung von Telefonbandsprache beruhend auf Qualitätsattributen. PhD thesis, Christian Albrechts Universität, DE–Kiel. 80, 190, 207, 209 Schroeder, M. R., Atal, B. S., and Hall, J. L. (1979). Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear. Journal of the Acoustical Society of America, 66(6):1647–1652. 69 Sen, D. (2001). Determining the Dimensions of Speech Quality from PCA and MDS Analysis of the Diagnostic Acceptability Measure. In Proc. Int. Conf. Measurement of Speech and Audio Quality in Networks (MESAQIN), volume 3, CZ–Prague. 29 Sen, D. (2004). Predicting Foreground SH, SL and BNH DAM Scores for Multidimensional Objective Measure of Speech Quality. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’04), volume 1, 493–496. 78 Shaughnessy, D. (2000). Speech Communication: Human and Machine. Addison-Wesley, USA– New York, NY. 6 Shiran, N. and Shallom, I. (2009). Enhanced PESQ Algorithm for Objective Assessment of Speech Quality at a Continuous Varying Delay. In Quality of Multimedia Experience (QoMEx 2009), 157–162. 130, 190, 209 c (Retrieved 28 January 2010). http://www.skype.com. 17 Skype Sottek, R. (1993). Modelle zur Signalverarbeitung im menschlichen Gehör. PhD thesis, Rhein.Westf. Techn. Hochschule. 80 Stevens, S. (1957). On the Psychophysical Law. Psychological Review, 64(3):153–181. 38, 42 Sydow, C. (2004). Practical Limitations of Wideband Terminals. In ETSI Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction, DE–Mainz. 22 Takahashi, A., Yoshino, H., and Kitawaki, N. (2004). Perceptual QoS assessment technologies for VoIP. IEEE Communications Magazine, 42(7):28–34. 59 Takahashi, A., Kurashima, A., Morioka, C., and Yoshino, H. (2005a). Objective Quality Assessment of Wideband Speech by an Extension of ITU–T Recommendation P.862. In Proc. 8th Int. Conf. on Spoken Language Processing (ICSLP), 3153–3156, ES–Lisbone. 94 Takahashi, A., Kurashima, A., and Yoshino, H. (2005b). Subjective Quality Index for Compatibly Evaluating Narrowband and Wideband Speech. In Proc. Int. Conf. Measurement of Speech and Audio Quality in Networks (MESAQIN), CZ–Prague. 54, 56, 116, 227 Tan, C. T., Moore, B. C. J., Zacharov, N., and Mattila, V. V. (2004). Predicting the Perceived Quality of Nonlinearly Distorted Music and Speech Signals. Journal of the Audio Engineering Society, 52(7-8):699–711. 79
242
References
Thorpe, L. and Yang, W. (1999). Performance of Current Perceptual Objective Speech Quality Measures. In IEEE Workshop on Speech Coding, 144–146. 75 Thorpe, L. and Rabipour, R. (2000). Changes in Voice Quality Judgments as a Function of Background Noise Level in the Listening Environment. In IEEE Workshop on Speech Coding, 26–28, USA-Delavan, WI. 85 Tribolet, J., Noll, P., McDermott, B., and Crochiere, R. (1978). A Study of Complexity and Quality of Speech Waveform Coders. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’78), volume 3, 586–590. 67, 68 Vaalgamaa, M. (2007). Intelligent Audio in VoIP - Benefits, Challenges and Solutions. In Proc. 30th AES Int. Conf.: Intelligent Audio Environments, number 35, FI–Saariselkä. 17 Voiers, W. D. (1977). Diagnostic Acceptability Measure for Speech Communication Systems. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’77), 204–207, Hartford. 44, 49, 78 Volberg, L., Kulka, M., Sust, C., and Lazarus, H. (2006). Speech Intelligibility and the Subjective Assessment of Speech Quality in Near Real Communication Conditions. Acta Acustica united with Acustica, 92(3):406–416. 12, 44 Voran, S. (1999a). Objective Estimation of Perceived Speech Quality—Part I. Development of the Measuring Normalizing Block Technique. IEEE Transactions on Speech and Audio Processing, 7(4):371–382. 71 Voran, S. (1999b). Objective Estimation of Perceived Speech Quality—Part II: Evaluation of the Measuring Normalizing Block Technique. IEEE Transactions on Speech and Audio Processing, 7(4):383–390. 71 Wältermann, M., Raake, A., and Möller, S. (2006a). Perceptual Dimensions of WidebandTransmitted Speech. In Proc. 2nd ISCA/DEGA Tutorial and Research Workshop on Perceptual Quality of Systems, 103–108. 29, 33, 34, 164, 225 Wältermann, M., Raake, A., and Möller, S. (2010a). Quality Dimensions of Narrowband and Wideband Speech Transmission. Acta Acustica united with Acustica, 96(6):1090–1103. 33, 164 Wältermann, M., Scholz, K., Möller, S., Huo, L., Raake, A., and Heute, U. (2008). An Instrumental Measure for End-to-end Speech Transmission Quality Based on Perceptual Dimensions: Framework and Realization. In Proc. 11th Int. Conf. on Spoken Language Processing (ICSLP), 22–26. 50, 178 Wältermann, M., Scholz, K., Raake, A., Heute, U., and Möller, S. (2006b). Underlying Quality Dimensions of Modern Telephone Connections. In Proc. 9th Int. Conf. on Spoken Language Processing (ICSLP), 2170–2173, USA–Pittsburgh, PA. 29, 32, 34, 79, 80, 81, 216, 222 Wältermann, M. and Raake, A. (2008). Towards a New E-Model Impairment Factor for Linear Distortion of Narrowband and Wideband Speech Transmission. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’08), 4817–4820, USA–Las Vegas NV. 108 Wältermann, M., Raake, A., and Möller, S. (2010b). Analytical Assessment and Distance Modeling of Speech Transmission Quality. In Proc. 11th Annual Conference of the Int. Speech Communication Association (Interspeech 2010), 1313–1316, Makuhari, Japan. 50, 206, 226 Wang, S., Sekey, A., and Gersho, A. (1992). An Objective Measure for Predicting Subjective Quality of Speech Coders. IEEE Journal on Selected Areas in Communications, 10(5):819– 829. 70 Wolf, S., Dvorak, C., Kubichek, R., South, C., Schaphorst, R., and Voran, S. (1991). Future Work Relating Objective and Subjective Telecommunications System Performance. In Proc. Global Telecommunications Conference (GLOBECOM’91), 2129–2134. 20, 58, 59 Yang, W. (1999). Enhanced Modified Bark Spectral Distortion (EMBSD): An Objective Speech Quality Measure Based on Audible Distortion and Cognition Model. PhD thesis, Temple University, USA–Philadelphia, PA. 70 Yang, W., Benbouchta, M., and Yantorno, R. (1998). Performance of the Modified Bark Spectral Distortion as an Objective Speech Quality Measure. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’98), volume 1, 541–544. 70
References
243
Young, E. D. (2008). Neural Representation of Spectral and Temporal Information in Speech. Philosophical Transactions B, 363(1493):923–945. 9 Zelinski, R. and Noll, P. (1977). Adaptive Transform Coding of Speech Signals. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(4):299–309. 67 Zieli´nski, S., Rumsey, F., and Bech, S. (2008). On Some Biases Encountered in Modern Audio Quality Listening Tests–A Review. Journal of the Audio Engineering Society, 56(6):427–451. 52, 53 Zwicker, E. and Fastl, H. (1990). Psychoacoustics: Facts and models. Springer, DE–Berlin, 1st edition. 7, 69, 72, 78, 90, 152, 157, 169, 171, 180 Zwicker, E., Flottorp, G., and Stevens, S. (1957). Critical Bandwidth in Loudness Summation. Journal of the Acoustical Society of America, 29(5):548–557. 8, 151, 152, 165, 216
Index
For terms that are frequently repeated in the book, only selected places are given. Places printed in italic refers to the definition or explanation of the terms A acceptability 16, 18, 49 advantage factor 63 of access 63 anchor 48, 55–58, 110–112, 183, 219 anchoring 62 articulation 5–6 artificial head 48, 76, 149 mouth 149 voice 66 assessment 39, 37–85, 219 Asymmetric Specific Loudness Difference (ASLD) 73 asymmetry 70, 73, 90–91, 157, 160 asymmetrical disturbance 97, 99, 102 auditory event 7, 38 memory 10 system 7–9, 85, 89 transformation see perceptual, transformation Automatic Gain Control (AGC) 27, 158 B bandpass 21–23, 72, 92, 95, 191, 199, 204, 222, 223, 226 bandwidth 6, 35, 40, 54, 55, 94, 164–168, 171 critical see critical-band rate
Equivalent Rectangular Bandwidth (ERB) 9, 78, 79, 166, 168 impairment 108, 209 restriction 104, 154, 158, 199 Bark 8, 8–9, 69, 72, 78, 89, 155, 157, 158, 160, 167, 175, 179–181 Spectral Distortion (BSD) 70, 189 bark 152–153 bit-rate 23 buffer see jitter, buffer burst(y) 20, 25, 84, 108–110 C Call Clarity Index (CCI) 83 Cepstral Distance (CD) 68 Class of Service (CoS) 16 cochlea 8, 158 codec 23, 22–26, 57–58, 126, 199, 202, 219 impairment see impairment, factor, equipment impairment factor (Ie ) Linear Predictive Coding (LPC) 24, 68 tandeming 107, 112, 191 wideband 24, 33, 48, 94, 97–99, 106–108, 111–112 coloration 35, 50 estimator 164–168, 204–209 comprehension 11 context 11–12, 14, 40, 52, 102, 134, 149, 188, 207, 225, 228 effect 55–56, 62, 105–106 continuity see discontinuity conversation effectiveness 18, 45 correlation (ρ ) 59, 117, 193 245
246 critical-band rate
Index 8–9
D delay
18, 21, 23, 25–27, 45, 74, 84, 89, 138–149, 172–173, 210 variation see jitter diagnostic 204–209, 228 Diagnostic Acceptability Measure (DAM) 49, 78 Diagnostic Instrumental Assessment of Listening quality (DIAL) 135–187, 197–209 measure 42, 77–82 diotic 47, 108, 150 Directness/Frequency Content (DFC) 35, 79 discontinuity 34, 50, 79 estimator 172–178, 217 E E-model 62–64, 83, 105–110, 120, 171 ease of communication 18 echo 10, 19, 26–27, 30, 45 error 23 bit error 20, 34, 112, 135 Bit Error Rate (BER) 20 entropy 74 Frame Error Rate (FER) 20 modified prediction error (σ ∗ ) 195, 197, 201, 206 pattern 108 prediction error (σ ) 59, 193, 195 transmission error 31, 64, 112, 114, 127, 199, 203 estimation 58, 59, 60, 84, 88–104 evaluation 17, 39, 50, 52, 93–99, 103–104, 187–210
compensation 89, 156–157 function 165, 167 Global System for Mobile communication (GSM) 20, 24, 134 H Hands-Free Terminal (HFT) 21, 26, 135, 164, 199, 204 handset 17, 19, 21, 26, 61, 76, 89, 149, 222 headphone 48, 117, 150 headset 21, 76, 222 I impairment factor 62–64, 183 bandwidth impairment factor (Ibw ) 168 effective equipment impairment factor (Ie,eff ) 63, 108–110 equipment impairment factor (Ie ) 57, 63, 106–108 loudness impairment factor (IL ) 171 noisiness impairment factor (Inoi ) 181 WB effective equipment impairment factor (Ie,WB,eff ) 110, 114 WB equipment impairment factor (Ie,WB ) 107, 113, 121–126 In-service Non-intrusive Measurement Device (INMD) 83 instrumental measurement method 16, 39, 58–84 intelligibility 11, 12, 15, 44–45, 68, 168 Intermediate Reference System (IRS) 21, 61, 89, 92, 117, 149 J jitter 25, 89 buffer 25, 108, 175 K
F formant 6, 139 frequency center frequency ( fc ) 167, 168 compensation 89, 97, 154–156 fundamental frequency (F0 ) 5, 98, 139 Nyquist frequency 67 sampling frequency ( fS ) 22, 137, 149, 187–190, 210 Full-Band (FB) 20, 188 G gain
k-Nearest Neighbors (k-NN)
183–186, 210
L level Active Speech Level (ASL) 94, 137 alignment 88, 188, 189 equivalent continuous sound level (Leq ) 171, 171 listening level 34, 35, 46, 50, 135, 152, 221 optimum listening level 34, 50, 105, 161, 169 preferred listening level 34, 50, 88 loudness 8, 35, 90, 157–158, 161
Index estimator 168–172, 216 Long-Term Loudness (LTL) 170 Loudness Rating (LR) 61 Short-Term Loudness (STL) 158, 160–162, 169
247 O objectivity 38, 39 order-effect 56 P
M mapping 91, 92, 163, 193 masking 90, 96, 102, 159 Mean Opinion Score (MOS) 30, 39, 53, 55–56, 63, 81, 92, 112, 122, 136, 163, 168, 171, 177, 181, 195, 210 Measuring Normalizing Blocks (MNB) 71 model 58–84 core model 150–164, 216 intrusive model 64, 115–116, 187–190 judgment model 182–186, 217 non-intrusive model 64, 82–84 opinion model 61–62 packet-layer model 60, 84 parameter-based model 59, 60–64 signal-based model 60, 64–84 Modulated Noise Reference Unit (MNRU) 57, 219 monotic 47, 108, 150 MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) 48 multidimensional 14, 15, 27–35, 49, 52, 78 scaling see scaling, multidimensional scaling N Narrow-Band (NB) 19, 55, 118, 128, 149, 191, 200–204, 221–224 network 19–22, 134 circuit-switched network 19, 25, 173 monitoring 60, 82, 84 packet-switched network 20, 25, 60, 84, 173 planning 60, 61, 105 noise additive noise 67, 80 background noise 18, 19, 31, 44, 50, 81, 140, 142, 151, 179–180, 191, 204 Comfort Noise Generation (CNG) 25 on speech 33, 181 reduction algorithm 27, 50 signal-correlated noise 57, 67, 80, 219 noisiness 34, 50, 79, 162 estimator 178–181, 217 normalization 56–58, 112–114, 122
P341 94, 101 packet loss 63 Packet Loss Concealment (PLC) 25, 80, 108, 172, 175, 203 percentage (Ppl ) 109 robustness factor (B pl ) 64, 108–110, 127–128 PEMO-Q 73, 189, 198 perceptual transformation 70, 73, 75, 89–90, 151–153, 165 Perceptual Analysis Measurement System (PAMS) 74–75, 88 Perceptual Evaluation of Speech Quality (PESQ) 75–76, 88–92, 134, 139–140, 142–143, 189 Enh. PESQ 130, 190 Mod. WB-PESQ 99–104, 190 WB-PESQ 75, 88–99, 134 Perceptual Speech Quality Measure (PSQM) 70–71, 75, 88, 134 phoneme 6, 11, 45, 140 POLQA 133–135 prediction 60, 62, 64, 109 preference mapping 52 Project—Acoustic Assessment Model (P.AAM) 76–77, 189 Public Switched Telephone Network (PSTN) 19 Q quality 12, 16 dimension 15, 27, 50, 164–181, 204–209, 228 element 15 feature 13, 15–16, 41 integral quality 15, 41, 197–204 loop 60 model see model of experience (QoE) 18 of service (QoS) 16, 16–18 overall quality 15 space 27–35 speech communication quality 17 speech quality 12, 12–18
248 test see test, auditory test voice quality 34 voice transmission quality 18 R rating Absolute Category Rating (ACR) 46 Comparison Category Rating (CCR) 47 Degradation Category Rating (DCR) 46 S scale 38 level 42 loudness-preference scale 50 sone scale 157 transmission rating scale 62, 83, 105–106, 119–121 scaling attribute scaling see semantic differential effect 53–54 multidimensional scaling (MDS) 51–52 Semantic Differential (SD) 32, 49 sign 9 signal 65–66 degraded signal 65 reference signal 65, 101–102 Signal-to-Noise Ratio (SNR) 20, 63, 66, 66–67 Skype 17 spectrum deviation 175–176 speech codec see codec level 20 perception 9–12 production 5–6 spectrum 6, 67 speech quality see quality, speech quality subject 14, 27, 38, 39, 41–43, 111, 134, 222, 225
Index effect 54–55 expert subject 42 naïve subject 42 Super-WideBand (S-WB) 197–200, 224–227 syllable 11, 162, 163
20, 150, 159,
T talker 12, 44, 65, 95 Telecommunication Objective Speech-Quality Assessment (TOSQA) 71–73, 150, 188 test audio test 47–48 auditory test 28, 38, 40, 40–58, 111–112, 116, 193, 221–228 conversation test 45 corpus 55–56 listening-only test 46–47 speech material 44 time alignment 89, 138–149 clipping 25, 172–173 warping 130, 138, 145, 144–149, 200 transmission system 18–27 V voice see speech, production quality see quality, voice quality service 18 Voice Activity Detector (VAD) 25, 89, 139–142 Voice over Internet Protocol (VoIP) 20 Voice Quality Enhancement (VQE) 26, 156 W Weighted Spectral Slope (WSS) 69, 175 WideBand (WB) 20, 55, 92, 224–227