Audio Coding
Yuli You
Audio Coding Theory and Applications
123
Yuli You, Ph.D. University of Minnesota in Twin Cities
ISBN 978-1-4419-1753-9 e-ISBN 978-1-4419-1754-6 DOI 10.1007/978-1-4419-1754-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010931862 c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To My parents and Wenjie, Amy and Alan
Preface
Since branching out of speech coding in the early 1970s, audio coding has now slipped into our daily lives in a variety of applications, such as mobile music/video players, digital television/audio broadcasting, optical discs, online media streaming, and electronic games. It has become one of the essential technologies in today’s consumer electronics and broadcasting equipments. In its more than 30 years of evolution, many audio coding technologies had come into the spotlight and then became obsolete, but only a minority have survived and are deployed in major modern audio coding algorithms. While covering all the major turns and branches of this evolution is valuable for technology historians or for people with intense interests, it is distracting and even inundating for most readers. Therefore, those historic events will be omitted and this book will, instead, focus on the current state of this evolution. Such a focus also helps to provide full coverage to selected topics in this book. This state of the art is presented from the perspective of a practicing engineer and adjunct associate professor, who single-handedly developed the whole DRA audio coding standard, from algorithm architecture to assembly-code implementation and to subjective listening tests. This perspective has a clear focus on “why” and “how to.” In particular, many purely theoretical details such as proof of perfect reconstruction property of various filter banks are omitted. Instead, the emphasis is on the motivation for a particular technology, why it is useful, what it is, and how it is integrated into a complete algorithm and implemented in practical products. Consequently, many practical aspects of audio coding technologies normally excluded in audio coding books, such as transient detection and implementation of decoders on low-cost microprocessors, are covered in this book. This book should help readers to grasp the state-of-the-art audio coding technologies and build a solid foundation for them to either understand and implement various audio coding standards or develop their own should the need arise. It is, therefore, a valuable reference for engineers in the consumer electronics and broadcasting industry and for graduate students of electrical engineering. Audio coding seeks to achieve data compression by removing perceptual irrelevance and statistical redundancy from a source audio signal and the removal efficiency is powerfully augmented by data modeling which compacts and/or decorrelates the
vii
viii
Preface
source signal. Therefore, the presentation of this book is centered around these three basic elements and organized into the following five parts. Part I gives an overview of audio coding, describing the basic ideas, the key challenges, important issues, fundamental approaches, and the basic codec architecture. Part II is devoted to quantization, the tool for removing perceptual irrelevancy. Chapter 2 delineates scalar quantization which quantizes a source signal one sample at a time. Both uniform and nonuniform quantization, including the Lloyd–Max algorithm, are discussed. Companding is posed as a structured and simple method to implement nonuniform quantization. Chapter 3 describes vector quantization which quantizes two or more samples of a source signal as one block each time. Also included is the Linde–Buzo–Gray (LBG) or k-means algorithm which builds an optimal VQ codebook from a set of training data. Part III is devoted to data modeling which transforms a source signal into a representation that is energy-compact and/or decorrelated. Chapter 4 describes linear prediction which uses a linear combination of the historic samples of the source signal as a prediction for the current sample so as to arrive at a prediction error signal that has lower energy and is decorrelated. It first explains why quantizing the prediction error signal, instead of the source signal, can dramatically improve coding efficiency. It then presents open-loop DPCM and DPCM, the two most common forms of linear prediction, derives the normal equation for optimal prediction, presents Levinson–Durbin algorithm that iteratively solves the normal equation, shows that the prediction error signal has a white spectrum and is thus decorrelated, and illustrates that the prediction decoder filter provides an estimate of the spectrum of the source signal. Finally, a general framework for linear prediction that can shape the spectrum of quantization noise to desirable shapes, such as that of the absolute threshold of hearing, is presented. Chapter 5 deals with transforms which linearly transform a block of source signal samples into another block of coefficients whose energy is compacted to a minority. It first explains why this compaction of energy leads to dramatically improved coding efficiency through the AM–GM inequality and the associated optimal bit allocation strategy. It then derives the Karhunen–Loeve transform from the search for the optimal transform. Finally, it presents suboptimal and practical transforms, such as discrete Fourier transform (DFT) and discrete cosine transform (DCT). Chapter 6 presents subband filter banks as extended transforms in which historic blocks of source samples overlap with the current block. It describes various aspects of subband coding, including reconstruction error and polyphase representation and illustrates that the dramatically improved coding efficiency is also achieved through energy compaction. Chapter 7 is devoted to cosine modulated filter banks (CMFB), whose structure is amenable for fast implementation. It first builds this filter bank from DFT and explains that it has a structure of a prototype filter plus cosine modulation. It then presents nonperfect reconstruction and perfect reconstruction CMFB and their efficient implementation structures. Finally, it illustrates that modified discrete cosine transform (MDCT), the most widely used filter bank in audio coding, is a special and simple case of CMFB.
Preface
ix
Part IV is devoted to entropy coding, the tool for removing statistical redundancy. Chapter 8 establishes that entropy is determined by the probability distribution of the source signal and is the fundamental lower limit of bit rate reduction. It then shows that any meaningful entropy codes have to be uniquely decodable and, to be practically implementable, should be instantaneously decodable. Finally, it illustrates that prefix-free codes are just such codes and further proves Shannon’s noiseless coding theorem, which essentially states that the entropy can be asymptotically approached by a prefix-free code if source symbols are coded as blocks and the block size goes to infinity. Chapter 9 presents Huffman code, an optimal prefix-free code widely used in audio coding. It first presents Huffman’s algorithm, which is an iterative procedure to build a prefix-free code from the probability distribution of the source signal, and then proves its optimality. It also addresses some practical issues related to the application of Huffman coding, emphasizing the importance of coding source symbols as longer blocks. While the previous parts can be applied to signal coding in general, Part V is devoted to audio. Chapter 10 covers perceptual models which determines which part of the source signal is inaudible (perceptually irrelevant) and thus can be removed. It starts with the absolute threshold of hearing, which is the absolute sensitivity level of the human ear. It then illustrates that the human ear processes audio signals in the frequency domain using nonlinear and analog subband filters and presents Bark scale and critical bands as tools to describe the nonuniform bandwidth of these subband filters. Next, it covers masking effects which describe the phenomenon that a weak sound becomes less audible due to the presence of a strong sound nearby. Both simultaneous and temporal masking are covered, but emphasis is given to the former because it is more thoroughly studied and extensively used in audio coding. The rest of the chapter addresses a few practical issues, such as perceptual bit allocation, converting masked threshold to the subband domain, perceptual entropy, and an example perceptual model. Chapter 11 addresses the resolution challenge posed by transients. It first illustrates that audio signals are mostly quasistationary, hence need fine frequency resolution to maximize energy compaction but are frequently interrupted by transients, which requires fine time resolution to avoid “pre-echo” artifacts. The challenge, therefore, arises: a filter bank cannot have fine frequency and time resolution simultaneously according to the Fourier uncertainty principle. It then states that one approach to address this challenge is to adapt frequency resolution in time to the presence and absence of transients and further presents switched-window MDCT as an embodiment: switching the window length of MDCT in such a way that short windows are applied to transients and long ones to quasistationary episodes. Two such examples are given, which can switch between two and three window lengths, respectively. For the double window length example, two more techniques, temporal noise shaping and transient localization are given, which can further improve the temporal resolution of the short windows. Practical methods for transient detection are finally presented.
x
Preface
Chapter 12 deals with joint channel coding. Only two widely used methods are covered, they are joint intensity coding and sum/difference (M/S stereo) coding. Methods to deal with low-frequency effect (LFE) channels are also included. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms, such as how to organize various data, how to assign entropy codebooks, how to optimally allocate bit resources, how to organize bits representing various compressed data and control commands into a bit stream suitable for transmission over various channels, and how to make the algorithm amenable for implementation on low-cost microprocessors. Chapter 14 is devoted to performance assessment, which, for a given bit rate, becomes an issue of how to evaluate coding impairments. It first points out that objective methods are highly desired, but are generally inadequate, so subjective listening tests are necessary. The double-blind principle of subjective listening test is then presented, along with the two methods, namely the ABX test and ITU-R BS.1116, that implement it. Finally, Chap. 15 presents Dynamic Resolution Adaptation (DRA) audio coding standard as an example to illustrate how integrate the technologies described in this book to create a practical audio coding algorithm. DRA algorithm has been approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification and by Chinese government as its national standard. Yuli You Adjunct Associate Professor Department of Electrical and Computer Engineering
[email protected]
Contents
Part I Prelude 1
Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.1 Audio Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.2 Basic Idea .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.3 Perceptual Irrelevance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.4 Statistical Redundancy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.5 Data Modeling .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.6 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.7 Perceptual Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.8 Global Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.9 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.10 Basic Architecture .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.11 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3 4 6 8 9 9 11 13 13 14 14 16
Part II Quantization 2
Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.2 Re-Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3 Uniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.2 Midtread and Midrise Quantizers .. . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.3 Uniformly Distributed Signals . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.4 Nonuniformly Distributed Signals . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.4 Nonuniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.4.1 Optimal Quantization and Lloyd-Max Algorithm . . . . . . . . . . 2.4.2 Companding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
19 21 23 24 24 25 27 28 33 35 39
3
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.1 The VQ Advantage .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.2 Formulation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.3 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
43 43 46 48
xi
xii
Contents
3.4 3.5
LBG Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 48 Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 49
Part III Data Model 4
Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.1 Linear Prediction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2.1 Encoder and Decoder .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2.2 Quantization Noise Accumulation .. . . . . . . . .. . . . . . . . . . . . . . . . . 4.3 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.3.1 Quantization Error .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.3.2 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4 Optimal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.1 Optimal Predictor .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.2 Levinson–Durbin Algorithm .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.3 Whitening Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.4 Spectrum Estimator.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5 Noise Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.1 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.3 Noise-Feedback Coding .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
53 53 55 55 57 59 59 60 61 61 63 65 68 69 69 71 71
5
Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.1 Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.1 Quantization Noise . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.2 AM–GM Inequality . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.3 Optimal Conditions .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.4 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.5 Optimal Bit Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.6 Practical Bit Allocation.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.7 Energy Compaction . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3 Optimal Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.1 Karhunen–Loeve Transform . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.2 Maximal Coding Gain .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.3 Spectrum Flatness . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4 Suboptimal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4.2 DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
73 73 76 76 77 78 79 80 81 82 82 83 84 85 85 86 88
6
Subband Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1 Subband Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.1 Transform Viewed as Filter Bank .. . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.2 DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.3 General Filter Banks. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
91 91 92 93 94
Contents
xiii
6.2 6.3
Subband Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 96 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 97 6.3.1 Decimation Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 98 6.3.2 Expansion Effects . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .100 6.3.3 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .102 Polyphase Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103 6.4.1 Polyphase Representation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103 6.4.2 Noble Identities .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .107 6.4.3 Efficient Subband Coder . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109 6.4.4 Transform Coder.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110 6.5.1 Ideal Subband Coder . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110 6.5.2 Optimal Bit Allocation and Coding Gain . .. . . . . . . . . . . . . . . . .111 6.5.3 Asymptotic Coding Gain . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .112
6.4
6.5
7
Cosine-Modulated Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115 7.1 Cosine Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115 7.1.1 Extended DFT Bank .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .116 7.1.2 2M -DFT Bank.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .117 7.1.3 Frequency-Shifted DFT Bank .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .119 7.1.4 CMFB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .120 7.2 Design of NPR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .122 7.3 Perfect Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .123 7.4 Design of PR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124 7.4.1 Lattice Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124 7.4.2 Linear Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .127 7.4.3 Free Optimization Parameters . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .129 7.5 Efficient Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131 7.5.1 Even m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131 7.5.2 Odd m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .134 7.6 Modified Discrete Cosine Transform .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136 7.6.1 Window Function .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136 7.6.2 MDCT .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .137 7.6.3 Efficient Implementation .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .138
Part IV 8
Entropy Coding
Entropy and Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .145 8.1 Entropy Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .146 8.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148 8.2.1 Entropy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148 8.2.2 Model Dependency .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .150 8.3 Uniquely and Instantaneously Decodable Codes . . . .. . . . . . . . . . . . . . . . .152 8.3.1 Uniquely Decodable Code . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .152 8.3.2 Instantaneous and Prefix-Free Code . . . . . . . .. . . . . . . . . . . . . . . . .153
xiv
Contents
8.4
9
8.3.3 Prefix-Free Code and Binary Tree . . . . . . . . . .. . . . . . . . . . . . . . . . .154 8.3.4 Optimal Prefix-Free Code .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .155 Shannon’s Noiseless Coding Theorem . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156 8.4.1 Entropy as the Lower Bound .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156 8.4.2 Upper Bound .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .158 8.4.3 Shannon’s Noiseless Coding Theorem . . . . .. . . . . . . . . . . . . . . . .159
Huffman Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161 9.1 Huffman’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161 9.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163 9.2.1 Codeword Siblings . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163 9.2.2 Proof of Optimality .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .165 9.3 Block Huffman Code.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .166 9.3.1 Efficiency Improvement .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .167 9.3.2 Block Encoding and Decoding.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169 9.4 Recursive Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169 9.5 A Fast Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .170
Part V
Audio Coding
10 Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .173 10.1 Sound Pressure Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174 10.2 Absolute Threshold of Hearing .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174 10.3 Auditory Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176 10.3.1 Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176 10.3.2 Auditory Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .177 10.3.3 Bark Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179 10.3.4 Critical Bands .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179 10.3.5 Critical Band Level .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .184 10.3.6 Equivalent Rectangular Bandwidth .. . . . . . . .. . . . . . . . . . . . . . . . .184 10.4 Simultaneous Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .185 10.4.1 Types of Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .186 10.4.2 Spread of Masking.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .188 10.4.3 Global Masking Threshold .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .191 10.5 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .193 10.6 Perceptual Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .194 10.7 Masked Threshold in Subband Domain .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195 10.8 Perceptual Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195 10.9 A Simple Perceptual Model.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .197 11 Transients. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199 11.1 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199 11.1.1 Pre-Echo Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .202 11.1.2 Fourier Uncertainty Principle . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .204 11.1.3 Adaptation of Resolution with Time. . . . . . . .. . . . . . . . . . . . . . . . .205
Contents
xv
11.2 Switched-Window MDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .207 11.2.1 Relaxed PR Conditions and Window Switching . . . . . . . . . . . .207 11.2.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .209 11.3 Double-Resolution Switched MDCT. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .210 11.3.1 Primary and Transitional Windows .. . . . . . . .. . . . . . . . . . . . . . . . .210 11.3.2 Look-Ahead and Window Sequencing . . . . .. . . . . . . . . . . . . . . . .213 11.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .214 11.3.4 Window Size Compromise .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215 11.4 Temporal Noise Shaping .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215 11.5 Transient-Localized MDCT . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .217 11.5.1 Brief Window and Pre-Echo Artifacts . . . . . .. . . . . . . . . . . . . . . . .217 11.5.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .220 11.5.3 Indication of Window Sequence to Decoder . . . . . . . . . . . . . . . .222 11.5.4 Inverse TLM Implementation .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .223 11.6 Triple-Resolution Switched MDCT . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .224 11.7 Transient Detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .226 11.7.1 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .227 11.7.2 A Practical Example .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .228 12 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231 12.1 M/S Stereo Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231 12.2 Joint Intensity Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .232 12.3 Low-Frequency Effect Channel . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .234 13 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235 13.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235 13.1.1 Frame-Based Processing . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236 13.1.2 Time–Frequency Tiling.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236 13.2 Entropy Codebook Assignment . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .238 13.2.1 Fixed Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .239 13.2.2 Statistics-Adaptive Assignment .. . . . . . . . . . . .. . . . . . . . . . . . . . . . .240 13.3 Bit Allocation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241 13.3.1 Inter-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241 13.3.2 Intra-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .242 13.4 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243 13.4.1 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243 13.4.2 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .244 13.4.3 Error Protection Codes . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245 13.4.4 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245 13.5 Implementation on Microprocessors . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .246 13.5.1 Fitting to Low-Cost Microprocessors.. . . . . .. . . . . . . . . . . . . . . . .246 13.5.2 Fixed-Point Arithmetic .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .247
xvi
Contents
14 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .251 14.1 Objective Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252 14.2 Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252 14.2.1 Double-Blind Principle .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 14.2.2 ABX Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 14.2.3 ITU-R BS.1116 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 15 DRA Audio Coding Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255 15.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255 15.2 Architecture.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .256 15.3 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .258 15.3.1 Frame Synchronization .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .259 15.3.2 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .262 15.3.3 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .264 15.3.4 Window Sequencing for LFE Channels . . . .. . . . . . . . . . . . . . . . .278 15.3.5 End of Frame Signature . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .279 15.3.6 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280 15.3.7 Unpacking the Whole Frame. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280 15.4 Decoding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281 15.4.1 Inverse Quantization .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281 15.4.2 Joint Intensity Decoding . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .282 15.4.3 Sum/Difference Decoding.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .283 15.4.4 De-Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .285 15.4.5 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .286 15.4.6 Inverse TLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289 15.4.7 Decoding the Whole Frame . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289 15.5 Formal Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .290 Large Tables . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293 A.1 Quantization Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293 A.2 Critical Bands for Short and Long MDCT . . . . . . . . . . .. . . . . . . . . . . . . . . . .294 A.3 Huffman Codebooks for Codebook Assignment . . . .. . . . . . . . . . . . . . . . .301 A.4 Huffman Codebooks for Quotient Width of Quantization Indexes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .303 A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .304 A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .318 A.7 Huffman Codebooks for Indexes of Quantization Step Sizes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .332 References .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .335 Index . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .339
Part I
Prelude
Chapter 1
Introduction
Sounds are physical waves that propagate in the air or other media. Such waves, which may be expressed as changes in air pressure, may be transformed by an analog audio system using a transducer, such as a microphone, into continuous electrical waves in the forms of current and/or voltage changes. This transformation of sounds into an electrical representation, which we call an audio signal, facilitates the storage, transmission, duplication, amplification, and other processing of sounds. To reproduce the sounds, the electrical representation, or audio signal, is converted back into physical waves via loudspeakers. Since electronic circuits and storage/transmission media are inherently noisy and nonlinear, audio signals are susceptible to noise and distortion, resulting in loss of sound quality. Consequently, modern audio systems are mostly digital in that the audio signals obtained above are sampled into discrete-time signals and then digitized into numerical representations, which we call digital audio signals. Once in the digital domain, a lot of technologies can be deployed to ensure that no inadvertent loss of audio quality occurs. Pulse-code modulation (PCM) is usually the standard representation format for digital audio signals. To obtain a PCM representation, the waveform of an analog audio signal is sampled regularly at uniform intervals (sampling period) to generate a sequence of samples (a discrete-time signal), which are then quantized to generate a sequence of symbols, each represented as a numerical (usually binary) code. The Nyquist–Shannon sampling theorem states that an analog signal that was sampled can be perfectly reconstructed from the samples if the sample rate exceeds twice the highest frequency in the original analog signal [68]. To ensure this condition is satisfied, the input analog signal is usually filtered with a low-pass filter whose stopband corner frequency is less than half of the sample rate. Since the human ear’s perceptual range for pure tones is widely believed to be between 20 Hz and 20 kHz (see Sect. 10.2) [102], such low-pass filters may be designed in such a way that the cutoff frequency starts at 20 kHz and a few kilohertz are allowed as the transition band before the stopband. For example, the sample rate is 44.1 kHz for compact discs (CD) and 48 kHz for sound tracks in DVD-Video. Some people, however, believe that the human ear can perceive frequencies much higher than 20 kHz, especially when transients present, so sampling rates as high as 192 kHz
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 1, c Springer Science+Business Media, LLC 2010
3
4
1 Introduction
are used in some audio systems, such as DVD-Audio. Note that, there is still power in the stopband for any practical low-pass filters, so perfect reconstruction is only approximately satisfied. The subsequent quantization process also introduces noise. The more the number of bits is used to represent each audio sample, the less the quantization noise becomes (see Sect. 2.3). The compact discs (CD), for example, use 16 bits to represent each sample. Due to the limited resolution and dynamic range of the human ear, 16 bits per sample are argued by some to be sufficient to deliver the full dynamics of almost all music, but higher resolution audio formats are called for in applications such as soundtracks in feature films, where there is often a very wide dynamic range between whispered conversations and explosions. The higher resolution format also enables more headroom to be left for audio processing which may inadvertently or intentionally introduce noise. Twenty-four bits per sample, used by DVD, are widely believed to be sufficient for most applications, but 32 bits are not uncommon. Digital audio signals rarely come as a single channel or monaural sound. The CD delivers two channels (stereo) and the DVD up to 7.1 channels (surround sound) that consists of seven normal channels (front left, front right, center, surround left, surround right, back left, and back right, for example) and one lowfrequency effect (LFE) channel. The terminology of 0.1 is used to indicate that an LFE channel has a very low bandwidth, usually no more than 120 Hz. Apparently, the number of channels are in an increasing trend. For example, Japan’s NHK demonstrated 22.2 channel surround sound in 2005 and China’s DRA audio coding standard (see Chap. 15) allows for 64.3 channel surround sound.
1.1 Audio Coding Higher audio quality demands higher sample rate, more number of bits per sample, and more channels. But all of these come with a significant cost: a large number of bits to represent and transfer the digital audio signals. Let b denote the number bits to represent each PCM sample and Fs the sample rate in samples per second, the bit rate to represent and transfer an Nch channel audio signal is (1.1) B0 D b Fs Nch in bits per second. As an example, let us consider a moderate case which is typically deployed by DVD-Video: 48 kHz sample rate and 24 bits per sample. This amounts to a bit rate of 48 24 D 1;152 kbps (kilo bits per second) for each channel. The total bit rate becomes 2,304 kbps for stereo, 6,912 kbps for 5.1, 9,216 kbps for 7.1, and 2,7648 kbps for 22.2 surround sound, respectively. And this is not the end of the story. If the 192 kHz sample rate is used, for example, these bit rates increase by four times. For mass consumption, audio signals need to be delivered to consumers through some sort of communication/broadcasting or storage channel whose capacity is usually very limited.
1.1 Audio Coding
5
Storage channels usually have the best channel capacity. DVD-Video, for example, is designed to hold at least two hours of film with standard definition video and 5.1 surround sound. It was given a capacity of 4.7 GB (gigabytes), which was the state-of-arts when DVD was standardized. If two hours of 5.1 surround sound is to be delivered in the standard PCM format (24 bits per sample and 48 kHz sample rate), it needs about 6.22 GB (gigabytes) storage space. This is more than the capacity of the whole DVD disc and there is no capacity left for the video, whose demand for bit rate is usually more than ten times that of audio signals. Apparently, there is a problem of insufficient channel capacity for the delivery of audio signals. This problem is much more acute with wireless channels. For example, overthe-air audio and/or television broadcasting usually allocates no more than 64 kbps channel capacity to deliver stereo audio. If delivered at 24 bits per sample and 48 kHz sample rate PCM, a stereo audio signal needs a bit rate of 2,304 kbps, which is 36 times the allocated channel capacity. This problem of insufficient channel capacity for delivering audio signals may be addressed by either allocating more channel capacity or reducing the demand of it. Allocating more channel capacity is usually very expensive and even physically impossible in situations such as wireless communication or broadcasting. It is often more plausible and effective to pursue the demand reduction route: reducing the bit rate necessary for delivering audio signals. This is the task of digital audio (compression) coding. Audio coding achieves this goal of bit rate reduction through an encoder and a decoder, as shown in Fig. 1.1. The encoder obtains a compact representation of the input audio signal, often referred to as the source signal, that demands less bits. The bits for this compact representation is delivered through a communication/broadcasting or storage channel to the decoder, which then reconstructs the original audio signal from the received compact representation. Note that the term “channel” used here is an abstraction or aggregation of channel coder, modulator, physical channel, channel demodulator, and channel decoder in communication literature. Since the channel is well known for introducing bit errors, the compact representation received by the decoder may be different than that at the output of the encoder. From the viewpoint of audio coding, however, the channel
Fig. 1.1 Audio coding involves an encoder to transform a source audio signal into a compact representation for transmission through a channel and a decoder to decode the compact representation received from the channel to reconstruct the source audio signal
6
1 Introduction
may be assumed to be error-free, but an audio coding system must be designed in such way that it can tolerate a certain degree of channel errors. At the very least, the decoder must be able to detect and recover from most channel errors. Let B be the bit rate needed to deliver the compact representation, the performance of an audio coding algorithm may be assessed by the compression ratio defined below: B0 : (1.2) rD B For the previous example of over-the-air audio and/or television broadcasting, the required compression ratio is 36:1. The compact representation obtained by the encoder may allow the decoder to perfectly reconstruct the original audio signal, i.e., the reconstructed audio signal at the output of the decoder is an exact or identical copy of the source audio signal inputted to the encoder, bit by bit. This audio coding process is called lossless. Otherwise, it is called lossy, meaning that the reconstructed audio signal is just an approximate copy of the source audio signal, some information is lost in the coding process and the audio signal is irreversibly distorted (hopefully not perceived).
1.2 Basic Idea According to information theory [85, 86], the minimal average bit rate that is necessary to transmit a source signal is its entropy, which is determined by the probability distribution of the source signal (see Sect. 8.2). Let H denote the entropy of the source signal, the following difference: R D B0 H
(1.3)
is the component in the source signal that is statistically redundant for the purpose of transmitting the source signal to the decoder and is thus called statistical redundancy. The goal of lossless audio coding is to remove statistical redundancy from the source signal as much as possible so that it is delivered to the decoder with a bit rate B as close as possible to the entropy. This is illustrated in Fig. 1.2. Note that, while entropy coding is a general terminology for coding techniques that remove statistical redundancy, a lossless audio coding algorithm usually involves sophisticated data modeling (to be discussed in Sect. 1.3), so the use of entropy coding in Fig. 1.2 is an over simplification if the context involves lossless coding algorithms and may imply that data modeling is part of it. The ratio of compression achievable by lossless audio coding is usually very limited, an overall compression ratio of 2:1 may be considered high. This level of compression ratio cannot satisfy many practical needs. As stated before, the over-the-air audio and/or television broadcasting, for example, may require a compression ratio of 36:1. To achieve this level of compression, some information in the source signal has to be irreversibly discarded by the encoder.
1.2 Basic Idea
7
Fig. 1.2 A lossless audio coder removes through entropy coding statistical redundancy from the the source audio signal to arrive at a compact representation
Fig. 1.3 A lossy audio coder removes both perceptual irrelevancy and statistical redundancy from the source audio signal to achieve much higher compression ratio
This irreversible loss of information causes distortion in the reconstructed audio signal at the decoder output. The distortion may be significant if assessed using objective measures such as mean square error, but is perceived differently by the human ear, which audio coding serves. Proper coder design can ensure that no distortion can be perceived by the human ear, even if the distortion is outstanding when assessed by objective measures. Furthermore, even if some distortion can be perceived, it may still be tolerated if it is not “annoying.” The portion of information in the source signal that leads to either unperceivable or unannoying distortion is, therefore, perceptually irrelevant and thus may be removed from the source signal to significantly reduce bit rate. After removal of perceptual irrelevance, there is still statistical redundancy in the remaining signal components, which can be further removed through entropy coding. Therefore, a lossy coder usually consists of two modules as shown in Fig. 1.3.
8
1 Introduction
Note that, while quantization is a general terminology for coding techniques that remove perceptual irrelevance, a lossy audio coding algorithm usually involves sophisticated data modeling, so the use of quantization in Fig. 1.3 is an over simplification if the context involves lossy audio coding algorithms and may imply that data modeling is part of it.
1.3 Perceptual Irrelevance The basic approach to removing perceptual irrelevance is quantization which involves representing the samples of the source signal with lower resolution (see Chaps. 2 and 3). For example, the integer value of 1,000, which needs 10 bits for binary representation, may be quantized by a scalar quantizer (SQ) with a quantization step size of 9 as 1; 000 9 111 which now only needs 7 bits. At the decoder side, the original value may be reconstructed as 111 9 D 999 for a quantization error of 1; 000 999 D 1: Consider the value of 1,000 above as a sample of a 10-bit PCM signal (no sign), the above quantization process may be applied to all samples of the PCM signal to generate another PCM signal of only 7 bits, for a compression ratio of 10:7. Of course, the original 10-bit PCM signal cannot be perfectly reconstructed from the 7-bit one due to quantization error. The quantization error is obviously dependent on the quantization step size, the larger the step size, the larger the quantization error. If the level of quantization error above is considered perceptually irrelevant, we have effectively compressed the 10-bit PCM signal into a 7-bit one. Otherwise, the quantization step size needs to be reduced until the quantization error is perceptually irrelevant. To optimize compression performance, the step size can be adjusted to an optimal value which gives a quantization error that is just not perceivable. The quantization scheme illustrated above is the simplest uniform scalar quantization (see Sect. 2.3) which may be characterized by a constant quantization step size applied to the whole dynamic range of the input signal. The quantization step size may be made variable depending on the values of the input signal so as to better adapt to the perceived quality of the quantized and reconstructed signal. This amounts to nonuniform quantization (see Sect. 2.4). To exploit the inter-sample structure and correlation among the samples of the input signal, a block of these samples may be grouped together and quantized as a vector, amounting to vector quantization (VQ) (see Chap. 3).
1.5 Data Modeling
9
1.4 Statistical Redundancy The basic approach to removing statistical redundancy is entropy coding whose basic idea is to use long codewords to represent less frequent sample values and short codewords for more frequent ones. As an example, let us consider the four PCM sample values listed in the first column of Table 1.1 which has a probability distribution listed in the second column of the same table. Since there are four PCM sample values, we need to use at least 2 bits to represent a PCM signal that draws sample values from the sample set above. However, if we use the codewords listed in the third column of Table 1.1 to represent the PCM sample values, we end up with the following average bit rate: 1
1 1 1 1 C 2 C 3 C 4 D 1:875 bits; 2 4 8 8
which amounts to a compression ratio of 2:1.875. The code in Table 1.1 is a variant of the unary code, which is not optimal for the probability distribution in the table. For an arbitrary probability distribution, if there is an optimal code which uses the least average number of bits to code the samples of the source signal, Huffman code is one of such codes [29] (see Chap. 9).
1.5 Data Modeling If audio coding involved only techniques for removing perceptual irrelevance and statistical redundancy, it would be a much simpler field of study and coding performance would also be significantly limited. Fortunately, there is another class of techniques that make audio coding much more effective and also complex. It is data modeling. Audio signals, like many other signals, are usually strongly correlated and have internal structures that can be expressed via data models. As an example, let us consider the 1,000 Hz sinusoidal signal shown at the top of Fig. 1.4, which is represented using 16-bit PCM with a sample rate of 48 kHz. Its periodicity manifests that it is strongly correlated. One simple approach to modeling the periodicity so as to
Table 1.1 The basic idea of entropy coding is to use long codewords to represent less frequent sample values and short codewords for more frequent ones PCM sample value Probability Entropy code 1 0 1 2 1 1 01 4 1 2 001 8 1 3 0001 8
10
1 Introduction
Amplitude
x 104 2 0 −2 0.005
0.01 Time (second)
0.015
0.02
0.005
0.01 Time (second)
0.015
0.02
0 Frequency (Hz)
1
4
Magnitude (dB)
Amplitude
x 10 2 0 −2
80 60 40 20 0
−2
−1
2 x104
Fig. 1.4 A 1,000 Hz sinusoidal signal represented as 16-bit PCM with a sample rate of 48 kHz (top), its linear prediction error signal (middle), and its DFT spectrum (bottom)
remove correlation is through linear prediction (see Chap. 4). Let x.n/ denote the nth sample of the sinusoidal signal and x.n/ O its predicted value, an extremely simple prediction scheme is to use the immediate preceding sample value as the prediction for the current sample value: x.n/ O D x.n 1/: This prediction is, of course, not perfect, so there is prediction error or residue which is e.n/ D x.n/ x.n/ O and is shown in the middle of Fig. 1.4. If we elect to send this residue signal, instead of the original signal, to the decoder, we will end up with a much smaller number of bits due to its significantly reduced dynamic range. In fact, its dynamic range is [4278, 4278] which can be represented using 14-bit PCM, resulting a compression ratio of 16:14.
1.6 Resolution Challenge
11
Another approach to decorrelation is orthogonal transforms (see Chap. 5) that, when properly designed, can transform the input signal into coefficients which are decorrelated and whose energy is compacted to a small number of coefficients. This compaction of energy is illustrated in the bottom of Fig. 1.4 which plots the logarithmic magnitude of the Discrete Fourier Transform (DFT) (see Sect. 5.4.1) of the 1,000 Hz sinusoidal signal at the top of Fig. 1.4. Instead of dealing with the periodically occurring large sample values of the original sinusoidal signal in the time domain, there are only a small number of large DFT coefficients in the frequency domain and the rest are extremely close to zero. A bit allocation strategy can be deployed which allocates bits to the representation of the DFT coefficients based on their respective magnitudes. Due to the energy compaction, only a small number of large DFT coefficients demand significant number of bits and the rest majority demand little if any, so a tremendous degree of compression can be achieved. DFT is rarely used in practical audio coding algorithms partly because it is a real-to-complex transform: for a block of N real input samples, it generates a block of N complex coefficients, which actually consist of N real and N imaginary coefficients. Discrete cosine transforms (DCT), which is a real-to-real transform, are more widely used in place of DFT. Note that, the N is hereafter referred to as the block size or block length. When blocks of transform coefficients are coded independent of each other, discontinuity occurs at the block boundaries. Referred to as the blocky effect, the discontinuity causes periodic “clicking” sound in the reconstructed audio and is usually very annoying. To overcome this blocky effect, lapped transforms that overlap between blocks are developed [49], which may be considered as special cases of subband filter banks [93] (see Chap. 6). Another benefit of overlapping between blocks is that the resultant transforms have sharper frequency responses and thus better energy compacting performance. To mitigate codec implementation cost, structured filter banks that are amenable to fast algorithms are mostly deployed in practical audio coding algorithms. Prominent among them are cosine modulated filter banks (CMFB) whose implementation cost is essentially that of a prototype FIR filter and DCT (see Chap. 7). A special case of it, modified discrete cosine transform (MDCT) (see Sect. 7.6), whose prototype filter is only twice as long as the block size, has essentially dominated various audio coding standards.
1.6 Resolution Challenge The compression achieved through energy compaction is based on two assumptions. The first is that the input signal be quasistationary, full of fine frequency structures, such as the one shown at the top of Fig. 1.5. This assumption is mostly correct because audio signals are quasistationary most of the time. The second assumption is that the transform or subband filter bank have a good frequency resolution to resolve these fine frequency structures. Since the frequency resolution of a transform
12
1 Introduction
Amplitude
0.2 0.1 0 −0.1 −0.2
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Time (second)
Amplitude
1 0.5 0 −0.5 −1
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Time (second)
Fig. 1.5 Audio signals are quasistationary (such as the one shown at the top) most of the time, but are frequently interrupted by dramatical transients (such as the one shown at the bottom)
or filter bank is largely proportional to the block size, this second assumption calls for the deployment of transforms or filter banks with large block sizes. To achieve a high degree of energy compaction, the block size should be as large as possible, limited only by the physical memory of the encoder/decoder as well as the delay associated with buffering a long block of samples. Unfortunately, the first assumption above is not correct all the time – quasistationary episodes of audio signals are frequently interrupted by dramatic transients which may rise from absolute quiet to extreme loudness within a few samples. Examples of such transients include sudden gun shots and explosions. A less dramatic transient produced by a music instrument is shown at the bottom of Fig. 1.5. For such a transient, it is well known that a long transform or filter bank would produce a flat spectrum that corresponds to small, if any, energy compaction, resulting in poor compression performance. To mitigate this problem, a short transform or filter bank should be used that has fine time resolution to localize the transient in the time domain. To be able to code all audio signals with high coding performance all the time, a transform/filter bank that would have good resolution in both time and frequency domains is desired. Unfortunately, the Fourier uncertainty principle [75], which is related to the Heisenberg uncertainty principle [90], states that this is impossible: a transform/filter bank can have a good resolution either in the time or frequency domain, but not both (see Sect. 11.1). This poses one of the most difficult challenges in audio coding. This challenge is usually addressed along the line of adapting the resolution of transforms/filter banks with time to the changing resolution demands of the input
1.8 Global Bit Allocation
13
signal: applying long block sizes to quasistationary episodes and short ones to transients. There are a variety of ways to implement this scheme, most dominant among them seems to be the switched block-size MDCT (see Sect. 11.2). In order for the resolution adaptation to occur on the fly, a transient detection mechanism which detects the occurrence of transients and identifies their locations (see Sect. 11.7) is needed.
1.7 Perceptual Models While quantization is the tool for removing perceptual irrelevance, a question as to what is the optimal degree of perceptual irrelevance that can be safely removed without audible distortion remains. This question is addressed by perceptual models that mimic the psychoacoustic behaviors of the human ear. When a source signal is fed to a perceptual model, it provides as output, some kind of description as to which parts of the source audio signal are perceptually irrelevant. This description usually comes in the form a threshold of power, called masking threshold, below which sound cannot be perceived by the human ear and thus can be removed. Since the human ear does most signal processing in the frequency domain, a perceptual model is best built in the frequency domain and the masking threshold given as a function of the frequency. Consequently, the data model should desirably be a frequency transform/filter bank so that the results from the perceptual model, such as the masking threshold, can be readily and effective utilized. It is, therefore, not a surprise that most modern audio coders operate in the frequency domain. However, it is still possible for an audio coder to operate in other domains, but there should be a mechanism to bridge that domain and the frequency domain in which the human ear mostly processes sound signals.
1.8 Global Bit Allocation The adjustment of the quantization step size affects proportionally the level of quantization noise and inversely the number of bits needed to represent transform coefficients or subband samples. A small quantization step size can ensure that the quantization noise is not perceivable, but at the expense of consuming a large number of bits. A large quantization step size, on the other hand, demands a small number of bits, but at the expense of a high level of quantization noise. Since a lossy audio coder usually operates under a tight bit rate budget with a limited number of bits that can be used, a global bit allocation mechanism needs to be installed to optimally allocate the limited bit resource so as to minimize the total perceived power of quantization noise.
14
1 Introduction
The basic bit allocation strategy is to allocate bits (by decreasing the quantization step size) iteratively to a group of transform coefficients or subband samples whose quantization noise is most audible until either the bit pool is exhausted or quantization noises for all transform coefficients/subband samples are below the masking thresholds.
1.9 Joint Channel Coding The discrete channels of a multichannel audio signals are coordinated, synchronized in particular, to produce dynamic sound imaging, so the inter-channel correlation in a multichannel audio signal is very strong. This statistic redundancy can be exploited through some forms of joint channel coding, either in the temporal or transform/subband domain. The human ear relies on a lot of cues in the audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that some components of the audio signal are either insignificant or even irrelevant for sound localization, thus can be removed for bit rate reduction. Joint channel coding is the general term for the techniques that explore interchannel statistic redundancy and perceptual irrelevance. Unfortunately, this is a less studied area and the existing techniques are rather primitive. The ones primarily used by most audio coding algorithms are sum/difference coding (M/S stereo coding) and joint intensity coding, both are discussed in Chap. 12.
1.10 Basic Architecture The various techniques discussed in the earlier sections can now be put together to arrive at the basic audio encoder architecture shown in Fig. 1.6. The multiplexer in the figure is a necessary module that packs all elements of the compressed audio data into a coherent bit stream adhering to a specific format suitable for transmission over various communication channels. The corresponding basic decoder architecture is shown in Fig. 1.7. Each module in this figure simply performs the inverse, and usually simpler, operation of the corresponding module in the encoder. Perceptual model, transient detection and global bit allocation are usually complex and computationally expensive, so are not suitable for inclusion in the decoder. In addition, the decoder usually does not have the relevant information to perform those operations. Therefore, these modules usually do not have counterparts in the decoder. All that the decoder needs are the results from these modules and they can be packed into the bit stream as part of the side information.
1.10 Basic Architecture
15
Fig. 1.6 The basic audio encoder architecture. The solid lines represent movement of audio data and the dashed line indicates control information
Fig. 1.7 The basic audio decoder architecture. The solid lines represent movement of audio data and the dashed line indicates control information
In addition to the modules shown in Figs. 1.6 and 1.7, the development of an audio coding algorithm also involves many practical and important issues, including: Data Structure. How transform coefficients or subband samples from all audio channels are organized and accessed by the audio coding algorithm. Bit Stream Format. How entropy codes and bits representing other control information are packed into a coherent bit stream.
16
1 Introduction
Implementation. How to structure the arithmetic of the algorithm to make the encoder and, especially, the decoder amenable for easy implementation on cheap hardwares such as fixed-point microprocessors. An important but often overlooked issue is the necessity for frame-based processing. An audio signal may last as long as a few hours, so encoding/decoding it as a monolithic piece takes a long time and demands tremendous hardware resources. The resulting monolithic piece of encoded bit stream also makes real-time delivery and decoding impossible. A practical approach is to segment a source audio signal into consecutive frames usually ranging from 2 to 50 ms in duration and code each of them in sequence. Most transforms/filter banks, such as MDCT, are block-based, so the size of the frames can be conveniently set to be either the same as the block size or a multiple of it.
1.11 Performance Assessment Performance evaluation is an essential and necessary part of any algorithm development. The intuitive performance measure for audio coding is the compression ratio defined in (1.2). Although simple, its effectiveness is very limited, mostly because it changes with time for a given audio signal and even more dramatically between different audio signals. A rational approach is to use the worst compression ratio for a set of difficult audio signals as the compression ratio for the audio coding algorithm. This is usually enough for lossless audio coding. For lossy audio coding, however, there is another factor that affects the usefulness of compression ratio – the perception or assessment of coding distortion. The compression ratio defined in (1.2) assumes that there is no audible distortion in the decoded audio signal. This is a critical assumption that renders the compression ratio meaningful. If this assumption were removed, any algorithm can achieve the maximal possible compression ratio, which is infinity, by not sending any bits to the decoder. Of course, this results in the maximal distortion of no audio signal outputted by the decoder. On the other extreme, we can throw an excessive number of bits to the encoder to make the distortion far below the threshold of perception, in the course wasting the precious bit resource. It is, therefore, necessary to establish the level of just inaudible distortion before compression ratio can be calculated. It is the compression ratio calculated at this point that authentically reflects the coding performance of the underlying audio coding algorithm. A more widely used approach to performance assessment, especially when different audio coding algorithms are compared, is to perceptually evaluate the level of distortion for a given bit rate and a selected set of critical test signals. The perceptual evaluation of distortion must be ultimately performed by the human ear through listening tests. For the same piece of decoded audio signal, different people are likely to hear differently, some may hear distortion and some may not. Playback equipments and the listening conditions also significantly impact the audibility of distortion. Therefore, a set of procedure and method for conducting casual and formal listening tests are needed and are discussed in Chap. 14.
Part II
Quantization
Quantization literally is a process of converting samples of a discrete-time source signal into a digital representation with reduced resolution. It is a necessary step for converting analog signals in the real world to digital signals, which enables digital signal processing. During this process of conversion, quantization also achieves a tremendous deal of compression because an analog sample is considered as having infinite resolution, thus requiring an infinite number of bits to represent, while a digital sample is of limited resolution and is represented using a limited number of bits. This conversion also means that a tremendous amount of information is lost forever. This loss of information might be a serious concern, but can be made imperceptible or tolerable by properly designing the quantization process. The human ear, for example, is widely believed to be unable to perceive resolution higher than 24 bits per sample. Any information or resolution more than this may be considered as irrelevant, hence can be discarded through quantization. When a digital signal is acquired through quantizing an analog one, the primary concern is to make sure that the digital signal is obtained at the desired resolution, or all relevant information is not lost. There is little, if any, attempt to seek a compact representation for the acquired digital signal. A compact representation is pursued afterwards, usually when the need for storage or transmission arises. Once the digital signal is inspected under the spotlight of compact representation, one may be surprised by the amount of unnecessary or irrelevant information that it still contains. This irrelevance can be removed by re-quantizing the already-quantized digital signal. There is essentially no difference in methodology between quantizing an analog signal and re-quantizing a digital signal, so they will not be distinguished in the treatment in this book. Scalar quantization (SQ) quantizes a source signal one sample at a time. It is simple, but its performance is not as good as the more sophisticated vector quantization (VQ) which quantizes a block of input samples each time.
Chapter 2
Scalar Quantization
An audio signal is a representation of sound waves usually in the form of sound pressure level that varies with time. Such a signal is continuous both in value and time, hence carries an infinite amount of information. The first step of significant compression is accomplished when a continuous-time audio signal is converted into a discrete-time signal using sampling. In what constitutes uniform sampling, the simplest sampling method, the continuous-time signal is sampled at a regular interval T , called sampling period. According to Nyquist– Shannon sampling theorem [65, 68], the original continuous-time signal can be perfectly reconstructed from the sampled discrete-time signal if the continuoustime signal is band-limited and its bandwidth is no more than half of the sample rate (1=T ). Therefore, sampling accomplishes a tremendous amount of lossless compression if the source signal is ideally bandlimited. After sampling, each sample of the discrete-time signal has a value that is continuous, so the number of possible distinct output values is infinite. Consequently, the number of bits needed to represent and/or convey such a value exactly to a recipient is unlimited. For the human ear, however, an exact continuous sample value is unnecessary because the resolution that the ear can perceive is very limited. Many believe that it is less than 24 bits. So a simple scheme of replacing an analog sample value with an integer value that is closet to it would not only satisfy the perceptual capability of the ear, but also removes a tremendous deal of imperceptible information from a continuously valued signal. For example, the hypothetical “analog” samples in the left column of Table 2.1 may be represented by the respective integer values in the right column. This process is called quantization. The underlying mechanism for quantizing the sample values in Table 2.1 is to divide the real number line into real intervals and then map each of such interval to an integer value. This is shown in Table 2.2, which is call a quantization table. The quantization process actually involves three steps as shown in Fig. 2.1 and explained below: Forward Quantization. A source sample value is used to look up the left column to find the interval, referred to as decision interval, that it falls into and the corresponding index, referred to as quantization index, in the center column is then identified. This mapping is referred to as encoder mapping. Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 2, c Springer Science+Business Media, LLC 2010
19
20
2 Scalar Quantization Table 2.1 An example of mapping “analog” sample values to integer values that would take place in a process called quantization “Analog” sound pressure level Integer sound pressure level 3.4164589759 3.124341 2.14235 1.409086743 0.61341984378562890423 0.37892458 0.61308 1.831401348156 2.8903219654710 3.208913064
3 3 2 1 1 0 1 2 2 3
Table 2.2 Quantization table that maps source sample intervals in the left column to integer values in the right column Sample value interval Index Integer value (1, 2.5) Œ 2.5, 1.5) Œ 1.5, 0.5) Œ 0.5, 0.5) Œ0.5, 1.5) Œ1.5, 2.5) Œ2.5, 1)
0 1 2 3 4 5 6
3 2 1 0 1 2 3
Fig. 2.1 Quantization involves an encoding or forward quantization stage represented by “Q”, which maps an input value to the quantization index, and a decoding or inverse quantization stage represented by “Q1 ”, which maps the quantization index to the quantized value
Index Transmission. The quantization index is transmitted to the receiver. Inverse Quantization. Upon receiving the quantization index, the receiver uses it to read out the integer value, referred to as the quantized value, in the right column. This mapping is referred to as decoder mapping. The quantization table above maps sound pressure levels with infinite range and resolution into seven integers, which need only 3 bits to represent, thus achieving a great deal of data compression. However, this comes with a price: much of the original resolution is lost forever. This loss of information may be significant, but it was done on purpose: those lost pieces of information are irrelevant to our needs or perception, we can afford to discard them.
2.1 Scalar Quantization
21
2.1 Scalar Quantization To pose the quantization process outlined above mathematically, let us consider a source random variable X with a probability density function (PDF) of p.X /. Suppose that we wish to quantize this source with M decision intervals defined by the following M C 1 endpoints bq ; q D 0; 1; : : : ; M;
(2.1)
referred to as decision boundaries, and with the following M quantized values, xO q ; q D 1; 2; : : : ; M;
(2.2)
which are also called output values or representative values. A source sample value x is quantized to the quantization index q if and only if x falls into the qth decision interval ıq D Œbq1 ; bq /; (2.3) so the operation of forward quantization is q D Q.x/; if and only if bq1 x < bq :
(2.4)
The quantized value can be reconstructed from the quantization index by the following inverse quantization (2.5) xO q D Q1 .q/; which is also referred to as backward quantization. Since q is a function of x as shown in (2.4), xO q is also a function of x and can be written as: x.x/ O D xO q D Q1 ŒQ.x/ :
(2.6)
This quantization scheme is called scalar quantization (SQ) because the source signal is quantized one sample each time. The function in (2.6) is another approach to describing the input–output map of a quantizer, in addition to the quantization table. Figure 2.2 is such a function that describe the quantization map of Table 2.2. The quantization operation in (2.4) obviously causes much loss of information, the reconstructed quantized value obtained in (2.5) or (2.6) is different than the input to the quantizer. The difference between them is called quantization error q.x/ D x.x/ O x:
(2.7)
It is also referred to as quantization distortion or quantization noise. Equation (2.7) may be rewritten as x.x/ O D x C q.x/;
(2.8)
so the quantization process is often modeled as an additive noise process as shown in Fig. 2.3.
22
2 Scalar Quantization
Fig. 2.2 Input–output map for the quantizer shown in Table 2.2
+ Fig. 2.3 Additive noise model for quantization
The average loss of information introduced by quantization may be characterized by average quantization error. Among the many norms that may be used to measure this error, the L-2 norm or Euclidean distance is usually used and is called mean squared quantization error (MSQE): q2
Z D D D
1
Z1 1 1
q 2 .x/p.x/dx .x.x/ O x/2 p.x/dx
M Z X
bq
qD1 bq1
.x.x/ O x/2 p.x/dx
(2.9)
2.2 Re-Quantization
23
Since x.x/ O D xO q is a constant within the decision interval Œbq1 ; bq /, we have q2
D
M Z X
bq
qD1 bq1
.xO q x/2 p.x/dx:
(2.10)
The MSQE may be better appreciated when compared with the power of the source signal. This may be achieved using signal-to-noise ratio (SRN) defined below ! x2 ; (2.11) SNR (dB) D 10 log10 q2 where x2 is the variance of the source signal. It is obvious that the smaller the decision intervals, the smaller the error term .xO q x/2 in (2.10), thus the smaller the mean squared quantization error q2 . This indicates that q2 is inversely proportional to the number of decision intervals M . The placement of each individual decision boundary and the quantized value also play major roles in the final q2 . The problem of quantizer design may be posed in a variety of ways, including: Given a fixed M :
M D Constant;
(2.12)
find the optimal placement of decision boundaries and quantized values so that q2 is minimized. This is the most widely used approach. Given a distortion constraint: q2 < Threshold;
(2.13)
find the optimal placement of decision boundaries and quantized values so that M is minimized. A minimal M means a minimal number of bits needed to represent the quantized value, hence a minimal bit rate.
2.2 Re-Quantization The quantization process was presented above with the assumption that the source random variable or sample values are continuous or analog. Quantization by name usually gives the impression that it were only for quantizing analog sample values. When dealing with such analog source sample values, the associated forward quantization is referred to as ADC (analog-to-digital conversion) and the inverse quantization as DAC (digital-to-analog conversion).
24
2 Scalar Quantization Table 2.3 An quantization table for “re-quantizing” a discrete source Decision interval Quantization index Re-quantized value Œ 0, 10) 0 5 Œ 10, 20) 1 15 Œ 20, 30) 2 25 Œ 30, 40) 3 35 Œ 40, 50) 4 45 Œ 50, 60) 5 55 Œ 60, 70) 6 65 Œ 70, 80) 7 75 Œ 80, 90) 8 85 Œ 90, 100) 9 95
Discrete sources sample values can also be further quantized. For example, consider a source that takes integer sample values between 0 through 100. If it is decided, for some reason, that this resolution is too much or irrelevant for a particular application and sample values spaced at an interval of 10 are really what are needed, a quantization table shown in Table 2.3 can be established to re-quantize the integer sample values. With discrete sources sample values, the formulation of quantization process in Sect. 2.1 is still valid with the replacement of probability density function with probability distribution function and integration with summation.
2.3 Uniform Quantization Both quantization Tables 2.2 and 2.3 are the embodiment of uniform quantization, which is the simplest among all quantization schemes. The decision boundaries of a uniform quantizer are equally spaced, so its decision intervals are all of the same length and can be represented by a constant called quantization step size. For example, the quantization step size for Table 2.2 is 1 and for Table 2.3 is 10. When an analog signal is uniformly sampled and subsequently quantized using a uniform quantizer, the resulting digital representation is called pulse-code modulation (PCM). It is the default form of representation for many digital signals, such as speech, audio, and video.
2.3.1 Formulation Let us consider a uniform quantizer that covers an interval of ŒXmin ; Xmax of a random variable X with M decision intervals. Since its quantization step size is
2.3 Uniform Quantization
25
D
Xmax Xmin ; M
(2.14)
its decision boundaries can be represented as bq D Xmin C q;
q D 0; 1; : : : ; M:
(2.15)
The mean of an decision interval is often selected as the quantized value for that interval: xO q D Xmin C q 0:5; q D 1; 2; : : : ; M:
(2.16)
For such a quantization scheme, the MSQE in (2.10) becomes q2 D
M Z X
Xmin Cq
qD1 Xmin C.q1/
.Xmin C q 0:5 x/2 p.x/dx:
(2.17)
Let y D Xmin C q 0:5 x; (2.17) becomes q2 D
M Z X
0:5
qD1 0:5
y 2 p ŒXmin C q .y C 0:5/2 dy:
(2.18)
Plugging in (2.15), (2.18) becomes q2 D
M Z X
0:5
qD1 0:5
x 2 p bq .x C 0:5/ dx:
(2.19)
Plugging in (2.16), (2.18) becomes q2 D
M Z X
0:5
qD1 0:5
x 2 p.xO q x/dx:
(2.20)
2.3.2 Midtread and Midrise Quantizers There are two major types of uniform quantizers. The one shown in Fig. 2.2 is called midtread because it has zero as one of its quantized values. It is useful for situations where it is necessary for the zero value to be represented. One such example is control systems where a zero value needs to be accurately represented. This is also
26
2 Scalar Quantization
important for audio signals because the zero value is needed to represent the absolute quiet. Due to the midtreading of zero, the number of decision intervals (M ) is odd if a symmetric sample value range (Xmin D Xmax ) is to be covered. Since both the decision boundaries and the quantized values can be represented by a single step size, the implementation of the midtread uniform quantizer is simple and straight forward. The forward quantizer may implemented as q D round
x
(2.21)
where round./ is the rounding function which returns the integer that is closest to the input. The corresponding inverse quantizer may be implemented as xO q D q:
(2.22)
The other uniform quantizer does not have zero as one of its quantized values, so is called midrise. This is shown in Fig. 2.4. Its number of decision intervals is even if a symmetric sample value range is to be covered. The forward quantizer may implemented as ( x C 1; if x > 0I truncate (2.23) qD x truncate 1; otherwiseI
Fig. 2.4 An example of midrise quantizer
2.3 Uniform Quantization
27
where truncate./ is the truncate function which returns the integer part of the input, without the fractional digits. Note that q D 0 is forbidden for a midrise quantizer. The corresponding inverse quantizer is expressed below ( xO q D
.q 0:5/; if q > 0I .q C 0:5/; otherwise:
(2.24)
2.3.3 Uniformly Distributed Signals As seen in (2.20), the MSQE of a uniform quantizer depends on the probability density function. When this density function is uniformly distributed over ŒXmin ; Xmax : p.x/ D
1 ; x 2 ŒXmin ; Xmax ; Xmax Xmin
(2.25)
(2.20) becomes q2
M Z 0:5 X 1 D y 2 dx Xmax Xmin qD1 0:5
D
M X 3 1 Xmax Xmin qD1 12
D
M 3 Xmax Xmin 12
Due to the step size given in (2.14), the above equation becomes q2 D
2 : 12
(2.26)
For the uniform distribution in (2.25), its variance (signal power) is x2 D
1 Xmax Xmin
Z
Xmax Xmin
x 2 dx D
.Xmax Xmin /2 ; 12
(2.27)
28
2 Scalar Quantization
so the signal-to-noise ratio (SNR) of the uniform quantizer is SNR (dB) D 10 log10
x2 q2
!
.Xmax Xmin /2 12 12 2 Xmax Xmin D 20 log10
D 10 log10
(2.28)
Due to the step size given in (2.14), the above SNR expression becomes SNR (dB) D 20 log10 .M / D
20 log2 .M / 6:02 log2 .M /: log2 .10/
(2.29)
If the quantization indexes are represented using fixed-length codes, each codeword can be represented using R D ceil Œlog2 .M / bits;
(2.30)
which is referred as bits per sample or bit rate. Consequently, (2.29) becomes SNR (dB) D
20 R 6:02R dB; log2 .10/
(2.31)
which indicates that, for each additional bit allocated to the quantizer, the SNR is increased by about 6.02 dB.
2.3.4 Nonuniformly Distributed Signals Most signals, and audio signals in particular, are rarely uniformly distributed. As indicated by (2.20), the contribution of each quantization error to the MSQE is weighted by the probability density function. A nonuniform distribution means that the weighting is different now, so a different MSQE is expected and is discussed in this section.
2.3.4.1 Granular and Overload Error A nonuniformly distributed signal, such as Gaussian, is usually not bounded, so the dynamic range ŒXmin ; Xmax of a uniform quantizer cannot cover the whole range of the source signal. This is illustrated in Fig. 2.5. The areas beyond ŒXmin ; Xmax are called overload areas. When a source sample falls into an overload area, the quantizer can only assign either the minimum or the maximum quantized value to it:
2.3 Uniform Quantization
29 PDF
-Xmax
Overload _3 Δ
Xmax
Granular _2 Δ
_Δ
0
Overload Δ
2Δ
3Δ
X
Fig. 2.5 Overload and granular quantization errors
x.x/ O D
Xmax 0:5; if x > Xmax I Xmin C 0:5; if x < Xmin :
(2.32)
This introduces additional quantization error, called overload error or overload noise. The mean squared overload error is obviously the following 2 q.overload/ D
Z
1 Xmax
Z
C
Œx .Xmax 0:5/2 p.x/dx
Xmin 1
Œx .Xmin C 0:5/2 p.x/dx:
(2.33)
The MSQE given in (2.17) only accounts for quantization error within ŒXmin ; Xmax , which is referred to as granular error or granular noise. The total quantization error is 2 2 D q2 C q.overload/ : (2.34) q.total/ For a given PDF p.x/ and the number of decision intervals M , (2.17) indicates that the smaller the quantization step size is, the smaller the granular quantization noise q2 becomes. According to (2.14), however, the smaller quantization step size also translates into smaller Xmin and Xmax for a fixed M . Smaller Xmin and Xmax obviously leads to larger overload areas, hence a larger overload quantization 2 error q.overload/ . Therefore, the choice of , or equivalently the range ŒXmin ; Xmax of the uniform quantizer, represents a trade-off between granular and overload quantization errors. This trade-off is, of course, relative to the effective width of the given PDF, which may be characterized by its variance . The ratio of the quantization range ŒXmin ; Xmax over the signal variance Fl D
Xmax Xmin ;
(2.35)
called the loading factor, is apparently a good description of this trade-off. For Gaussian distribution, a loading factor of 4 means that the probability of input
30
2 Scalar Quantization
samples going beyond the range is 0.045. For a loading factor of 6, the probability reduces to 0.0027. For most applications, 4 loading is sufficient.
2.3.4.2 Optimal SNR and Step Size To find the optimal quantization step size that gives the minimum total MSQE 2 , let us drop (2.17) and (2.33) into (2.34) to obtain q.total/ 2 D q.total/
M Z X
Xmin Cq
qD1 Xmin C.q1/
Z
C Z C
1
Œx .Xmin C q 0:5/2 p.x/dx
Œx .Xmax 0:5/2 p.x/dx
Xmax
Xmin 1
Œx .Xmin C 0:5/2 p.x/dx:
(2.36)
Usually, a uniform quantizer is symmetrically designed such that Xmin D Xmax :
(2.37)
Then (2.14) becomes D
2Xmax : M
(2.38)
Replacing all Xmin and Xmax with using the above equations, we have 2 q.total/
D
M Z X
.q0:5M /
qD1 .q10:5M /
Z
C Z C
1 0:5M
Œx 0:5.M 1/2 p.x/dx
0:5M 1
Œ.q 0:5 0:5M / x2 p.x/dx
Œx C 0:5.M 1/2 p.x/dx:
(2.39)
p.x/ D p.x/
(2.40)
Assuming a symmetric PDF:
and doing a variable change of y D x in the last term of (2.39), it turns out that this last term becomes the same as the second term, so (2.39) becomes
2.3 Uniform Quantization 2 q.total/ D
31
M Z X
.q0:5M /
qD1 .q10:5M / Z 1
C2
0:5M
Œ.q 0:5 0:5M / x2 p.x/dx
Œx 0:5.M 1/2 p.x/dx
(2.41)
Now that both (2.39) and (2.41) are only a function of , their minimum can be found by setting their respective first order derivative against to zero: @ 2 D 0: @ q.total/
(2.42)
This equation can be solved using a variety of numerical methods, see [76], for example. Figure 2.6 shows optimal SNR achieved by a uniform quantizer at various bits per sample (see (2.30)) for Gaussian, Laplacian, and Gamma distributions [33]. The SNR given in (2.31) for uniform distribution, which is the best SNR that a uniform quantizer can achieve, is plotted as the bench mark. It is a straight line in the form of SNR(R) D a C bR (dB);
(2.43)
with a slope of bD
20 6:02 log2 .10/
(2.44)
and an intercept of a D 0:
(2.45)
50 45
Optimal SNR (dB)
40
Uniform Gaussian Laplacian Gamma
35 30 25 20 15 10 5 0 1
2
3
4 5 Bits Per Sample
6
7
8
Fig. 2.6 Optimal SNR achieved by a uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions
32
2 Scalar Quantization
Apparently, the curves for other PDF’s also seem to fit a straight line with different slopes and intercepts. Notice that both the slope b and the intercept a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of a uniform quantizer is inversely related to PDF kurtosis. This degradation in performance is mostly reflected in the intercept a. The slope b is only moderately affected. There is, nevertheless, reduction in slope when compared with the uniform distribution. This reduction indicates that the quantization performance for other distributions relative to the uniform distribution becomes worse at higher bit rates. Figure 2.7 shows the optimal step size normalized by the signal variance, opt =x , for Gaussian, Laplacian, and Gamma distributions as a function of the number of bits per sample [33]. The data for uniform distribution is used as the benchmark. Due to (2.14), (2.27) and (2.30), the normalized quantization step size for the uniform distribution is 2 2 log2 .M / R D log10 p ; (2.46) log10 D log10 p x log 10 log 3 3 2 2 10 so it is a straight line. Apparently, as the peakedness or kurtosis increases in the order of uniform, Gaussian, Laplacian, and Gamma distributions, the step size also increases. This is necessary for optimal balance between granular and overload quantization errors: an increased kurtosis means that the probability density is spread more toward the tails, resulting more overload error, so the step size has to be increased to counteract this increased overload error.
Optimal Step Size
100
10−1
10−2
Uniform Gaussian Laplacian Gamma 1
2
3
4 5 Bits Per Sample
6
7
8
Fig. 2.7 Optimal step size used by a uniform quantizer to achieve optimal SNR for uniform, Gaussian, Laplacian, and Gamma distributions
2.4 Nonuniform Quantization
33
The empirical formula (2.43) is very useful for estimating the minimal total MSQE for a particular quantizer, given the signal power and bit rate. In particular, dropping in the SNR definition in (2.11) to (2.43), we can represent the total MSQE as (2.47) 10 log10 q2 D 10 log10 x2 a bR or q2 D 100:1.aCbR/ x2 :
(2.48)
2.4 Nonuniform Quantization Since the MSQE formula (2.10) indicates that the quantization error incurred by a source sample x is weighted by the PDF p.x/, one approach to reduce MSQE is to reduce quantization error in densely distributed areas where the weight is heavy. Formula (2.10) also indicates that the quantization error incurred by a source sample value x is actually the distance between it and the quantized value x, O so large quantization errors are caused by input samples far away from the quantized value, i.e., those which are near the decision boundaries. Therefore, reducing quantization errors in densely distributed areas necessitates using smaller decision intervals. For a given number of decision intervals M , this also means that larger decision intervals need to be placed to the rest of the PDF support so that the whole input range is covered. From the perspective of resource allocation, each quantization index is a piece of bit resource that is allocated in the course of quantizer design, and there are only M pieces of resources. A quantization index is one-to-one associated with a quantized value and decision interval, so a piece of resource is considered as consisting of a set of quantization index, quantized value, and a decision interval. The problem of quantizer design may be posed as optimal allocation of these resources to minimize the total MSQE. To achieve this, each piece of resources should be allocated to carry the same share of quantization error contribution to the total MSQE. In other words, the MSQE conbribution carried by individual pieces of resources should be “equalized”. For a uniform quantizer, its resources are allocated uniformly, except for the first and last quantized values which cover the overload areas. As shown at the top of Fig. 2.8, its resources in the tail areas of the PDF are not fully utilized because low probability density or weight causes them to carry too little MSQE contribution. Similarly, its resources in the head area are over utilized because high probability density or weight causes them to carry too much MSQE contribution. To reduce the overall MSQE, those mis-allocated resources need to be re-distribute in such a way that the MSE produced by individual pieces of resource are equalized. This is shown at the bottom of Fig. 2.8.
34
2 Scalar Quantization PDF
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ x
Resources are uniformly allocated for a uniform quantizer PDF
x
Resources are nonuniformly allocated for a nonuniform quantizer
Fig. 2.8 Quantization resources are under-utilized by the uniform quantizer (top) in the tail areas and over-utilized in the head area of the PDF. These resources are re-distributed in the nonuniform quantizer (bottom) so that individual pieces of resources carry the same amount of MSQE contribution, leading to smaller MSQE
The above two considerations indicates that the MSQE can be reduced by assigning the size of decision intervals inversely proportional to the probability density. The consequence of this strategy is that the more densely distributed the PDF is, the more densely placed the decision intervals can be, thus the smaller the MSQE becomes. One approach to nonuniform quantizer design is to post it as an optimization problem: finding the quantization intervals and quantized values that minimizes the MSQE. This leads to the Lloyd-Max algorithm. Another approach is to transform the source signal through a nonlinear function in such a way that the transformed signal has a PDF that is almost uniform, then a uniform quantizer may be used to deliver improved performance. This leads to companding.
2.4 Nonuniform Quantization
35
2.4.1 Optimal Quantization and Lloyd-Max Algorithm Given a PDF p.x/ and a number of decision intervals M , one approach to the design of a nonuniform quantizer is to find the set of decision boundaries fbq gM 0 and quansuch that the MSQE in (2.10) is minimized. Towards the solution tized values fxO q gM 1 of this optimization problem, let us first consider the following partial derivative @q2 @xO q
Z D2
bq
bq1
Z
D 2xO q
.xO q x/p.x/dx
bq bq1
Z p.x/dx 2
bq
xp.x/dx:
(2.49)
bq1
Setting it to zero, we have R bq b
xO q D R q1 bq
xp.x/dx ;
(2.50)
bq1 p.x/dx
which indicates that the quantized value for each decision interval is the centroid of the probability mass in the interval. Let us now consider another partial derivative @q2 @bq
D .xO q bq /2 p.bq / .xO qC1 bq /2 p.bq /
(2.51)
Setting it to zero, we have bq D
1 .xO q C xO qC1 /; 2
(2.52)
which indicates that the decision boundary is simply the midpoint of the neighboring quantized values. Solving (2.50) and (2.52) would give us the optimal set of decision boundaries 2 O q gM fbq gM 0 and quantized values fx 1 that minimizes q . Unfortunately, to solve (2.50) for xO q we need bq1 and bq , but to solve (2.52) for bq we need xO q and xO qC1 . The problem is a little difficult.
2.4.1.1 Uniform Quantizer as a Special Case Let us consider a simple case where the probability distribution is uniform as given in (2.25). For such a distribution, (2.50) becomes xO q D
bq1 C bq : 2
(2.53)
36
2 Scalar Quantization
Incrementing q for this equation, we have xO qC1 D
bq C bqC1 : 2
(2.54)
Dropping (2.53) and (2.54) into (2.52), we have 4bq D bq1 C bq C bq C bqC1 ;
(2.55)
bqC1 bq D bq bq1 :
(2.56)
bq bq1 D ;
(2.57)
bqC1 bq D :
(2.58)
which leads us to Let us denote plugging it into (2.56), we have
Therefore, we can conclude by induction on q that all decision boundaries are uniformly spaced. For quantized values, let us subtract (2.53) from (2.54) to give xO qC1 xO q D
bqC1 bq C bq bq1 : 2
(2.59)
Plugging in (2.57) and (2.58), we have xO qC1 xO q D ;
(2.60)
which indicates that the quantized values are also uniformly spaced. Therefore, uniform quantizer is optimal for uniform distribution.
2.4.1.2 Lloyd-Max Algorithm Lloyd-Max algorithm is an iterative procedure for solving (2.50) and (2.52) for an arbitrary distribution, so an optimal quantizer is also referred to as Lloyd-Max quantizer. Note that its convergence is not proven, but only experimentally found. Before presenting the algorithm, let us first note that we already know the first and last decision boundaries: b0 D Xmin and bM D Xmax :
(2.61)
For unbounded inputs, we may set Xmin D 1 and/or Xmax D 1. Also, we rearrange (2.52) into (2.62) xO qC1 D 2bq xO q ;
2.4 Nonuniform Quantization
37
The algorithm involves the following iterative steps: 1. Make a guess for xO 1 . 2. Let q D 1. 3. Plugging xO q and bq1 into (2.50) to solve for bq . This may be done by integrating the two integrals in (2.50) forward from bq1 until the equation holds. 4. Plugging xO q and bq into (2.62) to get a new xO qC1 . 5. Let q D q C 1. 6. Go back to step 3 unless q D M . 7. When q D M , calculate R bM b
1 D xO M R M bM
bM 1
xp.x/dx
(2.63)
p.x/dx
8. Stop if jj < predetermined threshold:
(2.64)
9. Decrease xO 1 if > 0 and increase xO 1 otherwise. 10. Go back to step 2. A little explanation is in order for (2.63). The iterative procedure provides us with an xO M upon entering step 7, which is used as the first term to the right of (2.63). On the other hand, since we know bM from (2.61), we can use it with bM 1 provided by the procedure to obtain another estimate of xO M using (2.50). This is given as the second term on the right side of (2.63). The two estimates of the same xO M should be equal if equations (2.50) and (2.52) are solved. Therefore, we stop the iteration at step 8 when the absolute value of their difference is smaller than some predetermined threshold. The adjustment procedure for xO 1 at step 9 can also be easily explained. The iterative procedure is started with a guess for xO 1 at step 1. Based on this guess, a O q gM whole set of decision boundaries fbq gM 0 and quantized values fx 1 are obtained from step 2 through step 8. If the guess is off, the whole set derived from it is off. In particular, if the guess is too large, the resulting xO M will be too large. This will cause > 0, so xO 1 needs to be reduced; and vice versa.
2.4.1.3 Performance Gain Figure 2.9 shows optimal SNR achieved by Lloyd-Max algorithm for uniform, Gaussian, Laplacian, and Gamma distributions against the number of bits per sample [33]. Since the uniform quantizer is optimal for uniform distribution, its optimal SNR curve in Fig. 2.9 is the same as in Fig. 2.6, thus can serve as the reference. Notice that the optimal SNR curves for the other distributions are closer to this curve in Fig. 2.9 than in Fig. 2.6. This indicates that, for a given number of bits per sample, optimal nonuniform quantization achieves better SNR than optimal uniform quantization.
38
2 Scalar Quantization 45 Uniform Gaussian Laplacian Gamma
40
Optimal SNR (dB)
35 30 25 20 15 10 5 0
1
2
3
4 5 Bits Per Sample
6
7
Fig. 2.9 Optimal SNR versus bits per sample achieved by Lloyd-Max algorithm for uniform, Gaussian, Laplacian, and Gamma distributions
Apparently, the optimal SNR curves in Fig. 2.9 also fit straight lines well, so can be approximated by the same equation given in (2.43) with improved slope b and intercept a. The improved performance of nonuniform quantization results in better fitting to a straight line than those in Fig. 2.6. Similar to uniform quantization in Fig. 2.6, both the slope b and the intercept a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of a Lloyd–Max quantizer is inversely related to PDF kurtosis. Compared with the uniform distribution, all other distributions have reduced slopes b, indicating that their performance relative to the uniform distribution becomes worse as the bit rate increases. However, the degradations of both a and b are less conspicuous than those in Fig. 2.6. In order to compare the performance between Lloyd-Max quantizer and uniform quantizer, Fig. 2.10 shows optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions: Optimal SNR Gain D SNRNonuniform SNRUniform ; where SNRNonuniform is taken from Fig. 2.9 and SNRUniform from Fig. 2.6. Since the Lloyd-Max quantizer for uniform distribution is a uniform quantizer, the optimal SNR gain is zero for uniform distribution. It is obvious that the optimal SNR gain is more profound when the distribution is more peaked or is of larger kurtosis.
2.4 Nonuniform Quantization
39
8 Uniform Gaussian Laplacian Gamma
Optimal SNR Gain (dB)
7 6 5 4 3 2 1 0
1
2
3
4
5
6
7
Bits Per Sample
Fig. 2.10 Optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions
2.4.2 Companding Finding the whole set of decision boundaries fbq gM O q gM 0 and quantized values fx 1 for an optimal nonuniform quantizer using Lloyd-Max algorithm usually involves a large number of iterations, hence may be computationally intensive, especially for a large M . The storage requirement for these decision boundaries and quantization values may also become excessive, especially for the decoder. Companding is an alternative. Companding is motivated by the observation that a uniform quantizer is simple and effective for a matching uniformly distributed source signal. For a nonuniformly distributed source signal, one could use a nonlinear function f .x/ to convert it into another one with a PDF similar to a uniform distribution. Then the simple and effective uniform quantizer could be used. After the quantization indexes are transmitted to and subsequently received by the decoders, they are first inversely quantized to reconstruct the uniformly quantized values and then the inverse function f 1 .x/ is applied to produce the final quantized values. This process is illustrated in Fig. 2.11. The nonlinear function in Fig. 2.11 is called a compressor because it usually has a shape similar to that shown in Fig. 2.12 that stretches the source signal when its sample value is small and compresses it otherwise. This shape of compression is to match the typical shape of PDF, such as Gaussian and Laplacian, which has large probability density for small absolute sample values and tails off towards large absolute sample values, in order to make the converted signal have a PDF similar to a uniform distribution.
40
2 Scalar Quantization
Fig. 2.11 The source sample value is first converted by the compressor into another one with a PDF similar to a uniform distribution. It is then quantized by a uniform quantizer and the quantization index is transmitted to the decoder. After inverse quantization at the decoder, the uniformly quantized value is converted by the expander to produce the final quantized value Expander 1
0.5
0.5 Output
Output
Compressor 1
0
−0.5
−0.5 −1 −1
0
−0.5
0 Input
0.5
1
−1 −1
−0.5
0 Input
0.5
1
Fig. 2.12 -Law companding deployed in North American and Japanese telecommunication systems
The inverse function is called an expander because the inverse of compression is expansion. After the compression-expansion, hence “companding”, the effective decision boundaries when viewed from the expander output is nonuniform, so the overall effect is nonuniform quantization. When companding is actually used in speech and audio applications, additional considerations are given to the perceptual properties of the human ear. Since the perception of loudness by the human ear may be considered as logarithmic, logarithmic companding is widely used. 2.4.2.1 Speech Processing In speech processing, the -law companding, deployed in North American and Japanese telecommunication systems, has a compression function given by [33]
2.4 Nonuniform Quantization
41
y D f .x/ D sign.x/
ln.1 C jxj/ ; 1 x 1I ln.1 C /
(2.65)
where D 256 and x is the normalized sample value to be compounded and is limited to 13 magnitude bits. Its corresponding expanding function is x D f 1 .y/ D sign.y/
.1 C /jyj 1 ; 1 y 1:
(2.66)
Both functions are plotted in Fig. 2.12. A similar companding, called A-law companding, is deployed in Europe, whose compression function is sign.x/ y D f .x/ D 1 C ln.A/
Ajxj; 0 jxj A1 I 1 C ln.Ajxj/; A1 < jxj 1I
(2.67)
where A D 87:7 and the normalized sample value x is limited to 12 magnitude bits. Its corresponding expanding function is ( x D f 1 .y/ D sign.y/
1Cln.A/ jyj; A
ejyj.1Cln.A//1 ACA ln.A/
0 jyj ;
1 1Cln.A/
1 1Cln.A/ I
< jyj 1:
(2.68)
It is usually very difficult to implement both the logarithmic and exponential functions used in the companding schemes above, especially on embedded microprocessors with limited resources. Many such processors even do not have a floating point unit. Therefore, the companding functions are usually implemented using piece-wise linear approximation. This is adequate due to the fairly low requirement for speech quality in telephonic systems,
2.4.2.2 Audio Coding Companding is not as widely used in audio coding as in speech processing, partly due to higher quality requirement and wider dynamic range which renders implementation more difficult. However, MPEG 1&2 Layer III [55, 56] and MPEG 2&4 AAC [59, 60] use the following exponential compression function to quantize MDCT coefficients: (2.69) y D f .x/ D sign.x/jxj3=4 ; which may be considered as an approximation to the logarithmic function. The allowed compressed dynamic range is 8191 y 8191. The corresponding expanding function is obviously x D f 1 .y/ D sign.y/jyj4=3 :
(2.70)
42
2 Scalar Quantization
The implementation cost for the above exponential function is a remarkable issue in decoder development. Piece-wise linear approximation may lead to degradation in audio quality, hence may be unacceptable for high fidelity application. Another alternative is to store the exponential function as a quantization table. This amounts to 13 3 D 39 KB if each of the 213 entries in the table are stored using 24 bits. The most widely used companding in audio coding is the companding of quantization step sizes of uniform quantizers. Since quantization step sizes are needed in the inverse quantization process in the decoder, they need to be packed into the bit stream and transmitted to the decoder. Transmitting these step sizes with arbitrary resolution is out of the question, so it is necessary that they be quantized. The perceived loudness of quantization noise is usually considered as logarithmically proportional to the quantization noise power, or linearly proportional to the quantization noise power in decibel. Due to (2.28), this means the perceived loudness is linearly proportional to the quantization step size in decibel. Therefore, almost all audio coding algorithms use logarithmic companding to quantize quantization step sizes: (2.71) ı D f ./ D log2 ./; where is the step size of a uniform quantizer. The corresponding expander is obviously (2.72) D f 1 .ı/ D 2ı : Another motivation for logarithmic companding is to cope with the wide dynamic range of audio signals, which may amount to more than 24 bits per sample.
Chapter 3
Vector Quantization
The scalar quantization discussed in Chap. 2 quantizes the samples of a source signal one by one in sequence. It is simple because it deals with only one sample each time, but it can only achieve so much for quantization efficiency. We now consider quantizing two or more samples as one block each time and call this approach vector quantization (VQ).
3.1 The VQ Advantage Let us suppose that we need to quantize the following source sample sequence: f1:2; 1:4; 1:7; 1:9; 2:1; 2:4; 2:6; 2:9g:
(3.1)
If we use the scalar midtread quantizer given in Table 2.2 and Fig. 2.2, we get the following SQ indexes: f1; 1; 2; 2; 2; 2; 3; 3g; (3.2) which is also the sequence for the quantized values since the quantization step size is one. Since the range of the quantization indexes is Œ1; 3, we need 2 bits to convey each index. This amounts to 8 2 D 16 bits for encoding the whole sequence. If two samples are quantized as a block, or vector, each time using the VQ codebook given in Table 3.1, we end up with the following sequence of indexes: f0; 1; 1; 2g: When this sequence is used by the decoder to look up Table 3.1, we obtain exactly the same reconstructed sequence as in (3.2), so the total quantization error is the same. Now 2 bits are still needed to convey each index, but there are only four indexes, so we need 4 2 D 8 bits to convey the whole sequence. This is only half the number of bits needed by the SQ while the total quantization error is same. To explain why VQ can achieve much better performance than SQ, let us view the sequence in (3.1) as a sequence of two-dimensional vectors:
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 3, c Springer Science+Business Media, LLC 2010
43
44
3 Vector Quantization
Table 3.1 An example VQ codebook
Fig. 3.1 A source sequence, viewed as a sequence of two-dimensional vectors, is plotted as dots using the first and second elements of each vector as x and y coordinates, respectively. The solid straight lines represent the decision boundaries for SQ and the solid curved lines for VQ. The solid dashed lines represent the quantized values for SQ and the diamonds represent the quantized or representative vectors for VQ
Index 0 1 2
Representative vector [1,1] [2,2] [3,3]
3.5
2.5
1.5
0.5 0.5
1.5
fŒ1:2; 1:4; Œ1:7; 1:9; Œ2:1; 2:4; Œ2:6; 2:9g;
2.5
3.5
(3.3)
and plot them in Fig. 3.1 as dots using the first and second elements of each vector as x and y coordinates, respectively. For the first element of each vector, we use solid vertical lines to represent its decision boundaries and dashed vertical lines its quantized value. Consequently, its decision intervals are represented by vertical strips defined by two adjacent vertical lines. For the second element of each vector, we do the same with horizontal solid and dashed lines, respectively, so that its decision intervals are represented by horizontal strips defined by two adjacent horizontal lines. When the first element is quantized using SQ, a vertical decision strip is activated to produce the quantized value. Similarly, a horizontal decision strip is activated when the second element is quantized using SQ. Since the first and second elements of each vector are quantized separately, the quantization of the whole sequence can be viewed as a process of alternative activation of vertical and horizontal decision strips. However, if the results of the SQ described above for the two elements of each vector are viewed jointly, the quantization decision is actually represented by the squares where the vertical and horizontal decision strips cross. Each decision square represents the decision intervals for both elements of a vector. The crossing point of dashed lines in the middle of the decision square represents the quantized values for both the elements of the vector. Therefore, the decision square and the associated crossing point inside it are the real decision boundaries and quantized values for each source vector.
3.1 The VQ Advantage
45
It is now easy to realize that many of the decision squares are never used by SQ. Due to their existence, however, we still need a two-dimensional vector to identify and hence represent each of the decision squares. For the current example, each of such vectors needs 2 2 D 4 bits to be represented, corresponding to 2 bits per sample. So there is no bit saved. What a waste those unused decision squares cause! To avoid this waste, we need to consider the quantization of each source vector as a joint and simultaneous action and forgo the SQ decision squares. Along this line of thinking, we can arbitrarily place decision boundaries in the two-dimensional space. For example, noting that the data points are scattered almost along a straight line, we can re-design the decision boundaries as those depicted by the curved solid lines in Fig. 3.1. With this design, we designate three crossing points represented by the diamonds in the figure as the quantized or representative vectors. They obviously represent the same quantized values as those obtained by SQ, thus leaving quantization error unchanged. However, the two-dimensional decision boundaries carve out only three decision regions, or decision cells, with only three representative vectors that need to be indexed and transmitted to the decoder. Therefore, we need to transmit 2 bits per vector, or 1 bit per sample, to the decoder, amounting to a bit reduction of 2 to 1. Of course, source sequences in the real world are not as simple as those in (3.1) and Fig. 3.1. To look at realistic sequences, Fig. 3.2 plots a correlated Gaussian sequence (meanD0, varianceD1) in the same way as Fig. 3.1. Also plotted are the decision boundaries (solid lines) and quantized values (dashed lines) of the same uniform scalar quantizer used above. Apparently, the samples are highly concentrated along a straight line and there are a lot of SQ decision squares that are wasted. One may argue that a nonuniform SQ would do much better. Recall that it was concluded in Sect. 2.4 that the size of a decision interval of a nonuniform SQ should be inversely proportional to the probability density. The translation of this rule into two dimension is that the size of the SQ decision squares should be inversely proportional to the probability density. So a nonuniform SQ can improve the coding performance by placing small SQ decision squares in densely distributed areas and large ones in loosely distributed areas. However, there still are a lot of SQ decision regions placed in areas that are extremely sparsely populated by the source samples, causing a waste of bit resources. 3.5 2.5 1.5
Fig. 3.2 A correlated Gaussian sequence (meanD0, varianceD1) is plotted as two-dimensional vectors over the decision boundaries (solid lines) and quantized values (dashed lines) of a midtread uniform SQ (step sizeD1)
0.5 −0.5 −1.5 −2.5 −3.5 −3.5 −2.5 −1.5 −0.5
0.5
1.5
2.5
3.5
46 Fig. 3.3 An independent Gaussian sequence (meanD0, varianceD1) is plotted as two dimensional vectors over the decision boundaries (solid lines) and quantized values (dashed lines) of a midtread uniform SQ (step sizeD1)
3 Vector Quantization 3.5 2.5 1.5 0.5 −0.5 −1.5 −2.5 −3.5 −3.5 −2.5 −1.5 −0.5
0.5
1.5
2.5
3.5
To void this waste, the restriction that decision boundaries have to be square has to be removed. Once this restriction is removed, we arrive at the world of VQ, where decision regions can be of arbitrary shapes and can be arbitrarily allocated to match the probability distribution density of the source sequence, achieving better quantization performance. Even with uncorrelated sequences, VQ can still achieve better performance than SQ. Figure 3.3 plots an independent Gaussian sequence (meanD0, varianceD1) over the decision boundaries (solid lines) and quantized values (dashed lines) of the same midtread uniform SQ. Apparently the samples are still concentrated, even though not as much as the correlated sequence in Fig. 3.2, so a VQ can still achieve better performance than an SQ. At least, an SQ has to allocate decision squares to cover the four corners of the figure, but a VQ can use arbitrarily shaped decision regions to cover those areas without wasting bits.
3.2 Formulation Let us consider an N -dimensional random vector x D Œx0 ; x1 ; : : : ; xN 1 T
(3.4)
with a joint PDF p.x/ over a vector space or support of ˝. Suppose this vector space 1 in a mutually exclusive and collectively is divided by a set of M regions fıgM 0 exhaustive way: M 1 [ ˝D ıq (3.5) qD0
and ıp \ ıq D ˚
for all
p ¤ q;
(3.6)
3.2 Formulation
47
where ˚ is the null set. These regions, referred to as decision regions, play a role similar to decision intervals in SQ. To each decision region, a representative vector is assigned: (3.7) rq 7! ıq ; for q D 0; 1; : : : ; M 1: The source vector x is vector-quantized to VQ index q as follows: q D Q.x/
if and only if
x 2 ıq :
(3.8)
The corresponding reconstructed vector is the representative vector: xO D rq D Q1 .q/:
(3.9)
Plugging (3.8) into (3.9), we have xO .x/ D rq .x/ D Q1 ŒQ.x/ :
(3.10)
The VQ quantization error is now a vector given below q.x/ D xO x:
(3.11)
xO D x C q.x/;
(3.12)
Since this may be rewritten as
the additive quantization noise model in Fig. 2.3 is still valid. The quantization noise is best measured by a distance, such as the L-2 norm or Euclidean distance defined below d.x; xO / D .x xO /T .x xO / D
N 1 X
.xk xO k /2 :
(3.13)
kD0
The average quantization error is then Z Err D D
d.x; xO /p.x/dx
˝ M 1 Z X qD0
ıq
d.x; rq /p.x/dx:
(3.14) (3.15)
If the Euclidean distance is used, the average quantization noise may be again called MSQE. The goal of VQ design is to find a set of decision regions fıg0M 1 and representative vectors frq g0M 1 , referred to as a VQ codebook, that minimizes this average quantization error.
48
3 Vector Quantization
3.3 Optimality Conditions A necessary condition for an optimal solution to the VQ design problem stated above is that, for a given set of representative vectors frq g0M 1 , the corresponding decision regions fıg0M 1 should decompose the input space in such a way that each source vector x is always clustered to its nearest representative vector [19]: Q.x/ D rq
d.x; rq / d.x; rk / for all k ¤ q:
if and only if
(3.16)
Due to this, the decision boundaries can be defined as ıq D fx j d.x; rq / d.x; rk / for all k ¤ qg:
(3.17)
Such disjoint sets are referred to as Voronoi regions, which rids us off the trouble of literally defining and representing the boundary of each decision region. This also indicates that a whole VQ scheme can be fully described by the VQ codebook, or 1 the set of representative vectors frq gM . 0 Another condition for an optimal solution is that, given a decision region ıq , the best choice for the representative vector is the conditional mean of all vectors within the decision region [19]: Z xq D
xp.xjx 2 ıq /dx:
(3.18)
x2ıq
3.4 LBG Algorithm Let us now consider the problem of VQ design, or finding the set of representative vectors and thus decompose the input space into a set of Voronoi regions so that the average quantization error in (3.15) is minimized. An immediate obstacle that needs to be addressed is that, other than multidimensional Gaussian, there is essentially no suitable theoretical joint PDF p.x/ to work with. Instead, we can usually draw , from the source governed by an unknown a large set of input vectors, fxk gL1 kD0 PDF. Therefore, a feasible approach is to use such a set of vectors, referred to as the training set, to come up with a optimal VQ codebook. In absence of the joint PDF p.x/, the average quantization error defined in (3.15) is no longer available. It may be replaced by the following total quantization error: Err D
L1 X
d .xk ; xO .xk // :
(3.19)
kD0
For the same reason, the best choice of the representative vector in (3.18) needs to be replaced by 1 X xq D xk (3.20) Lq xk 2ıq
where Lq is the number of training vectors in decision region ıq .
3.5 Implementation
49
The following Linde–Buzo–Gray algorithm (LBG algorithm) [42], also referred to as k-means algorithm, has been found to converge to a local minimum of (3.19): 1. nD0. 2. Make a guess for the representative vectors frq g0M 1 . By (3.17), this implicitly builds an initial set of Voronoi regions fıg0M 1 . 3. nDnC1. 4. Quantize each training vector using (3.16). Upon completion, the training set has been partitioned into Voronoi regions fıq g0M 1 . 5. Build a new set of representative vectors using (3.20). This implicitly builds a new set of Voronoi regions fıg0M 1 . 6. Calculate the total quantization error Err.n/ using (3.19). 7. Go back to step 3 if Err.n 1/ Err.n/ > (3.21) Err.n/ where is a predetermined positive threshold. 8. Stop. Steps 4 and 5 in the iterative procedure above can only cause the total quantization error to decrease, so the LBG algorithm converges to at least a local minimum of the total quantization error (3.19). This also explains the stopping condition in (3.21). There is a chance that, upon the completion of step 4, a Voronoi region may be empty in the sense that it contains not a single training vector. Suppose this happens to region q, then step 5 is problematic because Lq D 0 in (3.20). This indicates that representative vector rq is an outlier that is far away from the training set. A simple approach to fixing this problem is to replace it with a training vector in the most popular Voronoi region. After the completion of the LBG algorithm, the resulting VQ codebook can be tested against a separate set of source data, referred to as the test set, that are drawn from the same source. The LBG algorithm offers an approach to exploiting the real multidimensional PDF directly from the data without a theoretical multidimensional distribution model. This is an advantage over SQ which usually relies on a probability model. This also enables VQ to remove nonlinear dependencies in the data, a clear advantage over other technologies, such as transforms and linear prediction, which can only deal with linear dependencies.
3.5 Implementation Once the VQ codebook is obtained, it can be used to quantize the source signal that generates the training set. Similar to SQ, this also involves two stages as shown in Fig. 3.4.
50
3 Vector Quantization
Fig. 3.4 VQ involves an encoding or forward VQ stage represented by “VQ” in the figure, which maps a source vector to a VQ index, and a decoding or inverse VQ stage represented by “VQ1 ”, which maps a VQ index to its representative vector Fig. 3.5 The vector-quantization of a source vector entails searching through the VQ codebook to find the representative vector that is closest to the source vector
According to optimal condition of (3.16), the vector quantization of a source vector x entails searching through all representative vectors in the VQ codebook to find the one that is closest to the source vector. This is shown in Fig. 3.5. The decoding is very simple because it only entails using the received VQ index to look up the VQ codebook to retrieve the representative vector. Since the size of a VQ codebook grows exponentially with the vector dimension, the amount of storage for the codebook may easily become a significant cost, especially for the decoder which is usually more cost-sensitive. The amount of computation involved in searching through the VQ codebook for each source vector is also a major concern for encoder. Therefore, VQ with a dimension of more than 20 is rarely deployed in practical applications.
Part III
Data Model
Companding discussed in Sect. 2.4.2 illustrates a basic framework for improving quantization performance. If the compressor is replaced by a general signal transformation built upon a data model, expander by the corresponding inverse transformation, and the uniform quantizer by a general quantizer, respectively, companding can be expanded into a general scheme for quantization performance enhancement: data model plus quantization. The steps involved in such a scheme may be summarized as follows: 1. 2. 3. 4.
Transform the source signal into another one using a data model. Quantize the transformed signal. Transmit the quantization indexes to the decoder. Inverse-quantize the received quantized indexes to reconstruct the transformed signal. 5. Inverse-transform the reconstructed signal to reconstruct the original signal. A necessary condition for this scheme to work is that the transformation must be invertible, either exactly or approximately. Otherwise, original signal cannot be reconstructed even if no quantization is applied. The key to the success of such a scheme is that the transformed signal must be compact. Companding achieves this using a nonlinear function to arrive at a PDF that is similar to a uniform distribution. There are other methods that are much more powerful, most prominently among them are linear prediction, linear transform, and subband filter banks. Linear prediction uses a linear combination of historic samples as a prediction for the current sample. As long as the samples are fairly correlated, the predicted value will be a good estimate to the current sample value, resulting a small prediction error signal, which may be characterized by a smaller variance. Since the MSQE of an optimal quantizer is proportional to the the variance of the source signal (see (2.48)), the reduced variance will result in a reduced MSQE. A linear transform takes a block of input samples to generate another block of transform coefficients whose energy is compacted to a minority. Bit resources can then be concentrated to those high-energy coefficients to arrive at a significantly reduced MSQE.
A filter bank may be considered as an extension of transform by using samples from multiple blocks to achieve higher level of energy compaction without changing the block size, thus delivering even smaller MSQE. The primary role of data modeling is to exploit the inner structure or correlation of the source signal. As discussed in Chap. 3, VQ can also achieve this. But a major difficulty with VQ is that its complexity grows exponential with vector dimension, so a vector dimension of more 20 is usually considered as too complex to be deployed. However, correlation in most signals is usually much stronger than 20 samples. Audio signals, in particular, are well known for strong correlations up to thousands of samples. Therefore, VQ is usually not directly deployed, but rather jointly with a data model.
Chapter 4
Linear Prediction
Let us consider the source signal x.n/ shown at the top of Fig. 4.1. A simple approach to linear prediction is to just use the previous sample x.n 1/ as the prediction for the current sample: p.n/ D x.n 1/:
(4.1)
This prediction is, of course, not perfect, so there is prediction error or residue r.n/ D x.n/ p.n/ D x.n/ x.n 1/;
(4.2)
which is shown at the bottom of Fig. 4.1. The dynamic range of the residue is obviously much smaller than that of the source signal. The variance of the residue is 2.0282, which is much smaller than 101.6028, the variance of the source signal. The histograms of the source signal and the residue, both shown in Fig. 4.2, clearly indicate that, if the residue, instead of the source signal itself, is quantized, the quantization error will be much smaller.
4.1 Linear Prediction Coding More generally, for a source signal x.n/, a linear predictor makes an estimate of its sample value at time instance n using a linear combination of its K previous samples: K X p.n/ D ak x.n k/; (4.3) kD1
fak gK kD1
are the prediction coefficients and K is the predictor order. The where transfer function for the prediction filter is A.z/ D
K X
ak zk :
(4.4)
kD1
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 4, c Springer Science+Business Media, LLC 2010
53
54
4 Linear Prediction Input Signal
20
0
−20
1000
2000
3000
4000
5000
6000
7000
8000
6000
7000
8000
Prediction Residue
20
0
−20
1000
2000
3000
4000
5000
Fig. 4.1 An source signal and its prediction residue Input Signal
Prediciton Residue 2200
400
2000 1800
350
1600
300
1400 250 1200 200
1000
150
800 600
100 400 50 0
200
−20
0
20
0
−20
Fig. 4.2 Histograms for the source signal and its prediction residue
0
20
4.2 Open-Loop DPCM
55
The prediction cannot be perfect, so there is always a prediction error r.n/ D x.n/ p.n/;
(4.5)
which is also referred to as prediction residue. With proper design of the prediction coefficients, this prediction error can be significantly reduced. To assess the performance of this prediction error reduction, prediction gain is defined: PG D
x2 r2
(4.6)
where x2 and r2 denotes the variances of the source and the residue signals, respectively. For example, the prediction gain for the simple predictor in last section is PG D 10 log10
101:6028 17 dB: 2:0282
Since the prediction residue is likely to have a much smaller variance than the source signal, the MSQE could be significantly reduced if the residue signal is quantized in place of the source signal. This amounts to linear prediction coding (LPC).
4.2 Open-Loop DPCM There are many methods for implementing LPC, which are mostly concerned how quantization noise is handled in the encoder. Open-loop DPCM discussed in this section is one of the simplest, but it suffers from quantization noise accumulation.
4.2.1 Encoder and Decoder An encoder for implementing LPC is shown in Fig. 4.3 [66], where the quantizer is modeled by an additive noise source (see (2.8) and Fig. 2.3). Since the quantizer is placed outside the prediction loop, this scheme is often called open-loop DPCM [17]. The name DPCM will be explained in Sect. 4.3. The overall transfer function for the LPC encoder before the quantizer is H.z/ D 1
K X
ak zk D 1 A.z/;
(4.7)
kD1
where A.z/ is defined in (4.4). After quantization, the quantization indexes for the residue signal are transmitted to the decoder and are used to reconstruct the quantized residue rO .n/ through inverse
56
4 Linear Prediction
Fig. 4.3 Encoder for open-loop DPCM. The quantizer is attached to the prediction residue and is modeled by an additive noise source
Fig. 4.4 Decoder for open-loop DPCM
quantization. This process may be viewed as if the quantized residue rO .n/ were received directly by the decoder. This is the convention adopted in Figs. 4.3 and 4.4. Since the original source signal x.n/ is not available at the decoder, the prediction scheme in (4.3) cannot be used by the decoder. Instead, the prediction at the decoder has to use the past reconstructed sample values: p.n/ O D
K X
ak x.n O k/:
(4.8)
kD1
According to (4.5), the reconstructed sample itself is obtained by x.n/ O D p.n/ O C r.n/ O D
K X
ak x.n O k/ C rO .n/:
(4.9)
kD1
This leads us to the decoder shown in Fig. 4.4, where the overall transfer function of the LPC decoder or the LPC reconstruction filter is D.z/ D
1
1 PK
kD1
ak zk
D
1 : 1 A.z/
(4.10)
If there were no quantizer, the overall transfer function of the encoder (4.7) and decoder (4.10) is obviously one, so the reconstructed signal at the decoder output is the same as the encoder input. When the quantizer is deployed, the prediction
4.2 Open-Loop DPCM
57
residue used by the decoder is different from that by the encoder and the difference is the additive quantization noise. This causes reconstruction error at the decoder output e.n/ D x.n/ O x.n/; (4.11) which is the quantization error of the overall open-loop DPCM.
4.2.2 Quantization Noise Accumulation As illustrated above, the difference between the prediction residues used by the decoder and the encoder is the additive quantization noise, so this quantization noise is reflected in each reconstructed sample value. Since these sample values are convoluted by the prediction filter, it is expected that the quantization noise is accumulated at the decoder output. To illustrate this quantization noise accumulation, let us first use the additive noise model of (2.8) to write the quantized residue as rO .n/ D r.n/ C q.n/;
(4.12)
where q.n/ is again the quantization error or noise. The reconstructed sample at the decoder output (4.9) can then be rewritten as x.n/ O D p.n/ O C r.n/ C q.n/:
(4.13)
Dropping in the definition of residue (4.5), we have x.n/ O D p.n/ O C x.n/ p.n/ C q.n/:
(4.14)
Therefore, the reconstruction error at the decoder output (4.11) is e.n/ D x.n/ O x.n/ D p.n/ O p.n/ C q.n/:
(4.15)
Plugging in (4.3) and (4.8), this may be further expressed as e.n/ D
K X
ak Œx.n O k/ x.n k/ C q.n/
kD1
D
K X
ak e.n k/ C q.n/;
(4.16)
kD1
which indicates that the reconstruction error is equal to the current quantization error plus weighted sum of reconstruction errors in the past.
58
4 Linear Prediction
Moving this relationship backward one step to n 1, the reconstruction error at n 1 is equal to the quantization error at n 1 plus the weighted sum of reconstruction errors before n 1. Repeating this procedure all the way back to the start of the prediction, we conclude that quantization errors in each prediction step in the past are all accumulated to arrive at the reconstruction error at the current step. To illustrate this accumulation of quantization noise more clearly, let us consider the simple difference predictor used in Sect. 4.1: a1 D 1 and K D 1:
(4.17)
Plugging this into (4.16) we have e.n/ D e.n 1/ C q.n/:
(4.18)
Let us suppose that x.0/ D x.0/, O then the above equation produces iteratively the following reconstruction errors: e.1/ D e.0/ C q.1/ D q.1/ e.2/ D e.1/ C q.2/ D q.1/ C q.2/ e.3/ D e.2/ C q.3/ D q.1/ C q.2/ C q.3/ :: :
(4.19)
For the nth step, we have e.n/ D
n X
q.k/;
(4.20)
kD1
which is the summation of quantization noise in all previous steps, starting from the beginning of prediction. To obtain a closed representation of quantization accumulation, let E.z/ denote the Z-transform of reconstruction error e.n/ and Q.z/ the Z-transform of quantization noise q.n/, then the reconstruction error in (4.16) may be expressed as E.z/ D
Q.z/ ; 1 A.z/
(4.21)
which indicates that the reconstruction error at the decoder output is the quantization error filtered by an all-pole LPC reconstruction filter (see (4.10)). An all-pole IIR filter may be unstable and may produce large instances of reconstruction error which may be perceptually annoying, so open-loop DPCM is avoided in many applications. But it is sometimes purposely deployed in other applications to exploit the shaping of quantization noise spectrum by the all-pole reconstruction filter, see Sects. 4.5 and 11.4 for details.
4.3 DPCM
59
4.3 DPCM This problem of quantization error accumulation can be avoided by forcing the encoder to use the same predictor as the decoder, i.e., the predictor given in (4.8). This entails moving the quantizer in Fig. 4.3 inside the encoding loop, leading to the encoder scheme shown in Fig. 4.5. This LPC scheme is often referred to as differential pulse code modulation (DPCM) [9]. Note that a uniform quantizer is usually deployed.
4.3.1 Quantization Error To illustrate that quantization noise accumulation is no longer an issue with DPCM, let us first note that the prediction residue is now given by (see Fig. 4.5): r.n/ D x.n/ p.n/: O
(4.22)
Plugging this into the additive noise model (4.12) for quantization, we have r.n/ O D x.n/ p.n/ O C q.n/;
(4.23)
which is the quantized residue at the input to the decoder. Dropping the above equation into (4.9), we obtain the reconstructed value at the decoder x.n/ O D p.n/ O C x.n/ p.n/ O C q.n/ D x.n/ C q.n/:
(4.24)
Rearranging this equation, we have e.n/ D x.n/ O x.n/ D q.n/;
(4.25)
E.z/ D Q.z/:
(4.26)
or equivalently
Fig. 4.5 Differential pulse code modulation (DPCM). A full decoder is embedded as part of the encoder. The quantizer is modeled by an additive noise source
60
4 Linear Prediction
Therefore, the reconstruction error at the decoder output is exactly the same as the quantization error of the residue in the encoder, there is no quantization error accumulation.
4.3.2 Coding Gain As stated for (4.11), the reconstruction error at the decoder output is considered as 2 , is the MSQE the quantization noise of LPC, so its variance, denoted as q.DPCM/ 2 , the for DPCM. For a given bit rate, this MSQE should be smaller than q.PCM/ MSQE for directly quantizing the source signal, to justify the increased complexity of linear prediction. This improvement may be assessed by coding gain GDPCM D
2 q.PCM/ 2 q.DPCM/
;
(4.27)
which is usually evaluated in the context of scalar quantization. 2 Due to (4.25), q.DPCM/ is the same as the MSQE of the prediction residue 2 q.Residue/ : 2 2 q.DPCM/ D q.Residue/ :
(4.28)
2 Since q.Residue/ is related to the variance of the reside r2 by (2.48) for uniform and nonuniform scalar quantization, we have 2 D 100:1.aCbR/ r2 ; q.DPCM/
(4.29)
for a given bit rate of R. Note that the slope b and intercept a in the above equation are determined by a particular quantizer and the PDF of the residue. On the other hand, if the source signal is quantized directly with the same bit rate R in what constitutes the PCM, (2.48) gives the following MSQE 2 D 100:1.aCbR/ x2 ; q.PCM/
(4.30)
where x2 is the variance of the source signal. Here it is assumed that both the source signal and the DPCM residue share the same set of parameters a and b. This assumption may not be valid in many practical applications. Consequently, the coding gain for DPCM is GDPCM D
2 q.PCM/ 2 q.DPCM/
D
x2 D PG; r2
(4.31)
4.4 Optimal Prediction
61
which is the same as the prediction gain defined in (4.6). This indicates that the quantization performance of the DPCM system is dependent on the prediction gain, or how well the predictor predicts the source signal. If the predictor in the DPCM is properly designed so that (4.32) r2 x2 ; then GDPCM 1 or a significant reduction in quantization error can be achieved.
4.4 Optimal Prediction Now that it is established that the quantization performance of a DPCM system depends on the performance of linear prediction, the next task is to find an optimal predictor that maximizes the prediction gain.
4.4.1 Optimal Predictor For a given source signal x.n/, the maximization of prediction gain is equivalent to minimizing the variance of the prediction residue (4.22), which can be expressed as r.n/ D x.n/ p.n/ O D x.n/
K X
ak x.n O k/
(4.33)
kD1
using (4.8). Therefore, the design problem is to find the set of prediction coefficients that minimizes r2 : fak gK kD1 2 min r2 D E 4 x.n/
fak gK kD1
K X
!2 3 ak x.n O k/ 5 ;
(4.34)
kD1
where E./ is the expectation operator defined below Z E.y/ D
y./p./d:
(4.35)
Since x.n/ O in (4.34) is the reconstructed signal and is related to the source signal by the additive quantization noise (see (4.25)), the minimization problem in (4.34) involves optimal selection of the prediction coefficients as well as the minimization of quantization error. As discussed in Chaps. 2 and 3, independent minimization of quantization error or quantizer design itself is frequently very difficult, so it is highly desirable that the problem be simplified by taking quantization out of the picture. This essentially implies that the DPCM scheme is given up and the openloop DPCM encoder in Fig. 4.3 is used when it comes to predictor design.
62
4 Linear Prediction
One way to consider this simplification is the assumption of fine quantization: the quantization step size is so small that the resulting quantization error is negligible x.n/ O x.n/:
(4.36)
This enables the replacement of x.n/ O by x.n/ in (4.34). Due to the arguments above, the prediction residue considered for optimization purpose becomes K X
ak x.n k/;
(4.37)
!2 3 ak x.n k/ 5 :
(4.38)
r.n/ D x.n/ p.n/ D x.n/
kD1
and the predictor design problem (4.34) becomes 2 min r2 D E 4 x.n/
fak gK kD1
K X kD1
To minimize this error function, we set the derivative of r2 with respect to each prediction coefficient aj to zero: @r2 D 2E aj
" x.n/
K X
!
#
ak x.nk/ x.nj / D 0; j D 1; 2; : : : ; K: (4.39)
kD1
Due to (4.3), the above equation may be written as E Œ.x.n/ p.n// x.n j / D 0; j D 1; 2; : : : ; KI
(4.40)
and, using (4.5), further as E Œr.n/x.n j / D 0; j D 1; 2; : : : ; K:
(4.41)
This that the minimal prediction error or residue must be orthogonal to all data used in the prediction. This is called orthogonality principle. By moving the expectation inside the summation, (4.39) may be rewritten as K X
ak E Œx.n k/x.n j / D E Œx.n/x.n j / ; j D 1; 2; : : : ; K:
(4.42)
kD1
Now we are ready to make the second assumption: the source signal x.n/ is a wide sense stationary process so that its autocorrelation function can be defined as R.k/ D EŒx.n/x.n k/;
(4.43)
4.4 Optimal Prediction
63
and has the following property: R.k/ D R.k/:
(4.44)
Consequently, (4.42) can be written as K X
ak R.k j / D R.j /; j D 1; 2; : : : ; K:
(4.45)
kD1
It can be further written into the following matrix form: Ra D r;
(4.46)
a D Œa1 ; a2 ; a3 ; : : : ; aK T ;
(4.47)
r D ŒR.1/; R.2/; R.3/; : : : ; R.K/T ;
(4.48)
where
and
2 6 6 6 RD6 6 4
R.0/ R.1/ R.2/ :: :
R.1/ R.0/ R.1/ :: :
R.2/ R.1/ R.0/ :: :
3 R.K 1/ R.K 2/ 7 7 R.K 3/ 7 7: 7 :: 5 :
R.K 1/ R.K 2/ R.K 3/
(4.49)
R.0/
The equations above are known as normal equations, Yule–Walker prediction equations or Wiener–Hopf equations [63]. The matrix R and vector r are all built from the autocorrelation values of fR.k/gK kD0 . The matrix R is a Toeplitz matrix in that it is symmetric and all elements along a diagonal are equal. Such matrices are known to be positive definite and therefore nonsingular, yielding a unique solution to the determination of the linear prediction coefficients: (4.50) a D R1 r:
4.4.2 Levinson–Durbin Algorithm Levinson–Durbin recursion is a procedure in linear algebra to recursively calculate the solution to an equation involving a Toeplitz matrix [12, 41], thus avoiding an explicit inversion of the matrix R. The algorithm iterates on the prediction order, so the order of the prediction filter is denoted for each filter coefficient using superscripts: akn
(4.51)
64
4 Linear Prediction
where k is the kth coefficient for nth iteration. To get the prediction filter of order K, we need to iterate through the following sets of filter coefficients: nD1W
fa11 g
nD2W
fa12 ; a22 g
:: : K n D K W fa1K ; a1K ; : : : ; aK g
The iteration above is possible because both the matrix R and vector r are built from the autocorrelation values of fR.k/gK kD0 . The algorithm proceeds as follows: 1. 2. 3. 4.
Set n D 0. Set E 0 D R.0/. Set n D n C 1. Calculate n D
1
R.n/
E n1
n1 X
! akn1 R.n k/
(4.52)
kD1
5. Calculate
ann D n
(4.53)
n1 akn D akn1 n ank ; for k D 1; 2; : : : ; n 1:
(4.54)
E n D .1 n2 /E n1 :
(4.55)
6. Calculate
7. Calculate 8. Go to step 3 if n < M . For example, to get the prediction filter of order K D 2, two iterations are needed as follows. For n D 1, we have E 0 D R.0/
1 D
R.1/ R.1/ D E0 R.0/
a11 D 1 D " 1
E D .1
12 /E 0
D 1
(4.56)
R.1/ R.0/
(4.57)
R.1/ R.0/
2 #
R.0/ D
(4.58) R2 .0/ R2 .1/ : R.0/
(4.59)
4.4 Optimal Prediction
65
For n D 2, we have 2 D
R.2/ a11 R.1/ R.2/R.0/ R2 .1/ D 1 E R2 .0/ R2 .1/ a22 D 2 D
R.2/R.0/ R2 .1/ R2 .0/ R2 .1/
a12 D a11 .1 2 / D R.1/
R.0/ R.2/ : R2 .0/ R2 .1/
(4.60)
(4.61) (4.62)
Therefore, the final prediction coefficients for K D 2 are a1 D a12
and
a2 D a22 :
(4.63)
4.4.3 Whitening Filter From the definition of prediction residue (4.5), we can establish the following relationship: x.n j / D r.n j / C p.n j /: (4.64) Dropping it into the orthogonality principle (4.41), we have E Œr.n/r.n j / C E Œr.n/p.n j / D 0; j D 1; 2; : : : ; K;
(4.65)
so the autocorrelation function of the prediction residue is Rr .j / D E Œr.n/p.n j / ; j D 1; 2; : : : ; K:
(4.66)
Using (4.3), the right-hand side of the above equation may be further expanded into Rr .j / D
K X
ak E Œr.n/x.n j k/ ; j D 1; 2; : : : ; K:
(4.67)
kD1
4.4.3.1 Infinite Prediction Order If the predictor order is infinity (K D 1), the orthogonality principle (4.41) ensures that the right-hand side of (4.67) is zero, so the autocorrelation function of the prediction residue becomes ( Rr .j / D
r2 ; j D 0I 0; otherwise:
(4.68)
66
4 Linear Prediction
It indicates that the prediction residue sequence is a white noise process. Note that this conclusion is valid only when the predictor has an infinite number of prediction coefficients. 4.4.3.2 Markov Process For predictors with a finite number of coefficients, the above condition is generally not true, unless the source signal is a Markov process with an order N M . Also called an autoregressive process and denoted as AR(N), such a process x.n/ is generated by passing a white-noise process w.n/ through an N-th order all-pole filter W .z/ W .z/ ; (4.69) D X.z/ D PN k 1 B.z/ 1 kD1 bk z where B.z/ D
N X
bk zk ;
(4.70)
kD1
and X.z/ and W .z/ are the z transforms of x.n/ and w.n/, respectively. The corresponding difference equation is x.n/ D w.n/ C
N X
bk x.n k/:
(4.71)
kD1
The autocorrelation function of the AR process is Rx .j / D EŒx.n/x.n j / N X
D EŒw.n/x.n j / C
bk EŒx.n k/x.n j /
kD1 N X
D EŒw.n/x.n j / C
bk R.j k/
(4.72)
kD1
Since w.n/ is white, ( EŒw.n/x.n j / D Consequently,
( Rx .j / D
w2 C PN
PN
kD1
kD1
w2 ; j D 0I 0; j > 0:
bk R.k/; j D 0I
bk R.j k/; j > 0:
(4.73)
(4.74)
4.4 Optimal Prediction
67
A comparison with the Wiener–Hopf equations (4.45) leads to the following set of optimal prediction coefficients: ak D
bk ; 0 < j N I 0; N < j M I
(4.75)
which essentially sets A.z/ D B.z/:
(4.76)
The above result makes intuitive sense because it sets the LPC encoder filter (4.7) to be the inverse of the filter (4.69) that generates the AR process. In particular, the Z-transform of the prediction residue is given by R.z/ D Œ1 A.z/X.z/
(4.77)
according to (4.7). Dropping in (4.69) and using (4.76), we obtain R.z/ D Œ1 A.z/
W .z/ D W .x/; 1 B.z/
(4.78)
which is the unpredictable white noise that drives the AR process. An important implication of the above equation is that the prediction residue process is, once again, white for an AR process whose order is not larger than the predictor order.
4.4.3.3 Other Cases When a predictor with a finite order is applied to a general stochastic process, the prediction residue process is generally not white, but may be considered as approximately white in practical applications. As an example, let us consider the signal at the top of Fig. 4.1, which is not an AR process. Applying the Levinson–Durbin procedure in Sect. 4.4.2, we obtain the optimal first-order filter as R.1/ 0:99: (4.79) a1 D R.0/ The spectrum of the prediction residue using this optimal predictor is shown at the top of Fig. 4.6. It is obviously flat, so may be considered as white. Therefore, the LPC encoder filter (4.7) that produces the prediction residue signal is sometimes called a whitening filter. Note that it is the inverse of the decoder or the LPC reconstruction filter (4.10) H.z/ D
1 : D.z/
(4.80)
68
4 Linear Prediction Prediction Residue
Magnitude (dB)
10 0 −10 −20 −30
500
1000
1500
2000 2500 Frequency (Hz)
3000
3500
4000
3500
4000
Input Signal and Spectral Envelop
Magnitude (dB)
40
20
0
−20
500
1000
1500
2000 2500 Frequency (Hz)
3000
Fig. 4.6 Power spectra of the prediction residue (top), source signal (bottom), and the estimate by the reconstruction filter (the envelop in the bottom)
4.4.4 Spectrum Estimator From (4.7), the Z-transform of the source signal may be expressed as, X.z/ D
R.z/ R.z/ D ; PK H.z/ 1 kD1 ak zk
(4.81)
which implies that the power spectrum of the source signal is Srr .ej! / Sxx .ej! / D ˇ ˇ2 ; PK ˇ ˇ jk! a e ˇ1 ˇ kD1 k
(4.82)
where Sxx .ej! / and Srr .ej! / are the power spectrum of x.n/ and r.n/, respectively. Since r.n/ is white or nearly white, the equation above becomes r2 Sxx .ej! / DD ˇ ˇ2 : PK ˇ ˇ ˇ1 kD1 ak ejk! ˇ
(4.83)
4.5 Noise Shaping
69
Therefore, the decoder or the LPC reconstruction filter (4.10) provides an estimate of the spectrum of the source signal and linear prediction is sometimes considered as a temporal-frequency analysis tool. Note that this is an all-pole model for the source signal, so it can model peaks well, but is incapable of modeling zeros (deep valleys) that may exist in the source signal. For this reason, linear prediction spectrum is sometimes referred to as an spectral envelop. Furthermore, if the source signal cannot be modeled by poles, linear prediction may fail completely. The spectrum of the source signal and the estimate by the LPC reconstruction filter (4.83) are shown at the bottom of Fig. 4.6. It can be observed that the spectrum estimated by the LPC reconstruction filter matches the signal spectrum envelop very well.
4.5 Noise Shaping As discussed in Sect. 4.4, the optimal LPC encoder produces a prediction residue sequence that is white or nearly white. When this white residue is quantized, the quantization noise is often white as well, especially when fine quantization is used. This is a mismatch to the sensitivity curve of the human ear which is well known for its substantial variation with frequency. It is, therefore, desirable to shape the spectrum of quantization noise at the decoder output to match the sensitivity curve of the human ear so that more perceptual irrelevancy can be removed. Linear prediction offers a flexible and simple mechanism for achieving this.
4.5.1 DPCM Let us revisit the DPCM system in Fig. 4.5. Its output given by (4.23) may be expanded using (4.8) to become rO .n/ D x.n/
K X
ak x.n O k/ C q.n/
(4.84)
kD1
Due to (4.25), the above equation may be written as rO .n/ D x.n/
K X
ak Œx.n k/ C q.n k/ C q.n/
kD1
D x.n/
K X kD1
ak x.n k/ C q.n/
K X kD1
ak q.n k/
(4.85)
70
4 Linear Prediction
Fig. 4.7 Different schemes of linear prediction for shaping quantization noise at the decoder output: DPCM (top), open-loop DPCM (middle) and general noise feedback coding (bottom)
The Z-transform of the above equation is O R.z/ D Œ1 A.z/X.z/ C Œ1 A.z/Q.z/;
(4.86)
O is the Z-transform of rO .n/. where R.z/ The above equation indicates that the DPCM system in Figs. 4.5 and 4.4 may be implemented using the structure at the top of Fig. 4.7 [5]. Both the source signal and the quantization noise are shaped by the same LPC encoder filter and the shaping is subsequently reversed by the reconstruction filter in the decoder, so the overall transfer functions for the source signal and the quantization noise are the same and equal
4.5 Noise Shaping
71
to one. Therefore, the spectrum of the quantization noise at the decoder output is the same as that produced by the quantizer (see (4.26)). In other words, the spectrum of the quantization noise as produced by the quantizer is faithfully duplicated at the decoder output, there is no shaping of quantization noise. Many quantizers, including the uniform quantizer, produce a white quantization noise spectrum when the quantization step size is small (fine quantization). This white spectrum is faithfully duplicated at the decoder output by DPCM.
4.5.2 Open-Loop DPCM Let us now consider the open-loop DPCM in Figs. 4.3 and 4.4 which are redrawn in the middle of Fig. 4.7. Compared with the DPCM scheme at the top, the processing for the source signal is unchanged, the overall transfer function is still one. However, there is no processing for the quantization noise in the encoder, it is shaped only by the LPC reconstruction filter, so the quantization noise at the decoder output is given by (4.21). As shown in Sect. 4.4.4, the optimal LPC reconstruction filter traces the spectral envelop of the source signal, so the quantization noise is shaped toward that envelop. If the quantizer produces white quantization noise, the quantization noise at the decoder output is shaped to the spectral envelop of the source signal.
4.5.3 Noise-Feedback Coding While the LPC filter coefficients are determined by the necessity to maximize prediction gain, the DPCM and open-loop DPCM schemes imply that the filter for processing the quantization noise in the encoder can be altered to shape the quantization noise without any impact to the prefect reconstruction of the source signal. Apparently, the transfer function for such a filter, called error-feedback function or noise-feedback function, do not have to be either 1 A.z/ or 1, a different noise feedback function can be used to shape the quantization noise spectrum to other desirable shapes. This gives rise to the noise-feedback coding shown at the bottom of Fig. 4.7, where the noise-feedback function is denoted as 1 F .z/:
(4.87)
Figure 4.7 implies an important advantage of noise feedback coding: the shaping of quantization noise is accomplished completely in the encoder, there is zero implementation impact or cost at the decoder. When designing the noise feedback function, it is important to realize that the noise-feedback function is only half of the overall noise-shaping filter
72
4 Linear Prediction
S.z/ D
1 F .z/ ; 1 A.z/
(4.88)
the other half is the LPC reconstruction filter which is determined by solving the normal equation (4.46). The Z-transform of the quantization error at the decoder output is given by E.z/ D S.z/Q.z/; (4.89) so the noise spectrum is See .ej! / D jS.ej! /j2 Sqq .ej! /:
(4.90)
In audio coding, this spectrum is supposed to be shaped to match the masked threshold for a given source signal. In principle, this matching can be achieved for all masked threshold shapes provided by a perceptual model [38].
Chapter 5
Transform Coding
Transform coding (TC) is a method that transforms a source signal into another one with a more compact representation. The goal is to quantize the transformed signal in such a way that the quantization error in the reconstructed signal is smaller than directly quantizing the source signal.
5.1 Transform Coder Transform coding is block-based, so the source signal x.n/ is first grouped into blocks, each of which consists of M samples, and is represented by a vector: x.n/ D Œx0 .n/; x1 .n/; : : : ; xM 1 .n/T D Œx.nM /; x.nM 1/; : : : ; x.nM M C 1/T :
(5.1)
The dimension M is called block size or block length. For a linear transform, the transformed block is obtained as y.n/ D Tx.n/;
(5.2)
y.n/ D Œy0 .n/; y. n/; : : : ; yM 1 .n/T
(5.3)
where is called the transform of x.n/ or transform coefficients and the M M matrix 2 6 6 TD6 4
t0;0 t1;0 :: :
t0;1 t1;1 :: :
tM 1;0
tM 1;1
t0;M 1 t1;M 1 :: :
3 7 7 7 5
(5.4)
tM 1;M 1
is called the transformation matrix or simply the transform. This transform operation is shown in the left of Fig. 5.1.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 5, c Springer Science+Business Media, LLC 2010
73
74
5 Transform Coding
Fig. 5.1 Flow chart of a transform coder. The coding delay for transforming and inverse transforming the signals is ignored in this graph
Transform coding is shown in Fig. 5.1. The transform coefficients y.n/ are quantized into quantized coefficients yO .n/ and the resulting quantization indexes transmitted to the decoder. The decoder reconstructs from the received indexes the quantized coefficients yO .n/ through inverse quantization. This process may be viewed as if the quantized coefficients yO .n/ are received directly by the decoder. The decoder then reconstructs an estimate xO .n/ of the source signal vector x.n/ from the quantized coefficients yO .n/ through an inverse transform xO .n/ D T1 yO .n/;
(5.5)
where T1 represents the inverse transform. The reconstructed vector xO .n/ can then be unblocked to rebuild an estimate x.n/ O of the original source signal x.n/. A basic requirement for a transform in the context of transform coding is that it must be invertible (5.6) T1 T D I; so that an source block can be recovered from its transform coefficients in the absence of quantization: (5.7) x.n/ D T1 y.n/: Orthogonal transforms are most frequently used in practical applications. To be orthogonal, a transform must satisfy T1 D TT ;
(5.8)
TT T D I;
(5.9)
or
5.1 Transform Coder
75
Consequently, the inverse transform becomes x.n/ D TT y.n/:
(5.10)
A transform matrix T can be considered as consisting of M rows of vectors 2
tT0
3
6 T 7 6 t1 7 6 7 T D 6 : 7; 6 :: 7 4 5 tTM 1
(5.11)
where tTk represents the kth row of T tTk D Œtk;0 ; tk;1 ; : : : ; tk;M 1 ;
k D 0; 1; : : : ; M 1:
(5.12)
Equation (5.10) becomes x.n/ D Œt0 ; t1 ; : : : ; tM 1 y.n/ D
M 1 X
yk tk ;
(5.13)
kD0
which means that the source can be represented as a linear combination of the vecM 1 tors ftk gkD0 . Therefore, the rows of T are often referred to as basis vectors or basis functions. One of the advantages of an orthogonal transform is that its inverse transform TT is immediately defined without considering matrix inversion. The fact that the inverse matrix is just the transposition of the transform matrix means that the inverse transform can be implemented by just transposing the transform flow graph or running it backwards. Another advantage of an orthogonal transform is energy conservation which means that the transformed coefficients have the same total energy as the source signal: M 1 X kD0
y 2 .k/ D
M 1 X
x 2 .k/:
(5.14)
kD0
This can be easily proved: M 1 X kD0
y 2 .k/ D yT y D xT TT Tx D xT x D
M 1 X
x 2 .k/;
(5.15)
kD0
where the orthogonal condition in (5.9) is used. In the derivation above, the block index n is dropped for convenience. When the context is appropriate, this practice will be followed in the reminder of this book.
76
5 Transform Coding
5.2 Optimal Bit Allocation and Coding Gain The operations involved in transform and inverse transform are obviously sophisticated and entail a significant amount of calculation. This extra burden is carried due to the anticipation that, for a given bit rate, the quantization error in the reconstructed signal will be smaller than directly quantizing the source signal. This is explained below.
5.2.1 Quantization Noise The quantization error for the transform coder is the reconstruction error at the decoder output: O k/ x.nM k/; k D 0; 1; : : : ; M 1; qOk D x.nM
(5.16)
so the MSQE is 2 q. x/ O D
M 1 M 1 1 X 2 1 X 2 qO k D E qO k ; M M kD0
(5.17)
kD0
where zero mean is assumed without loss of generality. Let qO D ŒqO0 ; qO1 ; : : : ; qOM 1 T ;
(5.18)
the above equation becomes 2 q. x/ O D
1 T E qO qO : M
(5.19)
From Fig. 5.1, we obtain qO D TT q; so 2 q. x/ O D
(5.20)
1 T T E q TT q : M
(5.21)
Due to the orthogonal condition in (5.9), TTT D I, so 2 q. x/ O
M 1 M 1 1 X 2 1 T 1 X 2 E q q D D E qk D qk : M M M kD0
(5.22)
kD0
Suppose that the bit rate for the transform coder is R bits per source sample. Since there are M samples in a block, the total number of bits available for coding one block of source samples is MR. These bits are allocated to the quantizers in Fig. 5.1. If the kth quantizer is allocated rk bits, the total must be MR:
5.2 Optimal Bit Allocation and Coding Gain
RD
77 M 1 1 X rk : M
(5.23)
kD0
Due to (2.48), the MSQE for the kth quantizer is q2k D 100:1.aCbrk / y2k ; k D 0; 1; : : : ; M 1;
(5.24)
where the parameters a and b are dependent on the quantization scheme and the probability distribution of yk .n/. Dropping (5.24) into (5.22), we obtain the following total MSQE: 2 q. x/ O D
M 1 1 X 0:1.aCbrk / 2 10 yk : M
(5.25)
kD0
Apparently, the MSQE is a function of both signal variance y2k of each transform coefficient and the number of bits allocated to quantize it. While the former is determined by the source signal and the transform T, the later is by how the bits are allocated to the quantizers, i.e., bit allocation strategy. For a given the bit rate R, M 1 , called a bit a different bit allocation strategy assigns a different set of frk gkD0 allocation, which results in a different MSQE. It is, therefore, imperative to find the optimal one that gives the minimum MSQE.
5.2.2 AM–GM Inequality Before addressing this question, let us first define arithmetic mean AM D
M 1 1 X pk M
(5.26)
kD0
and geometric mean GM D
M 1 Y
!1=M pk
(5.27)
kD0
for a given set of nonnegative numbers pk , k D 1; 2; : : : ; M 1. The AM–GM inequality [89] states that AM GM
(5.28)
p0 D p1 D D pM 1 :
(5.29)
with equality if and only if
78
5 Transform Coding
For an interpretation of this inequality, let us consider a p0 p1 rectangle. Its 1 .A perimeter is 2.p0 C p1 /, area is p0 p1 , and average length of sides is p0 Cp 2 p square with the same area obviously has sides with a length of p0 p1 . The AM– p 1 GM inequality states that p0 Cp p0 p1 with equality if and only if p0 D p1 . 2 In other words, among all rectangles with the same area, the square has the shortest average length of sides.
5.2.3 Optimal Conditions Applying the AM–GM inequality to (5.22), we have 2 q. x/ O
M 1 1 X 2 D qk M
M 1 Y
!1=M q2k
(5.30)
q20 D q21 D D q2M 1 :
(5.31)
kD0
kD0
with equality if and only if
Plugging in (5.24), we have M 1 Y kD0
!1=M q2k
D
M 1 Y kD0
!1=M 100:1.aCbrk / y2k
D 100:1a 100:1b
PM 1 kD0
rk
M 1 Y kD0
D 100:1.aCbR/
M 1 Y kD0
!1=M y2k
!1=M y2k
;
(5.32)
where (5.23) is used to arrive at the last equation. Plugging this back into (5.30), we have !1=M M 1 Y 2 0:1.aCbR/ 2 yk ; (5.33) q.x/ O 10 kD0
with equality if and only if (5.31) holds. For a given transform matrix T, the variances of transform coefficients y2k are determined by the source signal, so the right-hand side of (5.33) is a fixed value for a given bit rate R. Therefore, (5.33) and (5.31) state that
5.2 Optimal Bit Allocation and Coding Gain
79
M 1 Optimal MSQE. MSQE from any bit allocation frk gkD0 is either equal to or larger than the fixed value on the right-hand side of (5.33), which is thus the minimal MSQE that could be achieved.
Optimal Bit Allocation. This minimal MSQE is achieved, or the equality holds, if and only if (5.31) is satisfied, or the bits are allocated in such a way that the MSQE from all quantizers are equalized. Note that this result is obtained without dependence on the actual transform matrix used. As long as the transform matrix is orthogonal, the results in this section hold.
5.2.4 Coding Gain Now it is established that the minimal MSQE of a transform coder is given by the right-hand side of (5.33), which we denote as 2 q.TC/
D 10
0:1.aCbR/
M 1 Y kD0
!1=M y2k
:
(5.34)
If the source signal is quantized directly as PCM with the same bit rate of R, the MSQE is given by (2.48) and denoted as 2 D 100:1.aCbR/ x2 : q.PCM/
(5.35)
Then the coding gain of a transform coder over PCM is GTC D
2 q.PCM/ 2 q.TC/
x2 D ; QM 1 2 1=M kD0 yk
(5.36)
which is the ratio of source variance x2 to the geometric mean of the transform coefficient variances y2k . Due to the energy conservation property of T (5.14), we have x2
M 1 M 1 1 X 2 1 X 2 D xk D yk ; M M kD0
(5.37)
kD0
where x2k is the variance of the kth element of the source vector x. Applying this back into (5.36), we have 1
PM 1 kD0
GTC D M QM 1
2 kD0 yk
y2k 1=M ;
(5.38)
80
5 Transform Coding
which states that the optimal coding gain of a transform coder is the ratio of arithmetic to geometric mean of the transform coefficient variances. Due to the AM–GM inequality (5.28), we always have GTC 1:
(5.39)
Note, however, this is valid if and only if the optimal bit allocation strategy is deployed. Otherwise, the equality in (5.30) will not hold, the transform coder will not be able to achieve the minimal MSQE given by (5.34).
5.2.5 Optimal Bit Allocation The optimal bit allocation strategy requires that bits be allocated in such a way that the MSQE from all quantizers are equalized to a constant (see (5.31). Denoting such a constant as 02 , we obtain from (5.24) that the bits allocated to the kth transform coefficient is 10 1 log10 y2k .10 log10 02 C a/: (5.40) rk D b b Dropping it into (5.23), we obtain M 1 10 1 X 1 log10 y2k : .10 log10 02 C a/ D R b b M
(5.41)
kD0
Dropping this back into (5.40), we obtain the following optimal bit allocation: # " M 1 10 1 X 2 2 log10 yk log10 yk rk D R C b M kD0
DRC
y2k 10 log2 ; QM 1 2 1=M b log2 10 yk kD0
for k D 0; 1; : : : ; M 1:
(5.42)
If each transform coefficient is considered as satisfying a uniform distribution and is quantized using a uniform quantizer, the parameter b is given by (2.44). Then the above equation becomes y2k ; for k D 0; 1; : : : ; M 1: rk D R C 0:5 log2 QM 1 2 1=M yk kD0
(5.43)
5.2 Optimal Bit Allocation and Coding Gain
81
Both (5.42) and (5.43) state that, aside from a global bias determined by the bit rate R and the geometric mean of the variances of the transform coefficients, bits should be assigned to a transform coefficient in proportion to the logarithm of its variance.
5.2.6 Practical Bit Allocation It is unlikely that the optimal bit allocation strategy would allocate an integer value rk to any transform coefficients. When the quantization indexes are directly packed into a bit stream for delivery to the decoder, only an integer number of bits can be packed each time. If entropy coding is subsequently used to code the quantization indexes, an integer number of bits is not necessary, but number of quantization intervals has to be an integer. In this case, rk can be a rational number. But this still cannot be guaranteed by the bit allocation strategy. A simple approach to addressing this problem is to round rk to its nearest integer or a value that corresponds to an integer number of quantization intervals. There are, however, more elaborate methods, such as a water-filling procedure which iteratively allocates bits to transform coefficients with the largest quantization error [4, 51]. In addition, the optimal bit allocation strategy assumes an ample supply of bits. Otherwise, some of the transform coefficients with a small variance will get negative number of bits, as can be seen from both (5.42) and (5.43). If the bit rate R is not large enough to ensure positive number of bits allocated to all transform coefficients, the strategy can be modified to include the following clause: (5.44) rk D 0 if rk < 0: Furthermore, if the variance of any transform coefficient becomes zero, zero bit is allocated to it and the zero variance is subsequently taken out of the geometric mean calculation. With these modifications, however, the equalization condition (5.31) of the AM–GM inequality no longer holds, so the quantizer cannot achieve the minimal MSQE in (5.34) and the coding gain in (5.36). To illustrate the above method, let us consider an extreme example where y20 D M x2 ; y2k D 0; k D 1; 2; : : : ; M 1:
(5.45)
The zero variance causes the geometric mean to completely break down, so both (5.34) and (5.36) are meaningless. The modified strategy above, however, dictates the following bit allocation: b0 D MR; bk D 0; k D 1; 2; : : : ; M 1:
(5.46)
82
5 Transform Coding
5.2.7 Energy Compaction Dropping the above two equations for the extreme example into (5.25), we obtain the total MSQE as 2 0:1.aCbMR/ 2 x : (5.47) q. x/ O D 10 Since the MSQE for direct quantization (PCM) is still given by (5.35), the effective coding gain is 2 q.PCM/ GTC D D 100:1.M 1/bR : (5.48) 2 q. x/ O To appreciate this improvement, consider uniform distribution whose parameter b is given by (2.44), the above coding gain becomes GTC .dB/ 6:02.M 1/R:
(5.49)
For a scenario of M D 1;024 and R D 1 bits per source sample which is typical in audio coding, this coding gain is 6164.48 dB! An intuitive explanation for this dramatic improvement is energy compaction and exponential reduction of quantization error with respect to bit rate. With direct quantization, each sample in the source vector has a variance of x2 and is allocated R bits, resulting in a total signal variance of M x2 and a total number of MR bits for the whole block. With transform coding, however, this total variance of M x2 for the whole block is compacted to the first coefficient and all MR bits for the whole block are allocated to it. Since MSQE is linearly proportional to the variance (see (5.25)), the MSQE for the first coefficient would increase M times due to the M times increase in variance, but this increase is spread out to M samples in the block, resulting no change. However, MSQE decreases exponentially with bit rate, so the M times increase in bit rate causes M times MSQE decrease in decibel! In fact, energy compaction is the key for coding gain in transform coding. Since the transform matrix T is orthogonal, the arithmetic mean is constant for a given signal, no matter how its energy is distributed by the transform to the individual coefficients. This means that the numerator of (5.38) remains the same regardless of the transform matrix T. However, if the transform distributes most of its energy (variance) to a minority of transform coefficients and leaves the balance to the rest, the geometric mean in the denominator of (5.38) becomes extremely small. Consequently, the coding gain becomes extremely large.
5.3 Optimal Transform Section 5.2 has established that the coding gain is dependent on the degree of energy compaction that the transform matrix T delivers. Is there a transform that is optimal in terms of having the best energy compaction capability or delivering the best coding gain?
5.3 Optimal Transform
83
5.3.1 Karhunen–Loeve Transform To answer this question, let us go back to (5.2) to establish the following equation: y.n/yT .n/ D Tx.n/xT .n/TT :
(5.50)
Taking expected values on both sides, we obtain the covariance of the transform coefficients
where
Ryy D TRxx TT ;
(5.51)
Ryy D E y.n/yT .n/
(5.52)
and Rxx is the covariance matrix of source signal defined in (4.49). As noted there, it is symmetric and Toeplitz. By definition, the kth diagonal element of Ryy is the variance of kth transform coefficient: (5.53) ŒRyy kk D E Œyk .n/yk .n/ D y2k ; 1 so the geometric mean of fy2k gM kD0 is the product of the diagonal elements of Ryy : M 1 Y kD0
y2k D
M 1 Y
ŒRyy kk :
(5.54)
kD0
It is well known that a covariance matrix is positive semidefinite, i.e., its eigenvalues are all real and nonnegative [33]. For practical signals, Rxx may be considered as positive definite (no zero eigenvalues). Due to (5.51), Ryy may also be considered as positive definite as well, so the following inequality holds [93]: M 1 Y
ŒRyy kk det Ryy
(5.55)
kD0
with equality if and only if Ryy is diagonal. Due to (5.51), we have det Ryy D det T det Rxx det TT :
(5.56)
Taking determinant of (5.9) gives det TT det T D det I D 1;
(5.57)
j det Tj D 1:
(5.58)
which leads to
84
5 Transform Coding
Dropping this back into (5.56), we obtain det Ryy D det Rxx :
(5.59)
Consequently, the inequality in (5.55) becomes M 1 Y
ŒRyy kk det Rxx ;
(5.60)
kD0
again, with equality if and only if Ryy is diagonal. Since Rxx is completely determined by statistical properties of the source signal, the right-hand side of (5.60) is a fixed value. Due to (5.51), however, we can adjust the transform matrix T to alter the value on the left-hand side of (5.60). The best we can achieve by doing this is to find a T that makes Ryy a diagonal matrix: Ryy D TRxx TT D diagfy20 ; y21 ; : : : ; y2M 1 g
(5.61)
so that the equality in (5.60) holds. It is well known in matrix theory [28] that the matrix T which makes (5.61) hold is an orthonormal matrix whose rows are the orthonormal eigenvectors of the M 1 are the eigenvalues of Rxx . Such matrix Rxx and the diagonal elements fy2k gkD0 a transform matrix is called Karhunen–Loeve Transform (KLT) of source signal x.n/ and the eigenvalues its transform coefficients.
5.3.2 Maximal Coding Gain With a Karhunen–Loeve transform matrix, the equality in (5.60) holds, which gives the minimum value for the geometric mean:
Minimum:
M 1 Y kD0
!1=M y2k
D .det Rxx /1=M :
(5.62)
Dropping this back into (5.36), we establish that the maximum coding gain of the optimal transform coder, for a given source signal x.n/, is Maximum: GTC D
x2
.det Rxx /1=M
Note that this is made possible by Deploying the Karhunen–Loeve transform Providing ample bits And following the optimal bit allocation strategy
:
(5.63)
5.4 Suboptimal Transforms
85
The maximal coding gain is achieved by Karhunen–Loeve transform through diagonalizing the covariance matrix Rxx into Ryy . The diagonalized covariance matrix Ryy means that the transform coefficients that constitute the vector y are uncorrelated, so maximum coding gain or energy compaction is directly linked to decorrelation.
5.3.3 Spectrum Flatness The maximal coding gain discussed above is optimal for a given block size M . Typically, the maximal coding gain increases with the block size, approaching an upper limit when the block size becomes infinity. It can be shown [33] that transform coding is asymptotically optimal because this upper limit is equal to the theoretic upper limit predicted by rate-distortion theory [33]: lim GTC D
M !1
1 ; x2
(5.64)
where x2 is the spectrum flatness measure x2
D
exp
R 1 ln Sxx .ej! /d! 2 R 1 S .ej! /d! 2 xx
(5.65)
of the source signal x.n/ whose power spectrum is Sxx .ej! /.
5.4 Suboptimal Transforms Barring its optimality, KLT is rarely used in practical applications. The first reason is that KLT is signal-dependent: it is built from the covariance matrix of the source signal. There are a few ramifications for this including: KLT is as good as the statistical signal model, but a good model is not always
available. Signal statistics changes with time in most applications. This calls for real-time
calculation of covariance matrix, eigenvectors, and eigenvalues. This is seldom plausible in practical applications, especially in the decoder. Even if the encoder is assigned to do the calculation, transmission of eigenvectors to the decoder consumes a large number bits, so is not feasible for compression coding. Even if the signal-dependent issues above are resolved and the associated eigenvectors and eigenvalues are available at our conveniences, the calculation of the Karhunen–Loeve transform itself is still a big deal, especially with a large M , because both the transform in (5.2) and the inverse transform in (5.10) require
86
5 Transform Coding
an order of M 2 calculations (multiplications and additions). This is unfavorable when compared with structured transforms, such as DCT, whose structures are amenable for fast implementation algorithms that require calculations on the order of M log2 .M /. There are many structured and signal-independent transforms which can be considered as suboptimal in the sense that their performances approach that of KLT when the block size is large. In fact, all sinusoidal orthogonal transforms are found to approach the performance of KLT when the block size tends to infinity [74], including discrete Fourier transform, DCTs, and discrete sine transforms. With such sinusoidal orthogonal transforms, frequency is a key characteristic for the basis functions or vectors, so the transform coefficients are usually indexed by frequency. Consequently, they are often referred to as frequency coefficients and are considered as in the frequency domain. The transforms may also be referred to as frequency transforms and time–frequency analysis.
5.4.1 Discrete Fourier Transform DFT (Discrete Fourier Transform) is the most prominent transform in this category of sinusoidal transforms. Its transform matrix is given by kn ; T D WM D ŒWM
(5.66)
WM D ej2=M
(5.67)
kn ŒTk;n D WM D ej2kn=M :
(5.68)
where so that
The matrix is unitary, meaning .WM /T WM D M I;
(5.69)
so its inverse is
1 .WM /T : (5.70) M Note that the scale factor M above is usually disregarded when discussing orthogonal transforms, because it can be adjusted either on the forward or backward transform side within a particular context. The matrix is also symmetric, that is, W1 D
WTM D WM ;
(5.71)
so (5.70) becomes W1 D
1 W : M M
(5.72)
5.4 Suboptimal Transforms
87
The DFT is more commonly written as yk D
M 1 X
kn x.n/WM
(5.73)
nD0
and its inverse (IDFT) as x.n/ D
M 1 X
kn yk WM :
(5.74)
kD0
When DFT is applied to a block of M source samples, this block of M samples are virtually extended periodically on both sides of the block boundary to infinity, as shown at the top of Fig. 5.2. This introduces sharp discontinuities at both boundaries. In order to accommodate these discontinuities, DFT needs to incur a lot of large coefficients, especially at high frequencies. This causes spread of energy, the opposite of energy compaction, so DFT is not ideal for signal coding. DFT
…
… DFT-II
…
… DFT-IV
… Fig. 5.2 Periodic boundaries of DFT (top), DCT-II (middle) and DCT-IV (bottom)
…
88
5 Transform Coding
In addition, DFT is a complex transform that produces complex transform coefficients even for real signals. This simplies that the number of transform coefficients that need to be quantized and conveyed to the decoder is doubled. Due to the two drawbacks above, DFT is seldom directly deployed in practical transform coders. It is, however, frequently deployed in many transform coders as a conduit for fast calculation of other transforms because of its good structure for fast algorithm and the abundance of such algorithms.
5.4.2 DCT DCT is a family of Fourier-related transforms that use only real transform coefficients [2, 80]. Since the Fourier transform of a real and even signal is real and even, a DCT operates on real data and is equivalent to a DFT of roughly twice the block length. Depending on how the beginning and ending block boundaries are handled, there are eight types of DCTs, and correspondingly eight types of DSTs, but only two of them are widely used.
5.4.2.1 Type-II DCT The most prominent member of the DCT family is the type-II DCT given below: r DCT-II: ŒCk;n D c.k/ where
( c.k/ D
p1 ; 2
1;
k 2 cos .n C 0:5/ M M if k D 0I otherwise:
(5.75)
(5.76)
Its inverse transform is IDCT-II: ŒCT n;k D ŒCk;n :
(5.77)
DCT-II tends to deliver the best energy compaction performance in the DCT and DST family. It achieves this mostly because it uses symmetric boundary conditions on both sides of its period, as shown in the middle of Fig. 5.2. In particular, DCT-II extends its boundaries symmetrically on both sides of a period, so the samples can be considered as periodic with a period of 2M and there is essentially no discontinuity at both boundaries. DCT-II for M D 2 was shown to be identical to the KLT for a first-order autoregression (AR) source [33]. Furthermore, the coding gain of DCT-II for other M is shown to be very close to that of the KLT for such a source with high correlation coefficient: R.1/ 1: (5.78) D R.0/ Similar results with real speech were also observed in [33].
5.4 Suboptimal Transforms
89
Since many real-world signals can be modeled as such a source, DCT-II is deployed in many signal coding or processing applications and is sometimes simply called “the DCT” (its inverse is, of course, called “the inverse DCT” or “the IDCT”). Two-dimensional DCT-II, which shares these characteristics, has been deployed by many international image and video coding standards, such as JPEG [37], MPEG1&2&4 [54, 57, 58], and MPEG-4(AVC)/H.264 [61].
5.4.2.2 Type-IV DCT Type-IV DCT is obtained by shifting the frequencies of the Type-II DCT in (5.75) by =2M , so it has the following form r DCT-IV: ŒCk;n D
h i 2 cos .n C 0:5/.k C 0:5/ : M M
(5.79)
It is also its own inverse transform. Due to this frequency shifting, its right boundary is no longer smooth, as shown at the bottom of Fig. 5.2. Such a sharp discontinuity requires a lot of large transform coefficients to compensate, significantly degrading its energy compacting ability. So it is not as useful as DCT-II. However, it serves as valuable building block for fast algorithms of DCT, MDCT and other cosine-modulated filter banks.
Chapter 6
Subband Coding
Transform coders artificially divide an source signal into blocks, then process and code each block independently of each other. This leads significant variations between blocks, which may become visible or audible as discontinuities at the block boundaries. Referred to as blocking artifacts or blocking effects, these artifacts may appear as “tiles” in decoded images or videos that were coded at low bit rates. In audio, blocking artifacts sound like periodic “clicking” which is considered as annoying by many people. While the human eye can tolerate a large degree of blocking artifacts, the human ear is extremely intolerant of such periodic “clicking”. Therefore, transform coding is rarely deployed in audio coding. One approach to avoiding blocking artifacts is to introduce overlapping between blocks. For example, a transform coder may be structured in such a way that it advances only 90% of a block, so that there is a 10% overlap between blocks. Reconstructed samples in the overlapped region can be averaged to smooth out the discontinuities between blocks, thus avoiding the blocking artifacts. The cost for this overlapping is reduced coding performance. Since the number of transform coefficients is equal to the number of source samples in a block, the number of samples overlapping between blocks is the number of extra transform coefficients that needs to be quantized and conveyed to the decoder. They consume extra valuable bit resources, thus degrading coding performance. What is really needed is overlapping in the temporal domain, but no overlapping in the transform or frequency domain. This means the transform matrix T in (5.2) is no longer M M , but M N , where N > M . This leads to subband coding (SBC). For example, application of this idea to DCT leads to modified discrete cosine transform (MDCT) which has 50% overlapping in the temporal domain or its transform matrix T is M 2M .
6.1 Subband Filtering Subband coding is based on the decomposition of an source signal into subband samples through a filter bank. This decomposition can be considered as an extension of a transform where the filter bank corresponds to the transform matrix and Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 6, c Springer Science+Business Media, LLC 2010
91
92
6 Subband Coding
the subband samples to the transform coefficients. This extension is based on a new perspective on transforms, which views the multiplication of a transform matrix with the source vector as filtering of the source vector by a bank of subband filters whose impulse responses are the row vectors of the transform matrix. Allowing these subband filters to be longer than the block size of the transform enables dramatical improvement in energy compaction capability, in addition to the elimination of blocky artifacts.
6.1.1 Transform Viewed as Filter Bank To extend from transform coding to subband coding, let us consider the transform expressed in (5.2). The operations involved in such a transform to obtain the kth component of the transform coefficient vector y.n/ from the source vector x.n/ may be written as yk .n/ D tTk x.n/ D
M 1 X
tk;m xm .n/ D
mD0
M 1 X
tk;m x.nM m/;
(6.1)
mD0
where the last step is obtained via (5.1). The above operation obviously can be considered as filtering the source signal x.n/ by a filter with an impulse response given by the kth basis vector or basis function tk . Consequently, the whole transform may be considered as filtering by a bank of filters, called analysis filter banks, with impulse responses given by the row vectors of the transform matrix T. This is shown in Fig. 6.1. Similarly, the inverse transform in (5.10) may be considered as the output from a bank of filters, called synthesis filter banks, with impulse responses given by the row vectors of the inverse matrix TT . This is also shown in Fig. 6.1. For the suboptimal sinusoidal transforms discussed in Chap. 5, each of the filters in either the analysis or synthesis bank is associated with a basis vector or basis
Fig. 6.1 Analysis and synthesis filter banks
6.1 Subband Filtering
93
function which corresponds to a specific frequency, so it deals with components of the source signal associated with that frequency. Such filters are usually band-pass and decompose the frequency into small bands, called subbands, so such filters are M 1 are called called subband filters and the decomposed signal components fyk gkD0 subband samples.
6.1.2 DFT Filter Bank To illustrate the power of this new perspective on transforms, let us consider a simple analysis filter bank shown in Fig. 6.2, which is built using the inverse DFT W matrix given in (5.70). The delay chain in the analysis bank consists of M 1 delay units z1 connected together in series. As the source signal x.n/ passes through it, a bank of signals are extracted uk .n/ D x.n k/; k D 0; 1; : : : ; M 1:
(6.2)
This enables that M samples from the source signal are presented simultaneously to the transform matrix. Due to (5.67), except for the scale factor of 1=M , the subband samples for the kth subband is the output from the kth subband filter and is given as yk .n/ D
M 1 X
km um .n/WM ; k D 0; 1; : : : ; M 1;
(6.3)
mD0
which is essentially the inverse DFT in (5.74). Due to (6.2), it becomes km x.n m/WM :
(6.4)
…
M 1 X
…
yk .n/ D
mD0
…
Fig. 6.2 DFT analysis filter banks
94
6 Subband Coding
Its Z-transform is Yk .z/ D
M 1 X
km X.Z/zm WM D X.Z/
mD0
M 1 X
k zWM
m
;
(6.5)
mD0
so the transfer function for the kth subband filter is Hk .z/ D
M 1 m X Yk .z/ k zWM D : X.z/ mD0
(6.6)
Since the transfer function for the zeroth subband is H.z/ D
M 1 X
zm ;
(6.7)
mD0
the transfer functions for all other subbands may be represented by it: k : Hk .z/ D H zWM
(6.8)
Its frequency response is 2k Hk .ej! / D H ej.! M / ;
(6.9)
which is H ej! uniformly shifted by 2k M in the frequency domain. Therefore, H.z/ is called the prototype filter and all other subband filters in the DFT bank are built by uniformly shifting or modulating the prototype filter. A filter bank with such a structure is called a modulated filter bank. It is a prominent category of filter banks which are most notably amenable for fast implementation. The magnitude response of the prototype filter (6.7) is sin M! !2 ; H ej! D sin 2
(6.10)
which is shown at the top of Fig. 6.3 for M D 8. According to (6.9), all other subband filters in the bank are shifted or modulated versions of it and are shown at the bottom of Fig. 6.3.
6.1.3 General Filter Banks Viewed from the new perspective of subband filtering, DFT apparently has rather inferior energy compaction capability: the subband filters have wide transition bands and their stopband attenuation is only about 13 dB. A significant amount of energy in one subband is spilled into other subbands, appearing as ripples in Fig. 6.3.
6.1 Subband Filtering H0
20 Amplitude (dB)
95 H0
10 0 −10 −20
0
0.2
0.4
0.6
0.8
1
ω /2π
Amplitude (dB)
20
H0
H2
H1
H4
H3
H5
H6
H7
H0
10 0 −10 −20 0
0.2
0.4
0.6
0.8
1
ω /2π
Fig. 6.3 Magnitude responses of a DFT analysis filter bank with M D 8. The top shows the prototype filter. All subband filters in the bank are uniformly shifted or modulated versions of it and are shown at the bottom
Fig. 6.4 Ideal bandpass subband filters for M D 8
While other transforms, such as KLT and DCT, may have better energy compacting capability than the DFT, they are eventually limited by M , the number of samples in the block. It is well known in filter design that a sharp magnitude response with less energy leakage requires long filters [67], so the fundamental limiting factor is the block size M of these transforms or subband filters. To obtain even better performance, subband filters longer than M need to be used. To maximize energy compaction, subband filters should have the magnitude response of an ideal bandpass filter shown in Fig. 6.4, which has no energy leakage at all and achieves the maximum coding gain (to be proved later). Unfortunately, such a bandpass filter requires an infinite filter order [68], which is very difficult to implement in a practical system, so the challenge is to design subband filters that can optimize coding gain for a given limited order.
96
6 Subband Coding
Once the order of subband filters are extended beyond M , overlapping between blocks occurs, providing an additional benefit of mitigating the blocking artifacts discussed at the beginning of this chapter. It is clear that transforms are a special type of filter banks whose subband filters have order less than M . Its main characteristics is that there is no overlapping between transform blocks. Therefore, transforms are frequently referred as filter banks and transform coefficients as subband samples. On the other hand, filter banks with subband filters longer than M are also sometimes referred to as transforms or lapped transform in the literature [49]. One such example is the modulated cosine transform (MDCT), whose subband filters are twice as long as the block size.
6.2 Subband Coder When the filter bank in Figs. 6.1 and 6.2 is directly used for subband coding, there exists an immediate obstacle: M -fold increase in the number of samples to be coded because the analysis bank generates M subband samples for each source sample. This problem may be resolved, as shown in Fig. 6.5, by M -fold decimation in the analysis bank to make the total number of subband samples equal to that of the source block, followed by M -fold expansion in the synthesis bank to recover the sample rate of the subband samples back to the original sample rate of the source signal. An M -fold decimator, also referred to as a downsampler or sample rate compressor, discards M 1 samples for each block of M input samples and retains only one sample for output: xD .n/ D x.M n/;
(6.11)
…
…
…
…
…
Fig. 6.5 Maximally decimated filter bank and subband coder. The #M denotes M -fold decimation and "M M -fold expansion. The additive noise model is used to represent quantization in each subband
6.3 Reconstruction Error
97
where x.n/ is the source sequence and xD .n/ is the decimated sequence. Due to the loss of M 1 samples incurred in decimation, it may not be possible to recover x.n/ from the decimated xD .n/ due to aliasing [65, 68]. When applied to the analysis bank in Fig. 6.5, the decimator reduces the sample rate of each subband to its 1=M . Since there are M subbands, the total sample rate for all subbands is still the same as the source signal. The M -fold expander, also referred to as an upsampler or interpolator, passes through each source sample to the output and, after each, inserts M 1 zeros to the output: ( x.n=M /; if n=M is an integerI (6.12) xE .n/ D 0; otherwise: Since all samples from the input are passed through to the output, there is obviously no loss of information. For example, the source can be recovered from the expanded output by an M -old decimator. As explained in Sect. 6.3, however, expansion causes images in the spectrum, so needs to be handled accordingly. When applied to the analysis bank in Fig. 6.5, the expander for each subband recovers the sample rate of each subband back to the original sample rate of the source signal. It is then possible to output an reconstructed sequence at the same sample rate as the source signal. Coding in the subband domain, or subband coding, is accomplished by attaching a quantizer to the output of each subband filter in the analysis filter bank and a corresponding inverse quantizer to the input of each subband filter in the synthesis filter bank. The abstraction of this quantization and inverse quantization is the additive noise model in Fig. 2.3, which is deployed in Fig. 6.5. The filter bank above has a special characteristic: its sample rate for each subband is 1=M of that of the source. This happens because the decimation factor is equal to the number of subbands. Such a subband system is called a maximally decimated or critically sampled filter bank.
6.3 Reconstruction Error When a signal moves through a maximally decimated subband system, it is altered by analysis filters, decimators, quantizers, inverse quantizers, expanders, and synthesis filters. The combined effect of these alternation may lead to reconstruction error at the output of the synthesis bank: e.n/ D x.n/ O kx.n d /;
(6.13)
where k is a scale factor and d is a delay. For subband coding, this reconstruction error needs to be either exactly or approximately zero, in the absence of quantization, so that the reconstructed signal is a delayed and scaled version of the source signal. This section analyzes decimation and expansion effects to arrive at conditions on analysis and synthesis filters that guarantee zero reconstruction error.
98
6 Subband Coding
6.3.1 Decimation Effects Before considering the problem of decimation effects, let us first consider the following geometric series: pM .n/ D
M 1 1 X j 2mn e M : M mD0
(6.14)
When n is a multiple of M , the above equation becomes pM .n/ D 1
(6.15)
due to ej2m D 1. For other values of n, the equation becomes 2n M j 1 1 e M D 0; pM .n/ D M ej 2n M 1
(6.16)
due to the formula for geometric series and ej2n D 1. Therefore, the geometric series (6.14) becomes (
1; if n D multiples of M I
pM .n/ D
0; otherwise:
(6.17)
Now let us consider the Z-transform of the decimated sequence in (6.11): XD .z/ D
1 X
xD .n/zn D
nD1
1 X
x.Mn/zn :
(6.18)
nD1
Due to the upper half of (6.17), we can multiply the right-hand side of the above equation with pM .nM/ to get 1 X
XD .z/ D
pM .nM/x.Mn/zn :
(6.19)
nD1
Due to the lower half of (6.17), we can do a variable change of m D nM to get 1 X
XD .z/ D
pM .m/x.m/zm=M ;
(6.20)
mD1
where m takes on integer values at an increment of one. Dropping in (6.14), we have M 1 1 1 X X 2k m XD .z/ D x.m/ z1=M ej M ; M mD1 kD0
(6.21)
6.3 Reconstruction Error
99
Due to (5.67), the equation above becomes XD .z/ D
M 1 1 m 1 X X k x.m/ z1=M WM ; M mD1
(6.22)
kD0
Let X.z/ denote the Z-transform of x.m/, the equation above can be written as XD .z/ D
M 1 1 X 1=M k X z WM : M
(6.23)
M 1 1 X j !2k ; X e M M
(6.24)
kD0
The Fourier transform of (6.23) is XD .ej! / D
kD0
which can be interpreted as 1. 2. 3. 4.
Stretch X.ej! / by a factor of M . Create M 1 aliasing copies and shift them by 2k, respectively. Add all shifted aliasing copies obtained in step 2 to the stretched copy in step 1. Divide the sum above by M .
As an example, let us consider the prototype filter H.z/ in (6.7) as the Z-transform for a regular signal. Its time-domain representation is obviously as follows: ( 1; 0 n < M I x.n/ D (6.25) 0; otherwise: Let M D 8, we now examine the effect of eightfold decimation on this signal. Its Fourier transform is given in (6.10) and shown at the top of Fig. 6.3. The stretched H.ej!=M / and all its shifted aliasing copies are shown at the top of Fig. 6.6. Due to the stretching factor of M D 8, their period is no longer 2, but stretched to 8 2, which is the frequency range covered by Fig. 6.6. The Fourier transform for the decimated signal is shown at the bottom of Fig. 6.6, whose period is 2 as required by the Fourier transform for a sequence. Due to the overlapping of the stretched spectrum with its shifted aliasing copies and the subsequent mutual cancellation, the spectrum for the decimated signal is totally different than that of the original source signal shown in Fig. 6.3, so we cannot recover the original signal from its decimated version. One approach to avoid aliasing is to band-limit the source signal to j!j < =M , according to the teaching from Nyquist’s sampling theorem [65]. Due to the stretching factor of M , the stretched spectrum is now bandlimited to j!j < . Since its shifted copies are placed at 2 interval, there is no overlapping between the original and the aliasing copies. The aliasing copies can be removed by an ideal low-pass filter, leaving only the original copy.
100
6 Subband Coding 8−Fold Streched Spectrum and its Shifted Copies H0
Amplitude (dB)
20
H1
H2
H3
1
2
3
1
2
H4
H5
H0
H6
H7
6
7
8
6
7
8
10 0 −10 −20
0
4 5 ω/2π 8−Fold Decimated Signal
Amplitude (dB)
20 10 0 −10 −20
0
3
4 ω/2π
5
Fig. 6.6 Stretched spectrum of the source signal and all its shifted aliasing copies (top). Due to the stretching factor of M D 8, their period is also stretched from 2 to 8 2. Spectrum for the decimated signal (bottom) has a period of 2, but is totally different from that of the original signal
The approach above is not the only one for aliasing-free decimation, see [82] for details. However, aliasing-free decimation is not the goal for filter bank design. A certain amount of aliasing is usually allowed in some special ways. As long as aliasing from all decimators in the filter bank cancel each other completely at the output of the synthesis bank, the reconstructed signal is still aliasing free. Even if aliasing cannot be canceled completely, proper filter bank design can still keep them small enough so as to obtain a reconstruction with tolerable error.
6.3.2 Expansion Effects To see the consequence of expansion, let us consider the Z-transform of the expanded signal xE .n/: 1 X xE .n/zn : (6.26) XE .z/ D nD1
Due to (6.12), xE .n/ is nonzero only when n is a multiple of M : n D kM, where k is an integer. Replacing n with kM in the above equation, we have
6.3 Reconstruction Error
101 1 X
XE .z/ D
xE .kM/zkM :
(6.27)
kD1
Due to the upper half of (6.12), xE .kM/ D x.k/, so we have XE .z/ D
x.k/zKM D X zM :
1 X
(6.28)
kD1
Its Fourier transform is
XE .ej! / D X ejM! ;
(6.29)
which is an M -fold compressed version of XE .ej! /. In other words, The effect of sample rate expansion is frequency compression. As an example, the signal in (6.25), whose spectrum is shown at the top of Fig. 6.7, is eightfold expanded to give an output signal whose spectrum is shown at the bottom of Fig. 6.7. Due to the compression of frequency by a factor of 8, seven images are shifted into Œ0; 2 region from outside.
Input Signal Amplitude (dB)
20 10 0 −10 −20 0
0.2
0.4
0.6
0.8
1
0.6
0.8
1
ω/2π 8−Fold Expanded Signal
Amplitude (dB)
20 10
0 −10 −20 0
0.2
0.4
ω /2π
Fig. 6.7 The effect of sample rate expansion is frequency compression. The spectrum for the source signal on the top is compressed by a factor of 8 to produce the spectrum for the expanded signal at the bottom. Seven images are shifted into Œ0; 2 region from outside due to this frequency compression
102
6 Subband Coding
6.3.3 Reconstruction Error Let us now consider the reconstruction error of the subband system in Fig. 6.5 in the absence of quantization. Due to (6.23), each subband signal after decimation is Yk .z/ D
M 1 1 X m m X z1=M WM : Hk z1=M WM M mD0
(6.30)
Due to (6.28), the reconstructed signal is XO .z/ D
M 1 X
Fk .z/Yk zM
kD0
D
M 1 M 1 m m 1 X X X zWM Fk .z/Hk zWM M mD0 kD0
D D
1 M
M 1 X mD0
1 X.z/ M C
1 X m m M X zWM Fk .z/Hk zWM kD0 M 1 X
Fk .z/Hk .z/
kD0
M 1 M 1 m 1 X m X : X zWM Fk .z/Hk zWM M mD1
(6.31)
kD0
Define the overall transfer function as T .z/ D
M 1 1 X Fk .z/Hk .z/ M
(6.32)
kD0
and the aliasing transfer function as Am .z/ D
M 1 m 1 X ; Fk .z/Hk zWM M
m D 1; 2; : : : ; M 1;
(6.33)
kD0
the reconstructed signal is XO .z/ D T .z/X.z/ C
M 1 X
m Am .z/: X zWM
(6.34)
mD1
Note that T .z/ is also the overall transfer function filter bank in the absence ofm the is the shifted version of the of both the decimators and expanders. Since X zWM source signal, the reconstructed signal may be considered as a linear combination of the source signal X.z/ and its shifted aliasing versions.
6.4 Polyphase Implementation
103
To set the reconstruction error (6.13) to zero, the overall transfer function should be set to a delay and scale factor: T .z/ D kzd
(6.35)
and the total aliasing effect to zero: M 1 X
m Am .z/ D 0: X zWM
(6.36)
mD1
If a subband system produces no reconstruction error, it is called a perfect reconstruction (PR) system. If there is reconstruction error, but it is limited and approximately zero, it is called a near-perfect reconstruction or nonperfect reconstruction (NPR) system. For subband coding, PR is desirable and NPR is the minimal requirement.
6.4 Polyphase Implementation While the total sample rate of all subbands within a maximally decimated filter bank is made equal to that of the source signal through decimation and expansion, waste of computation is still an issue. To illustrate this issue, let us look at the output of the decimators in Fig. 6.5. The decimator keeps only one subband sample and discards the other M 1 subband samples, so the subband filtering for generating the discarded M 1 subband samples is a total waste of computational resources and thus should be eliminated. This is achieved using polyphase representation of subband filters and noble identities.
6.4.1 Polyphase Representation Polyphase representation is an important advancement in the theory of filter banks that greatly simplifies the implementation structures of both analysis and synthesis banks [3, 94].
6.4.1.1 Type-I Polyphase Representation For any given integer M , an FIR or IIR filter given below H.z/ D
1 X nD1
h.n/zn
(6.37)
104
6 Subband Coding
can always be written as 1 X
H.z/ D
h.nM/znM
nD1
C z1
1 X
h.nM C 1/znM
nD1
:: : 1 X
C z.M 1/
h.nM C M 1/znM :
(6.38)
nD1
Denoting pk .n/ D h.nM C k/;
0 k < M;
(6.39)
which is called a type-I polyphase component of h.n/, and its Z-transform Pk .z/ D
1 X
pk .n/zn ; 0 k < M;
(6.40)
nD1
the (6.38) may be written as H.z/ D
M 1 X
zk Pk zM :
(6.41)
kD0
The equation above is called the type-I polyphase representation of H.z/ with respect to M and its implementation is shown in Fig. 6.8. The type-I polyphase representation in (6.41) may be further written as H.z/ D pT zM d.z/;
Fig. 6.8 Type-I polyphase implementation of an arbitrary filter
(6.42)
6.4 Polyphase Implementation
where
105
h iT d.z/ D 1; z1 ; : : : ; zM 1
(6.43)
p.z/ D ŒP0 .z/; P1 .z/; : : : ; PM 1 .z/T
(6.44)
is the delay chain and
is the type-I polyphase (component) vector. The type-I polyphase representation of an arbitrary filter may be used to implement the analysis filter bank in Fig. 6.5. Using (6.42), the kth subband filter Hk .z/ may be written as Hk .z/ D hTk zM d.z/; 0 k < M;
(6.45)
hk .z/ D Œhk;0 ; hk;1 ; : : : ; hk;M 1 T
(6.46)
where are the type-I polyphase components of Hk .z/. The analysis bank may then be represented by 3 hT0 zM d.z/ 6 7 6 hT zM d.z/ 7 6 7 7 6 1 h.z/ D 6 7 D H zM d.z/; 7D6 :: 4 5 5 4 :M T HM 1 .z/ hM 1 z d.z/ 2
H.z/ H1 .z/ :: :
where
3
2
2 6 6 H.z/ D 6 4
hT0 .z/ hT1 .z/ :: :
(6.47)
3 7 7 7 5
(6.48)
hTM 1 .z/ is called a polyphase (component) matrix. This leads to the type-I polyphase implementation in Fig. 6.9 for the analysis filter bank in Fig. 6.5.
Fig. 6.9 Type-I polyphase implementation of a maximally decimated analysis filter bank
106
6 Subband Coding
6.4.1.2 Type-II Polyphase Representation Type-II polyphase representation of a general filter H.z/ with respect to M may be obtained from (6.41) through a variable change of k D M 1 n: H.z/ D
M 1 X
z.M 1n/ PM 1n zM
(6.49)
z.M 1n/ Qn zM ;
(6.50)
nD0
D
M 1 X nD0
where Qn .z/ D PM 1n .z/
(6.51)
is a permutation of Pn .z/. Figure 6.10 shows type-II polyphase implementation of an arbitrary filter. The type-II polyphase representation in (6.50) may be re-written as zn Qn zM ;
(6.52)
H.z/ D z.M 1/ dT z1 q zM ;
(6.53)
q.z/ D ŒQ0 .z/; Q1 .z/; : : : ; qM 1 .z/T
(6.54)
H.z/ D z.M 1/
M 1 X nD0
so it can be expressed in vector form
where represents the type-II polyphase components. Similar to type-I polyphase representation of the analysis filter bank, a synthesis filter bank may be implemented using type-II polyphase representation. The type-II polyphase components of the kth synthesis subband filter may be denoted as fk .z/ D Œfk;0 .z/; fk;1 .z/; : : : ; fk;M 1 .z/T ;
Fig. 6.10 Type-II polyphase implementation of an arbitrary filter
(6.55)
6.4 Polyphase Implementation
107
then the kth synthesis subband filter may be written as Fk .z/ D z.M 1/ dT z1 fk zM ;
(6.56)
so the synthesis filter bank may be written as fT .z/ D ŒF0 .z/; F1 .z/; : : : ; FM 1 .z/ i h D z.M 1/ dT z1 f0 zM ; f1 zM ; : : : ; fM 1 zM D z.M 1/ dT z1 F zM ;
(6.57)
F.z/ D Œf0 .z/; f1 .z/; : : : ; fM 1 .z/:
(6.58)
where This leads to the type-II polyphase implementation in Fig. 6.11.
6.4.2 Noble Identities Now that we have polyphase implementation of both analysis and synthesis filter banks, we can move on to rid off the M 1 wasteful filtering operations in both filter banks. We achieve this using noble identities.
6.4.2.1 Decimation The noble identity for decimation is shown in Fig. 6.12 and is proven below using (6.23)
Fig. 6.11 Type-II polyphase implementation of a maximally decimated synthesis filter bank
Fig. 6.12 Noble identity for decimation
108
6 Subband Coding
Y1 .z/ D
M 1 1 X 1=M k U z WM M kD0
M 1 k k H z1=M WM X z1=M WM M kD0 # " M 1 1 X 1=M k X z WM H.z/ D M
D
M 1 X
kD0
D Y2 .z/:
(6.59)
Applying the notable identity given above to the analysis bank in Fig. 6.9, we can move the decimators on the right side of the analysis polyphase matrix to its left side to arrive at the analysis filter bank in Fig. 6.13. With this new structure, the delay chain presents M source samples in correct succession simultaneously to the decimators. The decimators ensure that the subband filters operate only once for each block of M input samples, generating only one block of M subband samples. The sample rate is thus reduced by M times, but the data move in parallel now. The combination of the delay chain and the decimators essentially accomplishes a series-to-parallel conversion.
6.4.2.2 Interpolation The noble identity for interpolation is shown in Fig. 6.14 and is easily proven below using (6.28) Y1 .z/ D U zM D X zM H zM D Y2 .z/:
Fig. 6.13 Efficient implementation of a maximally decimated analysis filter bank
Fig. 6.14 Noble identity for interpolation
(6.60)
6.4 Polyphase Implementation
109
Fig. 6.15 Efficient implementation of a maximally decimated synthesis filter bank
For the synthesis bank in Fig. 6.11, the expanders on the left side of the synthesis polyphase matrix may be moved to its right side to arrive at the filter bank in Fig. 6.15 due to the noble identity just proven. With this structure, the synthesis subband filters operate once for each block of subband samples, whose sample rate was reduced to 1=M of the source sample rate by the analysis filter. The expanders increase the sample rate by inserting M 1 zeros after each source sample, making the sample rate the same as the source signal. The delay chain then delays their outputs in succession to align and interlace the M nonzero subband samples in the time domain so as to form an output stream which has the same sample rate as the source. The combination of the expander and the delay chain essentially accomplishes a parallel-to-series conversion.
6.4.3 Efficient Subband Coder Replacing the analysis and synthesis filter banks in the subband coder in Fig. 6.5 with the efficient architectures in Figs. 6.13 and 6.15, respectively, we arrive at the efficient architecture for subband coding shown in Fig. 6.16.
6.4.4 Transform Coder Compared with the subband coder structure shown in Fig. 6.16, the transform coder in Fig. 5.1 is obviously a special case with H.z/ D T and F.z/ D TT :
(6.61)
In other words, the transform matrix is a polyphase matrix of order zero. The delay chain and the decimators in the analysis bank simply serve as a series-toparallel converter that feeds the source samples to the transform matrix in blocks of M samples. The expander and the delay chain in the synthesis bank serve as a
110
6 Subband Coding
Fig. 6.16 An efficient polyphase structure for subband coding
parallel-to-series converter that interlaces subband samples outputted from the transform matrix to form an output sequence. The orthogonal condition (5.9) ensures that the filter bank satisfies the PR condition. For this reason, the transform coefficients can be referred to as subband samples.
6.5 Optimal Bit Allocation and Coding Gain Section 6.4.4 has shown that a transform coder is a special subband coder whose polyphase matrix is of order zero. A subband coder, on the other hand, can have a polyphase matrix with a higher order. Other than this, there is no difference between the two, so it can be expected that the optimal bit allocation strategy and the method for calculating optimal coding gain for subband coding should be similar to transform coding. This is shown in this section with the ideal subband coder.
6.5.1 Ideal Subband Coder An ideal subband coder uses the ideal bandpass filters in Fig. 6.4 as both the analysis and synthesis subband filters. Since the bandwidth of each filter is limited to 2=M , there is no overlapping between these subband filters. This offers optimal band separation between subband bands in terms that no energy in one subband is leaked into another, achieving optimal energy compaction.
6.5 Optimal Bit Allocation and Coding Gain
111
m Since shifting the frequency of any of these ideal bandpass filters by WM creates a copy that does not overlap with the original one:
m m D Fk .z/Hk zWM D 0; for m D 1; 2; : : : ; M 1; Hk .z/Hk zWM
(6.62)
the aliasing transfer function (6.33) is zero: Am .z/ D
M 1 m 1 X D 0; m D 1; 2; : : : ; M 1: Fk .z/Hk zWM M
(6.63)
kD0
Therefore, condition (6.35) for zero total aliasing effect is satisfied. On the other side, each of the bandpass filters has a uniform transfer function Hk .z/ D Fk .z/ D
(p M ; passband 0;
stopband
(6.64)
so the overall transfer function defined in (6.32) is T .z/ D 1;
(6.65)
which obviously satisfies (6.36). Therefore, the ideal subband system satisfies both conditions for perfect reconstruction and is thus a PR system.
6.5.2 Optimal Bit Allocation and Coding Gain Let us consider the synthesis bank in Fig. 6.5 and assume that the quantization noise qk .n/ from the kth quantizer is zero-mean wide sense stationary with a variance of q2k (fine quantization). After it passes through the expander, it is no longer wide sense stationary because each qk .n/ is periodically interlaced by M 1 zeros. However, after it passes through the ideal passband filter Fk .z/, it becomes wide sense O stationary again with a variance of q2k =M [82]. Therefore, the total MSQE of x.n/ at the output of the synthesis filter bank is 2 q. x/ O D
M 1 1 X 2 qk M
(6.66)
kD0
Let us now consider the analysis bank in Fig. 6.5. Since the decimator retains one sample out of M source samples, its output has the same variance as its input, which, for the kth subband, is given by: y2k D
1 2
Z
jHk .ej! /j2 Sxx .ej! /d!:
(6.67)
112
6 Subband Coding
Using (6.64) and denoting its passband as k , we can write the equation above as y2k D
M 2
Z k
Sxx .ej! /d!
(6.68)
Adding both sides of the equation above for all subbands, we obtain M 1 X kD0
y2k D
Z M 1 Z M X M Sxx .ej! /d! D Sxx .ej! /d! D M x2 ; 2 2 k
(6.69)
kD0
which leads to x2
M 1 1 X 2 D yk : M
(6.70)
kD0
Since (6.66) and (6.70) are the same as (5.22) and (5.37), respectively, all the derivations in Sect. 5.2 related to coding gain and optimal bit allocation applies to the ideal subband coder as well. In particular, we have the following coding gain: 1 M
x2
GSBC D QM 1 kD0
y2k
PM 1 kD0
1=M D Q M 1 kD0
y2k
y2k 1=M ;
(6.71)
and optimal bit allocation strategy y2k ; for k D 0; 1; : : : ; M 1: rk D R C 0:5 log2 QM 1 2 1=M kD0 yk
(6.72)
6.5.3 Asymptotic Coding Gain When the number of subbands M is sufficiently large, each subband becomes sufficiently narrow, so the variance in (6.68) may be approximated by y2k
M jk jSxx .ej! /; 2
(6.73)
where jk j denotes the width of k . Since jk j D the (6.73) becomes
2 ; M
y2k Sxx .ej! /:
(6.74)
(6.75)
6.5 Optimal Bit Allocation and Coding Gain
113
The geometric mean used by the coding gain formula (6.71) may be rewritten as M 1 Y kD0
2
!1=M
M 1 Y
D exp 4ln
y2k
kD0
!1=M 3 5
y2k
M 1 1 X D exp ln y2k M
! :
(6.76)
kD0
Dropping (6.75) into the equation above, we have M 1 Y kD0
!1=M y2k
"
# M 1 1 X j! exp ln Sxx .e / M kD0 " # M 1 1 X j! 2 ln Sxx .e / D exp 2 M kD0 # " M 1 1 X j! ln Sxx .e /jk j ; D exp 2
(6.77)
kD0
where (6.74) is used to obtain the last equation. As M ! 1, the equation above becomes !1=M Z M 1 Y 1 2 yk D exp ln Sxx .ej! /d! (6.78) 2 kD0
Dropping it back into the coding gain (6.71), we obtain lim GSBC D
M !1
D D
exp
1 2 1
exp 1 : x2
x2 j! ln Sxx .e /d!
R R
21 R 2
Sxx .ej! /d!
ln Sxx .ej! /d!
(6.79)
(6.80) (6.81)
where x2 is the spectral flatness measure defined in (5.65). Therefore, the ideal subband coder approaches the same asymptotic optimal coding gain as KLT (see (5.64)).
Chapter 7
Cosine-Modulated Filter Banks
Between the KLT transform coder and the ideal subband coder, there are many subband coders which offer great energy compaction capability with a reasonable implementation cost. Prominent among them are cosine modulated filter banks (CMFB) whose subband filters are derived from a prototype filter through cosine modulation. The first advantage of CMFB is that the implementation cost of both analysis and synthesis banks are that of the prototype filter plus the overhead associated with cosine modulation. For a CMFB with M bands and N taps per subband filter, the number of operations for the prototype filter is on the order of N and that for the cosine modulation, when implemented using a fast algorithm, is on the order of M log2 M , so the total operations is merely on the order of N C M log2 M . For comparison, the number of operations for a regular filter bank is on the order of M N. The second advantage is associated with the design of subband filters. Instead of designing all subband filters in a filter bank independently, which entails optimizing a total of M N coefficients, we only need to optimize the prototype filter with CMFB, which has no more than N coefficients. Early CMFBs are near perfect reconstruction systems [6, 8, 50, 64, 81] in which only “adjacent-subband aliasing” is canceled, so the reconstructed signal at the output of the synthesis filter bank is only approximately equal to a delayed and scaled version of the signal inputted to the analysis filter bank. The same filter bank structure was later found to be capable of delivering perfect reconstruction if two additional constraints are imposed on the prototype filter [39, 40, 46–48, 77–79].
7.1 Cosine Modulation The idea of modulated filter bank was exemplified by the DFT bank discussed in Sect. 6.1.2 whose analysis filters are all derived from the prototype filter in (6.7) using DFT modulation (6.8). This leads to the implementation structure shown in Fig. 6.2 whose implementation cost is the delay line plus the DFT which can be implemented using an FFT. Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 7, c Springer Science+Business Media, LLC 2010
115
116
7 Cosine-Modulated Filter Banks
While the subband filters of the DFT bank are limited to just M taps (see (6.7)), it illustrates the basic idea of modulated filter banks and can be easily extended to accommodate for subband filters with more than M taps through polyphase representation. Even if the prototype filter of an extended DFT filter bank is real-valued, the modulated subband filters are generally complex-valued because the DFT modulation is complex-valued. Consequently, the subband samples of an extended DFT filter bank are complex-valued and are thus not amenable to subband coding. To obtain a real-valued modulated filter bank, the idea that extends DFT to DCT is followed: a 2M DFT is used to modulate a real-valued prototype filter to produce 2M complex subband filters and then subband filters symmetric with respect to the zero frequency are combined to obtain real-valued ones. This leads to CMFB. There is a little practical issue as shown in Fig. 6.3, where the magnitude response of the prototype filter in the DFT bank is split at the zero frequency, leaving half of its bandwidth in the real frequency domain and the other half in the imaginary frequency domain. Due to periodicity of the DFT spectrum, it appears to have two subbands in the real frequency domain, one starting at the zero frequency and the other ending at 2. This is very different from other modulated subbands whose subbands are not split. This issue may be easily addressed by shifting the subband filters by =2M . Afterwards, the subband filters whose center frequencies are symmetric with respect to the zero frequency are combined together to construct a real subband filter.
7.1.1 Extended DFT Bank The prototype filter H.z/ given in (6.7) for the DFT bank in Sect. 6.1.2 is of length M . This can be extended by obtaining type-I polyphase representation of the modulated subband filter (6.8). In particular, using the type-I polyphase representation (6.41), the subband filters of the DFT modulated filter bank (6.8) may be written as k Hk .z/ D H zWM D
M 1 X
k zWM
m
Pm
k zWM
M
mD0
D
M 1 X
km m WM z Pm zM ; for k D 1; 2; : : : ; M 1;
(7.1)
mD0
where Pm .zM / is the mth type-I polyphase component of H.z/ with respect to M . Note that a subscript M is attached to W to emphasize that it is for an M -fold DFT: WM D ej2=M : Equation (7.1) leads to the implementation structure shown in Fig. 7.1.
(7.2)
7.1 Cosine Modulation
117
…
…
…
Fig. 7.1 Extension of DFT analysis filter bank to accommodate for subband filters with more M 1 are the type-I polyphase components of the prototype filter H.z/ with than M taps. fPm .zM /gmD0 respect to M . Note that the M subscript for the DFT matrix WM is included to emphasize that each of its elements is WM and it is an M M matrix
If the prototype filter used in (7.1) is reduced to (6.7): Pm zM D 1; for m D 1; 2; : : : ; M 1;
(7.3)
then the filter bank in Fig. 7.1 degenerates to the DFT bank in Fig. 6.2. There is obviously no explicit restriction on the length of the prototype filter in (7.1), so a generic N -tap FIR filter can be assumed: H.z/ D
N 1 X
h.n/zn :
(7.4)
nD0
This extension enables longer prototype filters which can offer much better energy compaction capability than the 13 dB achieved by the DFT bank in Fig. 6.3.
7.1.2 2M -DFT Bank Even if the coefficients of the prototype filter H.z/ in Fig. 7.1 are real-valued, the modulated filters Hk .z/ generally do not have real-valued coefficients because the DFT modulation is complex. Consequently, the subband samples outputted from these filters are complex. Since a complex sample actually consists of a real and imaginary parts, there are now 2M subband samples to be quantized and coded for each block of M real-valued input samples, amounting to a onefold increase. To avoid this problem, a modulation scheme that leads to real-valued subband samples is called for. This can be achieved using an approach similar to the derivation of DCT from DFT: a real-valued prototype filter is modulated by a 2M DFT to produce 2M complex subband filters and then each pair of such subband filters symmetric with respect to the zero frequency are combined to form a real-valued subband filter.
118
7 Cosine-Modulated Filter Banks
2M 1 Fig. 7.2 2M -DFT analysis filter bank. fPm .z2M /gmD0 are the type-I polyphase components of the prototype filter H.z/ with respect to 2M and W2M is the 2M 2M DFT matrix
To arrive at 2M DFT modulation, the polyphase representation (7.1) becomes 1 2M X k km m D Hk .z/ D H zW2M W2M z Pm z2M ; for k D 1; 2; : : : ; 2M 1; mD0
(7.5)
where W2M D ej2=2M D ej=M ;
(7.6)
and Pm .z2M / is the m-th type-I polyphase component of H.z/ with respect to 2M . The implementation structure for such a filter bank is shown in Fig. 7.2. The magnitude responses of the above 2M DFT bank is shown in Fig. 7.3 for M D 8. The prototype filter is again given in (6.7) with 2M D 16 taps and its magnitude response is shown at the top of the figure. There are 2M D 16 subband filters whose magnitude responses are shown in the bottom of the figure. Since a filter with only real-valued coefficients has a frequency response that is conjugate-symmetric with respect to the zero frequency [67], each pair of the subband filters in the above 2M DFT bank satisfying this condition can be combined to form a subband filter with real-valued coefficients. Since the frequency responses of both H0 .z/, which is the prototype filter, and H8 .z/ are themselves conjugatesymmetric with respect to the zero frequency, their coefficients are already realvalued and they cannot be combined with any other subband filters. The remaining subband filters from H1 .z/ to H7 .z/ and from H9 .z/ to H1 5.z/ can be combined to form a total of .M 2/=2 D 7 real-valued subband filters. These combined subband filters, plus H0 .z/) and H8 .z/, give us a total of 7 C2 D 9 combined subband filters. Since the frequency response of the prototype filter H.z/ is split at zero frequency and that of H8 .z/ split at and , their bandwidth is only half of that of the other combined subband filters, as shown at the bottom of Fig. 7.3. This results in a situation where two subbands have half bandwidth and the remaining subbands have full-bandwidth. While this type of filter banks with unequal subband bandwidth can be made to work (see [77], for example), it is rather awkward for practical subband coding.
7.1 Cosine Modulation
119
Amplitude (dB)
30
H0
20 10 0 −10 −0.5
Amplitude (dB)
30
H8 H9
0 ω /2π H10
H11
H12
H13
H14
H15
H0
0.5
H1
H2
H3
H4
H5
H6
H7 H8
20 10 0 −10 − 0.5
0 ω /2π
0.5
Fig. 7.3 Magnitude responses of the prototype filter of a 2M DFT filter bank (top) and of all its subband filters (bottom)
7.1.3 Frequency-Shifted DFT Bank The problem of unequal subband bandwidth can be addressed by shifting the subband filters to the right by the additional amount of =2M so that (7.5) becomes kC0:5 Hk .z/ D H zW2M D
2M 1 X mD0
D
2M 1 X
kC0:5 zW2M
m
Pm
2M kC0:5 zW2M
0:5 m km zW2M W2M Pm z2M ; k D 1; 2; : : : ; 2M 1; (7.7)
mD0 M where W2M D 1 was used. Equation (7.7) can be implemented using the structure shown in Fig. 7.4. Figure 7.5 shows the magnitude responses of the filter bank above using the prototype filter given in (6.7) with 2M D 16 taps. They are the same magnitude responses given in Fig. 7.2 except for a frequency shift of =2M , respectively. Now all subbands have the same bandwidth.
120
7 Cosine-Modulated Filter Banks
Fig. 7.4 2M -DFT analysis filter bank with an additional frequency shift of =2M to the right. 2M 1 are the type-I polyphase components of H.z/ with respect to 2M and W2M is the fPm .z2M /gmD0 2M 2M DFT matrix
Amplitude (dB)
30
H0
20 10 0 −10 −0.5
Amplitude (dB)
30
0 ω /2π H8
H9
H10 H11 H12 H13 H14 H15 H0
0.5
H1
H2
H3
H4
H5
H6
H7
20 10 0 −10 −0.5
0 ω/2π
0.5
Fig. 7.5 Magnitude response of a prototype filter shifted by =2M to the right (top). Magnitude responses of all subband filters modulated from such a filter using a 2M DFT (bottom)
7.1.4 CMFB From Fig. 7.5, it is obvious that the frequency response of the k-th subband filter is conjugate-symmetric to that of the .2M 1 k/-th subband filter with respect to zero frequency (they are images of each other), so they are candidate pairs for combination into a real filter. Let us drop (7.4) into first equation of (7.7) to obtain
7.1 Cosine Modulation
121
Hk .z/ D
N 1 X nD0
D
N 1 X nD0
n kC0:5 h.n/ zW2M .kC0:5/n n
h.n/W2M
z
(7.8)
and H2M 1k .z/ D
N 1 X nD0
D
N 1 X
n 2M 1kC0:5 h.n/ zW2M n k0:5 h.n/ zW2M
nD0
D
N 1 X nD0
.kC0:5/n n
h.n/W2M
z
:
(7.9)
Comparing the two equations above we can see that the coefficients of Hk .z/ and H2M 1k .z/ are obviously conjugates of each other, so the combined filter will have real coefficients. When the pair above are actually combined, they are weighted by a unitmagnitude constant c: Ak .z/ D ck Hk .z/ C ck H2M 1k .z/
(7.10)
to aid alias cancellation and elimination of phase distortion. In addition, linear phase condition is imposed on the prototype filter: h.n/ D h.N 1 n/
(7.11)
and mirror image condition on the synthesis filters: sk .n/ D ak .N 1 n/;
(7.12)
where ak .n/ and sk .n/ are the impulse responses of the kth analysis and synthesis subband filters, respectively. After much derivation [93], we arrive at the following cosine modulated analysis filter:
N 1 ak .n/ D 2h.n/ cos .k C 0:5/ n C k ; M 2 for k D 0; 1; : : : ; M 1; where the phase k D .1/k
: 4
(7.13)
(7.14)
122
7 Cosine-Modulated Filter Banks
The cosine modulated synthesis filter can be obtained from the analysis filter (7.13) using the mirror image relation (7.12): N 1 .k C 0:5/ n k ; M 2 for k D 0; 1; : : : ; M 1:
sk .n/ D 2h.n/ cos
(7.15)
The system of analysis and synthesis banks above completely eliminates phase distortion, so the overall transfer function T .z/ defined in (6.32) is linear phase. But amplitude distortion remains. Therefore, the CMFB is a nonperfect reconstruction system and is sometimes called pseudo QMF (quadrature mirror filter). Part of the amplitude distortion comes from incomplete aliasing cancellation: aliasing from only adjacent subbands are canceled, not from all subbands. Note that, even though linear phase condition (7.11) is imposed to the prototype filter, the analysis and synthesis filters generally do not have linear phase.
7.2 Design of NPR Filter Banks The CMFB given in (7.13) and (7.15) cannot deliver perfect reconstruction due to incomplete aliasing cancellation and existence of amplitude distortion. But the overall reconstruction error can be reduced to an acceptable level if the amplitude distortion is properly controlled. Amplitude distortion arises if the magnitude of the overall transfer function jT .z/j is not exactly flat, so the problem may be posed as designing the prototype filter in such a way that jT .z/j is flat or close to flat. It turns out that this can be ensured if the following function is sufficiently flat: [93]: jH.ej! /j2 C jH ej.!=M / j2 1;
for ! 2 Œ0; =M :
(7.16)
This condition can be enforced through minimizing the following cost function: Z ˇ.H.z// D
=M 0
2 jH.ej! /j2 C jH ej.!=M / j2 1 d!:
(7.17)
In addition to the concern above over amplitude distortion, energy compaction is also of prominent importance for signal coding and other applications. To ensure this, all subband filters should have good stopband attenuation. Since all subband filters are shifted copies of the prototype filter, they all have the same amplitude shape of the prototype filter. Therefore, the optimization of stopband attenuation for all subband filters can be reduced to that of the prototype filter. , so The nominal bandwidth of the prototype filter on the positive frequency is 2M stopband attenuation can be optimized by minimizing the following cost function:
7.3 Perfect Reconstruction
123
Z .H.z// D
2M
C
jH.ej! /j2 d!;
(7.18)
where controls the transition bandwidth and should be adjusted for a particular application. Now both amplitude distortion and stopband attenuation can be optimized by minh.n/
ı.H.z// D ˛ˇ.H.z// C .1 ˛/.H.z//;
Subject to (7.11);
(7.19)
where ˛ controls the trade-off between amplitude distortion and stopband attenuation. See [76] for standard optimization procedures that can be applied.
7.3 Perfect Reconstruction The CMFB in (7.13) and (7.15) becomes a perfect reconstruction system when aliasing is completely canceled and amplitude distortion eliminated. Toward this end, we first impose the following length constraint on the prototype filter N D 2mM;
(7.20)
where m is a positive integer. Then the CMFB is a perfect reconstruction system if and only if the polyphase components of the prototype filter satisfy the following pairwise power complementary conditions [40, 93]: PQk .z/Pk .z/ C PQM Ck .z/PM Ck .z/ D ˛; k D 0; 1; : : : ; M 1;
(7.21)
where ˛ is a positive number. The notation “tilde” applied to a rational Z-transform function H.z/ means taking complex conjugate of all its coefficients and replacing z with z1 . For example, if
then
H.z/ D a C bz1 C cz2 ;
(7.22)
HQ .z/ D a C b z C cz2 :
(7.23)
It is intended to effect complex conjugation applicable to a frequency response function: HQ .ej! / D H .ej! /: (7.24) When applied to a matrix of Z-transform functions H.z/ D ŒHi;j .z/, a transpose operation is also implied: Q H.z/ D ŒHQ i;j .z/T : (7.25)
124
7 Cosine-Modulated Filter Banks
7.4 Design of PR Filter Banks The method for designing a PR prototype filter is similar to that for the NPR prototype filter discussed in Sect. 7.2, the difference is that the amplitude distortion is now eliminated by the power complementary conditions (7.21), so the design problem is focused on energy compaction: Z .H.z// D jH.ej! /j2 d!; minh.n/ C (7.26) 2M Subject to (7.11) and (7.21): While the above minimization step may be straight-forward by itself, the difficulty lies in the imposition of the power-complementary constraints (7.21) and the linear phase condition (7.11).
7.4.1 Lattice Structure One approach to impose the power-complementary constraints (7.21) during the minimization process is to implement the power-complementary pairs of polyphase components using a cascade of lattice structures. 7.4.1.1 Paraunitary Systems Toward this end, let us write each power-complementary pair as the following 2 1 transfer matrix or system: Pk .z/ D
Pk .z/ ; k D 0; 1; : : : ; M 1; PM Ck .z/
(7.27)
then the power complementary condition (7.21) may be rewritten as PQ k .z/Pk .z/ D ˛; k D 0; 1; : : : ; M 1;
(7.28)
which means that the 2 1 system Pk .z/ is paraunitary. Therefore, the power complementary condition (7.21) is equivalent to the condition that Pk .z/ is paraunitary. In general terms, an m n rational transfer matrix or system H.z/ is called paraunitary if Q (7.29) H.z/H.z/ D ˛In ; where In is the n n unit matrix. It is obviously necessary that m n. Otherwise, the rank of H.z/ is less than n. If m D n or the transfer matrix is square, the transfer system is further referred to as unitary.
7.4 Design of PR Filter Banks
125
7.4.1.2 Givens Rotation As an example, consider Givens rotation described by the following transfer matrix [20, 22]: cos sin G./ D ; (7.30) sin cos where is a real angel. A flowgraph for this transfer matrix is shown in Fig. 7.6. It can be easily verified that GT ./G./ D
cos sin sin cos
cos sin sin cos
D I2 ;
(7.31)
so it is unitary. A geometric interpretation of Givens rotation is that it rotates an input vector clockwise by . In particular, if an input A D Œr cos ˛; r sin ˛T with an angle of ˛ is rotated clockwise by an angle of , the output vector has an angle of ˛ and is given by
r cos.˛ / r cos ˛ cos C r sin ˛ sin r cos.˛/ D D G./ : (7.32) r sin.˛ / r sin ˛ cos r cos ˛ sin r sin.˛/
7.4.1.3 Delay Matrix Let us consider another 2 2 transfer matrix: 1 0 ; DD 0 z1
(7.33)
which is a simple 2 2 delay system and is shown in Fig. 7.7. It is unitary because
Fig. 7.6 The Givens rotation
Fig. 7.7 A simple 2 2 delay system
Z -1
126
7 Cosine-Modulated Filter Banks
Q D 1 0 DD 0 z
1 0
0 z1
D I2 :
(7.34)
7.4.1.4 Rotation Vector The following simple 2 1 transfer matrix:
cos R./ D sin
(7.35)
is a simple rotation vector. It is paraunitary because
R ./R./ D Œcos T
cos sin sin
D I1 :
(7.36)
Its flowgraph is shown in Fig. 7.8.
7.4.1.5 Cascade of Paraunitary Matrices An important property of paraunitary matrices is that a cascade of paraunitary matrices are also paraunitary. In particular, if H1 .z/ and H2 .z/ are paraunitary, then H.z/ D H1 .z/H2 .z/ is also paraunitary. This is because Q H.z/H.z/ D HQ2 .z/HQ1 .z/H1 .z/H2 .z/ D ˛ 2 I:
(7.37)
The result above obviously can be extended to include a cascading of any number of paraunitary systems. Using this property we can build more complex 2 1 paraunitary systems of arbitrary order by cascading the elementary paraunitary transfer matrices discussed above. One such example is the lattice structure shown in Fig. 7.9. It has N 1 delay subsystem and N 1 Givens rotation. Its transfer function may be written as
Fig. 7.8 A simple 2 1 rotation system
7.4 Design of PR Filter Banks
127
Fig. 7.9 A cascaded 2 1 paraunitary systems
P.z/ D
!
1 Y
G.n /D.z/ R.0 /:
(7.38)
nDN 1 N 1 It has a parameter set of fn gnD0 and N 1 delay units, so represents a 2 1 real-coefficient FIR system of order N 1. It was shown that any 2 1 realcoefficient FIR paraunitary systems of order N 1 may be factorized by such a lattice structure [93].
7.4.1.6 Power-Complementary Condition Let us apply the result above to the 2 1 transfer matrices Pk .z/ defined in (7.27) to enforce the power-complementary condition (7.21). Due to (7.20), both polyphase components Pk .z/ and PM Ck .z/ with respect to 2M have an order of m 1, so Pk .z/ has an order of m 1. Due to (7.38), it can be factorized as follows: Pk .z/ D
1 Y
! k G n D.z/ R 0k ; k D 0; 1; : : : ; M 1:
(7.39)
nDm1
Since this lattice structure is guaranteed to be paraunitary, the parameter set n
nk
om1 nD0
(7.40)
can be arbitrarily adjusted to minimize (7.26) without violating the power complementary condition. Since there are M such systems, the total number of free parameters nk that can be varied for optimization is reduced to mM from 2mM.
7.4.2 Linear Phase When the linear phase condition (7.11) is imposed, the number of free parameters above that are allowed to be varied for optimization will be further reduced.
128
7 Cosine-Modulated Filter Banks
To see this, let us first represent the linear phase condition (7.11) in the Z-transform domain: H.z/ D
N 1 X
h.n/zn
nD0
D z.N 1/
N 1 X
h.N 1 m/zm
mD0
D z.N 1/
N 1 X
h.m/zm
mD0 .N 1/
Dz
HQ .z/:
(7.41)
where a variable change of n D N 1 m is used to arrive at the second equation, the linear phase condition (7.11) is used for the third equation, and the assumption that h.n/ are real-valued is used for the fourth equation. The type-I polyphase representation of the prototype filter with respect to 2M is H.z/ D
2M 1 X
zk Pk z2M :
(7.42)
kD0
Dropping it into (7.41) we obtain HQ .z/
2M 1 X
D
zN 1k Pk z2M
kD0 nD2M 1k
D
2M 1 X
zN Ck2M P2M 1k z2M
kD0 N D2mM
D
2M 1 X
zk z2M.m1/ P2M 1k z2M :
(7.43)
kD0
From (7.42) we also have HQ .z/ D
2M 1 X
zk PQk z2M :
(7.44)
kD0
Comparing the last two equations, we have PQk z2M D z2M.m1/ P2M 1k z2M
(7.45)
7.4 Design of PR Filter Banks
129
or PQk .z/ D zm1 P2M 1k .z/:
(7.46)
Therefore, half of the 2M polyphase components are completely determined by the other half due to the linear phase condition.
7.4.3 Free Optimization Parameters Now there are two sets of constraints on the prototype filter coefficients: the power complementary condition ties polyphase components Pk .z/ and PM Ck .z/ using the lattice structure, while the linear phase condition ties Pk .z/ and P2M 1k .z/ using (7.46). Intuitively, this should leave us with roughly one quarter of polyphase components that can be freely optimized. 7.4.3.1 Even M Let us first consider the case that M is even or M=2 is an integer. Each pair of the following two sets of polyphase components: P0 .z/; P1 .z/; : : : ; PM=21 .z/
(7.47)
PM .z/; PM C1 .z/; : : : ; P3M=21 .z/
(7.48)
and can be used to form the 2 1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is mM=2. The remaining polyphase components can be derived from the two sets above using the linear phase condition (7.46). In particular, the set of polyphase components in (7.47) determines the following set: P2M 1 .z/; P2M 2 .z/; : : : ; P3M=2 .z/
(7.49)
PM 1 .z/; PM 2 .z/; : : : ; PM=2 .z/;
(7.50)
and (7.48) determines
respectively.
130
7 Cosine-Modulated Filter Banks
7.4.3.2 Odd M When M is odd or .M 1/=2 is an integer, the scheme above is still valid, except for P.M 1/=2 and PM C.M 1/=2 . In particular, each pair of the following two sets of polyphase components: P0 .z/; P1 .z/; : : : ; P.M 1/=21 .z/
(7.51)
PM .z/; PM C1 .z/; : : : ; P.3M 1/=21 .z/:
(7.52)
and can be used to form the 2 1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is m.M 1/=2. The linear phase condition (7.46) in turn causes (7.51) to determine P2M 1 .z/; P2M 2 .z/; : : : ; P.3M C1/=2 .z/
(7.53)
and (7.52) to determine PM 1 .z/; PM 2 .z/; : : : ; P.M C1/=2 .z/;
(7.54)
respectively. Apparently, both P.M 1/=2 and PM C.M 1/=2 are missing from the lists above. The underlying reason is that both the power-complementary and linear-phase conditions apply to them simultaneously. In particular, the linear phase condition (7.46) requires (7.55) PQ.M 1/=2 .z/ D zm1 P.3M 1/=2 .z/: This causes the power complementary condition (7.21) to become 2PQ.M 1/=2 .z/P.M 1/=2 .z/ D ˛; which leads to P.M 1/=2 .z/ D
p 0:5˛zı ;
(7.56)
(7.57)
where ı is a delay. Since H.z/ is low-pass with a cutoff frequency of =2M , it can be shown that the only acceptable choice of ı is [93] ( ıD
m1 2 ; m ; 2
if m is oddI if m is even:
(7.58)
Since ˛ is a scale factor for the whole the system, P.M 1/=2 is now completely determined. In addition, P.3M 1/=2 .z/ can be derived from it using (7.55): P.3M 1/=2 .z/ D z.m1/ PQ.M 1/=2 .z/ D
p 0:5˛z.m1ı/ :
(7.59)
7.5 Efficient Implementation
131
7.5 Efficient Implementation Direct implementation of either the analysis filter bank (7.13) or the synthesis filter bank (7.15) requires operations on the order M N . This can be significantly reduced to the order of N C M log2 M by utilizing polyphase representation and cosine modulation. There is a little variation in the actual implementation structures, depending on whether m is even or odd, and both cases are presented in Sects. 7.5.1 and 7.5.2 [93].
7.5.1 Even m When m is even, the analysis filter bank, represented by the following vector: 2
A0 .z/ A1 .z/ :: :
6 6 a.z/ D 6 4
3 7 7 7; 5
(7.60)
AM 1 .z/ may be written as [93] a.z/ D
p
M Dc CŒI J
P .z2M / 0 I J 0 0 P1 .z2M /
d.z/ zM d.z/
(7.61)
where ŒDc kk D cosŒ.k C 0:5/m; k D 0; 1; : : : ; M 1;
(7.62)
is an M M diagonal matrix, C is the DCT-IV matrix given in (5.79), 2
3 1 07 7 :: 7 :7 7 1 0 05 1 0 0 0
0 60 6 6 J D 6 ::: 6 40
0 0 0 1 :: :: : :
(7.63)
is the reversal or anti-diagonal matrix, h
i P0 z2M D Pk z2M ; k D 0; 1; : : : ; M 1; kk
(7.64)
is the diagonal matrix consisting of the initial M polyphase components of the prototype filter P .z/, and
132
7 Cosine-Modulated Filter Banks
h
P1 z2M
i kk
D PM Ck z2M ; k D 0; 1; : : : ; M 1;
(7.65)
is an diagonal matrix consisting of the last M polyphase components of the prototype filter P .z/. For an even m, (7.62) becomes ŒDc kk D cosŒ0:5m D .1/0:5m ; k D 0; 1; : : : ; M 1;
(7.66)
Dc D .1/0:5m I:
(7.67)
so
Therefore, the analysis bank may be written as #" # " 2M p d.z/ .z / 0 P 0 : a.z/ D .1/0:5m M CŒIJ IJ zM d.z/ 0 P1 .z2M / (7.68) The filter bank above may be implemented using the structure shown in Fig. 7.10. It is obvious that the major burdens of calculations are the polyphase filtering which entails operations on the order of N and the M M DCT-IV which needs operations
…
…
…
…
…
Fig. 7.10 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an even m. The Pk .z2M / is the p kth polyphase component of the prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
7.5 Efficient Implementation
133
on the order of M log2 M when implemented using a fast algorithm, so the total number of operations is on the order of N C M log2 M . The synthesis filter bank is obtained from the analysis bank using (7.12), which becomes Sk .z/ D z.N 1/ AQk .z/ (7.69) due to (7.41). Denoting sT .z/ D ŒS0 .z/; S0 .z/; : : : ; SM 1 .z/;
(7.70)
the equation above becomes sT .z/ D z.N 1/ aQ .z/:
(7.71)
Dropping in (7.68) and using (7.20), the equation above becomes " #" # p P Q 0 z2M 0
IJ MQ Q s .z/ D z d.z/ z d.z/ C M .1/0:5m Q 1 z2M 0 P I J " #
Q 0 z2M P 0 2M D z2M C1 dQ .z/ zM dQ .z/ z2M.m1/ Q 1 z 0 P .2mM 1/
T
"
# p IJ C M .1/0:5m : I J (7.72)
Due to (7.46), we have PQ 0 z2M 0 z 0 PQ 1 z2M 2 P2M 1 z2M 0 0 6 :: :: :: :: 6 : : : : 6 6 2M 6 0 0 PM z 6 D6 6 0 0 PM 1 z2M 6 6 :: :: :: :: 6 : : : : 4 2M.m1/
0 " D
JP1 z2M J 0 0
0
JP0 z2M J
0 #
:: :
0 :: :
0
:: :
0 :: :
P0 z2M
3 7 7 7 7 7 7 7 7 7 7 7 5
(7.73)
134
7 Cosine-Modulated Filter Banks
Fig. 7.11 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an even m. The Pk .z2M / is the p kth polyphase component of the prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
where the last step uses the following property of the reversal matrix: Jdiagfx1 ; x2 ; : : : ; xM 1 gJ D diagfxM 1 ; : : : ; x2 ; x1 g:
(7.74)
Therefore, the synthesis bank becomes i JP z2M J 0 h 1 Q Q 2M zM C1 d.z/ sT .z/ D zM zM C1 d.z/ J 0 JP0 z p IJ C M .1/0:5m ; (7.75) I J which may be implemented by the structure in Fig. 7.11.
7.5.2 Odd m When m is odd, both the analysis and synthesis banks are essentially the same as when m is even, with only minor differences. In particular, the analysis bank is given by [93]
7.5 Efficient Implementation
135
a.z/ D
p P .z2M / 0 M Ds CŒI C J I J 0 0 P1 .z2M /
d.z/ zM d.z/
(7.76)
where ŒDs kk D sinŒ.k C 0:5/m D .1/
m1 2
.1/k ; k D 0; 1; : : : ; M 1;
(7.77)
is a diagonal matrix with alternating 1 and 1 on the diagonal. Since this matrix only changes the signs of alternating subband samples and these sign changes are reversed upon input to the synthesis bank, the implementation of this matrix can be omitted in both analysis and synthesis bank. Similar to the case with even m, the corresponding synthesis filter bank may be obtained as: i JP .z2M /J 0 h 1 T M M C1 Q M C1 Q d.z/ z d.z/ s .z/ D z z 0 JP0 .z2M /J p ICJ (7.78) CDs M : IJ The analysis and synthesis banks can be implemented by structures shown in Figs. 7.12 and 7.13, respectively.
Fig. 7.12 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the subbands, prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating p hence can be omitted together with that in the synthesis bank. A scale factor of M is omitted
136
7 Cosine-Modulated Filter Banks
…
…
…
…
…
Fig. 7.13 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the subbands, prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating p hence can be omitted together with that in the analysis bank. A scale factor of M is omitted
7.6 Modified Discrete Cosine Transform Modified discrete cosine transform (MDCT) is a special case of CMFB when m D 1. It deserves special discussion here because of its wide application in audio coding. The first PR CMFB is the time-domain aliasing cancellation (TDAC) which was obtained from the 2M -DFT discussed in Sect. 7.1.2 without the =2M frequency shifting (even channel stacking) [77]. Referred to as evenly-stacked TDAC, it has M 1 full-bandwidth channels and 2 half-bandwidth channels, for a total of M C 1 channels. This issue was latter addressed using the =2M frequency shifting, which is called odd channel stacking, and the resultant filter bank is called oddly stacked TDAC [78].
7.6.1 Window Function With m D 1, the polyphase components of the prototype filter with respect to 2M becomes the coefficients of the filter itself: Pk .z/ D h.k/;
k D 0; 1; : : : ; 2M 1;
(7.79)
7.6 Modified Discrete Cosine Transform
137
which makes it intuitive to understand the filter bank. For example, the powercomplementary condition (7.21) becomes h2 .n/ C h2 .M C n/ D ˛; n D 0; 1; : : : ; M 1:
(7.80)
Also, the polyphase filtering stage in both Figs. 7.12 and 7.13 becomes simply applying the prototype filter coefficients, so the prototype filter is often referred to as the window function or simply the window. Since the block size is M and the window size is 2M , there is half window or one block overlapping between blocks, as shown in Fig. 7.15. The window function can be designed using the procedure discussed in Sect. 7.4. Due to m D 1, the lattice structure (7.39) degenerates into the rotation vector (7.35), so the design problem is much simpler. A window widely used in audio coding is the following half-sine window or simply sine window: h i : h.n/ D ˙ sin .n C 0:5/ 2M
(7.81)
It satisfies the power-complementary condition (7.80) because h h i i C sin2 C .n C 0:5/ h2 .n/ C h2 .M C n/ D sin2 .n C 0:5/ 2M i 2M h h2 i 2 2 C cos .n C 0:5/ D sin .n C 0:5/ 2M 2M D 1: (7.82) It is a unique window that allows perfect DC reconstruction using only the low-pass subband, i.e., subband zero [45,49]. This was shown to be a necessary condition for maximum asymptotic coding gain for an AR(1) signal with the correlation coefficient approaching the value of one.
7.6.2 MDCT The widely used MDCT is actually given by the following synthesis filter: M C1 .k C 0:5/ n C M 2 for k D 0; 1; : : : ; M 1:
sk .n/ D 2h.n/ cos
(7.83)
It is obtained from (7.15) using the following phase: k D .k C 0:5/.2m C 1/ : 2
(7.84)
138
7 Cosine-Modulated Filter Banks
Table 7.1 Phase difference between CMFB and MDCT. Its impact to the subband filters is a sign change when the phase difference is
k 4n C 0 4n C 1 4n C 2 4n C 3
k =4 =4 =4 =4
k =4 C =4 =4 =4 C
k k 0 0
…
…
…
…
…
…
…
…
…
Fig. 7.14 Implementation of forward MDCT as application of window function and then calculation of DCT-IV. The block C represents DCT-IV matrix
It differs from the k in (7.14) for some k by , as shown in Table 7.1. A phase difference of causes the cosine function to switch its sign to negative, so some of the analysis and synthesis filters will have negative values when compared with those given by (7.14). This is equivalent to that a different Ds is used in the analysis and synthesis banks in Sect. 7.5.2. As stated there, this kind of sign change is insignificant as long as it is complemented in both analysis and synthesis banks.
7.6.3 Efficient Implementation An efficient implementation structure for the analysis bank which utilizes the linear phase condition (7.11) to represent the second half of the window function is given in [45]. To prepare for window switching which is critical for coping with transients in audio signals (to be discussed in Chap. 11), we forgo this use of the linear phase condition to present the structure shown in Fig. 7.14 which uses the second half
7.6 Modified Discrete Cosine Transform
139
Fig. 7.15 Implementation of MDCT as application of an overlapping window function and then calculation of DCT-IV
of the window function directly. In particular, the input to the DCT block may be expressed as un D x.M=2 C n/h.3M=2 C n/ x.M=2 1 n/h.3M=2 1 n/ (7.85) and unCM=2 D x.n M /h.n/ x.1 n/h.M 1 n/
(7.86)
for n D 0; 1; : : : ; M=2 1, respectively. For both equations above, the current block is considered as consisting of samples x.0/; x.1/; : : : ; x.M 1/ and the past block as x.M /; x.M 1/; : : : ; x.1/ which essentially amounts to a delay line. The first half of the inputs to the DCT-IV block, namely un in (7.85), are obviously obtained by applying the second half of the window function to the current block of input data. The second half, namely unCM=2 in (7.86), are calculated by applying the first half of the window function to the previous block of data. This constitutes an overlap with the previous block of data. Therefore, the implementation of MDCT may be considered as application of an overlapping window function and then calculation of DCT-IV, as shown in Fig. 7.15. The following is the Matlab code for implementing MDCT: function [y] = mdct(x, n0, M, h) % % [y] = mdct(x, n0, M, h) % % x: Input array. M samples before n0 are considered as the % delay line % n0: Start of a block of new data % M: Block size % h: Window function % y: MDCT coefficients % % Here is an example for generating the sine win % n = 0:(2*M-1); % h = sin((n+0.5)*0.5*pi/M); %
140
7 Cosine-Modulated Filter Banks
% Convert to DCT4 for n=0:(M/2-1) u(n+1) = - x(n0+M/2+n+1)*h(3*M/2+n+1) - x(n0+M/2-1-n+1)*h(3*M/2-1-n+1); u(n+M/2+1) = x(n0+n-M+1)*h(n+1) - x(n0+-1-n+1)*h(M-1-n+1); end % % DCT4, you can use any DCT4 subroutines y=dct4(u);
The inverse of Fig. 7.14 is shown in Fig. 7.16 which does not utilize the symmetric property of the window function imposed by the linear phase condition. The output of the synthesis bank may be expressed as x.n/ D xd.n/ C uM=2Cn h.n/; xd.n/ D uM=21n h.M C n/; for n D 0; 1; : : : ; M=2 1I
(7.87)
and x.n/ D xd.n/ u3M=21n h.n/; xd.n/ D uM=2Cn h.M C n/; for n D M=2; M=2 C 1; : : : ; M 1I
(7.88)
where xd.n/ is the delay line with a length of M samples.
Fig. 7.16 Implementation of backward MDCT as calculation of DCT-IV and then application of window function. The block C represents DCT-IV matrix
7.6 Modified Discrete Cosine Transform
141
The following is the Matlab code for implementing inverse MDCT: function [x, xd] = imdct(y, xd, M, h) % % [x, xd] = imdct(y, xd, M, h) % % y: MDCT coefficients % xd: Delay line % M: Block size % h: Window function % x: Reconstruced samples % % Here is an example for generating the sine window % n = 0:(2*M-1); % h = sin((n+0.5)*0.5*pi/M); % % DCT4 u=dct4(y); % % for n=0:(M/2-1) x(n+1) = xd(n+1) + u(M/2+n+1)*h(n+1); xd(n+1) = -u(M/2-1-n+1)*h(M+n+1); end % for n=(M/2):(M-1) x(n+1) = xd(n+1) - u(3*M/2-1-n+1)*h(n+1); xd(n+1) = -u(-M/2+n+1)*h(M+n+1); end
If Figs. 7.14 and 7.16 were followed strictly, we could end up with half length for the delay lines at the cost of an increased complexity for swapping variables.
Part IV
Entropy Coding
After the removal of perceptual irrelevancy using quantization aided by data modeling, we are left with a set of quantization indexes. They can be directly packed into a bit stream for transmission to the decoder. However, they can be further compressed through the removal of statistical redundancy via entropy coding. The basically idea of entropy coding is to represent more probable quantization indexes with shorter codewords and less probable ones with longer codewords so as to achieve a shorter average codeword length. In this way, the set of quantization indexes can be represented with less number of bits. The theoretical minimum of the average codeword length for a particular quantizer is its entropy which is a function of the probability distribution of the quantization indexes. The practical minimum of average codeword length achievable by all practical codebooks is usually higher than the entropy. The codebook that delivers such practical minimum is called the optimal codebook. However, this practical minimum can be made to approach the entropy if quantization indexes are grouped into blocks, each block is coded as one block symbol, and the block size is allowed to be arbitrarily large. Huffman’s algorithm is an iterative procedure that always produces an optimal entropy codebook.
Chapter 8
Entropy and Coding
Let us consider a 2-bit quantizer that represents quantized values using the following set of quantization indexes: f0; 1; 2; 3g: (8.1) Each quantization index given above is called a source symbol, or simply a symbol, and the set is called a symbol set. When applied to quantize a sequence of input samples, the quantizer produces a sequence of quantization indexes, such as the following: 1; 2; 1; 0; 1; 2; 1; 2; 1; 0; 1; 2; 2; 1; 2; 1; 2; 3; 2; 1; 2; 1; 1; 2; 1; 0; 1; 2; 1; 2:
(8.2)
Called a source sequence, it needs to be represented by or converted to a sequence of codewords or codes that are suitable for transmission over a variety of channels. The primary concern is that the average codeword length is minimized so that the transmission of the source sequence demands a lower bit rate. An instinct approach to this coding problem is to use a binary numeral system to represent the symbol set. This may lead to the codebook in Table 8.1. Each codeword in this codebook is of fixed length, namely 2 bits, so the codebook is referred to as a fixed-length codebook or fixed-length code. Coding each symbol in the source sequence (8.2) using the fixed-length code in Table 8.1 takes two bits, amounting to a total of 2 30 D 60 bits to code the entire 30 symbols in source sequence (8.2). The average codeword length is obviously LD
60 D 2 bits/symbol; 30
(8.3)
which is the codeword length of the fixed-length codebook and is independent of the frequency that each symbol occurs in the source sequence (8.2). Since the symbols appear in source sequence (8.2) with obviously different frequencies or probabilities, the average codeword length would be reduced if a short codeword is assigned to a symbol with high probability and a long one to a symbol with low probability. This strategy leads to variable-length codebooks or simply variable-length codes.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 8, c Springer Science+Business Media, LLC 2010
145
146
8 Entropy and Coding
Table 8.1 Fixed-length codebook for the source sequence in (8.2)
Symbol 0 1 2 3
Codeword 00 01 10 11
Codeword length (bits) 2 2 2 2
Table 8.2 Variable-length unary code for the example sequence in (8.5). Shorter codewords are assigned to more frequently occurring symbols and longer codewords to less frequently occurring ones Symbol Codeword Frequency Codeword length (bits) 0 001 3 3 1 1 14 1 2 01 12 2 3 0001 1 4
As an example, the first two columns of Table 8.2 are such a variable-length codebook that is built using the unary numeral system. Assuming that the sequence is iid and that the number that a symbol appears in the sequence accurately reflects its frequency of occurrence, the probability distribution may be estimated as p.0/ D
14 12 1 3 ; p.1/ D ; p.2/ D ; p.3/ D : 30 30 30 30
(8.4)
With this unary codebook, the most frequently occurring symbol ‘1’ is coded with one bit and the least frequently occurring symbol ‘3’ is coded with four bits. The average codeword length is LD
14 12 1 3 3C 1C 2C 4 D 1:7 bits; 30 30 30 30
which is 0.3 bits better than the the fixed-length code in Table 8.1.
8.1 Entropy Coding To cast the problem above of coding a source sequence to reduce average codeword length in mathematical terms, let us consider an information source that emits a sequence of messages or source symbols X.1/; X.2/; : : :
(8.5)
by drawing from a symbol set or alphabet of S D fs0 ; s1 ; : : : ; sM 1 g
(8.6)
8.1 Entropy Coding
147
consisting of M source symbols. The symbols in the source sequence (8.5) is often assumed to be an independently and identically distributed (iid) random variable with a probability distribution of p.sm / D pm ; m D 0; 1; : : : ; M 1:
(8.7)
The task is to convert this sequence into a compact sequence of codewords Y .1/; Y .2/; : : :
(8.8)
C D fc0 ; c1 ; : : : ; cM 1 g;
(8.9)
drawn from a set of codewords
called a codebook or simply code. The goal is to find a codebook that minimizes the average codeword length without loss of information. The codewords are often represented using binary numerals, in which case the resultant sequence of codewords, called a codeword sequence or simply a code sequence, is referred to as a bit stream. While other radix of representation, such as hexadecimal, can also be used, binary radix is assumed in this book without loss of generality. This coding process has to be “lossless” in the sense that the complete source sequence can be recovered or decoded from the received codeword sequence without any error or loss of information, so it is called lossless compression coding or simply lossless coding. In what amounts to symbol code approach to lossless compression coding, a one-to-one mapping: sm
! cm ;
for m D 0; 1; : : : ; M 1;
(8.10)
is established between each source symbol sm in the symbol set and a codeword cm in the codebook and then deployed to encode the source sequence or decode the codeword sequence symbol by symbol. The codebooks in Tables 8.1 and 8.2 are symbol codes. Let l.cm / denotes the codeword length of codeword cm in codebook (8.9), then the average codeword length per source symbol of the code sequence (8.8) is LD
M 1 X
p.sm /l.cm /:
(8.11)
mD0
Due to the symbol code mapping (8.10), the equation above becomes: LD
M 1 X
p.cm /l.cm /;
mD0
which is the average codeword length of the codebook (8.9).
(8.12)
148
8 Entropy and Coding
Apparently any symbol set and consequently source sequence can be represented or coded using a binary numeral system with L D ceil Œlog2 .M / bits;
(8.13)
where the function ceil.x/ returns the smallest integer no less than x. This results in a fixed-length codebook or fixed-length code in which each codeword is coded with an L-bits binary numeral or is said to have a codeword length of L bits. This fixed length binary codebook is considered as the baseline code for an information source and is used by PCM in (2.30). The performance of a variable-length code may be assessed by compression ratio: RD
L0 ; L
(8.14)
where L0 and L are the average codeword lengths of the fixed-length code and the variable-length code, respectively. For the example unary code in Table 8.2, the compression ratio is 2 1:176: RD 1:7
8.2 Entropy In pursuit of a codebook that delivers an average codeword length as low as possible, it is critical to know if there exists a minimal average codeword length and, if it exists, what it is. Due to (8.12), the average codeword length is weighted by the probability distribution of the given information source, so it can be expected that the answer is dependent on this probability model. In fact, it was discovered by Claude E. Shannon, an electrical engineer at Bell Labs, in 1948 that this minimum is the entropy of the information source which is solely determined by the probability distribution [85, 86].
8.2.1 Entropy When a message X from an information source is received by the receiver which turns out to be symbol sm , the associated self-information is I.X D sm / D log p.sm /:
(8.15)
The average information per symbol for all messages emitted by the information source is obviously dependent on the probability that each symbol occurs and is thus given by: M 1 X H.X / D p.sm / log p.sm /: (8.16) mD0
8.2 Entropy
149
This is called entropy and is the minimal average codeword length for the given information source (to be proved later). The unit of entropy is determined by the logarithmic base. The bit, based on the binary logarithm (log2 ), is the most commonly used unit. Other units include the nat, based on the natural logarithm (loge ), and the hartley, based on the common logarithm (log10 ). Due to log2 x ; loga x D log2 a conversion between these units are simple and straigthforward, so binary logarithm (log2 ) is always assumed in this book unless stated otherwise. The use of logarithm as a measure of information makes sense intuitively. Let us first note that the function in (8.15) is a decreasing function of the probability p.sm / and equals zero when p.sm / D 1. This means that A less likely event carries more information because the amount of surprise
is larger. When the receiver knows that an event is sure to happen, i.e., p.X D sm / D 1, before receiving the message X , the event of receiving X to discover that X D sm carries no information at all. So the self-information (8.15) is zero. More bits need to be allocated to encode less likely events because they carry more information. This is consistent with our strategy for codeword assignment: assigning longer codewords to less frequently occurring symbols. Ideally, the length of the codeword assigned to encode a symbol should be its selfinformation. If this were done, the average codeword length would be the same as the entropy. Partly because of this, variable-length coding is often referred to as entropy coding. To view another intuitive perspective about entropy, let us suppose that we received two messages (symbols) from the source: Xi and Xj . Since the source is iid, we have p.Xi ; Xj / D p.Xi /p.Xj /. Consequently, the self-information carried by the two messages is I.Xi ; Xj / D log p.Xi ; Xj /
D log .p.Xi // log p.Xj / D I.Xi / C I.Xj /:
(8.17)
This is exact what we expect: The information for two messages should be the sum of the information that each
message carries. The number of bits to code two messages should be the sum of coding each
message individually. In addition to the intuitive perspectives outlined above, there are other considerations which ensure that the choice of using logarithm for entropy is not arbitrary. See [85, 86] or [83] for details.
150
8 Entropy and Coding
As an example, let us calculate the entropy for the source sequence (8.2): H.X / D
3 log2 30
12 log2 30
3 30 12 30
14 log2 30
1 log2 30
14 30 1 30
1:5376 bits: Compared with this value of entropy, the average codeword length of 1.7 bits achieved by the unary code in Table 8.2 is quite impressive.
8.2.2 Model Dependency The definition of entropy in (8.16) assumes that messages emitted from an information source be iid, so the source can be completely characterized by the one-dimensional probability distribution. Since entropy is completely determined by this distribution, its value as the minimal average codeword length is apparently as good as the probability model, especially the iid assumption. Although simple, the iid assumption usually do not reflect the real probability structure of the source sequence. In fact, most information source, and audio signals in particular, are strongly correlated. The violation of this iid assumption significantly skew the calculated entropy toward a value larger than the “real entropy”. This is shown by two examples below. In the first example, we notice that each symbol in the example sequence (8.2) can be predicted from its predecessor using (4.1) to give the following residual sequence: 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 0; 1; 1; 1; 1; 1; 1; 1; 1; 1; 0; 1; 1; 1; 1; 1; 1; 1: Now the alphabet is reduced to f1, 0, 1g with the following probabilities: p.1/ D
2 15 13 ; p.0/ D ; p.1/ D : 30 30 30
Its entropy is H.X / D
13 log2 30
13 30
2 log2 30
2 30
15 log2 30
15 30
1:2833 bits,
8.2 Entropy
151
which is 1:5376 1:2833 D 0:2543 bits less than the entropy achieved with the false iid assumption. Another approach to exploiting the correlation in example sequence (8.2) is to borrow the idea from vector quantization and consider the symbol sequence as a sequence of vectors or block symbols: .1; 2/; .1; 0/; .1; 2/; .1; 2/; .1; 0/; .1; 2/; .2; 1/; .2; 1/; .2; 3/; .2; 1/; .2; 1/; .1; 2/; .1; 0/; .1; 2/; .1; 2/ with an alphabet of f(2, 3), (1, 0), (2, 1), (1, 2)g: The probabilities of occurrence for all block symbols or vectors are p..2; 3// D
3 4 7 1 ; p..1; 0// D ; p..2; 1// D ; p..1; 2// D : 15 15 15 15
The entropy for the vector symbols is 1 H.X / D log2 15 4 log2 15
1 15 4 15
3 log2 15 7 log2 15
3 15 7 15
1:7465 bits per block symbol. Since there are two symbols per block symbol, the entropy is actually 1:7465=2 0:8733 bits per symbol. This is 1:5376 0:8733 D 0:6643 bits less than the entropy achieved with the false iid assumption. Table 8.3 summarizes the entropies achieved using three data models to code the same source sequence (8.2). It is obvious that the entropy for a source is a moving entity, depending on how good the data model is. Since it is generally not possible to completely know a physical source and hence build the perfect model for it, it is generally impossible to know the “real entropy” of the source. The entropy is as good as the data model used. This is similar to quantization where data models play a critical role. Table 8.3 Entropies obtained using three data models for the same source sequence (8.2)
Data model iid Prediction Block coding
Entropy (bits per symbol) 1.5376 1.2833 0.87325
152
8 Entropy and Coding
8.3 Uniquely and Instantaneously Decodable Codes A necessary requirement for entropy coding is that the source sequence can be reconstructed without any loss of information from the the codeword sequence received by the decoder. While the one-to-one mapping (8.10) is the first step toward this end, it is not sufficient because the codeword sequence generated by concatenating the codewords from such a codebook can become undecodable. There is, therefore, an implicit requirement for a codebook that all of its codewords must be uniquely decodable when concatenated together in any order. Among all the uniquely decodable codebooks for a given information source, the one or ones with the least average codeword length is called the optimal codebook. Uniquely decodable codes vary significantly in terms of computational complexity, especially when decoding is involved. There is a subset, called prefix-free codes, which are instantaneous decodable in the sense that each of its codewords can be decoded as soon as the last bit of it is received. It will be shown that, if there is an optimal codebook, at least one of them is a prefix-free code.
8.3.1 Uniquely Decodable Code Looking at the unary code in Table 8.2, one notices that the ‘1’ does not appear anywhere other than the end of the codewords. One would wonder why this is needed? To answer this question, let us consider the old unary numeral system shown in Table 8.4 which uses the number of 0’s to represent the corresponding number and thus establishes a one-to-one mapping. It is obvious that it takes the same number of bits to code the sequence in (8.2) as the unary code given in Table 8.2. Therefore, the codebook in Table 8.4 seems to be equally adequate. The problem lies in that the codeword sequence generated from the codebook in Table 8.4 cannot be uniquely decoded. For example, the first three symbols in (8.2) are f1, 2, 1g, so will be coded into f0 00 0g. Once the receiver received this sequence of f0000g, the decoder cannot uniquely decode the sequence: it cannot determine whether the received codeword sequence is either f3g, f2,2g, or f1,1,1,1g. To ensure unique decodability of the unary code, the ‘1’ is used to signal the end of a codeword in Table 8.2. Now, the three symbols f1, 2, 1g will be coded via the codebook in Table 8.2 into f1 01 1g. Once the receiver received the sequence f1011g, it can uniquely determines that the symbols are f1,2,1g. Table 8.4 A not uniquely decodable codebook
Symbol 0 1 2 3
Codeword 000 0 00 0000
Frequency 3 14 12 1
Codeword length 3 1 2 4
8.3 Uniquely and Instantaneously Decodable Codes
153
Fixed-length codes such as the one in Table 8.1 is also uniquely decodable because all codewords are of the same length and unique in the codebook. To decode a sequence of symbols coded with a fixed-length code of n bits, one can cut the sequence into blocks of n bits each, extract the bits in each block, and look up the codebook to find the symbol it represents. The unique decodability imposes limit on codeword lengths. In particular, McMillan’s inequality states that a binary codebook (8.9) is uniquely decodable if and only if M 1 X 2l.cm / 1; (8.18) mD0
where l.cm / is again the length of codeword cm . See [43] for proof. Given the source sequence (8.5) and the probability distribution (8.7), there are many uniquely decodable codes satisfying the requirement for decodability (8.18). But these codes are not equal because their average codeword lengths may be different. Among all of them, the one that produces the minimum average codeword length M 1 X p.cm /l.cm / (8.19) Lopt D min fl.cm /g
mD0
is referred as the optimal codebook Copt . It is the target of codebook design.
8.3.2 Instantaneous and Prefix-Free Code The uniquely decodable fixed-length code in Table 8.1 and unary code in Table 8.2 are also instantaneously decodable in the sense that each codeword can be decoded as soon as its last bit is received. For the fixed-length code, decoding is possible as soon as the fixed number of bits are received, or the last bit is received. For the unary code, decoding is possible as soon as the ‘1’ is received, which signals the end of a codeword. There are codes that are uniquely decodable, but cannot be instantaneously decoded. Two such examples are shown in Table 8.5. For Codebook A, the codeword f0g is a prefix of codewords f01g and f011g. When f0g is received, the receiver cannot decide whether it is codeword f0g or the first bit of codewords f01g or f011g, so it has to wait. If f1g is subsequently received, the receiver cannot decide whether the received f01g is codeword f01g or the first two bits of codeword f011g because f01g is a prefix of codeword 011. So the receiver has to wait again. Except for the Table 8.5 Two examples of not instantaneously decodable codebooks
Symbol 0 1 2
Codebook A 0 01 011
Codebook B 0 01 11
154
8 Entropy and Coding
reception of f011g which the receiver can decide immediately that it is codeword f011g, the decoder has to wait until the reception of f0g, the start of the next source symbol. Therefore, the decoding delay may be more than one codeword. For Codebook B, the decoding delay may be as long as the whole sequence. For example, the source sequence f01111g may decode as f0,2,2g. However, when another f1g is subsequently received to give a codeword sequence of f011111g, the interpretation of the initial 5’s bits (f01111g) becomes totally different because the codeword sequence f011111g decodes as f1,2,2g now. The decoder cannot make the decision until it sees the end of the sequence. The primary reason for the delayed decision is that the codeword f0g is a prefix of codeword f01g. When the receiver sees f0g, it cannot decide whether it is codeword f0g or just the first bit of codeword f01g. To resolve this, it has to wait until the end of the sequence and work backward. Apparently, whether or not a codeword is a prefix of another codeword is critical to whether it is instantaneously decodable. A codebook in which no codeword is a prefix of any other codewords is referred to as a prefix code, or more precisely prefix-free code. The unary code in Table 8.2 and fixed-length code in Table 8.1 are both prefix codes and are instantaneously decodable. The two uniquely decodable codes in Table 8.5 are not prefix codes and are not instantaneous decodable. In fact, this association is not a coincidence: a codebook is instantaneously decodable if and only if it is a prefix-free code. To see this, let us assume that there is a codeword in an instantaneously decodable code that is a prefix of at least another codeword. Because of this, this codeword is obviously not instantaneously decodable, as shown by Codebook A and Codebook B in the above example. Therefore, an instantaneously decodable code has to be prefix-free code. On the other hand, all codewords of a prefix-free code can be decoded instantaneously upon reception because there is no ambiguity with any other codewords in the codebook.
8.3.3 Prefix-Free Code and Binary Tree A codebook can be viewed as a binary tree or code tree. This is illustrated in Fig. 8.1 for the unary codebook in Table 8.2. The tree starts from a root node of NULL and can have no more than two possible branches at each node. Each branch represents either ‘0’ or ‘1’. Each node contains the codeword that represents all the branches connecting from the root node all the way through to the current node. If a node does not grow any more branches, it is called an external node or leaf; otherwise, it is called an internal node. Since an internal node grows at least one branch, the codeword it represents is a prefix of whatever codeword or node that grows from it. On the other hand, a leaf or external node does not grow any more branches, the codeword it represents is not a prefix of any other nodes or codewords. Therefore, the codewords of a prefix-free code are taken only from the leaves. For example, the code in Fig. 8.1 is a prefix-free code since only its leaves are taken as codewords.
8.3 Uniquely and Instantaneously Decodable Codes
155
Fig. 8.1 Binary tree for an unary codebook. It is a prefix-free code because only its leaves are taken as codewords
Fig. 8.2 Binary trees for two noninstantaneous codes. Since left tree (for Codebook A) takes codewords from internal nodes f0g and f01g, and the right tree (for Codebook B) from f0g, respectively, both are not prefix-free codes. The branches are not labeled due to the convention that left branches represent ‘0’ and right branches represents ‘1’
Figure 8.2 shows the trees for the two noninstantaneous codes given in Table 8.5. Since Codebook A takes codewords from internal nodes f0g and f01g, and Codebook B from f0g, respectively, both are not prefix-free codes. Note that the branches are not labeled because both trees follow the convention that left branches represent ‘0’ and right branches represent ‘1’. This convention is followed throughout this book unless stated otherwise.
8.3.4 Optimal Prefix-Free Code Instantaneous/prefix-free codes are obviously desirable for easy and efficient decoding. However, prefix-free codes are only a subset of uniquely decodable codes, so there is a legitimate concern that the optimal code may not be a prefix-free code for
156
8 Entropy and Coding
a given information source. Fortunately, this concern turns out to be unwarranted due to Kraft’s inequality which states that there is a prefix-free codebook (8.9) if and only if M 1 X
2l.cm / 1:
(8.20)
mD0
See [43] for proof. Since Kraft’s inequality (8.20) is the same as McMillan’s inequality (8.18), we conclude that there is a uniquely decodable code if and only if there is an instantaneous/prefix-free code with the same set of codeword lengths. In terms of the optimal codebooks, there is an optimal codebook if and only if there is a prefix-free code with the same set of codeword lengths. In other words, there is always a prefix-free codebook that is optimal.
8.4 Shannon’s Noiseless Coding Theorem Although prefix-free codes are instantaneously decodable and there is always a prefix-free codebook that is optimal, there is still a question as to how close the average codeword length of such an optimal prefix-free code can approach the entropy of the information source. Shannon’s noiseless coding theorem states that the entropy is the absolute minimal average codeword length of any uniquely decodable codes and that the entropy can be asymptotically approached by a prefix-free code if source symbols are coded as blocks and the block size goes to infinity.
8.4.1 Entropy as the Lower Bound To reduce the average codeword length, it is desired that only short codewords be used. But McMillan’s inequality (8.18) states that, to ensure unique decodability, the use of some short codewords requires the other codewords to be long. Consequently, the overall or average codeword length cannot be arbitrarily low, there is an absolute lower bound. It turns out that the entropy is this lower bound. To prove this, let KD
M 1 X
2l.cm / :
mD0
Due to McMillan’s inequality (8.18), we have log.K/ 0:
(8.21)
8.4 Shannon’s Noiseless Coding Theorem
157
Consequently, we have LD
M 1 X
p.cm /l.cm /
mD0
M 1 X
p.cm / Œl.cm / C log2 .K/
mD0
D
M 1 X
h i p.cm / log2 2l.cm / K
mD0
D
M 1 X
" p.cm / log2
mD0
D
M 1 X
p.cm / log2
mD0
DH
M 1 X
p.cm /2l.cm / K p.cm /
#
M 1 h i X 1 C p.cm / log2 p.cm /2l.cm / K p.cm / mD0
p.cm / log2
mD0
1 : p.cm /2l.cm / K
Due to log2 .x/
1 .x 1/; 8x > 0; ln 2
the second term on the right-hand side is always negative because: M 1 X
p.cm / log2
mD0
1 p.cm /2l.cm / K
M 1 1 X 1 p.cm / 1 ln 2 mD0 p.cm /2l.cm / K
D
M 1 1 1 X / p.c m ln 2 mD0 2l.cm / K
1 D ln 2
"
M 1 M 1 1 X l.cm / X 2 p.cm / K mD0 mD0
#
1 Œ1 1 ln 2 D 0: D
Therefore, we have L H:
(8.22)
158
8 Entropy and Coding
8.4.2 Upper Bound Since the entropy is the absolute lower bound on average codeword length, an intuitive approach to the construction of an optimal codebook is to set the length of the codeword assigned to a source symbol to its self-information (8.15), then the average codeword length would be equal to the entropy. This is, unfortunately, unworkable because the self-information is most likely not an integer. But we can get close to this by setting the codeword length to the next smallest integer: l.cm / D ceilŒ log2 p.cm /:
(8.23)
Such a codebook is called a Shannon–Fano code. A Shannon–Fano code is uniquely decodable because it satisfies the McMillan inequality (8.18): M 1 X
2l.cm / D
mD0
M 1 X
2ceilŒ log2 p.cm /
mD0
M 1 X
2log2 p.cm /
mD0
D
M 1 X
p.cm /
mD0
D 1:
(8.24)
The average codeword length of the Shannon–Fano code is LD
M 1 X
p.cm /ceilŒ log2 p.cm /
mD0
M 1 X
p.cm / Œ1 log2 p.cm /
mD0
D 1
M 1 X
p.cm / log2 p.cm /
mD0
D 1 C H.X /;
(8.25)
where the second inequality is obtained due to ceil.x/ 1 C x: A Shannon–Fano code may or may not be optimal, although it sometimes is. But the above inequality constitutes an upper bound on the optimal codeword length.
8.4 Shannon’s Noiseless Coding Theorem
159
Combining this and the lower entropy bound (8.22), we obtain the following bound for the optimal codeword length: H.X / Lopt < 1 C H.X /:
(8.26)
8.4.3 Shannon’s Noiseless Coding Theorem Let us group n source symbols in source sequence (8.5) as a block, called a block symbol, (8.27) X.k/ D ŒXkn ; XknC1 ; : : : ; XknCn1 ; thus converting source sequence (8.5) into a sequence of block symbols: X.0/; X.1/; : : : :
(8.28)
Since source sequence (8.5) is assumed to be iid, the probability distribution for a block symbol is p.sm0 ; sm1 smn1 / D p.sm0 /p.sm1 / p.smn1 /:
(8.29)
Using this equation, the entropy for the block symbols is H n .X/ D
1 M 1 M X X m0 D0 m1 D0
M 1 X mn1 D0
p.sm0 ; sm1 smn1 / log p.sm0 ; sm1 smn1 /: D n
M 1 X
p.sm0 / log p.sm0 /
m0 D0
D nH.X /:
(8.30)
Applying the bounds in (8.26) to the block symbols, we have nH.X / Lnopt < 1 C nH.X /
(8.31)
where Lnopt is the optimal codeword length for the block codebook that is used to code the sequence of block symbols. The average codeword length per source symbol is obviously Lnopt : (8.32) Lopt D n
160
8 Entropy and Coding
Therefore, the optimal codeword length per source symbol is bounded by H.X / Lopt <
1 C H.X /: n
(8.33)
As n ! 1, the optimal codeword length Lopt approaches the entropy. In other words, by choosing a large enough block size n, Lopt can be made as close to the entropy as desired. And this can always be delivered by a prefix-free code because the inequality in (8.33) was derived within the context that both the McMillan inequality (8.18) and Kraft inequality (8.20) are satisfied.
Chapter 9
Huffman Coding
Now that we are assured that, given a probability distribution, if there is an optimal uniquely decodable code, there is a prefix-free code with the same average codeword length, the next step is the construction of such optimal prefix-free codes. Huffman’s algorithm [29], developed by David A. Huffman in 1952 when he was a Ph.D. student at MIT, is just such a simple algorithm.
9.1 Huffman’s Algorithm Huffman’s algorithm is a recursive procedure that merges the two least probable symbols into a new “meta-symbol” until only one meta-symbol is left. This procedure can be best illustrated through an example. Let us consider the probability distribution in (8.4). Its Huffman code is constructed using the following steps (see Fig. 9.1): 1. The two least probable symbols 1=30 and 3=30 are merged into a new “metasymbol” with a probability of 4=30, which becomes the new least probable symbol. 2. The next two least probable meta-symbol 4=30 and symbol 12=30 are merged into a new “meta-symbol” with a probability of 16=30. 3. The next two least probable meta-symbol 16=30 and symbol 14=30 are merged into a new “meta-symbol” with a probability of 30=30, which signals the end of recursion. The Huffman codeword for each symbol is listed on the left side of the symbol in Fig. 9.1. These codewords are generated by assigning ‘1’ to top branches and ‘0’ to lower branches, starting from the right-most “meta-symbol” which has a probability of 30=30. The average length for this Huffman codebook is LD3
14 12 1 50 3 C1 C2 C3 D 1:6667 bits. 30 30 30 30 30
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 9, c Springer Science+Business Media, LLC 2010
161
162
9 Huffman Coding
Fig. 9.1 Steps involved in constructing a Huffman codebook for the probability distribution in (8.4)
This is better than 1:7 bits achieved using the unary code given in Table 8.2, but still larger than the entropy of 1.5376 bits. To present Huffman’s algorithm for an arbitrary information source, let us consider the symbol set (8.6) with a probability distribution (8.7), Huffman’s algorithm generates the codebook (8.9) with the following recursive procedure: 1. Let m0 , m1 , : : :, mM 1 be a permutation of 0, 1, : : :, M 1 for which pm0 pm1 pmM 1 ;
(9.1)
then the symbol set (8.6) is permulated as fsm0 ; sm1 ; : : : ; smM 3 ; smM 2 ; smM 1 g:
(9.2)
2. Merge the two least probable symbols smM 2 and smM 1 into a new metasymbol s 0 with the associated probability of p 0 D pmM 2 C pmM 1 :
(9.3)
This gives rise to the following new symbol set: fsm0 ; sm1 ; : : : ; smM 3 ; s 0 g:
(9.4)
with the associated probability distribution of pm0 ; pm1 ; : : : ; pmM 3 ; p 0 :
(9.5)
3. Repeat step 1 until the number of symbols is two. Then return with the following codebook f0; 1g: (9.6)
9.2 Optimality
163
4. If the codebook for symbol set (9.4) is fc0 ; c1 ; : : : ; cM 3 ; c 0 gI
(9.7)
return with the following codebook: fc0 ; c1 ; : : : ; cM 3 ; c 0 0; c 0 1g
(9.8)
for symbol set (9.2). Apparently, for the special case of M D 2, the iteration above ends at step 3, so Huffman codebook f0; 1g is returned.
9.2 Optimality Huffman’s algorithm produces optimal codes. This can be proved by induction on the number of source symbols M . In particular, for M D 2, the codebook produced by Huffman’s algorithm is obviously optimal: we cannot do any better than using one bit to code each symbol. For M > 2, let us assume that Huffman’s algorithm produces an optimal code for a symbol set of size M 1, we prove that Huffman’s algorithm produces an optimal code for a symbol set of size M .
9.2.1 Codeword Siblings From (9.8) we notice that the Huffman codeword for the two least probable symbols have the form of fc 0 0g and fc 0 1g, i.e., they have the same length and differ only in the last bit (see Fig. 9.1). Such a pair of codewords are called “siblings”. In fact, any instantaneous codebook can always be re-arranged in such a way that the codewords for the two least probably symbols are siblings while keeping the average codeword length the same or less. To show this, let us first note that, for two symbols with probabilities p1 < p2 , if a longer codeword was assigned to the more probable symbol, i.e., l1 < l2 , the codewords can always be swapped without any topological change to the tree, but with reduced average codeword length. One such example is shown in Fig. 9.2 where codewords for symbols with probabilities 0:3 and 0:6 are swapped. Repetitive application of this topologically constant procedure to a codebook can always end up with a new one which has the same topology as the original one, but whose two least probable symbols are assigned the two longest codewords and whose average codeword length is at least the same as, if not shorter than, the original one. Second, if an internal node in a codebook tree does not grow two branches, it can always be removed to generate shorter codewords. This is shown in Fig. 9.3 where node ‘1’ in the tree shown on the top is removed to give the tree at the bottom
164
9 Huffman Coding
Fig. 9.2 Codewords for symbols with probabilities 0:3 and 0:6 in the codebook tree on the top are swapped to give the codebook tree in the bottom. There is no change to the tree topology, average codeword length becomes shorter because the shorter codeword is weighted by the higher probability
Fig. 9.3 Removal of the internal node ‘1’ in the tree on the top produces the tree at the bottom which has shorter average codeword length
with shorter average codeword length. Application of this procedure to the two least probable symbols in a codebook ensures that they always have the same codeword length. Otherwise, the longest codeword must be from at least one internal node which grows only one branch. Third, if the codewords for the two least probable symbols do not grow from the same last internal node, the last internal node for the least probable symbol must grow another codeword whose length is the same as the second least probable
9.2 Optimality
165
symbol. Due to the same codeword length, these two codewords can be swapped with no impact to the average codeword length. Consequently, the codewords for the two least probable symbols can always grow from the same last internal node. In other words, the two codewords can always be of the form of c 0 0 and c 0 1.
9.2.2 Proof of Optimality Let LM be the average codeword length for the codebook produced by Huffman’s algorithm for the symbol set of (8.6) with the probability distribution of (8.7). Without loss of generality, the symbol set in (8.6) can always be permutated to give the following symbol set: fs0 ; s1 ; : : : ; sM 3 ; sM 2 ; sM 1 g
(9.9)
with a probability distribution of pm pM 2 pM 1
for all 0 m M 3:
(9.10)
Then the last two symbols can be merged to give the following symbol set: fs0 ; s1 ; : : : ; sM 3 ; s 0 g
(9.11)
which has M 1 symbols and a probability distribution of
where
p0 ; p1 ; : : : ; pM 3 ; p 0
(9.12)
p 0 D pM 2 C pM 1 :
(9.13)
Applying Huffman’s recursive procedure in Sect. 9.1 to symbol set (9.11) produces a Huffman codebook with an average codeword length of LM 1 . By the induction hypothesis, this Huffman codebook is optimal. The last step of Huffman procedure grows the last codeword in symbol set (9.11) into two codewords by attaching one bit (‘0’ and ‘1’) to its end to produce a codebook of size M for symbol set (9.9). This additional bit is added with a probability of pM 2 CpM 1 , so the average codeword length for the new Huffman codebook is LM D LM 1 C pM 2 C pM 1 :
(9.14)
Suppose that there were another instantaneous codebook for the symbol set in O M that is less than LM : (8.6) with an average codeword length of L O M < LM : L
(9.15)
166
9 Huffman Coding
As shown in Sect. 9.2.1, this codebook can be modified so that the codewords for two least probable symbols sM 2 and sM 1 have the forms of c 0 0 and c 0 1, while keeping the average codeword length the same or less. This means the symbol set is permutated to have a form given in (9.9) with the corresponding probability distribution given by (9.10). This codebook can be used to produce another codebook of size M 1 for symbol set (9.11) by keeping the codewords for fs0 ; s1 ; : : : ; sM 3 g the same and encoding the last symbol s 0 using c. Let us denote its average codeword length as LO M 1 . Following the same argument as that which leads to (9.14), we can establish (9.16) LO M D LO M 1 C pM 2 C pM 1 : Subtracting (9.16) from (9.14), we have LM LO M D LM 1 LO M 1
(9.17)
By the induction hypothesis, LM 1 is the average codeword length for an optimal codebook for the symbol set in (9.11), so we have LM 1 LO M 1 :
(9.18)
Plugging this into (9.17), we have
or
O M LM 0 L
(9.19)
LO M LM
(9.20)
which contradicts the supposition in (9.15). Therefore, Huffman’s algorithm produces an optimal codebook for M as well. To summarize, it was proven above that, if Huffman’s algorithm produces an optimal codebook for a symbol set of size M 1, it produces an optimal codebook for a symbol set of size M . Since it produces the optimal codebook for M D 2, it produces optimal codebooks for any M .
9.3 Block Huffman Code Although Huffman code is optimal for symbol sets of any size, the optimal average codeword length that it achieves is often much larger than the entropy when the symbol set is small. To see this, let us consider the extreme case of M D 2. The only possible codebook, which is also Huffman code, is f0; 1g:
9.3 Block Huffman Code
167
It obviously has an average codeword length of one bit, regardless the underlying probability distribution and entropy. The more skewed the probability distribution is, the smaller the entropy is, hence the less efficient the Huffman code is. As an example, let us consider the probability distribution of p1 D 0:1 and p2 D 0:9. It results in an entropy of H D 0:1 log.0:1/ 0:9 log.0:9/ 0:4690 bits; which is obviously much smaller than the one bit delivered by the Huffman code.
9.3.1 Efficiency Improvement By Shannon’s noiseless coding theorem (8.33), however, the entropy can be approached by grouping more symbols together into block symbols. To illustrate this, let us first block-code two symbols together as one block symbol. This gives the probability distribution in Table 9.1 and the corresponding Huffman code in Fig. 9.4. Its average codeword length is LD
1 .0:01 3 C 0:09 3 C 0:09 2 C 0:81 1/ D 0:645 bits; 2
which is significantly smaller than the one bit achieved by the M D 2 Huffman code. Table 9.1 Probability distribution when two symbols are coded together as one block symbol
Symbol
Probability
00 01 10 11
0.01 0.09 0.09 0.81
Fig. 9.4 Huffman code when two symbols are coded together as one block symbol
168
9 Huffman Coding
Table 9.2 Probability distribution when three symbols are coded together as one block symbol
Symbol 000 001 010 011 100 101 110 111
Probability 0.001 0.009 0.009 0.081 0.009 0.081 0.081 0.729
Fig. 9.5 Huffman code when three symbols are coded together as one block symbol
This can be further improved by coding three symbols as one block symbol. This gives us the probability distribution in Table 9.2 and the Huffman code in Fig. 9.5. The average codeword length is LD
1 .0:001 7 C 0:009 7 C 0:009 6 C 0:081 4 3 C 0:009 5 C 0:081 3 C 0:081 2 C 0:729 1/
D 0:5423 bits; which is more than 0.1 bit better than coding two symbols as a block symbol and closer to the entropy.
9.4 Recursive Coding
169
9.3.2 Block Encoding and Decoding In the examples above, the block symbols are constructed from the original symbols by stacking the bits of the primary symbols together. For example, a block symbol in Table 9.2 is constructed as B D fs2 s1 s0 g where s0 , s1 , and s2 are the bits representing the original symbols. This is possible because the number of symbols in the original symbol set is M D 2. In general, for any finite M , a block symbol B consisting of n original symbols may be expressed as B D sn1 M n1 C sn2 M n2 C C s1 M C s0 ;
(9.21)
where si ; i D 0; 1; : : : ; n 1; represents the bits for each original symbol. It is, therefore, obvious that block encoding just consists of a series of multiplication and accumulation operations. Equation (9.21) also indicates that the original symbols may be decoded from the block symbol through the following iterative procedure:
s0 D B B1 M; s1 D B1 B2 M; :: : sn2 D Bn2 Bn1 M; sn1 D Bn1 Bn M:
B1 D B=M I B2 D B1=M I B3 D B2=M I :: :
(9.22)
Bn D Bn1 =M
where the = operation represents integer division. When M is 2’s power, it may be implemented as right shifting. The first step in each iteration (the step to obtain si ) is actually the operation to get the remainder.
9.4 Recursive Coding Huffman encoding is straightforward and simple because it only involves looking up the codebook. Huffman decoding, however, is rather complex because it entails searching through the tree until a matching leaf is found. If the codebook is too large, consisting of more than 300 codewords, for example, the decoding complexity can be excessive. Recursive indexing is a simple method for representing an excessively large symbol set by a moderate one so that a moderate Huffman codebook can be used to encode the excessively large symbol set.
170
9 Huffman Coding
Without loss of generality, let us represent a symbol set by its indexes starting from zero, thus each symbol in the symbol set corresponds to a nonnegative integer x. This x can be represented as x D q M C r;
(9.23)
where M is the maximum value of the reduced symbol set f0; 1; : : : ; M g, q is the quotient, and r is the remainder. Once M is agreed upon by the encoder and decoder, only q and r need to be conveyed to the decoder. Usually r is encoded using a Huffman codebook and q by other means. One simple approach to encoding q is to represent it by repeating q times the symbol M . In this way a single Huffman codebook can be used to encode a large symbol set, no matter how large it is.
9.5 A Fast Decoding Algorithm Due to the need to search the tree to find the leaf that matches bits from the input stream, Huffman decoding is computationally expensive. While fast decoding algorithms are usually tied to specific hardware (computer) architectures, a generic algorithm is provided here to illustrate the steps involved in Huffman decoding: 1. 2. 3. 4. 5. 6.
n D 1I Unpack one bit from the bit stream; Concatenate the bit to the previously unpacked bits to form a word with n bits; Search the codewords of n bits in the Huffman codebook; Stop if a codeword is found to be equal to the unpacked word; n D n C 1 and go back to step 2.
Part V
Audio Coding
While presented from the perspective of audio coding, the chapters in the previous parts cover theoretical aspects of the coding technology that can be applied to the coding of signals in general. The chapters in this part are devoted to the coding of audio signals. In particular, Chap. 10 covers perceptual models which determines which part of the source signal is inaudible (perceptually irrelevant) and thus can be removed. Chapter 11 addresses the resolution challenge posed by the frequent interruption by transients to the otherwise quasi-stationary audio signals. Chapter 12 deals with widely used methods for joint channel coding as well as the coding of low-frequency effect (LFE) channels. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms. Chapter 14 is devoted to performance assessment of audio coding algorithms and Chap. 15 presents dynamic resolution adaptation (DRA) audio coding standard as an example to illustrate how to integrate the technologies described in this book to create a practical audio coding algorithm.
Chapter 10
Perceptual Model
Although data model and quantization have been discussed in detail in the earlier chapters as the tool for effectively removing perceptual irrelevance, a question still remains as to which part of the source signal is perceptually irrelevant. Feasible answers to this question obviously depend on the underlying application. For audio coding, perceptual irrelevance is ultimately determined by the human ear, so perceptual models need to be built that mimic the human auditory system so as to indicate to an audio coder which parts of the source audio signal are perceptually irrelevant, hence can be removed without audible artifacts. When a quantizer removes perceptual irrelevance, it essentially substitutes quantization noise for perceptually irrelevant parts of the source signal, so the quantization process should be properly controlled to ensure that quantization noise is not audible. Quantization noise is not audible if its power is below the sensitivity threshold of the human ear. This threshold is very low in an absolutely quiet environment (threshold in quiet), but becomes significantly elevated in the presence of other sounds due to masking. Masking is a phenomenon where a strong sound makes a weak sound less audible or even completely inaudible when the power of the weak sound is below a certain threshold jointly determined by the characteristics of both sounds. Quantization noise may be masked by signal components that occur simultaneously with the signal component being quantized. This is called simultaneous masking and is exploited most extensively in audio coding. Quantization noise may also be masked by signal components that are ahead of and/or behind it. This is called temporal masking. The task of the perceptual model is to explore the threshold in quiet and the simultaneous/temporal masking to come up with an estimate of global masking threshold which is a function of frequency and time. The audio coder can then adjust its quantization process in such a way that all quantization noises are below this threshold to ensure that they are not audible.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 10, c Springer Science+Business Media, LLC 2010
173
174
10 Perceptual Model
10.1 Sound Pressure Level Sound waves traveling in the air or other transmission media can be described by time-varying atmosphere pressure change p.t/, called sound pressure. Sound pressure is measured by the sound force per unit area and its unit is Newton per square meter (N=m2 ), which is also known as a Pascal (Pa). The human ear can perceive sound pressure as low as 105 Pa and a sound pressure of 100 Pa is considered as the threshold of pain. These two values establish a dramatic dynamic range of roughly 107 Pa. When compared with the atmospheric pressure which is 101,325 Pa, the absolute values of sound pressure perceivable by the human ear are obviously very small. To cope with this situation, sound pressure level or SPL is introduced l D 20 log10
p dB; p0
(10.1)
where p0 is a reference level of 2 105 Pa. This reference level corresponds to the best hearing sensitivity of an average listener at around 1,000 Hz. Another description of sound waves is the sound intensity I which is defined as the sound power per unit area. For a spherical or plane progressive wave, sound intensity I is proportional to the square of sound pressure, so sound intensity level is related to the sound pressure level by l D 20 log10
p I D 10 log10 dB; p0 I0
(10.2)
where I0 D 1012 W=m2 is the reference level. Due to this relationship, SPL level and sound intensity level are identical on logarithmic scale. When a sound signal is considered as a wide sense stationary random process, its spectrum or intensity density level is defined as L.f / D
P .f / ; I0
(10.3)
where P .f / is the power spectrum density of the sound wave.
10.2 Absolute Threshold of Hearing The absolute threshold of hearing (ATH) or threshold in quiet (THQ) is the minimum sound pressure level of a pure tone that an average listener with normal hearing capability can hear in an absolutely quiet environment. This SPL threshold varies with the frequency of the test tone, an empirical equation that describes this relationship is [91]
10.2 Absolute Threshold of Hearing
175
100 90 Threshold in Quiet (dB SPL)
80 70 60 50 40 30 20 10 0 −10
102
103 Frequency (Hz)
104
Fig. 10.1 Absolute threshold of hearing
Tq .f / D 3:64
f 1;000
0:8
2
6:5e0:6.f =1;0003:3/ C 0:001
f 1;000
4 dB (10.4)
and is plotted in Fig. 10.1. As can be seen in Fig. 10.1, the human ear is very sensitive in frequencies from 1,000 to 5,000 Hz and is most sensitive around 3,300 Hz. Beyond this region, the sensitivity of hearing degrades rapidly, especially below 100 Hz and above 10,000 Hz. Below 20 Hz and above 18,000 Hz, the human ear can hardly perceive sounds. The formula in (10.4) and hence Fig. 10.1 does not fully reflect the rapid degradation of hearing sensitivity below 20 Hz. When people age, the hearing sensitivity degrades mostly at high frequencies and there is little change in the low frequencies. It is rather difficult to apply the threshold in quiet to audio coding mostly because there is no way to know the playback SPL that an audio signal is presented to a listener. A safe bet is to equate the minimum in Fig. 10.1 around 3,300 Hz to the lowest bit in the audio coder. This ensures that quantization noise is not audible even if the audio signal is played back at the maximum volume, but is usually too pessimistic because listeners rarely playback sound at the maximum volume. Another difficulty with applying the threshold in quiet to audio coding is that quantization noise is complex and is unlikely sinusoidal. The actual threshold in quiet for complex quantization noise is definitely different than pure tones. But there is not much research reported on this regard.
176
10 Perceptual Model
10.3 Auditory Subband Filtering As shown in Fig. 10.2, when sound is perceived by the human ear, it is first preprocessed by the human body, including the head and shoulder, and then the outer ear canal before it reaches the ear drum. The vibration of the ear drum is transferred by the ossicular bones in the middle ear to the oval window which is the entrance to or the start of the cochlea in the inner ear. The cochlea is a spiral structure filled with almost incompressible fluids, whose start at the oval window is known as the base and whose end as the apex. The vibrations at the oval window induce traveling waves in the fluids which in turn transfer the waves to the basilar membrane that lies along the length of cochlea. These traveling waves are converted into electrical signals by neural receptors that are connected along the length of the basilar membrane [53].
10.3.1 Subband Filtering Different frequency components of an input sound wave are sorted out while traveling along the basilar membrane from the start (base) towards the end (the apex). This is schematically illustrated in Fig. 10.3 for an example signal consisting of three tones (400, 1,600, and 6,400 Hz) presented to the base of basilar membrane [102]. For each sinusoidal component in the input sound wave, the amplitude of basilar membrane displacement increases at first, reaches a maximum, and then decreases rather abruptly. The position where the amplitude peak occurs depends on the frequency of the sinusoidal component. In other words, a sinusoidal signal resonants strongly at a position on the basilar membrane that corresponds to its frequency. Equivalently, different frequency components of an input sound wave resonant at different locations on the basilar membrane. This allows different groups of neural receptors connected along the length of the basilar membrane to process different frequency components of the input signal. From a signal processing perspective, this frequency-selective processing of sound signals may be viewed as subband filtering and the basilar membrane may be considered as a bank of bandpass auditory filters.
Fig. 10.2 Major steps involved in the conversion of sound waves into neural signals in the human ear
10.3 Auditory Subband Filtering
177
Fig. 10.3 Instantaneous displacement of basilar membrane for an input sound wave consisting of three tones (400, 1,600, and 6,400 Hz) that is presented to the oval window of basilar membrane. Note that the three excitations do not appear simultaneously because the wave needs time to travel along the basilar membrane
An observation from Fig. 10.3 is that the auditory filters are continuously placed along the length of the basilar membrane and is activated in response to the frequency components of the input sound wave. If the frequency components of the sound wave are close to each other, these auditory filters overlap significantly. There is, of course, no decimation in the continuous-time world of neurons. As will be discussed later in this chapter, the frequency responses of these auditory filters are asymmetric, nonlinear, level-dependent, and with nonuniform bandwidth that increases with frequency. Therefore, auditory filters are very different from the discretely placed, almost nonoverlapping and often maximally decimated subband filters that we are familiar with.
10.3.2 Auditory Filters A simple model for auditory filters is the gammatone filter whose impulse response is given by [1, 71] h.t/ D At n1 e2Bt cos.2fc t C /;
(10.5)
where fc is the center frequency, the phase, A the amplitude, n the filter’s order, t the time, and B is the filter’s bandwidth. Figure 10.4 shows the magnitude response of the gammatone filter with fc D 1;000 Hz, n D 4, and B determined by (10.7). A widely used model that catches the asymmetric nature of auditory filters is the rounded exponential filter, denoted as roex.pl ; pu /, whose power spectrum is given by [73] 8 fc f ˆ < 1 C pl fcff epl fc ; f fc c (10.6) W .f / D f fc ˆ : 1 C pu f fc epu fc ; f > fc fc
178
10 Perceptual Model 20
Magnitude Response (dB)
10 0 −10 −20 −30 −40 −50 −60 0
1000
2000 3000 Frequency (Hz)
4000
Fig. 10.4 Magnitude response of a gammatone filter that models the auditory filters
0 −20
Power Spectrum (dB)
−40 −60 −80 −100 −120 −140 −160
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Frequency (Hz)
Fig. 10.5 Power spectrum of roex.pl ; pu / filter (fc D 1;000 Hz, pl D 40, and pu D 10) that models auditory filters
where fc represents the center frequency, pl determines the slope of the filter below fc , and pu determines the slope of the filter above fc . Figure 10.5 shows the power spectrum of this filter for fc D 1;000 Hz, pl D 40 and pu D 10.
10.3 Auditory Subband Filtering
179
The roex.pl ; pu / filter with pl D pu was used to estimate the critical bandwidth (ERB) of the auditory filters [53, 70]. When the order of the gammatone filter is in the range 3–5, the shape of its magnitude characteristic is very similar to that of the roex.pl ; pu / filter with pl D pu [72]. In particular, when the order n D 4, the suggested bandwidth for gammatone filter is B D 1:019ERB;
(10.7)
where ERB is given in (10.15).
10.3.3 Bark Scale From a subband filtering perspective, the location on the basilar membrane where the amplitude maximum occurs for a sinusoidal component may be considered as the point that represents the frequency of the sinusoidal component and as the center frequency of the auditory filter that processes this component. Consequently, the distance from the base of cochlea or the oval window on the basilar membrane represents a new frequency scale that is different from the linear frequency scale (in Hz) that we are familiar with. A frequency scale that seeks to linearly approximate the frequency scale represented by the basilar membrane length is the critical band rate or Bark scale. Its relationship with the linear frequency scale f (Hz) is empirically determined and may be analytically expressed as follows [102]: z D 13 arctan.0:00076f / C 3:5 arctan .f =7; 500/2 Bark
(10.8)
and is shown in Fig. 10.6. In fact, one Bark approximately corresponds to a distance of about 1.3 mm along the basilar membrane [102]. The Bark scale is apparently neither linear nor logarithmic with respect to the linear frequency scale. The Bark scale was proposed by Eberhard Zwicker in 1961 [101] and named in memory of Heinrich Barkhausen who introduced the “phon”, a scale for loudness as perceived by the human ear.
10.3.4 Critical Bands While complex, the auditory filters exhibit strong frequency selectivity: the loudness of a signal remains constant as long as its energy is within the passband, referred to as the critical band (CB), of an auditory filter and decreases dramatically as the energy moves out of the critical band. The critical bandwidth is a parameter that quantifies the bandwidth of the auditory filter passband. There are a variety of methods for estimating the critical bandwidth. A simple approach is to present a set of uniformly spaced tones with equal power to the
180
10 Perceptual Model 25
Bark
20 15 10 5 0
2000
4000
6000 8000 10000 12000 14000 16000 Frequency (Hz)
25
Bark
20 15 10 5 0 100
101
102 Frequency (Hz)
103
104
Fig. 10.6 Relationships of the Bark scale with respect to the linear frequency scale (top) and the logarithmic frequency scale (bottom)
listeners and measure the threshold in quiet [102]. For example, to estimate the critical bandwidth near 1,000 Hz, the threshold in quiet is measured by placing more and more equally powered tones starting from 920 Hz with a 20 Hz frequency increment. Figure 10.7 shows that the measured threshold in quiet, which is the total power of all tones, remains at about 3 dB when the number of tones increases from one to eight and begins to increase afterwards. This indicates that the critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz. To see this, let us denote the power of each tone as 2 and the total number of tones as n. Before n reaches eight, the power of each tone is 2 D
100:3 n
(10.9)
to maintain a total power of 3 dB in the passband. When n reaches eight, the power of each tones is 100:3 2 D : (10.10) 8 When n is more than eight, only the first eight tones fall into the passband and the others are filtered out by the auditory subband filter. Therefore, the power for each tone given in (10.10) needs to be maintained to maintain a total power of 3 dB in the passband. Consequently, the total power of all tones is
10.3 Auditory Subband Filtering
181
5 4.5
Total Power (dB)
4 3.5 3 2.5 2 1.5 1 0.5 0
1
2
3
4 5 6 7 Total Number of Tones
8
9
10
Fig. 10.7 The measurement of critical bandwidth near 1,000 Hz by placing more and more uniformly spaced and equally powered tones starting from 920 Hz with a 20 Hz increment. The total power of all tones indicated by the cross remains constant when the number of tones increases from one to eight and begins to increase afterwards. This indicates that critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz. This is so because tones after the eighth are filtered out by the auditory filter, so do not contribute to the power in the passband. The addition of more tones causes an increase in total power, but the power perceived in the passband remains constant
P D
100:3 n: 8
(10.11)
To summarize, the total power of all tones as a function of the number of tones is ( P D
100:3 ; n 8I 100:3 n; 8
(10.12)
otherwise.
This is the curve shown in Fig. 10.7. The method above is valid only in the frequency range between 500 and 2,000 Hz where the threshold in quiet is approximately independent of frequency. For other frequency ranges, more sophisticated methods are called for. One such method, called masking in frequency gap, is shown in Fig. 10.8 [102]. It places a test signal, called a maskee, at the center frequency where the critical bandwidth is to be estimated and then places two masking signals, called maskers, of equal power at equal distance in the linear frequency scale from the test signal. If the power of the test signal is weak relative to the total power of the maskers, the test signal is not audible. When this happens, the test signal is said to be masked
182
10 Perceptual Model
a
b
c
Fig. 10.8 Measurement of critical bandwidth with masking in frequency gap. A test signal is placed at the center frequency f0 where the critical bandwidth is to be measured and two masking signals of equal power are placed at equal distance from the test signal. In (a), the test signal is a narrow-band noise and the two maskers are tones. In (b), two narrow-band noises are used as maskers to mask the tone in the center. The masked threshold versus the frequency separation between the maskers is shown in (c). The frequency separation where the masked threshold begins to drop off may be considered as the critical bandwidth
by the maskers. In order for the test signal to become audible, its power has to be raised to above a certain level, denoted as masked threshold or sometimes masking threshold. When the frequency separation between the maskers is within the critical bandwidth, all of their powers fall into the critical band where the test signal resides and the total masking power is the summation of that of the two maskers. In this case, no matter how wide the frequency separation is, the total power in the critical band is constant, so a constant masked threshold is expected. As the separation becomes wider than the critical bandwidth, part of the powers of the maskers begin to fall out of the critical band and are hence filtered out by the auditory filter. This causes the total masking power in the critical band to become less, so their ability to mask the test signal becomes weaker, causing the masked threshold to decrease accordingly. This is summarized in the curve of the masked threshold versus the frequency separation shown in Fig. 10.8c. This curve is flat or constant for small frequency separations and begins to fall off when the separation becomes larger than a certain value, which may be considered as the critical bandwidth. Data from many subjective tests were collected to produce the critical bandwidth listed in Table 10.1 and shown in Fig. 10.9 [101, 102]. Here the lowest critical band is considered to have a frequency range between 0 and 100 Hz, which includes the inaudible frequency range between 0 and 20 Hz. Some authors may choose to assume the lowest critical band to have a frequency range between 20 and 100 Hz.
10.3 Auditory Subband Filtering Table 10.1 Critical bandwidth as proposed by Zwicker
183
Band number 1 2 3 4 5 6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 21 22 23
Upper frequency boundary (Hz) 100 200 300 400 510 630 770 920 1;080 1;270 1;480 1;720 2;000 2;320 2;700 3;150 3;700 4;400 5;300 6;400 7;700 9;500 12;000 15;500
3500
Critical Bandwidth (Hz)
3000 2500 2000 1500 1000 500 0
5
10 15 Critical Band Number
Fig. 10.9 Critical bandwidth as proposed by Zwicker
20
Critical bandwidth (Hz) 100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1;100 1;300 1;800 2;500 3;500
184
10 Perceptual Model
Many audio applications, such as CD and DVD, deploy sample rates that allow for a frequency range higher than the maximum 15,500 Hz given in Table 10.1. For these applications, one more critical band may be added, which starts from 15,500 Hz and ends with half of the sample rate. The critical bandwidth listed in Table 10.1 might give the wrong impression that critical bands are discrete and nonoverlapping. To emphasize that critical bands are continuously placed on the frequency scale, an analytic expression is usually more useful. One such approximation is given below "
f fG D 25 C 75 1 C 1:4 1;000
2 #0:69 Hz;
(10.13)
where f is the center frequency in hertz [102]. In terms of the number of critical bands, Table 10.1 may be approximated by (10.8) where the nth Bark represents the nth critical band. Therefore, the Bark scale is also called the critical band rate scale in the sense that one Bark is equal to one critical bandwidth.
10.3.5 Critical Band Level Similar to sound intensity level, critical band level of a sound in a critical band z is defined as the total sound intensity level within the critical band: Z L.z/ D
L.f /df:
(10.14)
f 2z
For tones, the critical band level is obviously the same as the tone’s intensity level. For noise, the critical band level is the total of the noise intensity density levels within the critical band.
10.3.6 Equivalent Rectangular Bandwidth An alternative measure of critical bandwidth is the equivalent rectangular bandwidth (ERB). For a given auditory filter, its ERB is the bandwidth of an ideal rectangular filter which has a passband magnitude equal to the maximum passband gain of the auditory filter and passes the same amount of energy as the auditory filter [53]. A notched-noise method is used to estimate the shape of the roex filter in (10.6), from which the ERB of the auditory filter is obtained [53, 70]. A formula that fits many experimental data well is given below [21, 53]
10.4 Simultaneous Masking
185
Bandwidth (Hz)
103 Traditional CB
ERB
102
102
103 Center Frequency (Hz)
104
Fig. 10.10 Comparison of ERB with traditional critical bandwidth (CB)
ERB D 24:7.0:00437f C 1/ Hz;
(10.15)
where f is the center frequency in hertz. The ERB formula indicates that the ERB is linear with respect to the center frequency. This is significantly different from the critical bandwidth in (10.13), as shown in Fig. 10.10, especially for frequency below 500 Hz. It was argued that ERB given in (10.15) is a better approximation than the traditional critical bands discussed in Sect. 10.3.4 because it is based on new data that were obtained using direct measurement of critical bands using the notched-noise method by a few different laboratories [53]. One ERB obviously represent one frequency unit in the auditory system, so the number of ERBs corresponds to a frequency scale and is conceptually similar to Bark scale. A formula for calculating this ERB scale or the number of ERBs for a center frequency f in hertz is given below [21, 53] Number of ERBs D 21:4 log10 .0:00437f C 1/:
(10.16)
10.4 Simultaneous Masking It is very easy to hear quiet conversation when the background is quiet. When there is sound in the background which may be either another conversation, music or noise, the speaker has to raise his/her volume to be heard. This is a simple example of simultaneous masking.
186
10 Perceptual Model
Fig. 10.11 Masking of a weak sound (dashed lines) by a strong one (solid lines) when they are presented simultaneously and their frequencies are close to each other
While the mechanism behind simultaneous masking is very complex, involving at least the nonlinear basilar membrane and the complex auditory neural system, a simple explanation is offered in Fig. 10.11 where two sound waves are presented to a listener simultaneously. Since their frequencies are close to each other, the excitation pattern of the weaker sound may be completely shadowed by the stronger one. If the basilar membrane is assumed to be linear, the weaker sound cannot be perceived by auditory neurons, hence is completely masked by the stronger one. Apparently, the masking effect is, to a large extent, dependent on the power of the masker relative to the maskee. For audio coding, the masker is considered as the signal and the maskee is the quantization noise that is to be masked, so this relative value is expressed by signal-to-mask ratio (SMR) SMR D 10 log10
2 Masker : 2 Maskee
(10.17)
In order for a masker to completely mask a maskee, the SMR has to pass a certain threshold, called SMR threshold TSMR D minfSMR j the maskee is inaudibleg:
(10.18)
The negative of this threshold in decibel is called masking index [102]: I D TSMR dB:
(10.19)
10.4.1 Types of Masking The frequency components of an audio signal may be considered as either noiselike or tone-like. Consequently, both the masker and the maskee may be either tones or noise, so there are four types of masking as shown in Table 10.2. Note that the noise here is usually considered as narrow-band with a bandwidth no more than the critical bandwidth. Since the removal of perceptual irrelevance is all about the masking of quantization noise which is complex and is rarely tone like, the cases of TMN and NMN are most relevant for audio coding.
10.4 Simultaneous Masking Table 10.2 Four types of masking
187
Tone Noise
Tone Tone masks tone (TMT) Noise masks tone (NMT)
Noise Tone masks noise (TMN) Noise masks noise (NMN)
10.4.1.1 Tone Masking Tone For the case of a pure tone masking another pure tone (TMT), both the masker and the maskee are simple, but it turned out to be very difficult to conduct masking experiments and to build good models, mostly due to the beating phenomenon that occurs when two pure tones with frequencies close to each other are presented to a listener. For example, two pure tones of 990 and 1,000 Hz, respectively, produce a beating of 10 Hz which causes the listeners to hear something different from the steady-state tone (masker). Then, they believe that the maskee had been heard. But this is, in fact, different from having actually heard another tone (maskee). Fortunately, quantization noise is rarely tone-like, so the lack of good models on this regard is less a problem for audio coding. A large number of experiments have, nevertheless, indicated an SMR threshold of about 15 dB [102].
10.4.1.2 Tone Masking Noise In this case, a pure tone masks the narrow-band noise whose spectrum falls within the critical band in which the tone stands. Since quantization noise is more noiselike, this case is very relevant to audio coding. Unfortunately, there exist relatively few studies to provide a good model for this useful case. A couple of studies, however, do indicate an SMR threshold between 21 and 28 dB [24, 84]. 10.4.1.3 Noise Masking Noise This case of a narrow band noise masking another one is very relevant to audio coding, but is very difficult to study because of the phase correlations between the masker and the maskee. So it is not surprising that there are little experimental results addressing this important issue. The limited data, however, do suggest an SMR threshold of about 26 dB [23, 52]. 10.4.1.4 Noise Masking Tone A narrow-band noise masking tone (NMT) is most widely studied in psychoacoustics. This type of experiments was deployed to estimate the critical bandwidth and the excitation patterns of the auditory filters. The masking spreading function to be discussed later in this section is largely based on this kind of studies. There are a lot of experimental data and models for NMT. The SMR threshold is generally considered to vary from about 2 dB at low frequencies to 6 dB at high frequencies [102].
188
10 Perceptual Model
Table 10.3 Empirical SMR thresholds for the four masking types
Masking type SMR threshold (dB)
TMT 15
TMN 21–28
NMN 26
NMT 2–6
10.4.1.5 Practical Masking Index Table 10.3 summarizes SMR thresholds for the four types of masking discussed above. From this table, we observe that tones have much weaker masking capability than noise and noise is much more difficult to be masked than tones. For audio coding, only TMN and NMT are usually considered with the following masking index formula: (10.20) ITMN D 14:5 z dB and INMT D K dB;
(10.21)
where K is a parameter between 3 and 5 dB [32].
10.4.2 Spread of Masking The discussion above addresses the situation that a masker masks maskee(s) that is within the masker’s critical band. Masking effect is no doubt the strongest in this case. However, masking effect also exists when the maskee is not in the same critical band as the masker. An essential contributing factor for this effect is that the auditory filters are not ideal bandpass filters, so do not attenuate frequency components outside the passband completely. In fact, the roll-off beyond the passband is rather gradual, as shown in Figs. 10.4 and 10.5. This means that a significant chunk of masker’s power is picked up by the auditory filter of the critical band where the maskee resides, making the maskee less audible. This effect is called the spread of masking. This explanation is apparently very simplistic in the light of the nonlinear basilar membrane and the complex auditory neural system. As discussed in Sect. 10.4.1, it is very difficult to study the masking behavior when both masker and maskees are within the same critical band, especially for the important case of TMN and NMN, it is no surprise that it is even more difficult to deal with the spread of masking effects. For simplification, masking spread function SF.zr ; ze / is introduced to express the masking effect due to a masker at critical band zr to maskees at critical band ze . If the masker at critical band zr has a critical band level of L.zr / , the power leaked to critical band ze or the critical band level that the maskee’s auditory filter picks up from the masker is (10.22) L.zr / SF.zr ; ze /:
10.4 Simultaneous Masking
189
Fig. 10.12 Spreading of masking into neighboring critical bands. The left maskee is audible while the right one is completely masked because it is completely below the masked threshold
If the masking index at critical band ze due to the masker is I.ze /, the masked threshold at critical band ze is LT .zr ; ze / D I.ze /L.zr / SF .zr ; ze /:
(10.23)
This relationship is shown in Fig. 10.12. The basic masking spread function is also shown in Fig. 10.12 which is mostly extracted from data obtained from NMT experiments [102]. It is a triangular function with a slope of about 25 dB per Bark below the masker and 10 dB per Bark above the masker. The slope of 25 dB for the lower half almost remains constant for all frequencies. The slope of 10 dB for the upper half is also almost constant for all frequencies higher than 200 Hz. Therefore, the spread function may be considered as shift-invariant across the frequency scale. As shown in Fig. 10.13, the simple spread function in Fig. 10.12 is captured by Schroeder in the following analytic form [84]: p SF.z/ D 15:81 C 7:5.z C 0:474/ 17:5 1 C .z C 0:474/2 dB;
(10.24)
where z D zr ze Bark;
(10.25)
which signifies that the spread function is frequency shift-invariant. A modified version of the spreading function above is given below SF.z/ D 15:8111389 C 7:5.Kz C 0:474/ p 17:5 1 C .Kz C 0:474/2 ˚ C8 min 0; .Kz 0:5/2 2.Kz 0:5/ dB
(10.26)
where K is a tunable parameter. This spreading function is essentially the same as the Schroeder spreading function when K D 1, except for the last term, which
190
10 Perceptual Model
Critical Band Level (dB)
0
−50
−100
Schroeder Schroeder With Dip MPEG Psycho Model 2 −150
−5
0
5
10
Bark Scale
Fig. 10.13 Comparison of Schroeder’s, modified Schroeder’s, and MPEG Psychoacoustic Model 2 spreading functions. The dips in modified Schroeder’s and MPEG Model near the top are intended to model additional nonlinear effects in auditory system
introduces a dip near the top (see Fig. 10.13) that is intended to model additional nonlinear effects in auditory system as reported in [102]. This function is used in MPEG Psychoacoustic Model 2 [60] with following parameter: ( KD
3; z < 0;
(10.27)
1:5; otherwise:
The three models above are independent of the SPL or critical band level of the masker. While simple, this is not a realistic reflection of the auditory system. A model that accounts for level dependency is the spreading function used in MPEG Psychoacoustic Model 1 given below
SF.z; Lr / D
8 ˆ 17.z C 1/ .0:4Lr C 6/; ˆ ˆ ˆ ˆ ˆ ˆ < .0:4Lr C 6/z;
3 z < 1 1 z < 0
17z; 0 z < 1 ˆ ˆ ˆ ˆ / 17; 1 z < 8 .z 1/.17 0:15L r ˆ ˆ ˆ : 0; otherwise;
dB
(10.28)
where Lr is the critical band level of the masker [55]. This spreading function is shown in Fig. 10.14 for masker critical band levels at 20, 40, 60, 80, and 100 dB, respectively. It apparently delivers increased masking for higher masker SPL on both sides of the masking curve to match the nonlinear masking properties of the auditory system [102].
10.4 Simultaneous Masking
191
0 −10
Critical Band Level (dB)
−20 Lr = 100 dB
−30
Lr =80 dB
−40 −50
Lr = 60 dB
−60 Lr = 40 dB
−70 −80
Lr = 20 dB
−90 −100
−2
0
2 Bark Scale
4
6
8
Fig. 10.14 The spreading function of MPEG Psychoacoustic Model 1 for masker critical band levels at 20, 40, 60, 80, and 100 dB, respectively. Increased masking is provided for higher masker critical band levels on both sides of the masking curve
The masking characteristics of the auditory system is also frequency dependent: the masking slope decreases as the masker frequency increases. This dependency is captured by Terhardt in the following model [91]: ( SF.z; Lr ; f / D
.0:2Lr C 230=f 24/z; z 0 24z;
otherwise
dB;
(10.29)
where f is the masker frequency in hertz. Figure 10.15 shows this model at Lr D 60 dB and for f D 100; 200; and 1;000, respectively.
10.4.3 Global Masking Threshold The masking spread function helps to estimate the masked threshold of a masker in one critical band over maskees in the same a different critical band. From a maskee’s perspective, it is masked by all maskers in all critical bands, including the critical band in which it resides. A question arises as to how those masked thresholds add up for maskees in a particular critical band? To answer this question, let us consider two maskers, one at critical band zr1 and the other at critical band zr2 , and denote their respective masking spread at critical band ze as LT .zr1 ; ze / and LT .zr2 ; ze /, respectively. If one of the masking effect is much stronger than the other one, the total masking effect would obviously
192
10 Perceptual Model 0
Critical Band Level (dB)
−20
f = 100 Hz
−40 f = 200 Hz −60
−80
f = 10000 Hz
−100
−120 −5
0
5
10
Bark Scale
Fig. 10.15 Terhardt’s spreading function at masker critical band level of Lr D 60 dB and masker frequencies of f D 100; 200; and 1;000 Hz, respectively. It shows reduced masking slope as the masker frequency increases
be dominated by the stronger one. When they are equal, how do the two masking effects “add up”. If it were intensity addition, a 3 dB gain would be expected. If it were sound pressure addition, a gain of 6 dB would be expected. Experiment using a tone masker placed at a low critical band and a critical-band wide noise masker placed at a high critical band to mask a tone maskee at a critical band in between (they are not in the same critical band) indicates a masking effect gain of 12 dB when the two maskers are of equal power. Even when one is much weaker than the other one, a gain between 6 and 8 dB is still observed. Therefore, the “addition” of masking effect is stronger than sound pressure addition and much stronger than intensity addition [102]. When the experiment is performed within the same critical band, however, the gain of masking effect is only 3 dB. This correlates well to intensity addition. In practical audio coding systems, intensity addition is often performed for simplicity, so the total masked threshold is calculated using the following formula: LT .ze / D I.ze /
X
L.zr /SF.zr ; ze /; for all ze :
(10.30)
all zr Since threshold in quiet establishes the absolute minimal masked threshold, the global masked threshold curve is LG .ze / D maxfLT .ze /; LQ .ze /g;
(10.31)
10.5 Temporal Masking
193
where LQ .ze / is the critical band level that represents the threshold in quiet. A conservative approach to establish this critical band level is to use the minimal threshold in the whole critical band: LQ .ze / D f .ze / min Tq .f /; f 2ze
(10.32)
where f .ze / is the critical bandwidth in hertz for critical band ze .
10.5 Temporal Masking The simultaneous masking discussed in Sect. 10.4 is under the condition of steady state, i.e., both the masker and maskee are long lasting and in steady state. This steady-state assumption is true most of the time because audio signals may be characterized as consisting of quasistationary episodes which are frequently interrupted by strong transients. Transients bring on masking effects that vary with time. This type of masking is called temporal masking. Temporal masking may be exemplified by postmasking (postmasker masking) which occurs after a loud transient, such as a gun shot. Immediately after such an event, there are a few moments when most people cannot hear much. In addition to postmasking, there are premasking (premasker masking) and simultaneous masking, as illustrated in Fig. 10.16. The time period that premasking can be measured is about 20 ms. During this period, the masked threshold gradually increases with time and reaches the level of simultaneous masking when the masker switches on. The period of strong premasking may be considered as long as 5 ms. Although premasking occurs before the masker is switched on, it does not mean that the auditory system can listen to the future. Instead, it is believed to be caused by the build-up time of the auditory system which is shorter for strong signals and longer for weak signals. The shorter build-up time of the strong masker enables parts of the masker to build up quickly, which then mask parts of the weak maskee which are built-up slowly.
Fig. 10.16 Schematic drawing illustrating temporal masking. The masked threshold is indicated by the solid line
194
10 Perceptual Model
Post-masking is much stronger than premasking. It kicks in immediately after the masker is switched off and shows almost no decay for the first 5 ms. Afterwards, it decreases gradually with time for about 200 ms. And this decay cannot be considered as exponential. The auditory system integrates sound intensity over a period of 200 ms [102], so the simultaneous masking in Fig. 10.16 may be described by the steady-state models described in the last section. However, if the maskee is switched on shortly after the masker is switched on, there is an overshoot effect which boosts the masked threshold about 10 dB upward above the threshold for steady-state simultaneous masking. This effect may last as long as 10 ms.
10.6 Perceptual Bit Allocation The optimal bit allocation strategy discussed in Chaps. 5 and 6 stipulates that the minimal overall MSQE is achieved when the MSQE for all subbands are equalized. This is achieved based on the assumption that all quantization noises in all frequency bands are “equal” in terms of their contribution to the overall MSQE as seen in (5.22) and (6.66). From the perspective of perceptual irrelevancy, quantization noise in each critical band is not “equal” in terms of perceived distortion and thus its contribution to the total perceived distortion is not equal. Only quantization noise in those critical bands whose power is above the masked threshold is of perceptual importance. Therefore, the MSQE for each critical band should be normalized by the masked threshold of that critical band in order to assess its contribution to perceptual distortion. Toward this end, let us define the critical band level of quantization noise in the subband context by rewriting the critical band level defined in (10.14) as X e2k (10.33) q2 .z/ D k2z
where e2k is again the MSQE for subband k. This critical band level of quantization noise may be normalized by the masked threshold of the same critical band using the following NMR (noise to mask ratio): NMR.z/ D
q2 .z/ LG .z/
:
(10.34)
Quantization noise for each critical band normalized in this way may be considered as “equal” in terms of its contribution to the perceptually meaningful total distortion. For this reason, NMR can be viewed as the variance of perceptual quantization error. Consequently, the total average perceptual quantization error becomes p2 D
1 X NMR.z/; Z all z
(10.35)
10.8 Perceptual Entropy
195
where Z is the number of critical bands. Note that only critical bands with NMR 1 need to be considered because the other ones are completely inaudible and thus have no contribution to p2 . Comparing the formula above with (5.22) and (6.66), we know that, if subbands are replaced by critical bands and quantization noise by perceptual quantization noise NMR.z/, the derivation of optimal bit allocation and coding gain in Sect. 5.2 applies. So the optimal bit allocation strategy becomes: Allocating bits to individual critical bands so that the NMR for all critical bands are equalized. Since NMR.z/ 1 for a critical band z means that quantization noise in the critical band is completely masked, the bit allocation strategy should ensure that NMR.z/ 1 for all critical bands should the bit resource is abundant enough.
10.7 Masked Threshold in Subband Domain The psychoacoustic experiments and theory regarding masking in the frequency domain are mostly based on Fourier transform, so DFT is most likely the frequency transform in perceptual models built for audio coding. On the other hand, subband filters are the preferred method for data modeling, so there is a need to translate the masked threshold in the DFT domain into the subband domain. While frequency scale correspondence between DFT and subbands are straightforward, the magnitude scales are not obvious. One general approach to address this issue is to use the relative value between signal power and masked threshold, which is essentially the SMR threshold defined in (10.17). Since the masked threshold is given by (10.31), the SMR threshold may be written as TSMR .z/ D
2 .z/ Masker ; LG .z/
(10.36)
2 where Masker .z/ is the power of the masker in critical band z. This ratio should be the same either in the DFT or subband domain, so can be used to obtain the masked threshold in the subband domain
L0G .z/ D
02 .z/ Masker ; TSMR .z/
(10.37)
02 where Masker .z/ is the signal power in the subband domain within critical band z.
10.8 Perceptual Entropy Suppose that the NMR.z/ for all critical bands are equalized to NMR0 , then the total MSQE for all subbands within critical band z is
196
10 Perceptual Model
q2 .z/ D LG .z/NMR0 ;
(10.38)
according to (10.34). Since the normalization of quantization error for all subbands within a critical band by the masked threshold also means that all subbands are quantized together with the same quantization step size, so the MSQE for each subband is the same: q2 .z/ LG .z/NMR0 D ; for all k 2 z; (10.39) e2k D k.z/ k.z/ where k.z/ represents the number of subbands within critical band z Therefore, the SNR for a subband in critical band z is SNR.k/ D
y2k e2k
D
y2k k.z/ LG .z/NMR0
;
for all k 2 z:
(10.40)
Due to (2.43), the number of bits assigned to subband k is 10 rk D log2 b log2 10
y2k k.z/ LG .z/NMR0
!
a ; b
for all k 2 z;
(10.41)
so the total number of bits assigned to all subbands within critical band z is r.z/ D
X
"
k2z
10 log2 b log2 10
!
y2k k.z/ LG .z/NMR0
# a : b
(10.42)
The average bit rate is then " 1 XX 10 RD log2 M b log2 10 k2z allz
!#
y2k k.z/ LG .z/NMR0
a : b
(10.43)
For perceptual transparency, the quantization noise in each critical band must be below the masked threshold, i.e., NMR0 1:
(10.44)
A smaller NMR0 means a lower quantization noise level relative to the masked threshold, so requires more bits to be allocated to critical bands. Therefore, setting NMR0 D 1 requires the least number of bits and at the same time ensures that quantization noise is just at masked threshold. This leads to the following minimum average bit rate: " 10 1 XX log2 RD M b log2 10 all z k2z
y2k k.z/ LG .z/
!#
a : b
(10.45)
10.9 A Simple Perceptual Model
197
The derivation above does not assume any quantization scheme because a general set of parameters a and b has been used. If uniform quantization is used and subband samples are assumed to have the matching uniform distribution, a and b are determined by (2.45) and (2.44), respectively, so the above minimum average bit rate becomes ! y2k k.z/ 1 XX : (10.46) 0:5 log2 RD M LG .z/ allz k2z To avoid negative values that the logarithmic function may produce, we add one to its argument to arrive at y2 k.z/ 1 XX RD 0:5 log2 1 C k M LG .z/ allz k2z
! ;
(10.47)
which is the perceptual entropy proposed by Johnston [34].
10.9 A Simple Perceptual Model To calculate perceptual entropy, Johnston proposed a simple perceptual model which has influenced many perceptual models deployed in practical audio coding systems [34]. This model is described here. A Hann window [67] is first applied to a chunk of 2;048 input audio samples and the windowed samples are transformed into frequency coefficients using a 2,048point DFT. Since a real input signal produces a DFT spectrum that is symmetric with respect to the zero frequency, only the first half of the DFT coefficients need to be considered. Therefore, the number of subbands M is 1,024. The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M 1 may be considered as the power spectrum Sxx .e j! / of the input signal, so are used to calculate the spectral flatness measure defined in (5.65) for each critical band, which is denoted as x2 .z/ for critical band z. If the input signal is noise-like, its spectrum is flat, then the spectral flatness measure should be close to one. If the input signal is tone-like, its spectrum is full of peaks, then the spectral flatness measure should be close to zero. Therefore, the spectral flatness measure is a good inverse measure for tonal quality of the input signal, thus can be used to derive a tonality index such as the following:
x2 .z/ (dB) ;1 : T .z/ D min 60
(10.48)
Note that the x2 .z/ above is in decibel. Since the spectral flatness measure is always positive and less than one, its decibel value is always negative. Therefore, the
198
10 Perceptual Model
tonality index is always positive. It is also limited to less than one in the equation above. The tonality index indicates that the spectral components in critical band z is T .z/ degree tone-like and 1 T .z/ degree noise-like, so it is used to weigh the masking indexes given in (10.20) and (10.21), respectively, to give a total masking index of I.z/ D T .z/ITMN .z/ C .1 T .z//INMT .z/ D .14:5 C z/T .z/ K.1 T .z// dB;
(10.49)
where K is set to a value of 5.5 dB. The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M 1 may be considered as the signal variance y2k ; k D 0; 1; : : : ; M 1, so the critical band level defined in (10.14) in the current context may be written as y2 .z/ D
X
P .k/:
(10.50)
k2z
Using (10.30) with a masking spread function, the cumulative masking effects for each critical band can be obtained as X y2 .zr /SF .zr ; z/; for all z: (10.51) LT .z/ D I.z/ all zr Denoting E.z/ D
X
y2 .zr /SF.zr ; z/;
(10.52)
all zr we obtain the total masked threshold LT .z/ D 10 log10 E.z/ C I.z/ D 10 log10 E.z/ .14:5 C z/T .z/ K.1 T .z// dB
(10.53)
for all critical bands z. As usual, this threshold is combined with the threshold in quiet using (10.31) to produce a global threshold, which is then been substituted into (10.47) to give the perceptual entropy.
Chapter 11
Transients
An audio signal often consists of quasistationary episodes, each including a number of tonal frequency components, which are frequently interrupted by dramatic transients. To achieve optimal energy compaction and thus coding gain, a filter bank with fine frequency resolution is necessary to resolve the tonal components or fine frequency structures in quasistationary episodes. But this filter bank is an ill fit for transients which often last for no more than a few samples, hence require fine time resolution for optimal energy compaction. Therefore, filter banks with both good time and frequency resolution are needed to effectively code audio signals. According the Fourier uncertainty principle, however, a filter bank cannot have fine frequency resolution and high time resolution simultaneously. One approach to mitigate this problem is to adapt the resolution of a filter bank in time to match the conflicting resolution requirements posted by transients and quasistationary episodes. This chapter presents a variety of contemporary schemes for switching the time– frequency resolution of MDCT, the preferred filter bank for audio coding. Also presented are practical methods for mitigating pre-echo artifacts that sometimes occur when the time resolution is not good enough to effectively deal with transients. Finally, switching the time–frequency resolution of a filter bank requires the knowledge of the occurrence and location of transients. Practical methods for detecting and locating transients are presented at the end of this chapter.
11.1 Resolution Challenge Audio signals mostly consist of quasistationary episodes, such as the one shown at the top of Fig. 11.1, which often include a number of tonal frequency components. For effective energy compaction to maximize coding gain, these tonal frequency components may be resolved or separated using filter banks, as shown by the MDCT magnitude spectra of 1,024 (middle) and 128 (bottom) subbands in the middle and bottom of Fig. 11.1, respectively. Apparently, the 1,024-subband MDCT is able to resolve the frequency components much better than the 128-subband MDCT, thus having a clear advantage in Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 11, c Springer Science+Business Media, LLC 2010
199
200
11 Transients
Amplitude
0.1 0 −0.1
10
20
30
40
Time
Magnitude (dB)
0
−50
−100 5
10 Frequency (kHz)
15
20
5
10 Frequency (kHz)
15
20
Magnitude (dB)
0
−50
−100
Fig. 11.1 An episode of quasistationary audio signal (top) and its MDCT spectra of 1,024 (middle) and 128 (bottom) subbands
energy compaction. It can be expected that the closer together the frequency components are, the more number of subbands is needed to resolve them. A filter bank with a large number of subbands, conveniently referred to as a long filter bank in this book, is able to deliver this advantage because using a large number of subbands to represent the full frequency range, as determined by the sample rate, means that each subband is allocated a small frequency range, hence the frequency resolution is finer. Also, such a filter bank has a long prototype filter that covers a large number of time samples, hence is able to resolve minute signal variations with frequency.
11.1 Resolution Challenge
201
Amplitude
0.5
0
−0.5
10
20
30
40
Time
Magnitude (dB)
−20 −40 −60 −80 −100 −120 5
10 Frequency (kHz)
15
20
5
10 Frequency (kHz)
15
20
Magnitude (dB)
−20 −40 −60 −80 −100 −120
Fig. 11.2 A transient that interrupts quasistationary episodes of an audio signal (top) and its MDCT spectra of 1,024 (middle) and 128 (bottom) subbands, respectively
Unfortunately, quasistationary episodes of an audio signal are intermittently interrupted by dramatic transients, as shown at the top of Fig. 11.2. Applying the same 1024-subband and 128-subband MDCT produces the spectra shown in the middle and bottom of Fig. 11.2, respectively. Now the short MDCT resolves the spectral valleys better than the long MDCT, so is more effective in energy compaction. There is another reason for improved overall energy compaction performance of a short filter bank. Transients are well known for causing flat spectra, hence a large
202
11 Transients
spectral flatness measure and bad coding gain, according to (5.65) and (6.81). As long as the prototype filter covers a transient attack, this spectral flattening effect is reflected in the whole block of subband samples. For overlapping filter banks, this affects multiple blocks of subband samples whose prototype filter covers the transient. For a long filter bank with a long prototype filter, this flattening effect affects a large number of subband samples, resulting low coding gain for a larger number of subband samples. Using a short filter bank with a short prototype filter, however, helps to isolate this spectral flatting effect to a smaller number of subband samples. Before and after those affected subband samples, the coding gain may go back to normal. Therefore, applying a short filter bank to cover the transients improves the overall coding gain.
11.1.1 Pre-Echo Artifacts Another reason that favors short filter banks when dealing with transients is pre-echo artifacts. Quantization is a step in an audio coder that compresses the signal most effectively, but it also introduces quantization noise. Under a filter bank scheme, the quantization noise introduced in the subband or frequency domain becomes almost uniformly distributed in the time domain after the audio signal is reconstructed from the quantized subband samples. This quantization noise is shown at the top of Fig. 11.3, which is the difference signal between the reconstructed signal and
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.3 Pre-echo artifacts produced by a long 1024-subband MDCT. The top figure shows the error between the reconstructed signal and the original, which is, therefore, the quantization noise. The bottom shows the reconstructed signal alone, the quantization noise before the transient attack is clearly visible and audible
11.1 Resolution Challenge
203
the original signal. When looking at the reconstructed signal alone (see bottom of Fig. 11.3), however, the quantization noise is not visible after the transient attack because it is visually masked by the signal. For the ear, it is also not audible due to simultaneous and postmasking. Before the transient attack, however, it is clearly visible. For the ear, it is also very audible and frequently very annoying because it is supposed to be quiet before the transient attack (see original signal at the top of Fig. 11.2). This frequently annoying quantization noise that occurs before the transient attack is called pre-echo artifacts. One approach to mitigate pre-echo artifacts is to use a short filter bank whose fine time localization or resolution would help at least limit the extent that the artifacts appear. For example, a short 128-subband MDCT is used to process the same transient signal to give the reconstructed signal and the quantization noise in Fig. 11.4. The quantization noise that occurs before the transient attack is still visible, but is much shorter, less than 5 ms in fact. As discussed in Sect. 10.5, the period of strong premasking may be considered as long as 5 ms, so the short pretransient quantization noise is unlikely to be audible. For a 128-subband MDCT, the window size is 256 samples, which covers a period of 256=44:1 5:8 ms for an audio signal of 44.1 kHz sample rate. In
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.4 Pre-echo artifacts produced by a short 128-subband MDCT. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but may be inaudible because it could be shorter than premasking which may last as long as 5 ms
204
11 Transients
order for significant quantization noise to build up, the MDCT window must cover a significant amount of signal energy, so the number of input samples after the transient attack and still covered by the MDCT window must be significant. This means that the number of input samples before the transient attack is much shorter than 256, so the period for pre-echo artifacts is much shorter than 5.8 ms and thus is very likely to be masked by premasking. Therefore, a 128-subband MDCT is likely to suppress most pre-echo artifacts.
11.1.2 Fourier Uncertainty Principle Now it is clear that a filter bank needs to have both fine frequency resolution and fine time resolution to effectively encode both transients and quasistationary episodes of audio signals. The time–frequency resolution of a filter bank is largely determined by its number of subband samples. For modulated filter banks, this is often reflected in the length of the prototype filter. A long prototype filter has better frequency resolution but poor time resolution. A short prototype filter (said to be compactly supported) has good time resolution but poor frequency resolution. There is no filter that can provide both good time and frequency resolution at the same time due to the Fourier uncertainty principle, which is related to the Heisenberg uncertainty principle [90]. Without loss of generality, let h.t/ denotes a signal that is normalized as follows: Z
1
1
jh.t/j2 dt D 1
(11.1)
and H.f / its Fourier transform. The dispersion about zero in both time and frequency domain may defined by Z Dt D Z
and Df D
1
t 2 jh.t/j2 dt
(11.2)
f 2 jH.f /j2 df;
(11.3)
1 1 1
respectively. It is obvious that they represent the energy concentration of h.t/ and H.f / toward zero in time and frequency domains, respectively, hence their respective time and frequency resolutions. The Fourier uncertainty principle states that [75] 1 Dt Df (11.4) 16 2 The equality is attained only in the case that h.t/ is an Gaussian function.
11.1 Resolution Challenge
205
Although the Gaussian function provides the optimal simultaneous time– frequency resolution under the uncertainly principle, it is still within the limit stipulated by the uncertainly principle, thus not the level of time–frequency resolution desired for audio coding. In addition, it has an infinity support and clearly does not satisfies the power complementary condition for use in CMFB, so is not suitable for practical audio coding systems. Providing simultaneous time–frequency resolution is one of the motivations behind the creation of the wavelet transform which may be viewed as a nonuniform filter bank. Its high-frequency basis functions are short to give good time resolution for high-frequency components and low-frequency basis functions are long to provide good frequency resolution for low-frequency components, so it does not violate the Fourier uncertainty principle. This approach can address the time–frequency resolution problems in many areas, such as image and videos, but it is not very suitable for audio. Audio signals often contain tones at high frequencies, which require fine frequency resolution at high frequencies, thus cannot be effectively handled by a wavelet transform.
11.1.3 Adaptation of Resolution with Time A general approach for mitigating this limitation on time–frequency resolution is to adapt the time–frequency resolution of a filter bank with time: deploy high frequency resolution to code quasistationary episodes and high time resolution to localize transients. This may be implemented using a hybrid filter bank in which each subband of the first stage filter bank is cascaded with a transform as the second stage, as shown in Fig. 11.5. Around a transient attack, only the first stage is deployed whose
Fig. 11.5 A hybrid filter bank for adaptation of time–frequency resolution with time. For transients, only the first stage is deployed to provide limited frequency resolution but better time localization. For quasistationary episodes, the second stage is cascaded to each subband of the first stage to boost frequency resolution
206
11 Transients
good time resolution helps to isolate the transient attack and limit pre-echo artifacts. For quasistationary episodes, the second stage is deployed to further decompose the subband samples from the first stage so that a much better frequency resolution is delivered. It is desirable that the first stage filter banks have good time resolution while the second stage transforms have good frequency resolution. A variation to this scheme is to replace the transform in the second stage with linear prediction, as shown in Fig. 11.6. The linear prediction in each subband is switched on whenever the prediction gain is large enough and off otherwise. This approach is deployed by DTS Coherent Acoustic where the first stage is a 32band CMFB with a 512-tap prototype filter [88]. The DTS scheme suffers from the poor time resolution of the first filter bank stage because 512 taps translates into 512=44:1 11:6 ms for a sample rate of 44.1 kHz, far longer than the effective premasking period of no more than 5 ms. A more involved but computationally inexpensive scheme is to switch the number of subbands of an MDCT in such a way that a smaller number of subbands is deployed to code transients and a large number of subbands to code quasistationary episodes. It seems to have become the dominant scheme in audio coding for the adaptation of time–frequency resolution with time, as can be seen in Table 11.1. This switched MDCT is often cascaded to the output of a CMFB in some audio coding algorithms. For example, it is deployed by MPEG-1&2 Layer III [55, 56], whose first stage is a 32-band CMFB with a prototype filter of 512 taps and the second stage is an MDCT which switches between 6 and 18 subbands. In this
Fig. 11.6 Cascading linear predictors with a filter bank to adapt time–frequency resolution with time. Linear prediction is optionally applied to the subband samples in each subband if the resultant prediction gain is sufficiently large
Table 11.1 Switched-window MDCT used by various audio coding algorithms Audio coder Number of subbands Dolby AC-2A [10] Dolby AC-3 [11] Sony ATRAC [92] Lucent PAC [36] MPEG 1&2 Layer 3 [55, 56] MPEG 2&4 AAC [59, 60] Xiph.Org Vorbis [96] Microsoft WMA [95] Digirise DRA [98]
128/512 128/256 32/128 and 32/256 128/1,024 6/18 128/1,024 64, 128, 256, 512, 1,024, 2,048, 4,096 or 8,192 64, 128, 256, 512, 1,024 or 2,048 128/1,024
11.2 Switched-Window MDCT
207
configuration, the resolution adaptation is actually achieved through the switched MDCT. This scheme suffers from the poor time resolution because the combined prototype filter length is 512 C 2 6 32 D 896 even when the MDCT is in short mode. This amounts to 896=44:1 20:3 ms for a sample rate of 44.1 kHz, which is far longer than the effective premasking period of no more than 5 ms. A similar scheme is used by MPEG-2&4 AAC in its gain control tool box [59, 60]. It deploys as the first stage a 4-subband CMFB with a 96-tap prototype filter and as the second stage an MDCT that switches between 32 and 256 subbands. This scheme seems to be able to barely avoid pre-echo artifacts due to its short combined prototype filter of 96 C 2 32 4 D 352 taps, which amount to 352=44:1 8:0 ms for a sample rate of 44.1 kHz. A more sophisticated scheme is used by Sony ATRACT deployed in its MiniDisc and Sony Dynamic Digital Sound (SDDS) cinematic sound system, which involves cascading of three stages of filter banks [92]. The first stage is a quadrature mirror filter bank (QMF) with two subbands. Its low-frequency subband is connected to another two-subband QMF, the outputs of which are connected to MDCTs which switches between 32 and 128 subbands. The high-frequency subband from the first stage QMF is connected to an MDCT that switches between 32 and 256 subbands. The combined short prototype filter lengths are 2.9 ms for the low-frequency subbands and 1.45 ms for the high-frequency subband, so they are within the safe zone of premasking. In place of the short MDCT, Lucent EPAC deploys a wavelet transform to handle transients. It still uses a 1024-subband MDCT to process quasistationary episodes [87].
11.2 Switched-Window MDCT A switched-window MDCT or Switched MDCT operates in long filter bank mode (with a long window function) to handle quasistationary episodes of audio signal, switches to short filter bank mode (with a short window function) around a transient attack, and reverts back to the long mode afterwards. A widely used scheme is 1,024 subbands for the long mode and 128 subbands for the short mode.
11.2.1 Relaxed PR Conditions and Window Switching To ensure perfect reconstruction, the linear phase condition (7.11) and powercomplementary conditions (7.80) must be satisfied. Since these conditions impose symmetric and power-complementary constraints on each half of the window function, it seems impossible to change either the window shape or the number of subbands.
208
11 Transients 1 0.8
Left Window
Right Window
0.6 Any Valid Window ⇒
0.4 0.2 0
Current Block 500
1000
1500
2000
2500
3000
1 0.8
Left Window
Right Window
0.6 Any Valid Window ⇒
0.4 0.2 0
Current Block 500
1000
1500
2000
2500
3000
Fig. 11.7 The PR conditions only apply to two window halves that operate on the current block of input samples, so the second half of the right window can have other shapes that satisfy the PR conditions on the next block
An inspection of the MDCT operation in Figs. 7.15, however, indicates that each block of the input samples are operated on by two window functions only. This is illustrated in Fig. 11.7, where the middle block marked by the dotted lines is considered as the current block. The first half of the left window, denoted as hL .n/, operates on the previous block and is thus not related to the current block. Similarly, the second half of the right window, denoted as hR .n/, operates on next block and is thus not related to the current block, either. Only the second half of the left window hL .n/ and the first half of the right hR .n/ operates on the current block. To ensure perfection reconstruction for the current block of samples, only the window halves covering it need to be constrained by the PR conditions (7.11) and (7.80). Therefore, they can be rewritten as hR .n/ D hL .2M 1 n/
(11.5)
h2R .n/ C h2L .M C n/ D ˛;
(11.6)
and
for n D 0; 1; : : : ; M 1, respectively.
11.2 Switched-Window MDCT
209
Let us do a variable change of n D M C m to the index of the left window so that m D 0 corresponds to the middle of the left window, or the start of the second half of the left window: hN L .m/ D hL .M C m/;
M m M 1:
(11.7)
This enables us to rewrite the PR conditions as
and
hR .n/ D hN L .M 1 n/
(11.8)
h2R .n/ C hN 2L .n/ D ˛;
(11.9)
for n D 0; 1; : : : ; M 1, respectively. Therefore, the linear phase condition (11.8) states that the second half of the left window and the first half of the right window are symmetric to each other with respect to the center of the current block and the power complementary condition states that these two window halves are power complementary at each point within the current block. Since these PR conditions are imposed to window halves that apply only to the current block of samples, the window halves in the next block, or any other blocks are totally independent from the current block or any other blocks. Therefore, different blocks can have totally different set of window halves and they can change in either shape or length from block to block. In other words, windows can be switched from block to block.
11.2.2 Window Sequencing While the selection of window halves are independent from block to block, both halves of a window are applied together to two blocks of samples to generate one block of MDCT coefficients. This imposes a little constraint on how window halves transit from one block to another block. To see this, let us consider the current block of input samples denoted by the dashed line in Fig. 11.7. It is used together with the previous block by the left window to produce the current block of MDCT coefficients. Once this block of MDCT coefficients are generated and transferred to the decoder, the second half of the left window is determined and cannot be changed. When it is time to produce the next block of MDCT coefficients, the second half of the left window was already determined when the current block of MDCT coefficients are generated. The PR conditions then dictate that its symmetric and power-complementary window half must be used as the first half of the right window. Therefore, the first half of the right window is also completely determined, there is no flexibility for change.
210
11 Transients
This, however, does not restrict the selection of the second half of the right window which can be totally different from the first half, thus enabling switching to a different window.
11.3 Double-Resolution Switched MDCT The simplest form of window switching is obviously the case of using a long window to process quasistationary episodes and a short window to deal with transients. This set of window lengths represents two modes of time–frequency resolution.
11.3.1 Primary and Transitional Windows The first step toward this simple case of window switching is to build a set of long and short primary windows using a window design procedure. Conveniently denoted as hM .n/ and hm .n/, respectively, where M designates the block length or the number of subbands enabled by the long window and m that by the short window, they must satisfy the original PR conditions (7.11) and (7.80), i.e., they are symmetric and power-complementary with respect to their respective middle points. Without loss of generality, the sine window can always be used for this purpose and are shown as the first and second windows in Fig. 11.8 where m D 128 and M D 1;024. Since the long and short windows are totally different in length, corresponding to different block lengths, transitional windows are needed to bridge this transition of block sizes. To maintain fairly good frequency response, a smooth transition in window shape is highly desirable and abrupt change should be avoided. An example set of such transitional windows are given in Fig. 11.8 as the last three windows. In particular, window WL L2S is a long window for transition from a long window to a short window, WL S2L for transition from a short window to a long window, and WL S2L for transition from a short window to a short window, respectively. A moniker in the form of WX Y2Z has been used to identify the different windows above, where “X” designates the total window length, “Y” the left half, and “Z” the right half of the window. For example, WL L2S designates a long window for transition from a long window to a short window and WS S2S designates a short primary window which may be considered as a short window for “transition” from a short window to a short window. Mathematically, let us denote the left and right half of the long window as hM L .n/ m m and hM R .n/ and the short window as hL .n/ and hR .n/, respectively. Then the two primary windows can be rewritten as
11.3 Double-Resolution Switched MDCT
211
1 WS_S2S
0.5 0
200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1 WL_L2L
0.5 0
200
400
600
800
1000
1 WL_L2S
0.5 0
200
400
600
800
1000
1 WL_S2L
0.5 0
200
400
600
800
1000
1 WL_S2S
0.5 0
200
400
600
800
1000
Fig. 11.8 Window functions produced from a set of 2048-tap and 256-tap sine windows. Note that the length of W S S2S is only 256 taps
( WS S2S:
hS
S2S .n/
D (
and WL L2L:
hL L2L .n/ D
hm L .n/; 0 n < mI hm R .n/; m n < 2m
hM L .n/; 0 n < M I hM R .n/; M n < 2M:
(11.10)
(11.11)
The transitional windows can then be expressed in terms of the long and short window halves as follows: 8 M ˆ ˆ hL .n/; 0 n < M I ˆ ˆ ˆ < 1; M n < 3M=2 m=2I (11.12) WL L2S: hL L2S .n/ D m ˆ h .n/; 3M=2 m=2 n < 3M=2 C m=2I ˆ ˆ R ˆ ˆ : 0; 3M=2 C m=2 n < 2M:
212
WL S2L:
WL S2S:
11 Transients
hL S2L .n/ D
8 0; 0 n < M=2 m=2I ˆ ˆ ˆ ˆ ˆ m < h .n/; M=2 m=2 n < M=2 C m=2I L
(11.13) ˆ ˆ 1; M=2 C m=2 n < M I ˆ ˆ ˆ : M hR .n/; M n < 2M: 8 0; 0 n < M=2 m=2I ˆ ˆ ˆ ˆ ˆ m ˆ h .n/; M=2 m=2 n < M=2 C m=2I ˆ ˆ < L hL S2S .n/ D 1; M=2 C m=2 n < 3M=2 m=2I (11.14) ˆ ˆ ˆ m ˆ ˆ hR .n/; 3M=2 m=2 n < 3M=2 C m=2I ˆ ˆ ˆ : 0; 3M=2 C m=2 n < 2M:
Figure 11.9 shows some window switching examples using the primary and transitional windows in Fig. 11.8. Transitional windows are placed back to back in the second and third rows and eight short windows are placed in the last row. As shown in Fig. 11.8, the long-to-short transitional windows WL L2S and WL S2S provide for a short window half to be placed in the middle of their second half. In the coordinate of the long transitional window, the short window must be
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.9 Some possible window sequence examples
11.3 Double-Resolution Switched MDCT
213
placed within Œ3M=2 m=2; 3M=2 C m=2 (see (11.12) and (11.14), respectively). After 3M=2 C m=2, the long transitional window does not impose any constraint because its window values have become zero, so other methods can be used to represent the samples beyond 3M=2 C m=2. The same argument applies to the first half of the short-to-long transitional windows WL S2L and WL S2S as well (see Fig. 11.8), leading to no constraint placed on samples before M=2 m=2, so other signal representation methods can be accommodated for the samples before M=2 m=2. MDCT with the short window function WS S2S is a simple method for representing those samples. Due to the overlapping of m samples between the short windows and transitional windows, M=m short windows need to be used to represent the samples between the long-to-short (WL L2S and WL S2S) and short-to-long (WL S2L and WL S2S) transitional windows. See the last row of Fig. 11.9, for example. These M=m short windows amount to a total of M samples, which are the same as the block length of the long windows, so this window switching scheme is amenable to maintaining a constant frame size, which is highly desirable for convenient real-time processing of signals. For the seek of convenience, a long block may sometimes be referred to as the frame in the remainder of this book. Under such a scheme, the short window represents the fine time resolution mode and the long windows, including the transitional ones, represent the fine frequency resolution mode, thus amounting to a double-resolution switched MDCT. If constant frame size is forgone, other possible number of short windows can be used. This typically involves more sophisticated window sequencing and buffer management. If the window size of the short window is set zero, i.e., m D 0, there is absolutely no need for any form of short window. The unconstrained regions before M=2 and after 3M=2 are open for any kind of representation methods. For example, a DCT-IV or wavelet transform may be deployed to represent samples in those regions.
11.3.2 Look-Ahead and Window Sequencing It is clear now that the interval for possible short window placement is ŒM=2 C m=2; 3M=2 C m=2) (see the last row of Fig. 11.9). Referred to as transient detection interval, it obviously resides between two frames: about half of the short windows are placed in the current frame and the other placed in the second frame. If there is a transient in the interval, the short windows should be placed to cover it; otherwise, one of the long windows should be used. In preparation for placing the short windows in this interval, a long-to-short transitional window WL X2S needs to be used in the current frame, where the “X” can be either “L” or “S” and are determined in the previous frame. Therefore, to determine the window for the current frame we need to look-ahead to the transient detection interval for the presence or absence of transients. Since this interval ends
214 Table 11.2 Possible window switching scenarios
11 Transients Current window half WL X2S WL X2L WS S2S WS S2S
Next window half WS S2S WL L2X WS S2S WL S2X
in the middle of the next frame, we need a look-ahead interval of up to a half frame, which causes additional coding delays. This look-ahead interval of half frame is necessary for other window switching situations, such as transition from short windows to long windows (see the last row of Fig. 11.9 again). If there is a transient in the transient detection interval, the short windows have to be placed to cover it. Since the short windows only cover the second half of the current frame, the first half of the current frame needs to be covered by a WL X2S long window. The short windows also cover the first half of the next frame, whose next half is covered by a WL S2X long window. This complete the transition from a long window to short windows and back to a long window. Table 11.2 summarizes the possible window transition scenarios between two frames. The decoder needs to know exactly what window the encoder used to generate the current block of MDCT coefficients in order to use the same window to perform the inverse MDCT, so information about window sequencing needs to be included in the bit stream. This can be done using three bits to convey an window index that identifies the windows in (11.10) to (11.14) and shown in Fig. 11.8. If window WL S2S is forbidden, only two bits are needed to convey the window index.
11.3.3 Implementation Double-resolution switched IMDCT is widely used in a variety of audio coding standards, its implementation is straight forward and fully illustrated in the third row of Fig. 11.9. In particular, starting from the second window in that row (the WL L2S window), the IMDCT may be implemented in the following steps: 1. Copy the .M m/=2 samples, where the WL L2S is one, from the delay line directly to the output; 2. Do M=2m short IMDCT and put all results to the output because all of them belong to the current frame; 3. Do another IMDCT, put first m=2 samples of its result to the output and store the remaining m=2 samples to the delay line, because the first half belong to the current frame and the rest to the next frame; 5. Do M=2m 1 short IMDCT and put all results to the delay line because all of them belong to the next frame; 6. Clear the remaining .M C m/=2 samples of the delay line to zero because the last WS S2S has ended.
11.4 Temporal Noise Shaping
215
11.3.4 Window Size Compromise To localize transient attacks, the shorter the window is, the better the localization is achieved, thus the better the coding gain becomes. This is true for the group of samples around the transient attack, but causes poor coding gain for the other samples in the frame, which do not contain transient attacks and are quasistationary. Therefore, there is a trade-off or compromise when choosing the size for the short window. Too short a window size means good coding gain for the transient attack but poor coding gain for the quasistationary remainder, and vice versa for a window size too long. The compromise eventually reached is, of course, optimal neither for the transients nor for the quasistationary remainder. This problem is further compounded by the need for longer long windows to better encode tonal components in quasistationary episodes. If the short window size is fixed, longer long windows mean more short windows in a frame, thus more short blocks of audio samples coded with poor coding gain. Therefore, the long windows cannot be too long. Apparently, there is a consensus of using 256 taps for the short window, as shown in Table 11.1. This seems to be a good compromise because a window size of 256 taps is equivalent to 256=44:1 5:8 ms, which is barely longer than the 5 ms for premasking. In other words, it is the longest acceptable size that is barely short enough, but not unnecessarily short. With this longest acceptable size for the short window, 2,048 taps for the long window is also a widely accepted option. However, pre-echo artifacts are still frequently audible with such a window size arrangement, especially for audio pieces with significant transient attacks. This calls for techniques to improve the time resolution of the short window for enhanced control of pre-echo artifacts.
11.4 Temporal Noise Shaping Temporal noise shaping (TNS) [26], used by AAC [60], is one of the preecho control methods [69]. It deploys an open-loop DPCM on the block of short MDCT coefficients that cover the transient attack, and leaves other blocks of short MDCT coefficients in the frame untouched. In particular, it deploys the open-loop DPCM encoder shown in Fig. 4.3 on the block of short MDCT coefficients that cover the transient in the following steps: 1. Estimate the autocorrelation matrix of the MDCT coefficients. 2. Solve the normal equations (4.46) using a method such as the Levinson–Durbin algorithm to produce the prediction coefficients. 3. Produce the prediction residue of the MDCT coefficients using the prediction coefficients obtained in the last step. 4. Quantize the prediction residue. Note that the quantizer is placed outside the prediction loop as shown in Fig. 4.3.
216
11 Transients
On the decoder side, the regular decoder shown in Fig. 4.4 is used to reconstruct the MDCT coefficients for the block with a transient. The first theoretical justification for TNS is the spectral flattening effect of transients. For the short window that covers a transient attack, the resultant MDCT coefficients are either close to or essentially flat, thus are amenable for linear prediction. As discussed in Chap. 4, the resultant prediction gain may be considered as the same as the coding gain. The second theoretic justification is that an open-loop DPCM shapes the spectrum of quantization noise toward the spectral envelop of the input signal (see Sect. 4.5.2). For TNS, the input signal is the MDCT coefficients and their spectrum is the time-domain samples covered by the short window, so the quantization noise of MDCT coefficients in the time domain is shaped toward the envelop of the time-domain samples. This means that more quantization noise is placed after the transient attack and less noise before the transient attack. Therefore, there is less likelihood for pre-echo artifacts. As an example, let us apply the TNS method to the MDCT block that covers the transient attack in the audio signal in Fig. 11.2 with a predictor order of 2. The prediction coefficients are obtained from the autocorrelation matrix using (4.63). The quantization noise and the reconstructed signal are shown in Fig. 11.10. While the quantization noise before the transient attack is still visible, it is significantly shorter than that of the regular short window shown in Fig. 11.4.
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.10 Pre-echo artifacts for TNS. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but is significantly shorter than that of the regular short window. However, the concentration of quantization noise in a short period of time (top) elevates the noise intensity significantly and hence may become audible
11.5 Transient-Localized MDCT
217
However, the concentration of quantization noise in a short period of time as shown at the top of the figure elevates the noise intensity significantly and hence may become audible. More sophisticated noise shaping methods may be deployed to shape the noise in such a way that it is more uniformly distributed behind the transient attack. Since linear prediction needs to be performed for each MDCT coefficients, TNS is computationally intensive, even on the decoder side. The overhead for transferring the description of the predictor, including the prediction filter coefficients, is also remarkable.
11.5 Transient-Localized MDCT Another approach to improving the window size compromise is to leave the short window size unchanged but use a narrower window shape to cope with transients better. This allows better transient localization with minimal impact to the coding gain for the quasistationary remainder of the frame and to the complexity of both encoder and decoder.
11.5.1 Brief Window and Pre-Echo Artifacts Let us look at window function WL S2S, which is the last one in Fig. 11.8. It is a long window, but its window shape is much narrower than the regular long window WL L2L. This is achieved by shifting the short window outward and properly padding zeros to make it as long as a long window. This same idea may be applied to the short window using a model window whose length, denoted as 2B, is even shorter than the short window, i.e., B < m. Denoting its left and right half as hB L .n/ and hB R .n/, respectively, this model window is not directly used in the switched MDCT, other than building other windows, so it may be referred to as the virtual window. As an example, a 64-tap sine window, shown at the top of Fig. 11.11 as WB B2B, may be used for such a purpose. It is plotted using a dashed line to emphasize that it is a virtual window. Based on this virtual widow, the following narrow short window, called a brief window, may be built
WS B2B:
8 0; 0 n < m=2 B=2I ˆ ˆ ˆ ˆ ˆ B ˆ ˆ hL .n/; m=2 B=2 n < m=2 C B=2I < hS B2B .n/ D 1; m=2 C B=2 n < 3m=2 B=2I (11.15) ˆ ˆ ˆ ˆ hB ˆ R .n/; 3m=2 B=2 n < 3m=2 C B=2I ˆ ˆ : 0; 3m=2 C B=2 n < 2m:
218 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0
11 Transients
WB_B2B 200
400
600
800
1000
1200
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
WS_S2S 200
400
600
800
1000
1200
WS_S2B 200
400
600
800
1000
1200
WS_B2S 200
400
600
800
1000
1200
WS_B2B 200
400
600
800
1000
1200
WL_L2L 200
400
600
800
1000
1200
WL_L2S 200
400
600
800
1000
1200
WL_S2L 200
400
600
800
1000
1200
WL_S2S 200
400
600
800
1000
1200
WL_L2B 200
400
600
800
1000
1200
WL_B2L 200
400
600
800
1000
1200
WL_B2B 200
400
600
800
1000
1200
WL_S2B 200
400
600
800
1000
1200
WL_B2S 200
400
600
800
1000
1200
Fig. 11.11 Window functions for a transient-localized 1024/128-subband MDCT. The first window is plotted using a dashed line to emphasize that it is a virtual window (not directly used). Its length is 64 taps. All windows are built using the sine window as the model window
Its nominal length is the same as the regular short window WS S2S, but its effective length is only .3m=2 C B=2/ .m=2 B=2/ D m C B
(11.16)
11.5 Transient-Localized MDCT
219
because its leading and trailing .mB/=2 taps are zero, respectively. Its overlapping with its neighboring windows is only B. As an example, the WB B2B virtual window at the top of Fig. 11.11 may be used in combination with the short window BS S2S in the same figure to build the brief window shown as the fifth window in Fig. 11.11. Its effective length of 128 C 32 D 160 is significantly shorter than the 256 taps of the short window, so should provide better transient localization. For a sample rate of 44.1 kHz, this corresponds to 160=44:1 3:6 ms. Compared with the 5:8-ms length of the regular short window, this amounts to an improvement of 1:3 ms, or 22.4%. This improvement is critical for pre-echo control because the 3.6-ms length of the brief window is well within the 5-ms range of premasking, while the 5.8-ms length of the regular short window is not. Figure 11.12 shows the quantization noise achieved by this brief window for a piece of real audio signal. Its pre-echo artifacts are obviously shorter and weaker than those with the regular short window (see Fig. 11.4) but longer and weaker than those delivered by TNS (see Fig. 11.10). One may argue that TNS might deliver better pre-echo control due to its significantly shorter but more powerful pre-echo artifacts. Due to the premasking effect that often lasts up to 5 ms, however, pre-echo artifacts significantly shorter than the premasking period is most likely inaudible and thus irrelevant. Therefore, the simple brief window approach to pre-echo control serves the purpose well.
Amplitude
0.5
0
−0.5 5
10
15
20
5
10
15
20
25
30
35
40
45
25
30
35
40
45
Amplitude
0.5
0
−0.5
Time
Fig. 11.12 Pre-echo artifacts for transient localized MDCT. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but is remarkably shorter than that of the regular short window
220
11 Transients
11.5.2 Window Sequencing To switch between this brief window (WS B2B), the long window (WL L2L) and the short window (WS S2S), the PR conditions (11.8) and (11.9) call for the addition of various transitional windows which are illustrated in Fig. 11.11 along with the primary windows. Since the brief window provides much better transient localization, the new switched-window MDCT scheme may be referred to as transient-localized MDCT (TLM). Due to the increased number of windows as compared with the conventional approach, the determination of appropriate window sequence is more involved, but still fairly simple. The addition to the usual placement of long and short windows discussed in Sect. 11.3 is the placement of the brief window within a frame with transients. Within such a frame, this brief window is placed only to the block of samples containing a transient, while the short and/or the appropriate short transitional windows are applied to the quasistationary samples in the remainder of the frame. Some window sequence examples are shown in Fig. 11.13.
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.13 Window sequence examples. The top sequence is for the conventional method which does not use the brief window. The brief window is used to cover blocks with transients for the sequences in the rest of the figure. The second and the third sequences are for a transient occurring in the first and the third blocks, respectively. Two transients occur in the first and sixth blocks in the last sequence
11.5 Transient-Localized MDCT
221
11.5.2.1 Long Windows If there is no transient within the current frame, a long window should be selected, the specific shape of which depending on the shape of the immediately previous and subsequent window halves, respectively. This is summarized in Table 11.3.
11.5.2.2 Short Windows If there is a transient in the current frame, a sequence of eight short windows should be used, the specific shape of each depends on transient locations. This is summarized as follows: WS B2B is placed to each short block within which there is a transient, to im-
prove the time resolution of the short MDCT. The window for the block that is immediately before this transient block has a
designation of the form “WS X2B”. The window for the block that is immediately after this transient block has a
designation of the form “WS B2X”. The moniker “X” in the above designation can be either “S” or “B”. The allowed placement of short windows may then be summarized in Table 11.4. For the remainder of the frame (away from the transients), short window WS S2S should be deployed, except for the first and last blocks of the frame, whose window assignments are dependent on the immediate window halves in the previous and subsequent frames, respectively. They are listed in Tables 11.5 and 11.6, respectively. Table 11.3 Determination of long window shape for a frame without detected transient Previous window half Current window Subsequent window half WL X2L
WS X2S
WS X2B
Table 11.4 Allowed placement of short windows around a block with a detected transient
WL WL WL WL WL WL WL WL WL
L2L L2S L2B S2L S2S S2B B2L B2S B2B
WL L2X WS S2X WS B2X WL L2X WS S2X WS B2X WL L2X WS S2X WS B2X
Pretransient WL L2B WL S2B WL B2B WS S2B WS B2B
Transient
WS B2B
Posttransient WL B2L WL B2S WL B2B WS B2S WS B2B
222
11 Transients Table 11.5 Determination of the first half of the first window in a frame with detected transients Last window in previous frame First window in current frame WL X2S WS X2S WS B2B
WS S2X WS S2X WS B2X
Table 11.6 Determination of the second half of the last window in a frame with detected transients Last window in current frame First window in subsequent frame WS X2S WS X2S WS B2B
WL S2X WS S2X WS B2X
11.5.3 Indication of Window Sequence to Decoder The encoder needs to indicate to the decoder the window(s) that it used to encode the current frame so that the decoder can use the same window(s) to decode the frame. This can be accomplished again using a window index. For a frame without a detected transient, one label from the middle column of Table 11.3 is all that is needed. For a frame with transients, the window sequencing procedure outlined in Sect. 11.5.2 can be used to determine the sequence of short window shapes based on the knowledge of transient locations in the current frame. This procedure also need to know whether there is a transient in the first block of the subsequent frame due to the need for “look-ahead”. A value of 0 or 1 may be used to indicate the absence or presence of a transient in a block. For example, ‘00100010’ indicates that there is a transient in the third and seventh block, respectively. This sequence may be reduced by the block count starting from the last transient block. For example, the previous sequence may be coded by ‘23’. Note that, the particular method above cannot indicate if there is a transient in the first block of the current frame. This particular information, together with the absence or presence of transient in the first block of the subsequent frame, may be conveyed by the nomenclature WS Curr2Subs, where: 1. Curr (S=no, B=yes) identifies if there is transient in the first block of current frame, and 2. Subs (S=no, B=yes) identifies if there is transient in the first block of the subsequent frame. This is summarized in Table 11.7. The first column of Table 11.7 is obviously the same set of labels used for the short windows. Combining the labels in Tables 11.7 and 11.3, we arrive at the
11.5 Transient-Localized MDCT Table 11.7 Encoding of the existence or absence of transient in the first block of the current and subsequent frames
223
Label WS B2B WS B2S WS S2B WS S2S
Table 11.8 Window indexes and their corresponding labels
Transient in the first block of Current frame Subsequent frame Yes Yes Yes No No Yes No No
Window index
Window label
0 1 2 3 4 5 6 7 8 9 10 11 12
WS S2S WL L2L WL L2S WL S2L WL S2S WS B2B WS S2B WS B2S WL L2B WL B2L WL B2B WL S2B WL B2S
complete set of window labels shown in Table 11.8. The total number of windows labels is now 13, requiring 4 bits to transmit the index to the decoder. A pseudo CCC function is given in Sect. 15.4.5 that illustrates how to obtain short window sequencing from this window index and transient location information.
11.5.4 Inverse TLM Implementation Both TLM and the regular double-resolution switched MDCT discussed in Sect. 11.3 involve switching between long and short MDCT. The difference is that the regular one uses a small set of short and long windows, while TLM deals with a more complex set of short and long windows that involve the brief window WS B2B. Note that the brief window and all of its related transitional windows are either short or long windows, just like the windows used by the regular switched MDCT. They are different simply because the window functions have different values. These simple differences in values do not change the procedure that calculate the switched MDCT, so the same procedure given in Sect. 11.3.3 that calculate the switched MDCT can be applied to calculate TLM.
224
11 Transients
11.6 Triple-Resolution Switched MDCT Another approach to improving the window size compromise is to introduce a third window size, called the medium window size, between the short and long window sizes. The primary purpose is to provide better frequency resolution to the stationary segments within a frame with detected transients, thus allowing a much shorter window size to be used to deal with transient attacks. There are, therefore, two window sizes within such a frame: short and medium. In addition, the medium window size can also be used to handle a frame with smooth transients. In this case, there are only medium windows and no short windows within such a frame. The three kinds of frames are summarized in Table 11.9. There are obviously three resolution modes, represented by the three window sizes, under such a switched MDCT architecture. To maintain a constant frame size, the long window size must be a multiple of the medium window size, which in turn a multiple of the short window size. As an example, let us reconsider the 1024/128 switched MDCT discussed Sect. 11.3. To mitigate the pre-echo artifacts encountered by its short window size of 256 taps, 128 may be selected as the new short window size, which corresponds to 64 MDCT subbands. To achieve better coding gain for the remainder of a transient frame, a medium window size of 512 may be selected, corresponding to 256 MDCT subbands. Keeping the long window size of 2048, we end up with a 1024/256/64 switched MDCT, or triple-resolution switched MDCT, whose window sizes are multiples of 4 and 4, respectively. Since there are three sets of window sizes that can be switched between each other, three sets of transitional windows are needed. Each set of these windows can be built using the formulas from (11.10) to (11.14). Figure 11.14 shows all these windows built based on the sine window. The advantage of this new architecture is illustrated in Fig. 11.15, where a few example window sequences are shown. Now the much shorter short window can be used to better localize the transients and the medium window to achieve more coding gains for the remainder of the frame. The all medium window frame is suitable for handling slow transient frames. Comparing Fig. 11.14 with Fig. 11.11, we notice they are essentially the same, except the first window. Window WB B2B in Fig. 11.11 is virtual and not actually used. Instead, the dilated version of it, WS B2B is used to deal with transients. The window equivalent to WB B2B is labeled as WS S2S in Fig. 11.14 and used to
Table 11.9 Three types of frames in a triple-resolution switched MDCT
Frame type Quasistationary Smooth transient Transient
Windows A long window A multiple of medium windows A multiple of short and medium windows
11.6 Triple-Resolution Switched MDCT 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0
225
WS_S2S 200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
WM_M2M 200
400
600
800
1000 WM_M2S
200
400
600
800
1000 WM_S2M
200
400
600
800
1000 WM_S2S
200
400
600
800
200
400
600
800
1000 WL_L2L 1000 WL_L2M
200
400
600
800
1000 WL_M2L
200
400
600
800
1000 WL_M2M
200
400
600
800
1000 WL_L2S
200
400
600
800
1000 WL_S2L
200
400
600
800
1000 WL_S2S
200
400
600
800
1000 WL_M2S
200
400
600
800
1000 WL_S2M
200
400
600
800
1000
Fig. 11.14 Window functions produced for a 1024/256/64 triple-resolution switched MDCT using the sine window as the model window
cope with transient. Since it can be much shorter than WS B2B, it can deliver much better time localization. Also, it is smoother than WS B2B, so it should have better frequency resolution property.
226
11 Transients
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.15 Window sequence examples for a 1024/256/64 triple-resolution switched MDCT. The all medium window frame on the top is suitable for handling slow transient frames. For the rest of the figure, the short window can be used to better localize transients and the medium window to achieve better coding gains for the remainder of a frame with detected transients
Comparing Figs. 11.15 and 11.13, we notice that they are also very similar, the only difference is essentially that the WS B2B window in Fig. 11.13 is replaced by four WS S2S windows in Fig. 11.15. Therefore, window sequencing procedures are similar and thus are not discussed here. However, the addition of another resolution mode means that the resultant audio coding algorithm will become much more complex because each resolution mode usually requires its own sets of critical bands, quantizers, and entropy codes. See Chap. 13 for more explanation.
11.7 Transient Detection The adaptation of the time–frequency resolution of a filter bank hinges on whether there are transients in a frame as well as their locations, so the proper detection and locating of transients are critical to the success of audio coding.
11.7 Transient Detection
227
11.7.1 General Procedure Since transients are mostly consist of high-frequency components, an input audio signal x.n/ is usually preprocessed by a high-pass filter h.n/ to extract its highfrequency components: y.n/ D
X
h.k/x.n k/:
(11.17)
k
The Laplacian whose impulse response function is given below h.n/ D f1; 2; 1g
(11.18)
is an example of such a high-pass filter. The high-pass filtered samples y.n/ within a transient detection interval (see Sect. 11.3.2) are then divided into blocks of equal size, referred to as transientdetection blocks. Note that this transient-detection block is different from the block used by filter banks or MDCT. Let L denote the number of samples in such a transient-detection block, then there are K D N=L transient-detection blocks in each transient detection interval, assuming that it has a size of N samples. The short block size of filter bank should be a multiple of this transient-detection block size. Next, some kind of metric or energy for each transient-detection block is calculated. The most straight-forward is the following L2 metric: E.k/ D
L1 X
jy.kL C i /j2 ; for k D 0; 1; : : : ; K 1:
(11.19)
i D0
For reduced computational load, the following L1 metric is also a good choice: E.k/ D
L1 X
jy.kL C i /j; for k D 0; 1; : : : ; K 1:
(11.20)
i D0
Other sophisticated “norms”, such as perceptual entropy [34] can also be used. At this point, transient detection decision can be made based on the variations of the metric or energy among the blocks. As a simple example, let us first calculate Emax D max E.k/;
(11.21)
Emin D min E.k/:
(11.22)
0k
and 0k
228
11 Transients
Then the existence of transient may be declared if Emax Emin >T Emax C Emin
(11.23)
where the T is a threshold, which may be set to 0.5. After a frame is declared as containing transients, the next step is to identify the locations of the transients in the frame. Let us use the following transient function to designate the existence or absence of transient in transient-detection block k: ( T .k/ D
1; if there is a transient; 0; otherwise.
(11.24)
A simple method for transient localization is ( T .k/ D
1; if E.k/ E.k 1/ > T; 0; otherwise;
(11.25)
where T is a threshold. It may be set as T Dk
Emax C Emin 2
(11.26)
where k is an adjustable constant. Since a short MDCT or subband block may contain a multiple of transientdetection blocks, the transient locations obtained above need to be converted into MDCT blocks. This is easily done by declaring that an MDCT block contains a transient if any of its transient-detection block contains a transient.
11.7.2 A Practical Example A practical example is provided here to illustrate how transient detection is done in practical audio coding systems. It entails two stages of decision. In the preliminary stage, no transient in the current frame is declared if any of the following conditions are true: 1. 2. 3. 4.
Emax < k1 Emin , where k1 is a tunable parameter. k2 Dmax < Emax Emin , where k2 is a tunable parameter. Emax < T1 , where T1 is a tunable threshold. Emin > T2 , where T2 is a tunable threshold.
The Dmax used above is the maximum of absolute metric difference defined below Dmax D max jE.k/ E.k 1/j: 0
(11.27)
11.7 Transient Detection
229
If none of the conditions above are satisfied, a decision cannot be made for now. Instead, we need to move on to the secondary stage of decision, which focuses on eliminating large preattack and postattack peaks. In particular, let us denote kmax D fk j E.k/ D Emax ; 0 k < Kg;
(11.28)
which is the index for the block that has the maximum metric or energy. It may be considered as the block where transient attack occurs. To find the preattack peak, search backward from kmax for the block where the metric begins to increase: for (k=kmax-1; k>0; k--) { if ( E[k-1] > E[k] ) { break; } } PreK = k-1; The preattack peak is P reE max D
max
0k
E.k/:
(11.29)
A similar procedure is deployed to find the postattack peak. In particular, search forward from kmax for the last block where the metric begins to increase and whose metric is also no more than EX D 0:5Emax : k = kmax ; do { k++; for (; k
E[k] ) break; } if ( k+1>=K ) break; } while ( E[k] > EX ); PostK = k+1; The postattack peak is PostEmax D
max E.k/: PostKk
(11.30)
For final decision, declare that the current frame contains transients if Emax > k3 maxfPreEmax ; PostEmax g; where k3 is a tunable parameter.
(11.31)
Chapter 12
Joint Channel Coding
Multichannel audio signals or programs, including the most widely used stereo and 5.1 surround sounds, are considered as consisting of discrete channels. Since a multichannel signal is intended for reproduction of coherent sound field, there is strong correlation between its discrete channels. This inter-channel correlation can obviously be exploited to reduce bit rate. On the receiving end, the human auditory system relies on a lot of cues in the audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that some components of the audio signal are either insignificant or even irrelevant for sound localization, thus can be removed for bit rate reduction. Surround sounds usually include one or more special channels, called low frequency effect or LFE channels, which are specifically intended for deep and lowpitched sounds with a frequency range from 3 to 120 Hz. The significantly reduced bandwidth presents a great opportunity for reducing bit rate. It is obvious that a great deal of bit rate reduction can be achieved by jointly coding all channels of an audio signal through exploitation of inter-channel redundancy and irrelevancy. Unfortunately, joint channel coding has not reached the same level of sophistication and effectiveness as that of intra-channel coding. Therefore, only a few widely used and simple methods are covered in this chapter. See [25] to explore further.
12.1 M/S Stereo Coding M/S stereo coding, or sum/difference coding, is an old technology which was deployed in stereo FM radio [15] and stereo TV broadcasting [27] to extend from monaural or monophonic sound reproduction (often shortened to mono) to stereophonic sound reproduction (shortened to stereo) while maintaining backward compatibility with old mono receivers. Toward this end, the left (L) and right (R) channels of a stereo program are encoded into sum S D 0:5.L C R/ Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 12, c Springer Science+Business Media, LLC 2010
(12.1) 231
232
12 Joint Channel Coding
and difference D D 0:5.L R/
(12.2)
channels. A mono receiver can process the sum signal only, so the listener can hear both left and right channels in a single loudspeaker. A stereo receiver, however, can decode the left channel by LDS CD (12.3) and the right channel by R D S D;
(12.4)
respectively, so the listener can enjoy stereo sound reproduction. The sum signal is also referred to as the main signal and the difference signal as the side signal, so this technology is often called main/side stereo coding. From the perspective of audio coding, this old technology obviously provides a means for exploiting the strong correlation between stereo channels. In particular, if the correlation between the left and right channels is strongly positive, the difference channel becomes very weak, thus needs less bits to encode. If the correlation is strongly negative, the sum channel becomes weak, thus needs less bits to encode. However, if the correlation between the left and right channels is weak, both sum and difference signals are strong, then there is not much coding gain. Also, if the left or right channel is much stronger than the other one, sum/difference coding is unlikely to provide any coding gain either. Therefore, the encoder needs to dynamically make the decision as to whether or not sum/difference coding is deployed and indicates the decision to the decoder. Instead of the time domain approach used by stereo FM radio and TV broadcasting, sum/difference coding in audio coding is mostly performed in the subband domain. In addition, the decision as to whether sum/difference coding is deployed for a frame is tailored to each critical band. In other words, there is a sum/difference coding decision for each critical band [35].
12.2 Joint Intensity Coding The perception of sound localization by the human auditory system is frequencydependent [16, 18, 102]. At low frequencies, the human ear localizes sound dominantly through inter-aural time differences (ITD). At high frequencies (higher than 4–5 kHz, for example), however, sound localization is dominated by inter-aural amplitude differences (IAD). This latter property renders a great opportunity for significant bit rate reduction. The basic idea of joint intensity coding (JIC) is to merge subbands at high frequencies into just one channel (thus significantly reducing the number of samples to be coded) and to transmit instead a smaller number of bits that describe the amplitude differences between channels.
12.2 Joint Intensity Coding
233
Joint intensity coding is, of course, performed on critical band basis, so only critical band z is considered in the following discussion. Since sound localization is dominated by inter-aural amplitude differences only at high frequencies, higher than 4–5 kHz, the critical band z should be selected in such a way that its lower frequency bound is higher than 4–5 kHz. All critical bands higher than this can be considered for joint intensity coding. To illustrate the basic idea of and the steps typically involved in joint intensity coding, let us suppose that there are K channels that can be jointly coded and denote the nth subband sample from the kth channel as X.k; n/. The first step of joint intensity coding is to calculate the power or intensity of all subband samples in critical band z for each channel: X X 2 .k; n/; 0 k < K: (12.5) k2 D n2z
At the second step, all subband samples in critical band z are jointed together to form a joint channel: J.n/ D
X
X.k; n/;
n 2 z:
(12.6)
0k
Without loss of generality, let us suppose that these joint subband samples are embedded in the zth critical band of the zeroth channel. These joint subband samples need to be adjusted so that the power of the 0th channel in critical band z is unchanged: s X.0; n/ D
P
02 J.n/; 2 n2z J .n/
n 2 z:
(12.7)
At this moment, these joint subband samples can be coded as a normal critical band of the zeroth channel. All subband samples in critical band z in other channels are discarded, thus achieving significantly bit saving. At the third step, the steering vector for jointed channels are calculated: s k D
k2 02
;
0 < k < K:
(12.8)
They are subsequently quantized and embedded into the bit stream as side information. This completes the encoder side of joint intensity coding. At the decoder side, the joint subband samples for critical band z embedded as part of the zeroth channel are first decoded and reconstructed as XO .0; n/. Then the steering vectors are unpacked from the bit stream and reconstructed as O k ; k D 1; 2; : : : ; K 1. And finally, subband samples of all jointed channels can be reconstructed from the joint channel using the steering vector as follows: X.k; n/ D Ok XO .0; n/; n 2 z and 0 < k < K:
(12.9)
234
12 Joint Channel Coding
Since a large chunk of subband samples are discarded, joint intensity coding introduces significant distortion. When properly designed and not aggressively deployed, the distortion may appear as mild collapsing of stereo imaging. A listener may complain that the sound field is narrower. When aggressively deployed, significant collapsing of stereo imaging is possible. After subband samples are jointed together and then reconstructed using the steering vector at the decoder side, they can no longer cancel out aliasing from other subbands. In other words, the aliasing canceling property necessary for perfect reconstruction is destroyed forever. If stopband attenuation of the prototype filter is high, this may not be a significant problem. For subband coders whose stopband attenuation is not high, however, there exists a problem that aliasing may become easily audible. This is the case for MDCT whose stopband attenuation is usually less than 20 dB.
12.3 Low-Frequency Effect Channel Low-frequency effects (LFE) channels are for sound tracks that are specifically intended for deep and low-pitched sounds with frequency bandwidth limited to 3–120 Hz. Such a channel is normally sent to a speaker that is specially designed for low-pitched sounds, called a subwoofer or low frequency emitter. Due to the extremely low bandwidth of an LFE channel, it can be very easily coded. A simple approach is to down-sample and then quantize it directly. Another simple approach is to code it using the same method as the other channels are coded, but with specific restriction to disallow transient mode for the filter bank. For MDCT-based coders, for example, this entails that only the long window is allowed.
Chapter 13
Implementation Issues
The various audio coding technologies presented in earlier chapters need to be stitched together to form a complete audio coding algorithm, which in turn needs to be implemented on a physical platform such as a DSP or other microprocessors. Toward this end, a lot of practical issues need to be effectively addressed, including but not limited to the following: Data Structure. The audio samples need to be properly organized and managed in such a way that is suitable for both encoding and decoding processing. Entropy Codebooks. To maximize the efficiency of entropy coding, a variety of entropy codebooks need to be adaptively deployed to match the statistic properties of various sources being coded. Bit Allocation. Bits are the precious resource that need to be optimally allocated to deliver the best overall coding efficiency. Bit Stream Format. All bits that represent the coded audio data need to be properly formated into a structure that facilitates easy synchronization, unpacking, decoding, and error handling. Implementation on Microprocessors. Audio encoders and decoders are mostly implemented on DSPs or other low-cost microprocessors, so an algorithm needs to be designed in such ways that it can be conveniently implemented on these microprocessors.
13.1 Data Structure To facilitate real-time audio coding, audio data need to be organized as frames so that they can be encoded, delivered, and decoded piece by piece in real time. In addition, frequent interruption of mostly quasistationary audio signals by transients necessitates the segmentation of audio data into quasistationary segments based on transient locations so that adaptation in time of a filter bank’s resolution can be properly deployed. In the frequency domain, the human ear processes audio signals in critical bands, so subband samples need to be organized in a similar fashion to maximally exploit perceptual irrelevance. The three requirements above constitutes the basic constraints in designing data structures for audio coding. Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 13, c Springer Science+Business Media, LLC 2010
235
236
13 Implementation Issues
13.1.1 Frame-Based Processing Real-time delivery of media contents, such as live broadcasting, is often a necessary requirement for many applications that involves audio. In these applications, one cannot wait until all samples are received and then begin to process and deliver the samples. In addition, this waiting for completion would make the encoding and decoding equipments extremely expensive due to the need to store the entire signal that may last hours, even days. A basic solution to this problem is frame-based processing in which the input signal is proceeded and delivered frame by frame. To avoid data corruption, this frame-based processing is usually implemented using double buffering or ping-pong buffering. Under such a mechanism, two sets of input and output buffer pairs are deployed. While one set of buffers are handling input and output operations (the input buffer is receiving and the output buffer is transmitting data), the data in the other set are being processed (the input buffer is being read from and the output buffer is written to filled). Once the input/output handling set is full/empty and the data in the processing set is processed, the buffer sets need to be switched to let the processed buffer set handle new input/output data and the data in the input/output set be processed. The processing has to be fast enough to ensure that the processing of the whole frame of data is completed before the input/output buffer set is full/empty. Otherwise, the processing has to be abandoned to ensure real-time operation. For easy frame buffer management, especially the synchronized switching of buffer sets, a constant frame size is highly desirable. The block-based nature of filter banks, and MDCT in particular, is obviously very amenable to frame-based processing and to maintaining a constant frame size. One can simply set the frame size to a multiple of the block size of the filter bank. With a switched filter bank, whose block size varies over time, a constant frame size can still be easily achieved by setting the frame size to a multiple of the longest block size and ensuring that the longer block sizes are multiples of the shorter block sizes. For example, the block sizes for the double-switched MDCT in Sect. 11.3 are 1,024 and 128, respectively, and that for the triple-switched MDCT in Sect. 11.6 are 1,024, 256 and 64, respectively, so a constant frame size may be maintained by setting it to a multiple of 1,024, such as 1; 024; 2; 048; : : : . While a frame containing a multiple of the longest blocks may save a little side information in the final bit stream, this structure is not essential as far as coding is concerned. Without loss of generality, therefore, the frame size is assumed to be the same as the longest block size of a switched filter bank in the remainder of the this book.
13.1.2 Time–Frequency Tiling A frame with transients contains a multiple of short blocks, in which transients may occur in one or more of those blocks. The statistical properties of the blocks after a
13.1 Data Structure
237
transient are usually similar to the transient block, so a transient may be considered as the start of a quasistationary segment, called transient segment, each consisting of one or more short blocks. For this reason transient localization may also be called transient segmentation. Under such a scheme, the first transient segment of a frame starts from the first block of the frame and ends before the first transient block. The second transient segment starts from the first transient block and ends either before the next transient block or with the end of the frame. This procedure continues until the end of the frame. This segmentation of subband samples by blocks based on transient locations and frame boundaries is illustrated in Fig. 13.1. Within each block, either short or long, the subband samples are segmented along the frequency axis in such a way that critical bands are approximated to fully utilize
Fig. 13.1 Segmentation of blocks of subband samples within a frame based on transient locations and frame boundaries
238
13 Implementation Issues
Critical Band M-1 Quantization Unit
.. .
Quantization Unit
.. .
.. .
Quantization Unit
.. .
Critical Band 1 Quantization Unit
Quantization Unit
Quantization Unit
Critical Band 0 Quantization Unit
Quantization Unit
Quantization Unit
Segment 1
Segment 2
Segment 0
Fig. 13.2 Time–frequency tiling of subband samples in a frame by transient segments in the time domain and critical bands in the frequency (subband) domain. Subband samples in each tile share similar statistical properties along the time axis and are within one critical band along the frequency axis, so may be considered as one unit for quantization and bit allocation purpose
results from perceptual models. Subband segments thus obtained may be conveniently referred to as critical band segments or simply critical bands. The subband samples for a frame with detected transient are, therefore, segmented along the time axis into transient segments and along the frequency axis into critical band segments. Combined together, they segment all subband samples in a frame into a time–frequency tiling as shown in Fig. 13.2. All subband samples within each tile share similar statistical properties along the time axis and are within one critical band along the frequency axis, so they should be considered as one group or unit for quantization and bit allocation purpose. Such a tile is called a quantization unit. For a frame with a long block, the time–frequency tiling for a frame is obviously reduced to tiling of critical bands along the frequency axis only, so a quantization unit consists of subband samples in one critical band.
13.2 Entropy Codebook Assignment After quantization, entropy coding may be applied to the quantization indexes to remove statistic redundancy. As discussed in Chaps. 8 and 9, an entropy codebook is designed based on a specific estimate of the probability model of the source sequence, which are the quantization indexes in this case. When quantization indexes are coded by an entropy codebook, it is imperative that they follow the probability model for which the entropy codebook is designed. Otherwise, less than optimal and even very poor coding performance is likely to occur.
13.2 Entropy Codebook Assignment
239
Due to the dynamic nature of audio signals, quantization indexes usually do not follow a single probability model. Accordingly, a library of entropy codebooks should be designed and assigned to different segments of quantization indexes to match their respective probability characteristics.
13.2.1 Fixed Assignment For the sake of simplicity, the quantization indexes for the subband samples in one quantization unit may be considered as statistically similar, so can be considered as sharing a common probability model and thus the same entropy codebook. In such a case, the quantization unit is also the unit for codebook assignment: all quantization indexes within a quantization unit are coded using the same entropy codebook. This is shown in Fig. 13.3. In other words, the application scopes of entropy codebooks overlap with that of quantization units. This is the scheme adopted by almost all audio coding algorithms. Under such a scheme, the codebook assigned to a quantization unit is usually the smallest one that can accommodate the largest quantization index (absolute value) within the quantization unit. Since this largest value is completely determined once the quantization step size is selected by bit allocation, the entropy codebook is also completely determined, there is no room for optimization.
Fig. 13.3 All quantization indexes within a quantization unit are coded using the same entropy codebook. The application scopes of entropy codebooks overlap with that of quantization units
240
13 Implementation Issues
13.2.2 Statistics-Adaptive Assignment Although the boundaries of a quantization unit along the time axis is based on the statistic properties of the subband samples, its boundaries along the the frequency axis is based on critical bands and is thus unrelated to the statistic properties of the quantization indexes. Therefore, the fixed codebook assignment approach discussed above does not provide a good match between the statistic properties of the entropy codebooks and those of the quantization indexes. To arrive at a better match, a statistic-adaptive approach may be adopted which ignores the boundaries of the quantization units and, instead, adaptively matches codebooks to the local statistic properties of quantization indexes. This is illustrated in Fig. 13.4 and outlined below: 1. Assign to a quantization index the smallest codebook that can accommodate it, thereby converting quantization indexes into codebook indexes. 2. Segments these codebook indexes into large segments based on their local statistic properties. 3. Select the largest codebook index for each segment as the codebook index for the segment. The advantage of this statistics-adaptive approach to codebook assignment over the fixed assignment approach can be seen by comparing Fig. 13.3 with Fig. 13.4. Since the largest quantization index falls into quantization unit d in Fig. 13.3, a large codebook needs to be assigned to this unit to handle this large quantization index if the fixed assignment scheme is deployed. This is obviously not a good match
Fig. 13.4 Statistic-adaptive approach to entropy codebook assignment, which ignores the boundaries of the quantization units and, instead, adaptively matches codebooks to the local statistic properties of quantization indexes
13.3 Bit Allocation
241
because most of the indexes in the unit are much smaller and thus are not suitable for being coded using a large codebook. Using the statistics-adaptive approach, however, the largest quantization index is segmented into codebook segment C as shown in Fig. 13.4, so it shares a codebook with other large quantization indexes in the segment. Also, all quantization indexes in codebook segment D are small, so a small codebook can be selected for them, which results in fewer bits for coding these quantization indexes. With the fixed assignment approach, only the codebook indexes need to be transferred to the decoder as side information, because the codebook application scopes are the same as the quantization units, which are usually determined by transient segments and critical bands. The statistics-adaptive approach, however, needs to transfer the codebook application scopes to the decoder as side information, in addition to the codebook indexes, since they are dynamic and independent of the quantization units. This overhead must be contained in order for the statisticsadaptive approach to offer any advantage. This can be accomplished by imposing a limit on the number of segments, thus restricting the amount of side information.
13.3 Bit Allocation The goal of audio coding is to reduce the bit rate required to carry audio data to the decoder with inaudible distortion. Under normal configuration, a particular application would specify a target bit rate, which may be the maximal rate that the audio stream can never exceed, an average rate that the audio stream should maintain over a certain period of time, or a variable rate that varies with channel condition (such as the Internet). Depending on the type of the bit rate, an inter-frame bit allocation algorithm is needed to optimally allocate a certain number of bits to encode each audio frame. Within each frame, an intra-frame bit allocation algorithm is needed to optimally allocate these bits to encode each quantization indexes and other side information to ensure minimal distortion.
13.3.1 Inter-Frame Allocation There are many algorithms to allocate bits to frames, the easiest approach is to allocate the same number of bit to each frame since equal frame size is usually deployed for easy buffer handling: Bits Per Frame D
Bit Rate Samples Per Frame: Sample Rate
(13.1)
As an example, let us consider a bit rate of 128 kbps and a frame size of 1,024 samples. For a sample rate of 48 kHz, the number of bits assigned to code each frame is
242
13 Implementation Issues
128 1;024 2;730: 48 While simple, this approach is not a good match to the dynamic nature of audio signals. For example, frames whose audio samples are either silent or of small magnitude demand a small number of bits. Allocating the same number of bits to such frames means a significant number of them are not used and thus wasted. On the other hand, frames with transients tend to demand a large number of bits, allocating the same number of bits to such frames results in inadequate number of bits and thus distortion is likely to become audible. Therefore, a better approach is to adaptively allocate bits to frames in proportion to their individual demand while full conformance to the target bit rate is maintained. Adaptive inter-frame bit allocation is a very complex problem that are affected by a variety of factors, such as the type of bit rate constraints (maximal and average bit rates), the size of decoder buffers, and bit demands of individual audio frames. Audio decoders usually have very limited buffer size, which leads to the possibility of buffer underflow and overflow. Buffer underflow, also called buffer underrun, is a condition that occurs when the input buffer is fed with data at a lower speed than the data is being read from it for a prolonged period of time so that there is not enough data for the decoder to decode. When this happens, real-time audio playback has to be paused until enough data is fed to the input buffer. Buffer overflow is the opposite condition where the input buffer is fed with data at a higher speed than the data is being read from it for a prolonged period of time so that the input buffer is full and can no longer take more data. When this occurs, some of the input data have to be discarded, causing decoding error. The bit demands of individual audio frames vary with time. Since it is impossible to predict the bit demand for frames in the future, optimal bit allocation is very problematic for real-time encoding. For applications where real-time encoding is not required, one usually can make a pilot run of encoding to estimate the bit demand for each frame, thus making inter-frame bit allocation easier.
13.3.2 Intra-Frame Allocation Once the number of bits allocated to a frame is known, there is still a need to optimally allocate these bits, referred to as a bit pool, to each individual quantization unit in such a way that quantization noise is either completely inaudible or least audible. According to the optimal bit allocation strategy discussed in Sects. 5.2, 6.5, and 10.6, the basic principle is to equalize MSQE or NMR (if a perceptual model is used) in all quantization units. While there are many methods to achieve this, the following water-filling algorithm illustrate the basic idea: 1. Find the quantization unit whose quantization noise is most audible, i.e. the NMR is the highest. 2. Decrease the quantization step size for this unit.
13.4 Bit Stream Format
243
3. Recalculate the number of bits consumed. 4. Repeat steps 1 through 3 until either The bit pool is exhausted or The quantization noise for all quantization units are below their respective
masking thresholds, or NMR for all quantization units are less than one.
13.4 Bit Stream Format All elements of compressed audio data that represent a frame of audio samples need to be packed into a structured frame for transmission to the decoder and subsequent unpacking by the decoder so that the original audio signal can be reconstructed and played back. A sequence of such frames form an audio bit stream. The payloads for an audio frame obviously consist of bits that represent audio samples, but supplementary data that may provide a description of the audio signal, such as sample rate and speaker configuration, are needed to assist the synchronization, decoding and playback of audio signals. Error protection codes may be inserted to help deal with difficult channels. Axillary data, such as time codes and information about the band, may also be attached to enhance the audio playback experience. Therefore, a compressed audio frame may have a typical frame structure as shown in Table 13.1 in which the payloads are supplemented by frame header, error protection codes, auxiliary data, and end of frame signature.
13.4.1 Frame Header A basic frame header structure is shown in Table 13.2. Its first role is to assist the decoder to synchronize with the audio frames so that it knows where a frame starts and ends. This is usually achieved through the synchronization codeword, the compressed data size codeword, and the end of frame signature which is the last codeword of a frame. Table 13.1 A basic frame structure for compressed audio data. The payloads are the bits representing audio data for all channels.
Frame Header Audio Data for Normal Channel 0 Audio Data for Normal Channel 1 :: : Payloads Audio Data for LFE Channel 0 Audio Data for LFE Channel 1 :: : Error Protection Codes Auxiliary data End of Frame
244 Table 13.2 A basic frame header that assists the synchronization of audio frames and provides a description of the audio signal that is carried by the audio bit stream
13 Implementation Issues Synchronization word Compressed data size Sample rate Number of normal channels Number of LFE channels Speaker configuration
The synchronization word signals the start of the frame. Some audio coding algorithms impose a constraint that the synchronization word cannot appear anywhere inside a frame for easy synchronization with the frame: once the synchronization word is found, it is the start of the frame and there is no ambiguity. In such a case, it is unnecessary to use the compressed data size codeword and end of frame signature. Forbidding the synchronization word to appear inside a frame imposes a certain degree of inflexibility to the encoding process. To lessen this inflexibility, it is imperative to use a long synchronization word. A good example is 0xffffffff (hexadecimal) which requires that 32 consecutive binary ‘1’s do not appear anywhere inside the frame. Without the restriction above, a synchronization word might appear in the middle of a frame. If seeking synchronization word begins in the middle of such a frame, an identification of the synchronization word inside the frame triggers false synchronization. However, compressed data size codeword, which goes right after the synchronization word, would point to a position in the bit stream where the end of frame signature is absent, so synchronization cannot be confirmed. Consequently, a new round of synchronization loop is lunched. The second role of the frame header is to provide a description of the audio signal carried by the bit stream. This facilitates the decoding of the compressed audio frames as well as the playback of the decoded audio. A minimal description must include the sample rate, the number of normal and LFE channels, and speaker configuration.
13.4.2 Audio Channels The payloads of an audio frame are the bits representing all audio samples which are usually packed one channel after another as shown in Table 13.1. Within each channel are a large chunk of ancillary data, called side information, and the bits representing the quantization indexes, as shown in Table 13.3. The side information aids the unpacking of quantization indexes and the reconstructing of audio samples from these indexes. It may include bits that represent MDCT window index, transient locations, control bits for joint channel coding, indexes for quantization step sizes and bit allocation. Note that the side information do not necessarily have to be packed as a whole chunk. Instead, it can be interleaved with bits representing the quantization indexes.
13.4 Bit Stream Format Table 13.3 Basic data format for packing one audio channel consisting of M entropy codebook application scopes
245 Window index Transient locations Joint channel coding Quantization step sizes Bit allocation Entropy codebook application scope 0 Quantization Entropy codebook application scope 1 :: : Indexes Entropy codebook application scope M 1 Side Information
The bits representing the quantization indexes for all coded subband samples may be packed sequentially from low frequency to high frequency. For frames with short windows, they are usually laid out one transient segment after another. When entropy coding is used for quantization indexes, these bits are obviously placed based on the application scopes of respective entropy codebooks.
13.4.3 Error Protection Codes Error protection codes, such as Reed–Solomon code [44], may be inserted between data elements in a frame to detect bit stream error and even correct minor errors. To reduce the associated overhead, they may be selectively placed to protect critical segments of data in a frame, such as the entropy codebook indexes, within which any error is likely to cause the whole decoding process to crash. This amounts to unequal error protection. In addition, bits in a frame may be arranged in such a way such that most burst bit errors would cause a graceful degradation of audio quality, instead of a total crash. The end result is a certain degree of error resilience, which is valuable when audio streams are transmitted or broadcast in wireless channels. However, error protection codes are extensively deployed in modern digital communication protocols and systems to ensure that the bit error rate is below a certain threshold. In addition, the trio of synchronization word, compressed data size word, and end of frame signature provide a basic mechanism for error detection: they must be consistent with each other. Therefore, it is unnecessary to insert additional error protection capability in audio frames for normal applications. Error protection should be considered as part of channel coding, unless the channel condition is so bad that joint source-channel coding becomes necessary.
13.4.4 Auxiliary Data Many data elements, such as the name for the band that played the piece of music being encoded or a number that identifies the encoder, which are not part of the audio
246
13 Implementation Issues
data and are user-defined, may be attached to the audio frame as auxiliary data. The decoder can choose to ignore all of them without interrupting the decoding and playback of the audio signal. The auxiliary data may even provide a mechanism for future extension to the audio coding algorithm. Data for coder extension may be attached to the original frame as auxiliary data to let the corresponding new decoders to decode. This extension in the auxiliary data is simply ignored by the older decoders which does not recognize the extension.
13.5 Implementation on Microprocessors An important aspect of consideration when designing an audio coding algorithm is implementation cost, especially on the decoder side, which is directly linked to encoder/decoder cost and their energy consumption. Decoder cost is usually a major factor affecting a consumer’s purchase decision. Energy consumption has become an important issue in the global warming trend and is a critical factor affecting the viability of mobile devices. Algorithm complexity has a significant impact to implementation cost. In the pursuit of coding performance, there is a tendency to add more and more compression components to the algorithm, usually at mediocre performance gain. An alternative is to maximize the performance of basic components to rein in a bloated algorithm. The synthesis filter bank and entropy decoding are the two most computationally expensive components in an audio decoder. Cosine modulated filter bank, and MDCT in particular, is widely used in audio coding largely because it has an structure amenable for fast algorithms. Huffman coding is considered as a good balance between coding gain and decoder complexity. If companding is deployed, the expanding on the decoder side may become an computationally expensive operation.
13.5.1 Fitting to Low-Cost Microprocessors Another implementation aspect that might be overlooked is the suitability of the decoding algorithm for low-cost microprocessors that have no floating-point units and small on-chip memory. For improved accuracy and performance on such systems, the audio coding algorithm must be designed in such a way that the decoder can be implemented using simple fixed-point arithmetic without resorting to sophisticated techniques to manage the magnitudes of any variables involved in the decoding process. The first step toward this end is to limit the dynamic range of all signal components that may be used by the decoding algorithm to within the range that can be represented by integers of a typical microprocessor, while maintaining acceptable precision required by audio signals. While 16 bits per sample have been
13.5 Implementation on Microprocessors
247
widely used for delivering audio signals in various digital audio systems, most notably the compact disc (CD), the representation of high-fidelity sounds may need more than 20 bits per sample, so 24 bits per sample has been considered as more appropriate. The 24 bits are also amenable for microprocessors that do not implement fixedpoint arithmetic in hardware and usually are at least of 32-bit data width. For such processors, the 24 bits for audio signals may be placed to occupy the lower 24 bits of a 32-bit word, leaving the upper 8 bits as a buffer for coping with overflowing, which may occur during a series of accumulation operations. Any noninteger parameters that the decoder may use, such as the quantization step sizes, must be converted to 24-bit integer representation. Division, which is extremely expensive on such systems, should be strictly disallowed. Modern microprocessor usually come with a small amount of fast on-chip internal memory and a large amount of slow off-chip or external memory. If the foot-print of the decoding algorithm could fit within the fast internal memory, decoding would be significantly accelerated. Software development time and reliability may also be improved if managing various accesses to the external memory is avoided. Memory foot print may be reduced by using small entropy and quantization codebooks, as well as a small audio frame size.
13.5.2 Fixed-Point Arithmetic A real number on a computer may be considered as consisting of an integer and a scale factor. For example, the real number 0.123 can be considered as 123/1,000, where 123 is the integer and 1,000 is the scaling factor. If both integer value and the scale factor are stored, the number is said as being represented by a floatingpoint number, because the scale factor actually determines the radix point or decimal point. The stored scale factors allow floating-point numbers to have a wide range of values. If the scale factor is not stored but assumed during the whole course of computation, the radix point is then fixed, so the real number becomes a fixed-point number. The computer uses 2’s complement to represent integers, so let us consider a binary fixed-point type with f fractional bits and a total of b bits. The corresponding number of integer bits is m D b f 1 (the left most bit is the sign bit). If the integer value represented by the b bits is x, its fixed-point value is x : 2f
(13.2)
Since the left most bit is the sign bit, the dynamic range of this fixed-point number is # " 2b1 2b1 1 : (13.3) f ; 2 2f
248
13 Implementation Issues
There are various notations for representing word length and radix point in a binary fixed-point type. The one most widely used in the signal processing community is the “Q” notation Qm.f and its abbreviation Qf which assumes a default value of m D 0. For example, Q0.31 describes a type with 0 integer bit and 31 fractional bits stored as a 32-bit 2s complement integer, which may be further abbreviated to Q31. The 24 bits per sample suggested above for representing variables and parameters in an audio coder may be designated as Q8.23 if implemented using a 32-bit integer (bD32). Using a 24-bit integer, it is Q0.23, or simply Q23. To add or subtract two fixed-point numbers, it is sufficient to add or subtract the underlying integers y x˙y x ˙ f D ; (13.4) f 2 2 2f which is exactly the same as integer addition or subtraction. Both operations may overflow, but does not underflow. For multiplication, however, the result needs to be rescaled x y f xy 2 D f f f 2 2 2
(13.5)
to maintain the same fixed-point representation. The scaling can be conveniently done through binary shifting to the left. While the multiplication of two Qm.f numbers with m > 0 may overflow, this is not true when m D 0 because the multiplication of two fractional numbers remains a fractional number. Multiplication of such numbers may underflow. Similar to multiplication, the result from the division of two fixed-point numbers needs to be rescaled x y 1 xy D (13.6) 2f 2f 2f 2f to maintain the same fixed-point representation. The scaling can be conveniently done through binary shifting to the right. Underflowing may occur in division. When implementing fixed-point multiplication and division using integer arithmetic, the intermediate multiplication and division results must be stored in double precision. To maintain accuracy, the intermediate results must be properly rounded before they are converted back to the desired fixed-point format. The C code for implementing fixed-point multiplication is given below: #define R
( 1 << (f-1) )
// Rounding factor
int FixedPointMultiply(int x, int y) { // long int temp; // Double precision for intermediate results // // Integer multiplication temp = (long int)x * (long int)y; // // Rounding
13.5 Implementation on Microprocessors
249
temp = (temp>0) ? temp+R : temp-R; // // Scaling return temp >> f; }
The C code for implementing fixed-point division is given below: int FixedPointDivide(int x, int y) // return x/y { long int temp; // Double precision for intermediate results // // Scaling temp = (long int)x << f; // // Rounding temp = (temp>0) ? temp+(y>>1) : temp-(y>>1); // // Integer division return int(temp/y); }
Chapter 14
Quality Evaluation
Since removal of perceptual irrelevance through quantization contributes the largest part of compression for a lossy audio coder, significant distortion or impairment is inherent in the reconstructed audio signal that the decoder outputs. The whole promise of perceptual audio coding is based on the assumption that quantization noise can be hidden behind or masked by the prominent signal components in such a way that it is inaudible. This promise may be easily delivered when the compression ratio is low. As the compression ratio increases to a certain level, distortion or impairment begins to become audible. The nature of distortion and the conditions under which it becomes audible depend on a variety of factors, including the audio coding algorithm, the input audio signal, the rendering equipments, and the listening environment. The threshold of inaudible impairment for a million dollar listening room is dramatically different than that for a portable music player. One of the goals of audio coding algorithm development is to maximize compression ratio subject to the constraint that distortion is inaudible or tolerable for the most difficult pieces of audio signals, using the best rendering equipments, and under the optimal listening environment. This is usually approached by setting the algorithm to work at various bit rates and then minimizing the corresponding coding impairment. This necessitates frequent evaluation of coding impairment. The development process of an audio coder inevitably involves many iterations of trial-and-error to test the performance of the algorithm structure and its subsystem as well as to optimize various parameters of the algorithm. Each of these steps entails the evaluation of distortion level to identify the right direction for algorithm improvement. Even after the algorithm development is completed at the engineering department, a formal evaluation or characterization of the algorithm is in order before it is lunched to the market. During the course of market promotion, suspicious customers may request a formal evaluation. Competitors may make various self-serving claims about audio coding algorithms on the market, which may be arguably based on their internal evaluation. Audio coding algorithms are often adopted as an industry, national or international standard to facilitate its mass adoption. During the standardization process, many competing algorithms are likely to be proposed, so the standardization Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 14, c Springer Science+Business Media, LLC 2010
251
252
14 Quality Evaluation
committee needs to order a set of formal evaluation to help determine which algorithm is to be adopted or how to combine good features from various algorithms into a joint one. All of the various situations above call for an effective and simple measure of coding distortion or impairment. Since the ultimate judge is the human ear, listening test or objective evaluation is the necessary and final tool for such impairment assessment. Listening test is, however, usually time-consuming and expensive, so simple objective metrics may serve as intermediate tools for fast and cheap impairment assessment so that algorithm development can evolve on fast track and performance of various algorithms can be conveniently compared.
14.1 Objective Metrics The first requirement for an objective metric is that it correlates well with how the human ear perceives. In addition, it must be simple enough to enable cheap and rapid evaluation. The first objective metric that would come to mind is the venerable signal-tonoise ratio (SNR). When the distortion is at the same level as the masked threshold, SNR becomes SMR which was discussed in Sect. 10.4. Depending on the types of masker and maskee, the SMR threshold may differ from 2 to 28 dB, as shown in Table 10.3. This means that, for some masker–maskee pairs, an SNR as low as 2 dB may be inaudible, but for some others, an SNR as high as 28 dB is necessary to ensure inaudibility. This large variation of up to 26 dB means that SNR does not correlate well with how the human ear perceives, so is unsuitable for measuring coding distortion. Apparently, psychoacoustic model must be deployed to account for how quantization noise is masked by signal components adjacent in both time and frequency. This leads to various algorithms that try to objectively measure perceived audio quality or distortion, including PEAQ (Perceptual Evaluation of Audio Quality) released as ITU-R Recommendation BS.1387 [31] in 1998. Unfortunately, such objective measures still provide impairment values that rarely correlate well with how the human ear perceives.
14.2 Subjective Tests Given that both SNR and the current perceptual objective measures are not considered as satisfactory in terms of effectiveness, the tedious and often expensive subjective listening tests has remained to be the only effective and final tool. Informal listening tests are conducted by developers in various stages of algorithm development and formal listening tests are usually ordered as the authoritative assessment of an audio coder’s performance.
14.2 Subjective Tests
253
14.2.1 Double-Blind Principle Subjective listening test usually involves comparing a coded piece of audio with the original one. A major challenge with such a test is the bias that either the listener or the test administrator may have on different types of distortion, music, and coder. This problem is often exacerbated by the subjective nature of human perception: a listener may even perceive the same piece of music differently at different time even under the same environment. A common approach to mitigate these problems is the “Double blind principle” given below: Neither the listener nor the test administrator knowns whether the piece of audio that is being listened is the coded signal or the original one during the whole test process.
14.2.2 ABX Test A simple subjective listening test that is favored by many algorithm developers is the ABX test [7]. Under such a test, the original audio signal is usually designated as A and the coded audio signal as B. Then either A and B are randomly presented to the listener as X, thus enforcing the double-blind principle. The listener is asked to identify X as either A or B. The listener has the liberty of listening to A, B, and X at any time before he or she makes the decision. Once the decision is made, the next round of testing begins. After this procedure is repeated many times, the rate of correct identification is calculated. A rate around 50% clearly indicates that the listener cannot reliably perceive the distortion, while a significantly biased rate indicates perceived distortion.
14.2.3 ITU-R BS.1116 For formal evaluation of small impairment in coded audio signals, ITU-R Recommendation BS.1116 is usually the choice [30]. It uses a continuous five grade impairment scale, the quantized version of which is shown in Table 14.1.
Table 14.1 Five grade impairment scale used by ITU-R BS.1116
Impairment description Imperceptible Perceptible, but not annoying Slightly annoying Annoying Very annoying
Grade 5.0 4.0 3.0 2.0 1.0
254
14 Quality Evaluation
The test method may be characterized as “double-blind, triple-stimulus with hidden reference”. The listener is presented with three stimuli: the reference signal at first, then the reference, and the coded signal in random order, thus enforcing the double-blind principle. The listener is asked to assess the small impairment in the later two and give grade to both of them. Since one of the later two is the reference signal, one of the grade has to be five. A higher grade assigned to the coded signal indicates a false identification. A rate of false identification around 50% clearly indicates that the listener cannot perceive the impairment, hence his grading results should be excluded in the final average grade. This procedure may exclude the majority of the listeners, leaving with a group of expert listeners. The number of expert listeners in a formal test may vary between 20 and 60. If even the expert listeners cannot perceive any impairment, the audio coding algorithm is called “transparent”. A training session should precede the formal testing to let the listeners familiar with the test procedure, the grading method, the environment, and the coder impairment. To give the listeners an opportunity to learn to appreciate the impairment and the grading scale, coded audio signals with easily perceived impairment may be presented to the listeners. The listening conditions and equipments used are critical to the final grade value of the test. ITU-R BS.1116 stipulate strict requirements to the test environment, including the speaker configuration, background noise, reverberation time, etc. See [30] for details. Since audio coders perform differently with audio signals of different characteristics, the selection of test signals or materials is also critical to the final grade value. When comparison test is performed with competing audio coders, this issue may become highly contentious. The EBU SQAM CD [13] contains a set of audio signals recommended by EBU for subjective listening tests. It is a selection of signals specifically chosen to reveal impairments of various audio systems. Castanets and Glockenspiel in SQAM are particularly famous for breaking many audio coders.
Chapter 15
DRA Audio Coding Standard
This chapter presents DRA (Dynamic Resolution Adaptation) audio coding standard [97] as an example to illustrate how to integrate the technologies described in this book to create a practical audio coding algorithm. DRA algorithm has been approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification [99] and by Chinese government as its national standard [100].
15.1 Design Considerations The core competence of an audio coding algorithm is the coding efficiency, or the compression ratio that it can deliver without audible impairment. While the compression ratio given in (1.2) is a hard metric, the audible impairment is a subjective metric which is affected by a variety of factor, including the sound material, the playback device, the listening environment, and the listeners. In addition, for many consumer applications, a certain degree of audible impairment can be tolerated and thus allowed. Therefore, coding efficiency is eventually a subjective term that varies with the desired or tolerable level of audio impairment requirement, which is usually determined by applications. To deliver a high level of fidelity (low level of impairment), an audio coding algorithm needs to maintain a signal path of high resolution in both its encoder and decoder. This usually requires fine quantization step sizes and consequently a large number of quantized values, thus demanding more bits to convey both to the decoder. Therefore, it is more difficult to deliver high coding efficiency when the fidelity requirement is high. To achieve high coding efficiency, an audio coding algorithms needs to consider more complex situations so that it can adapt better to the changing signal characteristics. It also needs to deploy more, complex and sometimes redundant technologies to squeeze out a little bit performance gain for some special situations. Therefore, high coding efficiency usually means complex algorithms. However, high algorithmic complexity, especially on the decoder side, results in expensive devices and high power consumption and is, therefore, an impediment
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 15, c Springer Science+Business Media, LLC 2010
255
256
15 DRA Audio Coding Standard
Fig. 15.1 DRA encoder. The solid lines indicates flow of audio data and the dashed lines flow of control parameters. The dashed boxes indicate optional modules that do not have to be invoked
to the adoption of the algorithm in the market place. Decoder cost is a key factor affecting consume purchase decision and power consumption is becoming a prominent issue in an era of mobile devices and global warming. When DRA audio coding algorithm was conceived, it was set as the design goal to deliver high coding efficiency and high audio quality (if bit rate allows) with minimal decoder complexity. To accomplish this, a principle was set that only necessary modules would be used, each module would be implemented with minimal impact to the decoder complexity, and a signal path of 24 bits would be maintained when the bit rate allows. This results in the encoder and decoder shown in Figs. 15.1 and 15.2, respectively.
15.2 Architecture Transient localized MDCT (TLM) discussed in Sect. 11.5 is deployed as a simple approach to optimizing the coding gain for both quasistationary and transient episodes of audio signals. The number of taps for long, short and virtual windows are 2,048, 256 and 64, respectively. According to (11.16), the effective length for
15.2 Architecture
257
Fig. 15.2 DRA decoder. The solid lines indicates flow of audio data and the dashed lines flow of control parameters. The dashed boxes indicate optional modules that do not have to be invoked
the resultant brief window is 160 taps, corresponding to a period of mere 3.6 ms for a sample rate of 44.1 kHz. This is well within the 5 ms range that strong premasking typically lasts. The implementation costs of the inverse TLM is essentially the same as the regular switched IMDCT because both can be accomplished using the same software or hardware implementation. A pseudo CCC implementation example is given in Sect. 11.5.4. The only additional cost of inverse TLM over the regular switched IMDCT is more complex window sequencing when in short MDCT mode. Fortunately, this window sequencing involves a very limited number of assignment and logic operations as can be seen in the pseudo CCC code in Sect. 15.4.5. Only midtread uniform quantization discussed in Sect. 2.3.2 is deployed to quantize all MDCT coefficients, so the implementation code for inverse quantization is minimal, only involving an integer multiplication of the quantization index with the quantization step size (see (2.22)). Logarithmic companding discussed in Sect. 2.4.2.2 is used to quantize the quantization step sizes with a step size of 0.2, resulting in a total of 116 quantized values that cover a dynamic range of 24 bits for the quantization indexes of MDCT coefficients. The quantized step size values are further converted into Q.23 fractional format for convenient and exact implementation on fixed-point microprocessors. The resultant quantization table is given in Table A.1. All MDCT coefficients in one quantization unit (see Sect. 13.1.2) share one quantization step size. The critical bands that define the quantization units along the frequency axis are given in various tables in Sect. A.2 for long and short windows and for a variety of sample rates. Statistic codebook assignment discussed in Sect. 13.2.2 is used to improve the efficiency of Huffman coding with small impact to decoding complexity, but with
258
15 DRA Audio Coding Standard
a potentially significant overhead due to the bits needed to transfer the codebook application scopes. But this can be properly managed by the encoder so that the efficiency gain significantly exceeds the overhead. The module in the decoder to reconstruct the number of quantization units involves a simple search to find the lowest critical band that accommodates the highest (in frequency) nonzero quantization index (see Sect. 15.3.3.4 for details) , so its implementation cost is very low. The optional sum/difference coding and joint intensity coding module implement the simplest methods discussed in Chap. 12 so as to incur the least decoding cost while reaping the the gains of joint channel coding. The decision for sum/difference coding and steering vectors for joint intensity coding are individually based on quantization units. The quantization of the steering vectors for joint intensity coding reuses the quantization step size table given in Table A.1 with a bias to provide for the need that the elements of these steering vectors may be larger as well as smaller than one, corresponding to the possibility that the intensities of the joint channels may be stronger or weaker than that of the source channel. As usual, transient detection, perceptual model and global bit allocation modules are not specified by DRA standard and can be implemented by any methods that can provide data in formats suitable for DRA encoding.
15.3 Bit Stream Format DRA frame structure is shown in Table 15.1. It is designed to carry up to 64 normal channels and up to 3 LFE channels, or up to 64.3 surround sounds. This capacity far exceeds the 7.1 surround sound enabled by Blu-ray Disc and even the Hamasaki 20.2 surround sound demonstrated by NHK Science and Technical Research Laboratories [62]. The bit stream structure for normal channels are shown in Table 15.2, which basically consists of five sections: window sequencing, codebook assignment, quantization indexes, quantization step sizes, and joint channel coding. Since no transient should be declared to an LFE channel due to its limited bandwidth (less than 120 Hz), only the long MDCT window WL L2L or the short MDCT window WS S2S is allowed to be used, thus there is no need to convey window sequencing information. The limited bandwidth of LFE channels also disqualify
Table 15.1 Bit stream structure for a DRA frame
Synchronization codeword (0x7FFF) Frame header Audio Normal channel(s): from 1 up to 64 Data LFE channel(s): from 0 up to 3 End of frame signature Auxiliary data
15.3 Bit Stream Format
259
Table 15.2 Bit stream structure for a normal channel MDCT window index Window sequencing Number of transient segments Lengths of each transient segment Number of Huffman codebooks Codebook assignment Application scopes for all Huffman codebooks Selection indexes identifying corresponding Huffman codebooks Quantization indexes Huffman codes for quantization indexes of all MDCT coefficients arranged based on the application scopes of Huffman codebooks. Quantization indexes in one application scope are coded using the corresponding Huffman codebook for that scope. Quantization step sizes Huffman codes for all quantization step sizes, one for each quantization unit One sum/difference coding decision for each quantization unit for Joint channel coding each channel pairs One joint intensity coding scale factor for each joint high-frequency quantization unit in the joint channel
Table 15.3 Bit stream structure for an LFE channel Number of Huffman codebooks Codebook assignment Application scopes for all Huffman codebooks Selection indexes identifying corresponding Huffman codebooks Quantization indexes Huffman codes for quantization indexes of all MDCT coefficients arranged based on the application scopes of Huffman codebooks. Quantization indexes in one application scope are coded using the corresponding Huffman codebook for that scope. Quantization step sizes Huffman codes for all quantization step sizes, one for each quantization unit
them from carrying sound imaging information, so joint channel coding does not need to be deployed. Therefore, an LFE channel has a simpler bit stream structure as shown in Table 15.3. The end of frame signature is placed right after the audio data and before the auxiliary data so that the decoder can check immediately if the end of frame signature can be confirmed after all audio data are decoded. This enables error checking without the burden of handling auxiliary data.
15.3.1 Frame Synchronization A DRA frame starts with the following 16 bit synchronization codeword: nSyncWord = 0x7FFF;
260
15 DRA Audio Coding Standard
so the decoding of DRA bit stream should start with a search for this synchronization codeword, as illustrated by the following pseudo CCC code: SeekSyncWord() { while ( Unpack(16) != 0x7FFF ) { continue; } }
where Unpack(nNumBits) is a function that unpacks nNumBits bits from the bit stream. For the code above, nNumBitsD16. Note that the seeking loop above may become an infinite loop in case that the input stream does not contain nSyncWord and thus may cause the decoder to hang up. This must be properly taken care of by aborting the loop and relaunching the SeekSyncWord() routine. Since nSyncWordD0x7FFF is not prohibited from appearing in the middle of a DRA frame, the SeekSyncWord() function listed above may end up with a false nSyncWord. A function that properly handles this situation should skip nNumWord (see Table 15.4) number of 32-bit codewords after this nSyncWord and then seek and confirm the next nSyncWord before declaring synchronization. The following pseudo CCC code illustrates this procedure: Sync() { // Seek the first nSyncWord while ( Unpack(16) != 0x7FFF ) { continue; } // Unpack frame header type (0=short, 1=long) nFrameHeaderType = Unpack(1); // Number of 32-bit codewords in the frame nBits4NumWord = (nFrameHeaderType==0) ? 10 : 13; nNumWord = Unpack(nBits4NumWord); // Skip all audio data in the frame Unpack(nNumWord*32-nBits4NumWord-1); // Seek and confirm the second nSyncWord while ( Unpack(16) != 0x7FFF ) { continue; } }
In case that the input stream is not a valid DRA stream, the nSyncWord may not appear after all, causing the two search loops to hang up forever. This situation must be taken care of by aborting the loops and relaunching the Sync() routine.
15.3 Bit Stream Format
261
Table 15.4 DRA frame header. If there is a slash in the column under “Number of bits”, such as L/R, L is the number of bits for short frame header and R is for long frame header. A value of zero means that the codeword is absent in the frame DRA codeword Number of bits Explanation nSyncWord nFrameHeaderType nNumWord
16 1 10/13
nNumBlocksPerFrm
2
nSampleRateIndex
4
nNumNormalCh
3/6
nNumLfeCh
1/2
bAuxChCfg
1
bUseSumDiff
1/0
bUseJIC
1/0
nJicCb
5/0
Synchronization codeword D 0x7FFF Indicates the type of DRA frame header Number of 32-bit codewords starting from the start of the frame (nSyncWord) until the end of the frame (End of frame signature) Number of short MDCT blocks of PCM audio samples encoded in the frame for each audio channel. The value in the bit stream represents the power over two (2nNumBlocksPerFrm ), so the valid values of 0, 1, 2 and 3 corresponds to 1, 2, 4 and 8 short MDCT blocks. Since the block length of the short MDCT is 128 samples, these values are equivalent to 128, 256, 512 and 1024 PCM samples for each channel, respectively Sample rate index which is used to look up Table 15.5 to get the sample rate for the audio signal carried in the frame. nNumNormalChC1 is the number of normal channels, so the supported number of normal channels is between 1 and 8 for short frames and between 1 and 64 for long frames. After its unpacking, it is immediately updated to reflect the value of nNumNormalChC1 so that nNumNormalCh represents the actual number of normal channels. This rule applies to all other DRA codewords as well Number of LFE channels. The supported number of LFE channels is either 0 or 1 for short frames and between 0 and 3 for long frames Indicates if additional speaker configuration information is attached to the end of the frame as the first part of the auxiliary data Indicates whethersum/difference coding is deployed: 0; not deployed; bUseSumDiff D 1; deployed. Transferred only for short frame header and when nNumNormalCh > 1. Indicates whether joint intensity coding is deployed: 0; not deployed; bUseJIC D 1; deployed. Transferred only for short frame header and when nNumNormalCh > 1. Joint intensity coding is deployed for all critical bands starting from nJicCbC1 and ending with the last active critical band. nJicCb is transferred only for short frame header and if bUseJIC DD 1
262
15 DRA Audio Coding Standard
Table 15.5 Supported sample rates
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sample rate (Hz) 8,000 11,025 12,000 16,000 22,050 24,000 32,000 44,100 48,000 88,200 96,000 1,74,600 1,92,000 Reserved Reserved Reserved
15.3.2 Frame Header There are two types of DRA frame header, the short one is for short frames and the long one for long frames. The type of frame header is indicated by the 1-bit nFrameHeaderType which comes right after nSyncWord: nFrameHeaderType D
0; Short frame header; 1; Long frame header.
The short frames are intended for lower bit rate DRA streams whose number of channels is limited to no more than 8.1 surround sound and for which joint channel coding may be optionally enabled. The long frames are intended for higher bit rate DRA streams which may support up to 64.3 surround sound and for which joint channel coding is disallowed. The DRA frame header is described in Table 15.4 and its unpacking from the bit stream is delineated by the following pseudo CCC code: UnpackFrameHeader() { // Verify sync codeword (16 bits) nSyncWord = Unpack(16); if ( nSyncWord != 0x7FFF ) { throw("Sync codeword not verified."); } // Type of frame header (0=short, 1=long) nFrameHeaderType = Unpack(1); // Number of 32-bit codewords in the frame if ( nFrameHeaderType == 0 ) { // 10 bits for short frame header nNumWord = Unpack(10); }
15.3 Bit Stream Format
263
else { // 13 bits for long frame header nNumWord = Unpack(13); } // Number of MDCT blocks per frame nNumBlocksPerFrm = 1<1 ) { // Present only for multichannel audio // Decision for sum/difference coding bUseSumDiff = Unpack(1); // Decision for joint intensity coding bUseJIC = Unpack(1); } else { // Not present if thre is only one channel bUseSumDiff = 0; bUseJIC = 0; } // The start critical band that JIC is deployed if ( bUseJIC == 1 ) { // Present only when JIC is enabled nJicCb = Unpack(5)+1; } else { nJicCb = 0; } } else { // Joint channel coding is not deployed for long frame // headers bUseSumDiff = 0; bUseJIC = 0; nJicCb = 0; } }
264
15 DRA Audio Coding Standard
Table 15.6 Default speaker configuration (FDfront, LDleft, RDright, SDsurround, CDcenter, BDback)
nNumNormalCh 1 2 3 4 5 6 7 8
Table 15.7 Representation of a few widely used speaker configurations using the default DRA speaker configuration scheme
Speaker configuration Mono Stereo 2.1 3.1 5.1 6.1 7.1
Table 15.8 Default placement of normal channels in a DRA frame
Speaker configuration F FL, FR FL, FR, FC FL, FR, SL, SR FL, FR, SL, SR, FC FL, FR, SL, SR, BC, FC FL, FR, SL, SR, BL, BR, FC FL, FR, SL, SR, BL, BR, FC, overhead nNumNormalCh 1 2 2 3 5 6 7
nNumLfeCh 0 0 1 1 1 1 1
Ordinal number
Normal channel
0 1 2 3 4 5 6 7
Front left Front right Surround left Surround right Back left Back right Front center Overhead
There is no specific information for speaker configuration in DRA frame header. Instead, it uses a default set of speaker configurations as laid out in Table 15.6. Table 15.7 shows how a number of widely used speaker configurations are represented under this scheme. If the default scheme is not enough or suitable for representing a specific speaker configuration, the bAuxChCfg flag in the frame header should be set to one. This indicates to the decoder that information for user-defined speaker configuration is placed as the first part of the auxiliary data.
15.3.3 Audio Channels Right after the frame header come the payloads of an audio frame which are the bits representing the normal and then the LFE audio channels. The default placement of normal channels are shown in Table 15.8 for the currently widely used speaker
15.3 Bit Stream Format
265
configurations. The placement of center channel toward the tail is to facilitate joint channel coding by letting left and right channel pairs start from the first channel. If any of these channels are not present, the channels below are moved upward to take the place. Any speaker configuration beyond Table 15.8 needs to be specified in the auxiliary data by setting bAuxChCfg to one. Inside each audio channel, data components are placed as shown in Table 15.2 for normal channels and in Table 15.3 for LFE channels. All these data components are delineated below.
15.3.3.1 Window Sequencing Window sequencing is indicated by a trio of window index, number of transient segments and the number of short MDCT blocks in each transient segment, as shown in Table 15.9. Each normal channel should has its own independent set of window sequencing codewords. When joint channel coding is deployed, however, all normal channels share the same set of window sequencing codewords, which are transferred in the first channel. Consequently, there are no window sequencing codewords for the other normal channels and they are copied from the first channel. The codeword nWinIndex is the window index for the current frame, which is used to look up Table 11.8 for the corresponding window label. The interpretation of these labels is discussed in Sect. 11.5.3. Since there are only 13 entries in Table 11.8, nWinIndex>12 indicates an error. If nWinIndex indicates a long window (0 nWinIndex 8), there is no transient in the current frame, so there is only one transient segment nNumSegments D 1
Table 15.9 DRA codewords for window sequencing. The Number of Bits is represented as L/S where L is the number of bits for long MDCT windows and S is for short windows. A value of zero means that the codeword is absent in the frame DRA codeword Number of bits Explanation nWinIndex nNumSegments
4/4 0/2
nNumBlocksInSegment
0/Varied
Window index Number of transient segments. Not transferred if a long window is used Number short MDCT blocks for each transient segment. There are (nNumSegments-1) of such codewords in the bit stream. The length of the last transient segment is not transferred but inferred from nNumBlocksPerFrm, the number of short MDCT blocks in a frame. Not transferred if a long window is used or nNumSegmentsD1
266
15 DRA Audio Coding Standard
Table 15.10 Huffman codebook for encoding the number of short MDCT blocks in a transient segment
Index 0 1 2 3 4 5 6
Length 2 2 3 3 4 3 4
Codeword 0 3 3 2 9 5 8
and it is as long as the whole frame nNumBlocksInSegment D nNumBlocksPerFrm: Therefore, both codewords are not transferred. If nWinIndex indicates a short window (9 nWinIndex 12), the codeword nNumSegment is always transferred. If nNumSegmentD1, there is only one transient segment in the whole frame, whose length is again nNumBlocksPerFrm, so there is no need to waste bit resource to transfer it. If nNumSegment>1, only the first (nNumSegment-1) nNumBlocksInSegment codeword(s) is(are) transferred, because the length of the last transient segment can be inferred by the fact that the total number of short MDCT blocks of all transient segments in a frame is equal to nNumBlocksPerFrm, the number of short MDCT blocks in the whole frame. The nNumBlocksInSegment codewords are encoded using Huffman codebook listed in Table 15.10. The decoding of the codewords above is delineated by the following pseudo CCC code: UnpackWinSequencing(pSegmentBook, Ch0) // pSegmentBook: pointer to the Huffman codebook for encoding // the number of blocks per transient segment. // Ch0: A reference to the audio channel zero which // is the source channel if joint channel coding // is deployed. { // The left edge of the first transient segment is anEdgeSegment[0] = 0; // if ( nCh==0 // Always appear for the first channel || (bUseJIC==false && bUseSumDiff==false) // For other normal channels, appear only if no joint // channel coding is deployed ) { // Unpack window index nWinIndex = Unpack(4); // Transient segments if ( nWinIndex != ANY_LONG_WIN) { // Transferred only for short windows
15.3 Bit Stream Format
267
// First, unpack the number of transient segments nNumSegment = Unpack(2)+1; // Number of short MDCT blocks for each transient segment if ( nNumSegment >= 2 ) { // Transferred only if there are more than two transient // segments // Loop through the transient segments, except for the // last one for (nSegment=0; nSegment
When the function above is called, it should be passed with a pointer to the Huffman codebook given in Table 15.10 which is further passed to the Huffman decoding function HuffDec().
268
15 DRA Audio Coding Standard
15.3.3.2 Codebook Assignment Since DRA deploys statistic-adaptive codebook assignment, the bit stream needs to transfer both the application scope and selection index for each Huffman codebook that is used to code quantization indexes (see Sect. 13.2.2). In other words, for each Huffman codebook that is deployed to encode a group of quantization indexes, the encoder needs to convey to the decoder the following pair of parameters: .application scope, selection index/ where the “application scope” defines the boundary for the group of quantization indexes that are coded by the Huffman codebook and the “selection index” identifies the Huffman codebook itself. The quantization indexes are coded based on transient segments. For a transient segment identified by nSegment, the number of Huffman codebooks to code all of its quantization indexes is anHSNumBands[nSegment]. Each of these Huffman codebooks is identified by the following pair of application scope and codebook selection index: .mnHSBandEdge[nSegment][nBand], mnHS[nSegment][nBand]/; where nBand goes from 0 to anHSNumBands[nSegment] 1 (see Table 15.11). And this is duplicated for all transient segments. Since the statistic properties of quantization indexes are obviously different for frames with declared transients than the ones without, both the application scopes and selection indexes are coded with different sets of Huffman codebooks as specified in Table 15.12. Since the application scope could theoretically be as large as the whole frame which may be as large as 1,024 quantization indexes, recursive indexing is deployed to avoid using large Huffman codebooks (see Sect. 9.4). Therefore, the small
Table 15.11 DRA codewords used to represent Huffman codebook assignment for a transient segment identified by nSegment DRA codeword Number of bits Explanation anHSNumBands[nSegment] 4 Number of Huffman codebooks used to encode the quantization indexes in the transient segment identified by nSegment mnHSBandEdge[nSegment][nBand] Variable Application scope for nBand-th Huffman codebook. nBand goes from 0 to anHSNumBands[nSegment]-1 mnHS[nSegment][nBand] Variable Selection index for the corresponding nBand-th Huffman codebook. nBand goes from 0 to anHSNumBands [nSegment]-1
15.3 Bit Stream Format
269
Table 15.12 Huffman codebooks for encoding codebook assignments Frame type Codebook pointer Quasistationary Transient Explanation pCodebook4Scope Table A.28 Table A.29 Huffman codebook for encoding application scopes. Recursive indexing is deployed to handle large scope values with a small codebook pCodebook4Selection Table A.30 Table A.31 Huffman codebook for encoding the corresponding selection indexes. Difference coding is used to improve coding efficiency
Huffman codebooks in Tables A.28 and A.29 are enough to cover any possible application scope values used by DAR standard. Each unit of application scope encoded in the bit stream corresponds to four quantization indexes, so the decoded value should be multiplied by four to get the application scopes that correspond to quantization indexes. The difference between the codebook selection index for the nBand-th application scope with its predecessor: nDiff D mnHS[nSegment][nBand] mnHS[nSegment][nBand-1] is encoded using one of the Huffman codebooks given in the last row of Table 15.12, depending on the presence or absence of transients in the frame, except for the first one in a transient segment, which is packed directly into the bit stream using four bits. Therefore, on the decoder side, the selection index is obtained as mnHS[nSegment][nBand] D nDiff C mnHS[nSegment][nBand-1]: The unpacking of codebook assignment is illustrated by the pseudo CCC code given below. Proper Huffman codebooks specified in Table 15.12 should be passed to this function depending on the presence or absence of transients in the frame. The HuffDecRecursive() function used in the code implements the recursive procedure given in the last paragraph of Sect. 9.4. UnpackCodeookAssignment( pCodebook4Scope, // pointer to a Huffman codebook for // decoding application scope pCodebook4Selection // pointer to a Huffman codebook for // decoding selection index ) { // The left boundary of the first application scope is // always zero nLeftBoundary = 0; // Loop through transient segments to unpack application // scopes
270
15 DRA Audio Coding Standard for (nSegment=0; nSegment0 ) { // The first selection index is transferred directly using // four bits nLast = Unpack(4); mnHS[nSegment][0] = nLast; // The rest are difference coded for (nBand=1; nBand
}
15.3.3.3 Quantization Indexes Two libraries of Huffman codebooks, shown in Table 15.13, are deployed to code the quantization indexes, one for frames with declared transients and the other for
15.3 Bit Stream Format Table 15.13 Huffman codebooks for encoding quantization indexes Codebook Selection index Dimension Covered range Signed Quasistationary 0 0 0 N/A N/A 1 4 [1,1] Yes Table A.34 2 2 [2,2] Yes Table A.35 3 2 [4,4] Yes Table A.36 4 2 [8,8] Yes Table A.37 5 1 [15,15] Yes Table A.38 6 1 [31,31] Yes Table A.39 7 1 [63,63] Yes Table A.40 8 1 [127,127] Yes Table A.41 9 1 [0,255) No Table A.42
271
Transient N/A Table A.43 Table A.44 Table A.45 Table A.46 Table A.47 Table A.48 Table A.49 Table A.50 Table A.51
quasistationary frames. Depending on the absence or presence of transient in the current frame, either the fifth or sixth column of this table is copied to the “anQIndexBooks[9]” array in the pseudo CCC code to be given at the end of this section so that the “pQIndexBook” pointer can be pointed to the proper Huffman codebook. When the range of quantization indexes covered by a Huffman codebook is small, block coding is deployed to enhance coding efficiency (see Sects. 8.4.3 and 9.3). The number of quantization indexes that are coded together as a block is indicated as “Dimension” in the table. The encoding and decoding procedure for block codes are delineated in Sect. 9.3. The quantization index range covered by each Huffman codebook is indicated by “Covered Range” in Table 15.13. Most of the Huffman codebooks handles signed quantization indexes, except the last two codebooks, namely Tables A.42 and A.51, which are unsigned and cover a range of [0,255). This range is extended to (255,255) with the transmission of a sign bit for each nonzero quantization index. The exclusion of 255 from the covered range is explained next. To accommodate the target signal path of 24 bits, recursive indexing is extensively used to handle quantization indexes whose absolute value are beyond [0,255). In particular, the absolute value of a quantization index are decomposed using (9.23) with M D 255 where the reminder r is coded with either Table A.42 or Table A.51 as indicated in Table 15.13 and the quotient q, if any of them in the whole codebook application scope are nonzero, is packed directly into the bit stream. For quantization index whose quotient is not zero, the residue encoded by either Table A.42 or Table A.51 above must have the value of 255, which actually represents an “Escape”, indicating that its quotient is not zero. If no “Escape” appears in the whole application scope, no quotients are packed into the bit stream. Otherwise, each “Escape” is followed by a nonzero quotient and a real residue. The real residue is again coded by either Table A.42 or Table A.51. Denote the number of bits used to pack the nonzero quotients as nBits4Quotient. nBits4Quotient-1 are difference-coded using the Huffman codebook indicated in Table 15.14. The valid range for nBits4Quotient-1 is between 0 and 15. Reconstructing the quotients from the values packed in the bit stream may generate values
272
15 DRA Audio Coding Standard Table 15.14 Determination of the Huffman codebook for encoding the number of bits used to pack the quotients generated in recursive indexing of large quantization indexes Codebook pointer Quasistationary frame Transient frame pQuotientWidthBook
Table A.32
Table A.33
out of this range. Any value out of this range must be moved back by taking the remainder against 16. Sign bits for all nonzero quantization indexes coded above follow immediately. One bit is packed for each nonzero quantization index and the negative sign is indicated by the value of zero. The unpacking of all quantization indexes is delineated by the following pseudo CCC code: UnpackQIndex( int anQIndexBooks[9],
// // // int *pQuotientWidthBook // // //
An array that holds pointers to Huffman codebooks for codign quantization indexes Pointer to the Huffman codebook for encoding number of bits for packing quotients
) { // Reset the history value for the number of bits for packing // quotients nLastQuotientWidth = 0; // Loop through all transient segments for (nSegment=0; nSegment
15.3 Bit Stream Format
273
// load the codebook indexed by nHSelect pQIndexBook = anQIndexBooks[nHSelect]; // This function returns with the number of the // codewords per dimension in the codebook. The total // number of codewords in the codebook is // nNumCodesˆnDim nNumCodes = GetNumCodewords(pQIndexBook); // Unpack the quantization indexes if ( nHSelect == 9 ) { // The Last Huffman codebook // Recursive indexing is used with the last Huffman // codebook, so the largest quantization index in this // codebook is 255 and represents the ‘‘Escape’’ nMaxIndex = nNumCodes-1; // nMaxIndex=255 // Decode the residue and count the times that // nMaxIndex (‘‘Escape’’) appears. nCtr = 0; // The counter for nMaxIndex appearance for (nBin=nStart; nBin0 ) { // This step is necessary only if nMaxIndex has // appeared // Decode the number of bits used to pack the // quotient nBits4Quotient = HuffDec(pQuotientWidthBook); // Add it to the history nLastQuotientWidth += nBits4Quotient; // If out of [0,15], take the remainder to move it // back nLastQuotientWidth = nLastQuotientWidth % 16; // The number of bits used to pack the quotients is nBits4Quotient = nLastQuotientWidth + 1; // Loop through all quantization indexes to unpack // the non-zero quotients and then recover the // absolute values of the quantization indexes for (nBin=nStart; nBin
274
15 DRA Audio Coding Standard // Add signs for (nBin=nStart; nBin1 ) { // Block coding is deployed for (nBin=nStart; nBin
}
15.3 Bit Stream Format
275
15.3.3.4 Quantization Step Sizes There is one quantization step size for each quantization unit which is defined along the time axis by the boundaries of transient segments and along the frequency axis by the boundaries of critical bands. The critical bands for all supported sample rates are listed in Sect. A.2. Since audio signals tend to have lower energy at high frequencies, aggressive quantization may cause the quantization indexes for critical bands at high frequencies to become all zero. When this happens, there is no need to transfer quantization step size indexes for those critical bands. For transient segment nSegment, the highest critical band with nonzero quantization indexes is referred to as the maximal active critical band and is denoted as anMaxActCb[nSegment]. This number may be inferred from the cumulative application scopes of Huffman codebooks that are used to encode the quantization indexes. One such method is illustrated by the following pseudo CCC code: ReconstructMaxActCb() { // Loop through transient segments for (nSegment=0; nSegment
It is obvious that there are anMaxActCb[nSegment] quantization units in transient segment indexed by nSegment, each of which has a quantization step size that needs to be conveyed to the decoder, so the number of the corresponding quantization indexes is also anMaxActCb[nSegment]. The differences of all these indexes between successive quantization units are encoded using one of the Huffman codebooks as indicated in Table 15.15.
276
15 DRA Audio Coding Standard Table 15.15 Determination of Huffman codebooks for encoding the indexes of quantization step sizes Codebook pointer Quasistationary frame Transient frame pQStepBook
Table A.52
Table A.53
Reconstructing the indexes from these difference values encoded in the bit stream may generate values out of the valid range of Œ0; 115. Any value out of this range must be moved within this range by taking the remainder. The unpacking operation is delineated by the following pseudo CCC code: UnpackQStepIndex(pQStepBook) // pQStepBook: Huffman codebook used to encode the difference // between quantization step size indexes { // Reset the quantization step size index history nLastQStepIndex = 0; // Loop through the transient segments for (nSegment=0; nSegment
15.3.3.5 Sum/Difference Coding Decisions Whether sum/difference coding is deployed for the current frame is indicated in the frame header by setting bUseSumDiffD0 (see Table 15.4). If it is deployed, the only information that sum/difference decoding needs is the decision of whether sum/difference coding is deployed for each paired quantization unit. This decision applies to the corresponding quantization units for both the sum and difference channels, so only one decision is packed into the bit stream for both channels. All these bits are packed as part of the bits for the difference channel, so the unpacking needs to be run only for the difference channel. After a pair of left and right channels are sum/difference-encoded, the left channel is replaced with the sum channel and the right channel is replaced with the difference channels (see Sect. 12.1), so the bits representing the decisions of sum/difference coding are actually carried in the right channel. Due to the default placement scheme of normal channels shown in Table 15.8, the right channels are placed as channels 1, 3, 5, : : : in the bit stream.
15.3 Bit Stream Format
277
If joint intensity coding is not deployed (nJicCbD0, see Table 15.4), sum/ difference coding may be deployed to the larger of the maximal active critical bands of both sum and difference channels. If joint intensity coding is deployed (nJicCb¤0), this number is limited by nJicCb. Sum/difference coding may be turned off for the whole transient segment and this decision is conveyed to the decoder by one bit per transient segment. If it is turned on, there is one bit decision for each critical band or quantization unit in the transient segment. The unpacking of these decisions is delineated by the following pseudo CCC code: UnpackSumDff(anMaxActCb4Sum[]) // anMaxActCb4Sum: Maximal active critical bands for the sum // (left) channel // Note: This function is run only for the difference // (right) channel { // Loop through the transient segments for (nSegment=0; nSegment0 ) { // Otherwise, sum/difference coding is deployed to // critical bands before joint intensity coding, so we // need to get the smaller of nMaxCb and nJicCb nMaxCb = __min(nJicCb, nMaxCb); } // If it is deployed, unpack the relevant decisions if ( nMaxCb>0 ) { // Sum/Difference coding may be turned off for the whole // transient segment, so need to unpack this decision anSumDffAllOff[nSegment] = Unpack(1); // if ( anSumDffAllOff[nSegment] == 0 ) { // Not turned off completely, unpack the decision for // each quantization unit for (nBand=0; nBand
278
15 DRA Audio Coding Standard
15.3.3.6 Steering Vector for Joint Intensity Coding If the frame header indicates that joint intensity coding is deployed by setting nUseJICD1 (see Table 15.4), there is one scale factor for each quantization unit in each joint channel starting from nJicCb until the maximal active critical band of the source channel for each transient segment (anMaxActCb[nSegment]). This applies to all normal channels except for the first one which is the source channel. The indexes of these scale factors are packed into the bit stream in exactly the same way as that for the quantization step sizes (see Sect. 15.3.3.4) and is delineated by the following pseudo CCC code: UnpackJicScaleIndex( pQStepBook, // The same Huffman codebook used to code the // quantization step sizes anMaxCbCh0[] // An array that stores the maximal active // critical bands of the source channel // (channel 0). ) { // Reset the scale factor index history nLastScaleIndex = 0; // Loop through transient segments for (nSegment=0; nSegment
15.3.4 Window Sequencing for LFE Channels Since LFE channels are bandlimited to 120 Hz, it is impossible for transients to occur in such channels. Therefore, an LFE channel always consist of one transient segment whose length is nNumBlocksPerFrm, the total number of short MDCT blocks in the whole frame. If the frame is as long as the long MDCT block, the long MDCT window (WL L2L) is always deployed. Otherwise, nNumBlocksPerFrm short windows (WS S2S) are always deployed to cover the samples in the frame. The setup of
15.3 Bit Stream Format
279
window sequencing for LFE channels is, therefore, very easy and can be delineated by the following pseudo CCC code: SetupLfeWinSequencing() { // The left edge of the first transient segment is anEdgeSegment[0] = 0; // if ( nNumBlocksPerFrm == 8 ) { // The frame size is equal to the long MDCT block // the long window is always used nWinIndex = WL_L2L; // It is always a quasi-stationary frame so there // one transient segment nNumSegment = 1; // whose length is equal to one long MDCT block anEdgeSegment[1] = 1; } else { // The frame is shorter than the long MDCT block, // short window is always used nWinIndex = WS_S2S; // It is always a quasi-stationary frame so there // one transient segment nNumSegment = 1; // whose length is equal to the frame size anEdgeSegment[1] = nNumBlocksPerFrm; } }
size, so
is only
so the
is only
15.3.5 End of Frame Signature To assist checking for possible errors that may occur during the decoding process, the encoder packs “1” to all unused bits in the last 32-bit codeword. Upon completing decoding, the decoder should check if the unused bits in the last 32-bit codeword are all “1”. If any of them is not, there is at least an error in the decoding process and error handling procedure should be deployed. This is illustrated below: CheckError() { if ( Unused Bits in the last codeword != all 1’s ) { ErrorHandling(); } }
280
15 DRA Audio Coding Standard
15.3.6 Auxiliary Data Auxiliary data are optional and user-defined. They are attached to the end of the 32-bit codeword that contains the end of frame signature. Therefore, a decoder can ignore these bits without any impact to the normal decoding. However, if bAuxChCfgD1 (see Table 15.4), the frame header mandates that additional information for speaker configuration is attached as the very first entry of auxiliary data. The specific format for speaker configuration is user-defined.
15.3.7 Unpacking the Whole Frame Now that all components of a DRA frame listed in Tables 15.1, 15.2, and 15.3 have been delineated, it is fairly easy to put together the unpacking of a whole DRA frame as follows: UnpackDraFrame() { // Frame synchronization Sync(); // Unpack the frame header UnpackFrameHeader(); // Unpack the normal channels for (nCh=0; nCh0 ) { // Only appear in the joint channels, nCh=0 is the // source channel UnpackJicScaleIndex(); } } // Unpack the LFE channels for (nCh=nNumNormalCh; nCh
15.4 Decoding
281
SetupLfeWinSequencing(); // Unpack codebook assignment UnpackCodebookAssignment(); // Unpack quantization indexes UnpackQIndex(); // Reconstruct the maximal active critical bands ReconstructMaxActCb(); // Unpack indexes for quantization step sizes UnpackQStepIndex(); } // Check for error (end of frame signature) CheckError() // Unpack auxiliary data UnpackAuxiliaryData() }
15.4 Decoding After a DRA frame is unpacked, all components are made ready for reconstructing the PCM samples for all channels in the frame. The reconstruction involves the following steps and each of them is described in this section:
Inverse quantization Sum/difference decoding Joint intensity decoding De-interleaving Inverse MDCT
15.4.1 Inverse Quantization Since DRA deploys only midtread uniform quantization (see Sect. 2.3.2), inverse quantization simply involves the multiplication of the quantization indexes with the corresponding quantization step size (see (2.22)). Since MDCT coefficients are quantized based on quantization unit and all coefficients in one unit share one quantization step size, inverse quantization is centered around quantization units and may be delineated as follows: InverseQuantization( aunStepSize[116], // Quantization step size table // Pointer to a critical band table *pnCBEdge ) { // Loop through all transient segments for (nSegment=0; nSegment
282
15 DRA Audio Coding Standard // transient segment. nBin0 = anEdgeSegment[nSegment]*128; // The number of short MDCT blocks in a transient // segment is nNumBlocks = anEdgeSegment[nSegment+1] - anEdgeSegment[nSegment]; // Loops through all active critical bands for (nBand=0; nBand
}
When the function above is called, aunStepSize should be passed a pointer to Table A.1 and pnCBEdge a pointer to a critical band table in Sect. A.2 based on the sample rate and window size.
15.4.2 Joint Intensity Decoding When joint intensity coding is deployed, the source channel is always the first channel and all other normal channels are the joint ones. There is one scale factor for each quantization unit for each of such joint channels, so the decoding simply involves multiplying the scale factor with the reconstructed MDCT coefficients in the source channel (see (12.9)). DRA standard reuses the quantization step size table (Table A.1) for JIC scale factors with a bias of rJicScaleBias D
1 1 D aunStepSize[57] 2;702
to make up for the need that JIC scale factors can be either larger or smaller than one.
15.4 Decoding
283
Joint intensity decoding may be delineated by the following pseudo CCC code: DecJIC( srcCh,
// // aunStepSize[116], // // *pnCBEdge
A reference to the source channel object (channel 0) Quantization step size table Pointer to a critical band table
) { // Loop through all transient segments for (nSegment=0; nSegment
When the function above is called, aunStepSize should be passed a pointer to Table A.1 and pnCBEdge a pointer to a critical band table in Sect. A.2 based on the sample rate and window size.
15.4.3 Sum/Difference Decoding Sum/difference decoding simply involves the proper application of (12.3) and (12.4) to the corresponding pair of reconstructed MDCT coefficients. The data structure for
284
15 DRA Audio Coding Standard
sum/difference coding is described in Sect. 15.3.3.5. The following pseudo CCC code describes the complete decoding steps: DecSumDff( LeftCh,
// // // *pnCBEdge //
A reference for the sum (left) channel object. Note: This function is run only for the difference (right) channels. Pointer to a critical band table.
{ // Loop through all transient segments for (nSegment=0; nSegment0 ) { // Otherwise, sum/difference coding is deployed to // critical bands before joint intensity coding, so we // need to get the smaller of nMaxCb and nJicCb nMaxCb = __min(nJicCb, nMaxCb); } // Loops through active critical bands that are // sum/difference coded for (nBand=0; nBand
15.4 Decoding
285
15.4.4 De-Interleaving For a frame in which the long MDCT is deployed, the 1,024 MDCT coefficients are naturally placed from low frequency to high frequency. For a frame in which the short MDCT is deployed, the MDCT coefficients are said to be naturally placed if the 128 coefficients from the first short MDCT block are placed from low frequency to high frequency, then the 128 coefficients from the second block, from the third block, and until the last block of the frame. Under this placement scheme, MDCT coefficients corresponding to the same frequency from succeeding MDCT blocks are 128 coefficients apart. To facilitate quantization and entropy coding, however, DRA encoder interleaves these coefficients within each transient segment based on frequency: the MDCT coefficients of all MDCT blocks within the transient segment corresponding to the same frequency are placed together right next to each other based on their respective time order. Then these group are laid out from low frequency to high frequency. This interleaving operation is applied to each transient segment in the frame. Under this placement scheme, MDCT coefficients from the same MDCT blocks with succeeding frequencies are nNumBlocks coefficients apart within each transient segment, where nNumBlocks is the number of short MDCT blocks in the transient segment. When it is time to do inverse short MDCT, this interleaving must be reverted. The following pseudo CCC code illustrates how this should be accomplished: DeInterleave() { // Index to the naturally ordered coefficient p = 0; // Loop through all transient segments for (nSegment=0; nSegment
286
15 DRA Audio Coding Standard // Move on to the first interleaved coefficient in the // next short MDCT block nBin0++; } }
}
15.4.5 Window Sequencing As discussed in Sect. 11.5.3, the window index unpacked from the bit stream points to the actual window for the long MDCT, so can be directly used to perform the inverse long MDCT. For frames processed with the short MDCT, the window index unpacked from the bit stream is used to encode the existence or absence of transient in the first block of the current and subsequent frames, as shown in Table 11.7. The particular window that should be applied to each short MDCT block must be determined using the method outlined in Sect. 11.5.2.2 which is further delineated by the following pseudo CCC code: SetupShortWinType( int nWinLast,
// // int nWinIndex, // int anWinShort[] // // //
Window index for the last MDCT block of the preceding frame Window index unpacked from the bit stream An array that holds the window index for each MDCT blocks of the current frame. It holds the result of this function
) { int nSegment, nBlock; // First MDCT block if ( nWinIndex==WS_S2S || nWinIndex==WS_S2B ) { // The first half of nWinIndex unpacked from the bit stream // indicates that there is no transient in the first MDCT // block, so set its window index to WS_S2S anWinShort[0] = WS_S2S; // Make sure that the first half of this window matches the // second half of the last window in the preceding frame switch ( nWinLast ) { case WS_B2B: // The second half of the last window is "brief", so // the first half must be changed to "brief" also anWinShort[0] = WS_B2S; break; case WL_L2S: case WL_S2S: case WL_B2S: case WS_S2S: case WS_B2S: // The second half of the last window is "short", // which matches the first half of the current one, so
15.4 Decoding
287
// nothing needs to be done break; default: // Any other window shape for the second half of the last // window is disallowed, so flag error throw("The second half of last window in the preceding frame must be of either the short or brief window shape"); } } else { // The first half of nWinIndex indicates that there is a // transient in the first MDCT block, so set its window // index to WS_B2B anWinShort[0] = WS_B2B; // The first half of this window must match the second half // of the last window in the preceding frame switch ( nWinLast ) { case WS_B2B: case WS_S2B: case WL_L2B: case WL_B2B: case WL_S2B: // The second half of the last window is "brief", // which matches the first half of the current one, so // nothing needs to be done break; default: // Any other window shape for the second half of the last // window is disallowed, so flag error throw("The second half of last window in the preceding frame must be of the brief window shape"); } } // Initialize windows for the other MDCT blocks to WS_S2S for (nBlock=1; nBlock
288
15 DRA Audio Coding Standard anWinShort[1] = WS_B2S; } } // Set window halves next to a WS_B2B window to "brief" for (nSegment=1; nSegment
15.4 Decoding
289
} break; // default: // The only other case, which is WS_S2B, should not happen // by itself, so flag error throw("WS_S2B should not happen by itself."); } }
15.4.6 Inverse TLM The implementation of inverse TLM was discussed in Sect. 11.5.4, which essentially uses the same procedure given in Sect. 11.3.3.
15.4.7 Decoding the Whole Frame Now that all components for decoding a DRA frame have been described, the following pseudo CCC code illustrates how to put them together to decode a complete DRA frame: DecodeDraFrame() { // Unpack the whole DRA frame UnpackDraFrame(); // Inverse quantization for both normal and LfE channels for (nCh=0; nCh
290
15 DRA Audio Coding Standard m_pCh[nCh].DeInterleave(); // Inverse TLM m_pCh[nCh].SwitchMDCTI(); }
}
15.5 Formal Listening Tests During its standardization process, DRA audio coding algorithm went through five rounds of ITU-R BS.1116 [30] compliant subjective listening test (see Sect. 14.2.3). The test grades are shown in Table 15.16. To stress the algorithm, only the constant inter-frame bit allocation scheme in (13.1) was allowed in all tests. No frame can use more than the constant number of bits determined by the formula. Also, unused bits in all frames are discarded, no bit reservoir is allowed. The first test was conducted in August 2004 by National Testing and Inspection Center for Radio and TV Products (NTICRT) under the Ministry of Information Industry, China. Ten stereo sound tracks selected mostly from SQAM CD [14] and five 5.1 surround sound tracks were used in the test. The test subjects were all expert listeners consisting of conductors, musicians, recording engineers, and audio engineers. The other four tests were performed by the State Lab for DTV System Testing (SLDST) under the State Administration for Radio, Film and TV, China. Other than a few Chinese sound tracks, most of the test materials were selected from the SQAM CD and a group of surround sound tracks used by EBU and MPEG, including Pitch pipe, Harpsichord, and Elloit1. Some of the most famous Chinese conductors, musicians, and recording engineers were among the test subjects, but senior and graduate students from the School of Recording Arts, Communication University of China and the Department of Sound Recording, Beijing Film Academy, were found to possess better acuity due to their age and training, so became the majority in later tests. The last test, while conducted by SLDST, was actually ordered and supervised by China Central TV (CCTV) as part of its DTV standard evaluation program. CCTV
Table 15.16 Grades for ITU-R BS.1116 compliant subjective listening tests Lab Date Stereo @ 128 kbps 5.1 @ 320 kbps 5.1 @ 384 kbps NTICRT SLDST SLDST SLDST SLDST
08/2004 10/2004 01/2005 07/2005 08/2006
4.6 4.2 4.1 4.7
4.6
4.7 4.0 4.2 4.5 4.9
15.5 Formal Listening Tests
291
was only interested in surround sounds, so DRA was tested at 384 kbps as well as 320 kbps. This test was conducted in comparison with two major international codecs, DRA came out as the clear winner. The test results from SLDST are more consistent because the same group of engineers performed the tests in the same lab and the listener pool was largely unchanged. The later the test, the more difficult the test became, because the test administrators and many listeners had learned the characteristics of DRA-processed audio and knew where to look for distortions. The gradually increasing test grades reflect continuous improvement to the encoder, especially to its perceptual model and transient detection unit.
Appendix A
Large Tables
A.1 Quantization Step Size
Table A.1 Quantization step sizes Index Step Size Step Size Step Size 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388607
1 2 5 9 18 37 74 147 294 588 1176 2353 4705 9410 18820 37641 75281 150562 301124 602249 1204498 2408995 4817990
1 3 5 11 21 42 84 169 338 676 1351 2702 5405 10809 21619 43238 86475 172951 345901 691802 1383604 2767209 5534417
Step Size
Step Size
2 3 6 12 24 49 97 194 388 776 1552 3104 6208 12417 24834 49667 99334 198668 397336 794672 1589344 3178688 6357376
2 3 7 14 28 56 111 223 446 891 1783 3566 7132 14263 28526 57052 114105 228210 456419 912838 1825677 3651354 7302707
293
294
A Large Tables
A.2 Critical Bands for Short and Long MDCT
Table A.2 Critical bands for long MDCT at 8000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
26 53 80 108 138 170 204 241 282 328 381 443 517 606 715 848 1010 1024
101:563 207:031 312:5 421:875 539:063 664:063 796:875 941:406 1101:56 1281:25 1488:28 1730:47 2019:53 2367:19 2792:97 3312:5 3945:31 4000
Table A.3 Critical bands for short MDCT at 8000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4 8 12 16 20 25 30 36 42 49 57 67 79 94 112 128
125 250 375 500 625 781:25 937:5 1125 1312:5 1531:25 1781:25 2093:75 2468:75 2937:5 3500 4000
Table A.4 Critical bands for long MDCT at 11025 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 39 59 80 102 125 150 177 207 241 280 326 380 446 526 625 745 888 1024
102:283 209:949 317:615 430:664 549:097 672:913 807:495 952:844 1114:34 1297:38 1507:32 1754:96 2045:65 2400:95 2831:62 3364:56 4010:56 4780:37 5512:5
Table A.5 Critical bands for short MDCT at 11025 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4 8 12 16 20 24 28 33 39 46 54 64 76 91 109 128
172:266 344:531 516:797 689:063 861:328 1033:59 1205:86 1421:19 1679:59 1981:05 2325:59 2756:25 3273:05 3919:04 4694:24 5512:5
A.2 Critical Bands for Short and Long MDCT
295
Table A.6 Critical bands for long MDCT at 12000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 18 105:469 1 36 210:938 2 54 316:406 3 73 427:734 4 93 544:922 5 114 667:969 6 137 802:734 7 162 949:219 8 190 1113:28 9 221 1294:92 10 257 1505:86 11 299 1751:95 12 349 2044:92 13 409 2396:48 14 483 2830:08 15 574 3363:28 16 684 4007:81 17 815 4775:39 18 968 5671:88 19 1024 6000
Table A.8 Critical bands for long MDCT at 16000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 13 101:563 1 27 210:938 2 41 320:313 3 55 429:688 4 70 546:875 5 86 671:875 6 103 804:688 7 122 953:125 8 143 1117:19 9 167 1304:69 10 194 1515:63 11 226 1765:63 12 264 2062:5 13 310 2421:88 14 366 2859:38 15 435 3398:44 16 519 4054:69 17 618 4828:13 18 734 5734:38 19 870 6796:88 20 1024 8000
Table A.7 Critical bands for short MDCT at 12000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 187:5 1 8 375 2 12 562:5 3 16 750 4 20 937:5 5 24 1125 6 28 1312:5 7 33 1546:88 8 39 1828:13 9 46 2156:25 10 55 2578:13 11 66 3093:75 12 79 3703:13 13 95 4453:13 14 113 5296:88 15 128 6000
Table A.9 Critical bands for short MDCT at 16000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 250 1 8 500 2 12 750 3 16 1000 4 20 1250 5 24 1500 6 28 1750 7 33 2062:5 8 39 2437:5 9 47 2937:5 10 56 3500 11 67 4187:5 12 80 5000 13 95 5937:5 14 113 7062:5 15 128 8000
296
A Large Tables Table A.10 Critical bands for long MDCT at 22050 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 10 107:666 1 20 215:332 2 30 322:998 3 41 441:431 4 52 559:863 5 64 689:063 6 77 829:028 7 91 979:761 8 107 1152:03 9 125 1345:83 10 146 1571:92 11 170 1830:32 12 199 2142:55 13 234 2519:38 14 277 2982:35 15 330 3552:98 16 394 4242:04 17 469 5049:54 18 557 5997 19 661 7116:72 20 790 8505:62 21 968 10422:1 22 1024 11025
Table A.12 Critical bands for long MDCT at 24000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 9 105:469 1 18 210:938 2 27 316:406 3 37 433:594 4 47 550:781 5 58 679:688 6 70 820:313 7 83 972:656 8 97 1136:72 9 113 1324:22 10 132 1546:88 11 154 1804:69 12 180 2109:38 13 212 2484:38 14 251 2941:41 15 299 3503:91 16 357 4183:59 17 425 4980:47 18 505 5917:97 19 599 7019:53 20 716 8390:63 21 875 10253:9 22 1024 12000
Table A.11 Critical bands for short MDCT at 22050 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz)
Table A.13 Critical bands for short MDCT at 24000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
4 8 12 16 20 24 29 35 42 51 61 73 87 105 128
344:531 689:063 1033:59 1378:13 1722:66 2067:19 2497:85 3014:65 3617:58 4392:77 5254:1 6287:7 7493:55 9043:95 11025
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
4 8 12 16 20 24 29 35 42 51 61 73 87 106 128
375 750 1125 1500 1875 2250 2718:75 3281:25 3937:5 4781:25 5718:75 6843:75 8156:25 9937:5 12000
A.2 Critical Bands for Short and Long MDCT
297
Table A.14 Critical bands for long MDCT at 32000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 7 109:375 1 14 218:75 2 21 328:125 3 28 437:5 4 36 562:5 5 44 687:5 6 53 828:125 7 63 984:375 8 74 1156:25 9 86 1343:75 10 100 1562:5 11 117 1828:13 12 137 2140:63 13 161 2515:63 14 191 2984:38 15 227 3546:88 16 271 4234:38 17 323 5046:88 18 384 6000 19 456 7125 20 545 8515:63 21 668 10437:5 22 868 13562:5 23 1024 16000
Table A.16 Critical bands for long MDCT at 44100 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 5 107:666 1 10 215:332 2 15 322:998 3 21 452:197 4 27 581:396 5 33 710:596 6 40 861:328 7 47 1012:06 8 55 1184:33 9 64 1378:13 10 75 1614:99 11 88 1894:92 12 103 2217:92 13 122 2627:05 14 145 3122:31 15 173 3725:24 16 207 4457:37 17 247 5318:7 18 293 6309:23 19 348 7493:55 20 418 9000:88 21 519 11175:7 22 695 14965:6 23 1024 22050
Table A.15 Critical bands for short MDCT at 32000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 500 1 8 1000 2 12 1500 3 16 2000 4 20 2500 5 24 3000 6 29 3625 7 35 4375 8 42 5250 9 50 6250 10 60 7500 11 73 9125 12 91 11375 13 123 15375 14 128 16000
Table A.17 Critical bands for short MDCT at 44100 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 689:063 1 8 1378:13 2 12 2067:19 3 16 2756:25 4 20 3445:31 5 24 4134:38 6 29 4995:7 7 35 6029:3 8 42 7235:16 9 51 8785:55 10 63 10852:7 11 84 14470:3 12 128 22050
298
A Large Tables Table A.18 Critical bands for long MDCT at 48000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 5 117:188 1 10 234:375 2 15 351:563 3 20 468:75 4 25 585:938 5 31 726:563 6 37 867:188 7 44 1031:25 8 52 1218:75 9 61 1429:69 10 71 1664:06 11 83 1945:31 12 98 2296:88 13 116 2718:75 14 138 3234:38 15 165 3867:19 16 197 4617:19 17 235 5507:81 18 279 6539:06 19 332 7781:25 20 401 9398:44 21 503 11789:1 22 692 16218:8 23 1024 24000
Table A.20 Critical bands for long MDCT at 88200 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 172:266 1 8 344:531 2 12 516:797 3 16 689:063 4 20 861:328 5 24 1033:59 6 28 1205:86 7 33 1421:19 8 39 1679:59 9 46 1981:05 10 54 2325:59 11 64 2756:25 12 76 3273:05 13 91 3919:04 14 109 4694:24 15 130 5598:63 16 155 6675:29 17 185 7967:29 18 224 9646:88 19 283 12187:8 20 397 17097:4 21 799 34410:1 22 1024 44100
Table A.19 Critical bands for short MDCT at 48000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz)
Table A.21 Critical bands for short MDCT at 88200 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 1378:13 1 8 2756:25 2 12 4134:38 3 16 5512:5 4 20 6890:63 5 24 8268:75 6 30 10335:9 7 39 13436:7 8 59 20327:3 9 128 44100
0 1 2 3 4 5 6 7 8 9 10 11 12
4 8 12 16 20 24 29 35 42 51 65 92 128
750 1500 2250 3000 3750 4500 5437.5 6562.5 7875 9562.5 12187.5 17250 24000
A.2 Critical Bands for Short and Long MDCT
299
Table A.22 Critical bands for long MDCT at 96000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 187:5 1 8 375 2 12 562:5 3 16 750 4 20 937:5 5 24 1125 6 28 1312:5 7 33 1546:88 8 39 1828:13 9 46 2156:25 10 55 2578:13 11 66 3093:75 12 79 3703:13 13 95 4453:13 14 113 5296:88 15 134 6281:25 16 160 7500 17 193 9046:88 18 240 11250 19 323 15140:6 20 546 25593:8 21 1024 48000
Table A.24 Critical bands for long MDCT at 174600 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 341:016 1 8 682:031 2 12 1023:05 3 16 1364:06 4 20 1705:08 5 24 2046:09 6 29 2472:36 7 35 2983:89 8 42 3580:66 9 51 4347:95 10 61 5200:49 11 73 6223:54 12 87 7417:09 13 105 8951:66 14 130 11083 15 174 14834:2 16 288 24553:1 17 1024 87300
Table A.23 Critical bands for short MDCT at 96000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 1500 1 8 3000 2 12 4500 3 16 6000 4 20 7500 5 25 9375 6 32 12000 7 45 16875 8 89 33375 9 128 48000
Table A.25 Critical bands for short MDCT at 174600 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6
4 8 12 16 22 37 128
2728:13 5456:25 8184:38 10912:5 15004:7 25235:2 87300
300
A Large Tables Table A.26 Critical bands for long MDCT at 192000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 4 375 1 8 750 2 12 1125 3 16 1500 4 20 1875 5 24 2250 6 29 2718:75 7 35 3281:25 8 42 3937:5 9 51 4781:25 10 61 5718:75 11 73 6843:75 12 87 8156:25 13 106 9937:5 14 136 12750 15 197 18468:8 16 465 43593:8 17 1024 96000
Table A.27 Critical bands for short MDCT at 192000 Hz sample rate Critical Upper Edge Upper Edge Bands (Subbands) (Hz) 0 1 2 3 4 5 6
4 8 12 16 23 47 128
3000 6000 9000 12000 17250 35250 96000
A.3 Huffman Codebooks for Codebook Assignment
301
A.3 Huffman Codebooks for Codebook Assignment Table A.28 Huffman codebook for application scopes for quasistationary frames Index Length Codeword 0 3 7 1 3 0 2 3 4 3 3 5 4 4 6 5 4 4 6 4 3 7 4 D 8 5 F 9 5 A 10 5 4 11 5 18 12 6 17 13 6 B 14 6 33 15 7 3A 16 7 38 17 7 15 18 8 77 19 8 73 20 8 5B 21 8 5A 22 8 29 23 8 C9 24 8 CA 25 9 ED 26 9 B1 27 9 B0 28 9 50 29 9 191
Table A.28 (continued) 30 9 196 31 10 1C 9 32 10 165 33 10 164 34 10 1CB 35 10 166 36 10 167 37 10 321 38 11 391 39 11 394 40 11 3B2 41 11 3B3 42 10 320 43 11 145 44 10 A3 45 11 144 46 11 65F 47 12 760 48 13 EC 6 49 12 761 50 13 EC 7 51 13 EC 4 52 12 72A 53 12 72B 54 12 720 55 12 CBC 56 13 197B 57 14 1C 85 58 13 EC 5 59 15 3909 60 15 3908 61 13 197A 62 13 E43 63 10 32E
302
A Large Tables Table A.29 Huffman codebook for application scopes for frames with detected transients Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
4 3 3 3 4 4 4 4 4 4 5 5 6 6 6 6 6 7 7 7 8 8 9 8 10 9 9 11 10 10 11 9
5 1 0 6 7 E F A 9 8 9 17 1B 19 18 10 2C 5A 34 5B 6A 6B 8D 45 11D 89 8C 223 110 11C 222 8F
Table A.30 Huffman codebook for selection indexes for quasistationary frames Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
11 10 8 6 5 5 5 3 2 2 3 4 5 6 6 7 9 11
B0 59 17 A 3 6 7 5 1 3 4 0 4 B 4 A 2D B1
Table A.31 Huffman codebook for selection indexes for frames with detected transients Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
12 12 12 9 7 5 4 3 1 3 4 5 5 6 8 11 11 12
CA CB C8 18 7 6 1 3 1 2 2 7 0 2 D 67 66 C9
A.4 Huffman Codebooks for Quotient Width of Quantization Indexes
303
A.4 Huffman Codebooks for Quotient Width of Quantization Indexes Table A.32 For quasi-stationary frames Index Length Codeword 0 2 2 1 3 6 2 3 0 3 3 1 4 3 2 5 4 7 6 5 1F 7 9 C9 8 10 190 9 10 191 10 8 65 11 7 33 12 6 18 13 5 1E 14 5 D 15 4 E
Table A.33 For frames with transients Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
304
A Large Tables
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.34 Codebook 14 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
10 9 10 9 7 8 10 9 10 9 7 9 7 5 7 8 7 8 10 8 9 8 7 8 10 8 10 8 7 8 7 4 7 8 7 8 7 4 7 4
17F D5 3E0 D6 29 DC 3E1 D7 3E6 96 7E BD 2B F 23 F7 2D DE 3E5 F5 1B6 F0 34 D0 12B DF 129 D9 26 DA 20 9 2C F6 30 D3 31 A 37 E
Table A.34 (continued) 40 2 41 4 42 7 43 4 44 7 45 8 46 7 47 8 48 7 49 5 50 7 51 9 52 7 53 8 54 10 55 8 56 10 57 9 58 7 59 8 60 9 61 8 62 10 63 8 64 7 65 8 66 7 67 4 68 7 69 8 70 7 71 9 72 10 73 8 74 10 75 8 76 7 77 9 78 10 79 8 80 10
0 C 36 B 27 D4 32 F2 2A E 22 BE 28 D7 3E3 D1 17E D4 33 F1 1B7 F4 3E7 D2 2E F3 7F 8 24 D8 7D BC 128 D6 3E2 DD 21 97 3E4 D5 12A
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.35 Codebook 22 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
7 6 5 6 7 6 5 4 5 6 5 3 2 4 5 6 4 4 5 6 7 6 5 6 7
59 3 17 B 5A 9 8 7 9 A 2 4 3 5 3 D A 6 7 8 5B C 0 2 58
Table A.36 Codebook 42 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
11 10 9 9 8 9 9 10 11 10 9 8 7 6 7 8 9
31E 17D 15E EF 41 FA 81 1D9 3CB 17C ED 74 21 28 34 7C F0
Table A.36 (continued) 17 10 1E4 18 9 15F 19 8 6C 20 7 30 21 6 3C 22 5 13 23 6 A 24 7 38 25 8 75 26 9 BF 27 9 F3 28 7 17 29 6 3D 30 5 4 31 4 E 32 5 9 33 6 11 34 7 32 35 8 AD 36 8 62 37 6 2A 38 5 11 39 4 B 40 3 0 41 4 C 42 5 10 43 6 29 44 8 2D 45 8 AE 46 7 35 47 6 14 48 5 7 49 4 D 50 5 6 51 6 3E 52 7 2A 53 9 EE 54 9 C6 55 8 7B 56 7 39 57 6 16 58 5 12 59 6 3F 60 7 2E 61 8 6E 62 9 59 63 10 1D8
305
306
A Large Tables Table A.36 (continued) 64 9 F1 65 8 7A 66 7 33 67 7 3F 68 7 2B 69 8 6D 70 9 DF 71 10 1BC 72 11 31F 73 10 1BD 74 9 80 75 8 AC 76 8 5E 77 9 FB 78 9 58 79 10 18E 80 11 3CA Table A.37 Codebook 82 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
13 13 13 12 12 11 11 10 10 11 11 11 12 12 12 13 14 13 12 12 12 11 10 10 10 9
62D A4 BCA 53 50 6E0 753 307 3A8 328 2F 5 7B1 7B7 51 CEE 19DB 33B4 E0B CEC F 61 552 77D 210 368 1F 13C
Table A.37 (continued) 26 10 195 27 10 15 28 10 211 29 11 756 30 11 604 31 12 5E8 32 12 FAC 33 13 F 0B 34 12 9D9 35 11 46C 36 11 549 37 10 269 38 11 3D8 39 10 11 40 10 1C C 41 9 1E0 42 9 E7 43 9 5 44 9 19A 45 10 1B5 46 10 3BC 47 11 3D9 48 11 6B4 49 12 5E9 50 12 9D8 51 12 6D0 52 11 13 53 11 386 54 10 369 55 9 109 56 9 3A 57 8 AA 58 8 F7 59 7 44 60 8 6 61 8 9F 62 9 79 63 9 149 64 10 37E 65 11 383 66 11 757 67 12 317 68 12 782 69 11 3DA 70 10 37F 71 9 148
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.37 (continued) 72 9 180 73 9 BE 74 9 F7 75 8 1 76 7 4C 77 8 FE 78 8 D8 79 9 C8 80 9 138 81 9 135 82 10 35B 83 11 EF 84 11 6FA 85 11 3C 86 11 7D7 87 9 10B 88 9 61 89 9 63 90 8 E0 91 7 43 92 7 61 93 7 31 94 7 73 95 8 7A 96 8 F1 97 9 CE 98 9 DB 99 9 14B 100 11 3C 3 101 11 67A 102 11 20 103 10 3BF 104 9 153 105 8 A7 106 8 DE 107 7 47 108 7 D 109 6 32 110 6 E 111 6 2E 112 7 F 113 7 41 114 8 D7 115 8 D4 116 9 1AC 117 10 17
Table A.37 (continued) 118 11 182 119 10 2A5 120 10 C0 121 9 E 122 8 3 123 8 1C 124 7 74 125 6 2C 126 6 1D 127 5 2 128 6 1F 129 6 34 130 7 76 131 8 32 132 8 54 133 9 32 134 10 67 135 10 237 136 10 8 137 9 118 138 8 8B 139 8 71 140 8 72 141 7 2E 142 6 16 143 5 4 144 4 4 145 5 5 146 6 14 147 7 30 148 8 6C 149 8 6F 150 9 C9 151 9 19F 152 10 37C 153 10 303 154 9 13D 155 9 78 156 8 FB 157 8 33 158 7 7A 159 6 31 160 5 12 161 5 1 162 6 1A 163 6 2B
307
308
A Large Tables Table A.37 (continued) 164 7 72 165 8 EB 166 8 FF 167 9 1F 4 168 10 66 169 11 3C 4 170 11 2A8 171 10 12 172 9 10A 173 8 D9 174 8 CC 175 7 51 176 7 1F 177 6 2F 178 6 D 179 6 2D 180 7 7E 181 8 6E 182 8 A6 183 8 9B 184 9 182 185 10 306 186 11 12 187 11 EE 188 11 329 189 9 139 190 9 7A 191 9 BF 192 8 EE 193 7 50 194 7 7C 195 7 2B 196 7 71 197 8 79 198 8 DB 199 9 CF 200 9 1E4 201 10 178 202 10 234 203 11 67B 204 11 4ED 205 11 183 206 10 13 207 9 19B 208 9 1B5 209 9 CB 210 8 AB
Table A.37 (continued) 211 8 0 212 7 40 213 8 F3 214 8 A8 215 9 7B 216 9 19C 217 9 14A 218 10 3D9 219 11 369 220 11 678 221 12 7B6 222 11 679 223 11 380 224 10 76 225 9 13A 226 9 AB 227 8 E1 228 8 18 229 8 66 230 8 DD 231 8 D5 232 9 1E5 233 10 1CD 234 10 3BD 235 11 387 236 12 783 237 12 5E6 238 12 F 60 239 12 5E7 240 11 6B5 241 11 381 242 10 33A 243 10 155 244 9 1B9 245 9 1ED 246 8 8A 247 9 1E1 248 10 1C 2 249 10 1E3 250 10 371 251 11 21 252 11 6E1 253 12 EA4 254 14 1C15 255 13 F 0A 256 12 CEF 257 12 553
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.37 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
(continued) 11 46D 11 18A 10 235 10 16 10 17B 9 119 10 C4 10 3EA 11 77C 11 6FB 11 4D0 12 6D1 13 A5 14 1C14 13 62C 13 BCB 12 FAD 12 5E4 12 784 11 605 11 3C 5 11 3C 0 10 3AA 10 277 11 3D 11 548 12 704 11 4D1 13 1D4A 13 1D4B 14 33B5
Table A.38 Codebook 151 Index Length Codeword 0 8 21 1 8 22 2 7 49 3 7 4F 4 7 2F 5 6 1 6 6 9 7 6 26 8 6 6 9 5 10
Table A.38 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
(continued) 5 2 5 5 4 B 4 4 3 6 3 3 3 7 4 3 4 A 5 A 5 1 5 11 6 7 6 25 6 16 6 0 7 2E 7 4E 7 48 8 23 8 20
Table A.39 Codebook 311 Index Length Codeword 0 10 BA 1 10 1EF 2 9 1B8 3 9 20 4 9 23 5 9 DA 6 9 79 7 9 F6 8 9 FE 9 8 DD 10 8 26 11 8 3E 12 7 72 13 7 6F 14 7 12 15 7 1B 16 8 6C 17 8 7A 18 7 73 19 7 16
309
310
A Large Tables Table A.39 (continued) 20 7 18 21 7 3E 22 7 3C 23 6 36 24 6 A 25 6 E 26 5 1A 27 5 3 28 5 C 29 4 0 30 3 5 31 3 2 32 3 4 33 4 F 34 5 E 35 5 1D 36 5 18 37 6 1A 38 6 8 39 6 38 40 6 32 41 7 37 42 7 19 43 7 9 44 7 A 45 8 7E 46 8 3D 47 7 1A 48 7 B 49 7 66 50 7 67 51 8 3F 52 8 27 53 8 2F 54 9 FF 55 9 DB 56 9 5C 57 9 78 58 9 22 59 9 1B9 60 10 1EE 61 9 21 62 10 BB
Table A.40 Codebook 631 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
10 11 10 10 10 10 10 10 10 10 9 10 9 10 10 9 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 9 9 9 8 9 8 9 9 8 8
1 F8 1E2 59 1CF 62 7D 24 2AA 1C 0 1B5 1E3 130 1C1 1D6 15A 15B 152 1B6 1C 6 1A 1B4 2D 1 32 9E 9A E1 F7 17 3 40 1B7 1C 9 33 9F EA DE 38 F0 FC A8
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.40 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
(continued) 8 AB 8 C 8 74 8 A 8 1D 7 45 8 71 7 6E 7 6C 7 7F 7 7 6 21 7 3B 7 3F 6 3E 6 4 5 12 5 1A 5 C 4 C 4 5 3 1 4 4 4 B 5 D 5 1D 5 14 6 1 6 3C 7 3D 6 23 7 3E 7 D 7 73 7 7A 7 57 8 79 7 44 8 72 7 41 8 14 8 FD 8 E2
Table A.40 (continued) 85 8 1 86 8 9C 87 8 AC 88 9 4 89 9 39 90 9 E1 91 9 1B 92 9 153 93 9 2A 94 9 1C 0 95 8 E5 96 8 1E 97 8 B 98 8 8 99 8 DF 100 8 9B 101 9 3F 102 8 9D 103 9 2B 104 9 5 105 9 1C 7 106 9 E6 107 9 1C 8 108 9 30 109 9 1EC 110 9 131 111 10 1CE 112 9 132 113 9 133 114 10 0 115 9 1ED 116 10 25 117 10 27 118 10 1D7 119 10 58 120 10 2AB 121 10 63 122 9 154 123 10 382 124 10 383 125 11 F9 126 10 26
311
312
A Large Tables Table A.41 Codebook 1271 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
11 13 10 11 11 10 11 10 10 11 11 10 10 10 11 10 10 10 10 11 11 10 10 10 10 9 11 10 10 10 10 10 10 10 11 10 10 9 11 10 9 11 10 10 10 9
5DA B8E 2E2 5DB 5D8 38 5D9 39 3E 74D 2B6 3A7 3F 158 76 159 2E3 2E0 2E1 3D4 2B7 3C 3D 32 3A4 177 2B4 33 15E 1EB 3A5 174 3BA 43 77 15F 15C 121 3D5 10A 1EA 3EA 15D 10B 3BB 142
Table A.41 (continued) 46 9 143 47 9 1D0 48 9 13 49 10 1E8 50 9 174 51 10 1E9 52 10 152 53 9 175 54 10 3B8 55 9 140 56 9 1D1 57 9 1D6 58 9 16A 59 9 94 60 9 26 61 9 27 62 9 1EB 63 9 BB 64 11 74 65 11 2B5 66 10 3B9 67 10 30 68 10 31 69 10 108 70 10 36 71 10 175 72 11 75 73 10 2E6 74 9 24 75 9 1D7 76 9 141 77 10 1EE 78 10 1A2 79 9 146 80 10 3BE 81 10 37 82 10 109 83 10 1A3 84 9 95 85 9 1E8 86 10 153 87 9 1D4 88 9 16B 89 9 168 90 9 1D5 91 10 1A0 92 9 B2
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.41 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
(continued) 9 E8 9 1E2 9 B9 9 10 9 25 9 11 9 F8 9 16 9 9A 8 92 8 E0 9 FC 8 E4 8 EC 8 F2 8 48 8 1E 8 44 8 5F 8 11 8 7F 8 75 7 5F 7 20 7 E 7 37 7 35 6 25 6 6 6 14 5 15 5 1F 4 8 4 2 3 6 4 3 5 C 5 0 5 13 6 1C 6 5 6 2C 7 3B 7 36 7 23 7 2D 7 52
Table A.41 (continued) 140 7 71 141 7 5E 142 8 79 143 8 4B 144 8 4F 145 8 E7 146 8 F3 147 8 F7 148 8 E5 149 8 E1 150 8 ED 151 9 9B 152 8 93 153 8 A7 154 9 FD 155 9 F9 156 9 1E9 157 9 17 158 9 14 159 9 98 160 9 99 161 9 3E 162 9 92 163 9 15 164 9 93 165 10 1A1 166 9 147 167 9 169 168 10 150 169 9 1E3 170 9 16E 171 10 1EF 172 10 2E7 173 9 16F 174 10 1EC 175 10 151 176 10 1ED 177 10 1E2 178 10 1A6 179 10 156 180 9 144 181 10 2E4 182 13 B8F 183 10 1E3 184 9 16C 185 11 4A 186 10 2E5
313
314
A Large Tables Table A.41 (continued) 187 10 157 188 11 4B 189 10 3BF 190 10 10E 191 9 B3 192 9 9C 193 9 20 194 9 3F 195 9 1E0 196 9 B0 197 10 154 198 9 1C C 199 10 1A7 200 9 145 201 9 122 202 10 1A4 203 9 16D 204 9 B1 205 9 8A 206 9 1E1 207 9 1EC 208 10 1A5 209 10 17A 210 9 123 211 11 3EB 212 10 1E0 213 10 155 214 10 10F 215 10 10C 216 10 13A 217 10 34 218 11 3E8 219 10 1E1 220 10 13B
Table A.41 (continued) 221 10 35 222 11 3E9 223 10 29A 224 10 3DA 225 10 10D 226 10 170 227 10 29B 228 10 298 229 11 3EE 230 11 3EF 231 10 3BC 232 11 48 233 11 3EC 234 10 3DB 235 10 17B 236 9 120 237 10 299 238 10 178 239 10 179 240 10 3BD 241 11 3ED 242 10 39A 243 11 22E 244 11 22F 245 11 49 246 11 74C 247 11 2E2 248 10 116 249 10 1D2 250 11 3A6 251 12 5C 6 252 10 42 253 10 39B 254 11 3A7
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.42 Codebook 2551 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
5 4 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 7 8 8 8 8 8 8 8 8 8 8 8 8
2 A D 7 4 0 16 11 17 18 13 15 B 2 32 25 34 2E 20 3E 3B 22 15 14 1B E 6E 5F 66 61 4E 7E 71 49 66 64 5A 43 49 4B 1E 19 35 D8 CF DB
Table A.42 (continued) 46 8 F 47 8 D6 48 8 9F 49 8 D 50 8 BD 51 9 F7 52 8 86 53 8 91 54 8 DF 55 9 EB 56 9 F4 57 9 E8 58 9 CB 59 9 85 60 9 E9 61 9 60 62 9 B7 63 9 A5 64 9 A0 65 9 3E 66 9 E0 67 9 E4 68 9 E7 69 9 B0 70 9 68 71 9 69 72 9 8D 73 9 61 74 9 8C 75 9 62 76 9 65 77 9 36 78 9 91 79 9 94 80 9 19D 81 9 180 82 9 8E 83 9 34 84 9 121 85 9 18C 86 10 1E5 87 9 134 88 9 135 89 10 1E7 90 9 30 91 9 1AA 92 10 1EC
315
316
A Large Tables Table A.42 (continued) 93 9 1BC 94 9 189 95 9 18D 96 10 166 97 10 194 98 9 182 99 9 1AB 100 9 183 101 10 167 102 10 164 103 9 18A 104 10 1D5 105 9 1AE 106 10 1ED 107 10 1FF 108 9 13D 109 10 162 110 10 120 111 10 19E 112 9 132 113 10 148 114 10 149 115 10 1E0 116 10 121 117 10 165 118 10 1EA 119 10 12A 120 10 105 121 10 2F2 122 10 19F 123 10 C8 124 10 19C 125 10 145 126 10 1E2 127 10 108 128 10 19D 129 9 133 130 10 1CA 131 9 108 132 10 14E 133 9 13C 134 10 1CB 135 9 10E 136 10 C9 137 9 109 138 10 1C2 139 10 1D4
Table A.42 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
(continued) 10 14F 10 102 10 14C 10 1E1 9 10F 10 14D 10 142 10 103 9 10A 10 16C 10 1EB 10 CE 10 1E3 10 1E6 10 109 10 33 10 62 10 146 10 100 10 6F 10 317 10 143 10 101 10 7E 10 106 10 147 10 338 10 107 10 CF 10 3A 10 339 10 7F 10 366 10 CC 10 C6 10 367 11 3FA 10 37B 10 3B 10 1FE 11 3FB 10 240 10 63 10 CD 10 364 10 6A 10 352
A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames Table A.42 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
(continued) 10 261 10 26E 10 26F 10 104 10 30 10 31 11 32B 10 365 10 353 10 350 10 31E 11 39A 10 2F3 10 11E 10 310 10 241 11 3C8 11 3C9 10 6B 11 2C6 10 6E 10 36A 10 32 10 36B 10 26C 10 311 10 26D 11 3F8 10 216 10 368 11 39B 10 217 10 369 11 2C7 11 23F
Table A.42 (continued) 222 11 398 223 11 288 224 10 35E 225 10 2F0 226 10 31F 227 10 262 228 10 351 229 10 263 230 10 31C 231 11 2DA 232 11 6F5 233 10 2F1 234 11 3F9 235 11 399 236 11 289 237 11 256 238 10 38 239 10 31D 240 10 302 241 10 39 242 11 2DB 243 11 18F 244 10 260 245 11 32A 246 10 35F 247 10 303 248 11 23E 249 11 257 250 10 316 251 11 18E 252 11 6F4 253 11 386 254 11 387 255 3 7
317
318
A Large Tables
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients
Table A.43 Codebook 14 Index Length Codeword 0 7 6D 1 8 1D 2 10 363 3 9 B 4 7 F 5 9 1A5 6 11 74 7 9 1E 8 11 75 9 9 184 10 7 67 11 9 185 12 7 6F 13 5 7 14 7 8 15 8 DD 16 7 6A 17 9 1B8 18 10 10 19 9 9 20 11 6C 4 21 9 19A 22 7 0 23 9 12 24 11 171 25 9 39 26 9 19B 27 8 2B 28 7 A 29 8 D3 30 7 11 31 4 A 32 7 12 33 9 5D 34 7 5 35 9 13 36 6 31 37 4 8 38 7 6 39 4 F
Table A.43 (continued) 40 2 1 41 4 E 42 7 B 43 4 B 44 6 32 45 9 1C 46 7 D 47 9 5E 48 7 3 49 4 9 50 7 C 51 9 54 52 7 1 53 7 60 54 9 1A4 55 8 C3 56 10 11 57 9 1B9 58 7 13 59 9 55 60 10 3B 61 9 1B0 62 11 170 63 9 1F 64 7 10 65 9 1B2 66 7 16 67 5 6 68 7 14 69 9 1B3 70 7 68 71 8 CC 72 11 172 73 9 A 74 11 6C 5 75 9 5F 76 7 9 77 9 38 78 11 173 79 8 8 80 7 6B
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.44 Codebook 22 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
8 6 5 6 8 6 5 4 5 6 5 3 2 3 5 6 5 4 5 6 8 6 6 6 8
13 22 12 20 11 39 1 3 1F 21 1E 5 1 6 1D 27 0 2 3 38 10 23 5 26 12
Table A.45 Codebook 42 Index Length Codeword 0 10 71 1 10 E8 2 9 16D 3 9 166 4 8 B7 5 9 39 6 11 340 7 11 341 8 12 686 9 11 342 10 9 FF 11 9 FA 12 7 5A 13 7 35 14 7 49 15 8 38
Table A.45 (continued) 16 10 1A4 17 11 3FA 18 9 140 19 9 E4 20 7 51 21 7 32 22 6 18 23 7 36 24 8 71 25 8 A1 26 10 2D8 27 8 A4 28 7 53 29 6 20 30 5 6 31 4 2 32 5 15 33 7 33 34 7 A 35 9 E5 36 8 7C 37 6 6 38 6 1E 39 4 5 40 2 3 41 4 4 42 5 11 43 6 4 44 8 70 45 9 75 46 7 F 47 7 3A 48 5 17 49 4 0 50 5 13 51 6 21 52 7 B 53 8 3B 54 10 1CF 55 9 FB 56 8 39 57 7 3B 58 6 F 59 7 37 60 7 48 61 8 B2 62 10 E9
319
320
A Large Tables Table A.45 (continued) 63 10 2D9 64 9 141 65 9 E6 66 7 58 67 6 25 68 8 7E 69 8 A5 70 9 D1 71 10 70 72 12 687 73 11 34A 74 10 1F C 75 9 D3 76 8 1D 77 9 167 78 11 3FB 79 10 1CE 80 11 34B Table A.46 Codebook 82 Index Length Codeword 0 12 413 1 13 824 2 13 825 3 13 83A 4 11 203 5 11 7BE 6 12 410 7 10 3DC 8 9 112 9 10 2A0 10 10 3DD 11 10 2A1 12 13 83B 13 12 411 14 13 838 15 12 416 16 13 839 17 12 417 18 11 7BF 19 12 414 20 12 415 21 11 7BC 22 12 46A 23 10 3D2 24 10 3D3 25 8 A0
Table A.46 (continued) 26 9 151 27 11 200 28 11 201 29 10 3D0 30 10 2A6 31 12 46B 32 13 83E 33 13 83F 34 11 7BD 35 13 83C 36 10 3D1 37 12 468 38 11 206 39 10 3D6 40 10 11F 41 9 113 42 8 16 43 9 1E1 44 10 55 45 10 7E 46 11 7B2 47 10 2A7 48 12 469 49 11 7B3 50 11 7B0 51 13 83D 52 11 7B1 53 10 2A4 54 9 110 55 10 7F 56 9 C 57 9 1E6 58 9 85 59 8 23 60 8 AD 61 9 172 62 10 11C 63 10 11D 64 11 7B6 65 13 832 66 12 46E 67 12 46F 68 12 46C 69 11 207 70 11 204 71 10 112 72 10 3D7
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.46 (continued) 73 10 2A5 74 10 2BA 75 8 A1 76 8 17 77 9 3C 78 9 1E7 79 9 D 80 10 3D4 81 10 113 82 10 3D5 83 10 3EA 84 10 3EB 85 11 7B7 86 10 2BB 87 10 110 88 10 7C 89 9 111 90 10 111 91 8 D0 92 8 24 93 7 6E 94 8 29 95 8 10 96 9 116 97 10 3E8 98 9 117 99 11 205 100 10 3E9 101 11 7B4 102 11 21A 103 10 116 104 10 7D 105 8 D8 106 9 173 107 8 B5 108 8 2A 109 7 26 110 6 0 111 7 7 112 7 5B 113 9 1A 114 10 72 115 9 3D 116 10 2B8 117 9 114 118 10 2B9 119 9 156 120 10 73
Table A.46 (continued) 121 9 7A 122 8 FD 123 8 F8 124 8 25 125 6 2F 126 6 B 127 5 B 128 5 13 129 6 2C 130 7 52 131 8 FE 132 8 FF 133 9 115 134 9 1B 135 10 13E 136 9 1AA 137 9 1E4 138 8 11 139 8 4E 140 7 42 141 7 1A 142 6 E 143 4 E 144 3 3 145 4 C 146 6 C 147 7 9 148 8 37 149 7 6B 150 8 4 151 9 95 152 10 3EE 153 10 117 154 11 21B 155 9 42 156 9 96 157 8 7 158 7 5D 159 7 13 160 5 12 161 5 A 162 6 6 163 6 20 164 7 6F 165 9 97 166 8 2B 167 9 9E 168 9 157
321
322
A Large Tables Table A.46 (continued) 169 10 70 170 10 3EF 171 10 13F 172 9 18 173 8 86 174 8 8E 175 8 D9 176 7 51 177 7 1F 178 6 2 179 7 24 180 8 28 181 8 D1 182 9 7B 183 8 F9 184 10 114 185 10 3EC 186 11 218 187 12 46D 188 11 219 189 10 115 190 10 DA 191 9 154 192 10 DB 193 8 14 194 8 B8 195 7 69 196 8 D4 197 8 DA 198 9 78 199 9 43 200 9 19 201 9 155 202 10 71 203 11 21E 204 11 7B5 205 11 21F 206 11 78A 207 10 2BE 208 10 76 209 10 3ED 210 9 A 211 8 A6 212 8 A7 213 9 40 214 9 1E5 215 10 D8
Table A.46 (continued) 216 9 1B6 217 9 168 218 11 78B 219 11 788 220 11 789 221 13 833 222 13 830 223 12 462 224 10 77 225 11 21C 226 10 74 227 9 79 228 9 41 229 7 46 230 8 87 231 8 8F 232 10 75 233 10 D9 234 10 2BF 235 10 2BC 236 11 21D 237 12 463 238 12 460 239 11 78E 240 13 831 241 11 212 242 10 2BD 243 11 78F 244 11 78C 245 9 1B7 246 8 22 247 9 94 248 11 213 249 10 356 250 10 2B2 251 10 357 252 12 461 253 10 2B3 254 13 836 255 12 466 256 11 78D 257 12 467 258 13 837 259 12 464 260 11 782 261 10 2B0 262 10 2B1
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.46 (continued) 263 8 FC 264 11 783 265 10 2D2 266 11 780 267 10 56 268 13 834 269 12 465 270 13 835 271 13 80A 272 13 80B 273 12 47A 274 13 808 275 12 47B 276 11 210 277 11 211 278 12 478 279 12 479 280 9 B 281 10 57 282 11 781 283 11 5A6 284 13 809 285 12 152 286 11 5A7 287 11 A8 288 12 153
323
Table A.47 Codebook 151 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
9 9 8 7 7 7 6 7 6 6 6 5 5 4 3 2 3 4 4 5 6 6 6 7 6 6 7 7 7 8 8
7E 7F 3E 65 12 1E 27 19 24 8 D 1E 5 D 5 1 0 E 8 18 E 3F 26 18 3E 25 13 66 64 CE CF
324
A Large Tables Table A.48 Codebook 311 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
10 10 9 10 9 9 8 8 8 9 9 9 7 8 8 7 8 7 7 7 7 7 7 7 6 6 5 5 5 4 3 3 3 4 5 5 5 6 6 6 7 6 7 7 7
3BC 170 4F 3BD 42 64 20 CE CF 77 65 B9 64 5D 3A 24 C 62 65 72 18 2F 2C 2D 2 13 1B 0 A 1 5 3 4 F 8 5 1A F D 38 25 30 1C 73 7
Table A.48 (continued) 45 7 75 46 8 26 47 7 11 48 7 66 49 7 12 50 7 76 51 7 63 52 8 33 53 8 E8 54 8 D 55 9 43 56 8 E9 57 10 EC 58 9 4E 59 9 1DF 60 8 EE 61 10 171 62 10 ED Table A.49 Codebook 631 Index Length Codeword 0 10 1A8 1 10 1A9 2 10 341 3 10 346 4 11 680 5 9 1A1 6 10 347 7 10 1AE 8 10 1AF 9 11 681 10 10 344 11 9 1A6 12 12 48E 13 10 1AC 14 10 1AD 15 9 1A7 16 10 1A2 17 10 345 18 9 3E 19 9 3F 20 10 1A3 21 8 1C 22 9 D5 23 10 1A0 24 8 D4
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.49 (continued) 25 9 CA 26 8 BA 27 8 BB 28 8 49 29 8 D5 30 8 DF 31 9 CB 32 9 C8 33 9 C9 34 8 CE 35 10 1A1 36 8 1D 37 8 8 38 8 42 39 8 B8 40 9 3C 41 8 4C 42 8 B9 43 7 56 44 8 5A 45 7 D 46 8 5B 47 7 6C 48 7 66 49 7 6D 50 7 6 51 7 30 52 7 27 53 6 2C 54 6 21 55 6 32 56 6 11 57 6 17 58 6 1B 59 5 18 60 5 2 61 5 E 62 4 2 63 3 7 64 4 3 65 4 9 66 5 A 67 5 0 68 5 11 69 6 1F 70 5 14
Table A.49 (continued) 71 6 20 72 7 31 73 6 2D 74 6 2A 75 7 3D 76 7 57 77 7 6E 78 7 5 79 8 43 80 7 7 81 8 4D 82 8 4A 83 8 BE 84 8 40 85 9 CE 86 8 4B 87 8 CF 88 8 BF 89 9 CF 90 8 9 91 9 CC 92 9 3D 93 10 1A6 94 9 32 95 8 58 96 8 78 97 8 DE 98 8 BC 99 8 18 100 10 35A 101 9 CD 102 9 33 103 9 B2 104 10 1A7 105 8 41 106 9 B3 107 8 BD 108 10 35B 109 9 1A4 110 10 1A4 111 10 358 112 10 359 113 9 1A5 114 9 F2 115 9 90 116 10 35E
325
326
A Large Tables Table A.49 (continued) 117 10 35F 118 10 35C 119 10 35D 120 11 3CE 121 12 48F 122 11 3CF 123 10 1A5 124 11 246 125 10 122 126 10 1E6 Table A.50 Codebook 1271 Index Length Codeword 0 10 52 1 9 2E 2 10 53 3 10 50 4 10 51 5 10 56 6 10 57 7 10 54 8 10 55 9 10 3AA 10 10 3AB 11 10 3A8 12 9 2F 13 10 3A9 14 9 2C 15 9 2D 16 10 3AE 17 10 3AF 18 9 22 19 10 3AC 20 10 3AD 21 10 3A2 22 10 3A3 23 9 23 24 10 3A0 25 10 3A1 26 9 20 27 10 3A6 28 10 3A7 29 10 3A4 30 9 F4
Table A.50 (continued) 31 9 21 32 10 3A5 33 10 3BA 34 10 3BB 35 9 F5 36 8 8A 37 10 3B8 38 8 F9 39 10 3B9 40 9 26 41 9 27 42 10 3BE 43 9 24 44 9 25 45 9 3A 46 8 8B 47 10 3BF 48 9 3B 49 8 88 50 10 3BC 51 8 89 52 8 FE 53 9 38 54 8 8E 55 9 39 56 9 3E 57 8 8F 58 8 8C 59 9 3F 60 10 3BD 61 8 7B 62 8 4D 63 8 8D 64 9 3C 65 9 3D 66 7 5 67 9 32 68 10 3B2 69 10 3B3 70 9 33 71 8 82 72 8 FF 73 9 30 74 10 3B0
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.50 (continued) 75 8 FC 76 10 3B1 77 9 31 78 9 36 79 8 83 80 8 78 81 8 80 82 9 37 83 9 34 84 8 81 85 9 35 86 10 3B6 87 9 A 88 8 5E 89 8 86 90 8 5F 91 8 FD 92 10 3B7 93 8 87 94 8 F2 95 8 84 96 8 85 97 8 5C 98 8 79 99 8 9A 100 8 7E 101 9 B 102 8 7F 103 7 27 104 8 7C 105 8 F3 106 7 24 107 7 25 108 7 36 109 7 32 110 8 F0 111 8 9B 112 8 F1 113 8 7D 114 7 12 115 7 66 116 7 13 117 7 37 118 7 67 119 6 30 120 6 10 121 7 2A 122 7 2B
Table A.50 (continued) 123 5 15 124 6 1C 125 6 A 126 5 1A 127 4 3 128 4 B 129 6 B 130 6 11 131 7 28 132 6 31 133 7 34 134 6 16 135 7 52 136 7 10 137 7 64 138 7 35 139 8 F6 140 7 3A 141 7 29 142 8 5D 143 7 53 144 8 62 145 7 50 146 7 11 147 8 F7 148 8 63 149 7 65 150 7 51 151 8 F4 152 8 60 153 8 F5 154 8 76 155 8 DA 156 8 61 157 8 98 158 8 DB 159 9 8 160 9 9 161 8 99 162 8 9E 163 10 3B4 164 8 9F 165 10 3B5 166 9 E 167 10 38A 168 8 9C 169 9 F 170 8 D8
327
328
A Large Tables Table A.50 (continued) 171 9 C 172 9 D 173 10 38B 174 8 77 175 10 388 176 9 2 177 8 D9 178 10 389 179 8 DE 180 8 9D 181 9 3 182 8 92 183 8 66 184 10 38E 185 8 93 186 10 38F 187 10 38C 188 10 38D 189 9 0 190 10 382 191 8 DF 192 8 90 193 10 383 194 10 380 195 10 381 196 8 DC 197 8 DD 198 8 91 199 9 1 200 9 6 201 8 96 202 9 7 203 8 4C 204 8 97 205 9 4 206 8 94 207 10 386 208 9 5 209 10 387 210 10 384 211 9 1A 212 9 1B
Table A.50 (continued) 213 10 385 214 10 39A 215 9 18 216 10 39B 217 10 398 218 10 399 219 8 95 220 9 19 221 9 1E 222 9 1F 223 9 1C 224 10 39E 225 9 CE 226 10 39F 227 10 39C 228 10 39D 229 10 392 230 10 393 231 10 390 232 10 391 233 9 1D 234 10 396 235 10 397 236 10 394 237 10 395 238 10 3EA 239 10 3EB 240 9 12 241 10 3E8 242 10 3E9 243 10 3EE 244 10 3EF 245 9 13 246 10 3EC 247 10 3ED 248 10 3E2 249 10 3E3 250 10 3E0 251 9 CF 252 9 10 253 10 3E1 254 9 11
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.51 Codebook 2551 Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
5 5 6 6 6 7 7 6 7 7 7 7 7 7 7 8 7 7 7 7 8 7 6 7 8 7 8 8 8 7 7 9 7 7 8 7 8 7 8 8 8 9 8 8 8 8
3 F 1A 8 B 23 14 9 33 15 36 20 21 26 1A D4 37 4A 4B 27 D5 48 20 24 44 25 45 AA 5A 1B 49 1AE 4E 18 5B 32 58 4F 59 5E AB 1AF A8 A9 5F AE
Table A.51 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
(continued) 8 5C 8 5D 7 4C 8 AF 9 1AC 8 AC 9 1AD 9 1A2 9 1A3 8 AD 9 1A0 9 1A1 7 19 8 52 9 1A6 8 A2 8 53 9 1A7 8 A3 8 A0 8 50 8 A1 9 1A4 8 A6 8 A7 8 A4 8 51 9 1A5 9 1BA 9 1BB 9 1B8 8 56 9 1B9 9 1BE 8 57 8 A5 8 BA 8 BB 9 1BF 8 54 9 1BC 9 1BD 9 1B2 8 B8 9 1B3 8 55 7 4D
329
330
A Large Tables Table A.51 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
(continued) 8 B9 8 3A 9 1B0 9 1B1 8 BE 8 BF 8 3B 8 BC 9 1B6 9 1B7 8 BD 9 1B4 9 1B5 8 B2 9 18A 9 18B 9 188 9 189 8 38 9 18E 8 B3 9 18F 9 18C 7 42 9 18D 8 B0 9 182 9 183 9 180 9 181 8 B1 9 186 8 B6 8 39 9 187 9 184 9 185 8 B7 8 3E 7 43 8 3F 9 19A 8 B4 8 3C 9 19B 9 198 9 199
Table A.51 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
(continued) 9 19E 8 B5 9 19F 9 19C 8 A 8 B 8 3D 9 19D 9 192 9 193 9 190 8 8 9 191 9 196 9 197 8 62 9 194 9 195 8 9 9 1EA 9 1EB 8 63 9 1E8 9 1E9 8 E 9 1EE 9 1EF 9 1EC 9 1ED 9 1E2 9 1E3 8 F 9 1E0 9 1E1 7 A 9 1E6 8 C 8 D 9 1E7 9 1E4 8 2 8 3 8 0 9 1E5 9 1FA 8 1 9 1FB
A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients Table A.51 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
(continued) 9 1F 8 8 6 8 7 9 1F 9 9 1FE 9 1FF 9 1F C 9 1FD 9 1F 2 9 1F 3 9 1F 0 8 4 9 1F 1 9 1F 6 9 1F 7 9 1F 4 8 5 9 1F 5 8 8A 9 1CA 9 1CB 9 1C 8 9 1C 9 9 1CE 9 1CF 8 8B 9 1C C 8 88 9 1CD 9 1C 2 9 1C 3 8 89 8 8E 9 1C 0 9 1C1
Table A.51 (continued) 222 8 60 223 9 1C 6 224 9 1C 7 225 9 1C 4 226 9 1C 5 227 8 8F 228 9 1DA 229 9 1DB 230 9 1D8 231 9 1D9 232 9 1DE 233 8 8C 234 9 1DF 235 9 1DC 236 9 1DD 237 8 8D 238 9 1D2 239 9 1D3 240 9 1D0 241 9 1D1 242 9 1D6 243 9 1D7 244 9 1D4 245 9 1D5 246 8 12 247 8 13 248 8 61 249 8 10 250 8 11 251 8 16 252 10 5E 253 10 5F 254 9 2E 255 5 E
331
332
A Large Tables
A.7 Huffman Codebooks for Indexes of Quantization Step Sizes Table A.52 For quasi-stationary frames Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
3 3 4 4 5 5 6 6 6 7 7 7 8 8 9 9 9 10 10 10 11 11 11 11 11 12 12 11 9 9 9 9 9 9 9 10 10 10 11 11
1 5 6 F 9 1C 1F 2 32 3A 76 66 47 CF E1 19C 138 1CF 1CB 3AD 394 39D 7D 4B2 4B3 733 F8 395 12E 18 EE 1C E4 1D4 129 3F 3C 251 4F 3 759
Table A.52 (continued) 40 11 75D 41 12 F9 42 11 7A 43 11 398 44 11 39C 45 10 25E 46 10 34 47 10 1C1 48 10 1C 0 49 9 12B 50 9 13D 51 9 139 52 9 12D 53 10 3A 54 10 32 55 10 33B 56 10 3B 57 10 33 58 10 25F 59 10 33A 60 10 258 61 10 35 62 10 36 63 10 278 64 10 250 65 10 3AF 66 11 6F 67 11 75C 68 11 4F 2 69 11 4A8 70 12 DC 71 12 952 72 13 1D60 73 14 1C C 9 74 13 1D61 75 15 7588 76 14 3AC 6 77 16 EB16 78 16 EB17 79 16 EB14 80 16 EB15 81 14 3AC 7
A.7 Huffman Codebooks for Indexes of Quantization Step Sizes Table A.52 (continued) 82 17 E646 83 17 E647 84 16 7322 85 15 7589 86 15 3990 87 15 3996 88 14 1C CA 89 15 3997 90 13 12A6 91 13 12A7 92 12 DD 93 11 7B 94 10 255 95 10 1CD 96 9 1D5 97 9 EF 98 8 9D 99 8 9F 100 8 46 101 8 71 102 8 76 103 7 74 104 7 77 105 7 22 106 6 24 107 6 26 108 6 10 109 6 1E 110 5 18 111 5 0 112 4 8 113 4 D 114 4 1 115 4 5
333
Table A.53 For frames with transients Index Length Codeword 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
3 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 10 9 11 11 10 10 12 12 12
0 D A E 9 8 6 19 F 17 B 31 38 14 10 12 5D 5F 63 26 27 B9 BD 39D 47 3E4 2DB 8D 16B 7C 3 60E 60F
334
A Large Tables Table A.53 (continued) 32 12 60C 33 12 60D 34 12 602 35 11 738 36 13 C12 37 12 603 38 13 C13 39 12 600 40 11 739 41 13 C10 42 13 C11 43 12 601 44 11 73E 45 12 606 46 10 8A 47 11 73F 48 11 2D8 49 11 2D9 50 10 8B 51 10 1F 3 52 10 88 53 9 B1 54 9 B2 55 9 B7 56 9 B0 57 9 B3 58 9 FA 59 9 C2 60 9 1C C 61 9 1CA 62 10 1F 1 63 10 168 64 11 73C 65 11 73D 66 12 607 67 11 2D2 68 11 3E5 69 12 604 70 12 605 71 13 C16 72 12 5AA 73 12 5AB
Table A.53 (continued) 74 13 C17 75 13 C14 76 13 C15 77 13 B6A 78 13 B6B 79 13 B68 80 12 5A8 81 12 5A9 82 11 72E 83 11 72F 84 12 7C 2 85 13 B69 86 10 2E2 87 11 2D3 88 10 2E3 89 10 2E0 90 10 89 91 10 2E1 92 10 8C 93 10 396 94 11 3E0 95 9 1CD 96 9 C3 97 9 FB 98 8 BC 99 8 E4 100 8 7A 101 8 62 102 8 7B 103 7 15 104 7 3C 105 7 3F 106 6 30 107 6 E 108 6 19 109 5 16 110 5 1D 111 5 A 112 5 D 113 4 9 114 4 8 115 4 F
References
1. Aertsen, A.M.H.J., Johannesma, P.I.M.: Spectro-temporal receptive fields of auditory neurons in the grassfrog. Biological Cybernetics 38, 223–234 (1980) 2. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Transactions on Computers 28, 90–93 (1974) 3. Bellanger, M., Bonnerot, G., Coudreuse, M.: Digital filtering by polyphase network: Application to sample-rate alteration and filter banks. IEEE Transactions on Acoustics, Speech and Signal Processing 24, 109–114 (1976) 4. Berger, T.: Rate Distortion Theory. Prentice-Hall, Englewood Cliffs, NJ (1971) 5. Cattermole, K.: Principles of Pulse Code Modulation. ILIFFE Books, London (1969) 6. Chu, P.: Quadrature mirror filter design for an arbitrary number of equal bandwidth channels. IEEE Transactions on Acoustics, Speech and Signal Processing 33(1), 203–218 (1985) 7. Clark, D.L.: Ten years of a/b/x testing. AES Convention 91(3167) (1991) 8. Cox, R.: The design of uniformly and nonuniformly spaced pseudo quadrature mirror filters. IEEE Transactions on Acoustics, Speech and Signal Processing 34(5), 1090–1096 (1986) 9. Cutler, C.C.: Differential quantization of communication signals. U.S. Patent 2605361 (1952) 10. Davidson, G., Bosi, M.: AC-2: high quality audio coding for broadcasting and storage. 46th Annual Broadcasting Engineering Conference pp. 98–105 (1992) 11. Dolby Laboratories: Digital Audio Compression Standard A/52B. Advanced Television Systems Committee (ATSC) (2005) 12. Durbin, J.: The fitting of time series models. Review of the International Institute of Statistics 28, 233–243 (1960) 13. EBU: Sound quality assessment material recordings for subjective tests, vol. 3253-E. European Broadcasting Union (1988) 14. EBU: Sound Quality Assessment Material Recordings for Subjective Tests, Tech. 3253. European Broadcasting Union (1988) 15. FCC: Amendment of Part 3 of the Commission’s Rules and Regulations to Permit FM Broadcast Stations to Transmit Stereophonic Programs on a Multiplex Basis, vol. FCC 61-524 13506. Federal Communications Commission (1961) 16. Feddersen, W.E., Sandel, T.T., Teas, D.C., Jeffress, L.A.: Localization of high frequency tones. Journal of the Acoustical Society of America 5, 82–108 (1957) 17. Flanagan, J.L.: Speech Analysis, Synthesis and Perception, second edn. Springer, Berlin (1983) 18. Gelfand, S.: Hearing: An Introduction to Psychological and Physiological Acoustics, fifth edn. Informa Healthcare, London (2009) 19. Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression, first edn. Springer, New York (1991) 20. Givens, W.: Computation of plain unitary rotations transforming a general matrix to triangular form. Journal of the Society for Industrial and Applied Mathematics 6(1), 26–50 (1958) 21. Glasberg, B.R., Moore, B.C.: Derivation of auditory filter shapes from notched-noise data. Hearing Research 47, 103–138 (1990)
335
336
References
22. Golub, G.H., Loan, C.F.V.: Matrix Computations, third edn. The Johns Hopkins University Press, Baltimore (1996) 23. Hall, J.L.: Auditory psychophysics for coding applications. In: V.K. Madisetti, D. Williams (eds.) The Digital Signal Processing Handbook, pp. 39.1–39.25. CRC, Boca Raton (1997) 24. Hellman, R.: Asymmetry of masking between noise and tone. Perception and Psychophysics 11, 241–246 (1972) 25. Herre, J.: From joint stereo to spatial audio coding – recent progress and standardization. 7th International Conference on Digital Audio Effects (DAFX04) pp. 157–162 (2004) 26. Herre, J., Johnston, J.D.: Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS). 101st AES Convention (1996 Reprint #4384) 27. Hoffner, R.: Multichannel television sound broadcasting in the united states. 82nd AES Convention, Paper Number: 2435 (1987) 28. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990) 29. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proceedings of the I.R.E. pp. 1098–1102 (1952) 30. ITU: ITU-R Recommendation BS.1116: Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. ITU (1997) 31. ITU: ITU-R Recommendation BS.1387: Method for objective measurements of perceived audio quality (PEAQ). ITU (1998) 32. Jayant, N., Johnston, J., Safranek, R.: Signal compression based on models of human perception. Proceedings of the IEEE 81, 1385–1422 (1993) 33. Jayant, N.S., Noll, P.: Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, Englewood Cliffs (1984) 34. Johnston, J.D.: Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications 6(2), 314–323 (1988) 35. Johnston, J.D., Ferreira, A.J.: Sum-difference stereo transform coding. IEEE International Conference on Acoustics, Speech, and Signal Processing 2, 569–572 (1992) 36. Johnston, J.D., Sinha, D., Dorward, S., Quackenbush, S.: AT&T perceptual audio coding (PAC). Collected Papers on Digital Audio Bit-Rate Reduction pp. 73–81 (1996) 37. JPEG: Information technology – Digital compression and coding of continuous-tone still images – Requirements and guideline, vol. 10918-1. ISO/IEC (1994) 38. Kimme, E., Kuo, F.: Synthesis of optimal filters for a feedback quantization system. IEEE Transactions on Circuit Theory 10, 405–413 (1963) 39. Koilpillai, R.D., Vaidyanathan, P.P.: New results on cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE International Conference on Acoustics, Speech, and Signal Processing 3, 1793–1796 (1991) 40. Koilpillai, R.D., Vaidyanathan, P.P.: Cosine-modulated fir filter banks satisfying perfect reconstruction. IEEE Transactions on Signal Processing 40, 770–783 (1992) 41. Levinson, N.: The wiener rms error criterion in filter design and prediction. Journal of Mathematical Physics 25, 261–278 (1947) 42. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Transactions on Communications 28, 84–95 (1980) 43. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 44. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Code. North-Holland, Amsterdam (1988) 45. Malvar, H.: Lapped transforms for efficient transform/subband coding. IEEE Transactions on Acoustics, Speech and Signal Processing 38, 969–978 (1990) 46. Malvar, H.: Modulated qmf filter banks with perfect reconstruction. Electronics Letters 26, 906–907 (1990) 47. Malvar, H.: Extended lapped transforms: fast algorithms and applications. International Conference on Acoustics, Speech, and Signal Processing 3, 1797–1800 (1991) 48. Malvar, H., Staelin, D.H.: The lot: transform coding without blocking effects. IEEE Transactions on Acoustics, Speech and Signal Processing 37, 553–559 (1989)
References
337
49. Malvar, H.S.: Signal Processing with Lapped Transforms. Artech House, Norwood (1992) 50. Masson, J., Picel, Z.: Flexible design of computationally efficient nearly perfect qmf filter banks. IEEE International Conference on Acoustics, Speech, and Signal Processing 10, 541–544 (1985) 51. McDonald, R., Schultheiss, P.: Information rates of gaussian signals under criteria constraining the error spectrum. Proceedings of the IEEE 52, 415–416 (1964) 52. Miller, G.A.: Sensitivity to changes in the intensity of white noise and its relation to masking and loudness. Journal of the Acoustical Society of America 19, 609–619 (1947) 53. Moore, B.C.: An Introduction to the Psychology of Hearing. Academic, London (1997) 54. MPEG: Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s – Part 2: Video, vol. 11172-2. ISO/IEC (1993) 55. MPEG: Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s – Part 3: Audio, vol. 11172-3. ISO/IEC (1993) 56. MPEG: Information technology: generic coding of moving pictures and associated audio information – Part 3: Audio, vol. 13818-3. ISO/IEC (1998) 57. MPEG: Information technology: generic coding of moving pictures and associated audio information – Part 2: Video, vol. 13818-2. ISO/IEC (2000) 58. MPEG: Coding of Audio-Visual Objects: Visual, vol. 14496-2. ISO/IEC (2004) 59. MPEG: Coding of Audio-Visual Objects: Audio, vol. 14496-3. ISO/IEC (2005) 60. MPEG: Information technology: Generic coding of moving pictures and associated audio information Part 7: Advanced Audio Coding (AAC), vol. 13818-7. ISO/IEC (2006) 61. MPEG: Coding of Audio-Visual Objects: Advanced Video Coding, vol. 14496-10. ISO/IEC (2008) 62. NHK: Super Hi-vision. http://www.nhk.or.jp/digital/en/superhivision/index.html 63. Norbert, W.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT, Cambridge (1964) 64. Nussbaumer, H.J.: Pseudo qmf filter bank. IBM Technical Disclosure Bulletin 24, 3081–3087 (1981) 65. Nyquist, H.: Certain topics in telegraph transmission theory. Transactions AIEE 47, 617–644 (1928) 66. Oliver, B.M.: Efficient Coding. Bell System Technical Journal 21, 724–750 (1952) 67. Oppenheim, A.V., Schafer, R.W., Yoder, M.T., Padgett, W.T.: Discrete-Time Signal Processing. Prentice-Hall, New Jersey (2009) 68. Oppenheim, A.V., Willsky, A.S., Hamid, S.: Signals and Systems. Prentice-Hall, Englewood Cliffs, NJ (1996) 69. Painter, T., Spanias, A.: Perceptual coding of digital audio. Proceedings of the IEEE 88(4), 451–513 (2000) 70. Patterson, R.D.: Auditory filter shapes derived with noise stimuli. Journal of the Acoustical Society of America 59(3), 640–654 (1976) 71. Patterson, R.D., Allerhand, M., Giguere, C.: Time-domain modeling of peripheral auditory processing: A modular architecture and software platform. Journal of the Acoustical Society of America 98(4), 1890–1894 (1995) 72. Patterson, R.D., Moore, B.C.J.: Auditory filters and excitation patterns as representations of frequency resolution. In: B.C.J. Moore (ed.) Frequency Selectivity in Hearing, pp. 123–177. Academic, London (1986) 73. Patterson, R.D., Nimmo-Smith, I., Weber, D.L., Milroy, R.: The deterioration of hearing with age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold. Journal of the Acoustical Society of America 72(6), 1788–1803 (1982) 74. Pearl, J.: On coding and filtering stationary signals by discrete fourier transforms. IEEE Transactions on Information Theory 19(2), 229–232 (1973) 75. Pinsky, M.A.: Introduction to Fourier Analysis and Wavelets. American Mathematical Society (2009) 76. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
338
References
77. Princen, J.P., Bradley, A.B.: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on ASSP 34(5), 1153–1161 (1986) 78. Princen, J.P., Johnson, A., Bradley, A.B.: Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2161–2164 (1987) 79. Ramstad, T., Tanem, J.: Cosine-modulated analysis-synthesis filter bank with critical sampling and perfect reconstruction. IEEE International Conference on Acoustics, Speech, and Signal Processing 3, 1789–1792 (1991) 80. Rao, K.R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic, Boston (1990) 81. Rothweiler, J.: Polyphase quadrature filters - a new subband coding technique. IEEE International Conference on Acoustics, Speech, and Signal Processing 8, 1280–1283 (1983) 82. Sathe, V., Vaidyanathan, P.: Effects of multirate systems on the statistical properties of random signals. IEEE Transactions on Signal Processing 41, 131–146 (1993) 83. Sayood, K.: Introduction to Data Compression, second edn. Morgan Kaufmann, Burlington (2000) 84. Schroeder, M.R., Atal, B.S., Hall, J.L.: Optimizing digital speech coders by exploiting masking properties of the human ear. Journal of the Acoustical Society of America 66(6), 1647–1652 (1979) 85. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423 (1948) 86. Shannon, C.: A mathematical theory of communication. Bell System Technical Journal 27, 623–656 (1948) 87. Sinha, D., Johnston, J.D.: Audio compression at low bit rates using a signal adaptive switched filter bank. IEEE International Conference on Acoustics, Speech, and Signal Processing 2, 1053–1056 (1996) 88. Smyth, M.: White Paper: An Overview of the Coherent Acoustics Coding System. DTS, Agoura Hills (1999) 89. Steele, J.M.: The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, Cambridge (2004) 90. Stein, E., Shakarchi, R.: Fourier Analysis: An introduction. Princeton University Press, Princeton (2003) 91. Terhardt, E.: Calculating virtual pitch. Hearing research 1(1(2)), 155–182 (1979) 92. Tsutsui, K.: ATRAC (adaptive transform acoustic coding) and ATRAC 2. In: V. Madisetti and D. Williams (eds.) The Digital Signal Processing Handbook, pp. 43.16–43.20. CRC, Boca Raton (1998) 93. Vaidyanathan, P.P.: Multirate Systems And Filter Banks. Prentice Hall, Englewood Cliffs (1992) 94. Vary, P.: On the design of digital filter banks based on a modified principle of polyphase. AEU 33(7/8), 293–300 (1979) 95. Wikipedia: Windows Media Audio. http://en.wikipedia.org/wiki/Windows Media Audio (2007) 96. Xiph.org, F.: Vorbis I specification. Xiph.org Foundation (2004) 97. You, Y.L., Ma, W.: Mobile Multimedia Broadcasting Standards: Technology and Practice. In: F.L. Luo, chap. 20 DRA Audio Coding Standard: An Overview, pp. 587–606. Springer, US (2008) 98. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. SJ/T11368-2006. Ministry of Information Industry, People’s Republic of China (2007) 99. You, Y.L. et al.: DRA Digital Audio Coding Technology Specification, vol. Q/DCHL0012008. China Hualu Group Co., Ltd. (2009) 100. You, Y.L. et al.: Multichannel Digital Audio Coding Technology, vol. GB/T 22726-2008. National Standardization Committee, People’s Republic of China (2009) 101. Zwicker, E.: Subdivision of the audible frequency range into critical bands. Journal of the Acoustical Society of America 33(2), 248–248 (1961) 102. Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models. Springer, Berlin (1999)
Index
2 2 delay system (defined), 125
absolute threshold of hearing (defined), 174 ADC (defined), 23 Additive noise model for quantization (defined), 22 additive quantization noise model, 47 aliasing (defined), 99 aliasing transfer function (defined), 102 alphabet (defined), 146 AM–GM inequality (defined), 77 analog-to-digital conversion (defined), 23 analysis filter banks (defined), 92 apex (defined), 176 AR, 137 AR (defined), 66 arithmetic mean (defined), 77 ATH (defined), 174 audible impairment, 255 auditory filters (defined), 176 autoregression, 88 autoregressive process (defined), 66 auxiliary data (defined), 246 average codeword length, 145 average codeword length (defined), 147 average perceptual quantization error (defined), 194 average quantization error (defined), 47
backward quantization (defined), 21 Bark scale, 185 Bark scale (defined), 179 base (defined), 176 baseline code (defined), 148 basilar membrane (defined), 176 basis functions (defined), 75 basis vectors (defined), 75
beating phenomenon (defined), 187 binary tree (defined), 154 bit allocation strategy (defined), 77 bit allocation, (defined), 77 bit allocation, 11 bit pool (defined), 242 bit rate (defined), 4, 28 bit stream (defined), 147 bits per sample (defined), 28 block codebook (defined), 159 block length, 11 block length (defined), 73 block size, 11 block size (defined), 73 block symbol (defined), 159 blocking artifacts (defined), 91 blocking effects (defined), 91 blocky effect, 11 brief window, 257 brief window (defined), 217 Buffer overflow (defined), 242 Buffer underflow (defined), 242 buffer underrun (defined), 242
cascade of paraunitary matrices (defined), 126 CB (defined), 179 CMFB, 11 code (defined), 147 code sequence (defined), 147 code tree (defined), 154 codebook, 145 codebook (defined), 147 codes, 145 codeword length (defined), 147 codeword sequence (defined), 147 codewords, 145 codewords (defined), 147 coding gain (defined), 60
339
340
Index
compact representation, 5 Companding (defined), 39 companding (defined), 40 compressed data size codeword (defined), 243 compression ratio, 148, 255 compression ratio (defined), 6 compressor (defined), 39 cosine modulated analysis filter (defined), 121 cosine modulated filter banks, 11 cosine modulated synthesis filter (defined), 122 critical band (defined), 179 critical band level (defined), 184 critical band rate (defined), 179 critical band rate scale (defined), 184 critical band segments (defined), 238 critical bands (defined), 238 critical bandwidth, 179, 182 critical bandwidth (defined), 179 critically sampled (defined), 97
end of frame signature (defined), 243 energy compaction, 11 energy conservation (defined), 75 entropy, 6 entropy (defined), 149 entropy coding, 9 entropy coding (defined), 149 equivalent rectangular bandwidth (defined), 184 ERB, 179 ERB (defined), 184 ERB scale (defined), 185 Error protection codes (defined), 245 error resilience (defined), 245 error-feedback function (defined), 71 even channel stacking (defined), 136 evenly-stacked TDAC (defined), 136 expander (defined), 40, 97 external node (defined), 154
DAC (defined), 23 data modeling, 9 decimator (defined), 96 decision boundaries (defined), 21 decision cells, 45 decision interval, 19 decision intervals, 47 decision intervals (defined), 21 decision regions, 45 decision regions (defined), 47 decoding delay (defined), 154 decorrelation (defined), 85 delay chain (defined), 93 DFT, 11 DFT (defined), 86 differential pulse code modulation (defined), 59 digital audio (compression) coding (defined), 5 digital-to-analog conversion (defined), 23 Discrete Fourier Transform, 11 Discrete Fourier Transform (defined), 86 Double blind principle (defined), 253 double buffering (defined), 236 double-blind, triple-stimulus with hidden reference (defined), 254 double-resolution switched MDCT, 223 double-resolution switched MDCT (defined), 213 downsampler (defined), 96 DPCM, 69 DPCM (defined), 59 DRA (defined), 255 Dynamic Resolution Adaptation (defined), 255
filter banks, 11 fine quantization, 69, 71, 111 fine quantization (defined), 62 fixed length, 145 fixed-length code (defined), 145, 148 fixed-length codebook (defined), 145, 148 fixed-length codes, 28 fixed-point (defined), 247 fixed-point arithmetic (defined), 246 flattening effect, 202 forward quantization (defined), 21 Forward Quantization., 19 Fourier uncertainty principle, 12 Fourier uncertainty principle (defined), 204 frame, 213 frame-based processing, 236 frequency coefficients (defined), 86 frequency domain (defined), 86 frequency resolution (defined), 200 frequency transforms (defined), 86 gammatone filter (defined), 177 geometric mean (defined), 77 Givens rotation (defined), 125 global bit allocation, 13 granular error (defined), 29 granular noise (defined), 29 half-sine window (defined), 137 Heisenberg uncertainty principle, 12 Heisenberg uncertainty principle (defined), 204
Index Huffman code, 9 Huffman code (defined), 161 hybrid filter bank (defined), 205
IDCT, 89 ideal bandpass filter (defined), 95 ideal subband coder (defined), 110 iid, 146 iid (defined), 147 images (defined), 101 independently and identically distributed (defined), 147 input–output map of a quantizer (defined), 21 instantaneously decodable (defined), 153 intensity density level, 184 intensity density level (defined), 174 inter-aural amplitude differences (defined), 232 inter-aural time differences (defined), 232 inter-frame bit allocation, 290 inter-frame bit allocation (defined), 241 internal node (defined), 154 interpolator (defined), 97 intra-frame bit allocation (defined), 241 Inverse Quantization, 20 inverse quantization, 281 inverse quantization (defined), 21 inverse transform (defined), 74
JIC, 282 JIC (defined), 232 joint channel coding (defined), 231 joint intensity coding, 14, 258, 259, 278, 282 joint intensity coding (defined), 232
k-means algorithm (defined), 49 Karhunen–Loeve Transform (defined), 84 KLT (defined), 84 kurtosis, 38 kurtosis (defined), 32
lapped transforms, 11 lattice structure (defined), 126 LBG algorithm (defined), 49 leaf (defined), 154 Levinson–Durbin recursion (defined), 63 LFE, 231 LFE (defined), 4, 234 linear phase condition, 207 linear phase condition (defined), 121 Linear prediction, 51
341 linear prediction, 10 linear prediction coding (defined), 55 linear predictor (defined), 53 linear transform (defined), 73 Lloyd-Max quantizer (defined), 36 loading factor (defined), 29 Logarithmic companding, 257 long filter bank (defined), 200 look-ahead, 213 lossless (defined), 6, 147 lossless coding (defined), 147 lossless compression coding (defined), 147 lossy (defined), 6 low frequency effect, 231 low-frequency effect (defined), 4 Low-frequency effects (defined), 234 LPC (defined), 55 LPC decoder (defined), 56 LPC encoder (defined), 55 LPC reconstruction filter, 58, 69, 71 LPC reconstruction filter (defined), 56
M/S stereo coding, 14 M/S stereo coding (defined), 231 Markov process (defined), 66 masked (defined), 181 masked threshold (defined), 182, 189 maskee (defined), 181 maskers (defined), 181 masking in frequency gap (defined), 181 masking index (defined), 186 masking signals (defined), 181 masking spread function (defined), 188 masking threshold, 13 masking threshold (defined), 182 maximal active critical band (defined), 275 maximally decimated (defined), 97 MDCT, 11, 91, 223 MDCT (defined), 136 mean squared quantization error (defined), 22 medium window (defined), 224 messages (defined), 146 meta-symbol (defined), 161 midrise (defined), 26 midtread (defined), 25 midtread uniform quantization, 257, 281 minimum average codeword length (defined), 153 mirror image condition (defined), 121 modified discrete cosine transform, 11, 91 Modified discrete cosine transform (defined), 136 modulated filter bank (defined), 94
342 modulating (defined), 94 monaural sound (defined), 4 MSQE, 47 MSQE (defined), 22 multiplexer, 14
near-perfect reconstruction (defined), 103 NMR, 242 NMR (defined), 194 noble identity for decimation (defined), 107 noble identity for interpolation (defined), 108 noise to mask ratio (defined), 194 noise-feedback coding (defined), 71 noise-feedback function (defined), 71 noise-shaping filter (defined), 71 nonperfect reconstruction, 122 nonperfect reconstruction (defined), 103 nonuniform quantization, 8 normal channels (defined), 4 normal equations, 215 normal equations (defined), 63 NPR (defined), 103 number of ERBs (defined), 185
odd channel stacking (defined), 136 oddly stacked TDAC (defined), 136 open-loop DPCM, 71, 215 open-loop DPCM (defined), 55 optimal codebook (defined), 152, 153 optimal coding gain, 113 Orthogonal transforms (defined), 74 orthogonality principle (defined), 62 output values (defined), 21 oval window (defined), 176 overall transfer function (defined), 102 overload areas (defined), 28 overload error (defined), 29 overload noise (defined), 29 overshoot effect (defined), 194
paraunitary (defined), 124 PCM (defined), 3, 24 PDF (defined), 21 peakedness, 38 peakedness (defined), 32 perceptual entropy (defined), 197 perceptual irrelevance, 8 perceptual quantization error (defined), 194 perceptually irrelevant (defined), 7 perfect reconstruction (defined), 103 perfect reconstruction, 123, 207
Index ping-pong buffering (defined), 236 polyphase (component) matrix (defined), 105 polyphase components, 106 polyphase implementation, 107 polyphase representation, 106 postmasking (defined), 193 power-complementary conditions, 207 PR (defined), 103 pre-echo artifacts (defined), 203 prediction coefficients (defined), 53 prediction error, 10, 51, 53 prediction error (defined), 55 prediction filter (defined), 53 prediction gain (defined), 55 prediction residue (defined), 55 predictor order (defined), 53 prefix (defined), 153 prefix code (defined), 154 prefix-free code (defined), 154 premasking, 257 premasking (defined), 193 primary windows (defined), 210 probability density function (defined), 21 prototype filter (defined), 94 pseudo QMF (quadrature mirror filter) (defined), 122 Pulse-code modulation, 3 pulse-code modulation (defined), 24
quantization, 8, 19 quantization distortion (defined), 21 quantization error, 8 quantization error (defined), 21 quantization error accumulation, 59 quantization index, 19 quantization index (defined), 21 quantization noise (defined), 21 quantization noise accumulation (defined), 57 quantization step size, 8, 24, 275 quantization step size (defined), 24 quantization table, 21 quantization table (defined), 19 quantization unit, 257, 275, 281 quantization unit (defined), 238 quantized value, 20 quantized values (defined), 21 quasistationary (defined), 199
re-quantize (defined), 24 re-quantizing (defined), 17 reconstructed vector (defined), 47 reconstruction error (defined), 57
Index representative values (defined), 21 representative vector (defined), 47 representative vectors, 45 residue, 10, 53 reversal or anti-diagonal matrix (defined), 131 rotation vector (defined), 126 rounded exponential filter (defined), 177 rounding function (defined), 26
sample rate, 3 sample rate (defined), 19 sample rate compressor (defined), 96 sampling period (defined), 19 sampling theorem, 3 sampling theorem (defined), 19 SBC (defined), 91 scalar quantization (defined), 21 scalar quantizer, 8 self-information (defined), 148 Shannon’s noiseless coding theorem, 167 siblings (defined), 163 side information (defined), 244 signal-to-mask ratio (defined), 186 signal-to-noise ratio (defined), 23, 252 simultaneous masking (defined), 193 sine window (defined), 137 SMR, 252 SMR (defined), 186 SMR threshold, 195 SMR threshold (defined), 186 SNR (defined), 252 sound intensity level, 184 sound intensity level (defined), 174 sound pressure level (defined), 174 source sequence, 145 source sequence (defined), 147 source signal, 5 source symbol, 145 source symbols (defined), 146 spectral envelop (defined), 69 spectral flatness measure, 113 spectrum flatness measure (defined), 85 SPL (defined), 174 spread of masking (defined), 188 SQ, 8 SQ (defined), 17, 21 SRN (defined), 23 statistic-adaptive codebook assignment, 268 statistical redundancy. (defined), 6 statistical redundancy, 9 steering vector (defined), 233 stereo (defined), 4 stopband attenuation, 94
343 subband (defined), 93 subband coding, 91 subband coding (defined), 97 subband filter (defined), 93 subband filters (defined), 93 subband samples (defined), 93 subbands (defined), 93 subjective listening test, 290 sum/difference coding, 14, 258, 259, 276 sum/difference coding (defined), 231 Sum/difference decoding, 283 surround sound (defined), 4 Switched MDCT (defined), 207 switched-window MDCT (defined), 207 symbol, 145 symbol code (defined), 147 symbol set, 145 symbol set (defined), 146 synchronization codeword (defined), 243 synthesis filter banks (defined), 92
TC (defined), 73 TDAC (defined), 136 temporal masking (defined), 193 Temporal noise shaping (defined), 215 temporal-frequency analysis, 69 test set (defined), 49 test signal (defined), 181 the DCT, 89 the inverse DCT, 89 THQ (defined), 174 threshold in quiet (defined), 174 time–frequency analysis (defined), 86 time–frequency resolution (defined), 204 time–frequency tiling (defined), 238 time-domain aliasing cancellation (defined), 136 TLM, 256 TLM (defined), 220 TNS (defined), 215 tonal frequency components, 199 training set (defined), 48 transform (defined), 73 Transform coding (defined), 73 transform coefficients (defined), 73 transformation matrix (defined), 73 transformed block (defined), 73 transient attack (defined), 202 transient detection interval, 227 transient detection interval (defined), 213 transient function (defined), 228 Transient localized MDCT, 256 transient segment (defined), 237
344 transient segmentation (defined), 237 transient-detection blocks (defined), 227 transient-localized MDCT (defined), 220 transitional windows (defined), 210 transparent (defined), 254 triple-resolution switched MDCT (defined), 224 truncate function (defined), 27 type-I polyphase (component) vector (defined), 105 type-I polyphase component (defined), 104 type-I polyphase representation (defined), 104 type-II DCT (defined), 88 type-II polyphase components (defined), 106 Type-II polyphase representation (defined), 106 Type-IV DCT (defined), 89 unary code, 9, 152, 162 unary codebook, 146 unary numeral system, 146, 152 unequal error protection (defined), 245 uniform quantization, 24 uniform quantizer (defined), 24 uniform scalar quantization, 8 unique decodability (defined), 152 unitary (defined), 86, 124 upsampler (defined), 97
Index variable-length codebooks (defined), 145 variable-length codes (defined), 145 vector quantization, 8 vector quantization (defined), 43 vector-quantized, 47 virtual window (defined), 217 virtual windows, 256 Voronoi regions (defined), 48 VQ, 8 VQ (defined), 17, 43 VQ codebook, 43 VQ codebook (defined), 47 VQ quantization error (defined), 47
water-filling, 81 water-filling algorithm (defined), 242 whitening filter (defined), 67 Wiener–Hopf equations, 67 Wiener–Hopf equations (defined), 63 window (defined), 137 window function (defined), 137
Yule–Walker prediction equations (defined), 63