Visual Media Coding and Transmission
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ...
125 downloads
1338 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Visual Media Coding and Transmission
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
Visual Media Coding and Transmission
Ahmet Kondoz Centre for Communication Systems Research, University of Surrey, UK
This edition first published 2009 # 2009 John Wiley & Sons Ltd. Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. #1998, #2001, #2002, #2003, #2004. 3GPPTM TSs and TRs are the property of ARIB, ATIS, CCSA, ETSI, TTA and TTC who jointly own the copyright in them. They are subject to further modifications and are therefore provided to you ‘as is’ for information purposes only. Further use is strictly prohibited. Library of Congress Cataloging-in-Publication Data Kondoz, A. M. (Ahmet M.) Visual media coding and transmission / Ahmet Kondoz. p. cm. Includes bibliographical references and index. ISBN 978-0-470-74057-6 (cloth) 1. Multimedia communications. 2. Video compression. 3. Coding theory. 4. Data transmission systems. I. Title. TK5105.15.K65 2009 621.382’1–dc22 2008047067 A catalogue record for this book is available from the British Library. ISBN 9780470740576 (H/B) Set in 10/12pt Times New Roman by Thomson Digital, Noida, India. Printed in Great Britain by CPI Antony Rowe, Chippenham, England
Contents VISNET II Researchers Preface Glossary of Abbreviations 1
Introduction
2
Video Coding Principles 2.1 Introduction 2.2 Redundancy in Video Signals 2.3 Fundamentals of Video Compression 2.3.1 Video Signal Representation and Picture Structure 2.3.2 Removing Spatial Redundancy 2.3.3 Removing Temporal Redundancy 2.3.4 Basic Video Codec Structure 2.4 Advanced Video Compression Techniques 2.4.1 Frame Types 2.4.2 MC Accuracy 2.4.3 MB Mode Selection 2.4.4 Integer Transform 2.4.5 Intra Prediction 2.4.6 Deblocking Filters 2.4.7 Multiple Reference Frames and Hierarchical Coding 2.4.8 Error-Robust Video Coding 2.5 Video Codec Standards 2.5.1 Standardization Bodies 2.5.2 ITU Standards 2.5.3 MPEG Standards 2.5.4 H.264/MPEG-4 AVC 2.6 Assessment of Video Quality 2.6.1 Subjective Performance Evaluation 2.6.2 Objective Performance Evaluation 2.7 Conclusions References
xiii xv xvii 1 7 7 7 8 8 9 14 16 17 17 19 20 21 22 22 24 24 28 28 29 29 31 31 31 32 35 36
vi
3
4
Contents
Scalable Video Coding 3.1 Introduction 3.1.1 Applications and Scenarios 3.2 Overview of the State of the Art 3.2.1 Scalable Coding Techniques 3.2.2 Multiple Description Coding 3.2.3 Stereoscopic 3D Video Coding 3.3 Scalable Video Coding Techniques 3.3.1 Scalable Coding for Shape, Texture, and Depth for 3D Video 3.3.2 3D Wavelet Coding 3.4 Error Robustness for Scalable Video and Image Coding 3.4.1 Correlated Frames for Error Robustness 3.4.2 Odd–Even Frame Multiple Description Coding for Scalable H.264/AVC 3.4.3 Wireless JPEG 2000: JPWL 3.4.4 JPWL Simulation Results 3.4.5 Towards a Theoretical Approach for Optimal Unequal Error Protection 3.5 Conclusions References Distributed Video Coding 4.1 Introduction 4.1.1 The Video Codec Complexity Balance 4.2 Distributed Source Coding 4.2.1 The Slepian–Wolf Theorem 4.2.2 The Wyner–Ziv Theorem 4.2.3 DVC Codec Architecture 4.2.4 Input Bitstream Preparation – Quantization and Bit Plane Extraction 4.2.5 Turbo Encoder 4.2.6 Parity Bit Puncturer 4.2.7 Side Information 4.2.8 Turbo Decoder 4.2.9 Reconstruction: Inverse Quantization 4.2.10 Key Frame Coding 4.3 Stopping Criteria for a Feedback Channel-based Transform Domain Wyner–Ziv Video Codec 4.3.1 Proposed Technical Solution 4.3.2 Performance Evaluation 4.4 Rate-distortion Analysis of Motion-compensated Interpolation at the Decoder in Distributed Video Coding 4.4.1 Proposed Technical Solution 4.4.2 Performance Evaluation 4.5 Nonlinear Quantization Technique for Distributed Video Coding 4.5.1 Proposed Technical Solution 4.5.2 Performance Evaluation
39 39 40 41 42 45 47 48 48 68 74 74 82 91 94 96 98 99 105 105 106 109 109 110 111 112 112 114 114 115 116 117 118 118 120 122 122 126 129 129 132
Contents
4.6
Symmetric Distributed Coding of Stereo Video Sequences 4.6.1 Proposed Technical Solution 4.6.2 Performance Evaluation 4.7 Studying Error-resilience Performance for a Feedback Channel-based Transform Domain Wyner–Ziv Video Codec 4.7.1 Proposed Technical Solution 4.7.2 Performance Evaluation 4.8 Modeling the DVC Decoder for Error-prone Wireless Channels 4.8.1 Proposed Technical Solution 4.8.2 Performance Evaluation 4.9 Error Concealment Using a DVC Approach for Video Streaming Applications 4.9.1 Proposed Technical Solution 4.9.2 Performance Evaluation 4.10 Conclusions References 5
Non-normative Video Coding Tools 5.1 Introduction 5.2 Overview of the State of the Art 5.2.1 Rate Control 5.2.2 Error Resilience 5.3 Rate Control Architecture for Joint MVS Encoding and Transcoding 5.3.1 Problem Definition and Objectives 5.3.2 Proposed Technical Solution 5.3.3 Performance Evaluation 5.3.4 Conclusions 5.4 Bit Allocation and Buffer Control for MVS Encoding Rate Control 5.4.1 Problem Definition and Objectives 5.4.2 Proposed Technical Approach 5.4.3 Performance Evaluation 5.4.4 Conclusions 5.5 Optimal Rate Allocation for H.264/AVC Joint MVS Transcoding 5.5.1 Problem Definition and Objectives 5.5.2 Proposed Technical Solution 5.5.3 Performance Evaluation 5.5.4 Conclusions 5.6 Spatio-temporal Scene-level Error Concealment for Segmented Video 5.6.1 Problem Definition and Objectives 5.6.2 Proposed Technical Solution 5.6.3 Performance Evaluation 5.6.4 Conclusions 5.7 An Integrated Error-resilient Object-based Video Coding Architecture 5.7.1 Problem Definition and Objectives 5.7.2 Proposed Technical Solution
vii
134 134 137 139 139 140 144 145 149 151 152 155 158 159 161 161 162 162 164 165 165 166 169 171 171 171 172 177 179 179 179 180 181 182 182 182 183 187 188 189 189 189
viii
6
7
Contents
5.7.3 Performance Evaluation 5.7.4 Conclusions 5.8 A Robust FMO Scheme for H.264/AVC Video Transcoding 5.8.1 Problem Definition and Objectives 5.8.2 Proposed Technical Solution 5.8.3 Performance Evaluation 5.8.4 Conclusions 5.9 Conclusions References
195 195 195 195 195 197 198 199 199
Transform-based Multi-view Video Coding 6.1 Introduction 6.2 MVC Encoder Complexity Reduction using a Multi-grid Pyramidal Approach 6.2.1 Problem Definition and Objectives 6.2.2 Proposed Technical Solution 6.2.3 Conclusions and Further Work 6.3 Inter-view Prediction using Reconstructed Disparity Information 6.3.1 Problem Definition and Objectives 6.3.2 Proposed Technical Solution 6.3.3 Performance Evaluation 6.3.4 Conclusions and Further Work 6.4 Multi-view Coding via Virtual View Generation 6.4.1 Problem Definition and Objectives 6.4.2 Proposed Technical Solution 6.4.3 Performance Evaluation 6.4.4 Conclusions and Further Work 6.5 Low-delay Random View Access in Multi-view Coding Using a Bit Rate-adaptive Downsampling Approach 6.5.1 Problem Definition and Objectives 6.5.2 Proposed Technical Solution 6.5.3 Performance Evaluation 6.5.4 Conclusions and Further Work References
203 203
Introduction to Multimedia Communications 7.1 Introduction 7.2 State of the Art: Wireless Multimedia Communications 7.2.1 QoS in Wireless Networks 7.2.2 Constraints on Wireless Multimedia Communications 7.2.3 Multimedia Compression Technologies 7.2.4 Multimedia Transmission Issues in Wireless Networks 7.2.5 Resource Management Strategy in Wireless Multimedia Communications
205 205 205 208 208 208 208 210 211 212 212 212 215 216 216 216 216 219 222 222 225 225 228 228 231 234 235 239
Contents
ix
7.3 Conclusions References
244 244
8
Wireless Channel Models 8.1 Introduction 8.2 GPRS/EGPRS Channel Simulator 8.2.1 GSM/EDGE Radio Access Network (GERAN) 8.2.2 GPRS Physical Link Layer Model Description 8.2.3 EGPRS Physical Link Layer Model Description 8.2.4 GPRS Physical Link Layer Simulator 8.2.5 EGPRS Physical Link Layer Simulator 8.2.6 E/GPRS Radio Interface Data Flow Model 8.2.7 Real-time GERAN Emulator 8.2.8 Conclusion 8.3 UMTS Channel Simulator 8.3.1 UMTS Terrestrial Radio Access Network (UTRAN) 8.3.2 UMTS Physical Link Layer Model Description 8.3.3 Model Verification for Forward Link 8.3.4 UMTS Physical Link Layer Simulator 8.3.5 Performance Enhancement Techniques 8.3.6 UMTS Radio Interface Data Flow Model 8.3.7 Real-time UTRAN Emulator 8.3.8 Conclusion 8.4 WiMAX IEEE 802.16e Modeling 8.4.1 Introduction 8.4.2 WIMAX System Description 8.4.3 Physical Layer Simulation Results and Analysis 8.4.4 Error Pattern Files Generation 8.5 Conclusions 8.6 Appendix: Eb/No and DPCH_Ec/Io Calculation References
247 247 247 247 250 252 256 261 268 270 271 272 272 279 290 298 307 309 312 313 316 316 317 323 324 328 329 330
9
Enhancement Schemes for Multimedia Transmission over Wireless Networks 9.1 Introduction 9.1.1 3G Real-time Audiovisual Requirements 9.1.2 Video Transmission over Mobile Communication Systems 9.1.3 Circuit-switched Bearers 9.1.4 Packet-switched Bearers 9.1.5 Video Communications over GPRS 9.1.6 GPRS Traffic Capacity 9.1.7 Error Performance 9.1.8 Video Communications over EGPRS 9.1.9 Traffic Characteristics 9.1.10 Error Performance 9.1.11 Voice Communication over Mobile Channels
333 333 333 335 339 348 350 351 354 357 357 358 359
x
Contents
9.1.12 Support of Voice over UMTS Networks 9.1.13 Error-free Performance 9.1.14 Error-prone Performance 9.1.15 Support of Voice over GPRS Networks 9.1.16 Conclusion 9.2 Link-level Quality Adaptation Techniques 9.2.1 Performance Modeling 9.2.2 Probability Calculation 9.2.3 Distortion Modeling 9.2.4 Propagation Loss Modeling 9.2.5 Energy-optimized UEP Scheme 9.2.6 Simulation Setup 9.2.7 Performance Analysis 9.2.8 Conclusion 9.3 Link Adaptation for Video Services 9.3.1 Time-varying Channel Model Design 9.3.2 Link Adaptation for Real-time Video Communications 9.3.3 Link Adaptation for Streaming Video Communications 9.3.4 Link Adaptation for UMTS 9.3.5 Conclusion 9.4 User-centric Radio Resource Management in UTRAN 9.4.1 Enhanced Call-admission Control Scheme 9.4.2 Implementation of UTRAN System-level Simulator 9.4.3 Performance Evaluation of Enhanced CAC Scheme 9.5 Conclusions References
360 361 362 362 363 365 365 367 368 368 369 370 372 373 373 374 379 389 396 402 403 403 403 410 411 413
10 Quality Optimization for Cross-network Media Communications 10.1 Introduction 10.2 Generic Inter-networked QoS-optimization Infrastructure 10.2.1 State of the Art 10.2.2 Generic of QoS for Heterogeneous Networks 10.3 Implementation of a QoS-optimized Inter-networked Emulator 10.3.1 Emulation System Physical Link Layer Simulation 10.3.2 Emulation System Transmitter/Receiver Unit 10.3.3 QoS Mapping Architecture 10.3.4 General User Interface 10.4 Performances of Video Transmission in Inter-networked Systems 10.4.1 Experimental Setup 10.4.2 Test for the EDGE System 10.4.3 Test for the UMTS System 10.4.4 Tests for the EDGE-to-UMTS System 10.5 Conclusions References
417 417 418 418 420 422 426 428 428 438 442 442 443 445 445 452 453
Contents
11 Context-based Visual Media Content Adaptation 11.1 Introduction 11.2 Overview of the State of the Art in Context-aware Content Adaptation 11.2.1 Recent Developments in Context-aware Systems 11.2.2 Standardization Efforts on Contextual Information for Content Adaptation 11.3 Other Standardization Efforts by the IETF and W3C 11.4 Summary of Standardization Activities 11.4.1 Integrating Digital Rights Management (DRM) with Adaptation 11.4.2 Existing DRM Initiatives 11.4.3 The New ‘‘Adaptation Authorization’’ Concept 11.4.4 Adaptation Decision 11.4.5 Context-based Content Adaptation 11.5 Generation of Contextual Information and Profiling 11.5.1 Types and Representations of Contextual Information 11.5.2 Context Providers and Profiling 11.5.3 User Privacy 11.5.4 Generation of Contextual Information 11.6 The Application Scenario for Context-based Adaptation of Governed Media Contents 11.6.1 Virtual Classroom Application Scenario 11.6.2 Mechanisms using Contextual Information in a Virtual Collaboration Application 11.6.3 Ontologies in Context-aware Content Adaptation 11.6.4 System Architecture of a Scalable Platform for Context-aware and DRM-enabled Content Adaptation 11.6.5 Context Providers 11.6.6 Adaptation Decision Engine 11.6.7 Adaptation Authorization 11.6.8 Adaptation Engines Stack 11.6.9 Interfaces between Modules of the Content Adaptation Platform 11.7 Conclusions References Index
xi
455 455 457 457 467 476 479 480 480 481 482 488 492 492 494 497 498 499 500 502 503 504 507 510 514 517 544 552 553 559
VISNET II Researchers UniS Omar Abdul-Hameed Zaheer Ahmad Hemantha Kodikara Arachchi Murat Badem Janko Calic Safak Dogan Erhan Ekmekcioglu Anil Fernando Christine Glaser Banu Gunel Huseyin Hacihabiboglu Hezerul Abdul Karim Ahmet Kondoz Yingdong Ma Marta Mrak Sabih Nasir Gokce Nur Surachai Ongkittikul Kan Ren Daniel Rodriguez Amy Tan Eeriwarawe Thushara Halil Uzuner Stephane Villette Rajitha Weerakkody Stewart Worrall Lasith Yasakethu HHI Peter Eisert J€ urgen Rurainsky Anna Hilsmann Benjamin Prestele David Schneider
Philipp Fechteler Info Feldmann Jens G€ uther Karsten Gr€ uneberg Oliver Schreer Ralf Tanger EPFL Touradj Ebrahimi Frederic Dufaux Thien Ha-Minh Michael Ansorge Shuiming Ye Yannick Maret David Marimon Ulrich Hoffmann Mourad Ouaret Francesca De Simone Carlos Bandeirinha Peter Vajda Ashkan Yazdani Gelareh Mohammadi Alessandro Tortelli Luca Bonardi Davide Forzati IST Fernando Pereira Jo~ao Ascenso Catarina Brites Luis Ducla Soares Paulo Nunes Paulo Correia Jose Diogo Areia Jose Quintas Pedro
Ricardo Martins UPC-TSC Pere Joaquim Mindan Jose Luis Valenzuela Toni Rama Luis Torres Francesc Tarres UPC-AC Jaime Delgado Eva Rodrı´guez Anna Carreras Ruben Tous TRT-UK Chris Firth Tim Masterton Adrian Waller Darren Price Rachel Craddock Marcello Goccia Ian Mockford Hamid Asgari Charlie Attwood Peter de Waard Jonathan Dennis Doug Watson Val Millington Andy Vooght TUB Thomas Sikora Zouhair Belkoura Juan Jose Burred
xiv
Michael Droese Ronald Glasberg Lutz Goldmann Shan Jin Mustafa Karaman Andreas Krutz Amjad Samour TiLab Giovanni Cordara Gianluca Francini Skjalg Lepsoy Diego Gibellino UPF Enric Peig Vı´ctor Torres Xavier Perramon PoliMi Fabio Antonacci Calatroni Alberto Marco Marcon Matteo Naccari Davide Onofrio
VISNET II Researchers
Giorgio Prandi Riva Davide Francesco Santagata Marco Tagliasacchi Stefano Tubaro Giuseppe Valenzise IPW Stanisław Badura Lilla Bagin´ska Jarosław Baszun Filip Borowski Andrzej Buchowicz Emil Dmoch Edyta Dabrowska ˛ Grzegorz Galin´ski Piotr Garbat Krystian Ignasiak Mariusz Jakubowski Mariusz Leszczyn´ski Marcin Morgos´ Jacek Naruniec Artur Nowakowski Adam Ołdak Grzegorz Pastuszak
Andrzej Pietrasiewicz Adam Pietrowcew Sławomir Rymaszewski Radosław Sikora Władysław Skarbek Marek Sutkowski Michał Tomaszewski Karol Wnukowicz INECS Porto Giorgiana Ciobanu Filipe Sousa Jaime Cardoso Jaime Dias Jorge Mamede Jose Ruela Luı´s Corte-Real Luı´s Gustavo Martins Luı´s Filipe Teixeira Maria Teresa Andrade Pedro Carvalho Ricardo Duarte Vı´tor Barbosa
Preface VISNET II is a European Union Network of Excellence (NoE) in the 6th Framework Programme, which brings together 12 leading European organizations in the field of Networked Audiovisual Media Technologies. The consortium consists of organizations with a proven track record and strong national and international reputations in audiovisual information technologies. VISNET II integrates over 100 researchers who have made significant contributions to this field of technology, through standardization activities, international publications, conferences workshop activities, patents, and many other prestigious achievements. The 12 integrated organizations represent 7 European states spanning across a major part of Europe, thereby promising efficient dissemination and exploitation of the resulting technological development to larger communities. This book contains some of the research output of VISNET II in the area of Advanced Video Coding and Networking. The book contains details of video coding principles, which lead to advanced video coding developments in the form of scalable coding, distributed video coding, non-normative video coding tools, and transform-based multi-view coding. Having detailed the latest work in visual media coding, the networking aspects of video communication are presented in the second part of the book. Various wireless channel models are presented, to form the basis for following chapters. Both link-level quality of service (QoS) and cross-network transmission of compressed visual data are considered. Finally, context-based visual media content adaptation is discussed with some examples. It is hoped that this book will be used as a reference not only for some of the advanced video coding techniques, but also for the transmission of video across various wireless systems with well-defined channel models. Ahmet Kondoz University of Surrey VISNET II Coordinator
Glossary of Abbreviations 3GPP
3rd Generation Partnership Project
AA ADE ADMITS ADTE AE AES AIR API AQoS ASC AV AVC
Adaptation Authorizer Adaptation Decision Engine Adaptation in Distributed Multimedia IT Systems Adaptation Decision Taking Engine Adaptation Engine Adaptation Engine Stack Adaptive Intra Refresh Application Programming Interface Adaptation Quality of Service Aspect-Scale-Context Audiovisual Advanced Video Coding
BLER BSD BSDL
Block Error Rate Bitstream Syntax Description Bitstream Syntax Description Language
CC CC CC/PP CD CDN CIF CoBrA CoDAMoS CoOL CoGITO CPU CROSLOCIS CS/H.264/AVC CxP
Convolutional Coding Creative Commons Composite Capabilities/Preferences Profile Coefficient Dropping Content Distribution Networks Common Intermediate Format Context Broker Architecture Context-Driven Adaptation of Mobile Services Context Ontology Language Context Gatherer, Interpreter and Transformer using Ontologies Central Processing Unit Creation of Smart Local City Services Cropping and Scaling of H.264/AVC Encoded Video Context Provider
xviii
DAML DANAE
Glossary of Abbreviations
dB DB DCT DI DIA DID DIDL DIP DistriNet DPRL DRM DS
Directory Access Markup Language Dynamic and distributed Adaptation of scalable multimedia content in a context-Aware Environment Decibel Database Discrete Cosine Transform Digital Item Digital Item Adaptation Digital Item Declaration Digital Item Declaration Language Digital Item Processing Distributed Systems and Computer Networks Digital Property Rights Language Digital Rights Management Description Schemes
EC EIMS
European Community ENTHRONE Integrated Management Supervisor
FA FD FMO FP
Frame Adaptor Frame Dropping Flexible Macroblock Ordering Framework Program
gBS
Generic Bitstream Syntax
HCI HDTV HP HTML
Human–Computer Interface High-Definition Television Hewlett Packard HyperText Markup Language
IEC IETF IBM iCAP IPR IROI ISO IST ITEC
International Electrotechnical Commission Internet Engineering Task Force International Business Machines Corporation Internet Content Adaptation Protocol Intellectual Property Rights Interactive Region of Interest International Organization for Standardization Information Society Technologies Department of Information Technology, Klagenfurt University
JPEG JSVM
Joint Photographic Experts Group Joint Scalable Video Model
MDS MB
Multimedia Description Schemes Macroblock
Glossary of Abbreviations
xix
MDS MIT MOS MP3 MPEG MVP
Multimedia Description Schemes Massachusetts Institute of Technology Mean Opinion Score Moving Picture Experts Group Layer-3 Audio (audio file format/extension) Motion Picture Experts Group Motion Vector Predictor
NAL NALU NoE
Network Abstract Layer Network Abstract Layer Unit Network of Excellence
ODRL OIL OMA OSCRA OWL
Open Digital Rights Language Ontology Interchange Language Open Mobile Alliance Optimized Source and Channel Rate Allocation Web Ontology Language
P2P PDA PSNR
Peer-to-Peer Personal Digital Assistance Peak Signal-to-Noise Ratio
QCIF QoS QP
Quarter Common Intermediate Format Quality of Service Quantization Parameter
RD RDF RDB RDD RDOPT REL ROI
Rate Distortion Resource Description Framework Reference Data Base Rights Data Dictionary Rate Distortion Optimization Rights Expression Language Region of Interest
SECAS SNR SOAP SOCAM SVC
Simple Environment for Context-Aware Systems Signal-to-Noise Ratio Simple Object Access Protocol Service-Oriented Context-Aware Middleware Scalable Video Coding
TM5
Test Model 5
UaProf UCD UED UEP UF
User Agent Profile Universal Constraints Descriptor Usage Environment Descriptions Unequal Error Protection Utility Function
xx
Glossary of Abbreviations
UI UMA UMTS URI UTRAN
User Item Universal Multimedia Access Universal Mobile Telecommunications System Uniform Resource Identifiers UMTS Terrestrial Radio Access Network
VCS VoD VOP VQM
Virtual Collaboration System Video on Demand Video Object Plane Video Quality Metric
W3C WAP WCDMA WDP WLAN WML WiFi
World Wide Web Consortium Wireless Access Protocol Wideband Code Division Multiple Access Wireless Datagram Protocol Wireless Local Area Network Website Meta Language Wireless Fidelity (IEEE 802.11b Wireless Networking)
XML XrML XSLT
eXtensible Markup Language eXtensible rights Markup Language eXtensible Stylesheet Language Transformations
1 Introduction Networked Audio-Visual Technologies form the basis for the multimedia communication systems that we currently use. The communication systems that must be supported are diverse, ranging from fixed wired to mobile wireless systems. In order to enable an efficient and costeffective Networked Audio-Visual System, two major technological areas need to be investigated: first, how to process the content for transmission purposes, which involves various media compression processes; and second, how to transport it over the diverse network technologies that are currently in use or will be deployed in the near future. In this book, therefore, visual data compression schemes are presented first, followed by a description of various media transmission aspects, including various channel models, and content and link adaptation techniques. Raw digital video signals are very large in size, making it very difficult to transmit or store them. Video compression techniques are therefore essential enabling technologies for digital multimedia applications. Since 1984, a wide range of digital video codecs have been standardized, each of which represents a step forward either in terms of compression efficiency or in functionality. The MPEG-x and H.26x video coding standards adopt a hybrid coding approach, employing block-matching motion estimation/compensation, in addition to the discrete cosine transform (DCT) and quantization. The reasons are: first, a significant proportion of the motion trajectories found in natural video can be approximately described with a rigid translational motion model; second, fewer bits are required to describe simple translational motion; and finally, the implementation is relatively straightforward and amenable to hardware solutions. These hybrid video systems have provided interoperability in heterogeneous network systems. Considering that transmission bandwidth is still a valuable commodity, ongoing developments in video coding seek scalability solutions to achieve a one-coding–multiple-decoding feature. To this end, the Joint Video Team of the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) have standardized a scalability extension to the existing H.264/AVC codec. The H.264-based Scalable Video Coding (SVC) allows partial transmission and decoding to the bit stream, resulting in various options in terms of picture quality and spatial-temporal resolutions. In this book, several advanced features/techniques relating to scalable video coding are further described, mostly to do with 3D scalable video coding applications. Applications and scenarios for the scalable coding systems, advances in scalable video coding for 3D video applications, a non-standardized scalable 2D model-based video coding scheme applied on the
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
2
Visual Media Coding and Transmission
texture, and depth coding of 3D video are all discussed. A scalable, multiple description coding (MDC) application for stereoscopic 3D video is detailed. Multi-view coding and Distributed Video Coding concepts representing the latest advancements in video coding are also covered in significant depth. The definition of video coding standards is of the utmost importance because it guarantees that video coding equipment from different manufacturers will be able to interoperate. However, the definition of a standard also represents a significant constraint for manufacturers because it limits what they can do. Therefore, in order to minimize the restrictions imposed on manufacturers, only those tools that are essential for interoperability are typically specified in the standard: the normative tools. The remaining tools, which are not standardized but are also important in video coding systems, are referred to as non-normative tools and this is where competition and evolution of the technology have been taking place. In fact, this strategy of specifying only the bare minimum that can guarantee interoperability ensures that the latest developments in the area of non-normative tools can be easily incorporated in video codecs without compromising their standard compatibility, even after the standard has been finalized. In addition, this strategy makes it possible for manufacturers to compete against each other and to distinguish between their products in the market. A significant amount of research effort is being devoted to the development of non-normative video coding tools, with the target of improving the performance of standard video codecs. In particular, due to their importance, rate control and error resilience non-normative tools are being researched. In this book, therefore, the development of efficient tools for the modules that are non-normative in video coding standards, such as rate control and error concealment, is discussed. For example, multiple video sequence (MVS) joint rate control addresses the development of rate control solutions for encoding video scenes formed from a composition of video objects (VOs), such as in the MPEG-4–standard, and can also be applied to the joint encoding and transcoding of multiple video sequences (VSs) to be transmitted over bandwidth-limited channels using the H.264/ AVC standard. The goal of wireless communication is to allow a user to access required services at any time with no regard to location or mobility. Recent developments in wireless communications, multimedia technologies, and microelectronics technologies have created a new paradigm in mobile communications. Third/fourth-generation (3G/4G) wireless communication technologies provide significantly higher transmission rates and service flexibility over a wide coverage area, as compared with second-generation (2G) wireless communication systems. High-compression, error-robust multimedia codecs have been designed to enable the support of multimedia application over error-prone bandwidth-limited channels. The advances of VLSI and DSP technologies are enabling lightweight, low-cost, portable devices capable of transmitting and viewing multimedia streams. The above technological developments have shifted the service requirements of mobile communication from conventional voice telephony to business- and entertainment-oriented multimedia services in wireless communication systems. In order to successfully meet the challenges set by the latest current and future audiovisual communication requirements, the International Telecommunication Union-Radio communications (ITU-R) sector has elaborated on a framework for global 3G standards by recognizing a limited number of radio access technologies. These are: Universal Mobile Telecommunications System (UMTS), Enhanced Data rates for GSM Evolution (EDGE), and CDMA2000. UMTS is based on Wideband CDMA technology and is employed in Europe and Asia using the frequency band around 2 GHz. EDGE is based on TDMA technology and uses
Introduction
3
the same air interface as the successful 2G mobile system GSM. General Packet Radio Service (GPRS) and High-Speed Circuit Switched Data (HSCSD) are introduced by Phase 2 þ of the GSM standardization process. They support enhanced services with data rates up to 144 kbps in the packet-switched and circuit-switched domains, respectively. EDGE, which is the evolution of GPRS and HSCSD, provides 3G services up to 500 kbps within GSM carrier spacing of 200 kHz. CDMA2000 is based on multi-carrier CDMA technology and provides the upgraded solution for existing IS-95 operators, mainly in North America. EDGE and UMTS are the most widely accepted 3G radio access technologies. They are standardised by the 3rd Generation Partnership Project (3GPP). Even though EDGE and UMTS are based on two different multiple-access technologies, both systems share the same core network. The evolved GSM core network serves for a common GSM/UMTS core network that supports GSM/GPRS/ EDGE and UMTS access. In addition, Wireless Local Area Networks (WLAN) are becoming more and more popular for communication at homes, offices and indoor public areas such as campus environments, airports, hotels, shopping centres and so on. IEEE 802.11 has a number of physical layer specifications with a common MAC operation. IEEE 802.11 includes two physical layers – a frequency-hopping spread-spectrum (FHSS) physical layer and a directsequence spread-spectrum (DSSS) physical layer – and operates at 2 Mbps. The currently deployed IEEE 802.11b standard provides an additional physical layer based on a high-rate direct-sequence spread-spectrum (HR/DSSS). It operates in the 2.4 GHz unlicensed band and provides bit rates up to 11 Mbps. IEEE 802.11a standard for 5 GHz band provides high bit rates up to 54 Mbps and uses a physical layer based on orthogonal frequency division multiplexing (OFDM). Recently, IEEE 802.11g standard has also been issued to achieve such high bit rates in the 2.4 GHz band. The Worldwide Interoperability for Microwave Access (WiMAX) is a telecommunications technology aimed at providing wireless data over long distances in different ways, from pointto-point links to full mobile cellular access. It is based on the IEEE 802.16 standard, which is also called WirelessMAN. The name WiMAX was created by the WiMAX Forum, which was formed in June 2001 to promote conformance and interoperability of the standard. The forum describes WiMAX as “a standards-based technology enabling the delivery of last mile wireless broadband access as an alternative to cable and DSL”. Mobile WiMAX IEEE 802.16e provides fixed, nomadic and mobile broadband wireless access systems with superior throughput performance. It enables non-line-of-sight reception, and can also cope with high mobility of the receiving station. The IEEE 802.16e enables nomadic capabilities for laptops and other mobile devices, allowing users to benefit from metro area portability of an xDSL-like service. Multimedia services by definition require the transmission of multiple media streams, such as video, still picture, music, voice, and text data. A combination of these media types provides a number of value-added services, including video telephony, E-commerce services, multiparty video conferencing, virtual office, and 3D video. 3D video, for example, provides more natural and immersive visual information to end users than standard 2D video. In the near future, certain 2D video application scenarios are likely be replaced by 3D video in order to achieve a more involving and immersive representation of visual information and to provide more natural methods of communication. 3D video transmission, however, requires more resources than the conventional video communication applications. Different media types have different quality-of-service (QoS) requirements and enforce conflicting constraints on the communication networks. Still picture and text data are categorized as background services and require high data rates but have no constraints on
4
Visual Media Coding and Transmission
the transmission delay. Voice services, on the other hand, are characterized by low delay. However, they can be coded using fixed low-rate algorithms operating in the 5–24 kbps range. In contrast to voice and data services, low-bit-rate video coding involves rates at tens to hundreds of kbps. Moreover, video applications are delay sensitive and impose tight constraints on system resources. Mobile multimedia applications, consisting of multiple signal types, play an important role in the rapid penetration of future communication services and the success of these communication systems. Even though the high transmission rates and service flexibility have made wireless multimedia communication possible over 3G/4G wireless communication systems, many challenges remain to be addressed in order to support efficient communications in multi-user, multi-service environments. In addition to the high initial cost associated with the deployment of 3G systems, the move from telephony and low-bit-rate data services to bandwidth-consuming 3G services implies high system costs, as these consume a large portion of the available resources. However, for rapid market evolvement, these wideband services should not be substantially more expensive than the services offered today. Therefore, efficient system resource (mainly the bandwidth-limited radio resource) utilization and QoS management are critical in 3G/4G systems. Efficient resource management and the provision of QoS for multimedia applications are in sharp conflict with one another. Of course, it is possible to provide high-quality multimedia services by using a large amount of radio resources and very strong channel protection. However, this is clearly inefficient in terms of system resource allocation. Moreover, the perceptual multimedia quality received by end users depends on many factors, such as source rate, channel protection, channel quality, error resilience techniques, transmission/processing power, system load, and user interference. Therefore, it is difficult to obtain an optimal source and network parameter combination for a given set of source and channel characteristics. The time-varying error characteristics of the radio access channel aggravate the problem. In this book, therefore, various QoS-based resource management systems are detailed. For comparison and validation purposes, a number of wireless channel models are described. The key QoS improvement techniques, including content and link-adaptation techniques, are covered. Future media Internet will allow new applications with support for ubiquitous media-rich content service technologies to be realized. Virtual collaboration, extended home platforms, augmented, mixed and virtual realities, gaming, telemedicine, e-learning and so on, in which users with possibly diverse geographical locations, terminal types, connectivity, usage environments, and preferences access and exchange pervasive yet protected and trusted content, are just a few examples. These multiple forms of diversity requires content to be transported and rendered in different forms, which necessitates the use of context-aware content adaptation. This avoids the alternative of predicting, generating and storing all the different forms required for every item of content. Therefore, there is a growing need for devising adequate concepts and functionalities of a context-aware content adaptation platform that suits the requirements of such multimedia application scenarios. This platform needs to be able to consume low-level contextual information to infer higher-level contexts, and thus decide the need and type of adaptation operations to be performed upon the content. In this way, usage constraints can be met while restrictions imposed by the Digital Rights Management (DRM) governing the use of protected content are satisfied. In this book, comprehensive discussions are presented on the use of contextual information in adaptation decision operations, with a view to managing the DRM and the authorization for adaptation, consequently outlining the appropriate adaptation decision techniques and
Introduction
5
adaptation mechanisms. The main challenges are found by identifying integrated tools and systems that support adaptive, context-aware and distributed applications which react to the characteristics and conditions of the usage environment and provide transparent access and delivery of content, where digital rights are adequately managed. The discussions focus on describing a scalable platform for context-aware and DRM-enabled adaptation of multimedia content. The platform has a modular architecture to ensure scalability, and well-defined interfaces based on open standards for interoperability as well as portability. The modules are classified into four categories, namely: 1. Adaptation Decision Engine (ADE); 2. Adaptation Authoriser (AA); 3. Context Providers (CxPs); and 4. Adaptation Engine Stacks (AESs), which comprise Adaptation Engines (AEs). During the adaptation decision-taking stage the platform uses ontologies to enable semantic description of real-world situations. The decisiontaking process is triggered by low-level contextual information and driven by rules provided by the ontologies. It supports a variety of adaptations, which can be dynamically configured. The overall objective of this platform is to enable the efficient gathering and use of context information, ultimately in order to build content adaptation applications that maximize user satisfaction.
2 Video Coding Principles 2.1 Introduction Raw digital video signals are very large in size, making it very difficult to transmit or store them. Video compression techniques are therefore essential enabling technologies for digital multimedia applications. Since 1984, a wide range of digital video codecs have been standardized, each of which represents a step forward either in terms of compression efficiency or in functionality. This chapter describes the basic principles behind most standard blockbased video codecs currently in use. It begins with a discussion of the types of redundancy present in most video signals (Section 2.2) and proceeds to describe some basic techniques for removing such redundancies (Section 2.3). Section 2.4 investigates enhancements to the basic techniques which have been used in recent video coding standards to provide improvements in video quality. This section also discusses the effects of communication channel errors on decoded video quality. Section 2.5 provides a summary of the available video coding standards and describes some of the key differences between them. Section 2.6 gives an overview of how video quality can be assessed. It includes a description of objective and subjective assessment techniques.
2.2 Redundancy in Video Signals Compression techniques are generally based upon removal of redundancy in the original signal. In video signals, the redundancy can be classified as spatial, temporal, or source-coding. Most standard video codecs attempt to remove these types of redundancy, taking into account certain properties of the human visual system. Spatial redundancy is present in areas of images or video frames where pixel values vary by small amounts. In the image shown in Figure 2.1, spatial redundancy is present in parts of the background, and in skin areas such as the shoulder. Temporal redundancy is present in video signals when there is significant similarity between successive video frames. Figure 2.2 shows two successive frames from a video sequence. It is clear that the difference between the two frames is small, indicating that it would be inefficient to simply compress a video signal as a series of images.
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
8
Visual Media Coding and Transmission
Figure 2.1 Spatial redundancy is present in areas of an image or video frame where the pixel values are very similar
Source-coding redundancy is present if the symbols produced by the video codec are inefficiently mapped to a binary bitstream. Typically, entropy coding techniques are used to exploit the statistics of the output video data, where some symbols occur with greater probability than others.
2.3 Fundamentals of Video Compression This section describes how spatial redundancy and temporal redundancy can be removed from a video signal. It also describes how a typical video codec combines the two techniques to achieve compression.
2.3.1 Video Signal Representation and Picture Structure Video coding is usually performed with YUV 4 : 2 : 0 format video as an input. This format represents video using one luminance plane (Y) and two chrominance planes (Cb and Cr). The luminance plane represents black and white information, while the chrominance planes contain all of the color data. Because luminance data is perceptually more important than the
Figure 2.2 Temporal redundancy occurs when there is a large amount of similarity between video frames
Video Coding Principles
Figure 2.3
9
Most video codecs break up a video frame into a number of smaller units for coding
chrominance data, the resolution of the chrominance planes is half that of the luminance in both dimensions. Thus, each chrominance plane contains a quarter of the pixels contained in the luminance plane. Downsampling the color information means that less information needs to be compressed, but it does not result in a significant degradation in quality. Most video coding standards split each video frame into macroblocks (MB), which are 16 16 pixels in size. For the YUV 4 : 2 : 0 format, each MB contains four 8 8 luminance blocks and two 8 8 chrominance blocks, as shown in Figure 2.3. The two chrominance blocks contain information from the Cr and Cb planes respectively. Video codecs code each video frame, starting with the MB in the top left-hand corner. The codec then proceeds horizontally along each row, from left to right. MBs can be grouped. Groups of MBs are known by different names in different standards. For example: . . .
Group of Blocks (GOB): H.263 [1–3] Video packet: MPEG-4 Version 1 and 2 [4–6] Slice: MPEG-2 [7] and H.264 [8–10]
The grouping is usually performed to make the video bitstream more robust to packet losses in communications channels. Section 2.4.8 includes a description of how video slices can be used in error-resilient video coding.
2.3.2 Removing Spatial Redundancy Removal of spatial redundancy can be achieved by taking into account: .
.
The characteristics of the human vision system: human vision is more sensitive to lowfrequency image data than high-frequency data. In addition, luminance information is more important than chrominance information. Common features of image/video signals: Figure 2.4 shows an image that has been high-pass and low-pass filtered. It is clear from the images that the low-pass-filtered version contains more energy and more useful information than the high-pass-filtered one.
10
Visual Media Coding and Transmission
Figure 2.4
‘Lena’ image subjected to a high-pass and a low-pass filter
These factors suggest that it is advantageous to consider image/video compression in the frequency domain. Therefore, a transform is needed to convert the original image/video signal into frequency coefficients. The DCT is the most widely used transform in lossy image and video compression. It permits the removal of spatial redundancy by compacting most of the energy of the block into a few coefficients. Each 8 8 pixel block is put through the discrete cosine transform (DCT): rffiffiffiffiffiffi N 1 N 1 X X 4 pð2n1 þ 1Þk1 pð2n2 þ 1Þk2 Sðk1 ; k2 Þ ¼ ÞCðk Þ s ð n ; n Þcos Cðk cos 1 2 1 2 N2 2N 2N n ¼0 n ¼0 1
2
where
k1 ; k2 ; n1 ¼ 0; 1; . . . N 1; CðkÞ ¼
ð2:1Þ pffiffiffi 1= 2 for k ¼ 0; ; N is the block size (N ¼ 8), 1 otherwise
s(n1,n2) is an original image 8 8 block, and S(k1, k2) is the 8 8 transformed block.
11
Video Coding Principles
200
350
180
300
Magnitude
Magnitude
160 140 120 100 80 60 40 20 0 3
4
5
100 50
1
1
6
7
7 5
-100 2
3 3
8
(a) Original image block
Figure 2.5
150
-50
4 2
200
0
7
1
250
4
5
6
7
1 8
9
(b) DCT- transformed block
Transform-based compression (a) Original image block (b) DCT transformed block
An example of the DCT in action is shown below, and illustrated in Figure 2.5. An input block, s(n1,n2), is first taken from an image: 2
183 6 183 6 6 179 6 6 177 Sð n1 ; n2 Þ ¼ 6 6 178 6 6 179 6 4 179 180
160 153 168 177 178 180 179 179
94 116 171 179 179 180 180 181
153 176 182 177 176 179 182 179
194 187 179 179 182 183 183 181
163 166 170 165 164 169 170 170
132 130 131 131 130 132 129 130
3 165 169 7 7 167 7 7 167 7 7 171 7 7 173 7 7 173 5 169
ð2:2Þ
It is then put though the DCT, and is rounded to the nearest integer: 2 6 6 6 6 6 jSðk1 ; k2 Þj ¼ 6 6 6 6 6 4
313 38 20 10 6 2 4 3
56 27 17 8 1 3 4 1
27 13 10 9 6 0 1 0
18 44 33 17 4 3 2 4
78 32 21 9 3 7 9 2
60 1 6 10 7 4 0 1
27 24 16 13 5 0 2 3
3 27 10 7 7 97 7 17 7 57 7 37 7 45 1
ð2:3Þ
The coefficients in the transformed block represent the energy contained in the block at different frequencies. The lowest frequencies, starting with the DC coefficient, are contained in the top-left corner, while the highest frequencies are contained in the bottom-right, as shown in Figure 2.5. Note that many of the high-frequency coefficients are much smaller than the low-frequency coefficients. Most of the energy in the block is now contained in a few low-frequency coefficients. This is important, as the human eye is most sensitive to low-frequency data.
12
Visual Media Coding and Transmission
It should be noted that in most video codec implementations the 2D DCT calculation is replaced by 1D DCT calculations, which are performed on each row and column of the 8 8 block: FðuÞ ¼ aðuÞ
N 1 X
f ðxÞcos
x¼0
pð2x þ 1Þu 2N
ð2:4Þ
for u ¼ 0; 1; 2 . . . ; N 1. The value of a(u) is defined as: 8 1 > > < pffiffiffiffi for u ¼ 0 N aðuÞ ¼ 2 > > pffiffiffiffi for u 6¼ 0 : N
ð2:5Þ
The 1D DCT is used as it is easier to optimize in terms of computational complexity. Next, quantization is performed by dividing the transformed DCT block by a quantization matrix. The standard quantization matrices used in the JPEG codec are shown below: 2
16 6 12 6 6 14 6 6 14 QY ¼ 6 6 18 6 6 24 6 4 49 72 2
QUV
11 12 13 17 22 35 64 92
17 6 18 6 6 24 6 6 47 ¼6 6 99 6 6 99 6 4 99 99
10 14 16 22 37 55 78 95 18 21 26 66 99 99 99 99
16 19 24 29 56 64 87 98 24 26 56 99 99 99 99 99
24 26 40 51 68 81 103 112 47 66 99 99 99 99 99 99
99 99 99 99 99 99 99 99
40 58 57 87 109 104 121 100
51 60 69 80 103 113 120 103
3 61 55 7 7 56 7 7 62 7 7 77 7 7 92 7 7 101 5 99
ð2:6Þ
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
3 99 99 7 7 99 7 7 99 7 7 99 7 7 99 7 7 99 5 99
ð2:7Þ
where QY is the matrix for luminance (Y plane) and QUV is the matrix for chrominance (U and V planes). The matrix values are set using psycho-visual measurements. Different matrices are used for luminance and chrominance because of the differing perceptual importance of the planes. The quantization matrices determine the output picture quality and output file size. Scaling the matrices by a value greater than 1 increases the coarseness of the quantization, reducing quality. However, such scaling also reduces the number of nonzero coefficients and the size of the nonzero coefficients, which reduces the number of bits needed to code the video.
13
Video Coding Principles
2.3.2.1 H.263 Quantization Example Take the DCT matrix from above: 2 6 6 6 6 6 jSðk1 ; k2 Þj ¼ 6 6 6 6 6 4
313 38 20 10 6 2 4 3
56 27 17 8 1 3 4 1
27 13 10 9 6 0 1 0
18 44 33 17 4 3 2 4
78 32 21 9 3 7 9 2
60 1 6 10 7 4 0 1
27 24 16 13 5 0 2 3
3 27 10 7 7 97 7 17 7 57 7 37 7 45 1
ð2:8Þ
and divide it by the luminance quantization matrix: 2
3 20 5 3 1 3 2 1 0 7 2 3 6 6 3 2 1 2 1 0 0 07 313=16 56=11 .. . 27=61 6 1 1 1 1 1 0 0 07 7 7 6 jSðk1 ;k2 Þj 6 1 0 0 1 0 0 0 07 6 38=12 } 7 6 7 ð2:9Þ ¼6 ¼6 . 7 0 0 0 0 0 0 07 4 .. 5 6 QY } 7 6 0 6 0 0 0 0 0 0 0 07 3=72 1=99 7 6 4 0 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0
The combination of the DCT and quantization has clearly reduced the number of nonzero coefficients. The next stage in the encoding process is to zigzag scan the DCT matrix coefficients into a new 1D coefficient matrix, as shown in Figure 2.6. Using the above example, the result is: ½ 20 5 3 1 2 3 1 1 1 1 0 0 1 2 3 2 1 1 0 0 0 0 0 0 1 1 0 1 EOB ð2:10Þ
Figure 2.6
Zigzag scanning of DCT coefficients
14
Visual Media Coding and Transmission Table 2.1
Run-level coding
Run Level
0 5
0 3
0 1
0 2
0 3
0 1
0 1
0 1
0 1
Run Level
2 1
0 2
0 3
0 2
0 1
0 1
6 1
0 1
1 1
where the EOB symbol indicates the end of the block (i.e. all following coefficients are zero). Note that the number of coefficients to be encoded has been reduced from 64 to 28 (29 including the EOB). The data is further reorganized, with the DC component (the top-left in the DCT matrix) being treated differently from the AC coefficients. DPCM (Differential Pulse Code Modulation) is used on DC coefficients in the H.263 standard [1]. This method of coding generally creates a prediction for the current block’s value first, and then transmits the error between the predicted value and the actual value. Thus, the reconstructed intensity for the DC at the decoder, s(n1,n2), is: sðn1 ; n2 Þ ¼ ^sðn1 ; n2 Þ þ eðn1 ; n2 Þ
ð2:11Þ
where ^sðn1 ; n2 Þ and eðn1 ; n2 Þ are respectively the predicted intensity and the error. For JPEG, the predicted DC coefficient is the DC coefficient in the previous block. Thus, if the previous DC coefficient was 15, the coded value for the example given above will be: 20 15 ¼ 5
ð2:12Þ
AC coefficients are coded using run-level coding, where each nonzero coefficient is coded using a value for the intensity and a value giving the number of zero coefficients preceding the coefficient. With the above example, the coefficients are represented as shown in Table 2.1. Variable-length coding techniques are used to encode the DC and AC coefficients. The coding scheme is arranged such that the most common values have the shortest codewords.
2.3.3 Removing Temporal Redundancy Image coding attempts to remove spatial redundancy. Video coding features an additional redundancy type: temporal redundancy. This occurs because of strong similarities between successive frames. It would be inefficient to transmit a series of JPEG images. Therefore, video coding aims to transmit the differences between two successive frames, thus achieving even higher compression ratios than for image coding. The simplest method of sending the difference between two frames would be to take the difference in pixel intensities. However, this is inefficient when the changes are simply a matter of objects moving around a scene (e.g. a car moving along a road). Here it would be better to describe the translational motion of the object. This is what most video codec standards attempt to do. A number of different frame types are used in video coding. The two most important types are:
15
Video Coding Principles . .
Intra frames (called I frames in MPEG standards): these frames use similar compression methods to JPEG, and do not attempt to remove any temporal redundancy. Inter frames (called P frames in MPEG): these frames use the previous frame as a reference.
Intra frames are usually much larger than inter frames, due to the presence of temporal redundancy in them. However, inter frames rely on previous frames being successfully received to ensure correct reconstruction of the current frame. If a frame is dropped somewhere in the network then all subsequent inter frames will be incorrectly decoded. Intra frames can be sent periodically to correct this. Descriptions of other types of frame are given in Section 2.4.1. Motion compensation is the technique used to remove much of the temporal redundancy in video coding. It is preceded by motion estimation. 2.3.3.1 Motion Estimation Motion estimation (ME) attempts to estimate translational motion within a video scene. The output is a series of motion vectors (MVs). The aim is to form a prediction for the current frame based on the previous frame and the MVs. The most straightforward and accurate method of determining MVs is to use block matching. This involves comparing pixels in a certain search window with those in the current frame, as shown in Figure 2.7. Typically, the Mean Square Error is employed, such that the MV can be found from: 0 1 X 1 ð2:13Þ ½sðn1 ; n2 ; kÞ sðn1 þ d1 ; n2 þ d2 k 1Þ2 A d^1 d^2 ¼ min @ ðd1 ;d2 Þ 256 ðn ;n Þ b 1
2 2
where d^1 d^2 is the optimum MV, s(n1, n2, k) is the pixel intensity at the coordinate (n1, n2) in the kth frame, and b is a 16 16 block. Thus, for each MB in the current frame, the algorithm finds the MV that gives the minimum MSE compared to an MB in the previous frame. Search window
Reference frame
MB n
Current frame
Figure 2.7 ME is carried out by comparing an MB in the current frame with pixels in the reference frame within a preset search window
16
Visual Media Coding and Transmission
Although this technique identifies MVs with reasonable accuracy, the procedure requires many calculations for a whole frame. ME is often the most computationally intensive part of a codec implementation, and has prevented digital video encoders being incorporated into lowcost devices. Researchers have examined a variety of methods for reducing the computational complexity of ME. However, they usually result in a tradeoff between complexity and accuracy of MV determination. Suboptimal MV selection means that the coding efficiency is reduced, and therefore leads to quality degradation, where a fixed bandwidth is specified. 2.3.3.2 Intra/Inter Mode Decision Not all MBs should be coded as inter MBs, with motion vectors. For example, new objects may be introduced into a scene. In this situation the difference is so large that an intra MB should be encoded. Within an inter frame, MBs are coded as inter or intra MBs, often depending on the MSE value. If the MSE passes a certain threshold, the MB is coded as Intra, otherwise inter coding is performed. The MSE-based threshold algorithm is simple, but is suboptimal, and can only be used when a limited number of MB modes are available. More sophisticated MB modeselection algorithms are discussed in Section 2.4.3. 2.3.3.3 Motion Compensation The basic intention of motion compensation (MC) is to form as accurate a prediction as possible of the current frame from the previous frame. This is achieved using the MVs produced in the ME stage. Each inter MB is coded by sending the MV value plus the prediction error. The prediction error is the difference between the motion-compensated prediction for that MB and the actual MB in the current frame. Thus, the transmitted error MB is:
eðn1 ; n2 ; kÞ ¼ sðn1 ; n2 ; kÞ s n1 þ d^1 ; n2 þ d^2 ; k 1 ð2:14Þ The prediction error is generally smaller in magnitude than the original, meaning that less bits are required to code the error. Therefore, the compression efficiency is increased by using MC.
2.3.4 Basic Video Codec Structure The video codec shown in Figure 2.8 demonstrates the basic operation of many video codecs. The major components are: . . . .
Transform and Quantizer: perform operations similar to the transform and quantization process described in Section 2.3.2. Entropy Coder: takes the data for each frame and maps it to binary codewords. It outputs the final bitstream. Encoder Control: can change the MB mode and picture type. It can also vary the coarseness of the quantization and perform rate control. Its precise operation is not standardized. Feedback Loop: removes temporal redundancy by using ME and MC.
17
Video Coding Principles
Macroblock Mode Quantization Parameter
Encoder Control
Video Signal
Transform
-
Quantized Transform Coefficients
Quantization
Inverse Quantization Entropy Coder Inverse Transform
+ Motion Estimation and Compensation
Figure 2.8
Motion Vectors
Basic video encoder block diagram
2.4 Advanced Video Compression Techniques Section 2.3 discussed some of the basic video coding techniques that are common to most of the available video coding standards. This section examines some more advanced video coding techniques, which provide improved compression efficiency, additional functionality, and robustness to communication channel errors. Particular attention is paid to the H.264 video coding standard [8, 9], which is one of the most recently standardized codecs. Subsequent codecs, such as scalable H.264 and Multi-view Video Coding (MVC) [11], use the H.264 codec as a starting point. Note that scalability is discussed in Chapter 3.
2.4.1 Frame Types Most modern video coding standards are able to code at least three different frame types: .
.
I frames (intra frames): these do not include any motion-compensated prediction from other frames. They are therefore coded completely independently of other frames. As they do not remove temporal redundancy they are usually much larger in size than other frame types. However, they are required to allow random access functionality, to prevent drift between the encoder and decoder picture buffers, and to limit the propagation of errors caused by packet loss (see Section 2.4.8). P frames (inter frames): these include motion-compensated prediction, and therefore remove much of the temporal redundancy in the video signal. As shown in Figure 2.9, P frames
18
Visual Media Coding and Transmission
Frame 1: I
Frame 2: P
Frame 3: P
Frame 4: P
Figure 2.9 P frames use a motion-compensated version of previous frames to form a prediction of the current frame
.
generally use a motion-compensated version of the previous frame to predict the current frame. Note that P frames can include intra-coded MBs. B frames: the ‘B’ is used to indicate that bi-directional prediction can be used, as shown in Figure 2.10. A motion-compensated prediction for the current frame is formed using information from a previous frame, a future frame, or both. B frames can provide better compression efficiency than P frames. However, because future frames are referenced during encoding and decoding, they inherently incur some delay. Figure 2.11 shows that the frames must be encoded and transmitted in an order that is different from playback. This means that they are not useful in low-delay applications such as videoconferencing. They also require additional memory usage, as more reference frames must be stored.
H.264 supports a wider range of frames and MB slices. In fact, H.264 supports five types of such slice, which include I-type, P-type, and B-type slices. I-type (Intra) slices are the simplest, in which all MBs are coded without referring to other pictures within the video sequence. If previously-coded images are used to predict the current MB it is a called P-type (predictive) slice, and if both previous- and future-coded images are used then it is called a B-type (bipredictive) slice. Other slices supported by H.264 are the SP-type (Switching P) and the SI-type (Switching I), which are specially-coded slices that enable efficient switching between video streams and random access for video decoders [12]. Avideo decoder may use them to switch between one of
Frame 1: I
Frame 2: B
Frame 3: P
Frame 4: B
Figure 2.10 B frames use bi-directional prediction to obtain predictions of the current frame from past and future frames
19
Video Coding Principles Playback order
Frame 1: I frame
Frame 2: B frame
Frame 3: P frame
Coding/transmission order
Frame 1: I frame
Frame 3: P frame
Frame 2: B frame
Figure 2.11 Use of B frames means that frames must be coded and decoded in an order that is different from that of playback
several available encoded streams. For example, the same video material may be encoded at multiple bit rates for transmission across the Internet. A receiving terminal will attempt to decode the highest-bit-rate stream it can receive, but it may need to switch automatically to a lower-bit-rate stream if the data throughput drops.
2.4.2 MC Accuracy Providing more accurate MC can significantly reduce the magnitude of the prediction error, and therefore fewer bits need to be used to code the transform coefficients. More accuracy can be provided either by allowing finer motion vectors to be used, or by permitting more motion vectors to be used in an MB. The former allows the magnitude of the motion to be described more accurately, while the latter allows for complex motion or for situations where there are objects smaller than an MB. H.264 in particular supports a wider range of spatial accuracy than any of the existing coding standards, as shown in Table 2.2. Amongst earlier standards, only the latest version of MPEG-4 Part 2 (version 2) [5] can provide quarter-pixel accuracy, while others provide only half-pixel accuracy. H.264 also supports quarter-pixel accuracy. For achieving quarter-pixel accuracy, the luminance prediction values at half-sample positions are obtained by applying a 6-tap filter to the nearest integer value samples [9]. The luminance prediction values at quarter-sample positions are then obtained by averaging samples at integer and half-sample positions. An important point to note is that more accurate MC requires more bits to be used to specify motion vectors. However, more accurate MC should reduce the number of bits required to code the quantized transform coefficients. There is clearly a tradeoff between the number of bits added by the motion vectors and the number of bits saved by better MC. This tradeoff depends upon the source sequence characteristics and on the amount of quantization that is used. Methods of finding the best tradeoff are dealt with in Section 2.4.3.
20 Table 2.2
Visual Media Coding and Transmission
Comparison of the ME accuracies provided by different video codecs
Standard
MVs per MB
H.261 MPEG-1 MPEG-2/H.262 H.263 MPEG-4 H.264/MPEG-4 pt. 10
1 1 2 4 4 16
Accuracy of luma Motion Compensation Integer pixel Integer pixel 1 /2 pixel 1 /2 pixel 1 /4 pixel 1 /4 pixel
2.4.3 MB Mode Selection Most of the widely-used video coding standards allow MBs to be coded with a variety of modes. For example: . . .
MPEG-2: INTRA, SKIP, INTER-16 16, INTER-16 8. H.263/MPEG-4: INTRA, SKIP, INTER-16 16, INTER-8 8. H.264/AVC: INTRA-4 4, INTRA-16 16, SKIP, INTER-16 16, INTER-16 8, INTER-8 16, INTER-8 8; the 8 8 INTER blocks may then be partitioned into 4 4, 8 4, 4 8.
Selection of the best mode is an important part of optimizing the compression efficiency of an encoder implementation. Mode selection has been the subject of a significant amount of research. It is a problem that may be solved using optimization techniques such as Lagrangian Optimization and dynamic programming [13]. The approach currently taken in the H.264 reference software uses Lagrangian Optimization [14]. For mode selection, Lagrangian Optimization may be carried out by minimizing the following Lagrangian cost for each coding unit: JMODE ðM; Q; lMODE Þ ¼ DREC ðM; QÞ þ lMODE RREC ðM; QÞ
ð2:15Þ
where RREC(M,Q) is the rate from compressing the current coding unit with mode, M, and with quantizer value, Q. DREC(M,Q) is the distortion obtained from compressing the current coding unit using mode, M, and quantizer, Q. The distortion can be found by taking the sum of squared differences: X SSD ¼ jsðx; y; tÞ s0 ðx; y; tÞj2 ð2:16Þ ðx;yÞ2A
where A is the MB, s is the original MB pixels, and s0 is the reconstructed MB pixels. The remaining parameter, lMODE, can be determined experimentally, and has been found to be consistent for a variety of test sequences [15]. For H.263, the following curve fits the experimental results well: lMODE ¼ 0:85 QH:263 2 where QH.263 is the quantization parameter.
ð2:17Þ
21
Video Coding Principles
For H.264, the following curve has been obtained: lMODE ¼ 0:85 2ðQH:264 12Þ=3
ð2:18Þ
where QH.264 is the quantization parameter for H.264. For both H.263 and H.264, rate control can be performed by varying the quantization. Once rate control has been performed, the quantization parameter can be used to calculate l, so that Lagrangian Optimization can be used to find the optimum mode. Results have shown that this kind of scheme can bring considerable benefits in terms of compression efficiency [14]. However, a major disadvantage of this type of optimization is that it involves considerable computational complexity. Many researchers have attempted to find lower-complexity solutions to this problem. Choi et al. describe one such scheme [16].
2.4.4 Integer Transform Similar to earlier standards, H.264 also applies a transform to the prediction residual. However, it does not apply the conventional floating-point 8 8 DCT transform. Instead, a separable integer transform is applied to 4 4 blocks of the picture [17]. This transform eliminates any mismatches between encoder and decoder in the inverse transform due to precise integer specification. In addition, its small size helps in reducing blocking and ringing artifacts. For an MB coded in intra-16 16 mode, a similar 4 4 transform is performed for 4 4 DC coefficients of the luminance signal. The cascading of block transforms is equivalent to an extension of the length of the transform functions. If X is the original image block then the 4 4 pixel DCT of X can be found: 2 3 2 3 a a a a a b a c 6b 6 c c b 7 c a b 7 7:X:6 a 7 AXAT ¼ 6 ð2:19Þ 4a a a a5 4a c a b5 c b b c a b a c 1 2rffiffiffi
p 1 b ¼ cos rffiffi2ffi 8 1 3p c ¼ cos 2 8
a ¼
The integer transform of H.264 is similar to the DCT, except that some of the coefficients are rounded and scaled, providing a new transform: 2
1 62 T CXC ¼ 6 41 1
1 1 1 2
1 1 1 2
3 2 1 1 61 27 7:X:6 1 5 41 1 1
2 1 1 2
1 1 1 1
3 1 27 7E 25 1
ð2:20Þ
where E is a scaling factor. Note that the transform can be implemented without any multiplications, as it only requires additions and bit shifting.
22
Visual Media Coding and Transmission
Figure 2.12
Use of surrounding pixel values with intra prediction
2.4.5 Intra Prediction In many video signals, adjacent MBs tend to have similar properties. These spatial redundancies can be reduced using Intra prediction of MBs. For a given MB, its predicted representation is calculated from already-encoded neighboring MBs. The difference between the predicted MB and the actual MB is transformed and quantized. The difference will usually be of smaller magnitude than the original block, and therefore will need fewer bits to code. In some previous standards, prediction was performed in the transform domain, while in H.264 it is now done in the spatial domain, by referring to neighboring samples [9]. H.264 provides two classes of intra coding, the INTRA-4 4 and INTRA-16 16, based on the size of sub-blocks over which prediction is estimated. When using the INTRA-4 4 mode, each 4 4 block of the luminance component utilizes one of nine available prediction options. Beside DC prediction (Mode 2), eight directional prediction modes are specified, which are labeled as 0, 1, 3, 4, 5, 6, 7, 8 in Figure 2.12. As shown in Figure 2.13, components from a to p are predicted from A to M components, and this prediction depends upon the selected mode. INTRA-16 16 mode is more suitable for smooth image areas as it performs a uniform prediction for all of the luminance components of an MB. The chrominance samples of an MB are also predicted, using a similar prediction technique to that for a luminance component.
2.4.6 Deblocking Filters One particular characteristic of block-based coding is the appearance of block structures in decoded images. This feature is more prominent at low bit rates. Deblocking filters can be
23
Video Coding Principles
Figure 2.13 Intra prediction uses a number of different ‘directions’ to determine how to calculate the predicted intra block
deployed to improve the perceived video quality. These filters can either be implemented inside the coding loop, or as a post-processing step at the decoder. Table 2.3 summarizes the advantages and disadvantages of each approach. An adaptive filter is applied to the horizontal and vertical edges of the MB within the prediction loop of H.264 [18]. The filter automatically varies its own strength, depending on the quantization parameter and the MB modes. This is to prevent filtering on high-quality video frames, where the filtering would degrade rather than improve quality. The filtered MB is also used for motion-compensated prediction of further frames in the encoder, resulting in a smaller residual after prediction, and hence better compression efficiency.
Table 2.3 Comparison of the use of post-filter and in-loop-filter deblocking algorithms Deblocking Type
Advantages
Disadvantages
Post filter
.
Freedom for decoder implementers
.
.
Allows for future innovation
.
.
Improvement in visual quality Guarantees a level of quality (standardized) Improvement in visual quality, optimized for encoded content Improved compression efficiency
.
In-loop filter
. . .
.
Requires extra frame buffer, additional memory Not always ideal for encoded video characteristics Higher complexity in decoder Higher complexity in encoder and decoder
24
Visual Media Coding and Transmission
2.4.7 Multiple Reference Frames and Hierarchical Coding Long–term-memory motion-compensated prediction involves the possible use of multiple reference frames to provide a better prediction of the current frame [19]. H.264 allows the use of up to 16 reference frames for prediction of the current frame (see Figure 2.14). This often results in better video quality and more efficient coding of video frames. The decoder replicates the multi-picture buffer of the encoder, according to the reference picture buffering type and any memory-management control operations that are specified in the bitstream. However, the technique also introduces additional processing delay and higher memory requirements at both encoder and decoder end, as both of them have to store the pictures that will be used for Inter picture prediction in a multi-picture buffer. Using more multiple pictures as references helps in error resilience, since the decoder can use an alternativee reference frame when errors are detected in the current one [20]. In scalable H.264, the multiple reference frame capability is exploited to provide temporal scalability (see Chapter 3).
2.4.8 Error-Robust Video Coding Video is highly susceptible to channel errors. This is due to a number of factors: . . .
High compression ratios. Variable-Length Coding (VLC). Propagation of errors in inter frames.
(i) High Compression Ratios Video tends to feature relatively high compression ratios. This means that each bit in the encoded bitstream represents a significant amount of picture information. Therefore, no matter how the bitstream is encoded, the loss of even a few bits may result in significant visual impairment. (ii) Variable-Length Coding When using VLC it is essential that all codewords are decoded correctly. Incorrect decoding will result in a number of possible outcomes:
Figure 2.14 Standards such as H.264 allow the use of multiple reference frames
25
Video Coding Principles
1
Bit 2 of symbol X is corrupted
10110
2
Decoder now sees symbol Y rather than symbol X
11
10110
11
Symbol X Symbol Y
Figure 2.15
. . .
Synchronization loss with VLC
An incorrect symbol is read, of the same length as the correct symbol. An incorrect symbol is read, of a different length to the correct symbol (see Figure 2.15). An invalid symbol is read that does not match any in the coding tables.
The first scenario will result in the least bit loss. Only one codeword is corrupted; all subsequent codewords are correctly decoded. However, the error is not detected, which means that concealment will not take place. The second scenario causes serious problems. All subsequent codewords will be incorrectly decoded as the decoder will have lost synchronization with the bitstream. Also, the error is not detected, so the corrupted information will be displayed, which is often visually very obvious. The third scenario indicates the presence of an error to the decoder. However, it is impossible to determine whether synchronization has been lost or not. Therefore, all subsequent codewords must be discarded. This scenario is the only scenario in which errors are detected. Incorrect decoding may not be detected, or will be detected some time after the corrupted codeword. This is a significant problem with video as the display of corrupted data severely degrades perceptual quality. (iii) Propagation of Errors Inter frames are predicted from the previous frame. Thus, when a frame is corrupted, the decoder uses a corrupted frame for the prediction of subsequent frames. Errors in one frame propagate to later frames, and through the use of motion vectors in subsequent frames the size of the error region will grow (Figure 2.16). To reduce the effect of channel errors on decoded video quality, most of the commonly used codecs have error-resilience features. The following discussion centers around those features in included in the H.264 standard. H.264 defines three profiles: baseline, main, and extended. The baseline profile is the simplest; it targets mobile applications with limited processing resources. It supports three error resilience tools, namely flexible macroblock ordering (FMO), arbitrary slice order, and redundant slices (RSs). The use of parameter sets and the formation of slices are also considered error-resilience tools [21].
26
Visual Media Coding and Transmission Error
Error + residual
MVs cause propagation to other MBs
Frame n
Figure 2.16
Frame n+1
Propagation of errors in inter frames
The main profile is intended for digital television broadcasting and next-generation DVD applications, and does not stress error-resilience techniques. However, the extended profile targets streaming video, and includes features to improve error resilience by using data partitioning, and to facilitate switching between different bitstreams by using SP/SI slices. The extended profile also supports all the error-resilience tools of the baseline profile. 2.4.8.1 Slices H.264 supports picture segmentation in the form of slices. A slice consists of an integer number of MBs of one picture, ranging from a single MB per slice to all MBs of a picture per slice. The segmentation of a picture into slices facilitates the adaptation of the coded slice size to different MTU (maximum transmission unit) sizes and helps to implement schemes such as interleaved packetization [22]. Each slice carries all the information required to decode the MBs it contains. 2.4.8.2 Parameter Sets The parameter set is a mandatory encoding tool, whose intelligent use greatly enhances error resilience. A parameter set consists of all information related to all the slices of a picture. A number of predefined parameter sets are available at the encoder and the decoder. The encoder chooses a suitable parameter set by referencing the storage location, and the same reference number is included in the slice header of each coded slice to inform the decoder. In this way, in an error-prone environment it is ensured that the most important parameter information arrives with minimum errors [21]. 2.4.8.3 Flexible Macroblock Ordering FMO is a powerful error-resilience tool that allows MBs to be assigned to slices in an order other than the scan order [23, 24]. A Macroblock Allocation map (MBAmap) is used to statically assign each MB to a slice group. All MBs within a slice group are coded using the normal scan order. A maximum of eight slice groups can be used. FMO consists of seven different types, named type 0 to type 6. Each type except type 6 contains a certain pattern, as
27
Video Coding Principles
Slice group 0 Slice group 1
Slice group 2
Slice group 0 Slice group 1
(a)
(b)
Slice group 0
Slice group 1
Slice group 0 Slice group 2 Slice group 1
Slice group 1
Slice group 0
(c)
(d)
Slice group 0 Slice group 0 Slice group 1
Slice group 1
(e)
Figure 2.17
(f)
Different types of FMO in H.264. (a) type 0, (b) type 1, (c) type 2, (d) type 3, (e) type 4, (f) type 5
shown in Figure 2.17. Slice type 0 divides the frame into a number of different sizes of slice group. As the number of slice groups is increased, the number of MBs surrounding each MB increases. For example, slice type 1 spreads the MB into more than one slice group in such a way that each MB is surrounded by MBs from different slice groups. Slice type 2 is used to
28
Visual Media Coding and Transmission
highlight the regions of interest inside the frame. Slice types 3–5 allow formation of dynamic slice groups in a cyclic order. If a slice group is lost during transmission, reconstruction of the missing blocks is easier with the help of the information in the surrounding MBs. However, FMO reduces the coding efficiency and has high overheads in the form of the MBAmap which has to be transmitted. 2.4.8.4 Redundant Slices RSs have been introduced to support highly-error-prone wireless environments [25]. One or more redundant copies of the MBs, in addition to the coded MBs, are included in a single bitstream. The redundant representation can be coded using different coding parameters. The primary representation is coded at high quality, and the RS representations can be coded at a low quality, utilizing fewer bits. If no error occurs during transmission, the decoder reconstructs only the primary slice and discards all redundant slices. However, if the primary slice is lost during transmission, the decoder uses the information from the redundant slices, if they are available. 2.4.8.5 H.264 Data Partitioning Data partitioning separates the coded slice data into more than one partition. In contrast to MPEG-4, separate bit strings are created without the need of boundary markers. Each partition has its own header and is transmitted independently according to the assigned priority. H.264 supports three different types of partition: .
. .
Type A partition: the most important information in a video slice, including header data and motion vectors, constitutes the type A partition. Without this information, symbols of other partitions cannot be decoded. Type B partition: compromises the INTRA coefficients of a slice. It carries less importance than the type A partition. Type C partition: only INTER coefficients and related data are included in the type C partition. This is the least important and the biggest partition.
Data partitioning can be used in Unequal Error Protection (UEP) schemes. For example, the type A partition can be protected more than the other two types. However, if INTER or INTRA partitioning is not available, the decoder can still use the header and motion information to improve the efficiency of error concealment [26, 27].
2.5 Video Codec Standards 2.5.1 Standardization Bodies Standardization of video codecs is critical in ensuring their widespread adoption. Currently, video codec standards are published by two separate bodies: .
International Telecommunications Union (ITU): as its name suggests, the ITU produces standards relevant for use in communications. It has been responsible for publishing the
Video Coding Principles
.
29
H.26x series of video coding standards. Note that audio, speech, and multiplexing standards are published separately by the ITU. International Organization for Standardization (ISO): the ISO publishes the MPEG (Motion Picture Experts Group) series of standards. These standards generally contain a video coding part, an audio coding part, and a systems part. The systems part provides a specification of how to multiplex audio-visual streams.
Experts from the two standardization bodies have cooperated under the umbrella of the Joint Video Team (JVT). This cooperation has led to joint standardization of video codecs, where the same standardized codecs are published by both the ISO and the ITU, but under different names (e.g. MPEG-4 AVC is the same as H.264).
2.5.2 ITU Standards The ITU standardized the first digital video codec in 1984, which was called H.120 [28]. The first version of the codec featured some very basic compression techniques, such as conditional replenishment, differential pulse code modulation (DPCM), scalar quantization, and variable length coding. An updated version was produced in 1988, which featured MC. The codec operates at 2048 kbps for PAL (phase-alternating line). Very few implementations of the codec were ever produced and it is now considered to be effectively obsolete. H.261 [29] was standardized in 1990, and is widely considered to be the codec on which most modern video coding standards are based. It was the first video codec to include the DCT, and also includes features such as 16 16 pixel macroblock integer pixel motion compensation, zigzag scan, run length coding, and variable length coding. The codec operated at rates between 64 kbps and 2048 kbps. It was highly successful, and saw use in applications such as videoconferencing over ISDN lines. H.262 [30] was a joint standardization effort with MPEG, and is the same as MPEG-2 [7], which is described in Section 2.5.3. H.263 [1] was designed as a successor to H.261. Its core features were capable of providing better compression efficiency than H.261. Some of the key enhancements were half pel MC, motion vector prediction, improved variable length coding of DCT coefficients, and a more efficient syntax for signaling coding modes. The standard also featured a number of annexes that provided optional enhancements to the core standard. The annexes included features such as B-frame coding, advanced motion prediction (use of four motion vectors per MB), and arithmetic coding. Two further versions of H.263 were produced, known as H.263 þ and H.263 þ þ . H.263 þ included a number of valuable coding tools, such as intra prediction, an in-loop deblocking filter, improved B-frame coding, and reference picture selection. It also introduced temporal, spatial, and SNR scalability to the codec. H.263 þ þ introduced some further compression efficiency enhancements, and a number of error-resilience tools.
2.5.3 MPEG Standards The first two MPEG standards were designed for specific applications. MPEG-1 [31] was intended for storing VHS-quality video on media such as CD-ROM. It used the H.261 standard as its basis, and added a number of coding efficiency enhancements, such as half pel MC, bidirectional motion prediction, and weighted DCT quantization matrices.
30
Visual Media Coding and Transmission
MPEG-2 [7] was aimed at providing a solution for digital broadcast, and was also employed to store movies on DVD. Notably it included the MPEG-2 audio layer 3 specifications, which are currently widely used to distribute audio over the Internet (MP3). The visual coding part is similar to MPEG-1, but contains a number of additional features, such as support for interlaced coding, increased DC quantization precision, scalability, and up to two motion vectors per MB, instead of one per MB. A third standard MPEG-3 was planned, which would have been optimized for highdefinition television (HDTV) purposes. However, it was found that MPEG-2 was capable of catering for HDTV, and plans for the proposed standard were scrapped. MPEG-4 was somewhat different from previous MPEG standards, in that it was not aimed at any particular application [5]. Instead, it was intended to serve as a ‘complete multimedia standard’. The MPEG-4 specification encompassed a wide range of features and technologies. Also, the designers of MPEG-4 kept in mind some of the requirements for delivery over mobile networks. Version 1 of MPEG-4 officially became a standard in October 1998, while version 2 was standardized in 2000. Version 1 contained some of the most significant features, while version 2 was a backwards-compatible extension of the first version. Further extensions to the standard were carried out following standardization of the second version of MPEG-4. Most notably, Part 10, called Advanced Video Coding (AVC), was added. This codec is discussed in Section 2.5.4. One of the most interesting features of MPEG-4 is its ability to describe a video scene as a number of audio-visual objects. Shape information can be coded alongside motion and texture information, which allows the decoder to reconstruct arbitrarily-shaped objects (Figure 2.18). Shape coding raises the prospect of interactive video. It is possible for the decoder to be implemented so that it allows manipulation of objects at the user’s command. Objects can theoretically be moved, resized, or even switched off. How widely the object-based features come into use will probably depend upon the ability of the encoder to provide automatic segmentation of images into separate objects. Although progress is being made in this field, there are still problems in performing reliable and accurate segmentation. One issue with MPEG-4 Part 2 is that it offers very little gain in compression efficiency compared to H.263þ . This is one of the reasons why it has not been as widely adopted as Part 10 of the standard, which is described in Section 2.5.4.
Figure 2.18
MPEG-4 allows a video scene to be encoded as a number of different objects
Video Coding Principles
31
2.5.4 H.264/MPEG-4 AVC The main goal of H.264’s design was to achieve low bit rates for a variety of applications. It can achieve up to 50% reduction in bit rate compared to H.263 þ or MPEG-4 Part 2, version 1. It has been designed to operate in a low-delay mode to adapt to real-time communication applications such as videoconferencing, but at the cost of reduced gain in motion prediction/ estimation. It introduces higher processing delay in applications with no delay constraints (e.g. video storage, sever-based video streaming applications). It provides tools to deal with packet loss in packet networks and bit errors in error-prone wireless networks. Many of its key features, such as intra prediction, multiple reference frames, integer transform, and in-loop deblocking filter, are described in Section 2.4.
2.6 Assessment of Video Quality Video quality can be assessed using two types of technique: subjective assessment and objective assessment. Subjective tests are performed by asking viewers to give their opinion on the quality of videos that they are shown. Objective tests use a mathematical algorithm to produce a numerical quality rating. Clearly, subjective tests are much more difficult to obtain as they require a significant amount of resources, in terms of people and time. However, they are capable of providing a better indication of quality than any of the objective measures currently in use. This is particularly true for 3D video quality assessment. This section describes methods for subjective and objective quality assessment.
2.6.1 Subjective Performance Evaluation Any subjective test for video should ideally respect the ITU-R BT.500-11 (‘Methodology for the Subjective Assessment for the Quality of Television Pictures’) standard [32]. Several subjective quality-assessment methods are described in the standard. It is essential to select the methods that best suit the objectives and circumstances of the assessment problem at hand. There are many criteria to be fulfilled in order to ensure that the evaluation tests are free from user- and environment-oriented inaccuracies. These include viewing conditions, monitorrelated issues, source signal quality and characteristics, and test session issues. Two of the main assessment techniques described within the standard are based either on impairment rating (with respect to original sources) or on continuous quality rating (to compare the performances of two different systems). Depending on the requirements of the particular evaluation environment, the reference (undistorted) signal and the processed signal (processed in the system under evaluation) might be displayed simultaneously or one after the other to measure the reaction of the viewer. The ratings provided by the viewers are processed, and some which are found to be too biased may be neglected and not added into the calculations. Prior to finding the subjective assessment results, the confidence interval is calculated, which represents the extent to which the results calculated are valid, considering a very large number of subjects. Figure 2.19 shows an example of some subjective test results, where two sequences were shown to viewers side by side. The viewers were asked to give a Differential Mean Opinion Score (DMOS), indicating which of the two videos they preferred and by how much.
32
Visual Media Coding and Transmission
Figure 2.19
Example subjective test results
2.6.2 Objective Performance Evaluation Since subjective video quality assessment is a time-consuming process it is important to establish fast and reliable objective measures. In image and video coding research communities the most widespread measure used for the evaluation of end results is peak signal-to-noise ratio (PSNR). Since PSNR is the most popular objective measure it has become the most common tool for comparison of different algorithms across different publications. However, it is not fully reliable because of its overall weak correlation with perceptual quality. Therefore any comparisons and evaluations that use PSNR have to be carefully performed. In this section guidelines on the use of PSNR are provided. PSNR measures distortion of given signal, usually with respect to the original, uncompressed content. Therefore it evaluates closeness between two signals by exploiting the differences of sample values. The main component of PSNR calculation is measurement of the mean square error (MSE) for each frame, f, of the video sequence:
MSEf ¼
1
2 1 NX ^f ðnÞ x f ð nÞ x N n¼0
ð2:21Þ
^f are pixels of the original frame and of the distorted frame, respectively, and where xf and x each frame consists of N pixels. PSNR of a frame, f, is then defined as:
2k 1 PSNRf ¼ 10 log MSEf
2 ¼ 10 log
2552 MSEf
ð2:22Þ
33
Video Coding Principles
where k is the number of bits per pixel for the given frame (k ¼ 8 for commonly-used test material). PSNR for the whole sequence can be evaluated in two ways: as the mean value of the PSNRs of individual frames, as: PSNR ¼
1 1 FX PSNRf F f ¼0
ð2:23Þ
or from the mean MSE, as:
2k 1 PSNR ¼ 10 log MSE
2 ¼ 10 log
2552 1 1 FX MSEf F f ¼0
ð2:24Þ
Although both ways of measuring average PSNR are common in the literature, within this project the average PSNR of a video sequence will be measured as in (2.23); that is, as the mean value of the PSNRs of the individual frames. As common video sequences have more than one component, for example YUV color space, the PSNR for each component can be evaluated separately. Moreover, average PSNR can also be measured over different color components, as: PSNR ¼ ðPSNRY þ 0:25 PSNRU þ 0:25 PSNRV Þ=1:5
ð2:25Þ
where weighting factors of 0.25 for chrominance components are used because of 4 : 2 : 0 sampling in this example. This way of computing average PSNR cannot always provide reliable comparison of conceptually different methods. For example, for video coders with different color quantizations, or where different rate-distortion optimization techniques between color components are utilized, the influence of variation of values of PSNR of different components can affect the overall measure. Therefore this measure has to be used only for very specific purposes and only for evaluation of small modifications within a closed system. Within the project, separate evaluation of color components will be used. Since luminance (Y) is the most important component for subjective perception, it is acceptable to provide evaluation results for this component only. The example in Figure 2.20 shows a common presentation of PSNR results for various methods evaluated in different bit-rate points. Instead of this type of evaluation of overall performance, a frame-based evaluation could be used. An example of how the PSNR values of individual frames can be presented is given in Figure 2.21. For different videos and different test conditions the same PSNR values can be associated with very different subjective quality measures. For example, PSNR of 32 dB can provide satisfactory subjective video quality for some videos, whilst for others it can be unacceptably distorted. However, from the practice some guidelines can be suggested on how to understand PSNR values. In general, PSNR values below 20 dB are associated with unrecognizable videos (very high distortion). Therefore, below 25 dB it is not advisable to perform comparisons. The range of PSNRs in which the tests are usually set is 30–40 dB, where visual quality is acceptable (more acceptable towards higher values). Above 45 dB the distortion is unnoticeable. In a lossless channel, rate distortion (RD) curves are used to measure the performance. An RD curve represents the PSNR results, calculated as described above, for various methods
34
Visual Media Coding and Transmission
Figure 2.20 Example of presentation of averaged PSNR results for various methods over different adaptation points
evaluated in different bit-rate points. The RD curve of the developed scalable MDC is compared with the original scalable video coding (SVC). In a lossy channel, the results are obtained as average PSNR values versus the channel loss in terms of decibels (dBs). The average PSNR is calculated between the corrupted-decoded and the original video. This simulation is repeated several times, and the starting point of the error pattern is changed for each one to ensure overall coverage of the long error pattern file. Next, the average PSNR over the number of simulations is calculated for that particular performance evaluation test using this error pattern. The simulation is then repeated for the next loss pattern. As a result, a low average PSNR is expected at high error rates, and a high average PSNR is expected at low error rates, as shown in Figure 2.22.
Figure 2.21 Example of presentation of PSNR results for individual frames and for various methods
35
Average PSNR (dB)
Video Coding Principles
High error rate
Figure 2.22
Low error rate
Average PSNR versus error rate
For object-based video coding, the performances of all parts of the codec can be evaluated separately. For ME algorithms and object segmentation, ground-truth data has to be generated. This means that foreground masks for each frame of the test sequence are used to compute the frame-by-frame PSNR, in order to evaluate the exact estimation of the background motion. Next, having the higher-order motion parameters for each frame pair, an object segmentation algorithm is applied. The foreground mask is used to measure the performance of the automatic object segmentation method. To this end, coding all the separated parts of the test scene and merging them at the decoder, the PSNR is measured as described above. The Structural SIMilarity (SSIM) is an alternative objective video-quality metric. It assumes that the human vision system is better at extracting structural information from the video signal than at extracting errors. The SSIM is given by:
ð2xy þ C1 Þ 2sxy þ C2 SSIM ¼ ð2:26Þ
x2 þ y2 þ C1 s2x þ s2y þ C2 where x is the mean of x, y is the mean of y, sx is the variance of x, sy is the variance of y, and sxy is the covariance of x and y. SSIM is a decimal value between 0 and 1, where 1 represents an exact match with the original video signal.
2.7 Conclusions This chapter has introduced block-based video compression techniques currently used in many video coding standards. It described how the codecs use knowledge about the human vision system to achieve good compression efficiency. The human vision system is more sensitive to luminance than chrominance information, and is also more sensitive to low-frequency information than high-frequency information. The compression techniques therefore operate in the frequency domain, to ensure that psycho-visual effects can be taken into account. The discrete cosine transform (DCT) is commonly used in coding standards. DCT coefficients are quantized and run-level coded to achieve compression. In addition, similarities between successive video frames are exploited by using a technique called motion compensation (MC) to form a prediction of the current frame from previous frames in the video sequence. The chapter also discussed some more advanced video coding techniques that are used in standards such as H.264 to achieve compression and functionality superior to earlier standard
36
Visual Media Coding and Transmission
codecs. These include an alternative transform to the DCT, more MB modes to enable more accurate representation of the motion within the video sequence, and an adaptive deblocking filter. A brief comparison was made between the different codecs standardized by the ITU and the ISO (MPEG). Finally, different methods of assessing video quality were discussed. Subjective evaluation is capable of providing the most accurate assessment of video quality, providing it is carefully handled. However, it is expensive in terms of time and requires the use of a significant number of people as viewers. This is the reason why objective measurements are more widely used in research.
References [1] Recommendation ITU-T H.263, “Video coding for low bit rate communication,” 2005. [2] G. Cote, B. Erol, M. Gallant, and F. Kossentini, “H.263 þ : video coding at low bit rates,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 8, no. 7, pp. 849–866, 1998. [3] K. Rijkse, “H.263: video coding for low-bit-rate communication,” Communications Magazine, IEEE, vol. 34, no. 12, pp. 42–45, 1996. [4] S. Battista, F. Casalino, and C. Lande, “MPEG-4: a multimedia standard for the third millennium. 2,” MultiMedia, IEEE, vol. 7, no. 1, pp. 76–84, 2000. [5] ISO/IEC 14496-2:2004, “Information technology–Coding of audio-visual objects–Part 2: Visual,” 2004. [6] T. Sikora, “MPEG-4 very low bit rate video,” 2 ed pp. 1440–1443, 1997. [7] ISO/IEC 11172-2:1993, “Information technology–Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbps–Part 2: Video,” 1993. [8] H. Kalva, “The H.264 Video Coding Standard,” MultiMedia, IEEE, vol. 13, no. 4, pp. 86–90, 2006. [9] Recommendation ITU-T H.264, “Advanced video coding for generic audiovisual services,” 2008. [10] T. Wiegand, G.J. Sullivan, G. Bjntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 560–576, 2003. [11] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G.B. Akar, et al., “Coding Algorithms for 3DTV–A Survey,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 17, no. 11, pp. 1606–1621, 2007. [12] M. Karczewicz and R. Kurceren, “The SP- and SI-frames design for H.264/AVC,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 637–644, 2003. [13] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and video compression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp. 23–50, 1998. [14] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G.J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 688–703, 2003. [15] G.J. Sullivan, and T. Wiegand, “Rate-distortion optimization for video compression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp. 74–90, 1998. [16] I. Choi, J. Lee, and B. Jeon, “Fast Coding Mode Selection With Rate-Distortion Optimization for MPEG-4 Part-10 AVC/H.264,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16, no. 12, pp. 1557–1561, 2006. [17] H.S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 598–603, 2003. [18] P. List, A. Joch, J. Lainema, G. Bjntegaard, and M. Karczewicz, “Adaptive deblocking filter,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 614–619, 2003. [19] T. Wiegand, Z. Xiaozheng, and B. Girod, “Long-term memory motion-compensated prediction,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 1, pp. 70–84, 1999. [20] M. Budagavi, and J.D. Gibson, “Multiframe video coding for improved performance over wireless channels,” Image Processing, IEEE Transactions on, vol. 10, no. 2, pp. 252–265, 2001.
Video Coding Principles
37
[21] S. Wenger, “H.264/AVC over IP,” Circuits and Systems for Video Technology IEEE Transactions on, vol. 13, no. 7, pp. 645–656, 2003. [22] S. Wenger, M. M. Hannuksela, T. Stockhammer, M. Westerlund, and D. Singer, “RTP payload format for JVT video,” RFC, 3984, Feb. 2005. [23] T. Stockhammer, and S. Wenger, “Standard-Compliant enhancement of JVT coded video for transmission over fixed and wireless IP,” Proceedings of IWDC, Capri, Italy., Aug. 2002. [24] T. Stockhammer, T. Weigand, T. Oelbaum, and F. Obermeier, “Video coding and transport layer techniques for H.264-based transmission over packet-lossy networks,” Proceedings of ICIP, Barcelona Spain, 2003. [25] M.M. Hannuksela, Y.K. Wang, and M. Gabbouj, “Sub-picture: ROI coding and unequal error protection,” 3 ed, 2002, pp. 537–540. [26] P. Raibroycharoen, M.M. Ghandi, E.V. Jones, and M. Ghanbari, “Performance analysis of H.264/AVC video transmission with unequal error protected turbo codes ” 3 ed, pp. 1580–1584, 2005. [27] T. Stockhammer, and M. Bystrom, “H.264/AVC data partitioning for mobile video communication,” 1 ed, pp. 545–548, 2004. [28] Recommendation ITU-T H.120, “Codecs for videoconferencing using primary digital group transmission,” 1993. [29] Recommendation ITU-T H.261, “Video codec for audiovisual services at p 64 kbps,” 1993. [30] Recommendation ITU-T H.262, “Information technology–Generic coding of moving pictures and associated audio information: Video,” 1995. [31] ISO/IEC 13818-2:2000, “Information technology–Generic coding of moving pictures and associated audio information: Video,” 2000. [32] Recommendation ITU-R BT.500-11, “Methodology for the subjective assessment for the quality of television pictures,” 2002.
3 Scalable Video Coding 3.1 Introduction The MPEG-x and H.26x video coding standards adopt a hybrid coding approach which employs block matching (BMA) motion compensation and the discrete cosine transform (DCT). The reasons are that (a) a significant proportion of the motion trajectories found in natural video can be approximately described with a rigid translational motion model; (b) fewer bits are required to describe simple translational motion; and (c) implementation is relatively straightforward and amenable to hardware solutions. The hybrid video systems have provided interoperability in the heterogeneous network systems. Considering that transmission bandwidth is still a valuable commodity, ongoing developments in video coding seek scalability solutions to achieve a one-coding–multipledecoding feature. To this end, the Joint Video Team of the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) is standardizing a scalability extension to the existing H.264/AVC codec. The H.264-based scalable video coding (SVC) allows partial transmission and decoding to the bitstream, resulting in various options for picture quality and spatio-temporal resolutions. In this chapter, several advanced features/techniques to do with scalable video coding are described, mostly related to 3D video applications. In Section 3.1.1, applications and scenarios for the scalable coding systems are described. The advances of scalable video coding in 3D video applications are discussed in Section 3.3. Subsection 3.3.1 discusses a nonstandardized scalable 2D-model-based video coding scheme applied to the texture and depth coding of 3D video. The adaptation of scalable video coding for stereoscopic 3D video applications is elaborated on in Subsection 3.3.2. Although scalable extension of H.264/AVC is selected as the starting point in scalable video coding standardization, there are many contributions from the wavelet research community to scalable video coding. Some of these are described in Subsection 3.3.2. Section 3.4 elaborates on the proposed error-robustness techniques for scalable video and image coding. Subsection 3.4.1 advances the state of the art SVC, using correlated frame to increase the error robustness in error-prone networks. A scalable, multiple description coding (MDC) application for stereoscopic 3D video is investigated in Subsection 3.4.2. Subsection 3.4.3 elaborates on the advances of the wireless JPEG 2000.
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
40
Visual Media Coding and Transmission
3.1.1 Applications and Scenarios This section describes the applications and scenarios that are targeted by the Visnet research on scalable video coding. The algorithm developments and test scenarios are guided by the limitations of these applications and scenarios. The main targeted application for this activity is TRT-UKs Virtual Desk (see Figure 3.1), which is one of the integration activities within Visnet II. The Virtual Desk is a collaborative working environment (CWE) which can connect multiple users in a single session. A key aspect of the collaboration is audiovisual conferencing. 3D scalable MDC video can provide significant benefits for the videoconferencing in terms of: . . .
3D: a more immersive communication experience. Scalability: adaptability to different terminal types (examples of terminals are shown in Figure 3.1). MDC: improved robustness to packet losses.
This application places some constraints on algorithm developments and the scenarios used to test them. These constraints are considered throughout the research and are elaborated below. (i) Source Sequences The most appropriate source sequences for audiovisual conferencing applications are “head and shoulders” test sequences, such as Paris, Suzie, and so on. However, there are no standard “head and shoulders” sequences available with depth maps. During investigations, a variety of different video sequences are experimented with, in order to demonstrate the applicability of the proposed techniques to a variety of sources. The sources used are described in Sections 3.2–3.6, when discussing the performance of the proposed algorithms. (ii) Available Bandwidth The terminals may be connected by DSL (1–8 Mbps), University Intranet (100 Mbps), or to a wireless network (WLAN: 54 Mbps; UMTS: 384 kbps). Therefore, video tests are run for bit rates greater than 1Mbps and for bit rates less than 200 kbps. The CIF resolution sequences (352 288) are used during the initial experiments. However, VGA (640 480) and 4CIF (704 576) will be tested as the algorithms are finalized.
Figure 3.1 TRT-UK virtual desk
41
Scalable Video Coding
(iii) Channel Losses Wired network losses are represented in simulations by JVTs packet loss simulator. Wireless network simulations are also carried out using the WiMAX simulator developed under IST-SUIT project. (iv)Low Delay The video coding algorithms must feature low delay. The proposed schemes do not inherently incur significant delay. However, the use of hierarchical B frames introduces a large amount of delay. The low-delay temporal scalability algorithms with I and P frames can be employed to avoid this. The algorithms described in this chapter respect the constraints described above.
3.2 Overview of the State of the Art Modern video transmission and storage systems are typically characterized by a wide range of access network technologies and end-user terminals. Varying numbers of users, each with their own time-varying data throughput requirements, adaptively share network resources, resulting in dynamic connection qualities. Users possess a variety of devices with different capabilities, ranging from cell phones with small screens and restricted processing power to high-end PCs with high-definition displays. Examples of applications include virtual collaboration system scenarios, as shown in Figure 3.2, where a large, high-powered terminal acts as the main control/ command point and serves a group of co-located users. This may be the headquarters of an organization and consist of communication terminals, shared desk spaces, displays and various user-interaction devices for collaborating with remotely-located partners. The remotely-located users with small fixed terminals will act as local contacts and provide local information. Mobile units (distribution, surveying, marketing, patrolling, and so on) of the organization may use mobile terminals, such as mobile phones and PDAs, to collaborate with the headquarters.
Figure 3.2
Virtual collaboration system diagram
42
Visual Media Coding and Transmission
In order to cope with the heterogeneity of networks/terminals and diverse user preferences, the current video applications need to consider not only compression efficiency and quality but also the available bandwidth, memory, computational power and display resolutions for different terminals. The transcoding methods and the use of several encoders to generate different resolution (spatial, temporal or quality) video streams can be used to address the heterogeneity problem. But the abovementioned methods impose additional constraints such as unacceptable delays, and increase bandwidth requirements due to redundant data streams. Scalable video coding is an attractive solution for the issues posed by the heterogeneity of todays video communications. Scalable coding produces a number of hierarchical descriptions that provide flexibility in terms of adaptation to user requirements and network/device characteristics. However, in errorprone environments, the loss of a lower layer in the hierarchical descriptions prevents all higher layers being decoded, even if the higher layers are correctly received, which means that a significant amount of correctly-received data must be discarded in certain channel conditions. On the other hand, an error-resilient technique, namely multiple description coding (MDC), divides a source into two or more correlated layers. This means that a highquality reconstruction is available when the received layers are combined and decoded, while a lower-, but still acceptable, quality reconstruction is achievable if only one of the layers is received. A brief review of scalable video coding techniques is presented in Section 3.2.1, followed by a review of MDC algorithms in Section 3.2.2. The combination of scalable coding and MDC techniques is also reviewed in this section. Combining SVC and MDC improves error robustness, and at the same time provides adaptability to user preferences, bandwidth variations, and receiving device characteristics. For more immersive communication experiences, stereoscopic 3D video content is considered as a source for the scalable multiple description coder. Hence, in Section 3.2.3, stereoscopic 3D video coding based on a scalable video coding architecture is reviewed.
3.2.1 Scalable Coding Techniques Nowadays video production and streaming are ubiquitous, as more and more devices are able to produce and distribute video sequences. This brings an increasingly compelling desire to send an encoded representation of a sequence that is adapted to user, device and network characteristics in such a way that coding is performed only once, while decoding may take place several times at different resolutions, frame rates and qualities. Scalable video coding allows decoding of appropriate subsets of bitstream to generate complete pictures of size and quality, dependent on the proportion of the total bitstream decoded. A number of existing video compression standards support scalable coding, such as MPEG-2 Video and MPEG-4 Visual. Due to reduced compression efficiency, increased decoder complexity, and the characteristics of traditional transmission systems, the above scalable profiles are rarely used in practical implementations. Recent approaches for scalable video coding are based on motion-compensated 3D wavelet transform and motion-compensated temporal differential pulse code modulation (DPCM), together with spatial decorrelating transformations. Those techniques are elaborated in the following subsections.
43
Scalable Video Coding
3.2.1.1 3D Wavelet Approaches The wavelet transform has proved to be a successful tool in the area of scalable video coding since it enables decomposition of a video sequence into several spatio-temporal sub-bands. Usually the wavelet analysis is applied both in the temporal and the spatial dimensions, hence the term ‘3D wavelet’. The decoder might receive a subset of these sub-bands and reconstruct the sequence at a reduced spatio-temporal resolution at any quality. The open-loop structure of this scheme solves the drift problems which typically occure in DPCM-based schemes when there is a mismatch between the encoder and the decoder. In the research literature, two main coding schemes coexist. They differ in the order of the spatial and temporal analysis steps: . .
t þ 2D: temporal filtering is performed first, followed by spatial analysis. Motion estimation/compensation takes place in the spatial domain (see Figure 3.3). 2D þ t: spatial analysis is performed first, followed by temporal filtering. Motion estimation/compensation takes place in the wavelet domain [2] (see Figure 3.4).
The scalable video coding based on 3D wavelet transform has been addressed in recent research activities [3]–[6]. In 2003 MPEG called for proposals for efficient scalable video coding technologies, because of the increased research activity related to scalable video coding. Most of the proposals were based on 3D wavelet transforms [7]. However, the proposed scalable extension of H.264/AVC was selected as the starting point for the scalable video coding standardization as it outperformed other proposals under different test conditions [8].
t-H
t-L
t-LH
t-LL
t-LLL
t-LLH
Figure 3.3 t þ 2D wavelet analysis. The input signal is first filtered along the time axis, and then in the spatial dimension. Solid arrows represent temporal low-pass filtering, while dashed arrows represent temporal high-pass filtering
44
Visual Media Coding and Transmission
t-H t-L
t-LH
t-LL
t-LLH t-LLL
Figure 3.4 2D þ t wavelet analysis. Spatial decomposition is carried out first, followed by temporal filtering. Motion estimation and compensation take place in the wavelet domain
3.2.1.2 DCT-based Approaches The scalable video coding profiles of existing video coding standards are based on DCT methods. Unfortunately, due to the closed loop, these coding schemes have to address the problem of drift, which arises whenever encoder and decoder work on different versions of the reconstructed sequence. This typically leads to the loss of coding efficiency when compared with nonscalable single-layer encoding. In 2007, the Joint Video Team (JVT) of the ITU-T VCEG and the ISO/IEC MPEG standardized a scalable video coding extension of the H.264/AVC standard [9]–[11]. The new SVC standard is capable of providing temporal, spatial, and quality scalability with base layer compatibility with H.264/AVC. Furthermore, this contains an improved DPCM prediction structure which allows greater control over drift effect associated with closed-loop video coding approaches [9]. Bitstreams with temporal scalability can be provided by using hierarchical prediction structures. In these structures, key pictures are coded at regular intervals by using only previous key pictures as references. The pictures between the key pictures are the hierarchical B pictures, which are bi-directionally predicted from the key pictures. The base layer contains a sequence of the key pictures at the coarsest supported temporal resolution, while the enhancement layers consist of the hierarchically-coded B pictures (see Figure 3.5). A lowdelay coding structure is also possible, if the prediction of the enhancement layer pictures is restricted to the previous frame only. Spatial scalability was achieved by using a multi-layer coding approach in previous coding standards, including MPEG-2 and H.263. Figure 3.6 shows a block diagram of a spatiallyscalable encoder. In the scalable extension of H.264/AVC, the spatial scalability is achieved
45
Scalable Video Coding
Figure 3.5
Prediction structure for temporal scalability
using an over-sampled pyramid approach. Each spatial layer of a picture is independently coded using motion-compensated prediction. Inter-layer motion, residual or intra prediction mechanisms can be used to improve the coding efficiency of the enhancement layers. In interlayer motion prediction, for example, the up-scaled base layer motion data is employed for the spatial enhancement layer coding. Quality scalability can be considered a subset of spatial scalability where both layers have similar spatial resolutions but different qualities. The scalable extension of H.264/AVC also supports quality scalability using coarse-grain scalability (CGS) and medium-grain scalability (MGS). CGS is achieved using spatial scalability concepts with the exclusion of the corresponding up-sampling operations in the inter-layer prediction mechanisms. MGS is introduced to improve the flexibility of bitstream adaptation and error robustness. Furthermore, this improves the coding efficiency of the bitstreams that aim at providing different bit rates.
3.2.2 Multiple Description Coding In order to mitigate the visual artifacts caused by data losses over unreliable and dynamic transmission links, error-resilient mechanisms are employed in present video communication systems. MDC is one of the prominent error-resilient methods which can be effectively used in
46
Visual Media Coding and Transmission
Video
Temporal decomposition
Texture
Intra prediction for intra block
Transform / Entr. Coding (SNR Scalable)
Motion
Bitstream
Multiplex
Motion Coding
2D Spatial Interpolation Core Encoder 2D Spatial Decimation
Decoded Frames Motion Temporal decomposition
Texture
Intra prediction for intra block
Transform / Entr. Coding (SNR Scalable)
Motion Motion Coding
2D Spatial Interpolation Core Encoder Decoded Frames Motion Temporal decomposition
Texture
Intra prediction for intra block
Transform / Ent Coding (SNR Scalable)
Motion Motion Coding Core Encoder
Figure 3.6
Scalable encoder using a multi-scale pyramid with three levels of spatial scalability [12]
combating burst errors in delay-constrained video applications where retransmission is not feasible [13], [14]. MDC tackles the problem of encoding a source for transmission over a communication system with multiple channels. Its objective is to encode a source into two or more bitstreams. The streams and the descriptions are correlated and equally important. This means that a high-quality resolution is decodable from all the received bitstreams together, while a lower-, but still acceptable, quality reconstruction is achievable if only one stream is received. A classification of MDC schemes is given below. .
Odd–Even Frames: the odd–even-frame-based MDC schemes have been subjected to detailed analysis by the research community, due to their simplicity in producing multiple streams. They can also easily avoid mismatch at the expense of reduced coding efficiency [15]–[17]. Basically, he odd–even-frame-based MDC includes the odd and even video frames into descriptions one and two, respectively [15]. The redundancy in this MDC technique comes from the longer temporal prediction distance compared to standard video coders. Hence, its coding efficiency is reduced.
Scalable Video Coding .
.
.
47
Redundant Slices: an error-robustness feature which allows an encoder to send an extra representation of a picture region at lower fidelity, which can be used if the primary representation is corrupted (e.g. redundant slices in H.264/AVC) [18]. Spatial MDC: the original signal is decomposed into subsets in the spatial domain, where each description corresponds to different subsets. Examples of algorithms include polyphase down-sampling on image samples [19] and motion vectors using quincunx subsampling [20]. Scalar Quantizer MDC: in MDC quantization algorithms, the multiple descriptions are produced by splitting the quantized coefficient into two or more streams. The output of the quantizer is assigned two or more indexes, one for each description. Based on the received indexes, the MDC decoder estimates the reconstructed signals. The redundancy and the corresponding side distortion introduced by this MDC algorithm are controlled by the assignment of indexes to each quantization bin. Some of the proposed MDC quantization algorithms are analyzed in [21]–[24].
Scalable MDC methods are introduced to improve the error resilience of the transmitted video over unreliable networks and, at the same time, provide adaptation to bandwidth variations and receiving device characteristics [25]. These methods can be categorized into two main types: . .
The first category starts with a MDC coder and the MDC descriptions are then made scalable (e.g. a single MDC description is split into base and enhancement layers [26], [27]). The second category starts with an SVC coder and the SVC layers are then mapped into MDC descriptions (e.g. a scalable wavelet encoder can be used to provide MDC streams that can vary the number of descriptions, the rate of each individual description, and the redundancy level of each description using post encoding [28]–[30]).
Furthermore, embedded MDC techniques are introduced which use embedded multiple description scalar quantizers and a combination of motion-compensated temporal filtering and wavelet transform [31]. This can also be used to achieve a fine-grain scalable bitstream in addition to error resilience.
3.2.3 Stereoscopic 3D Video Coding Stereoscopic 3D video can be used as a video source for the scalable MDC to provide a robust immersive communication experience. Therefore, some background on stereoscopic 3D video is provided in this section. There are several forms of 3D video in the research literature, including multi-view video and panaromic video [32]. Stereoscopic video is the simplest form of multi-view 3D video and can easily be integrated into broadcasting, storage, and communication applications using existing transmission and audiovisual technologies. Stereoscopic video renders two slightly different views for each eye, which facilitates depth perception of the 3D scene. 3.2.3.1 Types of Stereoscopic Video There are several techniques to generate stereoscopic content, including dual camera configuration, 3D/depth–range cameras and 2D-to-3D conversion algorithms [33]. The use of a stereo
48
Visual Media Coding and Transmission
Figure 3.7 “Interview” sequence: (a) color image; (b) per-pixel depth image. The depth-images are normalized to a near clipping plane, Znear, and a far clipping plane, Zfar
camera pair is the simplest and most cost-effective way to obtain stereo video, compared to other technologies available in the literature. The latest depth–range camera generates a color image and an associated per-pixel depth image of a scene (see Figure 3.7). The per-pixel depth value determines the position of the associated color texture in the 3D space. A specialized image-warping technique known as the depth-image-based rendering (DIBR) is used to synthesize the desired binocular viewpoint image [34]. There are advantages and disadvantages associated with stereoscopic video generated using depth–range cameras instead of the stereo camera pairs [35]. The ‘color-plus-depth-map’ formats are widely used in standardization and research activities. In order to provide interoperability of the content, flexibility regarding transport and compression techniques, display independence, and ease of integration, video-plus-depth-image solutions are standardized by MPEG as MPEG-C part 3 [36]. The ATTEST (Advanced Three-Dimensional Television System Technologies) consortium worked on a 3D-TV broadcast technology using color-depth sequences as the main source of 3D video [37]. The VISNET work on scalable stereoscopic video will focus on coding of color-depth sequences using the scalable video coding architecture. This aims at scaling existing video applications into stereoscopic video with low overheads.
3.3 Scalable Video Coding Techniques 3.3.1 Scalable Coding for Shape, Texture, and Depth for 3D Video In this section, a scalable 2D-model-based video coding scheme is investigated in order to achieve the texture and depth-map coding of 3D video. The 3D video is represented by using shape, texture (color), and depth-map, as shown in Figure 3.8. Scalable coding is a popular technique for achieving robust video delivery. For example, the base layer might be heavily protected, whereas the enhancement layers would not. Error resilience can be achieved during the bursts of higher error rate by protecting the base layer at the expense of the enhancement layers. Shape-adaptive coding can facilitate the video manipulation, and keep motion and depth discontinuities, so as to improve the rendering quality of the synthesized “virtual” views. A scalable 2D-model-based video coding scheme is presented in Figure 3.9 [1]. The coding scheme includes four parts: video segmentation, object modeling, scalable model/shape coding, and scalable texture coding. The scalable shape coding and scalable texture coding are elaborated in Subsections 3.3.1.1 and 3.3.1.2 respectively.
Scalable Video Coding
Figure 3.8
49
3D video representation by using (a) shape, (b) texture, and (c) depth-map
3.3.1.1 Scalable Shape Coding The proposed vertex-based shape coding (both intra- and inter-) schemes are presented. In the proposed schemes, curvature scale space images are employed during the contour approximation and contour motion estimation to improve the performance of shape coding. During the scalable coding, the coarsely-coded layers are used during prediction and coding. In the predictive (inter-) shape coding, the finest layer (for lossless coding) is encoded by using an intra-shape coding scheme, as the predictive coding is inefficient for this circumstance. Scalable Intra-shape Coding [39] Scalable shape encoding is one of the important steps in achieving highly-scalable object-based video coding. A new scalable vertex-based shape intra-coding scheme is proposed, as shown in Figure 3.10. In order to improve the encoding performance, a new vertex selection scheme is introduced which can reduce the number of approximation vertices. Then a new vertexencoding method is proposed, in which the information on the coarser layers and statistical entropy coding are exploited for high encoding efficiency. During approximation, the vertices are classified into several layers according to the selected error bands. In this analysis, four layers are used, which have the corresponding error band dmax ¼ 4, dmax ¼ 2, dmax ¼ 1 and dmax ¼ 0, respectively. These selections are based on the
Figure 3.9 Diagram of a scalable 2D-model-based video coding scheme
50
Visual Media Coding and Transmission
Layer 0 encoder
Object Contour
Progressive shape representation
Layer 1 encoder
Bit Assembler
Layered bitstream
Layer 2 encoder Layer 3 encoder
Figure 3.10
Scheme illustration of scalable intra-shape coding scheme
results of previous research, and it does not seem useful to encode more than four layers. Lossy shape coding should be limited to small distortions during video coding [40]. For the vertex selection of layer 0, curvature scale space image (CSS) is exploited [41]. The vertices of layer 0 should include the salient points of the object contour, which can feature contour efficiently, so that they can be decoded first. For other refinement layers, the iterative refinement method (IRM) [42], together with a novel merging scheme, is exploited [39]. Each approximating polygon edge of the coarser layers is recursively split by introducing a new vertex at the contour point with the largest distance, until the desired accuracy is reached. For the finest layer, the intrinsic image grid quantization is taken into account. Table 3.1 lists the average number of vertices using different vertex-selection methods. Compared with the method in [40] and [42], the proposed method can save up to 30–80% and 20–30% respectively of the total number of vertices for the lossless reconstruction of test video objects. Simulation results show that the proposed vertex selection scheme requires fewer numbers of vertices for contour progressive approximation. The success of the proposed vertex selection algorithm is due to the selection of salient features using CSS image, and the consideration of the intrinsic image grid quantization during approximation [40], [42]. For scalable vertex-based shape coding, the proposed method consists of three steps, namely encoding the vertices in layer 0, encoding the vertex connectivity, and encoding the position of vertices of the refinement layers [1], [39]. In order to improve the encoding performance, the information of the coarser encoded layers is exploited.
Table 3.1 Average number of vertices using different vertex-selection methods
News
Weather
Kids
Method [18] Method [15] Proposed method Method [18] Method [15] Proposed method Method [18] Method [15] Proposed method
Layer 0
Layer 1
Layer 2
Layer 3
22 22 21 13 13 11 29 29 27
7 6 6 3 3 3 13 11 11
25 22 21 12 12 12 38 29 28
147 96 72 66 56 41 186 123 96
51
Scalable Video Coding Table 3.2 Average bit usage per frame for the proposed scalable intra-shape coding scheme
Weather Kids News Forman
Layer 0
Layer 1
Layer 2
Layer 3 (lossless)
197 415 290 261
208 532 385 344
257 733 496 471
403 1471 927 676
The encoding of layer 0 consists of two parts: initial vertex encoding and the encoding of other vertices. 2D symbol (A, B) is defined to indicate the vertex connectivity, where new vertices will be inserted, and coded by Huffman codec. The statistical properties of A and B are studied and used for Huffman codebook design. For the vertex-position encoding of refinement layers, an improved OAVE encoding scheme (described in [43]) is used, in which the information from already-transmitted coarser layers and the approximation error band is exploited for high encoding efficiency. The encoding process includes determining and encoding two indicators, encoding the octant number, and encoding the major and minor components. Intensive experiments are conducted to evaluate the performance of the proposed algorithms, which include scalable shape representation and scalable intra-shape coding. The proposed scalable intra-shape coding scheme is tested by coding several widely-used MPEG-4 shape sequences: Kids (SIF), Weather (QCIF), News (QCIF), and Foreman (QCIF) sequences. Table 3.2 lists the average bit usage of four layers for the proposed intra-shape coding scheme. All of the shape sequences have a frame rate of 10 frames per second (fps). The performance of the proposed method is compared with other published CAE method (intra model) [44] and vertex-based shape coding algorithms [45]. Figure 3.11 illustrates the rate-distortion (R-D) performance for these methods. According to Figure 3.11, the proposed method shows better R-D performance than existing shape coding methods [44]–[47]. The R-D performance of the proposed method is also superior to the recently-developed nonscalable shape coding method described in [46], and the scalable shape coding algorithm described in [47] (see Fig. 17 and Fig. 20 in [46] for Kids and Weather sequences, respectively, and see Figure 3 in [47] for Kids sequences). One of the most important characteristics of the proposed method is that it can achieve scalable shape coding and provide a scalable bitstream. For example, the R-D performance of our proposed algorithm shown in Figure 3.11 is achieved by decoding the same compressed shape bitstream. This can facilitate other applications, such as layered transmission and progressive image reconstruction. Scalable Predictive Shape Coding [48] The coding efficiency achieved by the intra-shape coding may not satisfy the requirement of low-bit-rate video coding, even though the bits for intra-shape coding reduce in a large amount. Since a contour sequence has very high correlation in the temporal domain, a straightforward method to exploit its temporal reduction uses motion estimation and compensation. The contour in the current frame can be predicted from the contour obtained in the previous frame. Only the contour segment which cannot be predicted is encoded by the intra-shape coding technique. This can reduce the bit rate of shape coding drastically. Several motion-compensated contour coding schemes have been proposed [49]. In these contour
52
Visual Media Coding and Transmission
Figure 3.11 R-D performance comparison of different intra-shape coding schemes for (a) Weather and (b) Kids sequences
motion estimation schemes, the object contour is assumed to undergo a translational motion. For most video objects, this is not correct. A novel scalable predictive coding scheme which can achieve higher compression efficiency is proposed (see Figure 3.12). First, the contour motions in level i are estimated. They are predicted from the MVs of the previously-transmitted levels and/or the encoded MVs of the current level. Contour matching in CSS images is applied to find the correspondence of two contours during contour motion estimation, which can make the estimation more accurate. The scheme consists of two steps: contour motion estimation and scalable predictive shape coding. The novelties of this method are twofold. First, an efficient contour motion estimation scheme is proposed, which is based on the curvature information of the object contour and is used to predict the motion vectors of vertices in the coarser level. Second, a scalable encoding scheme is presented, in which the motion of contour is estimated hierarchically, and a multimodel encoding scheme is included to improve the compression efficiency.
Object Contour
ME
Layer 0
Layer 0
MV coding and intraLayer 1 coding for MC-failed Layer 2 segments
Layer 1 Layer 2
Layered Intra-coding
Layer 0 vertices Layer 1 vertices Layer 2 vertices
Figure 3.12
Approximating Vertex Buffer/ Vertex status adaptation
Bit Assembler
Layer 3
Layer 0 decoder Layer 1 decoder Layer 2 decoder
Diagram of scalable predictive shape coding
53
Scalable Video Coding Table 3.3 Average bit usage per frame for the proposed scalable predictive shape coding scheme
Weather Kids News Forman
Layer 0
Layer 1
Layer 2
Layer 3 (lossless)
24 241 51 83
52 359 86 157
153 548 277 201
325 1281 569 406
In order to demonstrate the performance of the proposed scalable predictive shape coding technique, Table 3.3 illustrates the average bits per frame for encoding several contour sequences with frame rate 10 fps. As there is no reference object contour, the contour of the first frame is compressed using the scalable intra-shape coding method discussed in the previous subsection. The performance of the proposed scalable predictive shape coding algorithm is compared with the generalized predictive shape coding scheme (GPSC) in [49], and with inter-model CAE in MPEG-4. Figure 3.13 presents the bit distortion curves of the proposed algorithm for (a) Weather and (b) Kids sequences. It shows that the proposed algorithm can achieve better RD performance than that of the CAE and GPSC techniques. The success of the proposed predictive shape coding is due to the adaptive coding of different layers and the accurate motion estimation of contours. Furthermore, it reveals that the coding models selected for layer 1 and layer 2 for most shape sequences are the motion-compensated ones, rather than the intra-coding ones, except for some shape frames of the Kids sequence. In these frames, the Kids move quickly and endure large nonrigid object motion, and the intrashape coding method is more efficient for these layers. Table 3.4 shows the average bit usage for the Kids sequence using intra-model, inter-model and adaptive-model coding schemes. It can be seen that the adaptive coding model can improve the coding efficiency greatly.
Figure 3.13 R-D performance comparison of the proposed scalable predictive shape coding method with other coding scheme for the (a) Weather and (b) Kids sequences
54
Visual Media Coding and Transmission
Table 3.4 Average bit usage per frame for the Kids sequence using different coding models in layer 1 and layer 2
Intra-model Inter-model Adaptive-model
Layer 0
Layer 1
Layer 2
Layer 3 (lossless)
241 241 241
387 363 359
660 651 584
1364 1357 1281
3.3.1.2 Scalable Texture Coding In this section, the proposed model-based scalable texture coding scheme is presented (see Figure 3.9). It consists of four main blocks: mesh-based temporal filtering, scalable MV coding, shape-adaptive bit-plane coding, and rate-distortion optimized bit truncation. In this scheme, motion-compensated temporal filtering (MCTF) is employed, which is similar to the temporal filtering scheme described in [50]. The main difference is that warping MC and the scalable object mesh model, rather than block-based MC, are used. As the video frame is segmented into objects with different motion patterns and depths, warping MC is expected to be more efficient than block-based MC. In the proposed scheme, different mesh models are used for different objects. For the foreground object with high priority (for example, an object near the camera), the content-adaptive scalable mesh model is designed and employed [49]. For the background object with low priority (an object far from the camera), an adaptive quadrangular mesh model is constructed and employed. After mesh-based temporal filtering, the estimated MVs are progressively encoded. Scalable MV coding is very important for efficient video coding under very low bit rate. At very-low-bitrate video coding, only partial motion information (the coarse version) is encoded using less bits, and more bits are used for texture coding, which can improve the coding performance greatly. The MV layers of foreground objects are generated according to the designed three-layer scalable object model. For the background, three-layer MVs are generated through adjustable Lagrange parameters during motion estimation, which is similar to the method in [51]. During motion vector coding of the first layer, the motion vector is predicted from its encoded neighboring MVs. For the motion vector coding of other layers, the motion vector is predicted from both its encoded neighboring MVs and the coarser layers of current MV. The difference is encoded by using the context-adaptive binary arithmetic coding (CABAC) scheme [1], [52]. After temporal filtering and scalable MV coding, the shape-adaptive bit-plane coding technique is used to encode the “I-frame” and the residual frames. In this research, the shapeadaptive SPECK algorithm [53] is employed, due to its simplicity and good compression performance. In order to improve the coding efficiency, the original shape-adaptive SPECK algorithm is modified, for which two tactics are employed. The first is aggressive discarding of transparent regions from sets after partitioning, and the second is employment of the contextadaptive binary arithmetic codec (CABAC) [54] to code the sign and significant map. For the original SA-SPECK algorithm, the significance information, the sign, and the bits during the refinement pass are encoded using the adaptive arithmetic codec (AAC). However, the context information to be encoded around the pixel can be exploited to improve the coding performance, just like the EBCOT algorithm [55].
55
Scalable Video Coding
Figure 3.14 Comparison of R-D performance of different shape-adaptive bit-plane coding algorithms for the (a) News and (b) Coastguard objects
Figure 3.14 illustrates the peak signal-to-noise ratio (PSNR) values of two different video objects. The proposed method can improve the performance (PSNR) by about 0.1–0.5 dB for most video objects over the original SPECK algorithm. For video objects with complex shape and small size, more PSNR gain could be achieved. After scalable coding of MVand texture information of all video objects, an R-D-optimized bit allocation and truncation scheme is required to decide the truncation points for the given target bits of one GOP (group of pictures). Currently, the bit allocation and optimal truncation scheme in the EBCOT algorithm is extended and employed in order to easily achieve bit allocation among video objects and video frames within GOP. The main feature of the proposed scheme is that the scalable MV information is included during bit allocations. The algorithm is presented as follows: n o ! l;j l 1. Construct the rate vector R ¼ R j0 i < O; 0 l < K; 0 j < N i i and slope vector n o !
l S ¼ Sl;j i j0 i < O; 0 l < K; 0 j < N i for all candidature truncation points. There-
fore, for rate and slope vector, there are in total N ¼
O 1 K 1 X X
N li elements.
i¼0 l¼0
n o 2. Rank the elements of S ¼ Sl;j and adjust the element sequence of rate vector corre i ! . . . spondingly so that S ¼ S0 ; ; Sj 1 ; Sj ; Sj þ 1 . . . ; SN 1 and Sj 1 Sj Sj þ 1 . The ! corresponding rate vector is R ¼ R0 ; . . . ; Rj 1 ; Rj ; Rj þ 1 . . . ; RN 1 . !
!
3. Decide the truncation point n based on rate vector R so that
nX 1 i¼0
Ri Rmax <
n X i¼0
Ri .
! 4. Retrieve the optimal truncation points N ¼ n0i ; n1i ; . . . ; nKi 1 j0 i < O for all video objects and all frames with current GOP from the selected n elements of rate and slope vectors in Step 3.
56
Visual Media Coding and Transmission
Figure 3.15
PSNR performance comparisons of different coding methods
l;j l After deciding the optimal truncation points, the bit number Rl;j i and the slope Si for each j 2 ni are kept in the header, along with the embedded bitstream. Extensive experiments are conducted to evaluate the performance of the proposed scalable 2D-model-based texture coding scheme and compare it with those of the MPEG-4 (Version 2, simple profile) and H.264 standards. Figure 3.15 illustrates the PSNR performance of the Ycomponent for different encoding bit rates under different coding schemes. The results show that the proposed method is 1–4 dB superior to the MPEG-4 codec for a wide range of bit rates. When compared with H.264, the proposed method can achieve better compression performance at low bit rates due to the use of scalable MV coding. Even though the subjective performance (PSNR value) of the proposed scheme is inferior to that of H.264 under medium and high bit rates, the objective difference is very small and both schemes are comparable. The success is due to the use of scalable MV coding and scalable mesh model during temporal filtering in the proposed scheme. When the target bit rate is very low, which is not enough to encode the full MV information, only the first MV layers are encoded and some bits are saved to encode the first frame of GOP. In the proposed scheme, the number of bits used for encoding MV information is decided automatically by the rate-distortion-optimized bit truncation algorithm discussed above. Most importantly, the proposed scheme can achieve highly scalable bitstream, which is important for new applications such as UMA. It can achieve spatial, temporal, object, and quality scalabilities simultaneously. It is worth noting that the experimental results presented here for each sequence are decoded from a single embedded bitstream for the proposed encoding method, and from different bitstreams corresponding to individual target coding rates for MPEG-4 and H.264. Regarding the computational complexity, the proposed scheme has higher complexity than MPEG-4 and H.264 due to the video segmentation, object modeling, and MC-based temporal filtering. During texture coding, the complexity of MC-based temporal filtering is larger than ME/MC in MPEG-4 and H.264, as mesh-based motion estimation and compensation is employed. Figure 3.16 shows the complexity and bit usage of the proposed scalable
57
Scalable Video Coding Object Model
Texture/MV coding (25.9%)
Object modeling (12.7%)
(7.2%)
Model compression (6.9%)
Texture (55.1%)
Motion Vectors (36.7%)
MCTF (54.5%)
(a) Complexity of each component
Figure 3.16
(b) Bit usage of each component
Complexity and bit usage of the proposed scalable 2D-model-based video coding scheme
2D-model-based video coding system for the “Motr_dhtr” sequence. The video segmentation is a time-consuming part of the proposed system. It occupies almost 40% computation complexity. Figure 3.16(a) shows the average proportion of the computational time for the system. It can be seen that speeding the MCTF is necessary to achieve real-time video compression. Figure 3.16(b) shows the bit usage of the components in the proposed system for the Motr_dhtr sequence under bit rate 32 kbit/s. It can be seen that only a small percentage (7.2%) of the total bits are used for scalable model and shape compression. 3.3.1.3 Scalable Depth Coding The proposed scalable 2D-model-based video coding scheme can be extended to code the 3D video content, including the YUV components and the depth component. Therefore, scalable depth map encoding using the 2D-model-based video coding scheme is discussed in this section. Currently, a large variety of techniques for the coding of still images or videos can be used to compress a depth map image or sequence, such as JPEG2000, motion JPEG, MPEG-4, and H.264 [56]–[60]. However, most of these standards are developed for “real-world” imagery and may not be optimal for encoding depth information. The main difference is that these existing coding methods use an error criterion similar to mean-square error (MSE), which may be close to optimal from a visual standpoint. However, MSE in depth map may not be very meaningful because the depth maps are never directly viewed; rather, they are used for rendering new images. Thus, the sensitivity of the final rendered image due to errors in the depth map cannot be described by the MSE-like criterion. For example, errors in the depth map close to an intensity edge can result in ugly rendering artifacts, while errors on a smooth surface will be unnoticeable. Furthermore, the sensitivity of the disparity with respect to the depth coding error varies with depth itself for any given view.
58
Visual Media Coding and Transmission
Based on these observations, three major improvements are proposed over other methods in the research literature: 1. Object-based depth map coding. 2. Optimal bit allocation to the objects with different depth map. 3. Reshaping dynamic range of depth in order to reflect the difference importance of different depths in the same object. These improvements can easily be achieved by using the proposed scalable 2D-model-based video coding scheme. In this section, extensive investigations of the depth-image compression performance of the different coding methods are conducted. The effects of the different coding methods on the final rendering performance are also investigated. Before using the above scheme to achieve scalable depth map encoding, a preprocessing step that reshapes the dynamic range of the depth map is necessary, because it has been observed that the errors are much more significant in the areas of smaller depth (for the object nearer to the camera). An idea similar to companding is proposed to achieve this. In this study, a simple quadratic function k0 ¼ k2/255 is selected to reshape the dynamic range of the depth map. In order to measure the relative performance, the PSNR values of the rendered views using compressed depth maps are measured with reference to the view rendered by the uncompressed depth map. It is represented by using the following criterion: EðdkÞ ¼ jjI ðp0 ðk þ dkÞÞ I ðp0 ðkÞÞjj2
ð3:1Þ
where I is the intensity of the rendered pixel, k is the original high-quality depth map, and dk is the coding error in the depth map. p0 is the rendered view of video. Figure 3.17 shows the average PSNR value for the Orbi and Interview sequences, and the comparison with the rendering results after depth compression by using H.264. In this experiment, the YUV components are not compressed, except for the depth map. It is shown that the proposed method can achieve better final rendering performance than H.264 for low-bit-rate depth coding.
Figure 3.17 sequence
R-D rendering performance of different coding methods: (a) Interview sequence; (b) Orbi
Scalable Video Coding
59
Figure 3.18 Frame 52 of Orbi depth image sequence: (a) original; (b) compressed at 64 kbps using H264; (c) DSUS2 at 64 kbps; (d) DSUS4 at 64 kbps
The effects of scalable coding, especially on the spatial scalability, are investigated. In order to investigate the effect of spatial scalability of the depth map on the final rendering performance, some research is being conducted on depth compression with reduced spatial resolution. The small-size depth image is achieved by using spatial down-sampling in the pixel domain prior to H.264 encoding. It is then up-sampled in the pixel domain after H.264 decoding. This can easily be achieved using our proposed 2D-model-based video coding scheme. The experimental results show that if the resolution of the depth image sequence is reduced prior to encoding, and up-sampled back to its original resolution after decoding, far better objective and subjective quality can be achieved than by compressing using the original resolution of the depth image for low-bit-rate 3D video applications, as shown in Figure 3.17. Figure 3.18(b) shows the depth image without using the down-sampled/up-sampled process. Figure 3.18(c) and (d) show the depth image down-sampled/up-sampled by a factor of 2 and of 4 (in each dimension), respectively. 3.3.1.4 Proposed Technical Solution Stereoscopic video is captured by two cameras a small distance apart, which produce two slightly different views of the same object. Stereoscopic video coding aims to exploit the
60
Visual Media Coding and Transmission
Figure 3.19
Simulcast coding for 3D-depth-map-based stereoscopic video
redundancies present in these two views, while removing the temporal and spatial redundancies. The simplest way to encode stereoscopic video is to encode both streams (the left and right sequences or Color/Left and Depth/Disparity sequences) simultaneously using two encoders (see Figure 3.19). This results in high bit rates and requires multiplexing and demultiplexing stages to transmit over a single communication channel, and resynchronization before displaying. Stereoscopic video can be coded using existing video coding standards after source processing (i.e. arranging both the left and the right image into a single image sequence, where the left part of the final image contains the left image sequence, and the right image occupies the right part of the final image sequence). This approach does not eliminate redundancies between the two views of the stereoscopic video material, and therefore offers relatively low compression efficiency. Figure 3.20(a) and (b) show examples of side-by-side and interlaced image sequences, respectively.
Figure 3.20
Source processed images: (a) side-by-side images; (b) interlaced images
61
Scalable Video Coding
Figure 3.21
Layered stereoscopic video coding
Stereoscopic video can be coded using a layered architecture, where each layer contains a single view (i.e. the color and depth image sequences can be coded as the base and enhancement layers, respectively). Figure 3.21 demonstrates the use of a layered architecture for stereoscopic video coding. The layered architecture for stereoscopic video coding provides scalability coding support for stereoscopic video, but still features a backward-compatible base layer for 2D video. In addition, the layered architecture can be used to encode different resolution depth planes, which users can choose based on their network and terminal capabilities (layered depth images, LDIs). It has been shown that stereoscopic video can be coded asymmetrically without affecting the perception of depth in 3D video. Hence, color and depth (or the right image sequence) map sequences can be coded asymmetrically using different quantization parameter (QP) values and spatial resolutions at each layer. This facilitates high-quality stereoscopic video over low-bit-rate communication channels. Furthermore, the depth image can be coded as a temporally/spatially down-sampled image sequence without affecting the perceptual quality of stereoscopic video [61]. This layered architecture can be further extended for unequal error protection (UEP) and unequal resource allocation schemes for stereoscopic video, based on the measured error sensitivity over multimedia networks. The following study describes a review of some of the most prominent methods for coding stereoscopic color and depth video using existing video codecs, and investigates their respective coding efficiencies [62]. This work also proposes a new configuration for encoding color and depth stereoscopic video, based on the scalable extension of the H.264 standard. The R-D performance comparison of the color and depth video obtained using the depth–range cameras versus left and right video is analyzed at low bit rates using the proposed configuration. The performances are compared, in terms of coding efficiency and implementation factors, with the MPEG-4 MAC and H.264/AVC configurations. The results show that the configuration proposed based on the scalable video coding architecture performs similarly to H.264/ AVC and outperforms the MPEG-4-MAC (Multiple Auxiliary Component)-based configuration in terms of coding efficiency. In terms of implementation factors, the proposed configuration provides wider flexibility than all other configurations. MPEG-4 MAC MPEG has identified MPEG-4 MAC as a possible solution for encoding color and depth stereoscopic video content based on available MPEG tools. MPEG-4 MAC allows the
62
Visual Media Coding and Transmission
Figure 3.22 MPEG-4 MAC architecture
encoding of auxiliary components in addition to the Y, U and V components present in 2D video. The depth/disparity map can be used as one of its auxiliary components. This MPEG-4 MAC configuration for stereoscopic video coding is shown in Figure 3.22 [63]. MAC produces a one-stream output. This one-stream approach facilitates end-to-end video communication applications without a system-level modification. Furthermore, this can be made MPEG-C part 3 standard compatible, which specifies the guidelines for color- and depthmap-based video communications. In the work presented here, the rate distortion of MPEG-4MAC-coded 2D and depth stereoscopic video is compared against results obtained with the H.264/AVC and the scalable extension of H.264/AVC standards. H.264/AVC The H.264/AVC is a video coding standard that provides high compression efficiency, network friendliness and a wide range of error-resilient features [64]. Much of the research into using H.264/AVC for stereoscopic video uses left–right view disparity estimation, rather than color and depth map coding [65]. There are two main approaches to coding color-depth video using H.264/AVC. The first is to encode the respective color and depth video sequences using two parallel H.264/AVC codec implementations. However, a single encoder output is advantageous compared to using two bitstreams, as it will not affect the end-to-end communication chain (i.e. no extra signaling is needed to accommodate the additional bitstream). Hence, one possible approach is to encode stereoscopic video after some source processing. At the source, the color and depth images are combined into a single source to serve as the input for the H.264/AVC codec, for example sideby-side color-depth images (see Figure 3.20(a)) or interlaced color-depth images (see Figure 3.20(b)). But this approach lacks backward-compatibility and flexibility for stereoscopic video communication applications. This work analyzes the coding efficiency of color and depth image sequences using H.264/AVC, where color and depth sequences were combined to form a side-by-side image sequence. The SVC single-layer configuration is used to obtain H.264/AVC results, as the SVC base layer is compatible with H.264/AVC. Using the same software avoids differences that might arise from different implementations of H.264/ AVC and the scalable extension of H.264/AVC (e.g. different rate-distortion coding tools). The proposed configuration based on H.264/AVC is shown in Figure 3.23.
63
Scalable Video Coding
Figure 3.23
H.264/AVC 3D coding architecture
The Scalable Extension of H.264/AVC This work proposes a configuration to code stereoscopic video sequences based on the layered architecture proposed in the scalable extension of H.264/AVC. The scalable extension of H.264/AVC, under development by JVT, is a new video coding standard which supports spatial, temporal, and quality scalability for video [66]. The color and depth/disparity image sequences are coded as the base and enhancement layers, respectively, as shown in Figure 3.24. As the base layer is compatible with H.264/AVC decoding, users with a H.264/AVC decoder will be able to decode the color image, whereas users with a scalable decoder, and appropriate rendering technology, will be able to decode the depth/disparity image and will experience stereoscopic video. This backward-compatible feature of the scalable extension can be used to enhance or scale existing video applications into stereoscopic video. Furthermore, it can be used to exploit asymmetric coding of the left and right images, as SVC supports a range of temporal, scalable, and quality-scalable layers. The inter-layer prediction modes can be selected based on the correlation between the color and depth images. The results presented here make use of the adaptive inter-layer prediction option in the JSVM (Joint Scalable Video Model) software. This configuration can be made compliant with the MPEG-C part 3 (ISO/IEC 23002-3) standard [67]. The next section discusses the coding efficiency of the proposed scalable configuration compared to MPEG-4 MAC and H.264/AVC.
Figure 3.24
Stereoscopic coding based on the SVC architecture
64
Visual Media Coding and Transmission
3.3.1.5 Performance Evaluation Two 2D (color) and depth image sequences were used to obtain the rate distortion results, namely Orbi and Interview. The Orbi is a very complex sequence, with camera movements and multiple objects, whereas Interview is a sequence captured with a static camera, featuring a stationary background. The tests were carried out using CIF format (352 288) video. The image sequences were encoded using the three proposed configurations based on existing video coding standards: MPEG-4 MAC, H.264/AVC (using the SVC single-layer coding) and SVC. It should be noted that the MPEG-4 MAC used is based on the ISO/IEC 14496-2 standard (MPEG-4 Visual), rather than AVC technology, which is ISO/IEC 14496-10. The basic encoding parameters are: 300 frames, IPPP. . . sequence format (only one I-frame at the beginning of the sequence), 30 fps original frame rate, a single reference frame, variable length coding (VLC), and 16 pixel search range. The QP in the configuration file is varied to obtain the bit rate range shown in the R-D curves. For the scalable configuration, the same QP is used at both layers. The R-D curves show the image quality measured in PSNR against the resulting bit rate. Figure 3.25 shows the R-D curves for the Orbi color and depth sequences based on MPEG-4 MAC, H.264/AVC, and the proposed scalable configurations. The R-D curves of the Interview color and depth sequences are shown in Figure 3.26. All of the results are plotted against the overall bit rate (output bit rate of the codec), which includes all overhead bits, texture, color, motion vectors and depth. In order to highlight R-D performance at a range of bit rates, the final bit rate is shown from 0 kbps to 2 Mbps. The H.264/AVC-coded stereoscopic video sequences are separated into color and depth sequences in order to calculate the PSNR with respect to their original color and depth sequences. The R-D curves for both the Orbi and the Interview sequence show that the proposed scalable configuration based on the scalable extension of H.264/AVC performs similarly to the
Figure 3.25 R-D curves of coded Orbi sequence using three different configurations: (a) color image sequence; (b) depth image sequence
Scalable Video Coding
65
Figure 3.26 R-D curves of coded Interview sequence using three different configurations: (a) color image sequence; (b) depth image sequence
H.264/AVC configuration, and outperforms the MPEG-4-MAC-based configuration at all bit rates. The scalable configuration is unable to achieve higher compression efficiency than the H.264/AVC configuration, due to the negligible use of inter-layer prediction between the base and enhancement layers. The correlation between the color and depth sequences is small. Hence, the scalable configuration could not gain any advantage from inter-layer prediction and performs similarly to the H.264/AVC configuration. However, the flexibility of the scalable configuration for stereoscopic video coding, such as asymmetric coding support (temporal, spatial, and quality scalability for depth image sequences) and backward-compatibility, provides significant benefits for 3D video applications compared to H.264/AVC. The performance of the configurations based on H.264/AVC is significant in depth map coding (see Figures 3.25(b) and Figures 3.26(b) [68]). This is more visible in the Interview sequence, which has less motion and a stationary background, as shown in Table 3.5. Furthermore, it can be observed that the configurations based on scalable H.264/AVC and H.264/AVC provide reasonable image quality at very low overall bit rates compared to the MPEG-4 MAC configuration. This performance advantage over MPEG-4 MAC is due to the advanced encoding algorithms used in the H.264/AVC standard, including flexible MB sizes and MB skip modes. The subjective image qualities of the Orbi color and depth image sequences are shown in Figures 3.27 and 3.28, respectively. The image sequences are obtained at an overall bit rate of 150 kbps using the scalable configuration and MPEG-4 MAC. According to Figure 3.27, the subjective quality of the scalable H.264/AVC-coded color image is better than the MPEG-4MAC-coded color image. The difference is more visible in the depth image sequences, as shown in Figure 3.28. The scalable H.264/AVC-coded depth image sequence demonstrates a sharp and superior image quality compared to the MPEG-4-MAC-coded depth image sequence at the given low bit rate of 150 kbps. Figure 3.28 shows the image quality of both sequences at an overall bit rate of 200 kbps. This shows that the proposed stereoscopic video coding configuration based on the scalable
66
Visual Media Coding and Transmission
Figure 3.27 Subjective image quality of the Orbi color sequence at an overall bit rate of 150 kbps: (a) SVC configuration; (b) MPEG-4 MAC configuration
extension of H.264/AVC provides reasonable quality compared to the MPEG-4 MAC configuration at overall bit rates as low as 200 kbps. The high performance and flexible features (backward-compatibility and temporal, spatial, and quality scalability) associated with SVC can be used to convert low-bit-rate video applications into stereoscopic video applications. Furthermore, at an overall bit rate of 200 kbps the Orbi depth image sequence can be coded at 49% of the Orbi color image bit rate using the proposed scalable configuration. The depth image bit rate can be further reduced by using a high QP value or reduced temporal or spatial scalability for the enhancement layer (depth image), without affecting the perceptual quality of the stereoscopic video. In order to avoid occlusion problems associated with the
Figure 3.28 Subjective image quality of the Orbi depth sequence at an overall bit rate of 150 kbps: (a) SVC configuration; (b) MPEG-4 MAC configuration
67
Scalable Video Coding Table 3.5 Image quality at overall bit rate of 200 kbps Orbi PSNR (dB)
Scalable H.264/AVC MPEG-4 MAC H.264/AVC
Interview PSNR (dB)
Color
Depth
Color
Depth
34.74 33.05 35.01
38.31 34.68 38.33
34.22 32.25 34.41
39.29 35.52 39.41
DIBR method, several layers of depth images (LDI) can be coded using this proposed scalable configuration. The tests are carried out to compare the R-D performance of color and depth image sequences against left and right image sequences using the proposed scalable coding configuration. In order to produce left and right image sequences, the Orbi and Interview sequences are projected into virtual left and right image sequences using the following equation: i Npix h m ðknear þ kfar Þ kfar Ppix ¼ xB ð3:2Þ D 255 Npix and xB are the number of horizontal pixels of the display and the eye separation, respectively. The depth value of the image is represented by the N-bit value m. knear and kfar specify the range of the depth information, respectively behind and in front of the picture, relative to the screen width Npix. D is the viewing distance. They are then coded as the base and enhancement layers using the scalable extension of H.264/AVC. The left and right sequences represent video generated using a stereo camera pair. The coded color and depth image sequences at the base and enhancement layers are converted to virtual left and right video to be compared with the coded left and right video. Figure 3.29 shows the R-D performance of the Orbi and Interview sequences at low bit rates up to an overall
Figure 3.29 R-D curves of Orbi (left) and Interview (right) sequences (color and depth sequences versus projected left and right sequences)
68
Visual Media Coding and Transmission
bit rate of 500 kbps. According to Figure 3.29, at low bit rates both the Orbi and the Interview sequence demonstrate better performance for color and depth image coding than for projected left and right view video coding when using the scalable video coding configuration. With high QP values, H.264/AVC-coded depth images are of high image quality compared with the color images, which results in high-quality left and right image sequences. Due to the amount of disparity between the projected left and right images, the use of adaptive inter-layer prediction between the base and enhancement layers is not effective. 3.3.1.6 Conclusions This work analyzes the R-D performance of stereoscopic video using three configurations based on MPEG-4 MAC, H.264/AVC, and the scalable extension of H.264/AVC. The proposed scalable configuration based on the layered architecture performs similarly to the configuration based on H.264/AVC and outperforms the configuration based on MPEG-4 MAC at all bit rates in terms of objective and subjective quality. Furthermore, the configuration based on scalable video coding produces high-quality stereoscopic video using color and depth sequences (obtained from the depth–range camera) compared to the virtual left and right sequences produced from the same color and depth sequences at low bit rates. This configuration can be used to explore the possibilities of scaling existing video communication applications into stereoscopic video. This will be further facilitated by the backwardly-compatible nature of the scalable coding architecture, which supports base-layer decoding using H.264/AVC decoders. The scalable layers present in this architecture can be exploited to encode multi-resolution video, where one view is coded differently to the other [69]. For example, the depth image can be coded as several quality layers, in order to eliminate the occlusion problem present in DIBR and gain optimum image quality based on user preference and terminal/network capabilities. Furthermore, it can be stated that the depth image sequences can be efficiently compressed using the H.264/AVC standard at all bit rates. In the future, scalable stereoscopic video coding based on this proposed configuration can be applied in a virtual collaboration scenario, where users can receive stereoscopic video content based on their terminal and network capabilities. Compared to conventional video, stereoscopic video will provide a more natural method of communication in a virtual collaboration environment. The backwardly-compatible nature of the scalable video coding will allow users to receive conventional video with their existing infrastructure.
3.3.2 3D Wavelet Coding Following VISNET activities, a proposal was submitted [70], [71] in response to the MPEG Call for Proposals on Scalable Video Coding at the Munich MPEG meeting in March 2003. This proposal consists of an in-band (2D þ t) coding scheme (see Figure 3.30), where wavelet spatial filtering is followed by MCTF (IB-MCTF, In-Band Motion-Compensated Temporal Filtering). The proposed codec is based upon the MC-EZBC codec [72], donated by Prof. J. Woods (Rensselaer Polytechnic Institute) to MPEG. The codec retains a simple Haar transform based on lifting in the temporal domain, with the ability to switch between forward, backward, or bi-directional prediction on a block-by-block basis. Wavelet coefficients are entropy-coded with EZBC (Embedded Zero Block Coding) [73]. In order to solve the shift-variance property of the wavelet transform, we make use of an overcomplete DWT
Scalable Video Coding
Figure 3.30
69
Block diagram showing the high-level architecture of the proposed codec
representation based on the algorithm a trous [74], which is depicted in Figure 3.31 for the 1D case. Previous solutions made use of the low-band-shift [75] algorithm to obtain an overcomplete DWT. The variant proposed in [76] allows subpixel motion estimation by simply rearranging the coefficients produced by the low -band-shift in output. It was shown that the algorithm a trous gives exactly the same output as the algorithm proposed in [76], but it lends itself to a simpler implementation. Although the filters at spatial decomposition level L are 2L(M-1) þ 1 samples long (M being the size of the analysis wavelet filter), such filters have nonzero coefficients only in M known locations, therefore the algorithms computational cost is constant throughout the decomposition levels and is equivalent to the low-band-shift, without requiring a rearrangement of the wavelet coefficients. Figure 3.32 shows one level of temporal analysis/synthesis using the lifted Haar transform in the overcomplete domain. This illustrates the case of a single sub-band. The reference is ODWT transformed in order to compute the motion-compensated prediction step. An overcomplete representation of t-H (temporal high-pass sub-band) is also needed to compute the update step. This is obtained by an IDWT–ODWT transform pair, so that a critically-sampled sub-band is turned into its overcomplete representation. The use of an in-band approach comes with several advantages with respect to traditional t þ 2D schemes. The following reasons justified its adoption in this proposal: .
An in-band (2D þ t) approach is more suitable in the case of spatial scalability, since each spatial resolution level can be encoded–decoded independently of higher-resolution
Figure 3.31 Overcomplete DWT computed through filter dilation (algorithm a trous). h(k)(n) is the dilated version of h(n), obtained by inserting k-1 zeros between two consecutive samples. h0(n) is a lowpass filter, while h1(n) is a high-pass filter
70
Visual Media Coding and Transmission
Figure 3.32 Left: temporal filtering occurring at the encoder side. One IDWT–ODWT module is needed to produce the optimally-interpolated version of the current frame. Right: temporal filtering at the decoder side. This time two IDWT–ODWT modules are needed: one for reconstructing the interpolated t-H frame, the other for the reference frame. In order to avoid cluttering the figure, the weighting factors are not explicitly indicated
.
.
sub-bands. This allows achievement of better performances when a sequence is decoded at lower resolution, as was recently pointed out in [77]. Moreover, a (2D þ t) scheme automatically allows complexity scalability. Blocking artifacts are reduced at low bit rates. Even when a block-based motion estimation/ compensation algorithm is used (the proposed method employs a variable-block-size motion model), blocking artifacts are cancelled out when the sequence is reconstructed at low bit rates. This is due to the fact that block boundaries appear in the wavelet domain, and they are smoothed by the inverse DWT. Visual inspection of Figure 3.33 demonstrates that the proposal outperforms a similar t þ 2D approach (MC-EZBC [72]) in terms of subjective quality. The (2D þ t) coding approach allows the implementation of fast motion estimation techniques designed to work in the wavelet domain. In particular, the FIBME algorithm (Fast In-Band Motion Estimation) has been developed.
Scalable Video Coding
71
Figure 3.33 The Foreman sequence, 256 kbps @ 30 Hz, frame 186 for subjective inspection. Left: MCEZBC; Right: IB-MCTF
.
Different motion models can be used for each sub-band. The algorithm allows assignation of different motion vectors to the individual sub-bands whenever the motion model is complex and cannot be described by a single motion vector per wavelet block.
Figure 3.34 shows the PSNR as a function of bit rate for the Mobile&Calendar sequence at CIF resolution and 30 Hz. The performance of the proposal is compared with MC-EZBC [72], which was the reference t þ 2D codec at the time of the Call for Proposals. The results show that the proposed in-band solution outperforms MC-EZBC at low bit rates, while it achieves nearly the same PSNR at medium to high bit rates. Despite this, the subjective quality of the reconstructed sequences is improved, as demonstrated by Figure 3.34.
Figure 3.34 The Mobile&Calendar sequence, CIF @ 30 Hz
72
Visual Media Coding and Transmission
The good results obtained in the low-bit-rate range are partly due to the scalable motion vector coding algorithm [78] which was integrated into the proposal. Using this, the motion information bit budget can be matched to the target bit rate. The motion field generated consists of motion vectors distributed on a quad-tree, since it produces blocks of variable sizes. The irregular structure of the quad-tree doesnt allow direct application of a wavelet transform to the motion vector data, as described in [79], where JPEG 2000 is used to achieve a quality scalable representation of motion. In our proposal, the motion field is encoded using a pyramidal representation and entropy coding, with an approach similar to SPIHT [80]. The optimal tradeoff between motion vector bits and sub-band bits is determined by repeated decoding at the encoder side. A model-based approach similar to the one presented in [79] can easily be integrated. The integration of this coding tool was necessary in order to achieve the lowest-bit-rate test points required by the MPEG Call for Proposals. In fact, if motion information is losslessly encoded, the bit budget is not enough to represent the motion field for some of the test sequences. This proposal responded to the Call for Proposals in both test scenarios. Scenario 1 consists of 4CIF resolution sequences (up to three spatial scalability layers), and Scenario 2 of CIF resolution sequences (up to two spatial scalability layers). The proposed solution was able to address all the requested test points (spatio-temporal resolution and bit rate) and ranked approximately the same as other 2D þ t-based solutions, but it was outperformed by t þ 2D and hybrid (DCT þ MCTF) approaches. This can partly be attributed to the lack of maturity of the newer 2D þ t scheme. Specifically, the proposed codec was still based on a Haar transform in the temporal domain, while all the other proposals were already based on more efficient 5/3 filters. Other coding tools were still missing, such as a better encoding of intra blocks. Our opinion is that the lower ranking obtained by 2D þ t-based solutions is not due to an inherent sub-optimality of this approach. On the contrary, Taubman has recently shown in [77] that a 2D þ t scheme is indeed necessary in order to obtain better results in case of spatial scalability. The activity in the area of 3D wavelet coding continued after the submission of the response to the Call for Proposals. Specifically, the coding efficiency of the proposal was increased by integrating a multi-hypothesis motion estimation/compensation in the lifting implementation of the temporal transform inspired by [81]. Here the multi-hypotheses are the different phases of the overcomplete DWT representation. Figure 3.35 gives a pictorial representation of the lifting prediction step. The inverse ODWT (IODWT) that produces the predictor frame has to
Figure 3.35
Lifting prediction step
73
Scalable Video Coding
Table 3.6 Comparison of Foreman average Y-PSNR (dB) at various spatio-temporal SNR points using: In-Band MCTF (IB-MCTF); In-Band Multi-Hypothesis MCTF (IBMH-MCTF) with layered motion
IB-MCTF IBMH-MCTF
QCIF 7.5 Hz 48 kbps
QCIF 15 Hz 64 kbps
CIF 15 Hz 128 kbps
CIF 15 Hz 256 kbps
CIF 30 Hz 512 kbps
26,68 28,91 þ 2.23
25,73 28,67 þ 2.95
28,86 29,79 þ 0.93
31,86 32,38 þ 0.54
34,54 34,89 þ 0.35
be interpreted as a weighted inverse of the different phases that compose the ODWT. An equivalent structure is used to implement the update step. In addition to this, the scalable motion vector encoding is replaced with a more efficient layered motion representation. With this approach, each spatio-temporal resolution has a corresponding optimized motion field. Only the relevant motion layers have to be sent when the sequence is decoded at reduced spatio-temporal resolution. Table 3.6 shows the improvements in terms of PSNR for the test points used in MPEG when comparing the two solutions: In-Band MCTF (IB-MCTF) and In-Band Multi-Hypothesis MCTF (IBMH-MCTF). These results take into account both the multi-hypothesis ODWT motion compensation and layered motion encoding. Figures 3.36 and 3.37 show the improvements in terms of perceptual quality when the sequence is decoded at reduced spatio-temporal resolution.
Figure 3.36 Frame 51 of the Foreman decoded at QCIF resolution at 15 Hz and 64 kbps. Left: IBMCTF. Right: IBMH-MCTF
Figure 3.37 Frame 1 of the Foreman decoded at CIF resolution at 15 Hz and 128 kbps. Left: IB-MCTF. Right: IBMH-MCTF
74
Visual Media Coding and Transmission
Figure 3.38 Detail of the Foreman sequence showing the effect of applying the update step in-band. Left: in-band update step. Right: spatial domain update step
However, the coding efficiency of the 3D-wavelet-based scalable video codec [77] can be further improved. In order to reduce residual blocking artifacts caused by the update step performed in the spatial domain, the adaptive updated step is extended to the in-band domain. Figure 3.38 shows the slightly better subjective quality achieved when an in-band approach is used both in the prediction and in the update step.
3.4 Error Robustness for Scalable Video and Image Coding 3.4.1 Correlated Frames for Error Robustness 3.4.1.1 Problem Definition and Objectives Scalable video coding supports efficient coding for both intra-frame and inter-frame [82], [83]. A group of coding pictures (GOP) is described, with a hierarchical structure depicted in Figure 3.39. It is noted that the size of GOP is restricted to a dyadic choice [84], i.e. the power of 2, in order to maintain the scalability feature in a temporal fashion. Each GOP contains a key frame, which is defined at the last picture. The coding option for key frames is determined by the frequency of intra-frame, I_Freq. If the value of I_ Freq is equal to that of GOP size, all the key frames are described as I-frames. Otherwise, P-frame coding is employed for non-intra key frames. Note that the SVC hierarchical structure does not allow more than one I-frame within a GOP. Thus, non-key pictures are encoded as P-frames or B-frames. The value of a key frame (presence and quality) for decoding the entire GOP structure is significant. A lost key frame, due to poor channel conditions, results in a dysfunctional decoding outcome for GOP reconstruction. This causes quality degradation, noticeable framefreezing, and disturbing motion-jerking phenomena [85]. Hence the design of error robustness for the key frames is essential in error-prone transmission to ensure a good quality of service (QoS). The error-concealment strategy supported by the SVC utilizes an unsophisticated frame-copy approach [86]. It does not produce a desirable outcome for key-frame reconstruction. Thus, an efficient error-robustness technique is proposed here. The algorithm employs redundant frame data [87], which is constructed from successive key frames. By transmitting additional bit overhead, error robustness for key-frame coding is improved. The algorithm formulation is described in Subsection 3.4.1.2. Subsection 3.4.1.3 presents the simulation results.
75
Scalable Video Coding GOP = 8
Key Layer 0
e
fram
Key
0
e
fram
8
12
4
Layer 1
6
2
Layer 2
10
...
Layer 3
1
3
Coding order: 0
Figure 3.39 fashion
5
8
7
4
2
11
9
6
1
3
5
7
Dyadic hierarchical coding structure (GOP ¼ 8) featuring four scalable levels in temporal
3.4.1.2 Proposed Technical Solution The proposed error-robustness technique is achieved by the construction of correlated frames. Let us denote the key frames corresponding to any two consecutive GOP structures as K1 and K2, respectively. A correlation operator, , possesses the following properties: K1 K2 ¼ C
ð3:3Þ
K1 C ¼ K2
ð3:4Þ
C K2 ¼ K1
ð3:5Þ
where C is a correlated frame: an outcome generated from the correlation between two key frames. Equations (3.3)–(3.5) further indicate that the reconstruction for a lost frame is plausible, provided that one of the two key frames, as well as their corresponding correlated frame, is received by the decoder. This provides additional protection to losses in error-prone environments. Since both intra- and inter-frames have disparate structures in terms of
76
Visual Media Coding and Transmission
prediction, the construction for the correlated frame requires structure compatibility between two corresponding key frames. To address this issue, I_Freq (the frequency of I-frame), as the size of GOP, is defined, such that all the key frames are coded with intra-prediction. SVC employs the H.264/AVC coding technique to achieve intra-frame coding. Unlike other standards, the H.264/AVC intra-prediction supports two extrapolation options, I4MB and I16MB, to reconstruct the spatial composition of a macroblock [88]. I4MB consists of nine ways to reconstruct the spatial composition of a 4 4 block. Figure 3.40 depicts elements (a–p) of a 4 4 block, with the neighboring encoded pixels (A–M) in eight directions. For instance, VERT, the vertical member (indicated as 0 in Figure 3.40(b)), extrapolates a 4 4 block vertically with four neighboring pixels, A, B, C, D, whereas the horizontal member, HORT, utilizes the horizontal adjacent pixels, I, J, K, L, to do the prediction. The other modes operate in the same way according to their corresponding orientations, except for DC, the directionless member, which extrapolates all pixels as (A þ B þ C þ D þ I þ J þ K þ L)/8. Due to the better accuracy obtained, I4MB prediction is particularly beneficial for descriptions comprising high-detailed information, such as the boundaries of an object and texture composition. I16MB resembles I4MB but is less time-consuming. Four extrapolation orientations, VERT, HORT, DC, and PLANE, are suggested in the Joint Video Specification. The first three schemes are similar to those of I4MB except that the area of the prediction is 16 16 pixels. PLANE prediction is obtained by linear spatial interpolation with the upper and left-hand samples. Thus, its direction is close to DIAG_DownLeft, as specified in the I4MB option. In terms of the amount of data, I4MB requires transmission of 16 times the amount of prediction information as I16MB. Thus, I16MB prediction is more suitable for use in areas of homogeneous content not needing a detailed description. The extrapolation option that results in the least Lagrangian cost is selected. Considering the strong spatial correlation within successive frames, the intra-prediction data for two adjacent key frames is set identically. For the sake of simplicity, the prediction data of the second key frame for each set of correlated-frame structures is forced to be identical to that of the first key frame. This results in quality degradation and/or a bit penalty due to the employment of the suboptimal intra-prediction. However, the need for predictor description for the correlated
Figure 3.40 (a) 4 4 block with elements (a–p) and neighboring pixels (A–M). (b) Eight directionbiased intra-modes in a 4 4 block (0: VERT; 1: HORT; 2: DC; 3: DIAG_DownLeft; 4: DIAG_DownRight; 5: VERT_Right; 6: HORT_Down; 7: VERT_Left; 8: HORT_Up)
77
Scalable Video Coding
Prediction Data
K2
K1 Residue Data for K1
Residue Data for K2
+
Corr Data, C
Figure 3.41 The proposed scheme for the construction of the correlated frame between two key frames, K1 and K2
frame is entirely exempted. Thus, it achieves a reduction in bit overhead when the correlated frame bits are considered. Figure 3.41 depicts the proposed scheme for the construction of a correlated frame using two key frames. It is noted that an identical set of intra-prediction data is employed for both K1 and K2. It provides error robustness to the prediction data, since the restoration-with-direct-copy strategy is easily achieved if one of the two key frames has been lost. The correlated frame, C, an outcome of the residue data from both K1 and K2, is then achieved with the use of correlation operator eXclusive OR (XOR). In order to prevent the quantization and transformation error, the correlated frame is directly constructed with the quantized frequency spectra resulting from the residue data of each key frame. The construction of the proposed correlated frame is summarized as follows: 1. Identify the first key picture from each set of correlated-frame structures. Store the corresponding predictor data, as well as its quantized frequency spectra resulting from the residue data. 2. For the second key frame, the prediction data is set identically to that in the first key frame. 3. A correlated frame, C, is acquired using the XOR operation of two quantized frequency spectra of both key frames. 4. Encode the correlated frame and repeat Step 1 for the next set of correlated-frame structures.
3.4.1.3 Performance Evaluation This subsection examines the performance of the proposed error-robustness algorithm. Results are presented as improvements over the standard SVC benchmark, JSVM 8.0 [89]. The selected sequences in QCIF resolutions, i.e. 176 144 pixels, are classified into three different classes. Class A comprises sequences with only low spatial complexity and/or little motion; Class B
78
Visual Media Coding and Transmission
contains medium spatial complexity and/or motion; Class C features high spatial complexity and/or complex motion activity. The other empirical settings are listed below and have been uniformly applied for each of the algorithms: 1. 2. 3. 4. 5. 6. 7.
The settings defined in the H.264/AVC main profile are considered. The frequency of the intra-frame is set at the size of GOP ¼ 32. In total 180 frames of each sequence (except 120 frames in Football) are processed. Only one reference frame is used for prediction. Four different compression levels, i.e. QP ¼ 28, 32, 36, and 40, are used. Motion refinement in fine granularity scalable layer is inactive. Content-adaptive variable length coding (CAVLC) is employed for coding.
The comparisons are given as the PSNR value of the luminance component, Y-PSNR (measured in dB), bit rate (described in kbps), rate-distortion relationship, and the performance in the error-prone channel. The Y-PSNR is defined as: 255 255 Y-PSNR ¼ 10log10 ð3:7Þ ^ MSEðY; YÞ where Y and Y^ represent the intensity of the original image luminance components and its corresponding reconstruction, respectively, and MSE is the mean square error measurement. As for the computation of bit penalty, the following equation is applied: Bit spent for error robustness scheme 100% ð3:8Þ Bit Penaltyð%Þ ¼ 1 Bit spent resulted from JSVM8:0 Performance Evaluation in Error-free Transmission Channels Table 3.7 indicates the quality degradation (measured in dB) and bit penalty (indicated as a percentage) for the implementation of the proposed error-robustness algorithms in an error-free channel transmission. The results are compared to the JSVM 8.0 benchmark at various frame rates. Table 3.7 entries are further arranged according to the class of sequence. It is observed that all the sequences employing the proposed algorithm generate insignificant PSNR degradation in any coding condition. A maximum distortion cost of 0.11 dB is found in Table 3.7 The Y-PSNR difference and bit-rate penalty of the proposed error-robustness algorithms in comparison with the JSVM 8.0 benchmark in three temporal scalable levels at QP ¼ 32 Classes
A B C
DY-PSNR (dB)
Sequences
Akiyo Hall Foreman Silent Fun Fair Mobile
Bit Penalty (%)
7.5 fps
15.0 fps
30.0 fps
7.5 fps
15.0 fps
30.0 fps
0.0572 0.0757 0.0187 0.1080 0.0538 þ 0.0109
0.0536 0.0722 þ 0.0035 0.1102 0.0335 þ 0.0074
0.0505 0.0661 þ 0.0053 0.1014 0.0231 þ 0.0102
28.18% 29.42% 27.99% 28.79% 14.38% 10.61%
25.16% 24.46% 22.11% 23.95% 9.95% 6.53%
22.48% 21.31% 18.36% 20.65% 7.65% 4.37%
Scalable Video Coding
79
the Silent Voice sequence. This is within an acceptable level as the visual distortion is still not perceptually significant to human viewers. As for the bit penalty, the general tendency depends on the class of sequence. In Classes A and B, 20–30% additional bits are needed for error protection, whereas Class C sequences result in much lower overhead requirements. The reason is attributed to the larger number of bits spent for coding of non-key frames in fast-motion sequences. It also explains the greater penalty in terms of bits spent for lower frame rates, as the proportion of non-key frames is reduced. In fact, the nature of slow-moving objects in both Class A and B sequences reflects an optional choice for the key frame protection, since frame replacement from the nearest key frames is enough to cope with the situation. In contrast, high-motion activities in the Class C sequences imply frequent appearances of new visual information that cannot be compensated with a simple frame-copy operation. Hence, the proposed error-robustness strategy is of particular benefit to the recovery of the lost frame and its corresponding GOP reconstruction in Class C sequences. Figure 3.42 shows the rate-distortion diagram (Y-PSNRs versus bit rates) of the Fun Fair and Football sequences employing the proposed error-robustness strategy. The results are compared with the benchmark, JSVM8.0 coder, over a wide range of compression levels, i.e. QP ¼ 28, 32, 36, and 40. It is observed that the Y-PSNR accuracy of the proposed strategy is similar to that of the JSVM8.0 at any coding condition. The performance differences shown by the rate-distortion curves are caused by the additional bits incurred by including correlated frames. The bit overhead of each is slightly increased at high bit rates. However, in both cases the level of overhead is less than 15%. Performance Evaluation in Error-prone Channels Figure 3.43 illustrates the performance of the proposed correlated frame and the JSVM framecopy strategy in the error-prone transmission for the Fun Fair and Football sequences, respectively. Both decoders suffer from the missing key frames 32 and 96. The quality measurement is indicated by the luminance component of the individual sequence at which 120 frames are decoded. It is obvious that the efficiency of the frame-copy strategy is of limited benefit for the concealment of errors. The reason is that the error from incorrect reconstruction of the key frame results in a significant propagation of errors to surrounding non-key frames.
Figure 3.42 RD performance for Fun Fair (left) and Football (right) sequences for JSVM 8.0 coder and the proposed error-robustness strategy
80
Visual Media Coding and Transmission
Y-PSNR (dB)
Fun Fair (QCIF, QP=36, GOP=32, lost frame 32 and 96 ) 35.5 33.5 31.5 29.5 27.5 25.5 23.5 21.5 19.5 17.5 15.5 13.5 11.5
JSVM 8.0
0
5
10
Correlated Frame
15
20
25
30
FrameCopy
35
40
45
50
55
60
65
70
75
80
85
90
95
100 105 110 115
Y-PSNR (dB)
Frame Number Football (QCIF, QP=36)
38 36 34 32 30 28 26 24 22 20 18 16 14 12
JSVM 8.0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
Correlated Frame
85
90
95
FrameCopy
100 105 110 115
Frame Number
Figure 3.43 The performance of two error-concealment strategies for Fun Fair (top) and Football (bottom) sequences in the error-prone transmission where the key frames 32 and 96 are missing
The frame-copy strategy applied in the fast-moving sequences results in a disastrous outcome, where a minimum PSNR value of less than 15 dB is evident at frames 32 and 96. In contrast, the proposed scheme performs key frame recovery by employing correlated frames. It provides error-robustness for the key frames transmitted in error-prone channel and error-concealment at the decoder side. However, the quality shown requires that one of the two key frames and the corresponding correlated frame are received by the decoder. As shown, the Y-PSNR curves of the proposed decoder are virtually identical to those of the JSVM software. The reason is straightforward: the correlated frame can be used to perfectly recover the lost key frame, which prevents error propagation to surrounding GOPs. Figure 3.44 shows four non-key pictures (25, 37, 93, and 104) obtained from the JSVM8.0 decoder, as well as the output resulting from the two error-concealment strategies. Clearly, the frame-copy strategy fails to recover the key frames. The ghosting effect present in the pictures is a consequence of the inappropriate motion-compensation process. There is no evidence of quality distortion using the proposed technique. 3.4.1.4 Conclusions and Further Work In this section, an error-robustness method for key frames in scalable video coding has been proposed. The technique employs correlated frames, i.e. the redundant data resulting from key frames of two consecutive GOP structures. The proposed error-robustness technique is of particular benefit for fast-motion sequences, as categorized in Class C. The reasons are
Scalable Video Coding
81
Figure 3.44 Subjective test of error-free JSVM8.0 (top), frame-copy strategy (middle), and the proposed correlated frame (bottom) for both Fun Fair and Football (Class C) sequences
82
Visual Media Coding and Transmission
twofold: (1) additional bit overhead for the correlated frame is relatively insignificant;(2) the missing key frame is always unable to be compensated by simple frame copy. The simulations place emphasis on examining the performance of correlated frames in Class C sequences. The results have shown the advantages over the JSVM frame-copy strategy in terms of error robustness for error-prone channel transmission, as well as the error-concealment at the decoder side. However, more extensive tests should be performed to judge the effectiveness of the proposed algorithm, including the use of more realistic channel simulations, and potential corruption of all frames, not just I-frames. Tests should also be carried out using the low-delay temporal scalability to evaluate performance when using a configuration that is appropriate for the virtual desk demonstrator. Unfortunately, the integration of the correlated frame with the SVC standard requires an efficient transmission scheme to ensure that one of the two key frames, as well as the redundant data, is received by the decoder. Further research is required to integrate this error-robustness feature with the scalable MDC technique.
3.4.2 Odd–Even Frame Multiple Description Coding for Scalable H.264/AVC 3.4.2.1 Problem Definition and Objectives This section involves the combination of two video coding aspects: multiple description coding and scalable video coding. MDC schemes provide a number of independent descriptions of the same content. Their advantages have been described in a large number of published papers, including [90]. Scalable coding produces a number of hierarchical descriptions that provide flexibility in terms of adaptation to user and network requirements. Although a hierarchical scalable approach is more compression-efficient than MDC, the loss of a lower layer prevents all higher layers being decoded even if they are correctly received, which means that a significant amount of correctly-received data may be useless in certain channel conditions. Scalable MDC aims to combine the flexibility of scalable coding with the robustness of MDC. Using MDC it is possible to mitigate the effects of losing lower-layer data and produce an optimal tradeoff for a particular channel error rate in terms of compression efficiency and error robustness. This is usually achieved by varying the amount redundancy introduced by the MDC scheme. Thus, in good channels, compression efficiency equivalent to hierarchical scalability can be achieved, while in poor channels, MDC may be used to mitigate the effects of errors. The MDC scheme presented in this section has the following properties: . .
The error resilience of the base layer of SVC is enhanced using temporal MDC. A “standard compliant” bitstream, in the sense that after re-multiplexing the data received from the different channels a standard decoder will be able to output video. In other words, a standard compliant bitstream can be produced using a very simple MDC combiner, which simply rewrites the received NALUs (network abstract layer units) into a single stream in a certain order.
The base layer is the most important part of the SVC bitstream, as all other layers depend on it. Therefore, it is important to make it error robust. Here, a standard compliant approach is used to produce multiple descriptions of the base layer. The existing scalable H.264/AVC scheme is effectively capable of providing different descriptions via the hierarchical B frames feature,
83
Scalable Video Coding
Figure 3.45 Contents of Streams 1 and 2 at frame level
which is used to provide temporal scalability. However, the temporal scalability levels are not independent of one another, and therefore modifications must be made to the prediction structure. 3.4.2.2 Proposed Technical Solution The proposed scalable MDC in this section begins with a standard H.264/SVC coder, and the error resilience of the base layer of H.264/SVC is enhanced using temporal MDC. The temporal MDC of the base layer is produced by using the multi-state video coding approach, as in [91], which separates the even and odd frames of a sequence into two MDC streams as shown in Figure 3.45. Figure 3.46 shows the general block diagram of the proposed scalable MDC for stereoscopic 3D video. The depth data, which will be combined with the texture data using a depth-imagebase rendering technique (DIBR) [92] to produce left and right views, is placed in the enhancement layer. This type of stereoscopic video coding configuration has better coding efficiency than other configurations of stereoscopic video, such as left and right coding and interlaced coding [62].
Figure 3.46 Proposed scalable MDC
84
Visual Media Coding and Transmission Table 3.8
Description of layers for the scalable MDC encoder
Layer
Resolution
Description
0 1 2 3
352 288 352 288 352 288 352 288
Base layer, even – Texture Base layer, odd – Texture Enhancement layer, even – Depth Enhancement layer, odd – Depth
H.264/SVC can produce scalable layers that can be exploited for MDC. SVC can produce scalable layers that can be exploited for MDC. It is possible to switch off the interlayer prediction in SVC [93]. This allows the creation of a multiple description scenarios among the SVC layers. Hence, in the scalable MDC simulation, the even and odd frames are separated before the encoding process for both texture and depth. Then the even frames for the texture are coded in the base layer (Layer 0) and the odd frames for the texture are coded in the enhancement layer (Layer 1). With the interlayer prediction switched off, it can be assumed that Layer 1 is also the base layer for the scalable MDC. The even frames for the depth are coded in the enhancement layer (Layer 2) and the odd frames for the depth are coded in the enhancement layer (Layer 3). Table 3.8 gives descriptions of the layers that can be produced by the scalable MDC for a CIF-size image sequence for only one spatial resolution. An example with spatial scalability is given in Subsection 3.4.2.3. A single standard compliant bitstream is produced from the above configuration. At the decoder, a bitstream extractor is used to extract the even and odd bitstreams of the texture and the depth. Each bitstream can then be decoded by the standard H.264/SVC decoder. Finally, the decoded even and odd frames are merged together to produce a full-resolution decoded sequence. For both texture and depth, if both the even and the odd streams are received, the decoder can reconstruct the coded sequence at full temporal resolution. If only one stream is received, the decoder can still decode the received stream at half the original temporal resolution. Since the even frames are predicted from previous even frames (independent of odd frames), there will be no mismatch if one of the streams is lost at the decoder. Additionally, if one stream is received, the decoder can decode at full resolution by interpolating between the received frames, as in [91], or by repeating the frame. If an error occurs in one of the frames in the MDC stream (Frame 4), as shown in Figure 3.47, the error will propagate to the next even frame. The MDC decoder can then switch to the odd Even frames: 0
2
4
6…
Errors propagate Odd frames: 0
1
3
5
Displayed : 0 frames
1 2 3
5
7…
7
9
11
Reduced frame rate
Figure 3.47
Example of an error in an MDC stream
85
Scalable Video Coding Table 3.9
Description of layers for the SDC
Layer
Resolution
Description
0 1
352 288 352 288
Base layer – Texture Enhancement layer – Depth
stream and display the frame at a reduced frame rate. Alternatively, the MDC decoder can interpolate between the reduced frame rates or repeat the frame to achieve full temporal resolution. 3.4.2.3 Performance Evaluation The performance of the proposed scalable MDC is compared with the single-description SVC (SDC) in an error-free channel and in a wireless error-prone WiMax channel. Singledescription coding for SVC means that the base and enhancement layers of SVC are used to respectively encode the 2D and depth information of the sequence, as shown in Table 3.9. The proposed scalable MDC uses a configuration as shown in Table 3.8 for the simulation. The performance of the proposed scalable MDC in an ideal MDC channel is also presented. The JSVM software [94] has been adapted to produce the even-and-odd-frame MDC scheme of the base layer. 3D stereoscopic test sequences (2D video plus depth), namely Interview and Orbi, at CIF resolution of 300 frames are used in the simulation. The base and enhancement layers of SVC are used to respectively encode the 2D and depth information of the sequence. Error-free Performance The rate distortions of MDC versus SDC for the Interview and for the Orbi sequences, at 30 fps of the base layer (2D video), are shown in Figures 3.48 and 3.49, respectively. The SDC is the original JSVM software used to code the Interview sequence. There is an I-frame coded every two seconds for both MDC and SDC. For SDC the I-frame is inserted every 60 frames, and for MDC the I-frame is inserted every 30 frames in each MDC stream. The same fixed QP values are used for the SDC and MDC to achieve the rate-distortion curve. It can be seen from Figures 3.48 and 3.49 that MDC is less efficient than SDC in terms of compression capability. MDC PSNR is about 1–2 dB lower than SDC for the same bit rate. This is due to the prediction from the previous two frames, which results in larger residual, and the larger motion vectors that need to be coded. MDC also has more I-frames than SDC because there are two I-frames (one for even streams and one for odd streams) inserted every 30 frames in the MDC streams, compared to one every 60 frames in the SDC stream. Ideal MDC Channel In an ideal MDC channel, it is assumed only one of the MDC descriptions is received, and that the other MDC description is lost or heavily corrupted. In this situation, the proposed scalable MDC decoder interpolates between the received frames to achieve full temporal resolution. Table 3.10 shows the average Y-PSNRs obtained from decoding only even frames and from decoding both descriptions of the proposed scalable MDC. It is observed that the proposed scalable MDC has high average PSNR, of more than 35 dB for both sequences, even if only one description is decoded. The PSNR difference between one
86
Visual Media Coding and Transmission
Figure 3.48
Rate distortion of MDC versus SDC for Interview
Figure 3.49 Rate distortion of MDC versus SDC for Orbi
87
Scalable Video Coding Table 3.10
Average Y-PSNRs (dB) for decoding one and both descriptions
Sequence
One description
Both descriptions
Interview Orbi
35.17 35.46
35.63 37.70
and both descriptions is 2.24 dB and 0.46 dB for Orbi and Interview respectively. No disturbing artifacts such as in [96] are observed when playing the decoded sequences, as the interpolation produces a perceptually insignificant averaging effect between the frames. The subjective result for error-prone is shown in Figure 3.50 for 20% packet loss of Internet error pattern [97] applied on the even stream only. The Interview sequence at CIF resolution and 30 fps is coded using the original JSVM software (SDC) and the modified JSVM software (MDC) to achieve the same bit rate of about 300 kbps by varying the quantization parameter. Table 3.11 shows the quantization parameter and the I-frame rate used to achieve the bit rate. In
Figure 3.50 Subjective results for Frame 80 of the Interview sequence when subjected to 20% packet loss: (a) SDC; (b) MDC-repeat; (c) MDC-interpolate
88
Visual Media Coding and Transmission
Table 3.11 Parameters used for the SDC and MDC encoders Encoder
QP
I-frame rate
Bit rate (kbps)
Average PSNR
SDC MDC
30 30
Every 15 frames Every 30 frames in each stream
290.89 292.85
35.64 35.63
order to achieve the same bit rate as MDC at the same quantization parameter, SDC used more I-frames than MDC. The 20% packet loss is applied only in the even stream for MDC, to simulate ideal MDC channel (only one stream is lost at a time), while in SDC all frames are subjected to the 20% packet loss. It is assumed that one lost packet will mean that one frame is lost. Figure 3.50 shows the subjective results for Frame 80, and a comparison between the interpolated and the repeat frame methods for concealment of the reduced frame rate. In SDC, frame-copy-error concealment is used. In the MDC simulation, if an error occurs in even frame, the decoder switches to odd frame and ignores the rest of the even frame. It then interpolates or repeats the odd frames to achieve full resolution. When playing the sequence, a little bit of jerky motion is observed for MDC-repeat due to the frame repetition. It can be concluded that under ideal MDC channel, MDC is very effective in combating the transmission errors compared to SDC. Error-prone Performance The compressed 3D video is transmitted over a simulated WiMax (IEEE802.16e) channel developed in the SUIT (Scalable, Ultra-fast and interoperable Interactive Television) project [95]. The network parameter settings used in this paper for the IEEE802.16e simulator are: 16QAM 3/4 modulation coding scheme (144 bits per slot), slots in the DL and IP packet size of 256 bytes. The mobile speed is up to 60km/h and partially-used sub channelization (PUSC) permutation is used. Convolutional turbo coding (CTC) is employed for the channel coding, and the ITU Vehicular-A environment is chosen. Figures 3.51 and 3.52 show the performance when transmitting the Orbi and Interview sequences respectively using scalable MDC compared to SDC in the WiMax channel. The results show the average PSNR for the luminance component. The depth data is assumed to be transmitted error-free using the enhancement layer, as the research focus is to provide error resilience of the base layer that is used to code the texture information. Table 3.12 shows the quantization parameter, the resulting error-free PSNR for luminance (Y-PSNR) and depth (D-PSNR), and the corresponding texture and depth bit rate in kbps. The overall bit rates for scalable MDC and SDC are 450 kbps and 460 kbps respectively for the Orbi sequence. For the Interview sequence, the overall bit rates for scalable MDC and SDC are 403 kbps and 405 kbps respectively. An I-frame is inserted for every 45 frames in SDC, and for every 45 frames in each MDC stream, resulting in six I-frames in total for each codec. It can be seen that scalable MDC has an improved performance of about 1 dB at high error rates (SNR 10–16 dB). One of the reasons is that MDC has the advantage of using the other stream when an error occurs in one. Provided the other stream is not in error, MDC can achieve reasonable results, as shown in Figure 3.51. Sometimes errors corrupt both streams, hence an I-frame is needed to stop error propagation. At low error rates (SNR 18–24 dB), SDC performs better by about 1 dB due to its superior performance in error-free conditions, as shown in
Scalable Video Coding
Figure 3.51 Performance of scalable MDC compared to SDC in WiMax for Orbi
Figure 3.52
Performance of scalable MDC compared to SDC in WiMax for Interview
89
90
Visual Media Coding and Transmission
Table 3.12 Quantization parameters used, and the resulting PSNRs and bit rates Sequence
Codec
QP
Y-PSNR (dB)
D-PSNR (dB)
Bit rate Texture (kbps)
Bit rate Depth (kbps)
Orbi
SDC MDC SDC MDC
28.4 30 28 30
38.97 37.70 37.06 35.63
41.43 40.45 42.05 40.64
318 313 270 278
142 137 135 125
Interview
Table 3.12. Similar performance is observed for the Interview sequence in Figure 3.52. Subjective results of the 2D video and the 3D stereoscopic video of the Orbi sequence at an SNR of 10 dB (Figure 3.53) also show the improvement achieved by the proposed scalable MDC over SDC. The DIBR technique in [92] is used to achieve stereoscopic 3D video. In Figure 3.53(c) and (d), the 3D video is rendered as red and blue images. Scalable Performance Table 3.13 shows the spatial scalable result of the proposed scalable MDC for the Interview sequence. All the layers are contained in one single bitstream. Layer 0 and Layer 1 are the MDC layers (base layer). The user can select to decode the required bitstream according to their
Figure 3.53 Subjective results for Frame 23 of the Orbi sequence for 2D video: (a) SDC; (b) MDC; and for stereoscopic 3D video: (c) SDC; (d) MDC; when subjected to 10 dB SNR WiMax channel
91
Scalable Video Coding Table 3.13
Description of layers
Layer
Resolution
Description
0 1 2 3 4 5 6 7
176 144 176 144 176 144 176 144 352 288 352 288 352 288 352 288
Base layer, even – Texture Base layer, odd – Texture Enhancement layer, even – Depth Enhancement layer, odd – Depth Enhancement layer, even – Texture Enhancement layer, odd – Texture Enhancement layer, even – Depth Enhancement layer, odd – Depth
terminal requirement in the virtual collaboration system. For example, two video resolutions, QCIF and CIF, can be decoded from the bitstream. If a stereoscopic 3D terminal is available, the user can also decode the depth layer and use a DIBR technique to achieve stereoscopic 3D video. 3.4.2.4 Conclusions and Further Work A standard compliance scalable MDC scheme based on even and odd frames is proposed for the H.264/SVC video coding targeting 3D videoconferencing applications. The proposed algorithm generates two descriptions for the base layer of SVC based on even and odd frame separation, which reduces its coding efficiency compared to SDC. However, in an ideal MDC channel, the proposed scheme achieves good quality even if only one of the descriptions is received. Objective and 2D/3D subjective evaluation in an error-prone WiMax channel shows an improved performance of the proposed algorithm at high error rates when compared to SDC. The proposed scheme can provide scalability and 3D video communication for the videoconferencing component of the virtual collaboration system.
3.4.3 Wireless JPEG 2000: JPWL Given the importance of wireless imaging applications, JPEG kicked off a new activity referred to as Wireless JPEG 2000 or JPWL, also known formally as part 11 of the JPEG 2000 specifications. As of October 2004, JPWL wass at the Committee Draft (CD) level [98]. The goal of JPWL is to extend the baseline specification in order to allow for the efficient transmission of JPEG 2000 image data over an error-prone wireless transmission environment. More specifically, JPWL defines a set of tools and methods to protect the codestream against transmission errors. It also defines means to describe the sensitivity of the codestream to transmission errors, and to describe the locations in the codestream of residual transmission errors. JPWL is notably addressing the protection of the image header, joint source-channel coding, unequal error protection, and data interleaving. Hereafter, a brief review of the current status of JPWL is presented. A more detailed overview is given in [99]. 3.4.3.1 Scope and General Description The transmission of image and video content over wireless networks is becoming ubiquitous. Wireless networks are characterized by the frequent occurrence of transmission errors, which put strong constraints on the transmission of digital imagery. Given its high compression
92
Visual Media Coding and Transmission
efficiency, JPEG 2000 is a very strong contender in wireless multimedia applications. Moreover, the highly-scalable JPEG 2000 codestream enables a wide range of QoS strategies for network operators. However, JPEG 2000 has to be robust to transmission errors in order to be suitable for wireless imaging applications. While the baseline specification defines a number of tools for error resilience, these tools only detect the occurrence of errors, conceal the erroneous data, and resynchronize the decoder. In particular, they do not correct transmission errors and do not address the occurrence of errors in the image header, even though it is the most important part of the codestream. For these reasons, they are not sufficient in wireless imaging. To overcome these limitations, JPWL extends the baseline specification and defines additional tools for error protection and correction. Examples of such tools are presented in [100]–[103]. JPWL is not only addressing the transmission of JPEG 2000 still images, but also the transmission of Motion JPEG 2000 video. JPWL is not linked to a specific network or transport protocol, but provides a generic solution for the robust transmission of JPEG 2000 codestream over error-prone networks. While the main target of JPWL is wireless applications, the same tools can also be employed in other error-prone applications. The main functionalities of the JPWL system are: . . .
To protect the codestream against transmission errors. To describe the degree of sensitivity of different parts of the codestream to transmission errors. To describe the locations of residual errors in the codestream.
The JPWL system can either be applied to an input source image or to a JPEG 2000 codestream in a number of configurations, one example being illustrated in Figure 3.54. At the transmission side, a JPWL encoder consists of three modules running concurrently: a JPEG 2000 baseline encoder compressing the input image, a generator of the error-sensitivity description, and a processor applying the error-protection tool. The result is a JPWL codestream robust to transmission errors. At the receiving side, a JPWL decoder is also composed of three modules: a processor to correct errors, a generator of the residual errors description, and a JPEG 2000 baseline decoder.
Figure 3.54
JPWL system description: encoder and decoder
Scalable Video Coding
93
The error-protection process modifies the codestream to make it more resilient to errors, for example by adding redundancy or by partitioning and interleaving the data. The errorcorrection process detects the occurrence of errors and corrects them whenever possible. Techniques to protect the codestream include forward error correcting (FEC) codes, data partitioning and interleaving, and unequal error protection (UEP). The specific tools for error protection have to be registered with the JPWL Registration Authority (RA). 3.4.3.2 Error-Protection Capability (EPC) EPC indicates which JPWL normative and informative tools are used in the codestream. More specifically, EPC signals whether the three other normative marker segments defined by JPWL, namely the error-sensitivity descriptor (ESD), the residual error descriptor (RED), and the error-protection block (EPB) are present in the codestream. Furthermore, EPC signals the use of informative tools which have been previously registered with the JPWL RA. Upon registration, each tool is assigned an ID, which uniquely identifies it. These informative tools allow for error resilience and/or error correction, and include techniques such as error-resilient entropy coding, FEC codes, UEP, data partitioning, and interleaving. EPC may also contain parameters relative to these informative tools. Therefore, this syntax allows for a flexible use of existing tools and the rollout of new ones in the future. 3.4.3.3 Error-Protection Block (EPB) The primary function of the EPB is to protect the main and tile-part headers. However, it can also be used to protect the remaining part of the bitstream. The EPB marker segment contains information about the error-protection parameters and redundancy data used to protect the codestream against errors [101]. 3.4.3.4 Error-Sensitivity Descriptor (ESD) The ESD contains information about the sensitivity of the codestream to transmission errors. This information is typically generated when the image is encoded using a JPEG 2000 baseline encoder (see for example Figure 3.54), but it can also be directly derived from a JPEG 2000 codestream. This information can subsequently be used when protecting the image. For instance, when applying a UEP technique, more powerful codes are used to protect the most sensitive portion of the codestream. This information can also be used for selective retransmissions, as proposed in [103]. More specifically, a larger number of retransmissions are attempted for the most critical parts of the codestream. Finally, the information carried in the ESD could also be used for other non-JPWL applications, such as efficient rate transcoding or smart prefetching. 3.4.3.5 Residual Error Descriptor (RED) The RED signals the presence of residual errors in the codestream. Indeed, a JPWL decoder may fail to correct all the errors in a codestream. The RED allows signaling of the location of such residual errors. This information can then be exploited by a JPEG 2000 decoder in order to better cope with errors. For instance, the decoder could request retransmission, conceal the errors, or discard the corrupted information.
94
Visual Media Coding and Transmission
3.4.4 JPWL Simulation Results In this section, simulation results are presented for the transmission of Motion JPEG 2000 video over an error-prone channel [99]. In particular, the efficiency of the error-resilience tools in the baseline JPEG 2000 coding scheme, and the performance of the JPWL EPB tool previously introduced, are evaluated. More precisely, two configurations of the EPB tool are considered: in the first case it is used to protect the main and tile-part headers, whereas in the second case it is applied to protect the whole codestream using UEP. The system depicted in Figure 3.55 is used to simulate video transmission over a WCDMA wireless channel. The video source is first encoded with JPEG 2000, using the Kakadu software [104]. The JPWL EPB tool is optionally applied in order to protect the codestream. A WCDMA error pattern is then used in order to simulate the random occurrence of transmission errors. As the injection of transmission errors is a random process, 50 trials are run for each simulation case and final results are the average over all the trials. For each trial, a different random circular shift is applied to the same error pattern file. The output video is obtained by applying the inverse operations, namely the optional JPWL EPB decoding followed by JPEG 2000 decoding. The test sequences City, Crew, Foreman, Harbor, Mobile and Soccer have been used in the simulation. These sequences exhibit very different characteristics. The first frame of each is depicted in Figure 3.56. The sequences are in CIF format with a frame rate of 15 fps. They are encoded at 384 kbps. The transmission errors have a bit error rate (BER) of 1e-3. The following four cases are compared: 1. Baseline JPEG 2000 encoding without error resilience tools; three quality layers are used, a code-block size of 64 64, and four levels of wavelet decomposition. 2. Baseline JPEG 2000 encoding with the following baseline error-resilience tools: . . . .
RESTART: the MQ coder is restarted at the beginning of each coding pass. ERTERM: the encoder enforces a predictable termination policy for the MQ coder. SEGMARK: a special symbol is encoded at the end of each bitplane. SOP: Start of Packet markers are inserted in front of every packet.
input source sequence
output decoded sequence
JPEG 2000 encoding
JPWL EPB encoding
WCDMA error pattern
JPEG 2000 decoding
JPWL EPB decoding
Figure 3.55 Simulation environment for the transmission of video over WCDMA
Scalable Video Coding
95
Figure 3.56 Test sequences: City, Crew, Foreman, Harbor, Mobile and Soccer
3. Same as 2, with the addition of JPWL EPB in order to protect the main and tile-part headers with the following codes: . . .
RS(160,64) code is used to protect the first EPB marker segment of the main header. RS(80,25) code is used to protect the first EPB marker segment of a tile-part header. RS(40,13) code is used to protect the remaining EPB marker segments in main and tile-part headers.
4. Same as 2, with the addition of JPWL EPB in order to protect the whole codestream using UEP as follows: . . . . . .
RS(160,64) code is used to protect the first EPB marker segment of the main header. RS(80,25) code is used to protect the first EPB marker segment of a tile-part header. RS(40,13) code is used to protect the remaining EPB marker segments in main and tile-part headers. RS(30,20) code is used to protect the first layer. RS(26,20) code is used to protect the second layer. The third layer is not protected.
As can be observed, the error-resilience tools result in a gain of 1.72 dB on average when compared to baseline without error resilience. The additional use of EPB leads to even more significant quality improvement. When EPB is applied to the main and tile-part headers, the gain achieved is on average 3.07 dB compared to baseline without error resilience, and 1.35 dB against baseline with error resilience. When EPB is applied to the whole codestream using UEP, the performance is on average 3.57 dB higher than baseline without error resilience, and 1.85 dB higher than baseline with error resilience.
96
Visual Media Coding and Transmission
Table 3.14 PSNR results (coding rate of 384 kbps and BER of 1e-3) Sequence
JPEG 2000 baseline
JPEG 2000 baseline þ error resilience
JPEG 2000 baseline þ error resilience þ JPWL EPB-HEADER
JPEG 2000 baseline þ error resilience þ JPWL EPB-UEP
City Crew Foreman Harbor Mobile Soccer Average
23.61 26.12 23.56 19.74 16.68 25.06 22.46
24.98 28.09 25.73 21.19 17.71 27.35 24.18
26.36 29.66 27.22 22.31 18.67 28.96 25.53
26.65 30.43 28.07 22.60 18.63 29.80 26.03
Furthermore, when EPB is not used, errors often occur in the main or tile-part headers, leading to frequent decoder crashes. When EPB is used, those errors can usually be successfully corrected, hence greatly reducing the number of decoder crashes. Table 3.14 summarizes the results in terms of PSNR.
3.4.5 Towards a Theoretical Approach for Optimal Unequal Error Protection In the previous subsection, the benefit of unequal error protection was shown. In this subsection, a theoretical approach to achieving optimal unequal error protection is presented. More specifically, the approach aims at finding the optimal protection (e.g. using ReedSolomon (RS) codes) for each layer of the codestream. The JPEG 2000 codestream is composed of a header (more specifically a main header and a tile-part header) followed by the bitstream. The latter is composed of several layers corresponding to different optimal rate-distortion points. The layers themselves consist of several packets, each packet having a header and a body. Each packet corresponds to a quality layer, a resolution, a color component, and a precinct. This structure is illustrated in Figure 3.57.
Header
Figure 3.57
Body
Structure of JPEG 2000 codestream
…
Packet
Packet
…
Packet
…
Layer n
Packet
…
Packet
Layer 2
Packet
Packet
Packet
Tile-part Header
Main Header
Layer 1
97
Scalable Video Coding
Each layer l corresponds to a rate: Rsrc;l ¼
l X
DRi
ð3:9Þ
DDi
ð3:10Þ
i¼0
and a distortion: Dsrc;l ¼
l X i¼0
Furthermore, it is assumed that transmission errors will result in bit errors with a given BER. This assumption is valid when interleaving is used in a wireless network with burst errors. Finally, an RS code for FEC is considered. More specifically, an RS code of the type RS(n,k), where n is the code symbol length and k is the source symbol length. For an unshortened RS code, n ¼ 2m 1 with m-bit symbols is available. This code has 2t ¼ n – k parity symbols and can correct up to t erroneous symbols. Hereafter, the aim is to solve the following problem: given a JPEG 2000 codestream, and assuming that each layer l is protected using an RS code RS(nl,kl), find the optimal nl and kl for each layer. With the above assumptions and notations, the overhead resulting from RS is given by: RRS;l ¼
l X
DRi
i¼0
2ti ki
ð3:11Þ
and the total rate is therefore given by: Rtotal;l ¼ Rsrc;l þ RRS;l ¼
l X i¼0
DRi
ni ki
ð3:12Þ
The probability of a symbol error is: Pse ¼ 1 ð1 BERÞm
ð3:13Þ
From this it results that the probability of being unable to correctly decode a codeword from layer l is: nl X nl Pce;l ¼ ð3:14Þ Pjse ð1 Pse Þnl j j j¼t þ 1 l
and therefore the probability of being unable to correctly decode at least one codeword from layer l is: Ple;l ¼ 1 ð1 Pce;l ÞNl
ð3:15Þ
Finally, the number of codewords in layer l is: Nl ¼
DRsrc;l ml kl
ð3:16Þ
98
Visual Media Coding and Transmission
Now the distortion introduced by transmission errors is considered. An additional hypothesis is made that if one codeword is not correctly decoded then the whole layer and all above layers are lost. The distortion resulting from errors is given by: !! l i X Y DRS;l ¼ DDi 1 ð1 Ple;j Þ ð3:17Þ i¼0
j¼0
and the total distortion for layer l is therefore: Dtotal;l ¼ Dsrc;l þ DRS;l ¼
l X i¼0
i Y DDi ð1 Ple; j Þ
! ð3:18Þ
j¼0
Using (3.12) and (3.18), it is then possible to optimize nl and kl for each layer. One way to use this approach is to save the rate-distortion metadata during source encoding using the JPWL syntax. For each layer, nl and kl are optimized, and finally RS encoding is applied at the application layer. The same approach could also be used to optimize automatic repeat request (ARQ) or power control.
3.5 Conclusions In this chapter, contributions relating to scalable video coding have been described in the framework of VISNET. Virtual collaboration videoconferencing is one of the applications and scenarios presented. Advances of scalable video coding for 3D video applications were discussed in the three categories given below. First, a highly-scalable 2D-model-based video coding technique was investigated and its application for depth map sequence coding was also studied. The objective of scalable coding of shape, texture, and depth for 3D video is to improve the coding efficiency of 3D video content. Some algorithms have been proposed for scalable shape coding, scalable texture coding, and scalable depth coding. Promising results have been obtained. Extensive experiments have been conducted to verify the efficiency of our proposed method for both texture and depth coding, and to compare it with the standard coding tools such as MPEG-4 and H.264. Second, the adaptation of scalable video coding for stereoscopic 3D video applications were also presented. The work analyzed the rate-distortion performance of stereoscopic video using three configurations based on MPEG-4 MAC, H.264/AVC, and SVC. The proposed SVC configuration, based on the layered architecture, performs similarly to the configuration based on H.264/AVC and outperforms the configuration based on MPEG-4 MAC at all bit rates in terms of objective and subjective quality. Furthermore, the configuration based on SVC produces high-quality stereoscopic video using color and depth sequences (obtained from the depth–range camera) compared to the virtual left and right sequences produced from the same color and depth sequences at low bit rates. Third, 3D wavelet-based approaches towards scalable video coding were acknowledged. Several research activities have focused on the design of efficient 3D wavelet coders in the last few years. This has led to coding architectures that have coding efficiency similar to the
Scalable Video Coding
99
state-of-the-art non-scalable video coder H.264/AVC, while adding scalability as an additional feature. Despite the fact that other coding schemes are able to offer scalability, 3D wavelet coders are characterized by an unmatched degree of flexibility. The MPEG Ad Hoc Group on Scalable Video Coding investigated the use of wavelet-based approaches as the basis for the future MPEG-21 standard. Although it is not clear yet whether a full-3D wavelet-based scheme will be adopted, motion-compensated temporal filtering will definitively be the core coding tool of that standard. Recent studies on longer temporal kernels and adaptive t þ 2D/2D þ t schemes seem to prove that further improvements are on their way. Error robustness techniques for scalable video and image coding developed within VISNET have also been described in this chapter. First, an error-robustness method for key frames in scalable video coding was proposed. This technique employs correlated frames, i.e. the redundant data resulting from key frames of two consecutive GOP structures. The proposed error-robustness technique is of particular benefit for fast-motion sequences, as categorized in Class C. The reasons are twofold: (1) additional bit overhead for the correlated frame is relatively insignificant;(2) the missing key frame is always unable to be compensated by simple frame copy. The simulations place emphasis on examining the performance of correlated frames in Class C sequences. The results show the advantages over the JSVM frame-copy strategy in terms of error robustness for error-prone channel transmission, as well as error concealment at the decoder side. Second, a standard compliance scalable MDC scheme based on even and odd frames was proposed for the H.264/SVC targeting 3D videoconferencing applications. The proposed algorithm generates two descriptions for the base layer of SVC, based on even and odd frame separation, which reduces its coding efficiency compared to SDC. However, in an ideal MDC channel the proposed scheme achieves good quality even if only one of the descriptions is received. Objective and 2D/3D subjective evaluation in an error-prone WiMax channel shows an improved performance of the proposed algorithm at high error rates when compared to SDC. The proposed scheme can provide scalability and 3D video communication for the videoconferencing component of the virtual collaboration system. Third, and finally in this chapter, the general issue of error resilience in scalable video coding based on JPEG 2000 was analyzed and the ongoing JPWL standardization effort was reviewed. The goal of JPWL is to define an extension of the JPEG 2000 baseline specification in order to enable the efficient transmission of JPEG 2000 codestream over an error-prone network. The performance of the JPWL error-protection block (EPB) tool was also evaluated. Two configurations of EPB were considered: to protect the main and tile-part headers, and to protect the whole codestream using unequal error protection (UEP). Experimental results showed a significant quality improvement when using UEP. Gains up to 3.57 dB have been obtained when compared to JPEG 2000 baseline without error resilience, and up to 1.85 dB against baseline with error-resilience tools.
References [1] J.-R. Ohm, “3-D subband coding with motion compensation,” IEEE Transactions on Image Processing, Vol. 3, pp. 559–571, 1994. [2] Y. Andreopoulos, M. van der Schaar, A. Munteanu, J. Barbarien, P. Schelkens and J. Cornelis, “Fully-scalable wavelet video coding using in-band motion compensated temporal filtering,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2003), Apr. 2003.
100
Visual Media Coding and Transmission
[3] S.J. Choi and J.W. Woods, “Motion-compensated 3-D subband coding of video,” IEEE Transactions on Image Processing, Vol. 8, p. 155, 1999. [4] B. Pesquet-Popescu and V. Bottreau, “Three-dimensional lifting schemes for motion-compensated video compression,” Proc. ICASSP, p. 1793, May 2001. [5] A. Secker and D. Taubman, “Motion-compensated highly scalable video compression using an adaptive 3D wavelet transform based on lifting,” Proc. ICIP, Vol. 2, p. 1029, 2001. [6] N. Mehrseresht and D. Taubman, “An efficient content-adaptive MC 3D-DWT with enhanced spatial and temporal scalability,” IEEE International Conference on Image Processing (ICIP2004), 2004. [7] MPEG Video Subgroup Chair, “Registered responses to the Call for Proposals on Scalable Video Coding,” ISO/ IEC JTC 1/SC29/WG11, Doc. M10569, Mar. 2004. [8] H. Schwarz, T. Hinz, H. Kirchhoffer, D. Marpe, and T. Wiegand, “Technical description of the HHI proposal for SVC CE1,” ISO/IEC JTC 1/SC 29/WG 11, Doc. M11244, Oct. 2004. [9] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 17, No 9, pp. 1103–1120, 2007. [10] J.-R. Ohm, “Advances in scalable video coding,” Proc. IEEE, Vol. 93, No. 1, pp. 42–56, 2005. [11] T. Wiegand, G.J. Sullivan, J. Reichel, H. Schwarz, and M. Wien,“Joint draft 11 of SVC amendment,” Joint Video Team, Doc. JVT-X201, Jul. 2007. [12] J. Reichel, H. Schwarz, and M. Wien, “Joint scalable video model JSVM-9,” JVT-V202, 22nd JVT Meeting: Marrakech, Morocco, Jan. 2007. [13] V.K. Goyal, “Multiple description coding: compression meets network,” IEEE Signal Processing Magazine, Vol. 18, No. 5, pp. 74–93, 2001. [14] Y. Wang, A.R. Reibman, and S. Lin, “Multiple description coding for video delivery,” IEEE Proceedings, Vol. 93, No. 1, pp. 57–70, 2005. [15] Y. Wang and S. Lin, “Error-resilient video coding using multiple description motion compensation,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, No. 6, pp. 438–452, 2002. [16] T. Tillo and G. Olmo, “A low complexity pre-post processing multiple description coding for video streaming,” Proc. IEEE International Conference on Information and Communication Technologies (ICTTA 2004), Apr. 2004. [17] G. Zhang and R.L. Stevenson, “Efficient error recovery for multiple description video coding,” Proc. IEEE International Conference on Image Processing (ICIP 2004), pp. 829–832, Oct. 2004. [18] D. Wang, N. Canagarajah, and D. Bull, “Slice group based multiple description video coding using motion vector estimation,” Proc. International Conference on Image Processing (ICIP 2004), Vol. 5, pp. 3237–3240, Oct. 2004. [19] N. Franchi, M. Fumagalli, G. Gatti, and R. Lancini, “A novel error-resilience scheme for a 3-D multiple description video coder,” Proc. Picture Coding Symposium (PCS04), Dec. 2004. [20] C.S. Kim and S.U. Lee, “Multiple description coding of motion fields for robust video transmission,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 9, pp. 999–1010, 2001. [21] A.R. Reibman, H. Jafarkhani, Y. Wang, M.T. Orchard, and R. Puri, “Multiple-description video coding using motion-compensated temporal prediction,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 12, No. 3, pp. 193–294, 2002. [22] W.S. Lee, M.R. Pickering, M.R. Frater, and J.F. Arnold, “A robust codec for transmission of very low bit-rate video over channel with bursty error,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 10, No. 8, pp. 1403–1412, 2000. [23] K. R. Matty and L.P. Kondi, “Balanced multiple description video coding using optimal partitioning of the DCT coefficients,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3, pp. 245–248, 2004. [24] Y.C. Lee, Y. Altunbasak, and R.M. Mersereau, “An enhanced two-stage multiple description video coder with drift reduction,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 14, No. 1,pp. 122–127, 2004. [25] M.V.D. Schaar and D.S. Turaga, “Multiple description scalable coding using wavelet-based motion compensated temporal filtering,” Proc. International Conference in Image Processing (ICIP 2003), pp. 489–492, Sep. 2003. [26] P.A. Chou, H.J. Wang, and V.N. Padmanabhan, “Layered multiple description coding,” Proceeding of Packet Video Workshop, Apr. 2003. [27] N. Franchi, M. Fumagalli, R. Lancini, and S. Tubaro, “Multiple description video coding for scalable and robust transmission over IP,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 15, No. 3, pp. 321–334, 2005.
Scalable Video Coding
101
[28] L.P. Kondi, “A rate-distortion optimal hybrid scalable/multiple-description video codec,” IEEE Trans. on Circuits and Systems for Video Technology, Vol 15, No. 7, pp. 921–927, 2005. [29] E. Akyol, A.M. Tekalp, and M.R. Civanlar, “Scalable multiple description video coding with flexible number of descriptions,” Proc. International Conference in Image Processing (ICIP 2005), pp. III-712–III-715, Sep. 2005. [30] M. Yu, X. Yu, G. Jiang, R. Wang, F. Xiao, and Y.-D. Kim, “New multiple description layered coding method for video communication,” Proc. International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT05), Dec. 2005. [31] F. Verdicchio, A. Munteanu, A.L. Gavrilescu, J. Cornelis, and P. Schelkens, “Embedded multiple description coding of video,” Proc. IEEE International Conference on Image Processing (ICIP 2006), Vol. 15, No. 10, pp. 3114–3130, 2006. [32] A. Smolic and H. Kimata,“Applications and requirements for 3DAV,” ISO/IEC JTC1/SC29/WG11, Doc. N5877, July 2003. [33] L.M.J. Meesters, W.A. Ijsselsteijn, and P.J.H. Seuntins, “Survey of perceptual quality issues in three-dimensional television systems,” Proc. SPIE, Vol. 5006, pp. 313–326, 2003. [34] C. Fehn, “A 3D-TV approach using depth-image-based rendering (DIBR),” Proc. International Conference on Visualization, Imaging, and Image Processing (VIIP-2003), Sep. 2003. [35] C. Fehn, “Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV,” Proc. SPIE Stereoscopic Displays and Virtual Reality Systems XI, pp. 93–104, Jan. 2004. [36] A. Bourge and C. Fehn, “Representation of auxiliary video and supplemental information,” ISO/IEC JTC1/SC29/ WG11 FCD 23002-3, Doc. N8482, Oct. 2006. [37] C. Fehn, K. Hopf, and Q. Quante, “Key technologies for an advanced 3D-TV system,” Proc. SPIE ThreeDimensional TV, Video and Display III, pp 66–80, Oct. 2004. [38] H.U. Mingyou, “Highly scalable 2D model-based video coding,” PhD thesis, University of Surrey, UK, 2005. [39] M. Hu, S. Worrall, A.H. Sadka, and A.M. Kondoz, “A scalable vertex-based shape intra-coding scheme for video objects,” Proc. ICASSP 2004, Vol. 3, pp. 273–276, 2004. [40] C.L. Jordan, Buhan, T. Ebrahimi, and M. Kunt, “Progressive content based shape compression for retrieval of binary images,” Computer Vision and Image Understanding, Vol. 71, No. 2, pp. 1126–1154, Aug. 1998. [41] F. Mokhtarian and M. Bober, Curvature Scale Space Representation: Theory, Applications & MPEG-7 Standardization, Kluwer Academic, ISBN: 1-4020-1233-0, 2003. [42] P. Gerkin, “Object-based analysis-synthesis coding of image sequences at very low bit rates,” IEEE Trans. on CSVT, Vol. 7, No. 2, pp. 251–255, Feb. 1997. [43] K.J. OConnell, “Object-adaptive vertex-based shape coding method,” IEEE Trans. on CSVT, Vol. 7, No. 1, pp. 251–255, Feb. 1997. [44] N. Brady, F. Bossen, and N. Murphy, “Context-based arithmetic encoding of 2D shape sequences,” Proc. ICIP97 – Special Session on Shape Coding, Vol. 1, pp. 29–32, Oct. 1997. [45] Chung Jae-won et al., “A new vertex-based binary shape coder for high coding efficiency,” Signal Processing: Image Commun., Vol. 15, No. 7–8, pp. 665–684, May 2000. [46] H. Wang, G.M. Schuster, A.K. Katsaggelos and T.N. Pappas, “An efficient rate-distortion optimal shape coding approach utilizing a skeleton-based decomposition,” IEEE Trans. on Image Processing, Vol. 12, No. 10, pp. 1181–1193, 2003. [47] G. Melnikov and A.K. Katsaggelos, “A rate-distortion optimal scalable vertex based shape coding algorithm,” Proc. ICASSP-2000, Vol. 4, pp. 1947–1950, 2000. [48] M. Hu, A.H. Sadka, S.T. Worrall, and A.M. Kondoz, “A scalable predictive shape coding scheme for video object,” Proc. IEEE International Conference on Multimedia & Expo (ICME-2005), Jul. 2005. [49] J.I. Kim, A.C. Bovik, and B.L. Evans, “Generalized predictive binary shape coding using polygon approximation,” Signal Processing: Image Communication, pp. 643–663, Jul. 2000. [50] J. Ascenso, M. Tagliasacchi, S. Tubaro, and F. Pereira, “MPEG-21 scalable video coding: from applications to technologies,” VISNET internal report, June 2004. [51] J. Xu, R. Xiong, B. Feng, G. Sullivan, M.-C. Lee, F. Wu, and S. Li, “3D sub-band video coding using barbell lifting,” ISO/IEC JTC1/SC29/WG11 M10569/S05, Munich MPEG meeting, Germany, March 2004. [52] M. Hu, S.T. Worrall, A.H. Sadka, and A.M. Kondoz, “Highly scalable 2D model-based video coding scheme using warping motion compensated temporal filtering,” Proc. International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS-2005), Apr. 2005.
102
Visual Media Coding and Transmission
[53] Z. Lu and W.A. Pearlman, “Wavelet video coding of video object by object-based SPECK algorithm,” Picture Coding Symposium (PCS-2001), pp. 413–416, Apr. 2001. [54] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard,” IEEE Trans. on CSVT, Vol. 13, No. 7, pp. 620–636, July 2003. [55] D. Taubman, “High performance scalable image compression with EBCOT,” IEEE Trans. on Image Processing, Vol. 9, No. 7, pp. 1158–1170, Jul. 2000. [56] ISO/IEC 14496-2:2004, Information Technology – Coding of Audio-Visual Objects – Part 2: Visual, 3rd Edition, 2004. [57] B. Chai, S. Sethuraman, and H.S. Sawhney, “A depth map representation for real-time transmission and view-based rendering of a dynamic 3D scene,” Proc. First international Symposium of 3D Data Processing Visualization and Transmission (3DPVT02), 2002. [58] R. Krishnamurthy et al., “Compression and transmission of depth maps for image based rendering,” ICIP 2001, pp. 828–831, Oct. 2001. [59] K. Sch€ u€ ur, C. Fehn, P. Kauff, and A. Smolic, “About the impact of disparity coding on novel view synthesis,” ISO/ IEC JTC1/SC29/WG11, MPEG02/M 8676, July 2002. [60] W. Chen and M. Lee, “a-channel compression in video coding,” International Conference on Image Processing, pp. 500–503, 1997. [61] H.A. Karim, A.H. Sadka, and A.M. Kondoz, “Reduced resolution depth compression for 3-D video over wireless networks,” 2nd International Workshop on Signal Processing for Wireless Communications, pp. 96–99, Jun. 2004. [62] C.T.E.R. Hewage, H.A. Karim, S. Worrall, S. Dogan, and A.M. Kondoz, “Comparison of stereo video coding support in MPEG-4 MAC, H.264/AVC and H.264/SVC,” Proc. Visual Information Engineering-VIE07, 2007. [63] H.A. Karim, S. Worrall, A.H. Sadka, and A.M. Kondoz, “3-D video compression using MPEG4-multiple auxiliary component (MPEG4-MAC),” IET 2nd International Conference on Visual Information Engineering (VIE2005), Apr. 2005. [64] T. Wiegand and G.J. Sullivan, “Overview of the H.264/AVC video coding standard,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 560–576, 2003. [65] B. Balamuralii, E. Eran, and B. Helmut, “An extended H.264 CODEC for stereoscopic video coding,” Proc. SPIE – The International Society for Optical Engineering, pp. 116–126, 2005. [66] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 17, No. 9, pp. 1103–1120, Sep. 2007. [67] A. Bourge and C. Fehn, “Representation of auxiliary video and supplemental information,” ISO/IEC JTC1/SC29/ WG11 FCD 23002-3, Doc. N8482, Oct. 2006. [68] S. Sun, S. Lei, and T. Nomura, “Stereo video coding support in H.264,” ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Dec. 2003. [69] C.T.E.R. Hewage, S. Worrall, S. Dogan, H. Kodikara Arachchi, and A.M. Kondoz, “Stereoscopic TV over IP,” Proc. 4th IET European Conf. on Visual Media Production, 2007. [70] D. Maestroni, M. Tagliasacchi, and S. Tubaro: “Registered responses to the CfP on scalable video coding,” ISO/ IEC JTC1/SC29/WG11 MPEG2004/M10569, Mar. 2004. [71] D. Maestroni, A. Sarti, M. Tagliasacchi, and S. Tubaro: “Fast in-band motion estimation with variable size block matching,” International Conference on Image Processing, Oct. 2004. [72] P. Chen, “Fully scalable subband/wavelet coding,” PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, USA, May 2003. [73] S. Hsiang and J.W. Woods, “Embedded image coding using zeroblocks of subband/wavelet coefficients and context modeling,” Proc. IEEE Data Compression Conference, pp. 83–92, Mar. 2001. [74] S. Mallat, A Wavelet Tour of Digital Signal Processing, Academic Press, 1998. [75] H.-W. Park and H.-S. Kim, “Motion estimation using low-band-shift method for wavelet-based moving-picture coding,” IEEE Trans. Image Processing, Vol. 9, No. 4, pp. 577–587, Apr. 2000. [76] Y. Andreopoulos, M. van der Schaar, A. Munteanu, J. Barbarien, P. Schelkens, and J. Cornelis, “Fully-scalable wavelet video coding using in-band motion compensated temporal filtering,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2003), Apr. 2003. [77] N. Mehrseresht and D. Taubman, “An efficient content-adaptive MC 3D-DWT with enhanced spatial and temporal scalability,” IEEE International Conference on Image Processing (ICIP2004), Oct. 2004.
Scalable Video Coding
103
[78] D. Maestroni, A. Sarti, M. Tagliasacchi, and S. Tubaro, “Scalable coding of variable size block motion vectors,” International Conference on Image Processing, Singapore, Oct. 2004. [79] A. Secker and D. Taubman, “Highly scalable video compression with scalable motion coding,” IEEE Trans. on Image Processing, Vol. 13, No. 8, pp. 1029–1041, Aug. 2004. [80] A. Said and W.A. Pearlman, “A new, fast and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 6, No. 3, June 1996. [81] Y. Wang, S. Cui, and J.E. Fowler, “3D video coding using redundant-wavelet multihypothesis and motion-compensated temporal filtering,” Proc. International Conference on Image Processing, Sep. 2003. [82] G. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE Signal Processing Magazine, pp. 74–90, Nov. 1998. [83] H. Schwarz, T. Hinz, H. Kirchhoffer, D. Marpe, and T. Wiegand, “Technical description of the HHI proposal for SVC CE1,” JVT-document, M11244, Oct. 2004. [84] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of hierarchical B pictures and MCTF,” Proc. ICME06, pp. 1929–1932, Jul. 2006. [85] D. Suh, G. Park et al., “Proposal for extension picture against loss of key pictures,” JVT-document, JVT-S015, Jan. 2006. [86] J. Reichel, H. Schwarz, and M. Wien, “Joint scalable video model JSVM-7,” JVT-document, JVT-T202, Jul. 2006. [87] J. Jia and H. Kim, “Implementation of redundant pictures in JSVM,” Doc. JVT-Q054, Oct. 2005. [88] G. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions,” Proc. SPIE, Vol. 5558, pp. 454–474, 2004. [89] JSVM 8.0 reference software, downloaded from CVS server, garcon.ient.rwth-aachen.de/cvs/jvt. [90] Y. Wang, A.R. Reibman, and S.Lin, “Multiple description coding for video delivery,” IEEE Proceedings, Vol. 93, No. 1, pp. 57–70, Jan. 2005. [91] J.G.Apostopoulos, “Error-resilient video compression via multiple state streams,” Proc. International Workshop on Very Low Bitrate Video Coding, VLBV99, Kyoto, Japan, Oct. 1999. [92] C. Fehn, “A 3d-tv approach using depth-image-based rendering (dibr),” Proc. VIIP2003, Spain, Sep. 2003. [93] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable H.264/MPEG4-AVC extension,” Proc. Int. Conf. on Image Proc. (ICIP 2006), Atlanta, GA, USA, Oct. 2006. [94] JSVM 8.7 reference software from CVS server, garcon.ient.rwth-aachen.de/cvs/jvt [95] http://suit.av.it.pt/ [96] H. Mansour, P. Nasiopoulus, and V. Leung, “An efficient multiple description coding scheme for the scalable extension of H.264/AVC (SVC),” Proc. IEEE Int. Symp. on Signal Processing and Inf. Tech., Vancouver, Canada, Aug. 2006. [97] S. Wenger, “Error patterns for Internet experiments,” ITU Telecommunications Standardization Sector, Doc. Q15-I-16r1, Oct. 1999. [98] “JPEG 2000 image coding system – Part 11: Wireless JPEG 2000 – Committee Draft,” ISO/IEC JTC1/SC29/ WG1 WG1N3386, July 2004. [99] F. Dufaux and D. Nicholson, “JPWL: JPEG 2000 for wireless applications,” SPIE Proc. Applications of Digital Image Processing XXVII, Denver, CO, USA, Aug. 2004. [100] A. Natu and D. Taubman, “Unequal protection of JPEG 2000 codestreams in wireless channels,” IEEE Proc. Globecom02, Taipei, Taiwan, Nov. 2002. [101] D. Nicholson, C. Lamy-Begot, X. Naturel, and C. Poulliat, “JPEG 2000 backward compatible error protection with Reed-Solomon codes,” IEEE Trans. on Consumer Electronics, Vol. 49, No. 4, pp. 855–860, Nov. 2003. [102] F. Frescura, M. Giorni, C. Feci, and S. Cacopardi, “JPEG 2000 and MJPEG 2000 transmission in 802.11 wireless local area networks,” IEEE Trans. on Consumer Electronics, Vol. 49, No. 4, pp. 861–871, Nov. 2003. [103] M. Grangetto, E. Magli, and G. Olmo, “Error sensitivity data structures and retransmission strategies for robust JPEG 2000 wireless imaging,” IEEE Trans. on Consumer Electronics, Vol. 49, No. 4, pp. 872–882, Nov. 2003. [104] http://www.kakadusoftware.com (v4.0).
4 Distributed Video Coding 4.1 Introduction Multimedia services have penetrated deeply into human lifestyle over the past decade. They come in diverse forms: business support, public security, entertainment, welfare, and many more. Video streaming, videoconferencing, content delivery, and other video related services form an indispensable, integral part of the increasingly popular service domains. The demand for higher video quality delivered at a low cost has been the driving force behind the vast amount of research activity carried out worldwide in innovating and improving the underlying technologies behind video services. In an era when the conventional video coding technologies are comfortably established, with H.264/MPEG-4 AVC, MPEG-2, MPEG-4-Visual, and so on having marked their places as recent developments, distributed video coding (DVC) has managed to introduce a radical shift in approach in the video coding arena, with a more flexible architecture utilizing distributed source coding techniques. DVC proposes a significantly lower complexity for the video encoder, while the major computationally-expensive tasks, including motion estimation for exploiting source redundancies, are shifted to the decoder. This lowcomplexity feature could be very effectively utilized for the design of very-low-cost video cameras for a range of applications. Some of the prospective applications of DVC include wireless sensor networks used for security surveillance systems, and mobile video communications, which have increasing potential for commercial use. The effectiveness of security surveillance systems depends on the deployment of a large number of video sensors scattered in the interested area. These devices generally capture the data and upload to a centralized server for processing and either displaying or storing. It is understood that the centralized hardware could be highly shared, since not all incoming streams need be decoded simultaneously. Therefore the cost of encoders is the deciding factor for the wide deployment of security surveillance systems and other similar applications, including the monitoring of the elderly, the disabled, and children by their guardians, and also disaster zone and traffic monitoring. Mobile communications is considered to be another very demanding application where the cost of the encoder is a matter of concern since each mobile handset used for video calling carries an encoder. The complexity of the video encoder and the decoder hardware in the
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
106
Visual Media Coding and Transmission
mobile unit is important in deciding the price of the equipment, and the mobile handset price plays a major role in the consumer market. A possible economical solution would be to use a combination of DVC and conventional coding techniques: a DVC encoder and a conventional decoder. For example, a very simple low-cost DVC encoder and the equally non-complex conventional decoder could be placed in the handset, while the complex segments of both approaches could be pushed to the base transceiver station (BTS) side of the link, where the processing power is high. In the subject of video signal processing, much research effort has been made to improve the rate-distortion (RD) performance of the video codecs. However, it has been observed that complexity aspects of video coding have attracted relatively low attention. The two main contributors to the equipment cost are the signal processing hardware and the computational complexity of the coding algorithms. Further, these affect the power consumption of the operation. The conventional video coding algorithms listed above are characterized by a very complex encoder and a simple low-cost decoder, motivated by one-to-many type applications such as video broadcasting, video on demand, and video streaming, where the cost of the receiver is the prime concern. Recently, some of the coding algorithms such as H.263 þ and H.264 have been studied for mobile video communications and surveillance systems. However, they all achieved compression efficiency at the expense of computational complexity of the encoder. Therefore, achieving a low-cost encoding process is still a significant challenge. However, their decoders are much simpler and are suitable for the one-to-many-type applications mentioned above. DVC is suggested as a solution to this encoding problem. Matching or exceeding the RD performance is considered to be the challenge facing presentday researchers involved with DVC.
4.1.1 The Video Codec Complexity Balance The computational complexity of a video coding algorithm is built around the major task of exploiting the redundant information in the source in order to achieve compression of the raw video stream. Historically this function was the heart of the video encoding process, inevitably making the encoder a highly complex structure. The corresponding decoding algorithms involved much less rigorous computations. The complexity of the encoder is typically estimated to be 5–10 times higher than that of the decoder. This conventional complexity balance is illustrated in Figure 4.1. This architecture was appropriately motivated by a number of one-to-many-type existing video applications. An example video communication scenario with one video encoder and many receivers is shown in Figure 4.2. With multiple decoders, minimizing the decoder complexity to facilitate the fabrication of low-cost decoders is the dominant consideration and hence the conventional video coding complexity balance is justified for these scenarios.
Figure 4.1
Complexity balance of conventional video coding
107
Distributed Video Coding
Figure 4.2
Example one-to-many-type video communication scenario
However, in sharp contrast to the above-discussed scenario, the video surveillance-type applications rely on the massive deployment of video capturing and encoding devices. A typical security surveillance scenario is shown in Figure 4.3. These scenarios necessarily demand low-cost video encoder architectures for economically-viable operation. Wireless low-power surveillance, multimedia sensor networks and wireless PC cameras are some of the application classes with similar architectural demands. DVC proposes a flexible architecture which enables a shift in complexity balance between the encoder and the decoder to cater for the new demand described above, as illustrated in Figure 4.4. The preliminary observations have shown that the encoder complexity could be brought down to negligible amounts by using a simple state machine with two or three shift registers coupled with the bitstream preparation and interleaving circuitry. However, it noted that these contrasting features have been somewhat compromised in some of the DVC proposals for improving the RD performance, particularly when the transform domain coding
Figure 4.3
Video surveillance application scenario
108
Visual Media Coding and Transmission
Figure 4.4
Complexity balance of distributed video coding
is considered in contrast to the pixel domain DVC coding. Nonetheless the competitive advantage is unaffected. New fronts of video communication requirements have emerged with the rapid growth in the mobile communication industry. In mobile communications, the complexity of the video encoder (in video capturing) and decoder (in video display) hardware in the mobile unit is crucial in designing devices of smaller size, lower power consumption and lower price. The footprint of the signal-processing hardware, the power consumption and the lower manufacturing cost resulting from low-complexity coding algorithms are some of the critical considerations in mobile video communication research. A possible solution would be to use a combination of DVC and conventional coding techniques; a DVC encoder and a conventional decoder. In a possible hybrid solution, a very simple low-cost DVC encoder and lowcomplexity conventional decoder could be placed in the handset, while the complex segments of both approaches could be pushed to the RBS side of the link. Transcoding capabilities from DVC to conventional coding formats and vice versa need to be integrated into the RBS. This would obviously incur an additional cost at the central installation, but it would be well justified by the fact that the central hardware would be heavily shared, considering the very low activity ratios apparent with mobile users. Figure 4.5 illustrates this hybrid system.
Figure 4.5
Hybrid solution for mobile video conferencing
109
Distributed Video Coding
4.2 Distributed Source Coding DVC is identified as the adaptation of the theoretical framework of distributed source coding (DSC) set by the Slepian–Wolf theorem [1] and the Wyner–Ziv theorem [2] for video coding. DSC concept deviates from the conventional source coding paradigm in the context of the dependency of the encoding of statistically-correlated sources. In the conventional approach, the statistically-correlated sources are jointly encoded and jointly decoded for the perfect reconstruction of the information stream at the decoder. DSC, in contrast, proposes to carry out independent encoding of statistically-dependent sources, and yet to jointly decode them. The information theoretical limitations inherited by the so-called independent encoding have been reported by Slepian and Wolf, as discussed in the next section.
4.2.1 The Slepian–Wolf Theorem (Portions reprinted, with permission, from M.B. Badem, W.A.R.J. Weerakkody, W.A.C. Fernando, A.M. Kondoz, ‘‘Design of a non-linear quantizer for transform domain DVC’’, IEICE Transactions on Fundamentals. Paper Number 2008EAP1119. 2008 IEICE.) Assume that X and Y are two statistically-dependent discrete random sequences which are independently and identically distributed (i.i.d.). Consider the case that these sequences are separately encoded with rates RX and RY, respectively, but are jointly decoded, exploiting the correlation between them as illustrated in Figure 4.6. Slepian and Wolf have presented an analysis of the possible rate combinations of RX and RY for the reconstruction of X and Y with an arbitrarily small error probability [1] as shown below. This is widely known as the Slepian–Wolf theorem: RX HðX=YÞ
ð4:1Þ
RY HðY=XÞ
ð4:2Þ
RX þ RY HðX; YÞ
ð4:3Þ
where H(X/Y) and H(Y/X) are the conditional entropies and H(X,Y) is the joint entropy of X and Y. (Equations 1, 2 and 3 reproduced by permission of 2008 IEICE.) According to the Slepian–Wolf theorem, if the total overall bit rate exceeds the summation of individual bit rates RX and RY and the conditional entropy between X and Y is lower than the summation, then the independently-decoded streams could be jointly decoded with arbitrarily
Figure 4.6
Distributed coding of two statistically-dependent discrete random sequences
110
Visual Media Coding and Transmission
Figure 4.7
Achievable rate region following the Slepian–Wolf theorem
small bit error probability. Thus a lower bound to the bit rate is set to be equal to the summation of individual bit rates. Thus it is concluded that independent encoding of statistically-dependent sequences does not impose any theoretical loss of compression efficiency compared to the more established approach of joint encoding, which is practiced in the conventional video coding techniques. Figure 4.7 illustrates the achievable rate region for the distributed coding of two staticallydependent i.i.d. sources, X and Y, for the recovery of their information with an arbitrarily small error probability according to the Slepian–Wolf theorem [1]. In Figure 4.7, the vertical, horizontal, and diagonal lines correspond to Equations (4.1), (4.2) and (4.3), respectively; they represent the lower bounds for the achievable rate combinations of RX and RY [1].
4.2.2 The Wyner–Ziv Theorem The Wyner–Ziv theorem [2] refers to a particular case of the Slepian–Wolf theorem [1] commonly known as the ‘‘lossy compression scenario with side information at the decoder’’. This scenario is shown in Figure 4.8. This concept assumes a finite acceptable distortion level d between the source information X and the corresponding decoded output X; hence the term ‘‘lossy compression’’. Assuming the information content Y, which is statistically dependent on X, is available at the decoder; Wyner and Ziv attempt to quantify the minimum bit rate to be passed between the encoder, termed RWZ(d), in order to achieve the finite distortion d between the input and output. The Wyner–Ziv theorem states that when the statistical dependency between X and Y is exploited only at the decoder, the transmission rate increases compared to the case where the correlation is exploited at both the encoder and the decoder, for the same RWZ(d) X
Encoder
Statistically dependent
Figure 4.8
Decoder
X’
Y
Lossy compression with decoder side information
111
Distributed Video Coding
average distortion d. The Wyner–Ziv theorem reads: RWZ ðdÞ RXjY ðdÞ;
d0
ð4:4Þ
where RWZ(d) is called the Wyner–Ziv minimum encoding rate (for X), and RW|K(d) represents the minimum rate necessary to encode X when Y is simultaneously available at the encoder and the decoder for the same average distortion d. When d approaches zero, i.e. when no distortion exists, the Wyner–Ziv theorem falls back to the Slepian–Wolf result. This means that it is possible to reconstruct the sequence W with an arbitrarily small error probability even when the correlation between W and the side information is only exploited at the decoder.
4.2.3 DVC Codec Architecture (Portion reprinted, with permission, from M.B. Badem, W.A.C. Fernando, W.A.R.J. Weerakkody, H.K. Arachchi, A.M. Kondoz, ‘‘Transform domain unidirectional distributed video coding Using Dynamic Parity Allocation’’, IEICE Transactions of Fundamentals. Paper Number 2008EAP1134. 2008 IEICE.) Building distributed source coding concepts into a video codec has been discussed in the literature using numerous distinct architectures. The codec design criteria include: flexibility and extremely low complexity of the encoder, optimum RD performance, and operational efficiency suitable for diverse applications including real-time video communications and video storage. Several channel coding schemes, including turbo coding [3], turbo trellis-coded modulation (TTCM) [4,5], and low-density parity-check (LDPC) codes [6,7], have been proposed for utilization in DVC implementations. Other distinct architectures have also been proposed for the same purpose, including the PRISM codec [8]. Another key selection option in designing the codec is to operate in either pixel domain or transform domain. Pixel domain DVC yields the lowest complexity for the encoder structure. Transform domain coding compromises the extremely-low-complexity feature for improved RD performance for the codec. Discrete cosine transform (DCT) has been widely proposed for utilization for the above purpose [9]. Wavelet transform and integer transform are further variants proposed in the literature for transform domain DVC codecs. In this chapter, the widely-appreciated turbocoding-based DVC implementation is discussed in detail. The discussion is predominantly based on the DVC architecture for pixel domain implementation, yet with added sections to discuss the variations for the transform domain implementation. Figure 4.9 shows a generic block diagram of the turbo-coding-based pixel domain DVC codec. The dominant constituent blocks are identified as: quantizer, bit plane extractor, turbo encoder, parity puncturer, parity buffer, and the key frame encoding mechanism on the encoder side; and side information generator, turbo decoder, reconstructor noise distribution and error estimation functions, and the key frame decoder on the decoder side of the codec. The transmission media between the encoder and the decoder involves a reverse path for communicating the parity request messages dynamically. The construction and functionality of each of these components is discussed in the following sections. Some of these components are either removed or modified, and additional blocks have been introduced in recent modifications to the DVC architecture, as proposed in recently-published research literature. Some of these modifications are discussed later in this chapter.
112
Visual Media Coding and Transmission
Parity Request WynerZiv Frames
Quantizer
Bit plane Extractor
Turbo Encoder
Parity Buffer/ Puncturer
Turbo Decoder
Inverse Bit plane Extractor
Reconstruction
Decoded Wyner-Ziv Frames
Side information Key Frames
Intraframe Encoder
Encoder
Side Info. Gen.
Intraframe Decoder
Decoder
Figure 4.9 Generic pixel domain DVC codec architecture using turbo coding. Reproduced by Permission of 2008 IEICE
4.2.4 Input Bitstream Preparation – Quantization and Bit Plane Extraction In DVC, it is convenient to understand the two functions of quantization and bit plane extraction together, since they are closely related. In the pixel domain DVC codec, the input video stream is subjected to a pixel-level quantization for video compression purposes. In the input video stream (YUV video format is assumed for all the simulations discussed in this chapter) each pixel is represented by 8 bits, yielding 256 quantization levels. The effect of the quantization process is to control the number of bits used for each pixel by limiting the number of quantization levels in the range 2M 2 {2,4,8,16,32,64,128,256}, where M is the quantization parameter. The quantization algorithm can be explained using a series of quantization bins, as illustrated in Figure 4.10. The number of bins is determined by the quantization parameter. The obvious ill-effect of quantization, the loss of information content in the video stream, is partially compensated by the reconstruction function performed at the decoder, which is in effect an inverse quantization process. The bit plane extraction process is illustrated in Figure 4.11. The most significant bits of every pixel in the frame are placed in sequence to form the first bit plane. The second-level significant bits of all pixels form the second bit plane, and the pattern continues until the least significant bit plane. Then the input bitstream for the turbo encoder is formed by placing the bit planes in order: the most significant bit plane first and the least significant bit plane last. As discussed above, the quantizer determines the number of bits per pixel sent to the encoder; in effect it determines the number of bit planes processed.
4.2.5 Turbo Encoder Turbo coding is a channel coding technique widely appreciated for use in DVC. A turbo encoder is formed by parallel concatenation of two recursive systematic convolutional (RSC) encoders separated by an interleaver, as illustrated in Figure 4.12(a).
Figure 4.10
Quantization bin formation for 2M ¼ 4
113
Pixel #0 Pixel #1
Distributed Video Coding
Bit plane #0 Bit plane #1
Bit plane #7
Figure 4.11
Bit plane extraction process
The construction of an RSC encoder is determined by the generator polynomial which takes the form: ð4:5Þ GðDÞ ¼ ½1; g2 ðDÞ=g1 ðDÞ where g1(D) and g2(D) are feedforward and feedback polynomials, respectively. Figure 4.12(b) illustrates an example RSC encoder with the generator polynomial (1, 13/15)oct.
Xk
Xk
RSC Encoder 1 Yk Interleaver
RSC Encoder 2 Xk + Xk
+
Zk
+
+
Figure 4.12 (a) Turbo encoder; (b) example RSC encoder
114
Visual Media Coding and Transmission
4.2.6 Parity Bit Puncturer For an input data block of length n, the turbo encoder generates 2n parity bits in the two constituent RSC encoders, as discussed above. This parity bitstream is supposed to be the compressed output of the video encoder; hence the full parity bit sequence cannot be moved forward due to the large content. Therefore a periodic puncturing function is involved, where bits are selected from the parity stream in a periodic pattern to compose the final video encoder output. The corresponding bits are generally selected symmetrically from both parity streams from the two constituent RSC encoders in the turbo encoder. Since the remaining bits are dropped there could be a significant information loss in this process depending on the implementation algorithm. Thus puncturing could be identified as a tradeoff of video quality and compression ratio. However, when the closed loop feedback mechanism between the encoder and the decoder for dynamic parity bit request is considered, the video quality is externally controlled at the decoder; hence the puncturing is an automated supportive mechanism for minimizing the bit rate, i.e. maximizing the compression.
4.2.7 Side Information Side information generation, performed at the DVC decoder, exploits the temporal and spatial correlations in the frames of the sequence in the vicinity of the target frame to produce an estimation for the target frame. The parity stream generated in the DVC encoder travels through the channel to the decoder and is then fed to the turbo decoder with the estimated side information. The estimation process makes use of the intra-decoded key frames. Assuming that Xm(i,j) is the current Wyner–Ziv (WZ) frame and Ym(i,j) is the correlated side information for the current frame: Ym ði; jÞ ¼ gðXcoded;m N2 ði; jÞ; Xcoded;m ðN2 1Þ ði; jÞ . . . . . . Xcoded;m 1 ði; jÞ; Xcoded;m þ 1 ði; jÞ; . . . . . . Xcoded;m þ ðN1 1Þ ði; jÞ; Xcoded;m þ N1 ði; jÞÞ
ð4:6Þ
where g() is a function to describe the motion-compensated prediction carried out using N2 past reference frames and N1 future frames. Then the relationship between Ym(i,j) and Xm(i,j) can be modeled with a noise term, nm(i,j), as: Ym ði; jÞ ¼ Xm ði; jÞ þ nm ði; jÞ
ð4:7Þ
(Reproduced by permission of 2007 EURASIP.) It can be shown that the noise term nm(i,j) can be approximated to an additive stationary white noise signal, if the motion estimation is accurate. For most cases, this noise process can be modeled using either a Gaussian or a Laplacian probability distribution [10]. In DVC codecs, the side information generation process generally makes use of the key frames to extract the temporal and spatial correlations used to estimate a correlated information stream corresponding to the Wyner–Ziv frame used in the DVC encoder, in order to practically implement the distributed source coding concept. A number of algorithms have been proposed in the literature for this purpose. Basic pixel interpolation and extrapolation techniques exploit the temporal correlations among the adjacent frames to estimate side information at pixel level [11,17]. These techniques have a very low computational cost. Interpolation techniques tend to yield closer estimation for the side information due to the bidirectional prediction, yet
115
Distributed Video Coding
tend to introduce additional delay due to the use of future frames. Motion estimation and compensation and motion field smoothening techniques have been proposed to significantly improve the quality of side information by estimating the motion vectors for the frame using the reference information in the adjacent key frames [12–14].
4.2.8 Turbo Decoder In the context of DVC, the turbo decoder plays the key role of correcting the errors in the side information stream modeled using a hypothetical AWN channel, as discussed above. The parity bitstream received from the encoder is used in the turbo decoder for the above purpose. Turbo coding was proposed by Berrou et al. in 1993 for channel coding in communications [3]. This concept has been successfully adopted for DVC. The structure of the turbo decoder is illustrated in Figure 4.13. One iteration step can be divided into two stages. In the first stage, only soft-in–soft-out (SISO) decoder 1 is active. Soft channel inputs containing received parity bits (Lcykl) from the first encoder and the systematic bits (side information – Lcyks) are fed to SISO decoder 1. In the case of rate-compatible punctured turbo (RCPT) codes, parity bits are punctured; thus at the receiver side, zeros are inserted to the punctured positions. In the first iteration there is no a priori information about the sent bits, thus log likelihood ratio (LLR) L(uk) is set to 0.5. The soft output of SISO decoder 1 is deducted by LLR values L(uk) and Lcyks according to the maximum a posteriori (MAP) algorithm [15] to produce LLR Le(uk). Then LLR Le(uk) bits are interleaved and become the input to SISO decoder 2. In the second stage of the first iteration, SISO decoder 2 comes into operation. The input to SISO decoder 2 consists of the interleaved version of the soft channel output of systematic bits, the output of encoder 2, and the a priori information L(uk) that is available from the first stage of
Figure 4.13 Turbo decoder
116
Visual Media Coding and Transmission
each iteration. L(uk) is completely independent of the other information used by the second decoder. SISO decoder 2 then produces a posteriori information L(uk|y). This is subtracted by Lcyks and Lcykl, yielding extrinsic LLR Le(uk). Le(uk) is then de-interleaved and becomes a priori information for the next iteration step. After each iteration, the number of error bits on average tends to reduce. However, the improvement in bit error rate (BER) reduction falls as the number of iterations increases. Hence after a fixed number of iterations (normally from four to eight), the iterative turbo decoder completes its decoding of one data block. Finally, a hard decision decoding is performed to extract the decoder output by comparing the LLR L(uk) sequence with zero threshold. 4.2.8.1 Statistical Distribution Estimation As discussed above, the input to the turbo decoder in the DVC decoder is the composite of the side information locally estimated and the parity bitstream received from the encoder. An estimation of the probability distribution of the residual error in the side information and the parity bitstreams is necessary in order to perform the turbo decoding operation. The wireless channel estimation techniques proposed in communications theory work well with the parity stream, but a practical solution to the problem of distribution estimation of the noise in side information is still an open challenge ahead of researchers. The codecs discussed in the literature assume perfect estimation techniques given the availability of reference noise-free information. 4.2.8.2 Error Estimation Dynamic estimation of the BER of the Wyner–Ziv decoder output is another very important function performed in the DVC decoder. The dynamic decisions to send out parity request messages to the encoder are made on the basis of the output BER exceeding a predetermined threshold. Finding a practical solution for the problem of the BER estimation of the output is an ongoing problem. A perfect error estimation algorithm, as proposed in Stanfords DVC codec [16], has been widely utilized in related literature. Optimum setting of the BER threshold trades off the bit rate to picture quality in the common closed loop feedback DVC architecture.
4.2.9 Reconstruction: Inverse Quantization The reconstruction of the decoded frame sequence is carried out to compensate for the quantization noise introduced at the DVC encoder. The output from the turbo decoder is passed through an inverse bit plane extraction stage to produce the decoded quantized pixels. The reconstruction module presented by Aaron et al. in [17] is widely used for the DVC codecs discussed in literature. The decision criteria in this algorithm are illustrated in Figure 4.14. Figure 4.14 shows the reconstruction function for four quantization levels. It is noted that this algorithm reconstructs each pixel individually. The reconstructed pixel value, X 0 i , is derived as: X 0 i ¼ EðXi jqi ; Yi Þ, where Xi is the original unquantized pixel value, qi is the decoded quantized pixel value, and Yi is the corresponding unquantized pixel value from the side information stream. If the side information Yi is within the reconstruction bin corresponding to qi then Xi
117
Distributed Video Coding
256
qi = 3
Reconstruction X’i
192
qi = 2 128
qi = 1 64
qi = 0
0
64
128
192
256
Side Information Yi
Figure 4.14
Reconstruction function (four quantization levels)
takes the value of Yi. If Yi is outside the bin, the function clips the reconstruction towards the boundary of the bin closest to Yi. When the accuracy level of side information is high, there will be a high hit rate into the appropriate reconstruction bin, so that this reconstruction algorithm yields very good results with less pixels getting clipped off. However, with video sequences of a higher motion characteristic, in general leading to less close to perfect side information, there will be more frequent occurrences of falling out of the bin. This will create an unpleasing viewing experience, particularly when the quantization is coarse.
4.2.10 Key Frame Coding DVC adopts distributed source coding principles. Thus the exploitation of source redundancies in the input video sequence, which is historically the major function of video encoders, has been moved to the decoder. Unlike when this function is performed at the encoder, the decoder does not have access to the full input frame sequence to be utilized for this purpose. To fulfill the absolute necessity of reference information for this process, a key frame coding mechanism has been included with the DVC codec architecture. The key frames are selected from the input video sequence in a periodic pattern. The periodicity of the key frame selection is determined by the group of picture (GOP) length of the codec. It is common to use a GOP length of 2 in DVC codec design, yielding alternate frames to be designated as key frames. A higher GOP length will reduce the key frame frequency. Key frame coding uses intra-frame coding techniques; hence it consumes a higher bit rate than intra-coded Wyner–Ziv frames. On the other hand, higher key frame frequency results in superior reference information for exploiting the source correlations at the DVC encoder. Therefore, determining the GOP length in DVC is, in general, a tradeoff between the bit rate and the video quality. Very little has been said in the DVC literature about the key frame coding algorithms. Many authors have assumed the use of conventional inter-frame coding techniques to serve the purpose of making the key frames available at the decoder, simulating the respective codecs for
118
Visual Media Coding and Transmission
Wyner–Ziv frames. Naturally, this has a tendency to violate the underlying motives of DVC. A unique key frame coding algorithm using a Wyner–Ziv codec is proposed in [18]. In this algorithm, key frames are encoded as independent Wyner–Ziv (WZ-I) frames by exploiting the spatial correlations at the decoder, as temporal correlations are not available for obvious reasons. In the rest of this chapter, we propose some solutions to the existing problems in the DVC codec, and analyze the effect of the DVC codec in error-prone channels. Finally, we modify the DVC codec to handle these errors and propose some error-concealment techniques.
4.3 Stopping Criteria for a Feedback Channel-based Transform Domain Wyner–Ziv Video Codec All DVC codecs assume that they have access to the original Wyner–Ziv frame to decide to stop the turbo decoding process, i.e. the original binary sequence x is available at the decoder to compute BERr exactly. In this case, the optimal number of requests r is obtained by monitoring the value of BERr. That is: r* ¼ arg min½BERr < t r
ð4:8Þ
where t is a threshold that indicates the acceptable residual BER. In the literature, t is typically set to 103 for QCIF video sequences. (Reproduced by permission of 2007 EURASIP.) However, this is a hypothetical assumption which is not possible in any practical DVC codecs. In the turbo coding literature (channel coding scenario), methods for early stopping in turbo decoding have been studied [19], [20]; the main target of these methods is a reduction in the number of turbo decoder iterations, decreasing the decoding process computational complexity. In this section, efficient request-stopping criteria will be proposed to control the number of requests for additional parity bits made by the decoder in the context of Wyner–Ziv video coding. Although the proposed criteria are based on the available turbo coding literature, they go a step further since they have to be adapted to a video coding context (source coding scenario), which was not done in the earlier implementation.
4.3.1 Proposed Technical Solution (Portions reprinted, with permission, from M. Tagliasacchi, J. Pedro, F. Pereira, S. Tubaro, ‘‘An efficient request stopping method at the turbo decoder in distributed video coding’’, EURASIP European Signal Processing Conference, Poznan, Poland, September 2007. 2007 EURASIP.) In this section, an algorithm that allows estimation of the optimal number of parity bit requests, without access to the original sequence x, is proposed. To this end, it is instructive to provide further details about the turbo decoding algorithm. Each SISO decoder uses the logMAP algorithm to determine the logarithm of the a posteriori probability (LAPP) ratio. Let LAPPr[u(i)] denote the LAPP ratio computed after the rth parity bit request: LAPPr ½uðiÞ ¼ log
PrfuðiÞ ¼ 1jypr ; pðx; yÞg PrfuðiÞ ¼ 0jypr ; pðx; yÞg
ð4:9Þ
where u (i) is the ith bit to be decoded. (Reproduced by permission of 2007 EURASIP.)
119
Distributed Video Coding
The SISO decoders iterate a predefined number of times, and at each iteration the SISO decoders exchange extrinsic soft information between themselves. After the last iteration, the final LAPP ratio is determined, and a hard decision is made based on the sign of the LAPP ratio. That is: 1; LAPPr ½uðiÞ > 0 ur ðiÞ ¼ ð4:10Þ 0; LAPPr ½uðiÞ 0 One of the methods proposed in the turbo coding literature for early stopping and error detection consists of monitoring the mean of the log a posteriori likelihood ratio (LAPP) values [19], i.e. the mean absolute log-likelihood ratio (MALR) for the decoded bits over a bitplane defined by: MALRr ¼
1 1 NX jLAPPr ½uðiÞj N i¼0
ð4:11Þ
which represents the sample mean of the absolute values of the LAPP ratios, computed across the N sequence samples. Intuitively, the higher the absolute value of |LAPPr|[u(i)]|, for a given sample i, the more confident the turbo decoder is in making the hard decision in (4.10). The following request-stopping criterion is proposed: If MALRr > TM then stop requesting parity bits. While in [19] the MALR criterion is exploited for early stopping (i.e. to determine when additional decoder inner iterations result in little or no improvement in the bit error probability at the output of the turbo decoder), thus saving processing time, in this manuscript it is adopted to stop the parity bit requests, thus saving bit rate. This means that in Wyner–Ziv coding the MALR value is to be monitored at the end of the decoding process following each request by the decoder. The second request-stopping proposed here is based on comparing the signs of the extrinsic information of both SISOs, after the turbo decoder completely processes a parity bit request. In fact, the LAPP ratio can be factored out as: LAPPr ½uðiÞ ¼ Lc ½uðiÞ þ Lr12 ½uðiÞ þ Lr21 ½uðiÞ
ð4:12Þ
where Lc[u(i)] is the a priori term which depends on the channel statistics p(x,y), specifically: PrfxðiÞ ¼ 1jyðiÞg ð4:13Þ Lc ½uðiÞ ¼ log PrfxðiÞ ¼ 0jyðiÞg Lrij ½uðiÞ is the extrinsic information passed from decoder i to decoder j, after the rth parity bit request. The percentage of sign changes (PSCS) in the extrinsic information between SISOs in a bitplane is defined as: 1 ; if signðLr12 ½uðiÞÞ 6¼ signðLr21 ½uðiÞÞ SCr ½uðiÞ ¼ ð4:14Þ 0 ; otherwise
120
Visual Media Coding and Transmission
PSCS ¼ 100
1 1 NX SCr ½uðiÞ N i¼0
ð4:15Þ
(Equation (4.15) reproduced by Permission of 2007 IEEE.) Equation (4.14) states that a sign change exists if the two pieces of extrinsic information have different signs. Equation (4.15) simply accumulates sign change (SC) values, where N is the sequence length. The intuitive fundament of this request-stopping criterion is that the more the two SISO decoders agree in their hard decisions, the higher the confidence in the final decision must be – see Equation (4.10) – and hence the smaller the residual BER. It is possible to notice that the residual BERr drops to zero, together with PSCSr, after the same number of requests. This stopping criterion is simpler and computationally more efficient than the cross-entropy criterion based on the cross-entropy between the distributions of the estimates at the output of the SISO decoders at each iteration. Therefore, a second request-stopping criterion is proposed: [L]If PSCRr < TP then stop requesting parity bits. Since it is not obvious if this criterion performs better than the previously proposed one, they will be compared later. In the next section, it will be shown that under-requesting may have a strong negative subjective impact, since more errors will be left in the decoded video. Therefore, it is relevant to consider a more conservative request-stopping criterion which combines the two previously presented ones. Thus, the following combined MALR–PSCS (CMP) request-stopping criterion is also proposed: If MALRr > TM and PSCSr < TP then stop requesting parity bits. The selection of the thresholds used in the proposed criteria is investigated in the next section.
4.3.2 Performance Evaluation The goal is to investigate and compare the effect in terms of RD performance of the stopping criteria proposed in the previous section, for both the PDWZ and the TDWZ codec architectures. For the PDWZ scenario, the test conditions adopted were: Foreman and Coastguard sequences, at QCIF resolution and 15 fps; GOP size is set to 2 frames; at QCIF resolution, the block length is equal to N ¼ 176 144 ¼ 25344 and the puncturing period P has been set equal to 48; the key frames are encoded with H.264/AVC intra (Main Profile) with the quantization steps chosen through an iterative process to achieve an average intra-frame quality (in peak signal-to-noise ratio (PSNR)) similar to the average WZ frame quality. Figure 4.15 shows the RD results for the Foreman and Coastguard sequences. It is easy to conclude that all the proposed request-stopping criteria, MALR, PSCS, and CMP, basically achieve similar coding efficiency to ideal error detection. The gap from ideal error detection is less than 0.1 dB at high bit rates. The values of the thresholds TM and TP have been empirically tuned to 105 and 1, respectively, in order to achieve a virtually zero residual BER at a cost of a negligible rate overhead.
121
36
36
34
34 PSNR (dB)
PSNR (dB)
Distributed Video Coding
32 30 PDWZ - Ideal error detection PDWZ - MALR (TM = 105)
28
32 30 PDWZ - Ideal error detection PDWZ - MALR (TM = 105)
28
PDWZ - PSCS (TP = 1)
PDWZ - PSCS (TP = 1)
PDWZ - CMP
26 0
128
256 384 Rate (kbps)
512
(a) Foreman
Figure 4.15
PDWZ - CMP
26 0
128
256 384 Rate (kbps)
512
(b) Coastguard
RD results for the PDWZ codec. Reproduced by Permission of 2007 IEEE
40
40
38
38 PSNR (dB)
PSNR (dB)
In the TDWZ coding scenario, most of the test conditions are the same as for the PDWZ scenario, including the puncturing period P ¼ 48; in this case, since bitplanes come from DCT coefficients, quantization matrices are defined as in [21]. The RD curves for the three stopping criteria – MALR, PSCS and CMP – are shown in Figure 4.16. In terms of RD performance, the three stopping criteria proposed are very close to the curve obtained with the ideal stopping criterion. This means that the basic purpose of the proposed stopping criteria has been achieved, providing a practical solution without sacrificing
36 34 32
TDWZ - Ideal error detection TDWZ - MALR (TM = 75)
30
128
256 384 Rate (kbps)
32
TDWZ - Ideal error detection TDWZ - MALR (TM = 75) TDWZ - PSCS (TP = 1)
28
TDWZ - CMP
0
34
30
TDWZ - PSCS (TP = 1)
28
36
512
TDWZ - CMP
0
128
40
40
38
38
36 34 32
TDWZ - Ideal error detection TDWZ - MALR (TM = 75)
30
TDWZ - PSCS (TP = 1)
28
TDWZ - CMP
0
128
256 384 Rate (kbps)
(c) Coastguard
Figure 4.16
512
(b) Hall Monitor
PSNR (dB)
PSNR (dB)
(a) Foreman
256 384 Rate (kbps)
512
36 34 32
TDWZ - Ideal error detection TDWZ - MALR (TM = 75)
30
TDWZ - PSCS (TP = 1)
28
TDWZ - CMP
0
128
256 384 Rate (kbps)
512
(d) Soccer
RD results for the TDWZ codec. Reproduced by Permission of 2007 IEEE
122
Visual Media Coding and Transmission
coding efficiency. Although very small, one may observe that the most significant RD loss happens for the Hall Monitor sequence, where the PSCS criterion is the one performing best. The thresholds used have been obtained empirically by maximizing the overall RD performance for the test sequences; for the results presented, TM ¼ 75 and TP ¼ 1.
4.4 Rate-distortion Analysis of Motion-compensated Interpolation at the Decoder in Distributed Video Coding (Portions reprinted, with permission, from ‘‘Rate-distortion analysis of motion-compensated interpolation at the decoder in distributed video coding’’, M. Tagliasacchi, L. Frigerio, S. Tubaro, IEEE Signal Processing Letters, Vol. 14, No. 9, September 2007. 2007 IEEE.) The goal of this activity is to introduce a model that allows study of the coding efficiency of DVC-based coding schemes. The analysis is restricted to schemes that compute the side information at the decoder by performing motion-compensated interpolation, starting from two intra-coded key frames. Specifically, focus is only on the generation of the side information, neglecting other factors related to the channel coding tools that are typically used to replace conventional entropy coding. The proposed model is designed in two steps. First, for each Wyner–Ziv-coded frame, the displacement error variance introduced by motion-compensated interpolation is estimated. In fact, the true motion field is not directly available at the decoder, and it must be estimated introducing displacement estimation errors. Then the power spectral density of the motion-compensated prediction error is estimated to obtain the RD curves by inverse water-filling [22]. Armed with the proposed model, the tradeoff between motion-compensated interpolation accuracy and GOP size is investigated, in order to find the optimal GOP size for a target distortion.
4.4.1 Proposed Technical Solution Consider a GOP of size N frames, encoded using either a conventional motion-compensated predictive codec or a DVC-based architecture. These schemes differ in the way the motioncompensated prediction (side information) s^ðtÞ of the current frame s(t) is generated: .
.
Motion estimation at the encoder: s^ðtÞ ¼ s^P ðtÞ is obtained by performing motion estimation using s(t) as current frame and the previously encoded frame s0 (t 1) as reference frame (s0 is the quantized version of s). An I-P-P-. . .-I GOP structure is assumed. Motion-compensated interpolation at the decoder: The decoder performs motion-compensated interpolation using lossy coded key frames s0 (t1) and s0 (t2) only (t1 < t < t2) to generate ^sðtÞ ¼ ^sWZ ðtÞ. An I-WZ-WZ-. . .-I GOP is adopted. The decoding of any Wyner–Ziv frame requires both the previous and the next I frame to be decoded first.
If the distortion D is constrained to be constant along the GOP, the average rate R per frame can be computed as: " # N 1 X 1 I fP;WZg RðDÞ ¼ R ðDÞ þ Ri ðDÞ ð4:16Þ N i¼1
123
Distributed Video Coding fP;WZg
where RI(D) is the contribution of the intra-coded frame and Ri ðDÞ of the ith inter-coded frame (for the case of motion-compensated prediction at the encoder or motion-compensated interpolation at the decoder). (Reproduced by Permission of 2007 IEEE.) The RD curve RI(D) is given by the following set of parametric equations [23]: ðð h i 1 2 I 0 ð4:17Þ D ðuÞ ¼ E ðs sÞ ¼ 2 min½u; fss ðLÞdL 4p L
1 R ðuÞ ¼ 2 8p
ðð
I
L
fss ðLÞ max 0; log2 dL u
ð4:18Þ
where fss ðLÞ; ðL ¼ ðvx ; vy ÞÞ is the spatial power spectral density (PSD) of the source and u > 0 is a real-valued parameter that allows movement along the RD curve. (Equations (4.17) and (4.18) reproduced by Permission of 2007 IEEE.) Notice that when u < fss(L), 8L ) DI(u) ¼ u. Therefore, u is proportional to the amount of distortion introduced by quantization. In the following, the RD curves R{P,WZ}(D) are derived adopting the framework introduced in [22]. To this end, let us denote the residual frame after motion-compensated prediction as eðtÞ ¼ sðtÞ ^sðtÞ and define the spatial power spectral density of e(t) as fee(L). Let us consider a video signal that is described by a constant translatory displacement (dx,dy), and neglect any other effect, such as rotation, zoom, occlusion, illumination change, and so on. The approximate expression of fee(L) is given by [22]: if fss ðLÞ < u fss ðLÞ fee ðLÞ ð4:19Þ ~ ee ðLÞ; ug otherwise maxff ~ ee ðLÞ ¼ 2fss ðLÞð1 e f
12 vx s2Ddx þ vy s2Ddy
Þþu
ð4:20Þ
where s2Ddc denotes the variance of the displacement error Ddc ¼ dc d^c (c ¼ x,y), which is assumed to be zero mean and Gaussian distributed. (Equations (4.19) and (4.20) reproduced by Permission of 2007 IEEE.) The error is strictly connected to the way motion is estimated and represented, as will be detailed shortly. In [22], an approximation of the RD function is given by: ðð h i 1 DfP;WZg ðuÞ ¼ E ðe0 eÞ2 ¼ 2 min½u; fss ðLÞdL ð4:21Þ 4p L
R
fP;WZg
1 ðuÞ ¼ 2 8p
ðð ~ ee ðLÞ > uÞ L:ðfss ðLÞ > u^f
"
# ~ ee ðLÞ f max 0; log2 dL u
ð4:22Þ
(Equations (4.21) and (4.22) reproduced by Permission of 2007 IEEE.) It is possible to observe that, in order to compute Equation (4.16), the characterization of the values of the displacement error variances s2Ddx and s2Ddy for each frame in the GOP is needed.
124
Visual Media Coding and Transmission
Assuming isotropic displacement errors, it is possible to state that, on average, s2Ddx ¼ s2Ddy ¼ s2Dd . Therefore, the coordinate index x,y will be dropped in the rest of this text. Two cases deserve to be analyzed in the following: .
.
P frames: The motion estimation is performed at the encoder; it is assumed that the displacement error is solely due to the finite accuracy used to represent motion vectors (M ¼ 1,1/2, 1/4,. . . pixels). Therefore, s2Dd ¼ M 2 =12 for any frame in the GOP, as indicated in [22]. WZ frames: The motion estimation is performed at the decoder between successive intra-coded key frames. This is then used to infer the motion for intermediate WZ frames. In order to evaluate s2Ddi for the ith frame, a model based on Kalman filtering, detailed in the following section, is proposed.
In the following, a state-space model according to the Kalman filtering framework is introduced. The time evolution of the true displacements with the state equation, and the noisy observation of the motion between two intra-coded key frames with the output equation, are described. Specifically, the following state equation: dðtÞ ¼ rdðt 1Þ þ zðtÞ
ð4:23Þ
where d(t) is the true displacement that the frame s(t) is subject to, r is the temporal correlation coefficient, and z(t) is a zero-mean white noise, having variance s2z (zðtÞ ¼ WNð0; s2z Þ), is introduced. (Reproduced by Permission of 2007 IEEE.) s2 The variance of d(t) can be computed as s2d ¼ 1 zr2 . In order to gain an insight, s2d can be interpreted as an indication of the motion complexity; a high value of s2d suggests that large displacements are expected. On the other hand, r measures the temporal coherence of the motion field, for a given value of s2d . A value of r close to 1 indicates that motion has approximately uniform velocity along time. In the proposed model, the motion-compensated interpolation process is viewed as an ^ ^ 1Þ, estimation of the displacements at time t,t 1,. . .t N þ 1 (i.e. dðtÞ, dðt ^ N þ 1Þ), when only the motion o(t) between the key frames is observed. . . .,dðt oðtÞ ¼ dðtÞ þ dðt 1Þ þ dðt 2Þ þ . . . þ dðt N þ 1Þ þ wðtÞ
ð4:24Þ
where w(t) is a white noise WNð0; s2w Þ that takes into account the finite accuracy of displacements (s2w ¼ M 2 =12), as already explained for P frames in the previous section. (Reproduced by Permission of 2007 IEEE.) Equation (4.24) describes the fact that the true motion between two successive key frames can be expressed as the sum of the displacements of the frames in between. The state-space model described by Equations (4.23) and (4.24) implies that a new observation o(t) is available at any time instant t. Actually, there is only access to one observation every N time instants, where N is the GOP size. A more accurate model for the problem at hand is obtained by relating the increment of the time variable to intra-frames only. With a change of variables, t ¼ t/N is defined and the state-space model is rewritten in the new time unit, t. For the sake of simplicity, consider a GOP of N ¼ 3 frames (see Figure 4.17). At time t, the intra-frames s(t) and s(t 1) are used to compute the displacement o(t). WZ frames are defined at intermediate fractional times t 1 þ k1 and t 1 þ k2 (ki ¼ i/N).
125
Distributed Video Coding
Figure 4.17 Motion-compensated interpolation, with time step t referred to the evolution of the intra-coded key frames. Reproduced by Permission of 2007 IEEE
Exploiting the autoregressive model (4.23) and denoting di(t) ¼ d(t 1 þ ki) and zi(t) ¼ z(t 1 þ ki), the following model is obtained: d1 ðtÞ ¼ d2 ðtÞ ¼ dðtÞ ¼ oðtÞ ¼
rdðt 1Þ þ z1 ðtÞ r2 dðt 1Þ þ rz1 ðtÞ þ z2 ðtÞ r3 dðt 1Þ þ r2 z1 ðtÞ þ rz2 ðtÞ þ zðtÞ d1 ðtÞ þ d2 ðtÞ þ dðtÞ þ wðtÞ
ð4:25Þ
(Reproduced by Permission of 2007 IEEE.) This can be written in the canonical form, prescribed by Kalman filtering: dðtÞ ¼ Fdðt 1Þ þ v1 ðtÞ
ð4:26Þ
oðtÞ ¼ Hdðt 1Þ þ v2 ðtÞ
ð4:27Þ
(Equations (4.26) and (4.27) reproduced by Permission of 2007 IEEE.) Going back to the original problem, the variances of the displacement errors s2Ddi of the ith ^ WZ frame in the GOP have to be obtained. Consider dðtjt 1Þ, i.e. the estimation of the state vector d(t) computed at time t with the data available up to time t 1. Kalman theory states that it is possible to relate the variance of the error on the state of the Kalman predictor ^ (Ddðtjt 1Þ ¼ dðtÞ dðtjt 1Þ) at time t to that at time t 1 via the RDE (Riccati differential equation): Pðt þ 1Þ ¼ FPðtÞF T þ V1 KðtÞðHPðtÞH T þ V2 ÞK T
ð4:28Þ
where PðtÞ ¼ E½Ddðtjt 1ÞDdT ðtjt 1Þ and the Kalman gain K(t) is defined as KðtÞ ¼ ðFPðtÞH T þ V12 ÞðHPðtÞH T þ V2 Þ 1 . (Reproduced by Permission of 2007 IEEE.) When the observation at time t is available, in addition to that at time t 1, the variance of ^ the error on the state (DdðtjtÞ ¼ dðtÞ dðtjtÞ) of the Kalman filter must be considered, instead of that of the Kalman predictor: E½DdðtjtÞDdT ðtjtÞ ¼ Pfilt ðtÞ ¼ PðtÞ PðtÞH T ½HPðtÞH T þ V2 1 HPðtÞ (Reprinted by Permission of 2008 IEICE.)
ð4:29Þ
126
Visual Media Coding and Transmission
In (4.28), upon convergence, P(t þ 1) ¼ P(t) ¼ P. Substituting P into Equation (4.28), the ARE (algebraic Riccati equation) is obtained, which can be solved by P. Values of the matrix Pfilt upon convergence are obtained by substituting P in Equation (4.29). Diagonal values of the matrix Pfilt correspond to the variances of the displacement errors s2Ddi of the WZ frames into the GOP. Intuitively, each s2Ddi value represents the displacement error between the true motion and the estimated motion for the ith frame, which is needed to compute Equation (4.19). Then the average rate can be computed according to Equation (4.16).
4.4.2 Performance Evaluation The first experiment shows how the average displacement error variance varies as a function of the state-space model parameters r and s2d . Figure 4.18 illustrates the average value of s2Dd across the GOP for two different GOP sizes. Notice that, for a given motion complexity s2d , the displacement error variance decreases as r ! 1. This is expected, because higher temporal correlation of the motion fields makes the motion-compensated interpolation task easier. On the other side, once the correlation coefficient r is kept constant, the displacement error variance increases with s2d . This means that it is usually harder to interpolate motion fields characterized by complex motion. By comparing Figure 4.18(a) with Figure 4.18(b), notice that, for any value of r and s2d , increasing the GOP size also increases the displacement error variance. Thus, using longer GOP sizes reduces the number of intra-coded frames, but at the same time motioncompensated interpolation generates worse side information. This implies that there is a tradeoff between motion accuracy and intra-frame period that must be sought in order to optimize the overall RD function. This is further investigated in the following experiments. In order to obtain realistic values of r and s2d to be used in the model simulations, they are computed for some test sequences. Motion estimation is performed with 1/4 pixel accuracy, and the parameters of the AR(1) model (4.23) that best fit the estimated motion vectors along the motion trajectories are obtained. Figure 4.19(a)–(c) depicts the RD curves obtained according to Equation (4.16), indicating the estimated parameters r and s2d of the AR(1) model (4.23) for the test sequences. The curves
Figure 4.18
Average displacement error variance s2d : (a) GOP size ¼ 2; (b) GOP size ¼ 8
Figure 4.19 (a)–(c) RD curves obtained with the proposed model (each plot indicates the values of r and s2d used to obtain the curves); (d)–(f) RD curves obtained for the test sequences Salesman, Coastguard and Foreman, respectively
128
Visual Media Coding and Transmission
are calculated according to the following steps: . . . .
Set the GOP size N, the motion estimation accuracy s2w, the state-space parameters (s2d , r), and the spatial spectral density function (the isotropic PSD fss suggested in [22] is used). Obtain the displacement error variances s2Ddi by computing the trace of the matrix in the equation Pfilt. For each value of u, compute RI(u), DI(u) for the first frame of the GOP (intra-coded key frame) using Equations (4.17) and (4.18). For each Wyner–Ziv frame i ¼ 2,. . .,N: ~ee ðLÞ, given s2 and fss ðLÞ.
Obtain the power spectral density of the prediction error f Ddi i
Compute the RD point corresponding to u using Equations (4.21) and (4.22)
Compute the average RD point according to Equation (4.16).
The INTER curve is obtained with Equations (4.21) and (4.22), where s2Dd ¼ M 2 =12 M ¼ 4) regardless of the frame index. Figure 4.19(a)–(c) shows that motion-compensated interpolation is unable to achieve the coding efficiency of conventional motion-compensated prediction at the encoder for the studied sequences. Therefore, the lack of the original frame when generating the side information introduces a coding efficiency loss. In addition, the optimal GOP size might depend on the target distortion. At high bit rates, shorter GOP sizes are usually preferred. In fact, high frequencies are preserved, and accurate displacement estimation is needed to reduce the energy of the prediction error. As observed in our first experiment, as the GOP size increases the displacement error variance also increases, thus impairing the accuracy of displacement estimation. Nevertheless, at low bit rates, quantization filters out high frequencies, therefore a higher displacement error variance can be tolerated. This implies that the GOP size can be increased to reduce the number of intra-coded key frames. The optimal GOP size depends on the underlying motion statistics. For sequences characterized by simple and temporally-coherent motion, like Salesman, the proposed model suggests that the optimal GOP size can be as large as 16–32 frames. As the motion complexity increases (s2d increases), and the motion temporal coherence vanishes (r decreases), the optimal GOP size can be as little as 1–2 frames (see Figure 4.19(c)). Therefore, for sequences characterized by complex motion, like Foreman, it can also happen that pure intraframe coding (i.e. GOP size equal to 1) outperforms Wyner–Ziv coding. In order to validate the proposed model, the RD functions for some test sequences (Salesman: 192 frames; Coastguard: 128 frames; Foreman: 192 frames) at QCIF resolution and 15 fps are obtained. Results are provided for H.264/AVC, using either I-slices (I-I-I) or P-slices (I-P-P, GOP size 32). For the other curves, the motion-compensated interpolation algorithm described in [24] is adopted, where the minimum block size is set equal to 4 4. In order to isolate the impact of the generation of the side information alone, turbo coding is replaced with conventional DCT-based intra-frame entropy coding of the prediction residuals, as in H.264/AVC. Therefore, results for a pseudo DVC-based coding architecture are provided, where other design parameters that might affect the coding efficiency (i.e. correlation channel estimation, stopping criteria for turbo decoding, encoder side rate-control) are explicitly singled out. In other words, the results provided can be interpreted as upper bounds that can be achieved if channel coding tools match the same performance as conventional entropy coding, when the former are used for source coding.
Distributed Video Coding
129
By comparing Figure 4.19(a)–(c) with Figure 4.19(d)–(f), it is possible to notice that the coding efficiency of motion-compensated interpolation at the decoder falls in between intraand inter-frame coding. Sometimes, it also falls below the intra-frame coding curve for long GOP sizes and sequences characterized by complex motion. Nevertheless, the coding efficiency of inter-frame coding is never achieved, suggesting that the lack of the current frame when generating the side information introduces a significant coding efficiency loss with respect to conventional motion-compensated predictive coding. In addition, the proposed model provides a quite accurate indication of the optimal GOP size for each of the tested sequences (16 for Salesman, 4 for Coastguard, and 1 (i.e. pure intra-frame coding) for Foreman). The difference between different GOP sizes can be better appreciated at high bit rates, as suggested by the proposed model.
4.5 Nonlinear Quantization Technique for Distributed Video Coding (Portions reprinted, with permission, from M.B. Badem, W.A.R.J. Weerakkody, W.A.C. Fernando, A.M. Kondoz, ‘‘Design of a non-linear quantizer for transform domain DVC’’, IEICE.) So far, in all DVC codecs a linear quantizer has been used. The linear quantizer has some limitations in fully exploiting the correlations of the Wyner–Ziv frames. In this section, a nonlinear quantization algorithm is proposed for DVC in order to improve the RD performance. The proposed solution is expected to exploit the dominant contribution to the picture quality from the relatively small coefficients when the high concentration of the coefficients nears zero, as evident when the residual input video signal for the Wyner–Ziv frames is considered in the transform domain. The performance of the proposed solution incorporating the nonlinear quantizer is compared with the performance of an existing transform domain DVC solution that uses a linear quantizer. The simulation results show a consistently improved RD performance at all bit rates when different test video sequences with varying motion levels are considered. The objective of this work is to propose a novel nonlinear quantization algorithm for DVC, in contrast to the linear quantizer that has traditionally been used in the common transform domain DVC implementations. It is noted that when the residual image is considered, by taking the incremental frame difference compared to the preceding reference frame, all transform coefficients demonstrate a similar probability density function, with a contrastingly high density near the zero mean. The vast majority of the elements are concentrated in the close vicinity of this peak distribution point. The motivation for the nonlinear quantization approach discussed in this work is to exploit this probability distribution so that small values, which make up a major portion of the input residual signal, are represented with a higher precision, so that a better picture quality and in effect improved compression efficiency can be achieved.
4.5.1 Proposed Technical Solution The proposed quantization algorithm builds upon the base hypothesis that each sample element in the input sequence has an unequal contribution to the output video quality, depending on the relative magnitude of the element. More specifically, in the case of transform domain DVC, it is assumed that the final decoded picture quality is more sensitive to the small incremental variations of the current frame compared to the reference frame than to the gross variations, primarily considering the very high probability density of the minor variations. The probability density function of the DC coefficients of the DCT transformed residual video signal,
130
Visual Media Coding and Transmission Probability Density Function of DC values 0.35 0.3
Probability
0.25 0.2
Original Difference
0.15 0.1 0.05 0 –200
0
200
400 DC Values
600
800
1000
Figure 4.20 Probability density function of DC coefficients of DCT transform for the mobile QCIF test video sequence (first 100 frames). Reproduced by Permission of 2008 IEICE
calculated considering the incremental variations of the current Wyner–Ziv frame compared to the reference frame (key frame), is illustrated in Figure 4.20. In Figure 4.20, the dotted line depicts the pdf of the DC coefficients of the original Wyner–Ziv frames, while the solid line represents the same for the residual signal, considering the incremental difference with respect to the predecessor key frame, with a GOP size of 2. The contrasting peak in the latter curve shows the concentration of the information content in the very small signal magnitudes in the residual signal. In this work, a nonlinear quantization algorithm is proposed to exploit this concentrated distribution of the information. In the proposed algorithm, the nonlinear quantization is implemented by passing each DCT coefficient band through a nonlinear transformation module incorporated into the quantizer. The reverse process is performed at the reconstruction module in the decoder. The nonlinear transformation function used to define variable step sizes for the quantizer in the proposed solution is given by: y ¼ b signðxÞð1 e c:absðxÞ=b Þ
ð4:30Þ
where x and y represent the input and output of the nonlinear transformation, respectively, and b and c are constants that represent the level of nonlinearity. (Reprinted by Permission of 2008 IEICE.) Figure 4.21 gives a graphical representation of the same transform function. The variable quantization bin sizes derived by the nonlinear quantization are illustrated in Figure 4.22, where W represent the quantization step size with respect to the linear quantizer. The reconstruction function is also modified to incorporate the inverse-nonlinear-transformation,
131
Distributed Video Coding
y
400 300 200 100
x
0 –600
–400
–200
–100
0
200
400
600
–200 –300 –400
Figure 4.21
Nonlinear transform function. Reproduced by Permission of 2008 IEICE
Figure 4.22 Comparison of linear and nonlinear quantization bins: (a) fixed step sizes in linear quantization; (b) variable step sizes in nonlinear quantization. Reproduced by Permission of 2008 IEICE
as defined by: y¼
b absðxÞ signðxÞ ln 1 c b
ð4:31Þ
The proposed quantization algorithm is incorporated into the existing transform domain DVC framework [25]. The modified codec architecture is depicted in Figure 4.23. First, frame difference operation is performed to derive the residual frames at both the encoder and the decoder. At the encoder, the difference between the current Wyner–Ziv frame (X2i) and the previous key frame (X2i 1) is taken. The resulting residual frame is 4 4 blockwise DCT transformed. Subframes are formed from the 16 DCT coefficient bands, and then the nonlinear transform and quantization are performed for each. At the decoder, side information is obtained by taking the difference between the motion-interpolated frame (Y2i) and the previous key frame (X2i 1). Similar to the process performed at the encoder, the DCT band coefficients are grouped together, nonlinear transformed, and quantized. This bitstream is fed into the turbo decoder and the reconstruction block. After the decoding and reconstruction, inverse nonlinear transform and IDCT are performed. Finally, the previous frame (X2i 1) is added and the reconstructed frame X2i is obtained.
132
Visual Media Coding and Transmission
Figure 4.23
Proposed codec architecture. Reproduced by Permission of 2008 IEICE
4.5.2 Performance Evaluation The RD performance of the proposed nonlinear quantization in transform domain DVC is tested for a number of standard test video sequences (QCIF: 176 144), and experimental results for the Foreman, Mobile and Salesman test video sequences are elaborated in this section. These sequences are selected to represent different motion levels: Foreman and Mobile with medium-to-high and Salesman with low-to-medium motion levels. The first 100 frames from each sequence are used for the simulation, 50 of which are Wyner–Ziv frames since the GOP length is selected as 2. The bit rate is varied by independently controlling the granularity of the quantizer, using the quantization matrices shown in Figure 4.24. 4
4
0
0
8
4
0
0
8
4
4
0
8
8
4
0
4
0
0
0
4
0
0
0
4
4
0
0
8
4
0
0
0
0
0
0
0
0
0
0
4
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(Q1)
(Q2)
(Q3)
(Q4)
16
8
8
0
16
8
8
4
16
8
8
4
32 16
8
4
8
8
0
0
8
8
4
0
8
8
4
4
16
8
4
4
8
0
0
0
8
4
0
0
8
4
4
0
8
4
4
0
0
0
0
0
4
0
0
0
4
4
0
0
4
4
0
0
(Q5)
Figure 4.24
(Q6)
(Q7)
(Q8)
Quantization matrices. Reproduced by Permission of 2008 IEICE
133
Distributed Video Coding 39 38.5
PSNR (dB)
38 37.5 37
Linear Quant. [15] Nonlinear Quant. (Proposed)
36.5 36 0
20
40
60
80
100
120
Bitrate (Kbps)
Figure 4.25 Performance comparison for the Foreman test sequence. Reprinted by Permission of 2008 IEICE
AWyner–Ziv frame rate of 15 fps is used and the results are shown for the Wyner–Ziv frames only. The PSNR is calculated for the luminance component of the frame. PSNR and bit rate are averaged over the sequence. The performance of the proposed algorithm is compared with the existing DVC codec, which uses a linear quantizer [25]. Figures 4.25, 4.26 and 4.27 illustrate the performance comparison of the proposed codec for the Salesman, Mobile and Foreman 34.4 34.2 34 PSNR (dB)
33.8 33.6 33.4 33.2
Linear Quant.[15]
33 Nonlinear Quant. (Proposed)
32.8 32.6 0
20
40
60
80
100
120
140
Bitrate (Kbps)
Figure 4.26 IEICE
Performance comparison for the Mobile test sequence. Reprinted by Permission of 2008
134
Visual Media Coding and Transmission 45.5
PSNR (dB)
45
44.5
44 Linear Quant [15] 43.5
Nonlinear Quant. (Proposed)
43 0
50
100
150
Bitrate (Kbps)
Figure 4.27 Performance comparison for the Salesman test sequence. Reprinted by Permission of 2008 IEICE
sequences respectively. The b and c parameters in the nonlinear transformation were set to 500 and 1 respectively based on experimental optimizations. It can be observed that the proposed algorithm helps to consistently improve the performance of the DVC codec by a notable margin over different bit rates and different motion levels in the input video sequence.
4.6 Symmetric Distributed Coding of Stereo Video Sequences In this section, a new coding scheme based on distributed source coding principles for stereo sequences is proposed. The core of the architecture is based on the pixel domain Wyner–Ziv codec described before and uses rate-compatible punctured turbo (RCPT) codes to correct the side information into the decoded frame. Similarly to the aforementioned works, the proposed algorithm exploits the inter-view redundancy at the decoder only. Despite the previous works, it is fully symmetric, since neither of the two views needs to be chosen as the reference. Also, the quality of the generated side information for different GOP sizes and target distortions is investigated. This work exploits some of the ideas originally developed within VISNET I [26]. Nevertheless, the exploitation of inter-view dependency is novel, as is the study of the behavior of the fusion algorithm as a function of the GOP size and target distortion.
4.6.1 Proposed Technical Solution Let us consider a scenario with two video sequences, X1 and X2, representing the same scene taken from different viewpoints. The coding architecture described in Section 4.2 could be applied independently to each of the two views, achieving very low encoding complexity. Nevertheless, in this case the inter-view redundancy would not be exploited.
135
Distributed Video Coding
Figure 4.28 Block diagram of the proposed algorithm
Figure 4.28 illustrates the block diagram of the proposed algorithm. Since the algorithm is fully symmetric with respect to the two cameras, only the processing of the first view is shown. The key frames X1(0) and X1(N) are intra-frame coded using H.264/AVC and transmitted to the decoder. At the encoder, the Wyner–Ziv frames X1(t) are split into two subsets A1(t) and B1(t) based on a checkerboard pattern. The same operation is symmetrically performed on the corresponding frame of the second view, X2(t), but inverting the role of subset A and B. At the decoder, subset A1(t) is decoded first, using the side information obtained by motion ^ 1 ðtÞ is used in interpolation Y1T , thus exploiting temporal correlation. The decoded subset A t two ways: first to improve the temporal side information to get Y1 ðtÞ, and second to perform view prediction and obtain Y1V ðtÞ. A fusion algorithm is then used to generate the mixed temporal/inter-view side information Y1tV ðtÞ, used to decode subset B1(t). The proposed algorithm can be detailed as follows: 1. Frame splitting: Let (x,y) be the coordinate values of a pixel. For the first view, X1(t): If
2. 3. 4. 5.
6.
½xmod2xor½ðy þ 1Þmod2; thenðx; yÞ 2 A; elseðx; yÞ 2 B:
ð4:32Þ
Let us denote with A1(t) (B1(t)) the pixel values assumed by X1(t) in the pixel locations belonging to the set A (B). The dual assignment, with sets A and B swapped, is performed for the second view, X2(t). WZ encoding: The encoder processes the two subsets A1(t) and B1(t) independently, generating parity bits for each of them. ^ 1 ð0Þ Motion interpolation: The side information Y1T ðtÞ is obtained by motion-interpolating X ^ and X 1 ðNÞ according to the algorithm described in [27]. WZ decoding: Subset A1(t) is decoded as described in Section 4.2, using Y1T ðtÞ as side information. ^ 1 ðtÞ is spatially interpolated in the pixel locations Spatial interpolation: The frame X ^ 1 ðtÞ. The output is the interpolated frame corresponding to set B, using the decoded set A S Y1 ðtÞ. The simple nonlinear, adaptive algorithm proposed in [26] is used for this purpose. Motion estimation: The temporal side information is refined to obtain Y1t ðtÞ. Bi-directional ^ 1 ð0Þ block-based motion estimation is performed, setting Y1S ðtÞ as the current frame and X ^ and X 1 ðNÞ as the reference frames.
136
Visual Media Coding and Transmission
7. View prediction: Frame X1(t) is predicted based on frame X2(t) to obtain Y1V ðtÞ. To this end, block-based motion estimation is performed setting Y1S ðtÞ as the current frame and Y2S ðtÞ as the reference frame. The geometric constraints of the camera setting are exploited by limiting the motion search along the epipolar lines, thus reducing the computational complexity of the motion search and the accuracy of the prediction. 8. Fusion: The improved temporal side information Y1t ðtÞ is combined with the view side information Y1V ðtÞ on a pixel-by-pixel basis, in order to find the best side information. In an ideal (but unrealistic) setting, if the original frame were available, the optimal side information would select either Y1t ðtÞ or Y1V ðtÞ, depending on which was closest to X1(t). That is: * Y1tV ðx; y; tÞ ¼ arg min X1 ðx; y; tÞ Y1j ðx; y; tÞj2 ð4:33Þ j¼t;V
In practice, the decoder does not have access to X1(t). Therefore, there is a need to infer j ¼ t, ^ 1 ðtÞ. V for each (x,y) in the set B from the available information, i.e. the already decoded set A For each pixel (x,y) in B, Nxy denotes its four adjacent neighbors. The fusion algorithm proceeds as follows: ^ 1 ðtÞ to estimate the temporal correlation i. Compute the difference between Y1t ðtÞ and A noise: X ^ 1 ðm; n; tÞj2 eT ðx; y; tÞ ¼ jY1t ðm; n; tÞ A ð4:34Þ ðm;nÞ2Nxy
^ 1 ðtÞ to estimate the inter-view correlation ii. Compute the difference between Y1V ðtÞ and A noise: X ^ 1 ðm; n; tÞj2 eV ðx; y; tÞ ¼ jY1V ðm; n; tÞ A ð4:35Þ ðm;nÞ2Nxy
iii. Generate the side information Y1tV ðx; y; tÞ: If else
eT ðx; y; tÞ < aeV ðx; y; tÞ þ b; Y1tV ðx; y; tÞ ¼ Y1t ðx; y; tÞ; Y1tV ðx; y; tÞ ¼ Y1V ðx; y; tÞ
ð4:36Þ
where a and b are properly-defined constants, as discussed below. The idea is that both temporal correlation and inter-view correlation are not spatially stationary, but their statistics slowly vary across space. Therefore, it is possible to use the observed temporal (inter-view) correlation for neighboring, already decoded, pixels as an estimate of the actual temporal (inter-view) correlation for the current pixel. When eT(x,y,t) < aeV(x,y,t) þ b, the temporal side information is selected. This happens when no temporal occlusions and/or illumination changes occur. It also happens when the view prediction is of poor quality because of occlusions due to the scene geometry. When eT(x,y,t) < aeV(x,y,t) þ b, the inter-view side information is selected. This occurs when the motion is complex and the temporal prediction is poor.
137
Distributed Video Coding
In order to obtain the optimal values of a and b, a linear classifier in the [eT, eV] feature space of each frame of the sequences Breakdancers and Exit is trained. The optimal value of b is always nearly equal to 0. Conversely, a assumes values in the range [0,6], with a strong peak in a ¼ 1. Therefore, this value is set for all the simulations reported in this deliverable. The proposed fusion algorithm differs from those presented in [28], [29]. In [28], [29], the ^ 1 ð0Þ and X ^ 1 ðNÞ, or the intra-coded side decision is taken based on the decoded key frames X views at the same time instant t. Thus there is the problem of propagating the fusion decision ^ 1 ðtÞ is to the actual frame to be decoded. In the proposed algorithm, part of the frame A already available. Therefore, the decision is more robust, since it relies on data of the same frame without requiring any propagation. 9. WZ decoding: Subset B1(t) is decoded as described in Section 4.2, using Y1tV ðtÞ as side information.
4.6.2 Performance Evaluation Several experiments were carried out in order to test the validity of the proposed algorithm. First, the quality of the side information was tested. The performance of the turbo decoding process heavily depends on the quality of the side information. Intuitively, a higher number of parity bits will be requested by the decoder when the correlation is weak, as more errors need to be corrected. Results are provided for the Breakdancers (BR, 100 frames) and Exit (EX, 250 frames) sequences, considering only two central views. The target distortion is determined by the quantization parameter QB 2 [1,4], which indicates the number of bitplanes to be decoded. The temporal side information YT is obtained by motion-compensated interpolation of the lossy key frames. The key frames are intra-coded using H.264/AVC, setting a QPISlice parameter that depends on the target QB (the values 36,35,34,33 are used in our experiments). Table 4.1 indicates the quality of the side information, measured in terms of PSNR (dB), averaged over the two views. The PSNR is computed only for pixels belonging to subsets B1,2, Table 4.1 Distortion (in dB) of the side information generated at the decoder: temporal YT, refined temporal Yt, inter-view YV, and fused YtV BR
GOP 2
GOP 8
QB YT
Yt
YV
YtV
Y tV
1 2 3 4 EX
26,59 27,95 28,90 29,65
25,42 26,46 27,12 27,62
27,49 29,53 31,43 33,08 GOP 2
QB YT
Yt
YV
1 2 3 4
34,11 35,00 35,85 36,69
26,78 26,91 27,03 27,14
25,88 25,90 25,96 26,11
32,70 33,27 33,85 34,40
DtT DtVt YT
Yt
YV
YtV
Y tV
29,11 31,18 33,05 34,62
0,71 2,05 2,94 3,54
24,21 25,93 27,16 28,18
24,21 25,70 26,71 27,52
25,09 27,59 29,83 31,96 GOP 8
26,42 29,11 31,40 33,42
YtV
Y tV
DtT DtVt YT
Yt
YV
DtVt Y tV
33,10 34,07 35,06 36,19
36,08 37,03 37,84 38,79
30,10 31,52 33,05 34,10
26,27 26,64 26,96 27,16
29,81 31,33 33,05 34,49
*
*
1,41 1,73 2,00 2,29
0,90 1,58 2,53 3,43
1,01 0,93 0,79 0,50
23,53 23,56 23,53 23,86
29,21 29,42 29,65 29,83
*
*
32,26 33,88 35,49 36,75
DtT DtVt 0,68 2,37 3,63 4,32
0,88 1,66 2,67 3,78
DtT DtVt 0,89 2,10 3,40 4,27
0,29 0,19 0,00 0,39
138
Visual Media Coding and Transmission
since for subsets A1,2 the temporal side information YT is all that is available. In order to study the behavior with different GOP sizes, results are provided for GOPs of lengths 2 and 8. A number of interesting conclusions can be drawn. The refined temporal side information Yt is always better than the original side information YT (Dt T ¼ Y T Y t > 0). This is especially true for low distortion levels (i.e. QB ¼ 4). In fact, in this case, the high-frequency content of the image is retained, and the additional accuracy of the refined motion field is fully exploited. Keeping the QB values fixed, DtT increases for longer GOP sizes. This is due to the fact that motion-compensated interpolation is not accurate when key frames are spaced apart in time. For both sequences, the gain obtained by motion refinement can be as large as þ 4 dB. The quality of the inter-view side information YV is strongly dependent on the cameras baseline. In the Breakdancers sequence the two views are highly correlated, and YV improves over YT (but, by itself, it is worse than Yt). In the Exit sequence the baseline is wider, and the quality of YV obtained with our simple block-matching algorithm is rather poor. In the previous section, an algorithm that allows fusion of the temporal (Yt) and inter-view V (Y ) side information to obtain YtV was proposed. For the Breakdancers sequence, the gain DtV t ¼ Y tV Yt increases with QB, being in the range [ þ 0.9 dB, þ 3.8 dB]. By increasing the GOP size, DtVt becomes slightly larger [0 dB, þ 0.3 dB], since the temporal side information quality decreases. For the Exit sequence, it has been observed that the inter-view side information is of poor quality. Therefore, the proposed fusion algorithm is unable to outperform the refined temporal side information Yt, apart from in the case of long GOP sizes and low distortion levels. Besides the fact that the inter-view redundancy is not exploited in this case, the side information generated by the proposed algorithm is largely better than YT, thanks to the temporal side information refinement. Figure 4.29 shows the RD curves obtained with the proposed coding scheme, using the side information YT, YV, and YtV respectively. By combining the enhanced temporal and inter-view side information, a coding gain between 0.5 and 4 dB, with larger gains at high bit rates, can be obtained.
Figure 4.29
RD performance chart for the Breakdancers sequence, GOP ¼ 2
Distributed Video Coding
139
4.7 Studying Error-resilience Performance for a Feedback Channelbased Transform Domain Wyner–Ziv Video Codec (Portions reprinted, with permission, from J.Quintas Pedro, Luıs Ducla Soares, Catarina Brites, Jo~ao Ascenso, Fernando Pereira, Carlos Bandeirinha, Shuiming Ye, Frederic Dufaux, Touradj Ebrahimi, ‘‘Studying error resilience performance for a feedback channel based transform domain Wyner-Ziv video codec’’, PCS 2007. 2007 EURASIP.) The objective of this research activity is to study in detail the error-resilience performance of the transform domain Wyner–Ziv (TDWZ) codec, developed in VISNET I for packet-based networks. The RD performance in the presence of channel errors will first be studied for different error conditions. More specifically, the way the TDWZ codec deals with errors in the two main components of the bitstream – key frames and WZ frames – will be studied in detail. In all scenarios, the feedback channel used by the decoder to ask the encoder for more parity information will be considered error free. This corresponds to a realistic option, because the total bit rate of the feedback channel is so low when compared to the downstream bit rate that very strong channel coding could easily be used to guarantee that it has no errors. In the following section the effect of channel errors on different parts of the bitstream will be considered, leaving the subject of comparison with the state-of-the-art H.264/AVC codec for future investigations.
4.7.1 Proposed Technical Solution The objective is to evaluate the error resilience of the various parts of the TDWZ codec stream in the presence of channel errors. This includes evaluating the error resilience of the Wyner–Ziv parts of the stream, as well as their dependence on reliable side information that is generated from the key frames. For this evaluation, four separate experiments will be performed: .
.
Corrupting key frames only: In the first experiment in this scenario, errors will only be applied to the H.264/AVC part of the bitstream (i.e. the key frames), while the WZ part remains error free. This experiment allows evaluation of the importance of having goodquality side information at the decoder, which is derived from corrupted key frames. To create a more error-resilient stream, the H.264/AVC encoder exploits the available tools, notably flexible macroblock ordering (FMO), in this case using a checkerboard pattern. Since the H.264/AVC frames received at the decoder will be corrupted, error concealment will be necessary in order to improve the quality of the side information used at the decoder. In this experiment, the error-concealment algorithm included in the JM (version 11) software is used. Because only intra-coding is performed, the H.264/AVC software intra error concealment is used; this error concealment is based on bi-dimensional interpolation of the missing samples based on the spatially adjacent blocks. For this case, the RD performance will be evaluated only for the key frames and for the overall set of frames, in order to study error propagation from key frames to WZ frames. Corrupting WZ frames only: In the second experiment, errors will be only applied to the WZ part of the bitstream. This leaves the H.264/AVC part of the transmitted bitstream error free, thus making it possible to see how much the decoded video quality drops when the WZ frames are corrupted but the side information at the decoder is still intact. This experiment may not correspond to a realistic situation, because a significant amount of channel coding
140
Visual Media Coding and Transmission
would be needed to guarantee that the H.264/AVC part of the bitstream is delivered uncorrupted. Nevertheless it is still useful to study the robustness of the WZ frames. Two cases are considered:
Decoder side rate allocation: In this case, rate control is performed at the decoder. Therefore, a feedback channel is available and it is assumed that, whatever the packet size, the network protocol may ask for the retransmission of lost packets until they are correctly received. This implies the WZ frame quality is always the same for each quantization matrix, with an increase in the rate depending on the packet losses. More precisely, the rate for each quantization matrix is computed as: RateðPLRÞ ¼ Rateerror free ð1=1-PLRÞ
.
ð4:37Þ
since 1/1-PLR is the average number of transmissions for a PLR in the 0–1 range. It is also assumed that the feedback channel is error free.
Encoder side rate allocation: In this case, it is supposed that the encoder performs ideal rate control, i.e. the number of requests for each bitplane for the error-free case is determined a priori and used for decoding the corresponding corrupted bitstream. Furthermore, if the bit error probability of the decoded bitplane is higher than 103, the decoder uses the corresponding bitplane of the side information. The header of the WZ bitstream contains critical information: image size, quantization parameters, intra period, and so on. Therefore, the header is assumed to be received correctly. In addition, the feedback channel is assumed to be error free. The size of each packet is 512 bytes. The packet size may affect the error performance of the codec, so different packet size may be tested as a future evaluation. Corrupting key frames and WZ frames: Finally, a fourth experiment will be performed, where errors are applied to both the H.264/AVC and WZ frames. Decoder side rate control is assumed here. This situationwill make it possibleto studythebehaviorofthewhole system whenerrors are applied in a more realistic way. In this case, the same error concealment as above will be applied for the key frames. In this case,the RD performancewill be evaluated for the overall set offrames.
4.7.2 Performance Evaluation This section evaluates the error-resilience performance of the adopted DVC codec, notably with video sequences which represent different types of video content related to several applications. The tests to be carried out will use four sequences: Foreman (with Siemens logo), Hall Monitor, Coastguard, and Soccer (see Figure 4.30). All frames were tested for all sequences (149 frames for Foreman, Coast Guard, and Soccer, and 165 frames for Hall Monitor); all sequences were tested at a temporal resolution of 15 Hz and a QCIF spatial resolution. For these simulations, the TDWZ video codec was configured to use a GOP length of 2. Eight quantization matrices as defined in [30] have been used, which means there are eight RD points for each evaluation case; key frames were encoded using the H.264/AVC baseline profile (the main profile does not support FMO) with (constant) quantization steps chosen through an iterative process which achieves an average intra-frame quality (PSNR) similar to the average WZ quality. As usual for WZ coding, only luminance data has been coded. For this error-resilience study, a communication channel that is characterized by the error pattern files provided in [31], with different packet loss ratios (PLRs), is considered. These patterns are the same as used in JVT. For the various test conditions, in terms of PLR and
141
Distributed Video Coding
Figure 4.30 Sample frames for test sequences: (a) Foreman (frame 80); (b) Hall Monitor (frame 75); (c) Coastguard (frame 60); (d) Soccer (frame 8)
quantization matrixes, the average (in time) PSNR will be measured, and averaged using 25 error pattern runs. 4.7.2.1 Corrupting Key Frames Only For the Coastguard sequence, the RD performance after key-frame corruption is presented in Figures 4.31 and 4.32, for the key frames only and for the overall set of frames (WZ frames are error free), respectively. It is important to notice that, for the various RD points, this means that quantization matrices, not all DCT coefficients, are sent; this means that for all non-sent DCT coefficients the (concealed) side information is used, which justifies the fact that the final quality of each Foreman
Hall Monitor
34 32
PLR = 0% PLR = 5% PLR = 20%
PLR = 3% PLR = 10%
PSNR [dB]
PSNR [dB]
36
30 28 26 24 22 50
70
90
110 130 Rate [kbps]
150
170
190
40 38 36 34 32 30 28 26 24 22 75
PLR = 0% PLR = 5% PLR = 20%
PLR = 3% PLR = 10%
100
125
150 Rate [kbps]
175
200
Figure 4.31 Key-frame RD performance for packet-loss key-frame corruption for the Foreman and Hall Monitor sequences (GOP ¼ 2)
Hall Monitor PSNR [dB]
PSNR [dB]
Foreman 35 34 33 32 31 30 29 28 27 26 25 24 75
PLR = 0% PLR = 5% PLR = 20%
125
175
225 Rate [kbps]
275
PLR = 3% PLR = 10%
325
38 37 36 35 34 33 32 31 30 29 28 27 26 100
PLR = 0% PLR = 5% PLR = 20%
PLR = 3% PLR = 10%
120 140 160
180 200 220 240 Rate [kbps]
260 280 300
Figure 4.32 Overall RD performance for packet-loss key-frame corruption for the Foreman and Hall Monitor sequences (GOP ¼ 2)
142
Visual Media Coding and Transmission
Hall Monitor PSNR [dB]
PSNR [dB]
Foreman 35 34 33 32 31 30 29 28 27 26 50
PLR = 0% PLR = 5% PLR = 20%
100
150
200 250 Rate [kbps]
300
PLR = 3% PLR = 10%
350
400
38 37 36 35 34 33 32 31 30 100 125
PLR = 0% PLR = 5% PLR = 20%
150
175
200 225 250 Rate [kbps]
275
PLR = 3% PLR = 10%
300
325
Figure 4.33 Overall RD performance for packet-loss WZ frame corruption for the Foreman and Hall Monitor sequences (GOP ¼ 2)
quantization matrix is not always the same (see Figure 4.32), even if for all cases WZ bits are requested until the same low bitplane error probability is reached for each transmitted DCT coefficient bitplane. 4.7.2.2 Corrupting WZ Frames Only – Decoder Side Rate Control For the Coastguard sequence, the RD performance when WZ frames only are corrupted is presented in Figure 4.33. In this case, the side information is always the same, since key frames are error free. For the TDWZ codec, the final quality for each matrix is always the same, since it is assumed the network protocol asks for the retransmission of the lost packets. 4.7.2.3 Corrupting WZ Frames Only – Encoder Side Rate Control The plots in Figure 4.34 show the RD results (PSNR) at various packet loss rates for the Foreman and Soccer sequences. These results show that the VISNET I DVC codec can compensate quite well for channel errors on WZ frames. In particular, the performance loss at a 3% loss rate is very low in each of the RD points considered, for all the test sequences. At higher loss rates (5 and 20%), the quality loss is still smaller than 0.5 dB at low bit rates, and it only increases slightly at higher bit rates.
Figure 4.34 Overall RD performance for packet-loss WZ frame corruption for the Foreman and Soccer sequences – encoder side rate control (GOP ¼ 2)
143
Distributed Video Coding
Figure 4.35
PSNR along time: (a) Foreman: 205 kbps; (b) Soccer: 202 kbps
The quality of the corrupted sequences can also be evaluated subjectively through visual inspection of the corrupted frames. For example, the PSNR versus the frame number for various error rates and a certain bit rate for the Foreman and Soccer sequences are shown in Figure 4.35. The corresponding frames with the lowest PSNR in each error rate for the Foreman and Soccer sequences are shown in Figures 4.36 and 4.37, respectively. From these figures, it is possible to find that the overall quality of the worst frames is still tolerable, especially at a low packet loss rate. Some blocking artifacts around the edges, caused by transmission errors, are visible at the higher error rates.
Figure 4.36 Visual results: the frame with smallest PSNR (Foreman sequence, 205 kbps): (a) no error (frame 118, 31.5 dB); (b) no error (frame 44, 31.7 dB); (c) no error (frame 130, 31.6 dB); (d) 3% (frame 118, 24.7 dB); (e) 5% (frame 44, 24.6 dB); (f) 20% (frame 130, 23.0 dB)
144
Visual Media Coding and Transmission
Figure 4.37 Visual results: the frame with smallest PSNR (Soccer sequence, 202 kbps): (a) no error (frame 128, 30.1 dB); (b) no error (frame 12, 30.5 dB); (c) no error (frame 40, 31.0 dB) (d) 3% (frame 128, 28.6 dB); (e) 5% (frame 12, 24.1 dB); (f) 20% (frame 40, 21.2 dB)
4.7.2.4 Corrupting Key Frames and WZ Frames – Decoder Side Rate Control For the Coastguard sequence, the RD performance after key-frame and WZ-frame corruption is presented in Figure 4.38. The higher the PLR, the poorer the side information and the poorer the final overall quality, since the side information for the higher-frequency DCT coefficients is not corrected with WZ bits.
4.8 Modeling the DVC Decoder for Error-prone Wireless Channels A number of independent studies have been carried out in relation to the performance of DVC over noisy and error-prone transmission media. The effects of data corruption in the Hall Monitor PSNR [dB]
PSNR [dB]
Foreman 35 34 33 32 31 30 29 28 27 26 25 24 75
PLR = 0% PLR = 5% PLR = 20%
125
175
225 275 325 Rate [kbps]
375
PLR = 3% PLR = 10%
425
38 37 36 35 34 33 32 31 30 29 PLR = 0% PLR = 3% 28 PLR = 5% PLR = 10% 27 PLR = 20% 26 25 100 125 150 175 200 225 250 275 300 325 350 375 Rate [kbps]
Figure 4.38 Overall RD performance for packet loss in key-frame and WZ-frame corruption for the Foreman and Hall Monitor sequences
145
Distributed Video Coding
Wyner–Ziv-encoded bitstream and the key-frame stream were discussed in detail in Section 4.7, using a packet-based network for simulations. The objective of the research activity discussed in this section is to design an improved model for the Wyner–Ziv decoder, to enhance the RD performance when operated over error-prone channels. The designed model will then be implemented on the VISNET I TDWZ codec for a comparative analysis. This research concentrates on a W-CDMA wireless channel scenario transmitting the encoded Wyner–Ziv bitstream. For a wireless channel, time-varying multipath fading is considered the main challenge in designing communication systems. The modifications to the decoding algorithm necessary to compensate for the adverse channel effects will be discussed in this section. The effects of data corruption on transmission of the key frames will also be outlined, and another modification to the Wyner–Ziv decoder will be presented, by developing a new error model for the key frames. A packet drop simulator will be used for the key frames coded using H.264/AVC intra-coding. Similar to many other related studies, the feedback channel is considered error free, since sufficient provisions are available for strong error correction. The design of a final solution using the error-correction models presented in this section, and considering a uniform data corruption model for the wireless channels, will be addressed in further work in this research area.
4.8.1 Proposed Technical Solution 4.8.1.1 Designing the Dual Channel Mathematical Model This section covers the designing of the novel mathematical model proposed for use in iterative turbo decoding of Wyner–Ziv using the MAP algorithm. In an iterative SISO system with maximum likelihood (ML) decoding, the LLRs of a data bit uk (L(uk)) and the decoded bit (^uk ) are given by: Pðuk ¼ þ 1Þ Lðuk Þ ¼ ln and ^u ¼ sign½Lðuk Þ Pðuk ¼ 1Þ
ð4:38Þ
Thus, given the LLR of uk it is possible to calculate the probability that uk ¼ þ 1 or uk ¼ 1, as follows. Noting that P(uk ¼ 1) ¼ 1 P(uk ¼ þ 1) and taking the exponent of both sides in Equation (4.38), we can write: eLðuk Þ ¼
Pðuk ¼ þ 1Þ 1 Pðuk ¼ þ 1Þ
Pðuk ¼ þ 1Þ ¼
eLðuk Þ 1 þ eLðuk Þ
(4.39)
¼ 1 þ e Lðuk Þ 1
Similarly: Pðuk ¼ 1Þ ¼
1 e Lðuk Þ ¼ 1 þ e þ Lðuk Þ 1 þ e Lðuk Þ
ð4:40Þ
146
Visual Media Coding and Transmission
And hence we can write: Pðuk ¼ 1Þ ¼
e Lðuk Þ=2 eLðuk Þ=2 1 þ e Lðuk Þ
ð4:41Þ
where the bracketed term is fixed for uk ¼ þ 1 or uk ¼ 1. The conditional LLR L(uk|yk) is defined as: Pðuk ¼ þ 1Þj y_ Lðuk j y_ Þ ¼ ln Pðuk ¼ 1Þj y_ Incorporating the codes trellis, this may be written as [32]: 0P 1 pðsk 1 ¼ s0 ; sk ¼ s; yÞ=pðyÞ sþ A Lðuk Þ ¼ ln@P pðsk 1 ¼ s0 ; sk ¼ s; yÞ=pðyÞ
ð4:42Þ
ð4:43Þ
s
where sk 2 S is the state of the encoder at time k, S þ is the set of ordered pairs (s0 , s) corresponding to all state transitions (sk1 ¼ s0 ) ! (sk ¼ s) caused by data input uk ¼ þ 1, and S is similarly defined for uk ¼ 1. For computing P(s0 ,s,y) ¼ P(sk1 ¼ s0 , sk ¼ s, y), the BCJR algorithm [33] for doing this is: pðs0 ; s; yÞ ¼ ak 1 ðs0 Þ g k ðs0 ; sÞ bk ðsÞ
ð4:44Þ
where ak(s), g k(s0 , s), and bk(s) are forward, branch, and backward metrics respectively and are defined as follows: ak ðsÞ ¼ ak 1 ðv1 Þg k ðv1 ; sÞ þ ak 1 ðv2 Þg k ðv2 ; sÞ ð4:45Þ bk ðsÞ ¼ bk þ 1 ðw1 Þg k þ 1 ðw1 ; sÞ þ bk þ 1 ðw2 Þg k þ 1 ðw2 ; sÞ
ð4:46Þ
where v1 and v2 denote valid previous states and w1 and w2 denote valid next states for the current state. Furthermore, g k(s0 , s) could be written as: g k ðs0 ; sÞ ¼ Pðsns0 Þpðyk ns0 ; sÞ ¼ Pðuk Þpðyk nuk Þ
ð4:47Þ
It is noted that, in the case of DVC, the parity stream is transmitted over a noisy wireless channel and the systematic stream is taken from the locally-generated side information frame, which could be modeled using a hypothetical Laplacian or a Gaussian model. But in real-world solutions, both the transmitted parity and the intra-coded key frame information suffer the noisy wireless channel conditions (i.e. AWGN and multipath fading). Thus, in practical situations, the effects of corrupted parity information in MAP decoding and corrupted intracoded key frame information, plus the hypothetical channel characteristics, have being considered in our discussion. The noise distribution in parity and systematic components of the input to the turbo decoder could be considered as independent. Therefore, p(yk/uk) in Equation (4.47) can be written as: pðyk =uk Þ ¼ pðypk =upk Þpðysk =usk Þ
ð4:48Þ
147
Distributed Video Coding
where yk ¼ fypk ; ysk g; uk ¼ fupk ; usk g are the received and transmitted parity and systematic information, respectively. The mathematical modeling of received parity and systematic information is shown below. Modeling the Parity Information (Wyner–Ziv Bitstream) For the parity bit sequence received through a multipath fading wireless channel, considering two paths, the probability of the received bit ypk conditioned on the transmitted bit upk can be written as: p
pðypk =upk
2
½y ðja1 j þ ja2 j k Eb 2s2 ¼ þ 1Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi e 2 2ps
2 Þ2
ð4:49Þ
and: p
2
½y þ ðja1 j þ ja2 j k Eb 2s2 pðypk =upk ¼ 1Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi e 2ps2
2 Þ2
ð4:50Þ
where a1 and a2 are the complex fading coefficients for two paths, s is the standard deviation in the Gaussian probability distribution, and Eb is the energy per bit. (Reproduced by permission of 2007 EURASIP.) The conditional LLR for parity information Lðypk jxpk Þ can be written as: p p Pðyk jxk ¼ þ 1Þ p p Lðyk jxk Þ ¼ ln ð4:51Þ Pðypk jxpk ¼ 1Þ (Reproduced by permission of 2007 EURASIP.) Substituting Equations (4.49) and (4.50) in (4.51), we get:\displaylines{ 80 2 19 p 2 2 Eb > > > > > > > ; : exp Eb 2 ypk þ ðja1 j2 þ ja2 j2 Þ
ð4:52Þ
2s1
¼
2 2 Eb p Eb p 2 2 2 2 y ðja j þ ja j Þ y þ ðja j þ ja j Þ 1 2 1 2 2s12 k 2s12 k Eb ¼ 4ðja1 j2 þ ja2 j2 Þ:ypk 2s12 (4.53) ¼ LC :ypk
where Lc is defined as the channel reliability value, and depends only on the signal-to-noise ratio (SNR) and the fading amplitude of the channel. (Equations (4.52) and (4.53) reproduced by permission of 2007 EURASIP.) Modeling the Systematic Information (Side Information Bitstream) Under the error-free transmission of intra-coded key frames, the correlation of the original key frames to be encoded and the estimated side information is assumed to be Laplacian distributed: f ðxÞ ¼
a ajyks uks j e 2
(Reproduced by permission of 2007 EURASIP.)
ð4:54Þ
148
Visual Media Coding and Transmission
Figure 4.39 The noise model for systematic information: (a) original hypothetical Laplacian noise model for side information (DC coefficients); (b) original pdf of DC coefficients; (c) resultant combined noise pdf
But experimentally we have found that with the presence of a noisy wireless channel the correlation between the estimated side information and the original information no longer exhibits a true Laplacian behavior particularly for the DC band. However, the distribution of the AC bands still resembles a Laplacian distribution. With the loss of information due to channel conditions, the new DC band distribution can be identified as the resultant additive distribution of the probability distribution of the DC coefficients and the original Laplacian distribution f(x), as illustrated in Figure 4.39. 4.8.1.2 Modifying the DVC Codec Incorporating the Above Mathematical Model The DVC codec structure incorporating the noisy channel models is illustrated in Figure 4.40. The transmission channel is modeled as a discrete time Rayleigh fading channel consisting of multiple narrow-band channel paths separated by delay elements, as illustrated in Figure 4.41. The values of the variances of each Gaussian noise source in each tap are selected according to the exponential power delay profile. A generic W-CDMA wireless channel with the parameters shown in Table 4.2 is used for the simulations. In this simulation, we consider a slow fading scenario such that the fading coefficients are assumed to be constant over the chip duration. The turbo decoder construction is illustrated in Figure 4.42. It consists of two MAP based SISO decoders separated by an interleaver identical to the one used in the turbo encoder. The input for each component decoder is the systematic and parity information, together with the a priori information carried forward from the previous iteration of the other
Figure 4.40 Block diagram of the DVC codec with the noisy channel models
149
Distributed Video Coding
Doppler filter
Doppler filter
Doppler filter
Doppler filter
Doppler filter
Doppler filter
White Gausian Noise
White Gausian Noise
White Gausian Noise
White Gausian Noise
White Gausian Noise
White Gausian Noise
j
j
X
X
………...
X
X delay
Channel Input
j
Fading coefficient
delay
X
X
X
X
X
+ White Gausian Noise
+ X
White Gausian Noise
j Channel Output
Figure 4.41
Rayleigh fading channel model
Table 4.2 Configuration parameters of the W-CDMA channel Channel model
W-CDMA uplink and downlink channels (slowly-varying frequency-selective fading)
Modulation Chip rate Spreading factor Spreading sequence Carrier frequency Doppler speed
QPSK 3.84 Mchip/s 32 OVSF sequences 2 GHz 100 km/h
SISO decoder, either interleaved or de-interleaved as necessary. The two parity bitstreams (ypk ) for the two component decoders are taken from the soft output of the channel. The systematic component (ysk ) is taken from the side information stream. The channel reliability factor (LC) for the parity stream is to be derived from channel estimation in a practical implementation.
4.8.2 Performance Evaluation The effect of the noisy channel models discussed above on the transform domain DVC codec performance is investigated in this section. The test conditions adopted were: Foreman and Hall
150
Visual Media Coding and Transmission
Figure 4.42
Block diagram of the turbo decoder
Monitor sequences (100 frames); QCIF resolution with 15 fps; GOP size ¼ 2. The key frames were encoded with H.264/AVC intra (main profile), with the quantization steps chosen through an iterative process to achieve an average intra-frame quality (PSNR) similar to the average WZ frame quality. The configuration parameters for the Reyleigh fading channel are shown in Table 4.2. Figure 4.43 shows the RD results for the Foreman and Hall Monitor sequences considering two paths in the fading model. From the depicted results, it can be concluded that the decoding algorithm presented here, which involves the discussed noise models for the Wyner–Ziv bitstream (parity bits) and the side information stream (systematic bits), significantly improves the RD performance. This solution is further verified for the single-path and four-path cases for the Foreman sequence, and the related simulation results are illustrated in Figure 4.44.
Figure 4.43
RD results for the two-path fading model (SNR ¼ 3 dB)
Distributed Video Coding
151
Figure 4.44 RD results for the single-path and four-path scenarios for the Foreman sequence (SNR ¼ 3 dB)
4.9 Error Concealment Using a DVC Approach for Video Streaming Applications (Portions reprinted, with permission, from R. Bernardini, M. Fumagalli, M. Naccari, R. Rinaldo, M. Tagliasacchi, S. Tubaro, P. Zontone, ‘‘Error concealment using a DVC approach for video streaming applications’’, EURASIP European Signal Processing Conference, Poznan, Poland, September 2007. 2007 EURASIP.) The general framework proposed in [34] considers an MPEG-coded video bitstream which is sent over an error-prone channel with little or no protection; an auxiliary bitstream, generated using Wyner–Ziv coding, is sent for error resilience. At the decoder side, the error-concealed decoded MPEG frame becomes the side information for the Wyner–Ziv decoder, which further enhances the quality of the concealed frame. The auxiliary Wyner–Ziv stream is generated by computing parity bits of a Reed–Solomon code, where the systematic data consists of a downsampled, coarsely-quantized version of the original sequence, together with mode decisions and motion vectors. This scheme works with a fixed rate allocated at the encoder, which does not depend on the actual distortion induced by the channel loss. If the packet loss rate (PLR) is known, together with the error concealment technique used at the decoder, the ROPE algorithm [35] allows the distortion estimation of a decoded frame without a direct comparison between the original and the decoded signal. In other words, it can estimate, at the encoder side, the expected distortion at the decoder. In its former version [35], the ROPE algorithm performed this estimation by considering integer pixel precision for motion compensation in the video coder. Since half-pixel precision is widely adopted by standard coders, we use the half-pixel precision version of ROPE presented in [36]. A forward error-correcting coding scheme that employs an auxiliary redundant stream encoded according to a Wyner–Ziv approach is proposed. The block diagram of the proposed architecture is depicted in Figure 4.45. Unlike [34], turbo codes to compute the parity bits of the auxiliary stream are used. The proposed solution works in the transform domain, by protecting the most significant bitplanes of DCT coefficients. In order to allocate the appropriate number of parity bits for each DCT frequency band, a modified version of ROPE that works in the DCT domain [25] is used. The information provided by ROPE is also used to determine which frames are more likely to suffer from drift. Therefore, the auxiliary
152
Visual Media Coding and Transmission
Figure 4.45 Overall architecture of the proposed scheme. Reproduced by permission of 2007 EURASIP
redundant data is sent only for a subset of the frames. We also show how prior information that can be obtained at the decoder, based on the actual error pattern, can be used to efficiently help turbo decoding.
4.9.1 Proposed Technical Solution To evaluate the number of Wyner–Ziv parity bits which should be transmitted into the auxiliary bitstream, the expected frame distortion in the DCT domain is estimated. Such distortion estimation is computed by means of the ROPE algorithm [35], which requires knowledge of the transmission channel packet loss rate and of the error concealment strategy used at the decoder. In its first and basic formulation, the ROPE algorithm estimates the expected distortion in the pixel domain and considers video coding with integer-precision motion vectors (MVs). As mentioned before, integer-precision MVs may lead to unacceptable coding performance. In this work, we extend the ROPE algorithm to half-pixel precision, as proposed in [36]. To calculate the expected frame distortion at the decoder in the DCT domain, the EDDD (expected distortion of decoded DCT-coefficients) algorithm [37] is used. This method requires the expected frame distortion in the pixel domain calculated by the ROPE algorithm as input, and outputs the expected distortion of the coefficients in the transform domain. Figure 4.46 shows the rate estimator module and highlights its inputs and its output, i.e. the expected channel induced distortion Di,j(t) for each DCT coefficient, defined as: ^ i;j ðtÞ X ~ i;j ðtÞÞ2 Di;j ðtÞ ¼ E½ðX
Figure 4.46
Rate estimator module for the expected distortion in DCT domain
ð4:55Þ
153
Distributed Video Coding
where i ¼ 0,. . .,K 1 denotes the block index within a frame and j ¼ 0,. . .,J is the DCT ^ i;j ðtÞ represents the reconstructed DCT coefficient at the coefficient index within a block. X ~ i;j ðtÞ is the co-located coefficient reconstructed at the decoder, after encoder at time t, while X the error concealment. (Reproduced by permission of 2007 EURASIP.) The estimated values of Di,j(t) represent the channel-induced distortion only, and they are obtained as follows: let c, q denote, respectively, transmission and quantization errors, supposed uncorrelated. For each (i, j) DCT coefficient, it is possible to write at the encoder: ^ i;j ðtÞ ¼ Xi;j ðtÞ þ qi;j ðtÞ X
ð4:56Þ
where Xi,j(t) is the original DCT coefficient. (Reproduced by permission of 2007 EURASIP.) At the decoder, it comes out: ~ i;j ðtÞ ¼ X ^ i;j ðtÞ þ ci;j ðtÞ X
ð4:57Þ
The total distortion at the decoder is given by: ~ i;j ðtÞÞ2 ¼ E½ðqi;j ðtÞÞ2 þ E½ðX ^ i;j ðtÞ X ~ i;j ðtÞÞ2 E½ðXi;j ðtÞ X ¼ E½ðqi;j ðtÞÞ2 þ Di;j ðtÞ
(4.58)
where the left hand side of Equation (4.58) is unknown and represents the variance of drift, provided by EDDD (for details, see [27]). The first term on the right hand side can be approximated by knowing the quantizer step size D as E[(qi,j(t))2] ¼ D2/12, by assuming uniformly distributed quantization noise. The distortion Di,j(t) is then used to compute the number of transmitted parity bits, to ~ into a ‘‘cleaner’’ version X0 (t), as will be correct the tth error-concealed decoded frame XðtÞ seen later. The proposed scheme is depicted in Figure 4.46. The input video signal is independently coded with a standard motion-compensated predictive (MCP) encoder and a Wyner–Ziv encoder. The generated bitstreams are transmitted over the error-prone channel, characterized by a packet loss probability pl. At the receiver side, the decoder decodes the primary bitstream and performs error concealment. A motion-compensated temporal concealment that can briefly be summarized as follows is used: .
.
If a macroblock (MB) is lost (this happens with probability pl) then it is replaced with the one in the previous frame, pointed by the motion vectors of the MB above the one under consideration. If the MB above the one being concealed is lost too (this happens with probability p2l ) then the current MB is replaced with the homologue in the reference frame.
The Wyner–Ziv encoder is similar to the one described in [25], with the difference that in [25] the side information is generated at the decoder by motion-compensated interpolation, while ~ here the side information is the concealed reconstructed frame at the decoder XðtÞ. To prevent mismatch between the encoder and decoder, a locally decoded version of the compressed ~ video XðtÞ is used as input to the Wyner–Ziv encoder, rather than the original video sequence X(t).
154
Visual Media Coding and Transmission
~ are only generated if the expected At the encoder, Wyner–Ziv redundancy bits for frame XðtÞ distortion at the frame level is above a predetermined threshold, i.e. only if: DðtÞ ¼
1 JX 1 1 KX Di;j ðtÞ > t KJ i¼0 j¼0
ð4:59Þ
Otherwise, no redundancy bits are transmitted. This allows us to concentrate the bit budget on those frames that are more likely to be affected by drift. ~ of the locally decoded To generate the Wyner–Ziv bitstream, DCT is applied to the frame XðtÞ sequence; the transform coefficients are grouped together to form coefficient subbands; each subband is then quantized using a dead-zone uniform quantizer, and for each subband the corresponding bitplanes are independently coded using a turbo encoder. The parity bits relative to all the DCT coefficients of each 8 8 block are transmitted. In the proposed scheme, the expected distortion that EDDD computes to adaptively estimate the number of parity bits that are required by the Wyner–Ziv decoder is used. As a matter offact, it is expected that if the distortion calculated by ROPE increases, the rate required by the decoder for exact reconstruction will increase too. At the encoder it is possible to estimate the expected distortion for each DCT coefficient, denoted by Di,j(t). Since encoding is performed by grouping together DCT coefficients belonging to the same subband j, first the average distortion for each subband is computed as: s2j ðtÞ ¼
1 1 KX Di;j ðtÞ K i¼0
ð4:60Þ
then the expected crossover probability for each bitplane b of each subband j, pbj ðtÞ, is obtained. ^ j ðtÞ X ~ j ðtÞ. To this end, a Laplacian distribution is assumed for the error nj ðtÞ ¼ X fnj ðtÞ ðnÞ ¼
aj ðtÞ aj ðtÞjnj e 2
ð4:61Þ
The distribution parameter aj(t) is derived by the expected distortion s2j ðtÞ as a2j ðtÞ ¼ 2=s2j ðtÞ. The joint knowledge of the Laplacian distribution and of the quantizer used allows us to estimate the crossover probability pbj ðtÞ for each bitplane. Therefore, it is possible to calculate Hðpbj ðtÞÞ ¼ pbj ðtÞ log2 ðpbj ðtÞÞ ð1 pbj ðtÞÞ log2 ð1 pbj ðtÞÞ, and using this value the bits Rbj ðtÞ ¼ Hðpbj ðtÞÞ that an ideal receiver needs to recover the original bits are estimated. Note that Hðpbj ðtÞÞ is the amount of redundancy required by a code for a symmetric binary channel achieving channel capacity. The encoder transmits about Rbj ðtÞ parity bits1 for each bitplane and subband, until the auxiliary stream bit budget RWZ(t) is exhausted. First, the parity bits of the most significant bitplane of all J DCT subbands are sent. Then encoding proceeds with the remaining bitplanes. The encoder also transmits the average distortion computed by the rate estimator module, i.e. s2j ðtÞ, since it is exploited by the turbo decoder as explained below. Note that, as was analyzed in [38], by fixing the rate at the encoder, a significant improvement – the transmission rate for a single bitplane never exceeds 1 – is obtained. In fact, if the estimated rate is greater than 1, it is 1
A higher rate is, typically, required to take into account both model inaccuracy and the suboptimality of the channel code.
Distributed Video Coding
155
convenient to transmit the uncoded bits. At the receiver side, the Wyner–Ziv decoder takes the ~ and transforms it using a block-based DCT. The transform coefficients concealed frame XðtÞ ~ j ðtÞ. The received parity bits are then grouped together to form the side information subbands X ~ j ðtÞ into X ^ 0 j ðtÞ. Turbo decoding is applied at the bitplane are used to ‘‘correct’’ each subband X level, starting from the most significant bitplane of each subband. In the proposed system, the turbo decoder exploits the received distortions estimated at the encoder s2j ðtÞ in order to tune ^ j ðtÞ and the side information X ~ j ðtÞ. Therefore, the correlation statistics between the source X adaptivity is guaranteed at both subband and frame level. In addition, it is proposed to exploit the knowledge of the actual error pattern, which is available at the decoder only. In fact, the decoder knows exactly which slices are lost, and can perform an error-tracking algorithm in order to determine which blocks might be affected by errors. Apart from the blocks belonging to lost slices, those blocks that depend upon previously corrupted blocks are flagged as ‘‘potentially noisy’’. The error tracker produces, for each frame, a binary map that indicates whether the reconstructed block at the decoder might differ from the corresponding one at the encoder. The algorithm is similar to the one presented in [39], with the important difference that, in this case, error tracking is performed at the decoder only. The turbo decoder can take advantage of this information by adaptively setting the conditional probability to 1 for those coefficients that are certainly correct. This means that the turbo decoder totally trusts the side information in these cases. After turbo decoding, some residual errors might occur if the number of received parity bits is not sufficient for the exact reconstruction of every bitplane. In this case, the error-prone bitplane is not considered in the reconstruction process; only the bits of the correctly decoded bitplanes are regrouped, and the ML reconstruction is finally applied (see [25]). Error detection at the turbo decoder is performed by monitoring the average of the modulus of the LAPP, as described in [19]. Finally, once Wyner–Ziv decoding is successfully completed, the reconstructed frame X0 (t) is copied into the buffer of the MCP decoder, to serve as reference frame at time t þ 1. This way the amount of drift propagated to successive frames is reduced.
4.9.2 Performance Evaluation In these simulations, the input video signal is compressed using an H.263 þ video coder with half-pixel precision motion estimation. The compressed bitstream is transmitted using the primary channel. The auxiliary bitstream carries Wyner–Ziv redundancy bits, which are used at the decoder to correct the bitplanes of the DCT coefficients of the side information, i.e. of the reconstructed frames after concealment. For the first set of simulations, 30 frames of the QCIF Foreman sequence are considered. The Foreman sequence is coded at 30 fps, QP ¼ 4, and GOP size ¼ 16. Each slice comprises one row of 16 16 macroblocks, with nine slices per frame. For transmission, one slice per packet is sent, and a packet loss probability pl ¼ 10% is assumed, with independent losses. As mentioned above, Wyner–Ziv redundancy bits are not sent when the expected distortion estimated by ROPE is below a certain threshold, which represents the minimum acceptable quality at the decoder. Note that such a threshold could be computed by the encoder on the basis of the average expected distortion for past frames. In these experiments, the encoder does not send Wyner–Ziv bits when D(t) < t, where t is empirically determined for each sequence. When Wyner–Ziv bits are transmitted, the number of bits is set to bðpbj ÞHðpbj ðtÞÞ, where bðpbj Þ
156
Figure 4.47
Visual Media Coding and Transmission
Expected distortion given by ROPE for 30 frames of the Foreman and Carphone sequences
ranges from 2 to 4 depending on pbj ðtÞ. The values bðpbj Þ are set on the basis of operational curves of the turbo coder performance for a binary symmetric channel with crossover probability pbj ðtÞ. In the following, the proposed scheme is compared with one using the H.263 þ intra-MB refresh coding option, where error resilience is obtained by coding, in each frame, a certain number of MBs in intra-mode. The proportion is chosen to spend roughly the same additional bit rate as required by the Wyner–Ziv redundancy stream. Figure 4.47(a) shows the ROPE expected distortion for frames 55–85 of the Foreman sequence. For this sequence, t is equal to 80 (equivalent to an expected PSNR equal to 30 dB). Figures 4.48 and 4.49 show the PSNR performance of the proposed scheme and the required rate. The results are obtained as the average of 30 different channel simulations. In this experiment, the main H.263 þ stream has an average bit rate R ¼ 568 kbps. An additional
Figure 4.48
PSNR evolution for 30 frames of the Foreman sequence
Distributed Video Coding
157
Figure 4.49 Frame-based bit-rate allocation for 30 frames of the Foreman sequence
Figure 4.50 EURASIP
PSNR for 30 frames of the Carphone sequence. Reproduced by permission of 2007
63 kbps (equivalent to approximately 11% of the original rate) are spent for the auxiliary Wyner–Ziv-coded stream, for a total rate R ¼ 631 kbps. Figure 4.48 also shows the PSNR for H.263 þ , when the intra-MB refresh rate has been set equal to 6 to achieve approximately the same rate (R ¼ 624 kbps) as the proposed scheme. It is apparent from the figures that the proposed scheme has comparable or better performance than the scheme using the intra-MB refresh procedure. In this respect, note that the intra-MB refresh procedure has to be used at coding time. In the proposed scheme it is possible
158
Visual Media Coding and Transmission
Figure 4.51 Frame-based bit-rate allocation for 30 frames of the Carphone sequence. Reproduced by permission of 2007 EURASIP
to start from a precoded sequence (the most common situation for video transmission) instead, and add Wyner–Ziv bits for protection. Similar results can be seen in Figures 4.50 and 4.51, where 30 frames of the Carphone sequence are considered. In this case, the main H.263 þ stream is encoded at a rate R ¼ 294 kbps. An additional 89 kbps are added as Wyner–Ziv parity bits to achieve an overall rate equal to 386 kbps. When intra-MB refresh is used, the refresh rate is set to 8. The ROPE expected distortion for the Carphone sequence is depicted in Figure 4.47(b). In this case, no parity bits are sent when the D(t) is below 35.
4.10 Conclusions Research on video coding technologies has been carried out with significant success for several decades since the early days of analog video signal processing. The evolving technologies have been standardized by ITU-T and ISO/IEC for the benefit of economical commercial developments. Still, DVC has been able to successfully mark its place as a promisingly emerging novel approach for video coding. With its flexible architecture, which enables the design of very low-complexity video encoders, DVC is expected to pave the way for a new era of significantly low-cost and miniature video cameras for a number of demanding applications. The potential beneficiaries of DVC include surveillance systems and wireless sensor networks. DVC is based on the distributed source coding concept, which discusses the independent encoding and joint decoding of statistically-dependent discrete random sequences, as described by the Slepian–Wolf and Wyner–Ziv theorems discussed in Section 4.1. Accordingly, decoding with side information is a feature of DVC, in contrast to conventional video coding techniques. A number of possible DVC implementations have been proposed in the literature. The architecture of the generic pixel domain DVC codec based on turbo coding,
Distributed Video Coding
159
which is widely discussed in related literature, was described earlier. Some of the notable components of the DVC codec architecture considered include: quantization, turbo encoding, parity puncturing, side information estimation, turbo decoding, and reconstruction. In this chapter we proposed several improvements to the VISNET DVC codec. They included some stopping criteria for a feedback channel, a nonlinear quantization process, rate distortion analysis, a model for the DVC decoder to handle errors, and an error-concealment technique. These researches have moved DVC research a few steps further forward. But it is to be noted that further significant research effort will be needed before DVC is ready for standardization or commercial deployment. As of now, the codec still contains a few limitations for which viable practical solutions are due, particularly the GOP size. When these limitations have been addressed, DVC will find many practical applications and may affect quality of life.
References [1] D. Slepian and J.K. Wolf, ‘‘Noiseless coding of correlated information sources,’’ IEEE Trans. on Inform. Theory, Vol. IT-19, pp. 471–480, Jul. 1973. [2] D. Wyner and J. Ziv, ‘‘The rate-distortion function for source coding with side information at the decoder,’’ IEEE Trans. on Inform. Theory, Vol. IT-22, pp. 1–10, Jan. 1976. [3] C. Berrou, A. Glavieux, and P. Thitimajshima, ‘‘Near Shannon limit error-correcting coding and decoding: turbocodes,’’ Proc. 1993 IEEE Int. Conf. on Communications, Geneva, Switzerland, pp. 1064–1070 [4] P. Robertson and T. W€orz, ‘‘Bandwidth-efficient turbo trellis-coded modulation using punctured component codes,’’ IEEE Journal on Selected Areas in Communications, Vol. 16, pp. 206–218, Feb. 1998. [5] W.A.R.J. Weerakkody, W.A.C. Fernando, A.B.B. Adikari, and R.M.A.P. Rajatheva, ‘‘Distributed video coding of Wyner–Ziv frames using Turbo Trellis Coded Modulation,’’ Proceedings of IEEE ICIP 2006, Atlanta, GA, Oct. 2006. [6] D. Schonberg, S.S. Pradhan, and K. Ramchandran, ‘‘LDPC codes can approach the Slepian–Wolf bound for general binary sources,’’ Proc. Allerton Conference on Communication, Control, and Computing, Champaign, IL, Oct. 2002. [7] Y. Tonomura, T. Nakachi, T. Fujii, ‘‘Distributed Video Coding Using JPEG 2000 Coding Scheme’’, IEICE Trans. Fundamentals, Vol. E90-A, No. 3, pp. 581–589, March 2007. [8] R. Puri and K. Ramchandran, ‘‘PRISM: a new robust video coding architecture based on distributed compression principles,’’ Proc. 40th Allerton Conf. on Comm., Control, and Computing, Allerton, IL, Oct. 2002. [9] C. Brites, J. Ascenso, and F. Pereira, ‘‘Improving transform domain Wyner–Ziv video coding performance,’’ Proc. IEEE ICASSP, Vol. 2, pp. II–525–II-528, May 2006. [10] Meyer, Westerlaken, Gunnewiek, and Lagendijk, ‘‘Distributed source coding of video with non-stationary side information,’’ Proc. SPIE VCIP, Beijing, China, Jun. 2005. [11] A. Aaron, E. Setton, and B. Girod, ‘‘Towards practical Wyner–Ziv coding of video,’’ Proc. IEEE International Conference on Image Processing, ICIP-2003, Barcelona, Spain, Sep. 2003. [12] M. Tagliasaccchi and S. Tubaro, ‘‘A MCTF video coding scheme based on distributed source coding principles,’’ Visual Communication and Image Processing, Beijing, Jul. 2005 [13] L. Natario, C. Brites, J. Ascenso, and F. Pereira,‘‘Extrapolating side information for low-delay pixel-domain distributed video coding,’’ VISNET Report, http://www.visnet-noe.org. [14] J. Ascenso, C. Brites, and F. Pereira,‘‘Improving frame interpolation with spatial motion smoothing for pixel domain distributed video codec,’’ VISNET Report, http://www.visnet-noe.org. [15] L.R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, ‘‘Optimal decoding of linear codes for minimising symbol error rate,’’ IEEE Trans. on Inform. Theory, Vol. 20, pp. 284–287, Mar. 1974. [16] B. Girod, A. Aaron, S. Rane, and D.R. Monedero, ‘‘Distributed video coding,’’ Proc. IEEE Special Issue on Advances in Video Coding and Delivery, Vol. 93, No. 1, pp. 1–12, 2003. [17] A. Aaron, R. Zhang and B. Girod, ‘‘Wyner–Ziv coding of motion video,’’ Proc. Asilomar Conference on Signals and Systems, Pacific Grove, CA, Nov. 2002.
160
Visual Media Coding and Transmission
[18] A.B.B. Adikari, W.A.C. Fernando, and W.A.R.J. Weerakkody, ‘‘Independent key frame coding using correlated pixels in distributed video coding’’, IEE Electronics Letters, Vol. 43, Issue 7, pp. 387–388, March 29 2007. [19] F. Zhai and I.J. Fair, ‘‘Techniques for early stopping and error detection in turbo decoding,’’ IEEE Transactions on Communications, Vol. 51, No. 10, pp. 1617–1623, Oct. 2003. [20] R.Y. Shao, S. Lin, and M. Fossorier, ‘‘Two simple stopping criteria for turbo decoding,’’ IEEE Transactions on Communications, Vol. 47, No. 8, pp. 1117–1120, Aug. 1999. [21] C. Brites, J. Ascenso, and F. Pereira, ‘‘Improving transform domain Wyner–Ziv video coding performance,’’ IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, May 2006. [22] B. Girod, ‘‘The efficiency of motion-compensated prediction for hybrid coding of video sequences,’’ IEEE Journal on Selected Areas in Communications, Vol. 7, pp. 1140–1154, Aug. 1987. [23] T. Berger, Rate Distortion Theory, Englewood Cliffs, NJ, Prentice Hall, 1971. [24] L. Piccarreta, A.Sarti, and S. Tubaro, ‘‘An efficient video rendering system for real-time adaptive playout based physical motion field estimation,’’ EURASIP European Signal Processing Conference, Antalya, Turkey, Sep. 2005. [25] A. Aaron, S. Rane, E. Setton, and B. Girod, ‘‘Transform-domain Wyner–Ziv codec for video,’’ Visual Communications and Image Processing, San Jose, CA, Jan. 2004. [26] M. Tagliasacchi, A. Trapanese, S.Tubaro, J. Ascenso, C.Brites, and F. Pereira, ‘‘Exploiting spatial redundancy in pixel domain Wyner–Ziv video coding,’’ IEEE International Conference on Image Processing, Atlanta, GA, Oct. 2006. [27] J. Ascenso, C. Brites, and F. Pereira, ‘‘Interpolation with spatial motion smoothing for pixel domain distributed video coding,’’ EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Slovak Republic, Jul. 2005. [28] M. Ouaret, F. Dufaux, and T. Ebrahimi, ‘‘Fusion-based multiview distributed video coding,’’ ACM International Workshop on Video Surveillance and Sensor Networks, Santa Barbara, CA, Oct. 2006. [29] X. Artigas, E. Angeli, and L. Torres, ‘‘Side information generation for multiview distributed video coding using a fusion approach,’’ Nordic Signal Processing Symposium, Reykjavik, Iceland, Jun. 2006. [30] B. Girod, A. Aaron, S. Rane, and D. Rebollo Monedero, ‘‘Distributed video coding,’’ Proc. IEEE, Vol. 93, No. 1, pp. 71–83, Jan. 2005. [31] S. Wenger,‘‘Proposed error patterns for Internet experiments,’’ Doc. VCEG Q15-I-16r1, VCEG Meeting, Red Bank, NJ, October 1999. [32] W.E. Ryan, A Turbo Code Tutorial, http://www.ece.arizona.edu/ ryan/. [33] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, ‘‘Optimal decoding of linear codes for minimizing symbol error rate,’’ IEEE Trans. Inf. Theory, pp. 284–287, Mar. 1974. [34] A. Aaron, S. Rane, D. Rebollo-Monedero, and B. Girod, ‘‘Systematic lossy forward error protection for error resilient digital video broadcasting – a Wyner–Ziv coding approach,’’ International Conference on Image Processing, Singapore, Oct. 2004. [35] R. Zhang, S.L. Regunathan, and K. Rose, ‘‘Video coding with optimal inter/intra mode switching for packet loss resilience,’’ IEEE Journal on Selected Areas in Communications, Vol. 18, No. 6, pp. 966–976, 2000. [36] V. Bocca, M. Fumagalli, R. Lancini, and S. Tubaro, ‘‘Accurate estimate of the decoded video quality: extension of ROPE algorithm to half-pixel precision,’’ Picture Coding Symposium, San Francisco, CA, Dec. 2004. [37] M. Fumagalli, M. Tagliasacchi, and S. Tubaro, ‘‘Expected distortion of video decoded DCT-coefficients in video streaming over unreliable channel,’’ International Workshop on Very Low Bitrate Video Coding, Costa del Rei, Italy, Sep. 2005. [38] D.P. Varodayan, ‘‘Wyner–Ziv coding of still images with rate estimation at the encoder,’’ Technical Report, Stanford University. [39] N.F.B. Girod, ‘‘Feedback-based error control for mobile video transmission,’’ Proc. IEEE, Vol. 87, pp. 1707–1723, Oct. 1997.
5 Non-normative Video Coding Tools 5.1 Introduction The definition of video coding standards is of the utmost importance because it guarantees that video coding equipment from different manufacturers will be able to interoperate. However, the definition of a standard also represents a significant constraint for manufacturers because it in some way limits what they can do. Therefore, in order to minimize the restrictions imposed on manufacturers, only tools that are essential for interoperability are typically specified in the standard – the normative tools. The remaining tools, which are not standardized, but are also important in video coding systems, are referred to as non-normative tools, and this is where competition and evolution of the technology will take place. In fact, this strategy of specifying only the bare minimum that will guarantee interoperability ensures that the latest developments in the area of non-normative tools will be easily incorporated in video codecs without compromising their standard compatibility, even after the standard has been finalized. In addition, this strategy also makes it possible for manufacturers to compete against each other and distinguish between their products available in the market. A significant amount of research effort is being devoted to the development of nonnormative video coding tools, with the target of improving the performance of standard video codecs. In particular, due to their importance, rate control and error-resilience non-normative tools are being researched. This chapter addresses the development of efficient tools for the modules that are nonnormative in video coding standards such as rate control and error concealment. For example, multiple video sequence (MVS) joint rate control will address the development of rate control solutions for encoding video scenes represented as a composition of video objects (VOs), such as in the MPEG-4 standard, and the joint encoding and transcoding of multiple video sequences (VSs) to be transmitted over bandwidth-limited channels using the H.264/AVC standard. This chapter describes the design, specification, and evaluation of bit-rate control algorithms for object-based and frame-based video coding architectures. Rate control algorithms will also be used in conjunction with statistical multiplexing in order to enable simultaneous transmissions of RD-optimized video streams over bandwidth-limited channels. The chapter will also look into object-based error resilience, where the problem of maximizing the video quality when
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
162
Visual Media Coding and Transmission
transmitting object-based video data through error-prone channels will be investigated. Improved motion estimation tools will be studied, as will the aim of reducing the encoding complexity. The topics to be addressed in this chapter are: . . . . . . . . .
Rate Control Rate Control Architecture for Joint MVS Encoding and Transcoding Bit Allocation and Buffer Control for MVS Encoding Rate Control Rate Control for H.264/AVC Joint MVS Encoding Optimal Rate Allocation for H.264/AVC Joint MVS Transcoding Error Resilience Spatio-temporal Scene-level Error Concealment for Segmented Video An Integrated Error-resilient Object-based Video Coding Architecture A Robust FMO Scheme for H.264/AVC Video Transcoding.
In order to better understand the novelty of the techniques proposed later in this chapter, the state of the art of these topics will be briefly described in Section 5.2.
5.2 Overview of the State of the Art 5.2.1 Rate Control The rate control mechanism is one of the video coding tools not normatively specified in any of the currently available and emerging video coding standards, since this is not necessary for interoperability. However, rate control is related to one of the main degrees of freedom for improving the performance of standard-based systems. Generically, the major objectives of video coding rate control, whatever the coding architecture, can be summarized as: . .
Regulation of the video encoder output data rate according to specific constraints. Maximization of the subjective impact of the decoded video.
In frame-based coding architectures, the rate control mechanism has, usually, the necessary degree of freedom to choose the best tradeoff in terms of spatial and temporal resolution, as well as to introduce (controlled) distortion in the texture data in order to maximize the global subjective quality of the decoded video, given the resources and conditions at hand. In object-based coding architectures, the degree of freedom of frame-based video coding rate control appears for each object in the video scene. Additionally, there is the shape data, which defines the shape of each object, and the scene description data, which specifies which objects are in the scene and how the scene is organized. The major novelty here is the semantic dimension of the data model and consequently of the rate control mechanism. This mechanism can decide to perform actions, such as not transmitting a less relevant object in order to save bits for the most semantically relevant objects, or to dynamically allocate the available bits to the various objects depending on their subjective and semantic relevance. Therefore, for objectbased coding, the relevant criteria to be used for rate control are related not only to the texture and shape characteristics of each object but also to their semantic dimension. This latter is in turn related to the priority and relevance of each object in the context of the scene and the application in question.
Non-normative Video Coding Tools
163
Although rate control is non-normatively specified in any available video coding standard, an informative annex of this issue is typically issued for each standard, aiming to provide adequate rate control solutions in terms of RD-complexity performance. The ITU-T H.261 [1] Reference Model 8 (RM8) [2] suggests a simple rate control algorithm which aims at controlling the encoder output bit rate by providing basically only encoder rate buffer fullness control, notably to avoid encoder rate buffer overflow. This algorithm essentially relies on a compensation mechanism that relates the encoder rate buffer fullness to the macroblock (MB) quantization. The MPEG-1 video coding standard [3] suggests a similar approach to RM8. However, in this case a different target number of bits should be used for each picture coding type (I, P, and B): typically B-pictures should be coded with fewer bits, P-pictures with two to five times more bits than B- pictures, and I-pictures with up to three times more bits than P-pictures. The solution proposed in [4] was typically used as a benchmark for MPEG-1 Video rate control. This solution is based on MB classification and adaptive quantization according to the MB complexity measure and the buffer occupancy. The MPEG-2 video [5] Test Model 5 (TM5) [6] rate control algorithm aims at controlling the encoder output bit rate through an adaptive picture bit allocation step, followed by an MB quantization parameter adjustment step. This algorithm avoids the main drawback of uniform bit allocation, by allocating to each picture a number of bits proportional to its estimated complexity (depending, essentially, on the picture coding type), which is estimated based on past encoding results. ITU-T H.263 [7] TMN8 [8] suggests a rate control method based on rate-quantization models and model adaptation, defining two levels of rate control: (a) frame-level, where the temporal resolution and the frame bit allocation are adjusted based on the buffer occupancy; and (b) MB-level rate control, where the MB quantization parameter is computed based on the rate-quantization model and an MB complexity measure. For object-based video coding, notably for MPEG-4 video coding [19], several rate control solutions were suggested and adopted through the standard development in various MPEG-4 video verification models, for example [9] and [11]. For frame-level rate control, the technique based on the rate-distortion (RD) model proposed in [10] is suggested, while for multiplevideo-object rate control the technique proposed in [12] was adopted. Besides these suggested techniques for rate control, other proposals have made, for example [13–15]. For H.264/AVC [16], the rate control solution proposed in [17] has been included as an informative part of this standard. This solution specifies three levels of rate control: 1. GOP-level rate control, which allocates the number of bits for the remaining number of bits in each GOP. 2. Picture-level rate control, which uses a quadratic rate distortion model to compute an estimate of the picture QP to be used for rate-distortion optimization (RDO). 3. Basic unit-level rate control, which can be used to estimate the mean absolute difference (MAD) of a given set of MBs, the basic unit, to compute the quantization parameter with finer granularity to be used in RDO. Although the H.264/AVC standard is defined for single video sequence encoding, there are several video coding scenarios that can benefit from jointly encoding several video sequences (see Section 5.3). In this context, statistical multiplexing allows better utilization
164
Visual Media Coding and Transmission
of available bandwidth for the transmission of several video sequences in the common channel. This feature is useful in such applications as broadcasting and video streaming over networks. Sophisticated algorithms are used to ensure accurate control of the size of the output stream in reasonable time. The bit allocation process in the H.264/AVC encoder [16] is performed only by the selection of the quantization parameter (QP). One of the most important elements while controlling the process is an RD model. The model built in the domain of MAD is nonlinear and the accuracy is not high enough [10]. The RD model built in the domain of the parameter r, denoting the percentage of zero-quantized transform coefficients [18], provides much better results in terms of estimation accuracy, robustness, and complexity.
5.2.2 Error Resilience Since the work related to non-normative error-resilience tools is clearly divided into two parts, one devoted to object-based systems and the other to frame-based systems, the same division will be considered here for the state-of-the-art description. 5.2.2.1 Error Resilience for Object-based Systems The emergence of the MPEG-4 object-based audiovisual coding standard [19] opened up the way for new video services, where scenes consist of a composition of objects. In order to make these object-based services available in error-prone environments, such as mobile networks or the Internet, with an acceptable quality, appropriate error-resilience techniques dealing with shape and texture data are necessary. Since both encoder and decoder can influence the behavior of the video coding system in terms of error resilience, techniques have been proposed in the literature for both sides of the communication chain. At the encoder side, the available error-resilience techniques make it possible to better protect the bitstream against channel errors (i.e. protective errorresilience techniques); by doing so the video decoding performance is much improved in the presence of errors. Two such object-based techniques are known in the literature [20, 21]. At the decoder side, however, the objective is to take defensive actions in order to minimize the negative impact of the channel errors on the decoded video quality (i.e. defensive error-resilience techniques). To do this, several object-based error-concealment techniques have been proposed in the literature, for example [22–27] for shape and [28] for texture. 5.2.2.2 Error Resilience for Frame-based Systems The H.264/AVC standard [16] represents the state of the art in frame-based video coding and includes several tools that can be enabled to enhance the robustness of the bitstream to channel errors. Techniques such as intra-coding refreshment (at the MB/slice/frame level), flexible macroblock ordering (FMO), SP/SI frames, and multiple reference frames can all be used (independently or in combination) to successfully combat channel errors. However, since the standard only defines the bitstream syntax of these tools, their proper use and optimization is left open to the codec designer. This creates an opportunity for researchers to propose more advanced error-resilience schemes.
Non-normative Video Coding Tools
165
5.3 Rate Control Architecture for Joint MVS Encoding and Transcoding 5.3.1 Problem Definition and Objectives Compliant video encoding, whatever the video coding standard, requires adequate control of the video encoder, meeting relevant constraints of the encoding framework, notably: bit rate, delay, complexity, and quality. Several relevant applications may require the transmission of multiple video sequences or video scenes composed of multiple VOs, through a common constant bit rate (CBR) channel, for example broadcasting, video monitoring and surveillance, telemedicine, and video conferencing. In this context, two different rate control scenarios have been identified: 1. Joint MVS encoding scenario: In this scenario multiple video sequences (or a video scene composed of multiple VOs) are jointly encoded, dynamically sharing along time the available channel bit rate and buffer constraints. Figure 5.1(a) illustrates the joint encoding of multiple frame-based video sequences, encoding compliantly with the H.264/AVC video coding standard. In this case the rate controller is responsible for controlling multiple H.264/AVC encoders deciding the optimal (in an RD-complexity sense) spatio-temporal resolutions, bit allocations, and corresponding set of coding parameters for each input raw video sequence.
Figure 5.1 Rate control scenarios: (a) joint multiple video sequence encoding; (b) joint multiple video stream transcoding
166
Visual Media Coding and Transmission
2. Joint MVS transcoding scenario: In this scenario multiple video streams (n ¼ 1,. . ., NVS) encoded atP individual constant bit rates, Rn, are to be transmitted over a CBR channel, such VS that RT Nn¼1 Rn , where RT is the available channelPbit rate. Since the available channel VS bit rate, RT, is lower than the aggregated bit rate, Nn¼1 Rn , a transcoding operation is required, aiming at dynamically reallocating the available channel bit rate among the various sequences. Figure 5.1(b) illustrates the joint transcoding of multiple video streams encoded compliantly with H.264/AVC. In this case, the rate controller is responsible for controlling multiple H.264/AVC transcoders deciding the new optimal (in an RD-complexity sense) bit allocations and the corresponding set of coding parameters for each input coded VS. Pleasant visual consumption requires that the video data be coded with approximately constant quality. In an MVS encoding scenario this will require: . .
Smoothly changing quality along time for each VS. Smoothly changing quality between the various VSs.
However, independent coding of the various VSs transmitted through the common CBR channel does not lead to optimal global RD tradeoff. Therefore, joint encoding of multiple VSs, adaptively sharing the available resources between the several VSs, may lead to a better global RD tradeoff. For low-delay video encoding under reasonable encoding complexity, feedback rate control assumes an important role since it is not usually viable to encode the same content multiple times as in pure feedforward rate control; while feedback methods react a posteriori to deviations relative to the ideal encoding behavior, feedforward methods infer in advance the results of a given set of encoding decisions prior to encoding. Therefore, adequate modeling of the encoding process is required for efficient rate control. In this context, this work proposes an improved rate control architecture for low-delay video encoding that can handle efficiently the encoding of single and multiple VSs (rectangular or arbitrarily shaped), combining the reaction capabilities of feedback rate control methods with the prediction capabilities of feedforward rate control methods.
5.3.2 Proposed Technical Solution Figure 5.1 presents the proposed rate control architecture for jointly controlling the encoding of multiple VSs, in order to meet the relevant constraints of the encoding scenario. This architecture is composed of three major building blocks: .
Video encoder: Responsible for encoding the original video content (i.e. the multiple VSs) into a set of bitstreams (one for each VS). This block is composed of: (a) a frame buffer where the original frames are stored; (b) a symbol generator that reduces the redundancy and irrelevancy of the original video data, generating adequate coded symbols; (c) an entropy coder that reduces the statistical redundancy of the coded symbols, converting them into bit codes; and finally (d) a video multiplexer responsible for organizing the coded symbols according to the adopted video coding syntax.
Non-normative Video Coding Tools .
.
167
Video buffering verifier: A set of normative models, each one defining rules and limits to verify whether the amount required for a specific type of decoding resource is within the bounds allowed by the corresponding profile and level specification. The rate control algorithm must use these rules and limits to define the control actions that will drive the video encoder without violating this mechanism. Rate controller: The mechanism responsible for controlling the video encoder, aiming at efficiently encoding the original video data while producing a set of bitstreams that does not violate the video buffering verifier1 mechanism. Essentially, this mechanism is composed of six modules (described in the next subsections), which, based on statistics computed from the input data stored in the frame buffer (feedforward information) and statistics computed through the different video buffering verifier models (feedback information), compute a multidimensional control signal (e.g. encoding time instants, MB coding modes, and quantization parameters) that will command the encoding process.
The adequate balancing between feedback and feedforward rate control, supported by the proposed architecture through different hierarchical levels of action and modeling, constitutes the major innovation of this architecture compared to available rate control methods. Notice, that according to the two scenarios presented in Section 5.3.1, the input video data can be either raw VSs (frame-based or object-based) or coded bitstreams representing rectangular VSs with a particular spatio-temporal resolution and encoded at a given constant bit rate. In the case of the joint MVS encoding scenario, the input data is raw VSs, which are stored in a frame buffer; for a given encoding time instant this buffer stores the frames of all VSs to be encoded in that particular time instant, or the video object planes (VOPs) of all the VOs to be encoded for the given video scene. In the case of the joint MVS transcoding scenario, the input data is coded bitstreams, which are first decoded, then the resulting symbols (e.g. motion vectors and transform coefficients) are forwarded to the encoder symbol generator for reprocessing. In the case of object-based video, the proposed rate control architecture is used to control object-based video encoders that produce MPEG-4 visual standard bitstreams, while for framebased video the proposed architecture is used for controlling H.264/AVC encoders (see Figure 5.2). 5.3.2.1 Scene Analysis for Resource Allocation This module extracts relevant information for rate control from the input video data, for each encoding time instant, before really encoding it; notably, the size of the video data units (frames or VOPs), the object activity derived from the amplitude of the motion vectors, and the texture coding complexity derived from the prediction error variance. The relevant data (MB organized) is then feedforwarded to the other rate controller modules in order to guide their actions, notably the spatio-temporal resolution control, the bit allocation, and the coding mode control modules.
1
In the case of H.264/AVC, this mechanism is called the hypothetical reference decoder (HRD).
168
Visual Media Coding and Transmission
Rate Controller Bit Allocation
Scene Analysis for Resource Allocation
Rate-Distortion Modeling
Video Buffering Verifier Control
Spatio-Temporal Resolution Control
Coding Mode Control
Video Buffering Verifier Video Encoder Input Video Sequences
Frame Buffer
Symbol Generator
Entropy Coder
Input Coded Bitstreams
Stream Buffer
Video Demultiplexer
Entropy Decoder
Video Multiplexer
Output Coded Bitstreams
Video Decoder
Figure 5.2
Symbol Decoder
Architecture of the proposed MVS rate control mechanism
5.3.2.2 Spatio-temporal Resolution Control This module is conceptually responsible for deciding the appropriate spatial and temporal resolutions of the input video data, trying to achieve the best tradeoff between spatial quality and motion smoothness taking into account, for example, the status of the various video buffering verifier buffers, the available bits for the current encoding time instant, and the motion activity of the different VSs. 5.3.2.3 Rate-distortion Modeling This module provides mathematical descriptions of the encoding process relating the video data rate and distortion characteristics with the coding parameters. In order to be able to predict in advance the behavior of the scene encoder, notably the joint behavior of the symbol generator and the entropy coder, it is necessary to have some mathematical description of this behavior that can be adapted to the actual encoding results during the encoding process, typically through parameter estimation. RD models are used to find optimal bit allocations and map them into coding parameters, notably the quantization parameter that will be used to encode each coding unit, i.e. a given frame, VOP, or MB inside a frame or VOP. Therefore, for rate control purposes, the encoder should be modeled by RD functions at the different levels of rate control operation, i.e. framelevel, MB-level, and, eventually, DCT-level. With these functions, the rate controller is able to predict the behavior of the video encoder with relative accuracy, aiming at reducing the amount of compensation needed to bring the video encoder to the ideal behavior, thus reducing the computational cost of controlling the encoder.
Non-normative Video Coding Tools
169
5.3.2.4 Bit Allocation This module is responsible for properly allocating the available bit-rate resources. However, in an MVS encoding scenario, the bit allocation task can be rather complex, since the different VSs may have different characteristics in terms of size, motion, and texture complexity, and may be encoded with different temporal resolutions. Consequently, following a strategy of dividing to conquer, the bit allocation can be partitioned into several hierarchical levels, similarly to the syntactic organization of the encoded video data: .
. . .
Group of scene planes (GOS): The set of frames (or VOPs) of all VSs between two random access points (I-frames), assuming that all sequences are encoded with synchronized Iframes, typically encoded with a constant number of bits. Scene plane (SP): The set of frames of all VSs to be encoded at a particular encoding time instant. Frame or video object plane: The sample of each VS at a given encoding time instant. Macroblock (MB): The smallest coding unit for which the quantization parameter can be changed.
For each of these levels, the bit allocation will provide an adequate bit allocation strategy and the corresponding compensation method responsible for dealing with deviations between the idealized behaviour and the actual encoding results. 5.3.2.5 Video Buffering Verifier Control This rate controller module is responsible for controlling the MVS encoder in such a way that the underlying video buffering verifier mechanism is not violated and, therefore, the set of bitstreams produced by the MVS encoder can be considered compliant with the profile@level selected to encode each VS, if any. The main purpose of this module is to provide guidance to the spatio-temporal resolution control, the bit allocation, and the coding mode control modules on the status of the various video buffering verifier models, and consequently to assist these modules in their respective tasks regarding the constraints of the video buffering verifier mechanism. This module should aim at preventing situations where strong measures have to be taken in order to avoid violations of the video buffering verifier mechanism; that is to say, soft control should be favored over hard control. 5.3.2.6 Coding Mode Control This module is conceptually responsible for deciding the appropriate coding parameters of each coding unit, i.e. MB texture coding modes, MB motion vectors (if applicable), and MB quantization parameter. This module typically has to deal with conflicting goals, since one goal is to provide smooth quality variations inside each frame, which usually requires keeping the frame quantization parameter approximately constant, and another is to accurately control the bit production in order to not violate the video buffering verifier constraints, which usually requires a fine control of the quantization parameter at MB level.
5.3.3 Performance Evaluation The performance of the rate control architecture proposed above is here evaluated for single video object SVO encoding. The MVO performance is analyzed in Section 5.4.
170
Visual Media Coding and Transmission
The video rate buffer verifier (VBV) buffer size, BS, is set to 0.5 s, i.e. numerically BS ¼ R/2 bits (R: target bit rate), and two random access conditions are tested: (a) one random access point (I-VOP) every second: label IP ¼ 1 s; (b) one single random access point at the beginning of the sequence (IPP. . .): label IP ¼ 10 s. Additionally, two spatio-temporal encoding conditions are considered: QCIF at 15 Hz (64–256 kbps) and CIF at 15 Hz (128–512 kbps). For the SVO tests, six representative test sequences at 30 Hz have been selected: the Football (260 frames), Kayak (220 frames), and Stefan (300 frames) sequences represent the highmotion video sequences, typically more difficult to encode, while the Foreman (300 frames), Mother & Daughter (300 frames), and News (300 frames) sequences represent the low-motion video sequences, typically easier to encode. The SVO rate control algorithm (IST) integrated in the proposed architecture is compared with the VM8 rate control algorithm implemented in the MPEG-4 Visual MoMuSys reference software [31]. To illustrate the advantages of the proposed solution, the two rate control algorithms mentioned above are compared here in terms of average quality achieved, measured as the average in time of the peak signal to noise ratio (PSNR) for the luminance component between the original and the reconstructed VOPs at the decoder. In order to be able to compactly compare the PSNR curves obtained with the two algorithms for the range of target bit rates considered, the tool developed by the ITU Video Coding Experts Group (VCEG) in [40] is used. This justifies presenting the PSNR and bit-rate gains instead of exhaustively presenting the base values. Tables 5.1 and 5.2 summarize the PSNR and bit-rate gains of the proposed SVO solution. It is worthwhile to mention here that both algorithms achieve the target bit rates within a deviation of less than 1% on average for all sequences and spatio-temporal conditions tested. From Tables 5.1 and 5.2, the following conclusions may be derived: .
.
.
The proposed algorithm shows the biggest gains for CIF at 15 Hz for a random access point at each second, with approximately 1 dB more in terms of PSNR and 17% less in terms of bit rate. The smallest gain of the proposed algorithm is achieved for QCIF at 15 Hz for a single random access point at the beginning of the sequence, with approximately 0.4 dB more in terms of PSNR and 7% less in terms of bit rate. Typically, the proposed algorithm exhibits higher gains for the high-motion sequences and for the shorter random access conditions, i.e. for the more difficult encoding conditions.
Table 5.1
SVO average PSNR and bit rate gains for QCIF at 15 Hz
Sequence
Football Kayak Stefan Foreman M&D News
PSNR (dB)
Bit rate (%)
IP ¼ 1 s
IP ¼ 10 s
IP ¼ 1 s
IP ¼ 10 s
0.55 0.43 0.14 0.18 0.76 1.12 0.53
0.62 0.42 0.17 0.09 0.57 0.20 0.35
12.07 8.47 2.72 3.42 13.67 13.66 9.00
13.47 8.45 3.40 1.83 12.14 3.06 7.06
171
Non-normative Video Coding Tools Table 5.2 SVO average PSNR and bit rate gains for CIF at 15 Hz Sequence
Football Kayak Stefan Foreman M&D News
PSNR (dB)
Bit rate (%)
IP ¼ 1 s
IP ¼ 10 s
IP ¼ 1 s
IP ¼ 10 s
1.17 1.58 0.89 0.54 0.66 1.39 1.04
1.18 1.53 0.81 0.53 0.40 0.77 0.87
19.30 25.96 13.49 10.72 15.27 21.05 17.63
21.62 25.44 12.26 11.17 10.97 14.18 15.94
These results are partly justified by the efficient interaction between the different modules of the proposed rate control architecture, notably between the bit allocation, the video buffering verifier control, and the coding mode control.
5.3.4 Conclusions This work proposes a novel rate control architecture for single and multiple video sequence encoding (frame-based or object-based), supporting the important requirements of a generic rate control algorithm and, in particular, the constraints of MPEG-4 object-based compliant video encoding. A rate control algorithm integrated in the proposed architecture was compared with the usual reference algorithms, namely VM8 for SVO encoding, and its gains were assessed. In summary, for SVO encoding, the proposed rate control solution outperforms the VM8 solution in terms of the average quality achieved.
5.4 Bit Allocation and Buffer Control for MVS Encoding Rate Control (Portions reprinted, with permission, from ICIP 2007, “Improved feedback compensation mechanisms for multiple video object encoding rate control”, and IEEE CSVT, P. Nunes, F. Pereira, “Joint rate control algorithm for low-delay MPEG-4 object based video encoding”, 2007 IEEE.)
5.4.1 Problem Definition and Objectives In delay-constrained constant bit rate (CBR) video encoding, the bit-rate variability is handled through a smoothing bitstream buffer in order to achieve a constant average bit rate measured over short periods of time. In this context, the rate controller is faced with two conflicting goals: (a) keep the bitstream buffer occupancy within permitted bounds, which typically requires finely adjusting the encoding parameters, for example the MB quantization parameter (QP), to produce more/less encoding bits according to the buffer occupancy tendency;(b) adjust the encoding parameters, aiming to maximize the subjective quality of the decoded video, which
172
Visual Media Coding and Transmission
typically requires slowly changing the QP (both spatially, between adjacent MBs, and temporally, between successive encoding time instants). To accomplish these goals, the rate controller needs: (a) to properly allocate the available bit rate, taking into account the video data complexity and the proper video buffering verifier mechanism constraints; (b) to compute the coding parameters that would lead to the estimated bit allocations. As referred to in Section 5.3.1, for low-delay video encoding under reasonable encoding complexity, feedback rate control assumes an important role since it is not usually viable to encode the same content multiple times. Therefore, for feedback rate control methods, adequate compensation mechanisms are needed to deal with deviations between the idealized and the actual encoder behavior. In order to control the output of an MVO encoder so that the VBV mechanism is not violated and the quality of the decoded VOs is maintained as approximately constant (along time and among the several VOs making up the video scene for each encoding time instant), the bit-rate allocation should simultaneously take into account the following aspects: (a) VO coding complexity based on the VOPs content and VO temporal resolution; (b) VO distortion feedback compensation; (c) Dynamic VBV target buffer occupancy according to the scene coding complexity. To the authors knowledge, these three aspects are not simultaneously handled in published MVO rate control (RC) solutions [32–38], leading to inefficient use of the coding resources, i.e. buffer size and bit rate.
5.4.2 Proposed Technical Approach This work proposes an improved bit allocation strategy for low-delay MVO encoding using the VBV mechanism feedback information and the VO encoding quality measured as the mean square error (MSE) between the original and the reconstructed video data. Relative to the typically-used benchmarking for object-based rate control algorithm – the MPEG-4 video verification model (VM) [32] – the proposed solution can provide more efficient bit allocation, leading to higher average peak signal-to-noise ratios (PSNRs) and smooth quality variations, which improve the user experience. 5.4.2.1 Bit Allocation for MVO Encoding Group of Scene Planes (GOS) Bit Allocation The GOS covers the set of all encoding time instants between two random access points, typically encoded with a constant number of bits (in CBR scenarios). GOSs may be composed of VOs with different VOP rates. In the case of a single VO, a GOS becomes a group of video object planes (GOV). The rate control algorithm aims at allocating a nominal number of bits to each GOS (luminance, chrominance, and relevant auxiliary data), TGOS , that is proportional to the GOS duration, i.e.: TGOS ½m ¼ R½m ðtGOS ½m þ 1 tGOS ½mÞ; m ¼ 1; . . . ; NGOS
ð5:1Þ
where R[m] and tGOS[m] are, respectively, the average target bit rate and the starting time instant for GOS m, and NGOS is the number of GOSs in the sequence. (Reproduced by Permission of 2007 IEEE.)
173
Non-normative Video Coding Tools
Deviations from the expected results are compensated for through the following feedback compensation equation: Xm 1 TGOS ½m ¼ T GOS ½m þ KGOS ðT GOS ½k SGOS ½kÞ ð5:2Þ k¼1 where SGOS[k] is the number of bits used to encode the GOS k, and KGOS is given by: KGOS ¼ 1=minðaN ; NGOS m þ 1Þ
ð5:3Þ
with aN ¼ max½1; d3BS =T GOS ½me, BS the VBV buffer size. The rationale for setting KGOS 1 is to avoid large quality fluctuations between adjacent GOSs, notably when scene changes occur inside a given GOS, and the bit allocation error compensation would penalize the upcoming GOS, if KGOS ¼ 1. With this approach, GOS bit allocation deviations are smoothed through aN GOSs, if the buffer size is sufficiently large to accommodate these bit-rate variations. Scene Plane (SP) Bit Allocation The SP represents the set of VOPs of all VOs to be encoded at a particular encoding time instant (not all VOs of a given scene need to have VOPs to be encoded in every SP). At the SP-level, in order to obtain approximately constant quality along a GOS, each SP should get a nominal target number of bits that is a fraction of the GOS target (5.2), proportional to the amount and complexity of the VOs to be encoded in that particular time instant. The coding complexity of a given VOP n in SP p of GOS m is given by: XVOP ½n ¼ aT ½n v½n; n ¼ 1; . . . ; NVO
ð5:4Þ
where NVO is the number of VOs in the scene, and aT[n] and v[n] are, respectively, the coding type weight (T 2 {I,P,B}) and coding complexity weight – reflecting the texture, shape, and motion (if applicable) coding complexity (see [39] for details) – of VO n in SP p of GOS m. Note that aT[n] ¼ 0, if VO n does not have a VOP in SP p of GOS m. The coding complexity of a given SP p of GOS m is defined as the sum of its VOP complexities defined according to (5.4), i.e.: XSP ½ p ¼
XNVO n¼1
XVOP ½n½ p; p ¼ 1; . . . ; NSP ½m
ð5:5Þ
Therefore, the GOS m coding complexity is the sum of its SP complexities defined according to (5.5), i.e.: XGOS ½m ¼
XNSP ½m p¼1
XSP ½ p; m ¼ 1; . . . ; NGOS
ð5:6Þ
where NSP[m] is the number of scene planes in the GOS m, with NSP[m] ¼ dtGOS[m] SRe, where tGOS is the GOS duration and SR is the scene rate. The nominal target number of bits allocated for each SP in a given GOS is set by the following equation (the GOS index has been dropped for simplicity): T SP ½ p ¼ TGOS XSP ½ p=XGOS ; p ¼ 1; . . . ; NSP
ð5:7Þ
174
Visual Media Coding and Transmission
The actual SP target is given by the feedback equation: XSP ½ p TSP ½ p ¼ T SP ½ p þ PNSP k¼p XSP ½k
Xp 1 k¼1
ðT SP ½k SSP ½kÞ
ð5:8Þ
Notice that, since the proposed solution concerns low-delay encoding scenarios, both XSP and XGOS can only be computed with data from the current or past encoding time instants. In order to obtain approximately constant distortion among consecutive encoding time instants for each VO, the different VOP coding weights are also adapted at the beginning of each GOVof a given VO through the following equation: I =D P Þg T aI ¼ ð bI = bP ÞðD
ð5:9Þ
I , and D P are, respectively, the average numbers of bits per pixel and the bP , D where bI , average pixel distortions for I- and P-VOPs, computed over window sizes WI and WP of past Iand P-VOP encoding results, and g T is a parameter that controls the impact of the average distortion ratios on the estimation of the VO coding weight (WI ¼ 3, WP ¼ NSP 1, and g T ¼ 0.5). Video Object Plane Bit Allocation At the VOP level, i.e. inside each SP, in order to obtain approximately constant quality among the several VOPs composing the SP, each VOP should get allocated a nominal target number of bits that is a fraction of the SP target (5.8), proportional to the relative complexity of the VOP to be encoded in that particular time instant. Therefore, the nominal target number of bits for the VO n VOP in a given SP p of GOS m is given by: T VOP ½n ¼ TSP XVOP ½n=XSP ; n ¼ 1; . . . ; NVO
ð5:10Þ
For MVO encoding, it is important to guarantee that the spatial quality among the different VOs in the scene is kept approximately constant, i.e. an important goal is to encode all the objects in the scene with approximately constant quality. This goal can hardly be achieved when only a pure feedforward approach is used to compute the VO weights used to distribute the SP target among the several VOPs in the given SP. This is the approach followed in [32], where there is no compensation for deviations on the bit-rate distribution among the several VOPs for a given encoding time instant. Therefore, it is important to update the VO coding complexity weights along time and to compensate the bit allocation deviations through the feedback adjustment of these parameters in order to meet the requirement of spatial quality smoothness. In this section, the following compensation mechanism is proposed, aiming at reducing the deviations in the average distortion among the several VOs composing the scene for a given SP. For this purpose, the SP average luminance pixel distortion is defined as the weighted sum of the various VOs distortions, i.e.: DSP ½ p ¼
XNVO k¼1
ðNPIX ½k DVO ½kÞ=
XNVO k¼1
where NPIX[k] is the number of pixels in VO k VOP in SP p.
NPIX ½k
ð5:11Þ
175
Non-normative Video Coding Tools
Using Equation (5.11) as the reference target SP distortion, a complexity weight adjustment is computed for each VO: !g D SVOP ½p 1½n=aT ½p 1½n DVOP ½p 1½n f D ½ p ½ n ¼ ð5:12Þ PNVO DSP ½p 1 k¼1 SVOP ½p 1½k where g D is a parameter to control the impact of fD in the VOP bit allocation feedback compensation (typically 0.1 g D 0.5; in this work g D ¼ 0.2 has been used). From Equations (5.4) and (5.12), the VO complexity is feedback-adjusted as: hD ½n ¼ fD ½n XVOP ½n and:
1
0 TVOP ½n ¼ TSP
ð5:13Þ
C 1 B C B hD ½n hD ½n nX BTSP hD ½k SVOP ½kC þ C B N N N VO VO VO X X X A k¼1 @ hD ½k hD ½k hD ½k k¼1
k¼n
ð5:14Þ
k¼1
Macroblock Bit Allocation The MB is the smallest coding unit for which the QP can be changed. At the MB-level, i.e. inside each VOP, in order to obtain approximately uniform quality among the several nontransparent MBs, each MB should get a nominal target number of bits that is a fraction of the VOP target (5.14), proportional to the relative complexity of the MB to be encoded in that particular VOP, XMB; in this work, the MB prediction error MAD is used. Therefore, the nominal and the actual target number of bits for each MB in a given VOP, respectively T MB ½i and TMB[i], are given by: XNMB T MB ½i ¼ TVOP XMB ½i= X ½k; i ¼ 1; . . . ; NMB ð5:15Þ k¼1 MB TMB ½i ¼ T MB ½i þ KMB
Xi 1 k¼1
ðT MB ½k SVOP ½kÞ
where NMB is the number of MBs in the VOP being encoded, and: h i XNMB KMB ¼ max XMB ½i= X ½k; 1=min ½ N i þ 1; 16 MB MB k¼i
ð5:16Þ
ð5:17Þ
The rationale for Equation (5.17) is the following: at the beginning of the VOP encoding, the bit allocation errors are compensated for along the subsequent 16 MBs (this value has been set empirically) at most, avoiding slow reactions for MBs with low complexities, i.e. low MADs; as the encoding proceeds to the last MBs, the KMB factor distributes the accumulated MB bit allocation error through the remaining MBs to be encoded. For a fine allocation of bits inside each VOP, the rate control algorithm computes a QP for each MB, taking into account the complexity of the several MBs to encode using the following MB-level rate-quantization function: RMB ðQÞ ¼ ða=QMB Þ XMB where a is the model parameter estimated after each MB encoding.
ð5:18Þ
176
Visual Media Coding and Transmission
5.4.2.2 Adding a Novel Video Buffering Verifier Control Compensation In order to efficiently use the available buffer space, for each SP of each GOS the target VBV buffer occupancy (immediately before removing all SP VOPs from the VBV buffer) is computed by: Pp 1 BT ½ p ¼ BS TGOS
k¼1
XSP ½k
XGOS
! ðtSP ½ p tSP ½1Þ R BL
ð5:19Þ
where BS is the VBV buffer size, TGOS is the target number of bits for encoding the whole GOS m given by Equation (5.2), XSP[k] is the SP k complexity given by Equation (5.5), XGOS is the GOS m complexity given by Equation (5.6), tSP[p] is the time instant of SP p, R is the average output target bit rate for GOS m, and BL is the VBV underflow margin, as explained in the following paragraphs. Since at the beginning of each GOS all VOPs are intra-coded, this will typically lead to the highest level in terms of encoder rate buffer occupancy, as intra-coded VOPs usually require more bits to achieve the same spatial quality as inter-coded VOPs. Consequently, in terms of VBV occupancy, this will correspond to the highest occupancy immediately before removing the first VOPs of a GOS, and the lowest VBV occupancy immediately after removing these VOPs from the VBV buffer. Therefore, in nominal terms, the available VBV margin is defined by the available encoder rate buffer space immediately after adding the encoded bits of the first SP in the GOS, or, in terms of VBV occupancy, by the occupancy of the VBV buffer immediately after removing the bits of the first SP VOPs. Since VBV buffer underflow (encoder rate buffer overflow) is more critical than VBV buffer overflow (encoder rate buffer underflow), it is convenient to unequally distribute the VBV margin over these two critical zones. Therefore, at the beginning of each GOS, the VBV margin is computed as follows: BM ¼ BS TGOS XSP ½1=XGOS
ð5:20Þ
The nominal free space in the buffer (5.20) is unequally divided as BL ¼ bVBV BM and BU ¼ (1 bVBV) BM, with bVBV ¼ 0.9. Based on Equation (5.19), the target number of bits used to encode the corresponding SP (5.8) is further adjusted by a multiplicative factor: KVBV ¼
1 aVBV ððBT BÞ=BT Þ ( B BT 1 þ aVBV ððB BT Þ=ðBS BT ÞÞ ( B > BT
ð5:21Þ
where aVBV ¼ 0.25 is a controller parameter set empirically. The rationale for Equation (5.21) is to decrease the SP bit allocation if the VBV buffer is approaching underflow (i.e. too many bits have been generated by the encoder in the past) and to increase the SP bit allocation if the VBV buffer is approaching overflow (i.e. too few bits have been generated by the encoder in the past). Therefore, the bit allocation given by Equation (5.8) is adjusted as follows: SBC TSP ½p ¼ TSP ½p KVBV
ð5:22Þ
177
Non-normative Video Coding Tools
In some extreme cases, notably for small buffer sizes, this soft SP-level VBV control may lead to SP bit allocations near imminent violations of the VBV mechanism; therefore, whenever this situation occurs, a further adjustment is performed in order to guarantee that the SP bit allocation will keep the VBV occupancy within the nominal VBV operation area defined by: bL BS B bU BS ; with bL ¼ 0:05 and bU ¼ 1:0
ð5:23Þ
(Reproduced by Permission of 2007 IEEE.)
5.4.3 Performance Evaluation The performance for MVO encoding of the proposed rate control solution (so called IST solution) is compared with the MPEG-4 VM5 rate control algorithm initially described in [32]. Two random access conditions are tested: (a) one random access point (I-VOP) every second: label IP ¼ 1 s; and(b) one single random access point at the beginning of the sequence (IPP. . .): label IP ¼ 10 s. The VBV buffer size is set numerically to R/2 bits (R: target bit rate). Four representative test sequences at 30 Hz and with 300 frames have been selected: Stefan and Bream with two VOs; and Coastguard and News with four VOs. These sequences can be grouped according to their motion activity into: (a) high-motion video sequences (Stefan and Coastguard); and (b) low-motion video sequences (Bream and News). The two rate control solutions are compared in terms of the so-called average scene quality, measured as the luminance average scene PSNR between the original and the reconstructed video frames at the decoder using the tool for compactly comparing two PSNR curves developed by the ITU-T Video Coding Experts Group [40] (see Table 5.3). The so-called scene PSNR variation is also used to assess the quality smoothness between the various VOs in the scene; it is computed as the ratio between the average scene PSNR difference and the average scene PSNR, where the first is the weighted sum of the absolute difference between each VO PSNR and the scene PSNR for each SP (weighted by the relative size of each VO). Table 5.3 illustrates the PSNR gains and bit-rate reductions for the proposed solution in two different conditions: (a) proposed MVO RC solution with VM5 VBV control, i.e. without Section 5.4.2.2; (b) proposed MVO RC solution with the VBV control proposed in Section 5.4.2.2 (IST label). These results support the following conclusions:
Table 5.3 Average PSNR and bit rate gains (CIF at 15 Hz) Sequence
PSNR (dB) IP ¼ 1 s
Stefan Coastguard Bream News
Bit rate (%) IP ¼ 10 s
IP ¼ 1 s
IP ¼ 10 s
a
b
a
b
a
b
a
b
1.98 0.61 0.33 1.07 0.83
2.57 1.13 0.80 2.50 1.75
1.91 0.58 0.13 0.78 0.40
2.15 0.55 0.06 0.49 0.54
30.4 12.6 6.2 54.7 22.9
38.8 21.8 15.5 33.9 27.5
34.5 11.8 2.5 16.8 6.8
37.7 10.9 1.1 8.9 9.7
178
Visual Media Coding and Transmission Stefan QCIF 15 Hz – 256 kbit/s – IP = 1 s 34
29
32
28
30
27
28
PSNR Y (dB)
Scene PSNR Y (dB)
Stefan MVO – IP = 1 s 30
26 25 24 VM5 QCIF 7.5 Hz IST QCIF 7.5 Hz VM5 QCIF 15 Hz IST QCIF 15 Hz VM5 CIF 15 Hz IST CIF 15 Hz VM5 CIF 30 Hz IST CIF 30 Hz
23 22 21 20
0
128
256
384
512
640
768
896
IST – Scene VM5 – Scene
26 24 22 20 18 16
1024 1152 1280
14
0
30
60
90
120
150
Bit Rate (kbit/s)
VOP
(a)
(b)
180
210
240
270
300
Figure 5.3 Stefan (IP ¼ 1 s): (a) average scene PSNR versus bit rate; (b) scene PSNR evolution QCIF at 15 Hz (256 kbps). Reproduced by Permission of 2007 IEEE .
.
Both cases (a) and (b) have higher PSNR gains (thus also bit-rate reductions) for IP ¼ 1 s tests due to the more efficient bit allocation and the finer QP (MB-level) control, as also illustrated in Figure 5.3(a) for Stefan under various encoding conditions. PSNR gains for case (b) can be as high as 2.6 dB. For the less demanding scenarios, i.e. IP ¼ 10 s and low-motion sequences, VM5 performs slightly better than case (a) and close to case (b) in terms of average scene PSNR, due to the high coding quality of the easy-to-code background VOs. However, the scene PSNR variation is lower (smoother quality) for cases (a) and (b), as illustrated in Figure 5.4 for Bream (case (b)). Bream MVO – IP = 10 s 0.30
VM5 QCIF 7.5 Hz IST QCIF 7.5 Hz VM5 QCIF 15 Hz IST QCIF 15 Hz VM5 CIF 15 Hz IST CIF 15 Hz VM5 CIF 30 Hz IST CIF 30 Hz
Scene PSNR Y Variation
0.25
0.20
0.15
0.10
0.05
0.00 0
128
256
384
512
640
768
896
1024 1152 1280
Bit Rate (kbit/s)
Figure 5.4 Bream (IP ¼ 10 s) scene PSNR variation versus bit rate. Reproduced by Permission of 2007 IEEE
179
Non-normative Video Coding Tools .
Case (b) provides an additional PSNR gain, relatively to case (b), of approximately 0.9 dB on average (5% less bit rate), also reducing the number of skipped frames, as illustrated in Figure 5.3(b) for Stefan (shown as severe PSNR drops).
5.4.4 Conclusions This work proposes two improved feedback mechanisms (VO distortion feedback and video rate buffer feedback) for low-delay MVO MPEG-4 encoding. The proposed solution was compared with the MPEG-4 VM5 solution [32], with the main conclusion that the proposed MVO RC solution clearly outperforms the benchmarking solution in terms of the average quality and quality smoothness, resulting in a more efficient use of the available resources, i.e. the buffer space and the target bit rate.
5.5 Optimal Rate Allocation for H.264/AVC Joint MVS Transcoding (Portions reprinted, with permission, from “Joint bit-allocation for multi-sequence H.264/AVC video coding rate control”, Picture Coding Symposium, Lisbon, November 2007. 2007. EURASIP)
5.5.1 Problem Definition and Objectives Let us consider the scenario depicted in Figure 5.5, where the input is represented by a set of pre-encoded H.264/AVC sequences, each encoded in either VBR or CBR mode. The sequences are multiplexed into a single channel, characterized by a global rate constraint Rtot. Therefore, the problem is twofold: Rate controller
H.264 bitstream
H.264 decoder
H.264 encoder
H.264 bitstream
H.264 decoder
H.264 encoder
...
H.264 bitstream
H.264 decoder
H.264 encoder
Figure 5.5 Proposed transcoding architecture for multi-sequence rate control. Reproduced by permission of 2007 EURASIP
180 . .
Visual Media Coding and Transmission
The rate controller module is responsible for optimally allocating the bit budget to the output sequences. Transcoding needs to be performed in order to meet a more stringent rate constraint and adjust the output bit rate of the sequences.
In this work, the focus is on defining novel solutions for the rate control module, while a very simple transcoding algorithm is envisaged in order to serve as a “proof of concept”. Specifically, in order to avoid issues related to drift propagation, an explicit transcoder that decodes the sequence in the pixel domain and re-encodes it subject to the new, more stringent, rate constraint is adopted. In order to speed up the transcoding process, the encoders in Figure 5.5 borrow the mode decisions and the motion vectors obtained from the decoded bitstream. Although this approach is not optimal in general, a broadcast scenario where the video content is encoded at high rates is addressed. Under this hypothesis, mode decision and motion information typically represent a small fraction of the overall bit rate. Therefore, efficient rate control can be simply obtained by re-quantizing the DCT coefficients with a coarser quantization step.
5.5.2 Proposed Technical Solution (Portions reprinted, with permission, from Mariusz Jakubowski, Grzegorz Pastuszak, “Multipath adaptive computation-aware search strategy for block-based motion estimation,” The International Conference on Computer as a Tool. EUROCON, 9–12 September. 2007 pages: 175–181. 2007 IEEE.) The rate control problem addressed in this work can be described as follows. Consider the problem of simultaneously transmitting S different video sources. Let R ¼ R1, R2,. . .,RS]T denote the rate allocation strategy, where Rs is expressed in bits/sample. Let D(R) ¼ [D1(R1), D2(R2)],. . .,DS(RS)]T denote the output distortion corresponding to the rate allocation strategy, R. The rate control module tries to minimize the average output distortion: R* ¼ arg min R
S 1X Ds ðRs Þ S s¼1
ð5:24Þ
subject to the overall rate constraint: S X
Rs Rtot
ð5:25Þ
s¼1
In order to find the optimal solution, R , the problem is formulated in the r domain. In [43], it is shown that in any typical transform domain system, there is always a linear relationship between the coding bit rate, R, and the percentage of zeros among quantized transform coefficients, denoted by r, i.e.: RðrÞ ¼ uð1 rÞ½bps
ð5:26Þ
where u is a constant parameter that depends on the source. Given a parametric model for the probability density function of transform coefficients, it is possible to express the distortion D
181
Non-normative Video Coding Tools
in terms of r, i.e. D(r,a), where a is the parameter defining the probability density function. Therefore, the optimization problem can be expressed directly in the r domain: S 1X Ds ðrs ; as Þ S s¼1
ð5:27Þ
us ð1 rs Þ Rtot
ð5:28Þ
min r
subject to the overall rate constraint: S X s¼1
and solved by means of the Lagrange multipliers method. The solution can be found either in closed form or numerically, depending on the functional form of D(r,a). It is worth pointing out that, in the transcoding scenario addressed here, the parameters needed to specify the optimization problem, i.e. [us, as], s ¼ 1,. . ., S, can be readily obtained from the histograms of the decoded DCT coefficients. Therefore, in the transcoding process, the input bitstreams are decoded to obtain the DCT coefficients relative to the current frame of each of the considered sequences (and the relative motion information that will be used in the recoding process). Then for each frame the histogram of the DCT coefficients is evaluated, and the parameters of a generalized Gaussian model that best fit the histogram are estimated. This allows the evaluation of the parameter that will be used in the optimization procedure in Equation (5.27) for optimal bit allocation in the recoding of the current frames relative to all the considered sequences. An assumption of a Laplacian distribution of the DCT coefficients could be made, avoiding a frame-by-frame estimation, but simulation tests have shown that this solution leads to significant performance loss without reducing the computational load in a significant way. In any case the proposed algorithm appears very fast and suitable for a realtime implementation. In addition, the rate allocation algorithm is carried out independently at each time instant. Therefore, each video sequence is encoded at a variable bit rate (VBR), while the overall bit rate is kept constant. At each time instant, any frame-based rate control algorithm can be applied to adaptively adjust the quantization parameter at the MB level, in such a way as to meet the target rate, R*s .
5.5.3 Performance Evaluation (Portions reprinted, with permission, from Mariusz Jakubowski, Grzegorz Pastuszak, “Multipath adaptive computation-aware search strategy for block-based motion estimation,” The International Conference on Computer as a Tool. EUROCON, 9–12 September. 2007 pages: 175–181. 2007 IEEE.) The proposed rate control algorithm has so far been tested on H.263 þ intra-encoded sequences, although its extension to H.264/AVC is readily obtained and is the subject of current and future work. Figure 5.6 shows the results obtained with S ¼ 2 sequences and a target bit rate Rtot ¼ 2 bps. The x axis shows R1, i.e. the rate allocated to the first sequence. The rate allocated to the second sequence can be obtained as R2 ¼ R R1. Figure 5.6 shows both the RD curve of each individual sequence and the average distortion. The vertical dashed line represents the estimated optimal bit allocation obtained by the proposed algorithm.
182
Figure 5.6
Visual Media Coding and Transmission
Results of rate allocation for two sequences. Reproduced by Permission of 2007 IEEE
5.5.4 Conclusions (Portions reprinted, with permission, from Mariusz Jakubowski, Grzegorz Pastuszak, “Multipath adaptive computation-aware search strategy for block-based motion estimation,” The International Conference on Computer as a Tool. EUROCON, 9–12 September. 2007 pages: 175–181. 2007 IEEE.) A rate controller module for the global rate control of multiple pre-encoded AVC sequences is used to allocate the bit budget to output sequences. The input sequences may be VBR or CBR, but the output sequences are multiplexed together for transmission at a fixed rate into a single channel. A novel solution for the rate control module proposed in this work decodes the sequence in the pixel domain and then re-encodes it with the new target bit rate. This approach speeds up the transcoding process, and avoids drift propagation-related issues. The parameters needed for the optimization are obtained from the histogram of the decoded DCT coefficients, improving the efficiency of the rate control. The results show that the proposed algorithm achieves optimal bit allocation for all the sequences.
5.6 Spatio-temporal Scene-level Error Concealment for Segmented Video (Portions reprinted, with permission, from G. Valenzise, M. Tagliasacchi, S. Tubaro, L. Piccarreta, “A rho-domain rate controller for multiplexed video sequences”, 26th Picture Coding Symposium 2007, Lisbon, November 2007. 2007 EURASIP.)
5.6.1 Problem Definition and Objectives As referred to in Section 5.5, several object-based error-concealment techniques, dealing both with shape and texture data, have been proposed in the literature. These techniques, however, have a serious limitation in common, which is that each video object is independently considered, without ever taking into account how it fits in the video scene. After all, the fact that a concealed video object has a pleasing subjective impact on the user when it is considered on its own does not necessarily mean that the subjective impact of the whole scene, when the
Non-normative Video Coding Tools
183
Figure 5.7 Illustration of a typical scene-concealment problem: (a) original video scene; (b) composition of two independently error-concealed video objects. Reproduced by Permission of 2008 IEEE
objects are put together, will be acceptable; this represents the difference between object-level and scene-level concealment. An example of this situation is given in Figure 5.7, where a hole has appeared as a result of blindly composing the scene by using two independently-concealed video objects. Whenconcealingacompletevideoscene,thewaythescenewascreatedhastobeconsideredsince this will imply different problems and solutions in terms of error concealment. As shown in Figure 5.8, the video objects in a scene can be defined either by segmentation of an existing video sequence(segmentedscene),inwhichcaseallshapeshavetofitperfectlytogether,orbycomposition of pre-existing video objects (composed scene), whose shapes do not necessarily have to perfectly fit together. Additionally, it is also possible to use a combination of both approaches. For the work presented here, segmented video scenes (or the segmented parts of hybrid scenes) are considered, since the concealment of composed scenes can typically be limited to object-level concealment. In addition, the proposed technique, which targets the concealment of both shape and texture data, relies not only on available information from the current time instant but also on information from the past – it is a spatio-temporal technique. In fact, it is the only known spatiotemporal technique that targets both shape and texture data, and works at the scene level.
5.6.2 Proposed Technical Solution In order to better understand the proposed scene-level error-concealment solution, the types of problem that may appear in segmented video scenes when channel errors occur should be briefly considered, as well as what can be done to solve them.
Figure 5.8
Two different scene types: (a) segmented scene; (b) composed scene
184
Visual Media Coding and Transmission
5.6.2.1 Scene-level Error Concealment in Segmented Video Segmented video scenes are obtained from rectangular video scenes by segmentation. This means that, at every time instant, the arbitrarily-shaped video object planes (VOPs) of the various video objects in the scene will fit perfectly together like the pieces in a jigsaw puzzle. These arbitrarily-shaped VOPs are transmitted in the form of rectangular bounding boxes, using shape and texture data. The shape data corresponds to a binary alpha plane, which is used to indicate the parts of the bounding box that belong to the object and, therefore, need to have texture associated with it. For the rest of the work described here, it will be considered that some kind of block-based coding, such as (but not necessarily) that defined in the MPEG-4 visual standard, was used and that channel errors manifest themselves as bursts of consecutive corrupted blocks for which both shape and texture data will have to be concealed, at both object and scene levels. Shape Error Concealment in Segmented Video Since, in segmented scenes, the various VOPs in a time instant have to fit together like the pieces in a jigsaw puzzle, if there is any distortion in their shape data, holes or object overlap will appear, leading to a subjective negative impact. However, the fact that the existing VOPs have to fit perfectly together can also be used when it comes to the concealment of shape data errors. In many cases, it will be possible to conceal at least some parts of the corrupted shape in a given corrupted VOP by considering uncorrupted complementary shape data from surrounding VOPs. For those parts of the corrupted shape for which complementary data is not available because it is corrupted, concealment will be much harder. Thus, depending on the part of the corrupted shape that is being concealed in a VOP, two distinct cases are possible: .
.
Correctly decoded complementary shape data: The shape data from the surrounding VOPs can be used to conceal the part of the corrupted shape under consideration since it is uncorrupted. Corrupted complementary shape data: The shape data from the surrounding VOPs cannot be used to conceal the part of the corrupted shape under consideration since it is also corrupted.
These two cases, which are illustrated in Figure 5.9, correspond to different concealment situations and, therefore, will have to be treated separately in the proposed technique.
Figure 5.9 Illustration of the two possible concealment situations for the Stefan video objects (Background and Player): (a) correctly decoded complementary shape data exists; (b) complementary shape data is corrupted in both objects. Reproduced by Permission of 2008 IEEE
Non-normative Video Coding Tools
185
Texture Error Concealment in Segmented Video When concealing the corrupted texture of a given VOP in a video scene, the available texture from surroundingVOPs appears to beoflittle ornouse since different objects typically have uncorrelated textures. However, in segmented scenes, the correctly decoded shape data from surrounding VOPs can be indirectly used to conceal the corrupted texture data. This is possible because the shape data can be used to determine the motion associated with a given video object, which can then be used to conceal its corrupted texture, as was done in [46]. Therefore, by concealing parts of the corrupted shape data of a given VOP with the correctly decoded complementary shape data, it will be possible to estimate the object motion and conceal the corrupted texture. 5.6.2.2 Proposed Scene-level Error-concealment Algorithm By considering what was said above for the concealment of shape and texture data in segmented video scenes, a complete and novel scene-level shape and texture error-concealment solution is proposed here. The proposed concealment algorithm includes two main consecutive phases, which are described in detail in the following two subsections. Shape and Texture Concealment Based on Available Complementary Shape Data In this phase, all the parts of the corrupted shape for which correctly decoded complementary shape data is available are concealed first. To do this for a given corrupted VOP, two steps are needed: 1. Creation of complementary alpha plane: To begin with, a complementary alpha plane, which corresponds to the union of all the video objects in the scene except for the one currently being concealed, is created. 2. Determination of shapel transparency values: Afterwards, each corrupted shapel (i.e. shape element) of the VOP being concealed is set to the opposite transparency value of the corresponding shapel in the complementary alpha plane. Since the complementary alpha plane can also have corrupted parts, this is only done if the required data is uncorrupted. This whole procedure is repeated for all video objects with corrupted shape. It should be noted that, for those parts of the corrupted shape for which complementary data is available, this type of concealment recovers the corrupted shape without any distortion with respect to the original shape, which does not happen in the second phase, described in the next subsection. In order to recover the texture data associated with the opaque parts of the shape data that has just been concealed, a combination of global and local motion (first proposed in [46]) is used. To do this for a given VOP, four steps are needed: 1. Global motion parameters computation: To begin with, the correctly decoded shape and texture data, as well as the shape data that was just concealed, are considered in order to locally compute global motion parameters for the VOP being concealed. 2. Global motion compensation: Then the computed global motion parameters can be used to motion-compensate the VOP of the previous time instant. 3. Concealment of corrupted data: That way, the texture data associated with the opaque parts of the shape data that has just been concealed is obtained by copying the co-located texture in the motion-compensated previous VOP.
186
Visual Media Coding and Transmission
4. Local motion refinement: Since the global motion model cannot always accurately describe the object motion due to the existence of local motion in some areas of the object, a local motion refinement scheme is applied. In this scheme, the available data surrounding the corrupted data being concealed is used to determine if any local motion exists and, if so, to refine the concealment. Shape and Texture Concealment for Which No Complementary Shape Data is Available In this phase, the remaining corrupted shape data, which could not be concealed in the previous phase because no complementary shape data was available in surrounding objects, will be concealed. The texture data associated with the opaque parts of the concealed shape will also be recovered. This phase is divided into two steps: 1. Individual concealment of video objects: Since the remaining corrupted shape of the various video objects in the scene has no complementary data available to be used for concealment, the remaining corrupted shape and texture data will be concealed independently of the surrounding objects. This can be done by using any of the available techniques in the literature. Here, however, to take advantage of the high temporal redundancy of the video data, individual concealment of video objects will be carried out by using a combination of global and local motion-compensation concealment, as proposed in [46]. This technique is applied to conceal both the shape and the texture data of the corrupted video object under consideration. 2. Elimination of scene artifacts by refinement of the object concealment results: As a result of the previous step, holes or object overlaps may appear in the scene, since objects have been processed independently. The regions that correspond to holes are considered undefined, in the sense that they do not yet belong to any object (i.e. shape and texture are undefined). As for the regions where objects overlap, they will also be considered undefined and treated the same way as holes, because a better method to deal with them (i.e. one that would work consistently for most situations) has not been found. In this last step, these undefined regions are divided among the video objects around them. To do this, a morphological filter based on the dilation operation [45] is cyclically applied to the N objects in the scene, A1, A2,. . ., AN, until all undefined regions disappear. The morphological operation to be applied to object Aj is as follows: " # N [ A j B Aj B \ Ai ð5:29Þ i¼1;i„j
The 3 3 structuring element B that is used for the dilation operation is shown in Figure 5.10. By cyclically applying this filter, the undefined regions are progressively 0
1
0
1
1
1
0
1
0
Figure 5.10 Structuring element used for the dilation operation in the refinement of individual concealment results. Reproduced by Permission of 2008 IEEE
Non-normative Video Coding Tools
187
Figure 5.11 Elimination of an undefined region by morphological filtering: (a) initial undefined region; (b) undefined region is shrinking; (c) undefined region has been eliminated. Reproduced by Permission of 2008 IEEE
absorbed by the objects around them until finally they disappear, as illustrated in Figure 5.11. The final result, however, depends on the ordering of objects in this cycle, but since the region to which this operation is applied is typically very small, the differences will hardly be visible. To estimate the texture values of the pixels in these new regions, an averaging procedure is used. This way, in each iteration of the above-mentioned morphological operation, the texture of the pixels which correspond to the shapels that have been absorbed is estimated by computing the mean of the adjacent 4-connected neighbors that were already included in the object. Since the regions over which texture concealment is necessary are typically very small, this procedure is adequate.
5.6.3 Performance Evaluation In order to illustrate the performance of the proposed shape-concealment process, Figure 5.12 should be considered. In this example, the three video objects in Figure 5.12(a) have been corrupted, as shown in Figure 5.12(b). In the remainder of Figure 5.12, the various steps of the concealment process are shown, leading to the final concealed video objects in Figure 5.12(f). To compare these video objects with the original ones in Figure 5.12(a), the Dn and PSNR metrics used by MPEG [46] may be used for shape and texture, respectively. The Dn metric is defined as: Dn ¼
Shapels differing in concealed and original shapes Opaque shapels in original shape
ð5:30Þ
which can also be expressed as a percentage, Dn [%] ¼ 100 Dn. As for the PSNR metric, since arbitrarily-shaped video objects are used, it is only computed over the pixels that belong to both the decoded VOP being evaluated and the original VOP. The obtained Dn values are 0.01%, 0.15%, and 0.12%, respectively, for the Background, Dancers, and Speakers video objects shown in Figure 5.12(f). The corresponding PSNR values are 37.58 dB, 26.20 dB, and 30.27 dB; the uncorrupted PSNR values are 38.25 dB, 33.51 dB, and 34.18 dB, respectively. As can be seen, although the shapes and textures of these video
188
Visual Media Coding and Transmission
Figure 5.12 The concealment process for the Dancers sequence: (a) original uncorrupted video objects (Background, Dancers, Speakers); (b) corrupted video objects; (c) video objects after the corrupted data for which complementary data exists has been concealed; (d) video objects after individual concealment; (e) undefined regions that appear after individual concealment (shown in grey); (f) final concealed video objects. Reproduced by Permission of 2008 IEEE
objects have been severely corrupted, the results are quite impressive, especially when compared to what is typically achieved by independent concealment alone. The main reason for such a great improvement is the use of the complementary shape data from surrounding objects during the concealment process, which does not happen when only independent concealment is performed.
5.6.4 Conclusions In this section, a shape and texture concealment technique for segmented object-based video scenes, such as those based on the MPEG-4 standard, was proposed. Results were presented, showing the ability of this technique to recover lost data in segmented video scenes with rather
Non-normative Video Coding Tools
189
small distortion. Therefore, with this technique, it should be possible for object-based video applications (with more than one object) to be deployed in error-prone environments with an acceptable visual quality.
5.7 An Integrated Error-resilient Object-based Video Coding Architecture (Portions reprinted, with permission, from M. Tagliasacchi, G. Valenzise, S. Tubaro, “Minimum variance optimal rate allocation for multiplexed H.264/AVC bitstreams, Image Processing, IEEE Transactions on Volume 17, Issue 7, July 2008 Page(s):1129–1143. 2008 IEEE.)
5.7.1 Problem Definition and Objectives As explained in Sections 5.5 and 5.6, in order to make possible new object-based video services – such as those based on the MPEG-4 object-based audiovisual coding standard – in errorprone environments, appropriate error-resilience techniques are needed. By combining several complementary error-resilience techniques at both sides of the communication chain, it is possible to further improve the error resilience of the whole system. Therefore, the purpose of the work described here is to propose an object-based video coding architecture, where complementary error-resilient tools, most of which have been previously proposed in VISNET I for all the relevant modules, are integrated.
5.7.2 Proposed Technical Solution The proposed object-based architecture, where the various proposed error-resilience techniques will be integrated, is shown in Figure 5.13. At the encoder side, after the video scene has been defined, the coding of video objects is supervised by a resilience configuration module, which is responsible for choosing the most adequate coding parameters in terms of resilience. This is important because the decoding performance will very much depend on the kinds of protective action the encoder has taken. The output of the various video object encoders will then be multiplexed and sent through the channel in question. At the decoder side, the procedure is basically the opposite, but, instead of a scene-definition module, a scene-composition module is used. In order to minimize the negative subjective impact of channel errors in the composed presentation, defensive actions have to be taken by the decoder. This includes error detection and error localization in each video object decoder, followed by (object-level) error concealment [47]. At this point, error concealment is applied independently to each video object, and, therefore, it is called object-level error concealment, which can be of several types depending on the data that is used. Afterwards, a more advanced type of concealment may also be performed in the scene-concealment module, which has access to all the video objects present in the scene: this is the so-called scene-level error concealment. The final concealed video scene is presented to the user by the composition module. In this context, several error-resilience techniques are suggested below for the most important modules of the integrated object-based video coding architecture. Since the
190
Visual Media Coding and Transmission Video Content Scene Definition Resilient
Resilient
Resilient
Resilience
Video
Video
Video
Configuration
Multiplexing and Synchronization Transmission or Storage De-multiplexing Resilient
Resilient
Resilient
Video
Video
Video
OSC OTC OSC OTC AOSTC
OSC OTC
AOSTC
AOSTC
Scene Concealment Scene Composition Video Presentation OSC: Object Spatial Concealment OTC: Object Temporal Concealment
Figure 5.13 Error-resilient object-based video coding architecture. Reproduced by Permission of 2008 IEEE
suggested techniques have already been individually proposed in the literature by the involved partners, only a brief description of each is provided, in order to explain their role in the context of the architecture; to get the full details on these tools, the reader should use the relevant references. 5.7.2.1 Encoder-side Object-based Error Resilience It is largely recognized that intra-coding refreshment can be used at the encoder side to improve error resilience in video coding systems that rely on predictive (inter-) coding. In these systems, the decoded quality can decay very rapidly due to long-lasting channel error propagation, which can be avoided by using an intra-coding refreshment scheme at the encoder to refresh the decoding process and stop (spatial and temporal) error propagation. This will decrease the coding efficiency, but it will significantly improve error resilience at the decoder side, increasing the overall video subjective impact [47]. Object-based Refreshment Need Metrics In order to design an efficient intra-coding refreshment scheme for an object-based video coding system, it would be helpful to have at the encoder side a method to determine which
Non-normative Video Coding Tools
191
components of the video data (shape and texture) of which objects should be refreshed, and when. With this in mind, shape-refreshment-need and texture-refreshment-need metrics have been proposed in [48]. These refreshment-need metrics have been shown to correctly express the necessity of refreshing the corresponding video data (shape or texture) according to some error-resilience criteria; therefore, they can be used by the resilience configuration module at the encoder side to efficiently decide whether or not some parts of the video data (shape or texture) of some video objects should be refreshed at a certain time instant. By doing so, significant improvements should be possible in the video subjective impact at the decoder side, since the decoder gets a selective amount of refreshment “help” depending on the content and its (decoder) concealment difficulty. Adaptive Object-based Video Coding Refreshment Scheme Based on the refreshment need metrics described in the previous section, the resilience configuration module should decide which parts of the shape and texture data of the various objects should be refreshed for each time instant. To do so, the adaptive shape and texture intracoding refreshment scheme proposed in [50] can be used. This scheme considers a multi-object video scene, and its target is to efficiently control the shape and texture refreshment rate for the various video objects, depending on their refreshment needs, related to the concealment difficulty at the decoder. As shown in [50], when this technique is used the overall video quality is improved for a certain total bit rate when compared to cases where less sophisticated refreshment schemes are used (e.g. a fixed intra-refreshment period for all the objects). This happens because objects with low refreshment needs (i.e. easy to conceal at the decoder when errors occur) can be refreshed less often without a significant reduction in their quality, thus saving refreshment resources. The saved resources are then used to improve the quality of objects with high refreshment needs (i.e. hard to conceal when errors occur) by refreshing them more often. 5.7.2.2 Decoder-side Object-based Error Resilience With the techniques described in Section 5.7.2.1, considerable improvements can be obtained in terms of the decoded video quality. However, further improvements can still be achieved by also using sophisticated shape and texture error-concealment techniques at the decoder side. Since different approaches exist in terms of error concealment, several types of errorconcealment technique may have to be used. These include object-level techniques to be used by the individual video object decoders, as well as scene-level techniques to be used by the scene-concealment module. Spatial Error Concealment at the Object Level The error-concealment techniques described in this section are object-level spatial (or intra-) techniques, in the sense that they do not rely on information from other time instants and only use the available information for the video object being concealed at the time instant in question. Two different techniques of this type are needed for a video object: one for shape and one for texture. In terms of spatial shape error concealment, the technique proposed in [51] may be used. This technique is based on contour interpolation and has been shown to achieve very good results. This technique considers that, in object-based video, the loss of shape data corresponds to broken contours, which have to be interpolated. By interpolating these broken contours with
192
Visual Media Coding and Transmission
Figure 5.14 The spatial shape concealment process: (a) lost data surrounded by the available shape data; (b) lost data surrounded by the available contours; (c) interpolated contours inside the lost area; (d) recovered shape. Reproduced by Permission of 2008 IEEE
Bezier curves, the authors have shown that it is possible to recover the complete shape with a good accuracy, since most contours in natural video objects typically have a rather slow direction variation. After the broken contours have been recovered, it is fairly easy to recover the values of the missing shapels (pixels in the shape masks) from the neighboring ones by using an adequate continuity criterion and then filling in the shape. This shape errorconcealment technique is illustrated in Figure 5.14, where the lost shape data is shown in gray throughout the example. In terms of spatial texture concealment, the technique proposed in [54] may be used. This technique, which is the only one in the literature for object-based systems, consists of two steps. In the first step, padding is applied to the available texture in order to extend it beyond the object boundaries. This is done in order to facilitate the second step, where the lost pixel values are estimated based on the available surrounding texture data (including the extended texture). For this second step, two different approaches have been proposed, for which results have been presented showing their ability to recover lost texture data in a quite acceptable way. One approach is based on the linear interpolation of the available pixel values, and the other is based on the weighted median of the available pixel values. The results that can be achieved with both approaches are illustrated in Figure 5.15. Temporal Error Concealment at the Object Level This section is devoted to object-level temporal (or inter-) error-concealment techniques, in the sense that they rely on information, for the object at hand, from temporal instants other than the current one. Since in most video objects the data does not change that much in consecutive time instants, these techniques are typically able to achieve better concealment results than spatial techniques.
Non-normative Video Coding Tools
193
Figure 5.15 A spatial texture concealment example: (a) corrupted object; (b) original object; (c) concealment with linear interpolation; (d) concealment with weighted median
In terms of object-level temporal shape error concealment, the technique proposed in [49] may be integrated in the proposed architecture. This technique is based on a combination of global and local motion compensation. It starts by assuming that the shape changes occurring in consecutive time instants can be described by a global motion model, and simply tries to conceal the corrupted shape data by using the corresponding shape data in the global motioncompensated previous shape. Assuming that the global motion model can accurately describe the shape motion, this alone should be able to produce very good results, as illustrated in Figure 5.16. However, in many cases, such as the one illustrated in Figure 5.17, the shape
Figure 5.16 The temporal shape concealment process with low local motion: (a) original uncorrupted shape; (b) corrupted shape; (c) motion-compensated previous shape; (d) concealed shape without local motion refinement
194
Visual Media Coding and Transmission
Figure 5.17 The temporal shape concealment process with high local motion: (a) original uncorrupted shape; (b) corrupted shape; (c) motion-compensated previous shape; (d) concealed shape without local motion refinement; (e) concealed shape with local motion refinement
motion cannot be accurately described by global motion alone, due to the existence of strong local motion in some areas of the (non-rigid) shape. Therefore, to avoid significant differences when concealing erroneous areas with local motion, an additional local motion refinement scheme has been introduced. Since the technique presented in [49] also works for texture data, it may also be integrated in the proposed architecture to conceal corrupted texture data. Adaptive Spatio-temporal Error Concealment at the Object Level The main problem with the two previous error-concealment techniques is that neither can be used with acceptable results for all possible situations. On one hand, spatial error-concealment techniques are especially useful when the video data changes greatly in consecutive time instants, such as when new objects appear. On the other hand, temporal error-concealment techniques are typically able to achieve better concealment results when the video data does not change much in consecutive time instants. Therefore, by designing a scheme that adaptively selects one of the two concealment techniques, it should be possible to obtain the advantages of both solutions while compensating for their disadvantages. An adaptive spatio-temporal technique has been proposed in [55] by the involved partners, and may be integrated in the proposed architecture. By using this spatio-temporal concealment technique, the concealment results can be significantly improved, as can be seen in Figure 5.18 for shape data.
Figure 5.18 The adaptive spatio-temporal error-concealment process: (a) uncorrupted original shape; (b) corrupted shape; (c) shape concealed with the spatial-concealment technique; (d) shape concealed with the temporal-concealment technique; (e) shape concealed with the adaptive spatio-temporal concealment technique
Non-normative Video Coding Tools
195
Scene-level Error Concealment The error-concealment techniques described until now all have a serious limitation in common, which is the fact that each video object is independently considered, without ever taking into account the scene context in which the objects are inserted. After all, just because a concealed video object has a pleasing subjective impact on the user when it is considered on its own, it does not necessarily mean that the subjective impact of the whole scene will be acceptable, particularly when (segmented) objects should fit together as in a jigsaw puzzle. An example of this situation was illustrated in Section 5.6. Two techniques have been proposed, in [52] and [53] (that in [53] corresponds to the technique described in Section 5.6), to deal with this kind of problem, which may be used in the proposed architecture by the scene-concealment module. In these techniques, which target segmented video scenes, the corrupted parts of the data for which correctly decoded complementary data is available in the neighboring objects are first concealed; this is especially effective for shape data. Then, for the remaining parts of the data that cannot be concealed in this way, object-level concealment techniques, such as those described before, are applied to each object. While the technique in [52] considers a spatial approach for this step, that in [53] considers a temporal approach. As a result of this step, holes or object overlaps may appear in the scene, since objects have been independently processed. Therefore, to eliminate these artifacts in the scene, a final refinement step is applied, based on a morphological filter for the shape and on a pixel-averaging procedure for the texture.
5.7.3 Performance Evaluation Since illustrative results of what is possible with the different techniques proposed for each module of the integrated error-resilient architecture have already been shown in Section 5.7.2, further results will not be given here.
5.7.4 Conclusions An integrated error-resilient object-based video coding architecture has been proposed. This is of the utmost importance because it can make the difference between having acceptablequality video communications in error-prone environments and not. The work presented here corresponds to a complete error-resilient object-based video coding architecture, and the various parts of the system have been thoroughly investigated.
5.8 A Robust FMO Scheme for H.264/AVC Video Transcoding 5.8.1 Problem Definition and Objectives As explained in Section 5.2, the H.264/AVC standard [56] includes several error-resilience tools, whose proper use and optimization is left open to the codec designer. In the work described here, focus is on studying adaptive FMO schemes to enhance the robustness of preencoded video material.
5.8.2 Proposed Technical Solution According to the H.264/AVC syntax, each frame is partitioned into one or more slices, and each slice contains a variable number of MBs. FMO is a coding tool supported by the standard
196
Visual Media Coding and Transmission
that enables arbitrary assignment of each MB to the desired slice. FMO can be efficiently combined with FEC-based channel coding to provide unequal error protection (UEP). The basic idea is that the most important slice(s) can be assigned a stronger error-correcting code. The goal is to design efficient algorithms that can be used to provide a ranking of the MBs within a frame. The ranking order is determined by the error induced by the loss of the MB at the decoder side. In other words, those MBs that, if lost, cause a large increase of distortion should be given higher protection. The total increase of distortion, measured in terms of MSE, can be factored out as follows: Dtot ðt; iÞ ¼ DMV ðt; iÞ þ Dres ðt; iÞ þ Ddrift ðt; iÞ
ð5:31Þ
where t is the frame index and i the MB index, and: . . .
DMV (t,i) is the additional distortion due to the fact that the correct motion vector is not available, and needs to be replaced by the concealed one at the decoder. Dres(t,i) is the additional distortion due to the fact that the residuals of the current MB are lost. Ddrift(t,i) is the drift introduced by the fact that the reference frame the MB refers to might be affected by errors.
In the first part of the work, the focus is on the DMV (t,i) term only. Given a fraction m 2 [0,1] of the MBs that can be protected, the goal consists of identifying those MBs for which motioncompensated concealment at the decoder cannot reliably estimate the lost motion vector. Three approaches are considered and compared: . .
.
Random selection of MBs: A fraction of m MBs is selected at random within the frame. Selection based on simulated concealment at the encoder:For each MB, the encoder simulates motion-compensated concealment, as if only the current MB is lost. The encoder computes the sum of absolute differences (SAD) between the original MB and the concealed one. The fraction of m MBs with the highest value of the SAD is selected. It is worth pointing out that this solution is computationally demanding because it requires the execution of the concealment algorithm at the encoder for each MB in the frame. Selection based on motion activity: For each MB, a motion-activity metrics is computed. This metrics is based on the value of the motion vectors in neighboring MBs. If neighboring MBs are split into 4 4 blocks, up to 16 neighboring motion vectors can be collected. If larger blocks (8 8, 16 16) exist, their motion vectors are assigned to each of the constituent 4 4 blocks, in such a way that a list of 16 motion vectors, i.e. mvx ¼ [mvx,1, mvx,2,. . ., mvx,16], mvy ¼ [mvy,1, mvy,2,. . .,mvy,16], is always computed. The activity index is computed as follows: o Sort motion vector component x (y). o Discard the first four and the last four motion vector components in each list. o Compute the standard deviation of the remaining eight motion vector components. o Average the standard deviations of the x and y components. Once the activity index is computed for each MB, the algorithm selects the fraction of m MBs with the highest index. With respect to the previous case, this algorithm is fast and requires the evaluation of data related to the motion field only.
197
Non-normative Video Coding Tools
5.8.3 Performance Evaluation In order to test the performance of the proposed algorithms, a simulation is performed as follows. Each sequence is encoded with a constant quantization parameter (i.e. QP ¼ 24) and the dispersed slice grouping is selected. The number of slices per frame is set equal to 9, and 50 channel simulations are carried out for the target packet loss rate (PLR), equal to 10%. At the encoder, for each value of m, one of the three aforementioned approaches is used to select the MBs to be protected. For these MBs, at the decoder, the correct motion vector is used to replace the concealed MB. Figure 5.19 shows the results obtained when random selection is performed. The x-axis indicates the percentage of protected MBs, i.e. 100m%, while the y-axis shows the average PSNR across the sequence and over the channel simulations. The dashed line represents the average PSNR when no channel errors are introduced, while the solid line represents the average PSNR when errors occur and motion-compensated concealment is performed at the decoder (the algorithm included in the JM reference software is used for this purpose). In this test, the case in which a fraction of the bitplanes of the motion vectors are correctly recovered is simulated. Motion vector components are represented as 16 bit integers, where the last two bits are used for the fractional part to accommodate 1/4 pixel accuracy. The “16 Btpln” label indicates that the motion vector is completely recovered. The “14 Btpln” line indicates that only the integer part of the motion vector is recovered, and so on. The rationale behind this test is that a DVC scheme based on turbo codes could be accommodated to protect the most significant bitplanes of the motion vectors. Figure 5.19 illustrates a ramp-like behavior, which is to be expected since the MBs are selected at random. Conversely, both Figure 5.20 and Figure 5.21 show that, for a given value of m, a larger average PSNR can be attained by a careful selection of the MBs. Also, the selection based on simulated concealment at the encoder performs best, but it has a higher computational
Foreman random geomeanPSNRs 37
6 Bitpl 7 Bitpl 8 Bitpl 9 Bitpl 10 Bitpl 11 Bitpl 12 Bitpl 13 Bitpl 14 Bitpl 15 Bitpl 16 Bitpl PsnrX-Xcap PsnrX-Xtilde
36 35
PSNR (dB)
34 33 32 31 30 29 28 0%
20%
40%
60%
Corrected MBs
Figure 5.19
Random selection of MBs
80%
100%
198
Visual Media Coding and Transmission Foreman MSE geomeanPSNRs 37
6 Bitpl 7 Bitpl 8 Bitpl 9 Bitpl 10 Bitpl 11 Bitpl 12 Bitpl 13 Bitpl 14 Bitpl 15 Bitpl 16 Bitpl PsnrX-Xcap PsnrX-Xtilde
36 35
PSNR (dB)
34 33 32 31 30 29 28 0%
20%
40%
60%
80%
100%
Corrected MBs
Figure 5.20
Selection based on simulated concealment at the encoder
complexity. These results suggest that it is more efficient to protect a fraction of the MBs at full precision than the full frame at a coarser accuracy, suggesting that a DVC-based protection of a fraction of the bitplanes is impractical.
5.8.4 Conclusions An adaptive FMO technique for H.264 was proposed and analyzed in this section. FMO is an important error-resilience tool which arbitrarily assigns MBs to the slices. In the proposed Foreman distortion MVs geomeanPSNRs 37
6 Bitpl 7 Bitpl 8 Bitpl 9 Bitpl 10 Bitpl 11 Bitpl 12 Bitpl 13 Bitpl 14 Bitpl 15 Bitpl 16 Bitpl PsnrX-Xcap PsnrX-Xtilde
36 35
PSNR (dB)
34 33 32 31 30 29 28 0%
20%
40%
60%
80%
Corrected MBs
Figure 5.21 Selection based on motion activity
100%
Non-normative Video Coding Tools
199
scheme, an efficient algorithm has been designed to provide a ranking of the MBs within a frame. Three selection methods are considered, namely: random selection of MBs; selection based on simulated concealment at the encoder; and selection based on motion activity, for each MB. The proposed algorithms are simulated and results are compared. Results show that decoded objective video quality improves with careful selection of the MB. Finally, the selection based on simulated concealment at the encoder performs best.
5.9 Conclusions (Portions reprinted, with permission, from M. Naccari, G. Bressan, M. Tagliasacchi, F. Pereira, S. Tubaro, “Unequal error protection based on flexible macroblock ordering for robust H.264/ AVC video transcoding”, Picture Coding Symposium, Lisbon, November 2007. 2007 EURASIP.) The techniques described in this chapter correspond to the non-normative areas of video coding standards. They deal with error resilience and rate control, and can readily be applied to existing codecs without compromising standard compatibility, since they work within the realm of the standard. While the objective of error-resilience techniques is to improve the performance of the video coding system in the presence of channel errors, rate control techniques improve performance by adequately distributing the available bit-rate resources in space and time. Research in the field of non-normative video coding tools up till now has shown how important these tools are in getting the best performance from normative video coding standards. The non-normative video coding tools are the “decision makers” that any video coding standard needs in order to find the “right path” through a flexible coding syntax, for example in the most powerful coding standards like MPEG-4 Visual and H.264/AVC.
References [1] ITU-T Recommendation H.261 (1993), “Video codec for audiovisual services at p 64 kbps,” 1993. [2] CCITT SGXV, “Description of reference model 8 (RM8),” Doc. 525, Jun. 1989. [3] ISO/IEC 11172-2:1993, “Information technology: coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbps: part 2: video,” 1993. [4] E. Viscito and C. Gonzales, “A video compression algorithm with adaptive bit allocation and quantization,” Proc. Visual Communications and Image Processing (VCIP91), Boston, MA, Vol. 1605, pp. 58–72, 1991. [5] ISO/IEC 13818-2:1996, “Information technology: generic coding of moving pictures and associated audio information: part 2: video,” 1996. [6] MPEG Test Model Editing Committee, “MPEG-2 Test Model 5,” Doc. ISO/IEC JTC1/SC29/WG11 N400, Sydney MPEG meeting, Apr. 1993. [7] ITU-T Recommendation H.263 (1996), “Video coding for low bitrate communication,” 1996. [8] ITU-T/SG15, “Video codec test model, TMN8,” Doc. Q15-A-59, Portland, OR, Jun. 1997. [9] MPEG Video, “MPEG-4 video verification model 5.0,” Doc. ISO/IEC JTC1/SC29/WG11 N1469, Maceio´ MPEG meeting, Nov. 1996. [10] T. Chiang and Y.Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, pp. 246–250, Feb. 1997. [11] MPEG Video, “MPEG-4 video verification model 8.0,” Doc. ISO/IEC JTC1/SC29/WG11 N1796, Stockholm MPEG meeting, Jul. 1997. [12] A. Vetro, H. Sun, and Y. Wang, “MPEG-4 rate control for multiple video objects,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 1, pp. 186–199, Feb. 1999. [13] J. Ronda, M. Eckert, F. Jaureguizar, and N. Garcıa, “Rate control and bit allocation for MPEG-4,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1243–1258, Dec. 1999.
200
Visual Media Coding and Transmission
[14] Y. Sun and I. Ahmad, “A robust and adaptive rate control algorithm for object-based video coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 10, pp. 1167–1182, Oct. 2004. [15] Y. Sun and I. Ahmad, “Asynchronous rate control for multi-object videos,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 8, pp. 1007–1018, Aug. 2005. [16] ISO/IEC 14496-10:2003/ITU-T Recommendation H.264, “Advanced video coding (AVC) for generic audiovisual services,” 2003. [17] Z.G. Li, F. Pan, K.P. Lim, G. Feng, X. Lin, and S. Rahardaj, “Adaptive basic unit layer rate control for JVT,” Doc. JVT-G012, 7th meeting, Pattaya, Thailand, Mar. 2003. [18] Z. He and S.K. Mitra, “A unified rate-distortion analysis framework for transform coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1221–1236, Dec. 2001. [19] ISO/IEC 14496-2:2001, “Information technology: coding of audio-visual objects: part 2: visual,” 2001. [20] L.D. Soares and F. Pereira, “Refreshment need metrics for improved shape and texture object-based resilient video coding,” IEEE Transactions on Image Processing, Vol. 12, No. 3, pp. 328–340, Mar. 2003. [21] L.D. Soares and F. Pereira, “Adaptive shape and texture intra refreshment schemes for improved error resilience in object-based video,” IEEE Transactions on Image Processing, Vol. 13, No. 5, pp. 662–676, May 2004. [22] S. Shirani, B. Erol, and F. Kossentini, “A concealment method for shape information in MPEG-4 coded video sequences,” IEEE Transactions on Multimedia, Vol. 2, No. 3, pp. 185–190, Sep. 2000. [23] L.D. Soares and F. Pereira, “Spatial shape error concealment for object-based image and video coding,” IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 586–599, Apr. 2004. [24] G.M. Schuster, X. Li, and A.K. Katsaggelos, “Shape error concealment using hermite splines,” IEEE Transactions on Image Processing, Vol. 13, No. 6, pp. 808–820, Jun. 2004. [25] P. Salama and C. Huang, “Error concealment for shape coding,” Proc. IEEE International Conference on Image Processing, Rochester, NY, Vol. 2, pp. 701–704, Sep. 2002. [26] L. D. Soares and F. Pereira, “Motion-based shape error concealment for object-based video,” Proc. IEEE International Conference on Image Processing, Singapore, Oct. 2004. [27] L.D. Soares and F. Pereira, “Combining space and time processing for shape error concealment,” Proc. Picture Coding Symposium, San Francisco, CA, Dec. 2004. [28] L.D. Soares and F. Pereira, “Spatial Texture Error Concealment for Object-based Image and Video Coding,” Proc. EURASIP Conference on Signal and Image Processing, Multimedia Communications and Services, Smolenice, Slovakia, Jun. 2005. [29] ISO/IEC 14496-2:2001, “Information technology: coding of audio-visual objects: part 2: visual,” 2001. [30] P. Nunes and F. Pereira “Joint rate control algorithm for low-delay MPEG-4 object-based video encoding,” IEEE Transactions on Circuits and Systems for Video Technology (submitted). [31] ISO/IEC 14496-5:2001, “Information technology: coding of audio-visual objects: part 5: reference software,” 2001. [32] MPEG Video, “MPEG-4 video verification model 5.0,” Doc. N1469, Maceio´ MPEG meeting, Nov. 1996. [33] A. Vetro, H. Sun, and Y. Wang, “MPEG-4 rate control for multiple video objects,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 9, No. 1, pp. 186–199, Feb. 1999. [34] J. Ronda, M. Eckert, F. Jaureguizar, and N. Garcia, “Rate control and bit allocation for MPEG-4,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 9, No. 8, pp. 1243–1258, Dec. 1999. [35] H.-J. Lee, T. Chiang, and Y.-Q. Zhang, “Scalable rate control for MPEG-4 video,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 10, No. 6, pp. 878–894, Sep. 2000. [36] P. Nunes and F. Pereira, “Scene level rate control algorithm for MPEG-4 video encoding,” Proc. VCIP01, San Jose, CA, Vol. 4310, pp. 194–205, Jan. 2001. [37] Y. Sun and I. Ahmad, “A robust and adaptive rate control algorithm for object-based video coding,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 14, No. 10, pp. 1167–1182, Oct. 2004. [38] Y. Sun and I. Ahmad, “Asynchronous rate control for multi-object videos,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 15, No. 8, pp. 1007–1018, Aug. 2005. [39] P. Nunes and F. Pereira, “Rate control for scenes with multiple arbitrarily shaped video objects,” Proc. PCS97, Berlin, Germany, pp. 303–308, Sep. 1997. [40] G. Bjontegaard,“Calculation of average PSNR differences between RD-curves,” Doc. VCEG-M33, Austin, TX, Apr. 2001. [41] ISO/IEC 14496-10:2003/ITU-T Recommendation H.264, “Advanced video coding (AVC) for generic audiovisual services,” 2003.
Non-normative Video Coding Tools
201
[42] T. Chiang, Y.Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, pp. 246–250, Feb. 1997. [43] Z. He, S.K. Mitra, “A unified rate-distortion analysis framework for transform coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1221–1236, Dec. 2001. [44] ISO/IEC 14496-2, “Information technology: coding of audio-visual objects: part 2: visual,” Dec. 1999. [45] R.C. Gonzalez and R.E. Woods, Digital Image Processing, 2nd Ed., Prentice Hall, 2002. [46] L.D. Soares and F. Pereira, “Motion-based shape error concealment for object-based video,” Proc. IEEE International Conference on Image Processing, Singapore, Oct. 2004. [47] L.D. Soares and F. Pereira, “Error resilience and concealment performance for MPEG-4 frame-based video coding,” Signal Processing: Image Communication, Vol. 14, Nos. 6–8, pp. 447–472, May 1999. [48] L.D. Soares and F. Pereira, “Refreshment need metrics for improved shape and texture object-based resilient video coding,” IEEE Transactions on Image Processing, Vol. 12, No. 3, pp. 328–340, Mar. 2003. [49] L.D. Soares and F. Pereira, “Temporal shape error concealment by global motion compensation with local refinement,” IEEE Transactions on Image Processing, Vol. 15, No. 6, pp. 1331–1348, Jun. 2006. [50] L.D. Soares and F. Pereira, “Adaptive shape and texture intra refreshment schemes for improved error resilience in object-based video,” IEEE Transactions on Image Processing, Vol. 13, No. 5, pp. 662–676, May 2004. [51] L.D. Soares and F. Pereira, “Spatial shape error concealment for object-based image and video coding,” IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 586–599, Apr. 2004. [52] L.D. Soares and F. Pereira, “Spatial scene level shape error concealment for segmented video,” Proc. Picture Coding Symposium, Beijing, China, Apr. 2006. [53] L.D. Soares and F. Pereira, “Spatio-temporal scene level error concealment for shape and texture data in segmented video content,” Proc. IEEE International Conference on Image Processing, Atlanta, GA, Oct. 2006. [54] L.D. Soares and F. Pereira, “Spatial texture error concealment for object-based image and video coding,” Proc. EURASIP Conference on Signal and Image Processing, Multimedia Communications and Services, Smolenice, Slovakia, Jun. 2005. [55] L.D. Soares and F. Pereira, “Combining space and time processing for shape error concealment,” Proc. Picture Coding Symposium, San Francisco, CA, Dec. 2004. [56] ISO/IEC 14496-10:2003/ITU-T Recommendation H.264, “Advanced video coding (AVC) for generic audiovisual services,” 2003. [57] “Joint model reference encoding methods and decoding concealment methods,” JVT-I049d0, San Diego, CA, Sep. 2003.
6 Transform-based Multi-view Video Coding 6.1 Introduction Existing video coding standards are suitable for coding 2D videos in a rate-distortion optimized sense, where rate-distortion optimization refers to the process of jointly optimizing both the resulting image quality and the required bit rate. The basic principle of block-based video coders is to remove temporal and spatial redundancies among successive frames of video sequences. However, as the viewing experience moves from 2D viewing to more realistic 3D viewing, it will become impossible to reconstruct a realistic 3D scene by using a simple 2D video scene, or to allow a user (or a group of users) to freely and interactively navigate in a visual scene [1]. More viewpoints of a given scene would be needed in such cases, which conventional video coders are not optimized to jointly code without modifying some of the existing compression tools. As the number of representations of a certain scene from different viewpoints increases, the size of the total payload to be transmitted will increase by the same proportion where every single representation (or viewpoint) is encoded as a single 2D video. This method is highly impractical due to storage and bandwidth constraints. Therefore, other ways to jointly encode viewpoints have been developed, all of which take into account the exploitation of strong correlations that exist among different viewpoints of a certain scene. Exploitation of inter-view correlation reduces the number of bits required for coding. The correspondences existing among different viewpoints are called inter-view correspondences. A generic multi-view encoder should reduce such inter-view redundancies, as well as other temporal and intra-frame redundancies, which are already well catered for by existing 2D video compression techniques. The ISO/ICE JTC1/SC29/WG11 Moving Picture Experts Group (MPEG) has recognized the importance of multi-view video coding, and established an ad hoc group (AHG) on 3D
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
204
Visual Media Coding and Transmission
audio and visual (3DAV) in December 2001 [2]. Four main exploration experiments were conducted in the 3DAV group between 2002 and 2004: 1. An exploration experiment on omni-directional video. 2. An exploration experiment on FTV. 3. An exploration experiment on coding of stereoscopic video using the multiple auxiliary component (MAC) of MPEG-4. 4. An exploration experiment on depth and disparity coding for 3D TV and intermediate view generation (view synthesis) [3]. After a Call for Comments was issued in October 2003, a number of companies claimed the need for a standard enabling the FTV and 3D TV systems. In October 2004 MPEG called interested parties to bring evidence on MVC technologies [4,5], and a Call for Proposals on MVC was issued in July 2005 [6], following acceptance of the evidence. Responses to the Call were evaluated in January 2006 [7], and an approach based on a modified AVC scheme was selected as the basis for standardization efforts. An important property of the codec under development is that it is capable of decoding AVC bitstreams as well as multi-view coded sequences. Multi-view coding (MVC) remains a part of the standardization activities within the body of the Joint Video Team (JVT) of ISO/MPEG and ITU/VCEG. Much of the work carried out so far is intended to compensate for the illumination level differences among multi-view cameras [8]. Such differences reduce the correlation between the different views, and therefore affect compression efficiency. Further research has been carried out with the aim of improving inter-view prediction quality. Reference frame-based techniques aim to adapt already existing motion estimation and compensation algorithms to remove inter-view redundancies. Basically, the disparity among different views is treated as motion in the temporal direction, and the same methods used to model motion fields are applied to model disparity fields. Utilization of the highly-efficient prediction structure based on a hierarchical decomposition of frames (in time domain) in the spatial (viewpoint) domain brings quite a lot of prediction efficiency overall [9]. The Joint Multi-view Video Model (JMVM) takes this type of prediction structure as its base prediction. Disparity-based approaches take into account the geometrical constraints. These constraints can be measured and used to try to improve the inter-view prediction efficiency by fully or partially exploiting the scene geometry. Intermediate representations of frames are generated or synthesized within these approaches, and used as potential prediction sources for coding. In [10] a disparity-based approach is set out, which relies on predicting frames – from novel images synthesized from other viewpoints – using their corresponding per-pixel depth information and known camera parameters. In [11] it is proposed to refer to the positional correspondence of the multi-view camera rig when interpolating intermediate frames. Accordingly, the scene geometry is exploited partially. The advantage of such a method is that neither depth/disparity fields nor the extrinsic and intrinsic parameters of the multi-camera arrays need to be calculated and sent. However, especially for free-viewpoint video applications [12,13], high-quality views need to be synthesized through depth image-based rendering techniques at arbitrary virtual camera locations, which necessitates the communication of scene depth information anyway. Another field of work concentrates on achieving efficient coding of multi-view depth information, trying to preserve fine-edge details in the depth image at the same time. The effect of the reconstruction quality of depth information is shown to have a significant effect
Transform-based Multi-view Video Coding
205
on the quality of the synthesized free viewpoint images [14]. The authors of [15] have proposed a multi-view-plus-depth map coding system to handle both efficient compression and high-quality view generation at the same time. Similarly, in [16], the importance of considering multiviewpoint videos jointly with their depth maps for efficient overall compression is stressed. A completely different data representation format for depth information is defined in [17], where the depth maps of a scene captured from a multiple-camera array are fused into the layers of a special representation for depth. This representation is called Layered Depth Image (LDI), and in [18,19] coding of such representations is investigated. Due to its different picture format, LDI is not easily or efficiently compressible with conventional video coders like AVC. A number of possible 3D video and free viewpoint video applications that include multiple viewpoint videos are subject to constraints. A large amount of the work performed on multiview video coding has the aim of improving the compression efficiency. However, there are other application-specific constraints, in addition to the bandwidth/storage limitations, that require high compression efficiency. These additional constraints include coding delay (crucial for real-time applications), coding complexity, random access capability, and additional scalability in the view dimension. These should also be taken into account when designing application-specific multi-view encoders. In the context of scalable 3D video compression, many key achievements have been made within VISNET I. The purpose of the research activities carried out within VISNET II is to develop new video coding tools suitable for multi-view video data, and to achieve the integration of these tools into the state-of-the-art multi-view video coder JSVM/JMVM. In VISNET II a number of different tools have been developed to improve the rate-distortion performance and reduce the multi-view encoder complexity, and to cope with the other application-specific constraints stated above at the same time. The research work performed within VISNET II regarding transform-based multi-view coding is detailed in the rest of the chapter. Section 6.2 discusses the reduction of multi-view encoder complexity through the use of a multi-grid pyramidal approach. Section 6.3 discusses inter-view prediction using the reconstructed disparity information. Section 6.4 shows a work on multi-view coding via virtual view generation, and Section 6.5 will discuss low-delay random view access issues and propose a solution.
6.2 MVC Encoder Complexity Reduction using a Multi-grid Pyramidal Approach 6.2.1 Problem Definition and Objectives Multi-view Video Coding uses several references to perform predictive coding at the encoder. Furthermore, motion estimation is performed with respect to each reference frame using different block search sizes. This entails a very complex encoder. The goal of the contribution is to reduce the complexity of the motion estimation while preserving the rate-distortion performance.
6.2.2 Proposed Technical Solution The complexity reduction is achieved by using so-called Locally Adaptive Multi-grid Block Matching Motion Estimation [20]. The technique is supposed to generate more reliable motion fields. A coarse but robust enough estimation of the motion field is performed at the lowest
206
Visual Media Coding and Transmission
resolution level and is iteratively refined at the high resolution levels. Such a process leads to a robust estimation of large-scale structures, while short-range displacements are accurately estimated on small-scale structures. This method takes into consideration the simple fact that coarser structures are sufficient in uniform regions, whereas finer structures are required in detailed areas. The method produces a precise enough estimate of the motion vectors, so that the prediction efficiency is maintained with a less complex structure than a classic full search method. On the other hand, the coding cost increases, since the amount of side information becomes larger. The multi-grid approach consists of three main steps, namely motion estimation at each level, the segmentation decision, and the down-projection operator. These are described in more detail in the following subsections. 6.2.2.1 Motion Estimation at Each Level At each level, motion estimation is performed using block matching. In order to reduce the complexity of the algorithm, an n-step search technique is used to search for a matching block, as illustrated in Figure 6.1. Initially, the first nine locations defined by the set (0, 2n1) are evaluated. The best estimate is the initial point for the next search step. At the ith step, the eight locations defined by the set (0, 2ni) around the initial point are evaluated. Therefore, the resulting maximum displacement of the n-step search is 2n 1. 6.2.2.2 The Segmentation Decision Rule The segmentation rule decides whether or not to split a block. This rule has a great impact on the overall performance of the algorithm. The algorithm can produce either more accurate motion vectors, which leads to the generation of a large amount of side information (i.e. many blocks are split), or poor motion vectors, which require a reduced amount of overhead information (i.e. few blocks are split). At its simplest, the segmentation rule might be defined according to the mean absolute error (MAE) and some threshold T, such that the decision not to split a block occurs as soon as T is reached: MAE nosplit > T ! split ð6:1Þ Mean square error (MSE) is another possible measure. The difficulty of this rule is to define the appropriate threshold T.
Figure 6.1
n-step search
207
Transform-based Multi-view Video Coding
A
B
C
a
b
c
d
D
F
E
G
H
Figure 6.2 Down-projection initializes the motion vectors of blocks a, b, c and d. Block a chooses from the motion vectors of its direct parent block, block A, block B and block D. Block b chooses from the motion vectors of its direct parent block, block B, block C and block E. Block c chooses from the motion vectors of its direct parent block, block D, block F and block G. Block d chooses from the motion vectors of its direct parent block, block E, block G and block H
6.2.2.3 Down-projection Operator This operator is used to map the motion vectors between two grid levels, from the coarser level towards the finer ones (i.e. the motion vectors that parent blocks transmit to their child blocks). It should prevent block artifacts from propagating into the fine levels. At the same time, it should prevent incorrect motion vector estimations, as a result of local minima selection of the matching criterion. Furthermore, it should guarantee a smooth and robust motion field. Each block can be segmented into four child blocks. Each of these requires an initial motion vector, obtained by the down-projection operation. The simplest way is pass the motion vector of the parent block to the four children. A more efficient operator is to select for each child block the best motion vector from among four parent blocks. These blocks are the ones which are closest to the child (i.e. its direct parent block and three of its neighbors, depending on the position of the child). A down-projection example is given in Figure 6.2. An example of a multi-grid iteration is shown in Figure 6.3; the coarsest and finest block resolutions are 64 64 and 4 4, respectively. It is therein possible to verify the fact that coarser structures are sufficient in uniform regions, whereas finer structures are required in detailed areas.
Figure 6.3
An example of the multi-grid search, where the block size varies from 4 to 64 pixels
208
Visual Media Coding and Transmission
The novelty comes from the adaptation of the multi-grid block motion estimation approach to the framework of multi-view video coding.
6.2.3 Conclusions and Further Work The intended next step is to replace the simple splitting rule with a splitting entropy criterion. This kind of splitting operation is supposed to control the segmentation such that an optimal bit allocation between the motion parameters and the displaced frame difference (DFD) is reached. This criterion compares the extra cost of sending additional motion parameters with the gain obtained on the DFD side, to decide whether or not to split a given block. In addition, the problem of setting a threshold is expected to be overcome, as has been done for the MAE segmentation rule. As an alternative to block-based motion search, a global motion model [21] (homography) for prediction from the side views will be investigated for use in MVC. This would generate less side information to be encoded, since the model applies to the whole image. On the other hand, a motion vector is assigned to each block of the image when block-based motion search is used.
6.3 Inter-view Prediction using Reconstructed Disparity Information 6.3.1 Problem Definition and Objectives In future multi-view video coding standards, the use of disparity/depth information will be essential, and consideration of coding such data is needed. Having such information enables the development of new applications containing interactivity, i.e. the user can freely navigate in a visual scene through the existence of virtual camera scenes. In this research, the effect of lossy encoded disparity maps on the inter-view prediction will be investigated. If this information is available, it can be used to assist inter-view prediction. Hence, the overall bit rate is reduced, and furthermore the complexity of the multi-view encoder is reduced as well.
6.3.2 Proposed Technical Solution The developed framework consists of four main building blocks, which are explained in the following subsections. 6.3.2.1 Estimation of the Fundamental Matrix To reduce the correspondence search to one dimension, the fundamental matrix F is required. This step has to be carried out once for a fixed camera setup. The following steps are needed to estimate F [22]: . . .
Feature detection using Harris corner detector. Feature matching using Normalized Cross-Correlation (NCC). Estimating F using the 7-pt algorithm and RANSAC.
209
Transform-based Multi-view Video Coding
Figure 6.4 Parameterization of disparity
6.3.2.2 Estimation of the Dense Disparity Map The notation of the disparity is briefly explained below. Given the fundamental matrix F, relating point correspondences of two images (i.e. views in this case) x and x0 : x0 F x ¼ 0
ð6:2Þ
Figure 6.4 depicts the parameterization for the disparity. The vectors T and N denote tangential and normal vectors of the epipolar line, respectively. D denotes the unitary disparity vector, which relates x with x0 by a translation dD. L describes the variable which has to be estimated below. More detail about this notation can be found in [23]. The problem of finding the correspondences in two images is formulated as a partial differential equation (PDE), which has been integrated in an expectation-maximization (EM) framework [24]. The following equation is minimized using finite differences:
P E½lðxÞ ¼ x VðxÞ QðxÞT S 1 QðxÞc þ CrlðxÞT TðrI1* ÞrlðxÞ QðxÞ ¼ I1* I2 ðx þ FðlðxÞ; xÞÞ
ð6:3Þ
F maps l to the disparity dD, V denotes the visibility of the pixel in the other image, I gives the true image, and S is the covariance matrix to model normal distributed noise with zero mean. T() is tensor and is explained in detail in [23] and [24]. Figure 6.5 shows the estimated disparity image using the abovementioned algorithm for a sample image from the sequence Flamenco2. 6.3.2.3 Lossy Encoding of Disparity Map The estimated disparity map is quantized to quarter pixel. After that, the disparity map is encoded using JSVM 3.5. During the test, the luminance (Y) component of the disparity image is set to the estimated disparity values, whereas the chrominance (U and V) components are set to a constant value, and are not used. The result after encoding is shown in Figure 6.6.
210
Visual Media Coding and Transmission
Figure 6.5
Original image (left) and dense disparity map (right)
6.3.2.4 Multi-view Video Coding Using the Reconstructed Disparity Map for View Prediction JSVM 3.5 is used to encode the picture. The encoder is set to support only unidirectional prediction (P prediction). To allow the insertion of external frames to be used as references for prediction, the GOP-string structure has been adapted, which tells the encoder how to encode a frame. As seen in Figure 6.7, the reference candidate is warped according to the reconstructed disparity map, before being used as reference. The frame that is currently predicted uses the warped prediction instead.
6.3.3 Performance Evaluation The current implementation supports P prediction only. Figure 6.8 shows the coding results for a single test frame. The disparity map is encoded using 2624 bits at a mean square error equal
Figure 6.6
Lossy encoded disparity map
Transform-based Multi-view Video Coding
Figure 6.7
211
Multi-view video coder that uses reconstructed disparity maps for view prediction
to 0.35. Figure 6.8 shows some gain in rate-distortion performance. If the displacement search range is set to zero, the drop in rate-distortion performance is only minor.
6.3.4 Conclusions and Further Work Next steps include the utilization of B-prediction to achieve a fully-featured multi-view coding approach. The B-prediction can be achieved by using depth maps and camera information. Furthermore, the coding of the disparity/depth maps needs further investigation.
Figure 6.8 Result of the proposed coding scheme. The blue curve shows the result if the displacement search range is set to zero
212
Visual Media Coding and Transmission
6.4 Multi-view Coding via Virtual View Generation (Portions reprinted, with permission, from E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Multi-view video coding via virtual view generation,’’ 26th Picture Coding Symposium, Portugal, November 2007. 2007 EURASIP.)
6.4.1 Problem Definition and Objectives In this research, a multi-view video coding method via generation of virtual image sequences is proposed. It is intended to improve the inter-view prediction process by exploiting the known scene geometry. Moreover, most future applications requiring MVC already necessitate the use of the scene geometry information [15,16], so it is beneficial to use this information during the compression stage. Pictures are synthesized through a 3D warping method to estimate certain views in a multiview set, which are then used as inter-view references. Depth maps and associated color video sequences are used for virtual view generation. JMVM is used for coding color videos and depth maps. Results are compared against the reference H.264/AVC simulcast method, where every view is coded without using any kind of inter-view prediction, under some low-delay coding scenarios. The rate-distortion of the proposed method outperforms that of the reference method at all bit rates.
6.4.2 Proposed Technical Solution The proposed framework is composed of two main steps, namely the virtual view generation step through 3D depth-based warping technique, and the multi-view coding step using the generated virtual sequences as inter-view references. 6.4.2.1 Generation of Virtual Views through Depth Image-based 3D Warping In order to be able to remove the spatial redundancy among neighboring views in a multi-view set, virtual sequences are rendered from already-encoded frames of certain views. These views are called ‘‘base views’’. The rendered frames are then used as alternative predictions for the according frames to be predicted in certain views. These views are called ‘‘intermediate views’’ or equivalently ‘‘b views’’. The virtual views are rendered through the unstructured lumigraph rendering technique explained in [25]. In this research work, this method uses an already-encoded picture of the base view, which is projected first to a 3D world with the pinhole camera model, and then back to the image coordinates of the intermediate view, taking into account the camera parameters of both the base view and the intermediate view. The pixel in base view image coordinate, (x,y), is projected to 3D world coordinates using: ½u; v; w ¼ RðcÞ A 1 ðcÞ ½x; y; 1 D½c; t; x; y þ TðcÞ
ð6:4Þ
where [u, v, w] is the world coordinate. (Reproduced by permission of 2007 EURASIP.) Here, c defines the base views camera. R, T, A define the 3 3 rotation matrix, the 3 1 translation vector, and the 3 3 intrinsic matrix of the base view camera, respectively, and D[c,t,x,y] is the distance of the corresponding pixel (x,y) from the base view camera at time t [25]. Theworld
Transform-based Multi-view Video Coding
213
Figure 6.9 Final rendered image (in the middle), constructed from the left and right rendered images. Reproduced by permission of 2007 EURASIP
coordinates are mapped back to the intermediate view image coordinate system using: ½x0 ; y0 ; z0 ¼ Aðc0 Þ R 1 ðc0 Þ f½u; v; w Tðc0 Þg
ð6:5Þ
where [(x0 /z0 ), (y0 /z0 )] is the corresponding point on the intermediate view image coordinate system [25]. The matrices in Equations (6.4) and (6.5), i.e. R, T, and A, and the corresponding depth images of the base views, are all provided by Microsoft Research for the multi-view Breakdancer sequence [15]. The camera parameters should be supplied to the image renderer. In the experiments, the depth maps supplied by Microsoft Research are used. In the proposed method, a 3D warping procedure is carried out pixel by pixel. However, care should be taken to avoid several visual artifacts. First, some pixels in the reference picture may be mapped to the same pixel location in the target picture. In that case, a depth-sorting algorithm for the pixels falling on the same point in the target picture is applied. The pixel closest to the camera is displayed. Second, not every pixel may fall on integer pixel locations. The exact locations should be rounded to fit to the nearby integer pixel locations in the target image. This makes many small visual holes appear on the rendered image. The estimates for empty pixels are found by extrapolating the nearby filled pixels, which is a valid estimation for holes with radius smaller than 10 pixels. For every intermediate view, the two neighboring base views are warped separately into the intermediate view image coordinate system. The resulting view yielding the best objective quality measurement is chosen for the prediction. For better prediction quality and better usage of the scene geometry, the formerly occluded regions in the final prediction view are compared with the corresponding pixels in the other warped image. Figure 6.9 shows a sample final rendered image segment, which is formed from two side camera images. 6.4.2.2 MVC Prediction Method One motivation for using virtual references as prediction sources is that for future video applications, particularly for FTV, transportation of depth information will be essential [26]. So, exploiting that information to improve compression of certain views would be quite reasonable. Besides the rendered references, it is important not to remove temporal references from the prediction list, since the temporal references occupy the highest percentage of the references used for prediction in hierarchical B-frame prediction [9]. In
214
Visual Media Coding and Transmission
Figure 6.10 EURASIP
Camera arrangement and view assignment. Reproduced by permission of 2007
the tests, other means of inter-view references are removed in order to be able to see the extent to which the proposed method outperforms conventional temporal predictive coding techniques. Figure 6.10 shows the camera arrangement and view assignment for an eight-camera multi-view sequence. The view assignment is flexible. There are two reasons for such an assignment, although it may not be optimal in a sense for minimizing the overhead caused by depth map coding. One reason is that, for any intermediate view, the two closest base views are used for 3D warping, making the most likely prediction frame be rendered. The other reason is to show the effects of using prediction frames rendered using just one base view. In our case, virtual prediction frames for the coding of intermediate view 7 are rendered using only base view 6. JMVM 2.1 is used for the proposed multi-view video coding scenario. Both the color videos and the depth maps of the base views are encoded in H.264 simulcast mode (no inter-view prediction). However, the original depth maps are downscaled to their half resolution prior to encoding. The fact that depth maps don’t need to include full depth information to be useful for stereoscopic video applications [27] motivates us to use downscaled versions of depth maps containing more sparse depth information. In the experiments, use of reduced-resolution depth maps affected the reconstruction quality at the decoder negligibly, even for very low bit rates. The PSNR of decoded and upsampled depth maps changed between roughly 33 dB and 34.5 dB. Table 6.1 shows the coding conditions for base views and depth maps. Following the coding of base views with their depth maps, intermediate views are coded using the rendered virtual sequences as inter-view references. In this case, the original frames at I-frame and P-frame positions are coded using the corresponding virtual frame references. At P-frame locations, temporal referencing is still enabled. A lower quantization parameter is used for coding intermediate views. The prediction structure for intermediate view coding is illustrated schematically in Figure 6.11. One reason for such a prediction structure is that it is intended to explore the coding performance of the proposed scheme for low-delay coding scenarios. Besides, as the GOP size increases, where the coding performance of temporal prediction is maximized, the effect of the proposed method on the overall coding efficiency becomes less visible. It was observed in experiments, for GOP size of 12, that the proposed technique had no gain compared to the reference technique (H.264-based simulcast method). Table 6.1
Codec configuration. Reproduced by Permission of 2007 EURASIP
Software Symbol mode Loop filter Search range Prediction structure Random access
JMVM 2.1 CABAC On (color video), Off (depth maps) 96 I P I P . . . (low delay, open GOP) 0.08 second (25 fps video)
215
Transform-based Multi-view Video Coding
Figure 6.11
Prediction structure of intermediate views. Reproduced by permission of 2007 EURASIP
6.4.3 Performance Evaluation
Breakdancers Intermediate Views 1,3 and 5 (average) for I P I P … coding 38 37 36 35 34 33 32
Reference Proposed
200
300
400
500
600
700
Avg PSNR (dB)
Avg PSNR (dB)
Figure 6.12(a) and (b) shows the performance comparison of the proposed MVC method with H.264-based simulcast coding. The coding bit rate of the depth map doesn’t exceed 20% of the coding bit rate of the associated color video. Figure 6.12(c) and (d) shows the performance comparisons between the proposed method and the reference method, where all frames in base views are intra-coded and intermediate views are predicted only from rendered virtual sequences. Figure 6.12(e) and (f) shows the results for Ballet test sequence.
Breakdancers Intermediate View 7 for I P I P … coding 38 37 36 35 34 33 32
Reference Proposed
200
300
400
(b)
Breakdancers Intermediate Views 1,3 and 5 (average) for I I I I … coding 38 37 36 35 34 33 32
Reference Proposed
600
800
1000
Avg PSNR (dB)
Avg PSNR (dB)
(a)
400
Breakdancers Intermediate View 7 for I I I I … coding 38 37 36 35 34 33 32
Reference Proposed
200
300
400
(c)
500
600
Avg Bit Rate (kbps)
(e)
700
800
700
800
Ballet Intermediate View 7 for I I I I… coding
Avg PSNR (dB)
Avg PSNR (dB)
Reference Proposed
400
600
(d)
Ballet Intermediate Views 1,3 and 5 (average) for I I I I… coding
300
500
Avg Bit Rate (kbps)
Avg Bit Rate (kbps)
38 37 36 35 34 33 32 31 30 200
600
Avg Bit Rate (kbps)
Avg Bit Rate (kbps)
200
500
38 37 36 35 34 33 32 31 30 200
Reference Proposed
300
400
500
600
700
800
Avg Bit Rate (kbps)
(f)
Figure 6.12 Rate distortion performance of proposed and reference schemes. Reproduced by permission of 2007 EURASIP
216
Visual Media Coding and Transmission
6.4.4 Conclusions and Further Work According to Figure 6.12, the coding performance is improved in comparison to combined I and P prediction. The difference in gain between Figure 6.12(a) and (c) shows us that the proposed method has a considerable gain over intra-coded pictures, but also that the temporal references should be kept as prediction candidates to achieve optimum coding performance. Similar results are observed in Figure 6.12(b) and (d), where the performance of the proposed method is analyzed for intermediate view 7. The proposed method still outperforms the reference coding method, and the gain over intra-coding is significant. The overall decrease in average coding gains when compared to those of intermediate views 1, 3, and 5 shows us that virtual sequences, rendered using two base views, can predict the original view better than the virtual sequences rendered using only one base view. Similar results are obtained for the Ballet sequence, as can be seen in Figure 6.12(e) and (f). The subjective evaluation of the proposed method was satisfactory. Accordingly, the proposed method is suitable for use in multi-view applications under low-delay constraints.
6.5 Low-delay Random View Access in Multi-view Coding Using a Bit Rate-adaptive Downsampling Approach 6.5.1 Problem Definition and Objectives (Portions reprinted, with permission, from E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Low delay random view access in multi-view coding using a bit-rate adaptive downsampling approach’’, IEEE International Conference on Multimedia & Expo (ICME), 23–26 June 2008, Hannover, Germany. 2008 IEEE. E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Utilisation of downsampling for arbitrary views in mutli-view video coding’’, IET Electronic Letters, 28 February 2008, Vol. 44, Issue 5, p. 339–340. (c)2008 IET.) In this research, a new multi-view coding scheme is proposed and evaluated. The scheme offers improved low-delay view random access capability and, at the same time, comparable compression performance with respect to the reference multi-view coding scheme currently used. The proposed scheme uses the concept of multiple-resolution view coding, exploiting the tradeoff between quantization distortion and downsampling distortion at changing bit rates, which in turn provides improved coding efficiency. Bi-predictive (B) coded views, used in the conventional MVC method, are replaced with predictive-coded downscaled views, reducing the view dependency in a multi-view set and hence reducing the random view access delay, but preserving the compression performance at the same time. Results show that the proposed method reduces the view random access delay in an MVC system significantly, but has a similar objective and subjective performance to the conventional MVC method.
6.5.2 Proposed Technical Solution A different inter-view prediction structure is proposed, which aims to replace B-coded views with downsampled (using bit rate-adaptive downscaling ratios) and P-coded views. The goal is to omit B-type inter-view predictions, which inherently introduces view hierarchy to the system and increases random view access delay. The disadvantage is that B coding, which improves the coding efficiency significantly, is avoided. However, the proposed scheme preserves the coding performance by using downsampled and P-coded views, reducing the
Transform-based Multi-view Video Coding
217
random view access delay remarkably at the same time. A mathematical model is constructed to relate the coding performances of different coding types used within the proposed scheme to one another, which enables us to estimate the relative coding efficiencies of different inter-view prediction structures. In the following subsections, the bit rate-adaptive downsampling approach is explained and the proposed inter-view prediction is shown. 6.5.2.1 Bit Rate-adaptive Downsampling The idea behind downsampling a view prior to coding and upsampling the reconstructed samples is based on the tradeoff between two types of distortion: distortion due to quantization and distortion due to downsampling. Given a fixed bit-rate budget, increasing the downsampling ratio means that less coarse quantization needs to be used. Thus, more information is lost through downsampling, but less is lost through coarse quantization. Finding the optimum tradeoff between the two distortion sources should lead to improved compression efficiency. To observe this, views are downsampled with different downscaling ratios prior to encoding, ranging from 0.3 to 0.9 (the same ratio for each dimension of the video, leaving the aspect ratio unchanged). These ratios are tested over a broad range of bit rates. The results indicate that the optimum tradeoff between the two distortion types varies with the target bit rate. Figure 6.13 shows the performance curves of downscaled coding, with some downscaling ratios and full resolution coding, for a particular view of the Breakdancer test sequence. The best performance characteristics at medium and low bit rates, where the quantization distortion is more effective, are achieved with 0.6 scaling ratio (mid-range), whereas at much higher bit rates, where the effect of the distortion due to downsampling becomes more destructive, larger scaling ratios (0.8–0.9) are suitable, introducing less downsampling distortion. Very low ratios, such as 0.3, are only useful at very low bit rates, where the reconstruction quality is already insufficient to be considered (less than 32 dB). Results do not change over different data sets.
Figure 6.13 Coding performance of a multi-view coder that uses several downscaling ratios for the second view of the Breakdancer test sequence. Reproduced by Permission of 2008 IET
218
Visual Media Coding and Transmission
In the rest, for simplicity, two predefined downscaling ratios are used 0.6 for bit rates less than 300 kbit/s, 0.8 for bit rates over 300 kbit/s – targeting VGA sequences (640 480) at 25 fps. Accordingly, up to 20% saving in bit rate is achieved for individual views at certain reconstruction qualities. 6.5.2.2 Inter-view Prediction Structure The random view access corresponds to accessing any frame in a GOP of any view with minimal decoding of other views [28]. In Figure 6.14(a), the reference inter-view prediction structure of the current MVC reference is shown (for 8 views and 16 views cases) at anchor frame positions. The random view access cost, defined as the maximum number of frames that must be decoded to reach the desired view, is 8 and 16 for the 8-view and the 16-view cases, respectively. The disadvantage is that as the number of cameras increases, the cost increases at the same rate. Furthermore, in some streaming applications only relevant views may be sent to the user, to save bandwidth. With such a dependency structure, more views would have to be streamed, and hence the bit rate would increase. In Figure 6.14(b) the proposed low-delay view random access model with downsampled P coding (LDVRA þ DP) is given. The group of views (GOV) concept is used, which is suitable for free viewpoint video (separated by dashed lines), and in each GOV one view, called a base view, is coded at full spatial resolution, while
Figure 6.14 Anchor frame positions: (a) reference MVC inter-view prediction structure; (b) low-delay view random access with downsampled P coding. Reproduced by Permission of 2008 IEEE
219
Transform-based Multi-view Video Coding
other views, called enhancement views, are downsampled using the idea described in Section 6.5.2.1 and are P coded. None of the views are B coded, so that no extra layers of view dependency are present. Every enhancement view is dependent on its associated base view, and every base view depends on the same base view, whose anchor frames are intra-coded.
6.5.3 Performance Evaluation In this work it is assumed that the coding performances at anchor frame positions of both techniques reflect the overall multi-view coding performances of the respective techniques. The reason is that most of the coding gain in MVC, compared to simulcast, is achieved at anchor frame positions where there are no means of temporal prediction. Therefore, only the coding efficiency at anchor frame positions is evaluated. In both prediction methods (reference MVC and LDVRA þ DP) there is one intra-coded (I) view (all GOPs begin with an intra-coded frame). Other than the I view, both prediction structures contain a certain number of P views, B views (only for reference technique), and DP views (only for LDVRA þ DP coding). Another assumption is that for each view, I coding at anchor frame positions at the same time instant would generate similar bit rate for the same output quality. The same is valid for P coding. The efficiency metrics of P, B, and DP coding are defined as aP, aB, and aDP respectively. aP, is set to 1 initially. Accordingly, aB and aDP change between 0 and 1. A lower efficiency index means higher coding efficiency. The values of aB and aDP are determined experimentally, and their values for different views are found to be consistent. Therefore, at changing bit rates, different aP, aB, and aDP values are calculated by averaging their values from the results of all views. Let the total number of cameras in a multi-view system be equal to 2n. The per-view coding efficiency index can be calculated as: Reference MVC !
LDVRA þ DPðGOV ¼ 3Þ !
LDVRA þ DPðGOV ¼ 5Þ !
n þ ðn 1ÞaB 2n b 2n=3 1c þ b 4n=3 c þ 1 aDP 2n b 2n=5 1c þ b 8n=5 c þ 1 aDP 2n
ð6:6Þ
ð6:7Þ
ð6:8Þ
Figure 6.15 shows the per-view efficiency versus PSNR graphs drawn for the 16-view Rena and 8-view Breakdancer sequences, for experimentally-determined values of aP, aB, and aDP. They are determined at different bit rates by taking the ratios of output bit rates for B and DP views with respect to the output bit rate of P-coded views. Actual coding results with JMVM are given in Figure 6.16. Common coding configurations for each view are shown in Table 6.2. LDRA curves represent the technique in which no downsampling is utilized for P-coded views. LDRA performs worse than the reference MVC method, since it does not benefit from the coding gains of B-view coding or downsampled P-view coding. The proposed LDRA coding technique with downsampled P-view coding tends to perform better than the reference coding technique at especially low bit rates. This is observed in both the estimation graphs and real coding results. At the same time, the real relative efficiencies of the proposed techniques with
220
Visual Media Coding and Transmission
Figure 6.15 Estimated relative performances for Rena (16 views) and Breakdancer (8 views) test sets. Reproduced by Permission of 2008 IEEE
respect to the reference coding technique are reflected correctly in the estimated relative perview efficiency graphs. In order to compare the relative efficiencies of both techniques, F1 is defined as the difference between the per-view efficiency indices of the reference and the proposed techniques. Then: ð3n 3ÞaB ð4n þ 3ÞaDP þ ðn þ 3Þ F1 ¼ ð6:9Þ 6n In order to make sure that the proposed low-delay random access coding scenario performs at least as well as the reference MVC method, we need the following condition to be satisfied: F1 0 Table 6.2
ð6:10Þ
Codec configuration. Reproduced by Permission of 2008 IEEE
Basis QP Entropy Coding Motion search range Temporal prediction structure Temporal GOP size RD optimization
22, 32, 37 CABAC 32 Hierarchical B prediction 12 Yes
Figure 6.16 Real experiment results with JMVM for Rena (16 views) and Breakdancer (8 views) test sets. Reproduced by Permission of 2008 IEEE
221
Transform-based Multi-view Video Coding
Experimental values for aP, aB, and aDP guarantee that the condition in (6.10) is satisfied for the test videos used, at most bit rates. Similarly, in order to make sure that the proposed scheme performs better with larger GOV sizes, define F2 as the difference between the per-view efficiency indices of the two proposed techniques (one with a GOV size of 3 and the other with a GOV size of 5). Then the following is obtained:
2n=3 1 4n=3 þ 1 2n=5 1 8n=5 þ 1 þ aD P aD P F2 ¼ 2n 2n 2n 2n
ð6:11Þ
YF2 ¼ =15 ð1 aD P Þ 0 2
Since aDP is absolutely below 1, it is certain that the condition in (6.11) is satisfied. It is observed from both the estimated and the real coding results. The perceptual quality of the proposed low-delay random access scheme with downsampled P coding is compared with the reference MVC method including B-coded views, using the stimulus comparison-adjectival categorical judgment method described in recommendation ITU-R BT.500-11 [29]. The Rena, Breakdancer, and Ballet test sequences are used for evaluations. A differential mean score opinion is calculated at two different bit rates and plotted on a differential scale, where 0 corresponds to no perceptual difference between the two methods and negative values indicate that the proposed method performs better. Sixteen subjects are used in the evaluations. Figure 6.17 shows the results. Since the Rena sequence is a blurry sequence originally, the downsampling distortion is not visually sensible. Therefore, there is no visual difference between the conventional MVC and the proposed coding method for the Rena sequence. On the other hand, for the other sequences tested it can be observed that at high bit rates the perceptual qualities of both methods do not differ, indicating that the downsampling distortion (blurriness) is not a significant issue. At lower bit rates, quantization distortion (blockiness) is more visible than the downsampling distortion and hence the proposed method generates visually more satisfactory results.
Figure 6.17 Subjective test results comparing the proposed method and the reference MVC method. Reproduced by Permission of 2008 IEEE
222
Visual Media Coding and Transmission
6.5.4 Conclusions and Further Work It is observed that the random view access performance of multi-view coding systems can be improved significantly with respect to the conventional MVC method without any loss of coding performance and perceptual quality. The reason is that the performance of efficient B coding, present in the conventional MVC method, can be achieved by downsampled P coding. The proposed inter-view dependency structure is more suitable for fast-switching free-view systems, due to the utilization of the concept of groups of views. Furthermore, assigning larger GOV sizes can further increase the compression performance, without affecting the overall random view access delay. The proposed approach brings a slight increase in the complexity, due to the addition of up-conversion and down-conversion blocks, but this is balanced with the reduction in the processing load for the downsampled videos. One limitation of this technique is with highly-textured video sequences, where the inherent low-pass filtering effect of downsampling might significantly degrade the subjective quality. This can be overcome by transmitting extra residuals for the blocks in the vicinity of object edges to improve the visual quality, which is a next step in research.
References [1] A. Smolic et al., ‘‘3D video and free viewpoint video: technologies, applications and MPEG standards,’’ IEEE International Conference on Multimedia and Expo, Jul. 2006. [2] ISO/IEC JTC1/SC29/WG11 N371, ‘‘List of ad-hoc groups established at the 58th meeting in Pattaya,’’ 2001. [3] A. Smolic and D. McCutchen, ‘‘3DAV exploration of video-based rendering technology in MPEG,’’ IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 3, pp. 348–356, Mar. 2004. [4] ISO/IEC JTC1/SC29/WG11 N6720, ‘‘Call for evidence on multi-view video coding,’’ 2004. [5] ISO/IEC JTC1/SC29/WG11 N6999, ‘‘Report of the subjective quality evaluation for multi-view coding CfE,’’ 2005. [6] ISO/IEC JTC1/SC29/WG11 N7327, ‘‘Call for proposals on multi-view video coding,’’ 2005. [7] ISO/IEC JTC1/SC29/WG11 N7779, ‘‘Subjective test results for the CfP on multi-view video coding,’’ 2006. [8] Y.-L. Lee, J.-H. Hur, D.-Y. Kim, Y.-K. Lee, S.-H. Cho, N.-H. Hur, and J.W. Kim, ‘‘H.264/MPEG-4 AVC-based multiview video coding (MVC),’’ ISO/IEC JTC1/SC29/WG11 M12871, Jan. 2006. [9] H. Schwarz, T. Hinz, A. Smolic, T. Oelbaum, T. Wiegand, K. Mueller, and P. Merkle, ‘‘Multi-view video coding based on h.264/mpeg4-avc using hierarchical b pictures,’’ Picture Coding Symposium, 2006. [10] J. Xin, A. Vetro, E. Martinian, and A. Behrens, ‘‘View synthesis for multi-view video compression,’’ Picture Coding Symposium, 2006. [11] K. Yamamoto, M. Kitahara, H. Kimata, T. Yendo, T. Fujii, M. Tanimoto et al. ‘‘SIMVC: multi-view video coding using view interpolation and color correction,’’ IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 17, Oct. 2007. [12] M. Tanimoto, ‘‘Overview of free viewpoint television,’’ Signal Processing: Image Communication, Vol. 21, pp. 454–461, 2006. [13] H. Kimata et al., ‘‘Multi-view video coding using reference picture selection for free viewpoint video communication,’’ Picture Coding Symposium, San Francisco, USA, 2004. [14] A. Smolic et al., ‘‘Multi-view video plus depth representation and coding,’’ IEEE ICIP 2007, San Antonio, TX, Sep. 2007. [15] C.L. Zitnick et al., ‘‘High-quality video view interpolation using a layered representation,’’ ACM Siggraph and ACM Trans. on Graphics, Aug. 2004. [16] P. Kauff et al., ‘‘Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability,’’ Signal Processing: Image Communication, Vol. 22, pp. 217–234, Feb. 2007. [17] J. Shade, S. Gortler, L. He, and R. Szeliski, ‘‘Layered depth images,’’ Computer Graphics Proceedings, Annual Conference Series, SIGGRAPH, Orlando, FL, Jul. 1998. [18] S.-U. Yoon and Y.-S. Ho, ‘‘Multiple color and depth video coding using a hierarchical representation,’’IEEE Transactions on Circuits and Systems for Video Technology, Oct. 2007.
Transform-based Multi-view Video Coding
223
[19] J. Duan and J. Li, ‘‘Compression of the layered depth image,’’ IEEE Transactions on Image Processing, Vol. 12, No. 3, Mar. 2003. [20] F. Dufaux and F. Moscheni, ‘‘Motion estimation techniques for digital TV: a review and a new contribution,’’ Proc. IEEE, Vol. 83, No. 6, pp. 858–876, Jun. 1995. [21] F. Dufaux and J. Konrad, ‘‘Robust, efficient and fast global motion estimation for video coding,’’ IEEE Transactions on Image Processing, Vol. 9, No. 3, pp. 497–501, Mar. 2000. [22] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [23] L. Alvarez, R. Deriche, J. Sanchez, and J. Weickert,‘‘Dense disparity map estimation respecting image discontinuities: a PDE and scalespace-based approach,’’ Tech. Rep. RR-3874, INRIA, Jan. 2000. [24] C. Strecha, R. Fransens, and L.J. Van Gool, ‘‘A probabilistic approach to large displacement optical flow and occlusion detection,’’ European Conference on Computer Vision, SMVP Workshop, Prague, Czech Republic, pp. 71–82, May 2004. [25] S. Yea et al., ‘‘Report on core experiment CE3 of multiview coding,’’ JVT-T123, Klagenfurt, Austria, Jul. 2006. [26] M. Tanimoto et al., ‘‘Proposal on requirements for FTV,’’ JVT-W127, San Jose, CA, Apr. 2007. [27] W.J. Tam and L. Zhang,‘‘Depth map preprocessing and minimal content for 3D-TV based on DIBR,’’ JVT-W095, San Jose, CA, Apr. 2007. [28] Y. Liu et al., ‘‘Low-delay view random access for multi-view video coding,’’ IEEE International Symposium on Circuits and Systems 2007, pp. 997–1000, May 2007. [29] ITU-R, ‘‘Methodology for the subjective assessment of the quality of the television signals,’’ Recommendation BT.500-11, 2002.
7 Introduction to Multimedia Communications 7.1 Introduction The goal of wireless communication is to allow a user to access required services at any time with no regard to location or mobility. Recent developments in wireless communication, multimedia technology, and microelectronics technology have created a new paradigm in mobile communications. Third/fourth-generation wireless communication technologies provide significantly higher transmission rates and service flexibility, over a wider coverage area, than is possible with second-generation wireless communication systems. High-compression, error-robust multimedia codecs have been designed to enable the support of a multimedia application over error-prone bandwidth-limited channels. The advances of VLSI and DSP technologies are preparing light-weight, low-cost, portable devices capable of transmitting and viewing multimedia streams. The above technological developments have shifted the service requirements of future wireless communication systems from conventional voice telephony to business-oriented multimedia services. To successfully meet the challenges set by future audiovisual communication requirements, the International Telecommunication Union Radiocommunication Sector (ITU-R) has elaborated on a framework for global third-generation standards by recognizing a limited number of radio access technologies. These are Universal Mobile Telecommunications System (UMTS), Enhanced Data rates for GSM Evolution (EDGE), and CDMA2000. UMTS is based on Wideband CDMA technology and is employed in Europe and Asia using the frequency band around 2 GHz. EDGE is based on TDMA technology and uses the same air interface as the successful second-generation mobile system GSM. General Packet Radio Service (GPRS) and High-Speed Circuit-Switched Data (HSCSD) are introduced by phase 2þ of the GSM standardization process, and support enhanced services with data rates up to 144 kbps in the packet-switched and circuit-switched domains respectively. GPRS has also been accepted by the Telecommunication Industry Association (TIA) as the packet data standard for TDMA/136 systems. EDGE, which is the evolution of GPRS and HSCSD, provides third-generation services up to 500 kbps within GSM carrier spacing of
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
226
Visual Media Coding and Transmission
GSM
HSCSD
EDGE
TDMA/136
GPRS
WCDMA (FDD/TDD) 3GPP
UMTS
CDMA2000
IS-95
1×
2G 250 kmph (for 900 MHz) User-definable to 900 MHz or 1800 MHz 0 dB gain for both transmitter and receiver. No antenna diversity AWGN source at receiver. User-definable Eb/No ratio Synchronization based on the cross-correlation properties of the training sequence 16-state soft-output MLSE equalizer for GMSK 16-state decision-feedback MLSE equalizer for 8-PSK Soft-decision Viterbi convolutional decoder. Fire correction and detection for CS-1 and CRC detection for CS2–4 and MCS-1–9 Bit error patterns and block error patterns User-definable. Most experiments run for 15 000 blocks per timeslot
Interleaving Training Sequence Codes Modulation Interference Characteristics Fading Characteristics Multipath Characteristics Transmission Capabilities
Mobile Terminal Velocity Carrier Frequency Antenna Characteristics Signal-to-Noise Characteristics Burst Recovery Equalizer Channel Decoding
Performance Measures Simulation Length
for these experiments, which will be referred to as the CCSR (Centre for Communication System Research) model, and those quoted in [13]. From the results in Table 8.6 it can be seen that for coding schemes CS-2 and CS-3 the performance of the CCSR model is about 1.5 dB better than the reference value suggested in Table 8.5 Reference interference performance values; implementation margin of 2 dB included C/I at BLER ¼ 10%
Coding scheme
CS-1 CS-2 CS-3 CS-4
GSM 900 TU50 – ideal FH
GSM 900 TU3 – no FH
9 dB 13.8 dB 16 dB 23 dB
13 dB 15 dB 16 dB 19.3 dB
258
Visual Media Coding and Transmission Table 8.6 Comparison of reference performance at TU50 IFH 900 MHz; Conditions: ideal frequency hopping, receiver noise floor Eb/No ¼ 28 dB (2 dB implementation margin assumed) C/I at BLER ¼ 10%
Coding scheme
CS-1 CS-2 CS-3 CS-4
[14]
[13]
CCSR model
9 dB 13.8 dB 16 dB 23 dB
9 dB 13 dB 15 dB 23 dB
8.5 dB 12.3 dB 14.6 dB 23.5 dB
Annex L of GSM 05.50. The difference in performance for schemes CS-1 and CS-4 is less pronounced at around 0.5 dB. However, comparison of BLER interference performance traces obtained using the CCSR model with those presented in Annex P of the same document [13], which gives the results obtained by Ericsson, shows that the performances of the two simulated systems are virtually identical. This is shown in Figure 8.7. 8.2.4.2 GPRS Model Verification at TU1.5 NFH 1800 MHz [13] Annex P also presents proposals for GPRS reference performance results for the TU3 multipath model at 1800 MHz with no frequency hopping implemented. The results of the CCSR model at a BLER of 10% under these conditions are shown in Table 8.7. The results in Table 8.7 clearly show that under the propagation conditions described in TU1.5, the performance of the CCSR model at a resulting BLER of 10% is very close to the reference interference performance levels specified in [14] Annex L. In fact, the obtained C/I ratio differs by no more than 0.5 dB for CS-1 and CS-3. No suitable results for comparison were 1.0E+00
BLER
1.0E–01
1.0E–02
1.0E–03
CS-1
CS-2
CS-3
CS-4
EricssonCS-1
EricssonCS-2
EricssonCS-3
EricssonCS-4
1.0E–04 3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 C/I (dB)
Figure 8.7 GPRS interference performance TU50 IFH 900 MHz
259
Wireless Channel Models Table 8.7 Comparison of reference performance at TU1.5 NFH 1800 MHz; Conditions: no frequency hopping, receiver noise floor Eb/No ¼ 28 dB (2 dB implementation margin assumed) C/I at BLER ¼ 10%
Coding scheme
CS-1 CS-2 CS-3 CS-4
[14]
[13]
CCSR model
13 dB 15 dB 16 dB 19.3 dB
13 dB 15 dB 16 dB 19 dB
13.5 dB 15.2 dB 16.5 dB –
obtained from CS-4 as simulations were carried up to C/I ¼ 18 dB, at which value the resulting BLER was still in excess of 10%. The performance traces under these conditions can be seen in Figure 8.7, where the results obtained with the CCSR model closely match those obtained by Ericsson in [14] Annex P. 8.2.4.3 GPRS Model Verification at TU1.5 IFH 1800 MHz Results obtained using the CCSR simulator were compared with those specified in GSM 05.05 and are shown in Table 8.8 and Figures 8.8–8.10. The simulations were run for lengths equivalent to 2 106 information bits, which is equivalent to roughly 10 800 RLC/MAC blocks for CS-1 and 4600 RLC/MAC blocks for CS-4 coding schemes. This length of information bits can be used to provide a continuous channel error pattern for a 64 kbps video stream for over 30 seconds before looping over to the beginning of the error sequence. This is because the bursty nature of the GSM fading channel, coupled with the differing visual susceptibility to errors of different parameters that constitute the video bitstream, requires that sufficiently long error patterns are used to obtain meaningful results. The performance of the CCSR model was validated for a variety of conditions. In particular, the model was seen to give accurate results with respect to variations in interference levels, transmission modes (frequency hopping enabled/disabled), and carrier frequency. Although the results for GPRS were presented as block error ratio values, the output from the simulators characterizes the physical link layer in terms of both bit and block error patterns. Although the results obtained with the designed model closely match the quoted reference performance Table 8.8 Comparison of reference performance at TU1.5 IFH 1800 MHz; Conditions: no frequency hopping, receiver noise floor Eb/No ¼ 28 dB (2 dB implementation margin assumed) C/I at BLER ¼ 10%
Coding scheme
CS-1 CS-2 CS-3 CS-4
[13]
CCSR Model
9 dB 13 dB 15 dB 23 dB
9.7 dB 13 dB 15 dB –
260
Visual Media Coding and Transmission
BLER
1.0E+00
1.0E–01 CS-1
CS-2
CS-3
CS-4
EricssonCS-1
EricssonCS-2
EricssonCS-3
EricssonCS-4
1.0E– 02 3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 C/I (dB)
Figure 8.8
GPRS interference performance TU1.5 NFH 1800 MHz
levels, it must be appreciated that much of the system performance relies on implementationdependent factors. This is particularly true for the receiver’s correlator and equalizer, and, to a lesser extent, the channel decoding mechanisms. These factors lead to variations in GPRS physical layer performance figures released by different manufacturers of up to 2 dB [14]. In 1.0E+00
BLER
1.0E–01
1.0E–02
CS-1
CS-2
CS-3
CS-4
1.0E–03 3
4
Figure 8.9
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 C/I (dB)
GPRS interference performance TU1.5 NFH 1800 MHz
261
Wireless Channel Models 1.0E+00
BLER
1.0E–01
1.0E–02
CS-1 CS-2 CS-3 EricssonCS-1 EricssonCS-2 EricssonCS-3
1.0E–03
1.0E–04 3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 C/I (dB)
Figure 8.10 GPRS interference performance at TU1.5 IFH 1800 MHz
addition, it must be noted that the assumption made in [13] that the TU1.5 1800 MHz and TU3 900 MHz propagation models are identical is retained here.
8.2.5 EGPRS Physical Link Layer Simulator As for the GPRS PDTCHs, the reference performance of EGPRS links is specified in terms of the carrier-to-interference ratio or energy per modulated bit required to obtain a 10% radio block error ratio. A block is considered erroneous in the simulation when any of the following occur: . . .
Uncorrectable bit errors in the data field after decoding (including CRC bits). Uncorrectable bit errors in the header field after decoding (including CRC bits). Erroneous decoded stealing flag code word.
[15] and [16] also specify that erroneous modulation detection, which is also referred to as blind detection error, should be simulated. This type of error was not included in the simulation models used, as different models were used for the GMSK and 8-PSK modulation schemes, with the receivers knowing a priori the type of modulation to expect. It is not expected that this assumption would cause excessive deviation from the required results. When examining the results obtained by different manufacturers and specified in [17], it can be seen that the specified values for the 8-PSK schemes vary widely. For example, in the cochannel interference case at TU1.5 NFH at 1800 MHz given in [17], there exists a 4.9 dB difference between the worst and best quoted values for MCS-5, and of 5.1 dB for MCS-6. In [18], fewer results from fewer manufacturers are given, but there still remains a spread of around 3 dB in the given values. As a result, average-based reference performance values are given in [17] and [18], which are seen to be quite similar to each other. The performance of the CCSR simulation model was therefore compared with these average figures. The variance in
262
Visual Media Coding and Transmission
values of the GMSK coding schemes (MCS-1–4) is shown to be considerably lower than for the 8-PSK schemes. This is probably due to the greater maturity of GMSK-based technology for fading channels as compared to 8-PSK. 8.2.5.1 EGPRS Model Validation at TU 50 NFH 900 MHz The EGPRS models were simulated for co-channel interference, with a single interferer being used. When examining the results obtained in [16], it was seen that MCS-8 and MCS-9 could not reach 10% BLER even at C/I values in excess of 30 dB. For this reason, reference performance figures at 30% BLER were used in these cases. As will be shown in this chapter, channel error ratios that result in BLER values of around 10% are considerably too high to produce acceptable video quality. For this reason MCS-8 and MCS-9 were not used in the tests carried out. The results obtained are shown in Tables 8.9–8.12. The reference figures given in [17] and [19], which are both published by the EDGE drafting group, are seen to be extremely similar to each other. Under these propagation conditions, the CCSR model is seen to produce results to within 0.5 dB for all coding schemes, with the exception of MCS-2, where the performance of the CCSR model is inferior by 1.5 dB, and MCS-3, where a discrepancy of 2 dB is noted. At MCS-7, and MCS-5, the CCSR model’s performance is superior to the reference figures. Reference performance figures for MCS-1–4 for these propagation conditions were not available in the given references. However, the performance is fairly similar to the equivalent results obtained at TU50 NFH 900 MHz, indicating that the performance is close to the expected values. Indeed, comparing the values for CS-1–4 for GSM 900 MHz and DCS 1800 MHz at TU50 NFH in [13], it can be seen that the reference figures at 10% BLER do not vary by more than 1 dB. There does however exist a considerable discrepancy between the results for MCS-5–7 as given in [20] and [18], where differences of up to 6 dB can be seen. Although the CCSR model performs well at MCS-5, giving results of within 1 dB of the value quoted in [20], codes MCS-6 and MCS-7 perform considerably worse. In fact at MCS-7 a figure of 10% BLER was not achieved at all. 8.2.5.2 EGPRS Model Validation at TU50 NFH 1800 MHz 8.2.5.3 EGPRS Model Validation at TU1.5 NFH 1800 MHz The CCSR EGPRS model performs to within 1 dB of the reference figures at all C/I ratios, except for MCS-3 where the discrepancy is of 1.5 dB. Indeed, at MCS-5 the CCSR model
Table 8.9
MCS-1 MCS-2 MCS-3 MCS-4 MCS-5 MCS-6 MCS-7
Performance comparison at TU50 NFH 900 MHz [17]
[19]
CCSR
8.5 dB 10.5 dB 15.0 dB 20.0 dB 15.5 dB 18.0 dB 23.0 dB
– – – – 15.5 dB 18.0 dB 25.0 dB
9 dB 12 dB 17 dB 20.5 dB 15.3 dB 18.5 dB 24.0 dB
263
Wireless Channel Models Table 8.10
MCS-1 [1] MCS-2 MCS-3 MCS-4 MCS-5 MCS-6 MCS-7
Performance comparison at TU50 NFH 1800 MHz [18]
[20]
CCSR
– – – – 15 dB 18 dB 27.5 dB
– – – – 13.0 dB 15.5 dB 21.5 dB
8 dB 12 dB 17 dB 21 dB 16 dB 24 dB –
exceeds the values given in [17] and [19] by about 4 dB. It is evident that under these conditions, the CCSR model performs very similarly to the reference models. See Table 8.11. 8.2.5.4 EGPRS Model Validation at TU1.5 IFH 1800 MHz When using ideal frequency hopping, the CCSR model on average performs around 1.5 dB worse than the reference figures. One reason for this is that, in order to simplify implementation and reduce the complexity of the simulation models, even though the degree of correlation between consecutive bursts is greatly reduced when compared to the nonfrequency hopping case, the de-correlation is not perfect. This slightly reduces the efficacy of the interleaving mechanisms. However, the difference between the obtained values and the reference figures is only in excess of 1.5 dB at MCS-3, and is less than 1 dB for MCS-4, MCS5 and MCS-7. The performance of the model under these conditions can consequently be considered adequate. Table 8.11
MCS-1 [1] MCS-2 MCS-3 MCS-4 MCS-5 MCS-6 MCS-7 Table 8.12
MCS-1 MCS-2 MCS-3 MCS-4 MCS-5 MCS-6 MCS-7
Performance comparison at TU1.5 NFH 1800 MHz [19]
[17]
CCSR
– – – – 19.5 dB 21.5 dB 26.5 dB
11.0 dB 13.0 dB 14.5 dB 17.0 dB 19.0 dB 21.0 dB 24.0 dB
11.3 dB 13 dB 16 dB 18 dB 15.2 dB 21 dB 24 dB
Performance comparison at TU1.5 IFH 1800 MHz [19]
[17]
CCSR
– – – – 14.5 dB 17.0 dB 23.5 dB
7.5 dB 10.0 dB 14.5 dB 19.5 dB 14.0 dB 17.0 dB 22.5 dB
9.0 dB 11.5 dB 16.5 dB 20.0 dB 15.3 dB 18.5 dB 22.0 dB
264
Visual Media Coding and Transmission
The EGPRS physical link layer model was validated for four propagation conditions, all of which assumed the typical urban multipath model. Both the GMSK and 8-PSK modulator/ demodulator structures were seen to give the expected results, and the relative differences between the performances of the two receiver structures as compared to the reference values were rather small. There does however exist a considerable difference between the performance of the designed model and those of the better-performing models given in [17], particularly for the 8-PSK modulation-coding schemes. As already mentioned, there exists a spread of around 4–5 dB in the figures given for different receiver implementations. There may be various reasons for this, including possible differences in the ways the propagation models are simulated. However, the most likely reason lies within the different implementation strategies of the receivers. There exist several techniques for carrying out channel and noise estimation, and for implementing equalization at the receiver. Some methods can adapt dynamically to differing channel conditions, whereas others perform optimally under certain conditions and not so well under others. The differences in the 8-PSK values are greater than for GMSK, where the technology is fairly stable and consolidated. As a result, although the CCSR model matches the reference performance figures or comes very close to them under practically all conditions tested, they should be regarded to a certain extent as the worst-case performance figures. In fact, as already described, the CCSR model used a custom-made EDGE receiver mechanism, which although employing a very effective nonlinear direct-feedback equalizer, cannot be considered the most optimal solution. In particular, no automatic frequency-correction mechanisms were implemented in the receiver. The comparison with the reference performance figures is however only part of the story. The 10% BLER figure chosen for measuring performance compared to reference values was selected on the basis of being around the position where optimal throughput is achieved when operating with block retransmissions [14]. However, real-time services require information integrity without the use of retransmissions, and consequently the error performance requirements are much more stringent. Typically, error ratios in the order of 103 to 1.0E+00
BLER
1.0E–01
1.0E–02 MCS-1 MCS-2 MCS-3 MCS-4
1.0E–03
1.0E–04 0
5
10
15
20
25
C/I (dB)
Figure 8.11 GMSK EGPRS interference performance TU50 NFH 1800 MHz
30
265
Wireless Channel Models
BLER
1.0E+00
1.0E –01
MCS-5 MCS-6 MCS-7
1.0E –02 0
5
10
15
20
25
30
35
C/I (dB)
Figure 8.12
8-PSK EGPRS interference performance TU50 NFH 1800 MHz
104 are required for video communications. For this reason, the performance of the various modulation-coding schemes at lower BLER values and relatively high C/I values are more critical than they would otherwise be for typical data transfer applications. The equalizer used for the GMSK modulation schemes is a 16-state Viterbi equalizer. Examination of results above C/I values of around 15–20 dB show a considerable deviation from the quoted figures. Figures 8.11–8.18 show the BER sensitivity performance of the GMSK receiver for 1.0E+00
BLER
1.0E–01
MCS-5 MCS-6 MCS-7
1.0E–02
1.0E–03 0
5
10
15
20
25
30
C/I (dB)
Figure 8.13
GMSK EGPRS interference performance TU1.5 IFH 1800 MHz
35
266
Visual Media Coding and Transmission 1.0E+00
BLER
1.0E–01
MCS-1 MCS-2 MCS-3 MCS-4
1.0E– 02
1.0E– 03 0
5
10
15
20
25
30
C/I (dB)
Figure 8.14
8-PSK EGPRS interference performance TU1.5 IFH 1800 MHz
the TU50 multipath model at 900 MHz and 1800 MHz. As these figures display raw error rates with no forward error correction, the results are not affected by frequency hopping. For Eb/No values below 18 dB, the equalizer used in the CCSR model outperforms the reference figures, but then levels off to a higher asymptotic BER value. The probable reason for such a deviation is that the equalizer was optimized for operation of the speech channels and low-bit
BLER
1.0E+00
1.0E– 01
MCS-1 MCS-2 MCS-3
1.0E– 02 0
5
10
15
20
25
C/I (dB)
Figure 8.15
GMSK EGPRS interference performance TU1.5 NFH 1800 MHz
30
267
Wireless Channel Models 1.0E+00
BLER
1.0E– 01
MCS-5 MCS-6 MCS-7
1.0E– 02
1.0E– 03 0
5
10
15
20
25
30
35
C/I (dB)
Figure 8.16
8-PSK EGPRS interference performance TU1.5 NFH 1800 MHz
rate data. These typically operate at C/I values below 12 dB. Indeed, differences in bit error ratios below 103 have a negligible effect on speech quality or data throughput using the TCH/9.6 or TCH/4.8 channels. This difference is consequently more conspicuous when using schemes MCS-3 and MCS-4, which for real-time services are the coding schemes that would operate at such interference levels when using GMSK. The major deviations are visible for MCS-3, where for example the BLER values at C/I ¼ 20 dB for TU50 NFH 1.0E+00
1800 MHz 900 MHz
BER
1.0E– 01
1800 MHz [SMG2274/99]
1.0E– 02
1.0E– 03
1.0E– 04 0
5
Figure 8.17
10
15
20 25 Eb/No (dB)
30
Raw bit error ratio GMSK TU50
35
40
268
Visual Media Coding and Transmission 1.0E+00
1800 MHz 900 MHz
1.0E–01 BER
1800 MHz [SMG2 274/99]
1.0E–02
1.0E–03 0
5
10
Figure 8.18
15
20 25 Eb/No (dB)
30
35
40
Raw bit error rate 8-PSK
1800 MHz are 0.05 and 0.015, respectively. A similar trend is visible when examining the sensitivity performance of the 8-PSK receiver in the TU50 multipath propagation conditions. At low Eb/No values, while inferior to the reference performance figures, the performance of the equalizer used in these experiments is quite close to the reference performance figures. However, at low noise values, this discrepancy increases considerably. There is also a large difference between the performance at 900 MHz and at 1800 MHz, where the difference between asymptotic bit rates is nearly in the order of one magnitude. This is caused by the greater susceptibility of 8-PSK to intersymbol interference, and the greater symbol spreading that occurs at 1800 MHz, particularly at high mobile speeds. The residual bit error patterns obtained for both GPRS and EGPRS are nevertheless suitable for use in the audiovisual transmission experiments, as they exhibit relative performance figures between coding schemes that are consistent with the relative power of the schemes. Moreover, the obtained results were shown to display a high degree of correlation with the performance results given by several manufacturers.
8.2.6 E/GPRS Radio Interface Data Flow Model The design of the EGPRS physical link layer model was restricted to examining the effects of varying channel conditions upon bits exiting the channel decoders. In order to carry out more extensive and detailed examinations of the effect of channel errors upon end-applications, such as video coding implementations, a GPRS data flow simulator was implemented. The model was implemented in Matlab, as this language provides a rapid development environment and comprehensive data analysis tools. The layers implemented included an MPEG-4 video codec with rate control functionality, RTP/UDP/IP transport layers, and GPRS SNDC, LLC, RLC/ MAC layer protocols. This layout is shown in Figure 8.19. It must be emphasized that only the data flow properties of the protocols have been implemented in this model. This means that none of the protocol signaling mechanisms have actually been included in the model, but only
269
Wireless Channel Models
Application Layer:
MPEG-4 Codec Video Packet
Transport Layer:
Rate Control: Some Video Packets Discarded
Segmentation Transport PDU
Transport PDU
Transport PDU
Add Headers RTP-UDP-IP
IP PDU
Headers Header Compression (optional)
IP PDU
GPRS SNDC Layer SNDC Payload SNDC Header GPRS LLC Layer LLC Payload
LLC Header GPRS RLC/MAC Layer
CS-1
Select Channel Coding Scheme
CS-2
RLC/MAC block
CS-3
RLC/MAC block
Physical Link Layer 1
RLC/MAC block
CS-4
RLC/MAC block
RLC/MAC block
Select Number of Timeslots 2
. . . . . . . . . .
Figure 8.19
8
GPRS data flow model
270
Visual Media Coding and Transmission
the resulting effect on header sizes, packet and stream segmentation procedures, and flow control effects. For example, when describing the RTP layer, sequence numbering is not actually implemented; only its effect on the resulting RTP-PDU header size is modeled in the simulator. The application layer consists of a traffic source emulating an MPEG-4 video codec that employs error-resilience functionality as described in [21] and rate control mechanisms that place an upper limit on the output throughput from the encoder, calculated on a frame-by-frame basis. The maximum allowable throughput is set according to the resources allocated across the radio interface. The output from the MPEG-4 codec is forwarded to the transport layers in units of discrete video frames, which will be referred to as video packets. These packets are then split up into transport layer PDUs according to the maximum IP packet size defined by the user. Each packet is then encapsulated into an independent RTP/UDP/IP [22] packet for forwarding down to the GPRS network. Header compression [23], [24], which is a user-definable feature of this simulator model, is implemented at the transport layer, although it is not actually used in the experiments. This means that the compression protocol that is employed in the end terminals must be supported in all the intermediate nodes in the core network, which potentially includes the Internet. This is hardly a realistic scenario, and a more appropriate implementation would be to include such functionality in the SNDC layer [25]. This would allow for headers to be compressed for transmission across the GPRS radio interface, only to be restored to their initial size at the SGSN for forwarding across the remainder of the network. However, as this model is solely concerned with the performance across the Um interface, the location of the compression algorithm has no effect upon any results obtained. At the SNDC layer, the 8 bit SNDC header [25] is added to the transport packet before forwarding on to the LLC layer. Here a 24 bit frame header and 24 bit frame check sequence are added to each LLC-PDU. A check is also carried out to ensure that no LLC frames exceed the maximum size of 1520 octets specified in [26]. The error-control and flow-control functions of the LLC layer are not implemented in this model. The LLC frames are then passed on to the RLC/MAC layer, where they are encapsulated into radio blocks according to the forward error correction scheme selected. In practice, the choice of coding scheme depends upon the carrierto-interference ratio at the terminal and the resulting throughput that can be sustained at that C/I level using the different channel coding schemes. The major side-effect of varying the protection afforded to the user data is the modification in the size of the GPRS radio blocks in terms of the number of information bits per block. The model therefore segments the incoming PDU into data payloads for the output radio blocks according to the selected channel coding scheme, and forwards these blocks to a FIFO buffer. Once in the output buffer, the blocks wait for one of the timeslots allocated to their associated terminal to become available and are then transmitted over the given timeslots. The model allows any number of timeslots from one to eight to be allocated to the source terminal. This layered model design provides for error occurrences at the physical layer to be mapped on to theactualapplication-layerpayload.Eachchannelerrorcanthereforebemappedontoanindividual video information bit, or the header or checksum section of any protocol in the GPRS stack.
8.2.7 Real-time GERAN Emulator The GPRS and EGPRS physical link layer simulation model is extremely computationally intensive, largely due to the modeling of the multipath propagation model, where a Rayleigh fading filter is used to represent each pathway. For example, a simulation representing data
Wireless Channel Models
271
encoded using the CS-1 scheme at TU1.5 IFH 1800 MHz at a carrier-to-interference ratio of 15 dB run on a 296 MHz Ultra SPARC processor, runs at an average rate of 138 information bits per second. Although the exact processing speed depends upon several factors, including the propagation conditions modeled, the C/I ratio present, and the modulation-coding scheme used, the obtained rates are way below those necessary to support a real-time simulation environment. In order to create a real-time testing environment for video communications applications, a real-time emulator was built using Visual C þþ for Microsoft Windows. The emulator implemented the data flow model and allowed for up to eight-slot allocation. The emulator program made use of a table look-up method to allow for real-time emulation. Data sets of bit error patterns at the physical link layer were created with the E/GPRS simulator for a wide range of interference and propagation conditions for each coding scheme. These were then used by the real-time emulator and fed into the GPRS radio interface data flow model described in Figure 8.19. Multi-threaded programming techniques were used to build the model, and a graphical user interface (Figure 8.20) was designed to allow for interactive manipulation of the coding scheme, interference level, carrier frequency, timeslot allocation, and frequencyhopping capability. The emulator was used in conjunction with a real-time MPEG-4 video encoder/decoder application using RTP and is shown in Figure 8.21.
8.2.8 Conclusion The design and validation of the EGPRS and GPRS physical link layer simulation model has been described. The GPRS models were seen to give a performance that closely matches the GSM reference performance figures. Although the performance of the EGPRS model was seen to match the figures given by different terminal manufacturers when operating at low terminal
Figure 8.20 GPRS emulator graphical user interface
272
Visual Media Coding and Transmission
MPEG-4 Server
MPEG-4 Client
RTP
RTP
UDP
UDP
SNDC LLC RLC/MAC
SNDC LLC RLC/MAC Physical Link Layer
Figure 8.21
GPRS radio access emulator structure
velocities, a significant divergence from the reference figures was obtained using the TU50 propagation models at 1800 MHz. This may be attributable to the increased Doppler spread at high terminal velocities and carrier frequencies, and corresponding limitations in the equalizer and receiver architecture in dealing with the resulting intersymbol interference. However, at lower terminal velocity figures and carrier frequencies, the simulator model closely matched the reference performance figures, and may therefore be considered suitable for use in the media transmission experiments.
8.3 UMTS Channel Simulator This section presents the design and implementation procedure for a multimedia evaluation testbed of the UMTS forward link. AWCDMA physical link layer simulator has been implemented using the Signal Processing WorkSystem (SPW) software simulation tools developed by Cadence Design System Inc. [27]. The model has been developed in a generic manner that includes all the forward link radio configurations, channel structures, channel coding/decoding, spreading/de-spreading, modulation parameters, transmission modeling, and their corresponding data rates according to the UMTS specifications. The performance of the simulator model is validated by comparison with figures presented in the relevant 3GPP documents [28], [29] for the specified measurement channels, propagation environments, and interference conditions in [30]. Using the developed simulator, a set of UMTS error pattern files is generated for different radio bearer configurations in different operating environments, and the results are presented. Furthermore, UMTS link-level performance is enhanced by implementing a closed-loop powercontrol algorithm. A UMTS radio interface protocol model, which represents the data flow across the UMTS protocol layers, is implemented in Visual C þ þ . It is integrated with the physical link layer model to emulate the actual radio interface experienced by users. This allows for interactive testing of the effects of different parameter settings of the UMTS Terrestrial Radio Access Network (UTRAN) upon the received multimedia quality.
8.3.1 UMTS Terrestrial Radio Access Network (UTRAN) The system components of UTRAN are shown in Figure 8.22. Functionally, the network elements are grouped into the radio network subsystem (RNS), the core network (CN), and the
273
Wireless Channel Models
Figure 8.22
Systems components in a UMTS
user equipment (UE). UTRAN consists of a set of RNSs connected to the core network through the Iu interface. The interface between the UE and the RNS is named Uu. An RNS contains a radio network controller (RNC) and one or more node Bs. The RNS handles all radio-related functionality in its allocated region. A node B is connected to an RNC through the Iub interface, and communication between RNSs is conducted through the Iur interface. One or more cells are allocated to each node B. The protocol within the CN is adopted from the evolution of GPRS protocol design. However, both the UE and UTRAN feature completely new protocol designs, which are based on the new WCDMA radio technology. WCDMA air interfaces have two versions, defined for operation in frequency division duplexing (FDD) and time division duplexing (TDD) modes. Only the FDD operation is investigated in this chapter. The modulation chip rate for WCDMA is 3.8 mega chips per second (Mcps). The specified pulse-shaping roll-off factor is 0.22. This leads to a carrier bandwidth of approximately 5 MHz. The nominal channel spacing is 5 MHz. However, this can be adjusted approximately between 4.4 and 5 MHz, to optimize performance depending on interference between carriers in a particular operating environment. As described in Table 8.13, the FDD version is designed to operate in either of the following frequency bands [30]: . .
1920–1980 MHz for uplink and 2110–2170 MHz for downlink. 1850–1950 MHz for uplink and 1930–1990 MHz for downlink.
274
Visual Media Coding and Transmission Table 8.13
WCDMA air interface parameters for FDD-mode operation
Operating Frequency Band
2110–2170 MHz 1930–1990 MHz downlink 1920–1980 MHz 1850–1910 MHz uplink Frequency Division Duplex (FDD) 3.84 mega chips per second 0.22 5 MHz
Duplexing Mode Chip Rate Pulse-shaping Roll-off Factor Carrier Bandwidth
All radio channels are code-division multiplexed and are transmitted over the same (entire) frequency band. WCDMA supports highly variable user data rates with the use of variable spreading factors, thus facilitating the bandwidth-on-demand concept. Transmission data rates of up to 384 kbps are supported in wide-area coverage, and 2 Mbps in local-area coverage. The radio frame length is fixed at 10 ms. The number of information bits or symbols transmitted in a radio frame may vary, corresponding to the spreading factor used for the transmission, while the number of chips in a radio frame is fixed at 38 400 [30]. 8.3.1.1 Radio Interface Protocol The radio interface protocol architecture, which is visible in the UTRAN and the user equipment (UE), is shown in Figure 8.23. Layer 1 (L1) comprises the WCDMA physical
Control plane
User plane
RRC L3
Network Layer H
Control
SDU
Interface
PDCP/ Infor. Field BMC
Logical channels
L2
RLC/ Transport channels
MAC Physical
L1
Radio Frame Radio Frame Physical channels
Layer
SDU – Service Data Unit
Figure 8.23
Radio-interface protocol architecture
275
Wireless Channel Models
layer. Layer 2 (L2), which is the data-link layer, is further split into medium access control (MAC), radio link control (RLC), packet data convergence protocol (PDCP), and broadcast multicast control (BMC). The PDCP exists mainly to adapt packet-switched connections to the radio environment by compressing headers with negotiable algorithms. Adaptation of broadcast and multicast services to the radio interface is handled by BMC. For circuit-switched connections, user-plane radio bearers are directly connected to the RLC. Every radio bearer should be connected to one unique instance of the RLC. The radio resource control (RRC) is the principal component of the network layer – layer 3 (L3). This comprises functions such as broadcasting of system information, radio resource handling, handover management, admission control, and provision of requested QoS for a given application. Unlike the traditional layered protocol architecture, where protocol layer interaction is only allowed between adjacent layers, RRC interfaces with all other protocols, providing fast local interlayer controls. These interfaces allow the RRC to configure characteristics of the lower-layer protocol entities, including parameters for the physical, transport, and logical channels [31]. Furthermore, the same control interfaces are used by the RRC layer to control the measurements performed by the lower layers, and by the lower layers to report measurement results and errors to the RRC. See Figure 8.24. UTRAN supports both circuit-switched and packet-switched connections. In order to transmit an application’s data between UE and the end system, QoS-enabled bearers have to be established between the UE and the media gateway (MGW). Figure 8.25 shows the user plane protocol stack used for data transmission over packet-switched connection in Release 4.
Measurement reports
RRC Control
Radio resource
Control
assignment: Code,
RLC
TF set, power step, etc. RLC re-transmission control
RLC Control
MAC
MAC
P HY
PHY
Control
Measurements
Measurements
RRC
UE
UTRAN
Figure 8.24 Interactions between RRC and lower layers [32]. Reproduced, with permission, from 3GPP TS 25.301, “Radio interface protocol architecture”, Release 4, V4.4.0. (2002–09). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/
276
Visual Media Coding and Transmission
Application
E.g. IP, PPP
E.g. IP, PPP Relay
Relay
PDCP
PDCP
GTP-U
GTP-U
GTP-U
GTP-U
RLC
RLC
UDP/IP
UDP/IP
UDP/IP
UDP/IP
MAC
MAC
L2
L2
L2
L2
L1
L1
L1
L1
L1 Uu
MS
Iu-PS
UTRAN
L1 Gn
3G-SGSN
Gi
3G-GGSN
Figure 8.25 User plane UMTS protocol stack for packet-switched connection [6]. Reproduced, with permission, from “Technical specification, 3rd Generation Partnership Project; technical specification group services and systems aspects; general packet radio service (GPRS); service description stage 2 (release 4)”, 3GPP TS 23.060 V4.0.0, March 2001. 2001 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/
8.3.1.2 Channel Structure Channels are used as a means of interfacing the L2 and L1 sub-layers. Between the RLC/MAC layer and the network layer, logical channels are used. Between the RLC/MAC and the PHY layers, the transport channels are used, and below the PHY layer is the physical channel (see Figure 8.23). Generally, logical channels can be divided into control and traffic channels. The paging control channel and the broadcast control channel are for the downlink only. The common control channel is a bi-directional channel shared by all UEs, while the common transport channel is a downlink-only shared channel. Dedicated control channels and dedicated transport channels are unique for each UE. Transport channels are used to transfer the data generated at a higher layer to the physical layer, where it gets transmitted over the air interface. The transport channels are described by a set of transport channel parameters, which are designed to characterize the data transfer over the radio interface. Each transport channel is accompanied by the transport format indicator (TFI), which describes the format of data to be expected from the higher layer at each time interval. The physical layer combines the TFI from multiple transport channels to form a transport format combination indicator (TFCI). This facilitates the combination of several transport channels into a composite transport channel at the physical layer, as shown in Figure 8.26, and their correct recovery at the receiver [31]. In UTRA, two types of transport channel exist, namely dedicated channel and common channel. As the names suggest, the main difference between them is that a common channel has its resources divided between all or a group of users in a cell, while a dedicated channel reserves resources for a single user.
277
Wireless Channel Models CH3 CH1
Transport Channels
Information TFI
TFCI
Data
Multiplexing & Physical layer processing
Physical Layer
Physical Data Channel
Physical Channel
Physical Control Channel
Figure 8.26
Transport channel mapping
The transmission time interval (TTI) defines the arrival period of data from higher layers to the physical layer. TTI size has been defined to be 10, 20, 40, and 80 ms. Selection of TTI size depends on the traffic characteristics. The amount of data that arrives in each TTI can vary in size, and is indicated in the transport format indicator (TFI). In the case of transport channel multiplexing, TTIs for different transport channels are time-aligned, as shown in Figure 8.27. The physical channels are defined by a specific set of radio interface parameters, such as scrambling code, spreading code, carrier frequency, and transmission power step. The channels are used to convey the actual data through the wireless link. The most important control information in a cell is carried by the primary common control physical channel (PCCPCH) and secondary common control physical channel (SCCPCH). The difference between these two is that the PCCPCH is always broadcast over the whole cell in a well-defined format, while
Bit rate in TTI 40 (ms) 30
20
10 0
10
20
30
40
50
60 70 80 Transmission time (ms)
Figure 8.27 Transmission time intervals (TTIs) in transport channel multiplexing
278
Visual Media Coding and Transmission
the SCCPCH can be more flexible in terms of transmission diversity and format. In the uplink, the physical random access channel (PRACH) and physical common packet channel (PCPCH) are data channels shared by many users. The slotted ALOHA approach is used to grant user access in PRACH [33]. A number of small preambles precede the actual data, serving as power control and collision detection. The physical downlink shared channel (PDSCH) is shared by many users in downlink transmission. One PDSCH is allocated to a single UE within a radio frame, but if multiple PDSCHs exist they can be allocated to different UEs arbitrarily: one to many or many to one. The dedicated physical data channel (DPDCH) and dedicated physical control channel (DPCCH) together realize the dedicated channel (DCH), which is dedicated to a single user [31]. 8.3.1.3 Modes of Connection Figure 8.28 shows the possible modes of realizing the connections of the radio bearers at each layer. PDCP, RLC and MAC modes must be combined with physical-layer parameters in a way that satisfies different QoS demands on the radio bearers. However, the exact parameter setting is a choice of the implementer of the UMTS system and of the network operator. The radio bearer can be viewed as either packet switched (PS) or circuit switched (CS). A PS connection passes the PDCP, where header compression can be applied or not. The RLC offers three modes of data transfer. The transparent mode transmits higher-layer payload data units (PDUs) without adding any protocol information and is recommended for realtime conversational applications. The unacknowledged mode will not guarantee the delivery to the peer entity, but offers other services such as detection of erroneous data. The acknowledged mode guarantees the delivery through the use of automatic repeat request (ARQ) [34].
Network
CS
PS
layer
PDCP H. compress
RLC
MAC
PHY
Transparent
Dedicated
Dedicated
Figure 8.28
Unacknowledged
Shared
No H. compress
Acknowledged
Broadcast
Shared
Interlayer modes of operation
Wireless Channel Models
279
The MAC layer can be operated in dedicated, shared, or broadcast mode. Dedicated mode is responsible for handling dedicated channels allocated to a UE in connected mode, while shared mode takes the responsibility of handling shared channels. The broadcast channels are transmitted using broadcast mode. The physical layer follows the MAC in choosing a dedicated or shared physical channel [35]. In UTRA, spreading is based on the orthogonal variable spreading factor (OVSF) technique. Quadrature phase shift keying (QPSK) modulation is used for downlink transmission. Both convolutional and turbo coding are supported for channel protection. The maximum possible transmission rate in downlink is 5760 kbps. It is provided by three parallel codes with a spreading factor of 4. With 1/2 rate channel coding, this could accommodate up to 2.3 Mbps user data. However, the practical maximum user data rate is subject to the amount of interference present in the system and the quality requirement of the application.
8.3.2 UMTS Physical Link Layer Model Description The physical link layer parameters and the functionality of the downlink for the FDD mode of the UMTS radio access scheme are described in this subsection. The main issues addressed are transport/physical channel structures, channel coding, spreading, modulation, transmission modeling, and channel decoding. Only the dedicated channels are considered, as the end application is real-time multimedia transmission for dedicated users. The implementation closely follows the relevant 3GPP specifications. A closed-loop fast power control method is also implemented. The developed model simulates the UMTS air interface. Figure 8.29 is a block diagram of the simulated physical link layer. It can be seen that the transmitted signal is subjected to a multipath fast-fading environment, for which the power-delay profiles are specified in [36]. In addition, an AWGN source is presented after the multipath propagation model. Cochannel interferers are not explicitly presented in the model because the loss of orthogonality of co-channels due to multipath propagation can be quantified using a parameter called the “orthogonality factor” [31], which indicates the fraction of intra-cell interfering power that is perceived by the receiver as Gaussian noise. The multipath-induced intersymbol interference is implicit in the developed chip-level simulator. By changing the variance of the AWGN source, the bit error and block error characteristics can be determined for a range of carrier-to-interference (C/I) ratios and Signal-to-Noise (S/N) ratios for different physical layer configurations. The simulator only considers a static C/I and S/N profile, and no slow fading effects are implemented. However, slow fading can easily be implemented by concatenating the data sets describing the channel bit error characteristics of different, static, C/I levels. Each radio access bearer (RAB) is normally accompanied by a signaling radio bearer (SRB) [37]. Therefore, in the simulator, two dedicated transport channels are multiplexed and mapped on to a physical channel. 8.3.2.1 Channel Coding UTRA employs four channel-coding schemes, offering flexibility in the degree of protection, coding complexity, and traffic capacity available to the user. The available channel-coding
280
Visual Media Coding and Transmission
Signalling Data
Signalling Data
Source Data Source Data CRC Removal CRC Attachment Channel Decoding Channel Coding Rate Dematching Rate Matching De-interleaving Interleaving Radio Frame Format
TrCh Demux
De-interleaving TrCh Mux Demodulation & Despreading
Interleaving AWGN Source
Modulation & Spreading
Tx Filter
RAKE Combiner
Multipath Mobile Channel
Figure 8.29
Rx Filter
UMTS physical link layer model
methods and code rates for dedicated channels are 1/2 rate convolutional code, 1/3 rate convolutional code, 1/3 rate turbo code, and no coding. 1/2 rate and 1/3 rate convolution coding is intended to be used with low data rates, equivalent to the data rates provided by second-generation cellular networks [31]. For high data rates, 1/3 rate turbo coding is recommended, and it typically brings performance benefits when sufficiently large input block sizes are achieved. The channel-coding schemes are defined in [38], and are outlined here. Convolutional Coding Convolutional codes with constraint length 9 and coding rates 1/3 and 1/2 are defined. Channel code block size is varied according to the data bit rates. The specified maximum code block size for convolutional coding is 504. If the number of bits in a transmission time interval (TTI) exceeds the maximum code block size then code block segmentation is performed (Figure 8.30). In order to achieve similar size code blocks after segmentation, filler bits are added to the beginning of the first block.
281
Wireless Channel Models y > 504 TTI block
x
x
Filler bits
Figure 8.30
x
x
x
TTI –Transmission Time Interval
Example of block segmentation at the channel encoder
Eight tail bits with binary value 0 are added to the end of the code block before encoding, and initial values of the shift register are set to 0s when the encoding is started. The generator polynomials used in the encoding as given in [38] are: Rate 1/2 convolutional coder: G0 ¼ 1 þ D2 þ D3 þ D4 þ D8 G1 ¼ 1 þ D þ D2 þ D3 þ D5 þ D7 þ D8
ð8:4Þ
Rate 1/3 convolutional coder: G0 ¼ 1 þ D2 þ D3 þ D5 þ D6 þ D7 þ D8 G1 ¼ 1 þ D þ D3 þ D4 þ D7 þ D8 G2 ¼ 1 þ D þ D2 þ D5 þ D8
ð8:5Þ
Note that the UTRA uses two different sets of generator polynomials to achieve two different convolutional code rates. If Ki denotes the number of bits in the ith code block before encoding then the number of bits after encoding, Yi, is: Yi ¼ 2Ki þ 16 with 1/2 rate coding. Yi ¼ 3Ki þ 24 with 1/3 rate coding. Turbo Coding Turbo codes employ two or more error control codes, which are arranged in such a way as to enhance the coding gain. They have been demonstrated to closely approach the Shannon capacity limit on both AWGN and Rayleigh fading channels. Traditionally, two parallel or serial concatenated recursive convolutional codes are used in the encoder implementation. A bit interleaver is used in between the encoders. Generated parity bitstreams from two encoders are finally multiplexed to produce the output turbo-coded bitstream. Turbo decoding is carried out iteratively. The whole process results in a code that has powerful error-correction properties. The defined turbo coder for use in UMTS is a parallel-concatenated convolutional code with two eight-state constituent encoders and one turbo-code internal interleaver. The coding rate of the turbo coder is 1/3. Figure 8.31 shows the configuration of the turbo coder.
282
Visual Media Coding and Transmission xk 1st constituent encoder
zk
xk Input
D
D
D
Output
Input
Turbo code internal interleaver Output
x’k
2nd constituent encoder
D
D
z’k
D
x’k Figure 8.31 Structure of rate 1/3 turbo coder [38]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; multiplexing and channel coding (FDD) (release 4)”, 3GPP TS 25.212 V4.6.0 (2002–09). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/
The transfer function of the eight-state constituent code is defined as: g1 ðDÞ GðDÞ ¼ 1; g0 ðDÞ
ð8:6Þ
g0 ðDÞ ¼ 1 þ D2 þ D3 g1 ðDÞ ¼ 1 þ D þ D3
ð8:7Þ
where:
The initial values of the shift registers are set to 0s at the start of the encoding. Output from the turbo code is read as X1, Z1, Z 0 1 , and so on. Termination of the turbo coder defined in UMTS is performed in a different way to conventional turbo code termination, which uses 0 incoming bits to generate the trellis termination bits. Here, the shift register feedbacks after all information bits have been encoded are used to generate the termination bits. To terminate the first constituent encoder, switch A in Figure 8.31 is set to the lower position while the second constituent encoder is disabled. Likewise, the second constituent encoder is terminated by setting switch B in Figure 8.31 to the lower position while the first constituent encoder is disabled. The turbo code internal interleaver arranges incoming bits into a matrix. If the number of incoming bits is less than the number of bits that the matrix could contain, padding bits are used. Then intra-row and inter-row permutations are performed according to the algorithm given in [38]. Pruning is performed at the output, so the output block size is guaranteed to be equal to
283
Wireless Channel Models
the input block size. If Ki denotes the number of input bits in a code block, the number of turbo code output bits Yi, is Yi ¼ 3Ki þ 12 for 1/3 code rate. The minimum block size and the maximum block size for turbo coding are defined as 40 bits and 5114 bits respectively. Data sizes below 40 bits can be coded with turbo codes; however, in such a case, dummy bits are used to fill the 40 bit minimum-size interleaver. If the incoming block size exceeds the maximum size then segmentation is performed before channel coding. 8.3.2.2 Rate Matching Rate matching is used to match the incoming data bits to available bits on the radio frame. Rate matching is achieved either by bit puncturing or by repetition. If the amount of incoming data is larger than the number of bits that can be accommodated in a single frame then bit puncturing is performed. Otherwise, bit repetition is performed. In the case of transport channel multiplexing, rate matching should take into account the number of bits arriving in other transport channels. The rate matching algorithm depends on the channel coding applied. The corresponding rate matching algorithms for convolutional and turbo coding are defined in [38]. In the simulation under discussion, rate matching is only performed for the signaling bearer. As signaling data is protected using convolutional codes, the rate matching algorithm is implemented only for convolutional coding. 8.3.2.3 Interleaving In UTRA, data interleaving is performed in two steps: first and second interleaving. These are also known as inter-frame interleaving and intra-frame interleaving, respectively. The first interleaving is a block interleaver with inter-column permutations (inter-frame permutation) and is used when the delay budget allows more than 10 ms of interleaving. In other words, the specified transmission time interval (TTI), which indicates how often data comes from higher layers to the physical layer, is larger than 10 ms. The TTI is directly related to the interleaving period and can take values of 10, 20, 40 or 80 ms. Table 8.14 shows the inter-column permutation patterns for first interleaving. Each column contains data bits for 10 ms duration. The second or intra-frame interleaving performs data interleaving within a 10 ms radio frame. This is also a block interleaver with inter-column permutations applied. Incoming data Table 8.14 Inter-columns permutation patterns for first interleaving [38]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; multiplexing and channel coding (FDD) (release 4)”, 3GPP TS 25.213 V4.3.0. (2002–06). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http:// pda.etsi.org/pda/ TTI
Number of columns
Inter-column permutation patterns
10 ms 20 ms 40 ms 80 ms
1 2 4 8
284
Visual Media Coding and Transmission
Table 8.15 Inter-columns permutation patterns for second interleaving [38]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; multiplexing and channel coding (FDD) (release 4)”, 3GPP TS 25.213 V4.3.0. (2002–06). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ Number of columns
Inter-column permutation patterns
30
bits are input into a matrix with n rows and 30 columns, row by row with a starting position of column 0 and row 0. The number of rows is the minimum integer n, which satisfies the condition: Total number of bits in radio block n 30 If the total number of bits in the radio block is less than that, it is necessary to fill the whole matrix, and bit padding is performed. The inter-column permutation for the matrix is performed based on the pattern shown in Table 8.15. Output is read out from the matrix column by column, and finally pruning is performed to remove padding bits that were added to the input of the matrix before the inter-column permutation. 8.3.2.4 Spreading and Scrambling The spreading in the downlink is based on the channelization codes and is used to preserve the orthogonality among different downlink physical channels within one cell (or sector of a cell), and to spread the data to the chip rate, which is 3.84 Mcps. In UTRA, spreading is based on the orthogonal variable spreading factor (OVSF) technique. The OVSF code tree is illustrated in Figure 8.32. Typically only one OVSF code tree is used per cell sector in the base station (or node B). The common channels and dedicated channels share the same code tree resources. The codes are normally picked from the code tree; however, there are certain restrictions as to which of the codes can be used for a downlink transmission. A physical channel can only use a certain code from the tree if no other physical channel is using a code that is on an underlying branch. Neither can a smaller spreading factor code on the path to the root of the tree be used. This is because even though all codes from the same level are orthogonal to each other, two codes from different levels are orthogonal to each other only if one of them is not the mother code of the other. The radio network controller in the network manages the downlink orthogonal codes within each base station. The spreading factor on the downlink may vary from 4 to 512 (an integer power of 2), depending on the data rate of the channel. Table 8.16 summarizes the channel bit rates, data rates, and spreading factors for downlink dedicated physical channels. In addition to spreading, a scrambling operation is performed in the transmitter. This is used to separate base stations (cell sectors) from one another. As the chip rate is already achieved
285
Wireless Channel Models
(c, c)
c
(c, -c) C4,1 = (1,1,1,1)
C4,1= (1, 1, 1, 1, 1, 1, 1, 1) C4,1= (1, 1, 1, 1,-1,-1,-1,-1)
C2,1= (1,1) C4,2 = (1,1,-1,-1) C1,1= (1)
C4,1= (1, 1,-1,-1, 1, 1,-1,-1) C4,1= (1, 1,-1,-1,-1,-1, 1, 1)
C4,3 = (1,-1,1,-1)
C4,1= (1,-1, 1,-1, 1,-1, 1,-1) C4,1= (1,-1, 1,-1,-1, 1,-1, 1)
C2,2= (1,-1) C4,4 = (1,-1,-1,1)
C4,1= (1,-1,-1, 1, 1,-1,-1, 1) C4,1= (1,-1,-1, 1,-1, 1, 1,-1)
SF4
SF8 Figure 8.32
SF16
SF32
Example of OVSF code tree used for downlink
with spreading, the symbol rate is not affected by scrambling. The downlink scrambling uses the Gold codes [39]. The number of primary scrambling codes is limited to 512, simplifying the cell search procedure. The secondary scrambling codes are used in the case of beam-steering and adaptive antenna techniques [40]. 8.3.2.5 Modulation Quadrature phase shift keying (QPSK) modulation is applied on time-multiplexed control and data streams on the downlink. Each pair of consecutive symbols is serial-to-parallel converted and mapped on to I and Q branches. The symbols on I and Q branches are then spread to the chip rate by the same real-valued channelization code. The spread signal is then scrambled by a cellspecific complex-valued scrambling code. Table 8.16
Downlink dedicated channel bit rates
Spreading factor
Channel bit rate (kbps)
User data rate with 1/2 rate coding (approx.)
4 8 16 32 64 128 256 512
1920 960 480 240 120 60 30 15
936 kbps 456 kbps 215 kbps 105 kbps 45 kbps 20–24 kbps 6–12 kbps 1–3 kbps
286
Visual Media Coding and Transmission
cos(ωt) Re{S} Pulse shaping
Complexed valued chipsequence from {S} spreading operations
Split real & imaginary parts
Im{S} Pulse shaping
–sin(ωt)
Figure 8.33 Downlink modulation [39]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; spreading and modulation (FDD) (release 4)”, 3GPP TS 25.213 V4.3.0. (2002–06). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/
Figure 8.33 shows the spreading and modulation procedure for a downlink physical channel. A square-root raised cosine filter with a roll-off factor of 0.22 is employed for pulse shaping, and the pulsed shaped signal is subsequently up-converted and transmitted. 8.3.2.6 Physical Channel Mapping The frame structure for a downlink dedicated physical channel is shown in Figure 8.34. Each radio frame has 15 equal-length slots. The slot length is 2560 chips. As shown in Figure 8.34, the DPCCH and DPDCH are time-multiplexed within the same slot [41]. Each slot consists of pilot symbols, transmit power control (TPC) bits, transport format combination indicator (TFCI) bits, and bearer data. The number of information bits transmitted in a single slot depends on the source data rates, the channel coding used, the spreading factor, and the channel symbol rate. The exact number of bits in the downlink DPCH fields is given in [41] and is summarized in Table 8.17. 8.3.2.7 Propagation Model The channel model used in the simulator is the multipath propagation model specified by IMT2000 in [36]. This model takes into account that the mobile radio environment is One radio frame, 10 ms Slot 0
Slot 1
Slot 13
Slot i
Slot 14
One time slot, 2560 Data 1 DPDCH
TPC
TFCI
DPCCH
Figure 8.34
Data 2
Pilot
DPDCH
DPCCH
Frame structure for downlink DPCH
287
Wireless Channel Models
Table 8.17 DPDCH and DPCCH fields (3GPP TS 25.211). Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; physical channels and mapping of transport channel on to physical channel (FDD) (release 4)”, 3GPP TS 25.211 V4.6.0. (2002–09). 2002 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ Spreading factor
512 256 128 64 32 16 8 4
DPDCH (bits/slot)
DPCCH (bits/slot)
Ndata1
Ndata2
NTPC
NTFCI
NPilot
0 2 6 12 28 56 120 248
2 6 22 48 112 232 488 1000
2 2 2 4 4 8 8 8
2 2 2 8 8 8 8 8
4 8 8 8 8 16 16 16
dispersive, with several reflectors and scatterers. For this reason, the transmitted signal may reach the receiver via a number of distinct paths, each having different delays and amplitudes. The multipath fast fading is modeled by the superposition of multiple single-faded paths with different arrival times and different average powers for specified power-delay profiles in [36]. Each path is characterized by Rayleigh distribution (first-order statistic) and classic Doppler spectrum (second-order statistic). Figure 8.35 shows a block diagram of a four-path frequency selective fading channel. UTRAN defines three different multipath power-delay profiles for use in different propagation environments. There are indoor office environments, outdoor-to-indoor and pedestrian environments, and vehicular environments. All of these models are implemented in the simulator and the tapped-delay line parameters for the vehicular environment are shown in Table 8.18. Mobile channel impulse response is updated 100 times for every coherence time interval. Power0 INPUT
Rayleigh fade Power1 Delay1
Delay2
Delay3
Figure 8.35
Rayleigh fade Rayleigh fade Rayleigh fade
Power2
Power3
∑
Four-path frequency selective fading channel
288
Visual Media Coding and Transmission
Table 8.18 Vehicular A test environment [36]. Reproduced, with permission, from “Universal Mobile Telecommunications System (UMTS); selection procedures for the choice of radio transmission technologies of the UMTS (UMTS 30.03 version 3.2.0)”, TR 101 112 V3.2.0 (1998–04) 1998 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http:// pda.etsi.org/pda/ Taps
Delay (nsec)
Power (dB)
Doppler Spectrum
1 2 3 4 5 6
0 310 710 1090 1730 2510
0 1.0 9.0 10.0 15.0 20.0
Classic Classic Classic Classic Classic Classic
After the multipath channel shown in Figure 8.29, white Gaussian noise is added to simulate the effect of overall interference in the system, including thermal noise and inter-cell interference. 8.3.2.8 Rake Receiver The rake receiver is a coherent receiver that attempts to collect the signal energy from all received signal paths that carry the same information. The rake receiver therefore can significantly reduce the fading caused by these multiple paths. The operation of the rake receiver follows three main operating principles (see Figure 8.36). The first is the identification of the time-delay positions at which significant energy arrives, and the time alignment of rake fingers for combining. The second is the tracking of fast-changing phase and amplitude values originating from the fast-fading process within each correlation receiver, and the removal of these from the incoming data. Finally, the demodulated and phaseadjusted symbols are combined across all active fingers and passed to the decoder for further processing. The combination can be processed using three different methods:
Input Signal Phase rotator
Correlator
Code generators
Delay equalizer
Channel estimator
Combiner
Timing (multipath tracking)
Finger 1 Finger 2 Finger 3
Figure 8.36
Block diagram of a rake receiver
289
Wireless Channel Models . .
.
Equal gain combining (EGC), where the output from each finger is simply summed with equal gain. Maximal ratio combining (MRC), where finger outputs are scaled by a gain proportional to the square root of the SNR of each finger before combining [42]. In this case, the finger output with the highest power dominates in the combination. Selective combining (SC), where not all finger outputs are considered in the combination, but some fingers are selected for combining according to the received power on each.
Two types of rake receiver have been developed for the downlink: 1. Ideal rake receiver. 2. Quasi-ideal rake receiver. The following subsections provide details of the designs of these rake receivers. Ideal Rake Receiver The block diagram of an ideal rake receiver is shown in Figure 8.37. Here, perfect channel estimation and perfect finger time alignment is assumed. This is implemented by storing all the fast-fading channel coefficients as a complex vector, where the vector length equals the number of frequency selective fading paths. This vector is then fed from the channel directly to the ideal receiver. At the receiver, the coefficients for each path are first separated and then applied to each rake finger, after being time-aligned in accordance with the delay (from channel delay profile) in each reflected path. Of the three rake finger combination methods, EGC is selected for the ideal receiver as it gives the optimal performance in the presence of ideal channel estimation and perfect time alignment. Quasi-ideal Rake Receiver The quasi-ideal rake receiver resembles a practical rake receiver in terms of implementation. However, as depicted in Figure 8.38, ideal finger search for the rake receiver is assumed. That is, each finger in the receiver is assumed to have perfect synchronization with the corresponding path in the channel. First the received data is time-aligned according to the channel
Input
Finger time alignment
To Finger2
Complex correlator
Scrambling code
Spreading code
Complex conjugate
From
Output Equal
Finger2 Gain
Channel coefficient
To FingerN
From FingerN
Figure 8.37
Ideal rake receiver
Combiner
290
Visual Media Coding and Transmission
Input
Finger time alignment
De-scrambling/
Delay
De-spreading
To Finger2
Matched filter
Complex FIR filter
Complex conjugate
Pilot bits To FingerN
From
Output
Finger2
Maximum Ratio
From
Combiner
FingerN
Figure 8.38
Qausi-ideal rake receiver
delay profile. Then the data on each finger undergoes a complex correlator process to remove the scrambling and spreading codes. As in an actual receiver implementation, pilot bits are used to estimate the momentary channel state for a particular finger. Channel estimation is achieved through the use of a matched filter, which is employed only during the period in which the pilot bits are being received. The pilot bits are sent in every transmit time slot; therefore, the maximum effective channel updating interval is equivalent to half a slot. Output from the matched filter is further refined by using a complex FIR filter. Here, the weighted multislot averaging (WMSA) technique proposed in [43] is employed to reduce the noise variance, and also to track fast channel variation between consecutive channel estimates. Two different sets of weighting for the WSMA filter are used for low vehicular speeds and high vehicular speeds respectively. This is because the limiting factor in channel estimation errors at low vehicular speed is the channel noise rather than the channel variation. Therefore, the noise averaging effect is more desirable at low vehicle speed. By comparison, at high vehicle speed channel variation becomes the limiting factor, hence weighting based on interpolation should be considered. The WMSA technique requires a delay of an integer number of time slots for the channel processing. The time-varying channel effect is removed from the de-scrambled and de-spread signal before it is sent to the signal combiner. MRC is used for the rake finger combination as it gives better performance. Intersymbol interference due to the multipath is implicit in the resulting output. 8.3.2.9 Channel Decoding In the implementation, a soft-decision Viterbi algorithm is used for the decoding of the convolutional codes. Turbo decoding is based on the standard LogMap algorithm (which is provided in SPW), which tries to minimize the bit error ratio rather than the block error ratio [42]. Eight iterations are performed.
8.3.3 Model Verification for Forward Link The theoretical formula for the BER probability with an order L MRC diversity combiner is given in [44], and is stated as:
291
Wireless Channel Models
rffiffiffiffiffiffiffiffiffiffiffiffiffi L 1X gk Pb ¼ pk 1 2 k¼1 1 þ gk
ð8:8Þ
where Pb is the bit error probability, L denotes the number of diversity path, and g k is the mean Eb/h for kth diversity path. pk is given by: pk ¼
L Y
i¼1 i„k
gk gk gi
ð8:9Þ
If the rake receiver is assumed to behave as an order L MRC diversity combiner, Equation (8.9) gives the lower bound of the BER performance. Other test conditions assumed in Equation (8.9) are: . . .
Perfect channel estimation. No intersymbol interference presence. Each propagation path has a Rayleigh envelope.
Using Equation (8.9), the theoretical lower bound of the performance for the power delay profile that is specified in the case 3 outdoor performance measurement test environment in annex B [30] is calculated. It is depicted in Figure 8.39. Here, a mean SNR value for each
Figure 8.39
Performance of uncoded channel
292
Visual Media Coding and Transmission
individual path is calculated from the global Eb/No by simply multiplying it by the fraction of power carried by each path (given in power delay profile). The number of rake fingers equals the number of propagation paths, which is four in this case. Figure 8.39 shows the performance in terms of raw BER (uncoded) for varying spreading factors in the above-described test environment. A single active connection is considered. The dashed lines show the performance obtained by Olmos and Ruiz in [44] for similar test conditions. Figure 8.39 clearly illustrates the close match of results obtained from the described model to those given in [44]. As the spreading factor reduces, the performance deviates from that of the theoretical bound, due to the presence of intersymbol interference. For non-ideal channel estimation, raw BER performance (see Figure 8.40) deviates considerably from the ideal channel estimation performance. Performance degradation is about 3–4 dB for operation at lower Eb/No, and increases gradually as Eb/No increases. It should be emphasized here that the channel-coding algorithm can correct almost all of the channel error occurrences if the raw BER values are less than 102. Therefore, the region that is interesting for multimedia applications is limited to the top-left-hand corner in Figure 8.40. 8.3.3.1 Model Performance Validation Reference performance figures for the downlink dedicated physical channels are given in [30]. These allow for the setting of reference transmitter and receiver performance figures for nominal error ratios, sensitivity levels, interference levels, and different propagation conditions. Reference measurement channel configurations are specified in Annex A [30],
Figure 8.40
Comparison of ideal and non-ideal channel estimate performance
293
Wireless Channel Models Table 8.19 BLER performance requirement [30]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; user equipment (UE) radio transmission and reception (FDD) (release 4)”, 3GPP TS 25.101 V410.0 (2002–03). 2004 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ Test number
DPCH_EC/Ior
BLER
1 2
11.8 dB 8.1 dB 7.4 dB 6.8 dB 9.0 dB 8.5 dB 8.0 dB 5.9 dB 5.1 dB 4.4 dB
102 101 102 103 101 102 103 101 102 103
3
4
while the reference propagation conditions are specified in Annex B [30]. A mechanism to simulate the interference from other users and control channels in the downlink (named orthogonal channel noise simulator (OCNS)) on the dedicated channel is shown in Annex C [30]. The performance requirements are given in terms of block error ratio (BLER) for different multipath propagation conditions, data rates (hence spreading factors), and channel coding schemes. For example, Table 8.19 lists the required upper bound of BLER for the reference parameter setting shown in Table 8.20. The power-delay profile of the multipath fading propagation condition used in the reference test is given in Table 8.21. Physical channel parameters, transport channel parameters, channel coding, and channel mapping for the 64 kbps reference test channel are depicted in Figure 8.41. As in a typical operating scenario, two transport channels, the data channel and the signaling channel, are multiplexed and mapped on to the same physical channel.
Table 8.20 Reference parameter setting [30]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; user equipment (UE) radio transmission and reception (FDD) (release 4)”, 3GPP TS 25.101 V410.0 (2002–03). 2004 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi. org/pda/ Parameter
Unit
Test 1
Test 2
Test 3
Test 4
Ior/Ioc Ioc Information data rate
dB dBm/3.84 MHz kbps
3 60 12.2
3 60 64
3 60 144
6 60 384
294
Visual Media Coding and Transmission Table 8.21 Power-delay profile for case 3 test environment [30]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; user equipment (UE) radio transmission and reception (FDD) (release 4)”, 3GPP TS 25.101 V410.0 (2002–03). 2004 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ Relative delay ns
Average power dB
Fading
0 260 521 781
0 3 6 9
Classical-Doppler Classical-Doppler Classical-Doppler Classical-Doppler
8.3.3.2 Calculation of Eb/No and DPCH_EC/Ior Reference test settings are given in terms of the ratio of energy per chip to the total transmit power spectral density of the node B antenna connector. The relationship between Eb/No and the setting of the variance s of the AWGN source and the conversion of DPCH_Ec/Ior to Eb/No is given in Equations (8.10) and (8.11) respectively. The derivations of these equations are given in Appendix A.
DTCH Information data
DCCH Information data
1280
100
CRC16
CRC detection
CRC12
CRC detection
1280
100 Tail8
Tail bit discard
1296
112
Termination 12 Turbo code R=1/3
3888
Viterbi decoding R=1/3
360
Rate matching
4014
Rate matching
372
1st interleaving
4014
1st interleaving
372
#1 2007
#1 2007
#2 2007
#1 93 #2 93 #3 93 #4 93
#2 2007
Radio Frame Segmentation 2007
2007
93
93
2007
93
2007
93
2nd interleaving 2100
slot segmentation 0
1
140 140
120ksps DPCH 0 (including TFCI bits)
1
••••
2100
2100 14 0
1
140 140 140
••••
Radio frame FN=4N
14 0
1
14 0 •••• ••••
1
140 140 140
14 0
Radio frame FN=4N+1
1
2100 14 0
•••• ••••
1
140 140140
14 0
Radio frame FN=4N+2
1
14 ••••
140
••••
14
Radio frame FN=4N+3
Figure 8.41 Channel coding of DL reference measurement channel (64 kbps) [30]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group radio access network; user equipment (UE) radio transmission and reception (FDD) (release 4)”, 3GPP TS 25.101 V4.10.0. (2002–03). 2004 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/
295
Wireless Channel Models Table 8.22
Performance validation for convolutional code use DPCH_Ec/Ior at BLER ¼ 1%
Propagation environment AWGN Case 1 Case 2 Case 3
3GPP – upper bound
CCSR model
16.6 dB 15.0 dB 7.7 dB 11.8 dB
18.5 dB 16.0 dB 12.5 dB 14.6 dB
Eb RC ch os ¼ No 2 Rb s2
Eb ¼ No
DPCH Ec Ior
RC ch os Rb
Ioc ^I or
ð8:10Þ
ð8:11Þ
where RC and Rb are the chip rate and the channel bit rate respectively, and ch_os denotes the channel over sampling factor. Equation (8.11) is equivalent to the equation proposed by Ericsson in [28]. 8.3.3.3 UMTS DL Model Verification for Convolutional Code Use Reference interference performance figures [30] for convolutional code with 12.2 kbps data were compared to the results obtained using the designed simulation model, which is referred to as the CCSR model. A comparison is given in Table 8.22. The reference results are given in terms of the upper bound of the average downlink power, which is needed to achieve the specified block error ratio value. The results listed in Table 8.22 show that the CCSR model performance is within the performance requirement limits in all propagation conditions. The performance requirements specified in 3GPP are limited to a single value, which is insufficient to test the model performance over a range of propagation conditions. Therefore, performance tests were carried out for a range of DPCH_Ec/Ior for different reference propagation environments, and the results are compared to those obtained by Ericsson [28] and NTT DoCoMo [29]. These results are shown in Figure 8.42. Figure 8.42 clearly illustrates the close performance of the CCSR model to the results given in the above references. BLER curves for case 1 and case 3 are virtually identical to those given in the above references. In the case 2 propagation environment, the CCSR model outperforms the other two reference figures. The reason for this may be the use of an EGC rake receiver in the CCSR model. Case 2 represents an imaginary radio environment, which consists of three paths with large relative delays and equal average power. Use of EGC in this environment combines energy from all three paths with equal gain, resulting in maximum power output, and shows optimal performance. The path combiner structures used in the reference models are unknown. Performance over AWGN environment shows slight variation at low-quality channels. However, the performance gets closer to that of the reference figure when the channel quality gets better.
296
Visual Media Coding and Transmission
Figure 8.42 Comparison of reference performance for convolutional code use: (a) 12.2 kbps measurement channel over AWGN channel; (b) 12.2 kbps measurement channel over case 1 environment; (c) 12.2 kbps measurement channel over case 2 environment; (d) 12.2 kbps measurement channel over case 3 environment
8.3.3.4 UMTS DL Model Verification for Turbo Code Use [30] also presents the upper bounds for performance of turbo codes under different propagation conditions. The results generated with the CCSR model at a BLER of 10% and 1% were compared to the above reference figures, as listed in Table 8.23 for the 64 kbps test channel and in Table 8.24 for the 144 kbps test channel. The results listed in the tables clearly show that under the propagation conditions described in [30], the performance of the CCSR model at a resulting BLER of 10% and 1% is within the required maximum limits. As for convolutional codes, performance for turbo codes are evaluated and compared to performance figures obtained by Ericsson [28] and NTT DoCoMo [29]. The performance traces under different conditions are shown in Figure 8.43 for 64 kbps and in Figure 8.44 for 144 kbps. Dashed-dotted lines denote the results of Ericsson, while dashed lines with star marks show the results of NTT DoCoMo.
297
Wireless Channel Models Table 8.23
Performance validation for 64 kbps channel
Propagation environment
AWGN Case 1 Case 2 Case 3 AWGN Case 1 Case 2 Case 3
DPCH/Ior at BLER ¼ 10% 3GPP – upper bound
CCSR model
13.1 dB 13.9 dB 6.4 dB 8.1 dB DPCH/Ior at BLER ¼ 1% 12.8 dB 10.0 dB 2.7 dB 7.4 dB
15.2 15.0 10.5 10.9 14.95 10.7 7.5 10.1
The result for case 1 is very close to the results given in the above references. The result over AWGN channel shows closer performance to Ericsson figures. As with the convolutional code, the CCSR model outperforms the other two models in the operation over case 2 propagation environment. However, for the 144 kbps channel over case 3, the CCSR model results do not closely follow the reference traces. Moreover, even the reference traces do not show close performance in this environment. There are several reasons for this behavior. First, case 3 resembles a typical outdoor vehicular environment. Mobile speed is set to 120 kmph in this condition. Therefore the effect of time-varying multipath channel conditions and intercell interference are more evident in this condition, resulting in a variation in performance. Second, the implementation of the intercell interference could be different in each model. In the CCSR model, intercell interference is evaluated mathematically and is mapped on to the variance of the Gaussian noise source. Third, the decoding algorithm used for turbo iterative decoding in the CCSR model is the LogMap algorithm, while the reference models use the MaxLogMap algorithm. Even though these two algorithms show similar performance for AWGN channels, when applied over multipath propagation conditions their performance depends on other conditions, such as the length of the turbo-internal interleaver, input block length, and input signal amplitude [42]. Table 8.24
Performance validation for 144 kbps channel
Propagation environment
AWGN Case 1 Case 2 Case 3 AWGN Case 1 Case 2 Case 3
DPCH/Ior at BLER ¼ 10% 3GPP – upper bound
CCSR model
9.9 dB 10.6 dB 8.1 dB 9.0 dB DPCH/Ior at BLER ¼ 1% 9.8 dB 6.8 dB 5.1 dB 8.5 dB
11.86 dB 12.5 dB 13.0 dB 12.3 dB 11.72 dB 7.6 dB 9.5 dB 10.75 dB
298
Visual Media Coding and Transmission
Figure 8.43 Comparison of reference performances for 64 kbps channel: (a) 64 kbps measurement channel over AWGN environment; (b) 64 kbps measurement channel over case 1 environment; (c) 64 kbps measurement channel over case 2 environment; (d) 64 kbps measurement channel over case 3 environment
8.3.4 UMTS Physical Link Layer Simulator From a user point of view, services are considered end-to-end; that means from one terminal equipment to another terminal equipment. An end-to-end service may have a certain quality of service, which is provided for the user by the different networks. In UMTS, it is the UMTS bearer service that provides the requested QoS, through the use of different QoS classes, as defined in [45]. At the physical layer, these QoS attributes are assigned a radio access bearer (RAB) with specific physical-layer parameters in order to guarantee quality of service over the air interface. RABs are normally accompanied by signaling radio bearers (SRBs). Typical parameter sets for reference RABs, signaling RBs, and important combinations of the two
Wireless Channel Models
299
Figure 8.44 Comparison of reference performances for 144 kbps channel: (a) 144 kbps measurement channel over AWGN environment; (b) 144 kbps measurement channel over case 1 environment; (c) 144 kbps measurement channel over case 2 environment; (d) 144 kbps measurement channel over case 3 environment
(downlink, FDD) are presented in [37]. In the simulation, 3.4 kbps SRB, which is specified in [37], is used for the dedicated control channel (DCCH). Transport channel parameters for the 3.4 kbps SRB are summarized in Table 8.25. Careful examination of parameter sets for RABs and SRBs, which are specified in [37], shows that the minimum possible rate-matching ratios for RABs and SRBs vary with the physical-layer spreading factor being used. This is because when a higher spreading factor is used it adds transmission channel protection to the transmitted data in addition to the channel protection provided by the channel coding. Therefore, the channel bit error rate reduces with increase in spreading factor, and it is possible to allow higher puncturing in these scenarios without loss of performance. Table 8.26 shows the variation of calculated minimum ratematching ratios (maximum puncturing ratio) with a spreading factor for FDD downlink control channels.
300
Visual Media Coding and Transmission
Table 8.25 Transport channel parameter for 3.4 kbps signaling radio bearer [37]. Reproduced, with permission, from “3rd Generation Partnership Project; technical specification group terminals; common test environments for user equipment (UE) conformance testing (release 4)”, 3GPP TS 34.108 V4.7.0 (2003–06). 2003 3GPP. 1998 3GPP. Reproduced by permission of European Telecommunications Standards Institute 2008. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ RLC
MAC
Layer 1
Logical Channel Type
DCCH
DCCH
DCCH
DCCH
RLC Mode Payload Size, bit Max Data Rate, bps AMD/UMD PDU Header, bit MAC Header, bit MAC Multiplexing
UM 136 3400 8
AM 128 3200 16
AM 128 3200 16
AM 128 3200 16
TrCH Type TB Sizes, bit TTI, ms Coding Type CRC, bit Max Number of bits/ TTI before Rate Matching
4 4 logical channel multiplexing DCH 148 40 CC 1/3 16 516
In the simulation, rate-matching attributes for an SRB are set according to the minimum rate-matching ratios shown in Table 8.26 for different physical channels. The actual information data rate is a function of spreading factor, rate-matching ratio, type of channel coding, channel coding rate, number of CRC bits, and transport block size. Table 8.27 is a list of all the parameters that are user-definable, either by modifying the parameters of hierarchical models, by changing the building blocks that constitute the model, or by using different schematics. 8.3.4.1 BLER/BER Performance of Simulator Simulations were carried out for different radio bearer configuration settings. For higher spreading factor realizations, the simulation period was set to 60 s duration. However, for
Table 8.26
Minimum rate matching ratios for SRB
SF
Minimum rate matching ratio
128 64 32 164
0.690 0.73 0.99 1.0
301
Wireless Channel Models Table 8.27 UTRAN simulator parameters Parameters
Settings
CRC attachment Channel coding scheme supported
24, 16, 12, 8 or 0 No coding, 1/2 rate convolutional coding, 1/3 rate convolutional coding, 1/3 rate turbo coding 1st interleaving: block interleaver with interframe permutation 2nd interleaving: block interleaver with inter-columns permutation (Permutation patterns are specified in [38]) The algorithm (for convolutional rate matching) as specified in 3GPP TS 25.212. Rate matching ratio (repeat or puncturing ratio) is user-definable Experiments were conducted for two transport channels TFCI-based detection 512, 256, 128, 64, 32, 16, 8, 4 10 ms, 20 ms, 40 ms, 80 ms As specified in 3GPP TS 25.211 User-defined values are converted to the variance of AWGN source at receiver Rayleigh fading mobile channel impulse response is updated 100 times for every coherence time interval Vehicular, pedestrian User-definable. Constant for the simulation run 3.84 Mcps 2000 MHz 0 dB gain for both transmitter and receiver antenna Rake receiver with maximum ratio combining, equal gain combining, or selective combining. Number of rake fingers is user-definable Closed-loop fast power control [46] Soft-decision Viterbi convolutional decoder Standard LogMap turbo decoder Number of turbo iterations is user-definable Bit error patterns and block error patterns 3000–6000 radio frames, equivalent to 30–60 s duration
Interleaving
Rate matching
TrCH multiplexing Transport format detection Spreading factor Transmission time interval Pilot bit patterns Interference/noise characteristics Fading characteristics Multipath characteristics Mobile terminal velocity Chip rate Carrier frequency Antenna characteristics Receiver characteristics
Transmission diversity characteristic Channel decoding
Performance measures Simulation length
spreading factor 8, the simulation period was limited to 30 s. This is to compensate for the higher processing time requirement seen at high data rates. 3000–6000 radio frames (10 ms) accommodate about 15 000–20 000 RLC blocks in a generated bit error sequence, and that is sufficiently long enough to obtain a meaningful BLER average. Furthermore, experimental results show that the selected simulation duration is sufficient to capture the bursty nature of the wireless channel and its effect on the perceptual quality of received video. Effect of Spreading Factor For the purpose of a performance comparison of the effect of spreading-factor variation, experiments were conducted for various spreading-factor allocations. The other physical channel parameters are set to their nominal values, which are shown in Table 8.28. The
302
Visual Media Coding and Transmission Table 8.28
Table 8.29
Nominal parameter settings
Spreading Factor
32
Transmission Time Interval CRC Attachment Channel Coding Mobile Speed Rate Matching Ratio Operating Environment
20 ms 16 bits 1/2 CC, 1/3 CC, 1/3 TC 3 kmph, 50 kmph 1.0 Vehicular A, pedestrian B
Channel throughput characteristics
Spreading factor Convolutional coding (1/2 Convolutional coding (1/3 rate) rate)
Turbo coding (1/3 rate)
RLC payload Rate (kbps) RLC payload Rate (kbps) RLC payload Rate (kbps) 128 32 16 8 4
45 320 320 640 640
15.75 96.0 192.0 416.0 896.0
49 320 320 640 640
9.8 64.0 128.0 288.0 608.0
45 320 320 640 640
9.0 64.0 128.0 288.0 608.0
calculated possible information data rates are based on the specified SRBs for a given composite transport channel (consisting of one signaling channel and one dedicated data channel) for FDD downlink channels and are presented in Table 8.29. Table 8.29 also lists the RLC payload setting used in different bearer configurations. Figure 8.45 shows the BER performance for the transmission of uncoded data (raw BER/ channel BER) over the vehicular A propagation environment. It clearly illustrates the errorflow characteristics experienced due to the intersymbol interference in multipath channels. The effect of intersymbol interference increases with a reduction in spreading factor. However, the error-flow characteristic is not very pronounced in terms of coded BER performance, except for very low spreading factor allocations (see Figure 8.46). This is due to the effect of the channel coding algorithm, which tends to correct most of the errors if the channel bit error ratio is lower than 102. Figure 8.46(a) shows the performance of convolutional code, while the performance of turbo code is shown in Figure 8.46(b). The effect of spreading factor variation on the performance of turbo codes is similar to that of convolutional codes. However, the performance for spreading factor 8 is closer to that of other spreading factors than to the convolutional coding case. This is mainly due to the behavior of turbo codes. It is known that the higher the input block size of the turbo code, the better the performance. A high-bit-rate (with low spreading factor) service can accumulate more bits in a TTI than a low-bit-rate service. The better performance of the turbo code seen with large input block sizes compensates for the reduced robustness against interference provided by low spreading factor realizations. In Figure 8.46(a) and (b), the performance for 128 spreading factor is worse than that for 16 and 32 spreading factors. A possible reason is the poor performance of the interleavers (the first interleaver in convolutional
303
Wireless Channel Models Performance over vehA
0
10
128sf 32sf 16sf 8sf
−1
ber
10
−2
10
−3
10
−4
10
−20
−10
0
10
20
30
40
Eb /No (dB)
Figure 8.45 Performance of uncoded channel over vehicular A environment
coding and the first and turbo internal interleaver in turbo coding) in the presence of smaller input block sizes. Effect of Channel Coding Figure 8.47 illustrates the effect of channel coding schemes on the block error ratio and bit error ratio performances. The vehicular A channel environment is considered the test environment, while the spreading factor is set to 32. As expected, turbo coding shows better performance than the other channel coding schemes, while the 1/3 rate convolutional code outperforms the 1/2 rate convolution code. It must be emphasized that the plots shown are the BLER/BER performances vs Eb/No. If the BLER/BER performances are viewed vs transmitted power,
Figure 8.46 Spreading factor effect for vehicular A environment: (a) 1/3 rate convolutional code; (b) 1/3 rate turbo code
304
Visual Media Coding and Transmission 10
0 TC1/3 CC1/3 CC1/2
10
0
TC1/3 CC1/3 CC1/2
−1
10
−1
10 −2
10
−2
BLER
BER
10 −3
10
−4
−3
10
10
−4
10
−5
10
−6
10
0
−5
2
4
6
Eb/No (dB)
(a)
Figure 8.47
8
10
12
10 0
2
4
6
8
10
12
Eb/No (dB)
(b)
Effect of channel coding scheme: (a) BER; (b) BLER performance
significant improvements will be visible for the 1/3 rate coding scheme compared to the 1/2 rate coding scheme. This is because the transmit power is directly proportional to the source bit rate. As 1/2 rate coding supports higher source rates, the corresponding curve will be shifted more to the right than others. Furthermore, the convolutional code and the turbo code show closer BLER performance (shown in Figure 8.47(b)) than the BER performance shown in Figure 8.47 (a). This is due to the properties of the implemented LogMap algorithm at the turbo decoder, which is optimized to minimize the number of bit errors rather than the block error ratio [42]. Effect of Channel Environment Experiments were conducted to investigate the BLER performance for the pedestrian B channel environment. The mobile speed is set to 3 kmph. Results for 1/3 rate convolutional code with different spreading factors are shown in Figure 8.48. As is evident from the figure, the
Figure 8.48
1/3 rate convolutional coding performance for the pedestrian B environment
305
Wireless Channel Models
resulting performance over the pedestrian B channel is much lower than that over the vehicular A channel environment when operating without fast power control. This is due to the slow channel variation associated with low mobile speeds. A larger number of consecutive information blocks can experience a long weaker channel condition during the transmission. This reduces the performance of block-based de-interleaving and channel decoding algorithms, resulting in a high block error ratio. On the other hand, a faster Doppler effect results in alternating weak and strong channel conditions of short durations at high vehicular speeds. This effect behaves as a time-domain transmit diversity technique and enhances the performance of block-based interleaving and channel coding algorithms. 8.3.4.2 Eb/No to Eb/Io and C/I Conversion The BER performance of UMTS-FDD systems depends on many factors, such as mean bit energy of the useful signal, thermal noise, and interference. Interference can be divided into three main parts, as intersymbol interference, intra-cell interference, and inter-cell interference. Due to the multiple receptions, the signal is received with significant delay spread in a multipath propagation environment. This causes the intersymbol interference. Orthogonal codes are used to separate users in the downlink. Without any multipath propagation, these codes can be considered as perfectly orthogonal to each other. However, in a multipath propagation environment, orthogonality among spreading codes deviates from perfection, due to the presence of delay spread of the received signal. Therefore, the mobile terminal sees part of the signal, which is transmitted to other users, as interference power, and it is labeled intra-cell interference. The interference power seen among users in neighboring cells is quantified as the inter-cell interference. BER performance is commonly written as a function of the global Eb/h, where the definition is given as: Eb Eb ¼ h No þ c þ ð1 rÞ Io þ hISI
Eb ¼ h
"
Eb No
1
1 # 1 Eb Eb Eb 1 þ þ ð1 rÞ þ c Io hISI
ð8:12Þ
ð8:13Þ
where Eb is the received energy per bit of the useful signal, No is the power density representing the system-generated thermal noise, h is the global noise power spectrum density, x is the intercell interference power spectral density, r is the orthogonality factor (OF), Io is the intra-cell interference power spectral density, and hISI is the power spectral density of the intersymbol interference of the received signal. These factors depend on: . . . . .
The operating environment. The number of active users per cell. The used spreading factors in the code tree. The cell site configurations. Presence of the diversity techniques.
306 . . .
Visual Media Coding and Transmission
The mobile user locations. The type of radio bearer. The voice activity factor.
The loss of orthogonality between simultaneously-transmitted signals on a WCDMA downlink is quantified by the OF. The lower the value of the OF, the smaller the interference; an OF of 1 corresponds to the perfect orthogonal case, while an OF near 0 indicates considerable downlink interference. The introduction of the orthogonality factor in modeling intra-cell interference allows the employment of the Gaussian hypothesis. It is employed where the equivalent Gaussian noise with power spectral density is equal to (1 r) times the received intra-cell interference power, and is simply added at the receiver input. The statistics of the OF are normally derived from measurement data gathered through extensive field trial campaigns. In the designed UTRA downlink simulator, the OFs, which are derived based on the gathered channel data and are presented in [47], are used to simulate the intra-cell interference power. The inter-cell interference can also be modeled with the Gaussian hypothesis. The inter-cell interference power spectral density and the intra-cell interference power spectral density can be explicitly obtained through system-level simulations or analytical calculations based on simplifying assumptions and cell configuration [31]. However, intersymbol interference can only be obtained by chip-level simulation and does not depend on other factors, apart from used spreading factor, propagation condition, and mobile speed. Therefore, it is sufficient to obtain the Eb/h performance for one single connection (x ¼ 0, Io ¼ 0) of each of all the possible bit rates or spreading factors by chip-level simulation. Then the Eb/Io performances can easily be derived from Eb/h using Equation (8.14). No (1 r)Io is assumed, and the intersymbol interference is implicit in the simulation. Eb Eb ¼ ð1 rÞ Io h
ð8:14Þ
Equation (8.15) shows the relationship between average energy per bit and average received signal power, S. S ¼ Eb R
ð8:15Þ
where R denotes data bit rate. Therefore: Eb c
ð8:16Þ
Eb Eb ¼ ð1 rÞ R Io No
ð8:17Þ
SIRc ¼ R
SIRI ¼ R
SIRTotal ¼ ½SIRc 1 þ SIRI 1 1
ð8:18Þ
307
Wireless Channel Models Table 8.30 Orthogonality factor variation for different cellular environments [47] Environment
Urban small Urban large Rural large
Code orthogonality factor Mean
Std
0.571 0.514 0.626
0.159 0.212 0.190
[T]where SIRx,SIRI, and SIRTotal denote signal-to-inter-cell interference ratio, signal-to-intracell ratio, and signal-to-interference ratio respectively. Assume No ¼ 0.0002, x ¼ 0.005, and vehicular A propagation environment. From Table 8.30, the orthogonality factor is 0.514. The Eb/Io for Eb/h values shown in Figure 8.46(a) are calculated from Equation (8.14) and are shown in Figure 8.49.
8.3.5 Performance Enhancement Techniques 8.3.5.1 Fast Power Control for Downlink As can be seen in Figure 8.48, data transmission over the pedestrian B slow-speed propagation environment shows worse performance than over the high-speed propagation environment. This is mainly because the error-correcting methods are based on the interleaving and blockbased methods, which do not work effectively in the presence of long-duration weak channel conditions caused by Rayleigh fading at low mobile speeds. This condition (weak long radio link) can be improved by the application of a fast power control algorithm. A closed-loop fast
Figure 8.49
BLER performance: solid line shows BLER vs. Eb/No; dashed line shows BLER vs. Eb/h
308
Visual Media Coding and Transmission
From transmitter
Transmit power adjustment
Wireless
Rake receiver
To channel decoder
Received
Feedback delay
Figure 8.50
Transmit power decision making unit
power
Fast power control for UTRA-FDD downlink
power control algorithm has been designed and is incorporated in the simulator. The implementation and the resulting performance improvement are described below. 8.3.5.2 Algorithm Implementation A block diagram representation of the implemented power control algorithm is depicted in Figure 8.50. According to the measured received pilot power at the receiver, the UE generates appropriate transport power control (TPC) commands (whether to adjust transmit power up or down) to control the network transmit power and sends them in the TPC field of the uplink dedicated physical control channel (DPCCH). The TPC command decision is made by comparing the average received pilot power (averaged over an integer number of slots to mitigate the effects of varying interference and noise) to the pilot power threshold, which is predefined by the UTRAN based on the outer-loop power control [48]. Upon receiving the TPC commands, UTRAN adjusts its downlink DPCCH/DPDCH power according to Equation (8.19). PðkÞ ¼ Pðk 1Þ DTPC
ð8:19Þ
where P(k) denotes the downlink transmit power in the kth slot, and DTPC is the power control step size. is decided from the uplink TPC command. This algorithm is executed at a rate of 1500 times per second for each mobile connection. Settings used in the implementation are listed in Table 8.31. Note: the aggregated power control step is defined as the required total changes in a code channel in response to ten multiple consecutive power control commands. Table 8.31
Power control parameter settings [48]
Power Control Step Size Aggregated Power Control Step Change Power Averaging Window Size, n Feedback Delay Algorithm Execution Frequency
0.5 0.25 dB 4–6 dB 4 3 slots 1.5 kHz
309
Wireless Channel Models 2.5 2 1.5 1 0.5 0 1000
1500
2000
2500
3000
1500
2000
2500
3000
2.5 2 1.5 1 0.5 0 1000
Figure 8.51
Characteristic of the fast power control algorithm
The control algorithm adjusts the power of the DPCCH and DPDCH; however, the relative power difference between the two is not changed. Figure 8.51 shows how a downlink closed-loop power control algorithm works on a fading channel at low vehicular speed. Node B transmit power varies inversely proportional to the received pilot power. This closely resembles the time-varying channel at low mobile speeds. Transmit power cut-off values are defined by the maximum and minimum power limits set by node B. Receive power at the receiver shows very little residual fading. Figure 8.52 illustrates the performance of the power control algorithm for data transmission over the vehicular A propagation environment with a spreading factor 16 and 1/3 rate convolutional coding. The experiment was carried out at three different mobile speeds settings, namely 3, 50 and 120 kmph. The performance improvement by power control is evident at low speed, while at high mobile speed the improvement is largely insignificant. This is because a transmission diversity gain is provided by the highly time-varying channel at high mobile speed.
8.3.6 UMTS Radio Interface Data Flow Model The designed physical link layer simulator alone provides a necessary experimental platform to examine the effects of the radio link upon the data transmitted through the physical channel. However, in order to investigate the effect of channel bit errors upon the end-application, the application performance must be validated in an environment as close as possible to that of the real world. Therefore, not only the effect of the physical link layer but also the effect of UMTS protocol layer operation on multimedia performance should be investigated. A UMTS data flow model was designed in Microsoft Visual C þþ to emulate the protocol layer behavior.
310
Visual Media Coding and Transmission 0
10
120km/h 50km/h 3km/h FPC−120km/h FPC−50km/h FPC−3km/h
−1
bler
10
−2
10
−3
10
−4
10
0
2
Figure 8.52
4
6 Eb /No (dB)
8
10
12
Fast power control algorithm performance
The design criteria follow a modular design strategy. Each of the protocol layers was implemented separately and protocol interaction is performed through the specified interfaces. This allows individual protocol-layer optimization, or improvement and testing of novel performance-enhancement algorithms in the presence of a complete system. Here, only the protocol-layer effect on multimedia performance is considered. The protocol layers implemented include the application layer, transport layer, PDCP layer, RLC/MAC layer, and layer 1. The block diagram of the data flow model is shown in Figure 8.53. The effect of protocol headers on application performance was emulated by allocating dummy headers. The application consists of a full error-resilience-enabled MPEG-4 video source. In addition to the employed error-resilience techniques, the TM5 rate-control algorithm is used in order to achieve a smoother output bit rate. An adaptive intra refresh algorithm is also implemented to stop temporal error propagation and to achieve a smoother output bit rate. The output source bit rate is set according to the guaranteed bit rate, which is a user-defined QoS parameter. For packet-switched connections, encoded video frames are forwarded to the transport layer at regular intervals, as defined by the video frame rate. Each video frame is encapsulated into an independent RTP/UDP/IP packet [22] for forwarding down to the PDCP layer. The PDCP exists mainly to adapt transport-layer packet to the radio environment by compressing headers with negotiable algorithms [49]. The current version of the data flow model implements the resulting compressed header sizes, but not the actual headercompression algorithms. More extensive and detailed examination of different headercompression algorithms and their effects on multimedia performance can be found in [50]. For circuit-switched connections, the output from the video encoder is directly forwarded to the RLC/MAC layer. At the RLC layer, forwarded information data is further segmented into RLC blocks. The size of an RLC block is defined by the transport block (TB) size, which is an implementationdependent parameter. Optimal setting of TB size should include many factors, such as application type, source traffic statistics, allowable frame delay jitter, and RLC buffer size.
311
Wireless Channel Models
Rate Control
Application Layer:
Application Layer:
MPEG-4 VideoEncoder frame
MPEG-4 VideoDecoder frame
Transport Layer:
Transport Layer:
PDCP Layer:
PDCP Layer:
RLC/MAC Layer:
RLC/MAC Layer:
Layer 1:
Layer 1:
Video Packetization
Header compression/ No.compression (for PS)
Bit error injection from physical link layer simulator
Figure 8.53
UTRAN data flow model
RLC block header size depends on the selected RLC mode for the transmission. Transparent mode adds no header as it transmits higher-layer payload units transparently. Unacknowledge mode and acknowledgement mode add 8 bit and 16 bit headers to each RLC block, respectively [34]. Apart from the segmentation and addition of a header field, other RLC layer functions such as error detection and retransmission of erroneous data are not implemented in the current version of the model as the main use of the model is to investigate the performance of conversational-type multimedia applications. The MAC layer can be either dedicated or shared. A dedicated mode is responsible for handling dedicated channels allocated to a UE in connected mode, while shared mode takes responsibility for handling shared channels. If channel multiplexing is performed at the MAC layer then a 4 bit MAC header is added to each RLC block [35]. Layer 1 attaches a CRC to forwarded RLC/MAC blocks. According to the specified TTI length, higher-layer blocks are combined to form the TTI blocks and store them in a buffer for further processing before transmitting over the air interface. The number of higher-layer PDUs to be encapsulated within the TTI block depends on the selected channel coding scheme, the spreading factor, and the rate-matching ratio. In a practical system, the selection of channel coding scheme, TTI length, and CRC size is normally performed by the radio resource management algorithm according to the end user quality requirement, application type, operating environment, system load, and so on. For experimental sake, these parameters are to be user-definable in the designed data flow model. An error-prone radio channel environment is emulated by applying generated bit errors from the physical link layer simulator to the information data at layer 1.
312
Visual Media Coding and Transmission
The receiver side is emulated by reversing the described processing. Layer 1 segments the TTI blocks received over the simulated air interface into RLC/MAC blocks. After detaching CRC bits, RLC/MAC blocks are passed on to the RLC/MAC layer. At the RLC/MAC layer the received data is reassembled into PDCP data units for packet-switched connections. If IP/UDP/RTP headers are found to be corrupted, data encapsulated within the packet is dropped at the network layer. Finally, the received source data is displayed using an MPEG-4 decoder. This layered implementation of the UMTS protocol architecture allows the investigation of the effects of physical layer-generated bit errors upon different fields of the payload data units at each protocol layer. In other words, the data flow model can be used to map channel errors on to different PDU fields and to optimize protocol performance for the given application.
8.3.7 Real-time UTRAN Emulator The above-described UMTS data flow model is integrated with the physical link layer model to form the UTRAN emulator. The emulator software suite provides a graphical user interface for connection setup, radio bearer configuration, and performance monitoring. The emulator model considers the emulated system to be a black box, whose input–output behavior intends to reproduce the real system without requiring knowledge of the internal structure and processes. It has also been designed for accurate operation in real time with moderate implementation complexity. The emulator was implemented in Visual C þþ , as it provides a comprehensive graphical user interface design environment. Figure 8.54 depicts the block diagram of the designed emulator architecture. It consists of three main parts, namely content server, UMTS emulator, and mobile client. An “MPEG-4 file transmitter” is used as the content server. It selects the corresponding video sequence, which is encoded according to the requested source bit rate, frame rate, and other error-resilience parameters, and transmits the video to the UMTS radio link emulator. At the emulator the received source data passes through the UMTS data flow model and the simulated physical link layer, and is finally transmitted to the mobile client for display. Here, a PC-based MPEG-4 decoder is used to emulate the display capabilities of the mobile terminal. The UMTS configuration options dialog box (Figure 8.55) is designed for interactive radio bearer configuration for a particular connection. The QoS parameter page shows the userrequested quality of service parameters, such as type of service, traffic class, data rates, residual bit error ratio, and transfer delay. In addition, operator control parameters, connection type, PDCP connection type, and the number of multiplexed transport channels are shown. The transport channel parameter page for the data channel shows the transport channelrelated network parameter settings (Figure 8.56). Logical channel type, RLC mode, MAC channel type, MAC multiplexing, layer 1 parameters, TTI, channel coding scheme, and CRC are user-definable emulator parameters, while TB size and rate matching ratio are calculated and displayed from the other input parameter values. If the number of multiplexed transport channels is set to 1 then the transport channel parameter page for the control channel is disabled. Otherwise it shows the transport channel parameters that are related to the control channel. The appropriate spreading factor for transmission is calculated based on the requested QoS parameters and is displayed on the physical/radio channel parameter page (Figure 8.57). Radio channel-related settings (carrier frequency, channel environment, mobile speed) and receiver
313
Wireless Channel Models
Figure 8.54
UMTS emulator architecture
characteristics (number of rake fingers, rake combining, diversity techniques, power control) are selected on the physical/radio parameter page. Figure 8.58 illustrates the user interfaces of the designed emulator. In addition to the radio bearer configuration parameter pages described so far, the emulator also displays the instantaneous performance in terms of Eb/No, Eb/Io, C/I, and BER. Furthermore, it allows interactive manipulation of the number of users in the cell (hence co-channel interference), and monitoring of the video performance in a more realistic operating environment.
8.3.8 Conclusion The design and evaluation of a UMTS-FDD simulator for the forward link has been described. The CCSR model gives performances that satisfy the requirements shown in 3GPP performance figures. Furthermore, the performance of the CCSR model closely follows the performance traces published by different terminal manufactures on most test configurations. However, some performance variation was visible for operation over the case 2 propagation environment and 144 kbps reference channel over the case 3 test environment. As mentioned earlier, there are several things that could contribute to this. The most likely is the different implementation strategies followed in the receiver design and in channel decoding. Another
314
Visual Media Coding and Transmission
Figure 8.55 QoS parameter option page
Figure 8.56
Transport channel parameter option page
Wireless Channel Models
315
Figure 8.57 Physical/radio channel parameter option page
possible contributor is the different simulation techniques used for propagation modeling and interference modeling. The differences seen in the performance of turbo codes are greater than in the performance of convolutional codes, where the coding/decoding technology is fairly stable and consolidated. In addition, the performance of the LogMap algorithm implementation (provided in the SPW package) is highly sensitive to the amplitude of the decoder input signal. The input amplitude setting was based on the conducted experimental results and may cause slight performance degradation. Even though the CCSR model matches the reference performance traces, or comes very close to them under the particular bearer configurations tested, the bit error sequences generated by the CCSR model should be considered, to a certain extent, as worst-case performance figures. The designed quasi-ideal rake receiver employs non-ideal channel estimation based on weighted multislot averaging (WMSA) techniques. The settings of the WSMA filter parameters might not be considered optimal in varying simulation environments. Also, the employment of advanced power control techniques can result in improved performance over the less complex fast power control algorithm implemented. In fact, the CCSR model shows about 3 dB performance loss over the results published by Olmos [51] using a non-ideal rake receiver and fast power control. The intention of the UMTS physical layer simulation model was to facilitate investigation of multimedia performance over the UMTS air interface. Although the performance test results were presented as block error ratio values, the output of the physical link simulator produces bit error sequences to characterize the actual physical link layer. After integrating with the UMTS
316
Visual Media Coding and Transmission
Figure 8.58
Real-time UTRAN emulator
protocol data flow model, the physical link layer-generated bit error patterns can be used in audiovisual transmission experiments. In addition to the high degree of correlation shown between the performance of the CCSR model and the quoted reference figures, generated bit error patterns are suitable for employment in a radio bearer optimization for multimedia communication, as they exhibit relative differences between various network parameter and interference settings despite the type of receiver architecture implemented.
8.4 WiMAX IEEE 802.16e Modeling 8.4.1 Introduction This section features a discussion of the WiMAX IEEE 802.16e system. First, the function of basic components in the WiMAX model is discussed. Second, the simulation results facilitate a study of the performance of WiMAX for different mobile speeds, channel models, and channel coding schemes. Furthermore, this section outlines the developed WIMAX software for trace generation, the error pattern collection method, and the format of the error pattern.
317
Wireless Channel Models
In Subsection 8.4.2, the WiMAX physical layer system is outlined. Subsection 8.4.3 presents the results of WIMAX physical layer performance with different settings and different channel models. Finally, Subsection 8.4.4 discusses the error pattern generation format and parameters. Basic assumptions for the error pattern generation are also presented.
8.4.2 WIMAX System Description The IEEE 802.16e-2005 standard [52] provides specification of an air interface for fixed, nomadic, and mobile broadband wireless access systems with superior throughput performance. It enables non-line-of-sight reception, and can also cope with high mobility of the receiving station. The IEEE 802.16e extension enables nomadic capabilities for laptops and other mobile devices, allowing users to benefit from metro area portability of an xDSL-like service. The standard allows the physical layer to be scalable in bandwidth ranging from 1.25 to 20 MHz with fixed subcarrier spacing, while at the same time providing advanced features such as adaptive modulation and coding (AMC), advance antenna systems (AAS), coverage enhancing safety channel, convolutional turbo coding (CTC), and low-density parity check (LDPC) code. This rich set of IEEE802.16e features allows the equipment manufacturer to pick and mix the features to provide superior throughput performance, while at the same time allowing the system to be tailored to cater for restrictions in certain countries. The WIMAX forum is currently standardizing system profiles, which encompass subsets of IEEE 802.16e features considering country restriction, and at the same time allowing interoperability between equipment from different companies. This subsection describes the implemented WIMAX baseband simulation model. The implemented features consider a broadcasting deployment scenario. The system specifications are outlined, and an overview of every component in the baseband system model is given. 8.4.2.1 System Specifications and Baseband System Model Figure 8.59 shows the block diagram of the physical layer of the IEEE.802.16e standard. The block diagram specifies the processing of data streams. The components of the system are: Encoding and Modulation Tx Chain Data from MAC and Adaptation Layers
Randomization
FEC Encoder
Subchannel Mapper
Bit interleaving
Subchannel Allocation and Boosting
MIMO Processing
IFFT
Downlink/Uplink Demodulation and Decoding
Rx Chain Data to MAC and Adaptation Layers
DeRandomization
FEC Decoder
Figure 8.59
Bit Deinterleaving
Subchannel Demapper and Adapter
Channel Estimation
MIMO Processing
Physical layer of the IEEE 802.16e standard
FFT
318 . . . . . . .
Visual Media Coding and Transmission
randomizer FEC encoder: convolutional turbo code (CTC), repetition code, etc. bit interleaver data and pilot modulation subchannel allocation: FUSC, PUSC, etc. MIMO processing: space time coding (STC), spatial multiplexing (SM), etc. FFT/IFFT: 2048, 1024, 512, 256 points.
Channel coding procedures include randomization, FEC encoding, bit interleaving, and repetition coding. When repetition codes are used, allocation for the transmission will always include an even number of adjacent subchannels. The basic block will pass the regular coding chain where the first subchannel sets the randomization seed. The data will follow the coding chain up to the QAM (Quadrature Amplitude Modulation) mapping. The data outputted from the QAM mapper will be loaded on to the block of pre-allocated subchannels for transmission. The subchannel allocation follows one of the subcarrier permutation schemes, for example FUSC or PUSC. After that, multiple-antenna signal processing is applied if available in the system, and finally the data is passed to the OFDM transceiver (IFFT block) for transmission. A subset of the features of IEEE 802.16e is implemented according to the broadcasting scenario. The functionality and requirements of each block are specified below. The C þþ software implementation of the baseband model uses the IT þþ communication signal processing library [53]. Data Randomizer Data randomization is performed on data transmitted on the downlink and uplink. The randomization is initialized on each FEC block (using the first subchannel offset) and the OFDMA symbol offset on which that block is mapped. Symbol offset, for both UL and DL, is counted from the start of the frame, where the DL preamble is count 0. If the amount of data to transmit does not exactly fit the amount of data allocated, padding of 0 FF (“1” only) is added to the end of the transmission block, up to the amount of data allocated. Each data byte to be transmitted enters sequentially into the randomizer, MSB first. Preambles are not randomized. The seed value is used to calculate the randomization bits, which are combined in an XOR operation with the serialized bitstream of each FEC block, as shown in Figure 8.60. The randomizer sequence is applied only to information bits. Convolutional Turbo Coding (CTC) Figure 8.61 shows the convolutional turbo coding (CTC) encoder. Incoming bits are fed alternately into A and B. During the first encoding operation, A and B are connected to position 1 of the constituent encoder to generate parity C1. In the second step, the interleaved A and B bits are connected to position 2 of the constituent encoder to generate parity C2. A, B, C1 and C2 together are the mother codeword that can be punctured to obtain different code rates for transmission. The constituent encoder uses duo-binary circular recursive systematic convolutional (CRSC) code, where the encoding operation is performed on a pair of bits, hence “duobinary”. The code is circular in such a way that the ending state matches the starting state. The polynomials defining constituent encoder connections are described in hexadecimal and binary symbol notations as follows:
319
Wireless Channel Models Initalization Sequence
0 MSB
1
1
1
0
1
1
1
0
0
0
1
0
1
0
1 LSB
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Data In
Data Out
Figure 8.60 Data randomizer . . .
For the feedback branch: 0 B; equivalently: 1 þ D þ D3. For the Y parity bit: 0 D; equivalently: 1 þ D2 þ D3. For the W parity bit: 0 9; equivalently: 1 þ D3.
The number of input bits depends on the number of slots allocated by the MAC layer to the user for transmission. Table 8.32 shows the number of bits per data slot. Concatenation of a number of data slots is performed in order to make larger blocks for coding whenever possible, with the limitation of not exceeding the largest block defined in Table 8.32 for a given applied modulation and coding rate. A larger coding block improves the coding performance, but introduces higher decoding and computational complexity. To decode the above duo-binary CTC codes, MaxLogMap [54] has been adopted.
Figure 8.61
Convolutional turbo coding (CTC)
320
Visual Media Coding and Transmission Table 8.32
Encoding slots concatenation for different rates in CTC
Modulation and rate
Number of bits per data slot
Maximum number of concatenated slots
QPSK 1/2 QPSK 3/4 16-QAM 1/2 16-QAM 3/4 64-QAM 1/2 64-QAM 2/3 64-QAM 3/4 64-QAM 5/6
48 72 96 144 144 192 216 240
10 6 5 3 3 2 2 2
Interleaver The interleaving operation ensures that adjacent coded bits are mapped onto nonadjacent OFDM subcarriers, and at the same time maps coded bits alternately on to less and more significant bits of the modulation constellation. The intention is to increase the robustness and avoid long runs of low-reliability bits. The operation is divided into two permutation steps. The first permutation equation is: i ¼ ðNCBPS =16Þðkmod16Þ þ bk=16c
k ¼ 0; . . . ; NCBPS 1
ð8:20Þ
where i is the index after the first permutation of index k. k is the index of coded bits before first permutation. NCBPS is the encoded block size. The second permutation mapping index i to j is: j ¼ s bi=sc þ ði þ NCBPS b16 i=NCBPS cÞmod s
i ¼ 0; . . . ; NCBPS 1
ð8:21Þ
where s ¼ max(NBPSC/2,1), with NBPSC being the number of bits per modulation symbol, for example two bits for QPSK. Note that the interleaver is not used for CTC. Modulation Permutation definition Figure 8.62 shows the PRBS used to produce a random sequence, wk, that will be used for subcarrier randomization (after QAM mapping) and pilot modulation in the following. The polynomial for the PRBS generator will be X11 þ X9 þ 1. The initialization vector for the memory will follow the steps in [52].
LSB 1
MSB 2
3
4
5
6
7
8
9
10
11 wk
Figure 8.62
PRBS for pilot modulation
321
Wireless Channel Models
Figure 8.63
QPSK, 16-QAM and 64-QAM constellations point
Data modulation OFDM subcarriers are modulated using QPSK, 16-QAM, and 64-QAM constellations. The encoded and interleaved serial input data is divided into groups of NBPSC, i.e. 2, 4, or 6 bits, which will then be converted into a complex number, I þ Q j, representing either a QPSK, 16-QAM, or 64-QAM constellation point. The mappings for QPSK, 16-QAM and 64-QAM are shown in Figure 8.63. Finally, the resulting complex value is normalized by multiplying it by the normalization factor, KMOD, specified in Table 8.33. The constellation-mapped data is subsequently modulated on to the allocated data subcarriers, and each subcarrier is multiplied by the factor 2 (1/2 wk) according to the subcarrier index k. wk is derived using the method described above for permutation definition. Pilot modulation In the downlink, the pilot is transmitted with a boosting of 2.5 dB over the average nonboosted power of each data tone. The pilot subcarriers are modulated with sequence wk, defined earlier using the following equation: 8 1 wk Refck g ¼ 3 2 ð8:22Þ Imfck g ¼ 0 In the downlink, for PUSC, FUSC, AMC, and optional FUSC permutations, all pilots (of the segment, in the case of PUSC) will be modulated, whether or not all the subchannels are Table 8.33 Modulation QPSK 16-QAM 64-QAM
Modulation-dependent normalization factor KMOD pffiffiffi 1=p2 ffiffiffiffiffi 1=p10 ffiffiffiffiffi 1= 42
322
Visual Media Coding and Transmission
allocated in the DL-MAP. For AMC permutation in AAS zone, the BS is not required to modulate the pilots that belong to bins not allocated in the DL-MAP, or allocated as gaps (UIUC ¼ 13). Subchannel Allocation There are many types of subchannel allocation mechanism, grouped according to whether the transmission is uplink or downlink. For the downlink, the two main subchannel allocation methods are: . .
Partial usage of subchannels (PUSC), where some of the subchannels are allocated to the transmitter. Full usage of subchannels (FUSC), where all of the subchannels are allocated to the transmitter.
FUSC employs full channel diversity by distributing data subcarriers to subchannels using a permutation mechanism designed to minimize interference between cells. It is somewhat similar to the idea of the classical frequency hopping technique. For PUSC, subchannels are divided and assigned to three segments, which can be allocated to sectors of the same cell. As with FUSC, a permutation mechanism is applied to allocate subcarriers to subchannels to harvest the interference averaging and fast-fading averaging effects. See [52] for details of the permutation mechanism. Figure 8.64 shows the PUSC data frame, consisting of L subchannels across the time interval. Data can be transmitted over the subchannels depicted in the figure. In PUSC mode, a data slot is composed of one subchannel and two OFDMA symbols. The data region is the allocated area for user data transmission. The mapping of encoded data blocks onto the subchannels is depicted in Figure 8.64. The mapping follows the order in Figure 8.64 (vertical direction first, then proceed to the next two OFDMA time symbols).
Figure 8.64 Example of mapping encoded data blocks to subchannels in downlink PUSC mode [52]
323
Wireless Channel Models Table 8.34 FUSC FFT parameters System parameter FFT size No. of guard subcarriers No. of used subcarriers No. of data subcarriers No. of pilot subcarriers
Description 128 22 106 96 9
512 86 426 384 42
1024 173 851 768 83
2048 345 1703 1536 166
IFFT/FFT The IFFT block transforms the data from frequency to time domain, using inverse fast Fourier transforms at the transmitter while the FFT performs the reverse operation at the receiver. A highlight of the IEEE 802.16e standard is the idea of scalable OFDMA (SOFDMA), where the FFT size can be adjusted while fixing the subcarrier frequency spacing at a particular value. This is advantageous for supporting a wide range of bandwidths in order to flexibly address the need for various spectrum allocations, ranging from 1.25 Mhz to 20 MHz. The relevant FFT parameters for FUSC and PUSC schemes are shown in Tables 8.34 and 8.35, respectively.
8.4.3 Physical Layer Simulation Results and Analysis This subsection describes the physical layer performance of the WIMAX system. The simulation parameters used are the WIMAX system parameters, which can be found in Table 8.36. Other simulation parameters and assumptions are: . . .
ITU vehicular A and vehicular B channel models [55]. A spectral mask of 5 dBc/Hz flat up to 10 kHz, and then reducing 20 dB/dec up to 120. Perfect channel knowledge.
Unless otherwise specified, the parameters above are assumed for the simulation results presented below. 8.4.3.1 Performance of WIMAX at Different Mobile Speeds Figure 8.65 shows the performance of PUSC schemes for mobile speeds of 60 and 100 kmph in the ITU vehicular A channel. The FEC blocks sizes are indicated in the graph legends, in the Table 8.35 PUSC FFT parameters System parameter FFT size No. of guard subcarriers No. of used subcarriers No. of data subcarriers No. of pilot subcarriers
Description 128 43 85 72 12
512 91 421 360 60
1024 183 841 720 120
2048 367 1681 1440 240
324
Visual Media Coding and Transmission Table 8.36
System parameters of the WIMAX platform
System parameter
Description
Duplexing Multiple access Subcarrier permutation Carrier frequency Channel bandwidth FFT Subcarrier spacing Symbol duration, TS Cyclic prefix, TG OFDM duration TDD frame length No. of symbols in a frame DL/UL ratio TTG/RTG
TDD OFDMA PUSC 2.3 GHz 8.75 MHz 1024 9.765 625 kHz 102.4 ms 12.8 ms 115.2 ms 5 ms 42 27/15 121.2/40.4 ms
range of 384–480 bits, depending on the modulation and coding setting. It can be seen that WIMAX is fairly robust to Doppler spread and the performance loss is minimal. Figure 8.66 further shows the BER performance at a mobile speed of 150 kmph. Comparing Figure 8.65 and Figure 8.66, the performance loss is still small, even when the 64-QAM 1/2 is used. Figure 8.67 further shows the result for ITU vehicular A BER and PER at 100 kmph using FEC blocks of 144 and 192 bits. When compared to Figure 8.65, there is a small performance loss. This is due to small block size being used; for the class of CTC codes, larger block sizes will have a better BER/PER performance. Nevertheless, the performance is still robust to Doppler spread, even at 100 kmph. 8.4.3.2 Performance of WIMAX with Different Channel Models Figure 8.68 shows the BER and PER performance of ITU vehicular A and vehicular B channel at 60 kmph using FEC blocks of length 384–480 bits as defined in the graphs, depending on the modulation and coding setting. It can be seen that ITU vehicular B beyond 16-QAM 3/4 model has large performance degradation compared to the ITU vehicular A channel. This is due to the larger channel echo of the ITU vehicular B model than the guard interval of the defined system parameters in Table 8.36. While the defined system parameters target smaller cell size and low antenna, the figure shows that the system is still usable (and robust for modes below 16-QAM 1/ 2) at the larger cell size of the normal GSM/3G system. Note that this problem can be solved easily, by using a larger guard interval at the cost of reduced bit rate if a larger cell size is desired.
8.4.4 Error Pattern Files Generation 8.4.4.1 Data Flow over the Physical Layer Figure 8.69 shows the physical layer time division duplex (TDD) frame of the WIMAX system. It contains downlink (DL) and uplink (UL) subframes with various regions for performing protocol functions and data transmissions. Basically, the DL data burst #x region is the
Wireless Channel Models
Figure 8.65
325
BER and PER performance of PUSC scheme at mobile speeds 60 kmph and 100 kmph
Figure 8.66 BER performance of PUSC scheme at mobile speed 150 km/h
326
Visual Media Coding and Transmission
Figure 8.67
Figure 8.68
BER and PER performance of the PUSC scheme at 100 kmph
BER and PER performance with ITU vehicular A and vehicular B channel
Wireless Channel Models
Figure 8.69
327
WIMAX physical layer TDD frame. Reproduced by Permission of 2004, 2007 IEEE
allocated data region where a user can transmit data. As shown in Figure 8.69, a data burst generated by users is placed at the correct DL data burst, depending on the MAC scheduler allocation decision. The allocation decision is transmitted in the DL-MAP section of the DL subframe. 8.4.4.2 Pregenerated Trace Format (Portions reprinted, with permission, from C.H. Lewis, S.T. Worrall, A.M. Kondoz, “Hybrid WIMAX and DVB-H emulator for scalable multiple descriptions video coding testing”, International Symposium on Consumer Electronics (ISCE 2007), June 20–23, 2007, Dallas, Texas, USA. 2007 IEEE.) The developed WIMAX simulator has been compared to the literature for validation. It has been used to generate error pattern files. The format of the trace matches the TDD frame format. The system parameters are shown in Table 8.36. As PUSCs have been used, one data slot is equal to one subchannel and two time symbols. Thus, the DL subframe is a matrix of 30 13 data slots, excluding the preamble symbol. In order to reduce data storage requirements, the error pattern is saved in the form of a dataslot error pattern instead of a bit error pattern. The data-slot error pattern is obtained by comparing all the data bits within an original data slot to the transmitted data slot. If there is any bit error within the data slot, it is declared as an error. Note that we have not specifically assumed any IP packet size or data burst size within a physical layer frame. Also, no MAC layer packet encapsulation procedures have been performed. This decision was made to allow flexibility on the choice of packet size and the data throughput during video transmission simulation. Data slots can be aggregated to different MAC frame sizes for different packet sizes, and different burst sizes for different data throughput. This also allows the study of efficient packetization schemes for video transmission. Maximum FEC code block size has been assumed for the trace generation. The maximum code block size is MCS mode-dependent, as defined in the IEEE 802.16e-2005 standard.
328 Table 8.37
Visual Media Coding and Transmission
Parameters used for trace generation
Parameter
Values
Length of trace Permutation Channel coding Terminal speed Test environment MCS mode SNR range
15 s PUSC CTC 60, 120 kmph ITU vehicular A QPSK 1/2, QPSK 3/4, 16-QAM 1/2, 16-QAM 3/4, 64-QAM 1/2 0–30 dB, which will have 5–7 data points for each MCS mode
For error pattern generation, the system parameters in Table 8.36 and the simulation parameters in Table 8.37 are used. Nevertheless, error patterns for other scenarios may be generated easily with the WIMAX simulator. For each MCS mode, 5–7 data points, each representing different SNR/BER levels, are generated. In the generated error pattern files, symbol 1 refers to a data slot error while symbol 0 means that there is no error.
8.5 Conclusions Wireless communication technologies have experienced rapid growth and commercial success during the last decade. Driven by the powerful vision of being able to communicate from anywhere at any time with any type of data, the integration of multimedia and mobile technologies is currently underway. Third- and beyond-third-generation communication systems will support a wide range of communication services for mobile users from any geographical location, in a variety of formats, audiovisual services, and applications. This document has described the design of channel simulators for GPRS/EGPRS and UMTS radio access networks. The simulators can be used to investigate resource-allocation and qualityenhancement methods for the transmission of audiovisual services over heterogeneous radio access networks. The design and implementation of a physical link layer simulation model of the GPRS and EGPRS packet data channels was presented. The model was integrated with a GPRS radio interface data flow model to simulate the GERAN radio access networks. The design of the UMTS radio access network simulator was presented. A UMTS-FDD forward-link physical layer simulation model was designed using the Signal Processing WorkSystem (SPW) software simulation tools. The model includes all the radio configurations, channel coding/ decoding, modulation parameters, transmission modeling, and their corresponding data rates for a dedicated physical channel according to the UMTS specifications. The physical link simulator generated bit error patterns, which correspond to various radio bearer configurations. These error patterns were integrated with the UMTS protocol data flow model, designed in Visual C þþ . This integration was implemented in a real-time emulator so as to allow for interactive monitoring of the effects of network parameter settings upon the received multimedia quality. The developed simulators match the reference performance traces, or the performances are within the operational margins specified by the relevant standardization bodies. However, the bit error sequences generated by the simulators should be considered, to a certain extent, as
329
Wireless Channel Models
worst-case performance figures. The receivers implemented in the simulators are customerdesigned and are based on less-complex non-ideal channel estimation. The parameter settings for these channel estimators and receivers might not be considered as optimal compared to the commercially-available advanced receiver architectures. The performances of developed simulators may further be improved with the use of advanced receiver architectures and receiver/transmitter diversity techniques. The intention of the physical-layer simulation models was to facilitate performance investigation of audiovisual applications over the heterogeneous wireless interface. The outputs of the physical link simulators produce bit error sequences, which reflects the characteristics of the actual physical link layer. These bit error sequences are integrated with the higher-protocol-layer data flow models to emulate the effect of end-to-end communication systems. The emulated systems allow us to investigate the end-to-end quality of audiovisual applications over a number of wireless communication systems. In addition to the high degree of correlation shown between the performance of the developed simulators and the quoted reference figures, generated bit error patterns are nevertheless suitable for employment in a radio bearer optimization for audiovisual communication, as they exhibit relative differences between various network parameter and interference settings, despite the type of receiver architecture implemented. This chapter also described the WIMAX simulation models developed. An overview of the WIMAX baseband model and its specification was provided. The implemented functions are a subset of the IEEE 802.16e standard, as not all the features are relevant or necessary for the purpose of the study. The physical link-level performance of WIMAX was presented. It can be seen that WIMAX is fairly robust to high vehicular speeds (Doppler spread) from the simulations carried out for 150 kmph. There is some performance degradation for existing system parameters when the popular GSM/3G, large cell size, ITU vehicular B channel is considered for 16-QAM 3/ 4 mode and above. This is because the defined parameters are targeting smaller cell sizes (smaller delay spread) than the ITU vehicular B channel. Nevertheless, the system is still robust for 16-QAM 1/2 mode and below. A comparison of different coding schemes, convolutional coding (CC), convolutional turbo coding (CTC), and low-density parity check (LDPC) was made. It was shown that the gain of LDPC over CTC is marginal (less than 1dB), and only when the block size is large. In an OFDMA system, where the allocated resource unit can be small, there is no huge advantage in an LDPC scheme. The implemented trace-generation software and the format of the error pattern were also described. The generated trace considers QPSK 1/2, QPSK 3/4, 16-QAM 1/2, 16-QAM 3/4, and 64-QAM 1/2 for the ITU vehicular A channel. Nevertheless, other parameters can be considered easily with the simulator.
8.6 Appendix: Eb/No and DPCH_Ec/Io Calculation total signal power Rb 2s2 No ¼ RC ch os Eb ¼
where RC and Rb are chip rate and channel bit rate, respectively. ch_os denotes the channelover-sampling factor. Setting total_signal_power to 1, Eb/No becomes:
330
Visual Media Coding and Transmission
Eb RC ch os ¼ No 2 Rb s2 where ^I or is the received power spectral density of the downlink as measured at the UE antenna connector, Ioc is the power spectral density of a band-limited white noise source as measured at the UE antenna connector, Ior is the total transmit power spectral density of the downlink at the node B antenna connector, DPCH_Ec is the average energy per chip for DPCH, and power DPCH EC ¼ transmitchipsignal . Assume no path loss. rate transmit signal power bit rate DPCH factor ^I or chip rate ¼ bit rate
Eb ¼
total noise power ¼ power on OCNS þ AGWN ¼ ðOCNS factor * ^I or þ Ioc Þ chip rate Code orthoganality is assumed. total noise power ¼ No chip rate ch os Eb DPCH factor ^I or chip rate2 ch os ¼ No bit rate ðOCNS factor ^I or þ Ioc Þ chip rate Eb DPCH factor ^I or chip rate ch os ¼ No bit rate ðOCNS factor ^I or þ Ioc Þ DPCH Ec chip rate ch os Eb Ior ¼ No bit rate ðOCNS factor þ Ioc =^I or Þ as: Ioc =^I or OCNS factor DPCH Ec chip rate ch os Eb bit rate Ior ¼ Ioc No ^I or
References [1] [2] [3] [4] [5]
M. Nilsson, “Third generation radio access standards,” Ericsson Review, No. 3, 1999. T. Nilsson, “Toward third generation mobile multimedia communication,” Ericsson Review, No. 3, 1999. H. Granbohm and J. Wikulund, “GPRS: general packet radio service,” Ericsson Review, No. 2, 1999. P. Stuckmann, The GSM Evolution: Mobile Packet Data Services, John Wiley & Sons, Ltd., 2003. B. Sarikaya, “Packet mode in wireless networks: overview of transition to third generation,” IEEE Communications Magazine, Vol. 38, No. 9, pp. 164–172, Sep. 2000.
Wireless Channel Models
331
[6] 3rd Generation Partnership Project, “Technical specification: group services and system aspects: general packet radio service (GPRS): service description; stage 2 (release 4),” 3GPP TS 23.060 V4.0.0, March 2001. [7] 3rd Generation Partnership Project, “GSM/EDGE: overall description of the GPRS radio interface: stage 2,” TS 03.64 2002, v. 8.10.0, 1999. [8] http://www.chronologic.com/products/dsp/cossapcs.html. [9] J.G. Proakis, Digital Communications, McGraw-Hill, London, 1995. [10] GSM, “ETSI EN 300 909 V8.5.0 (2000–07) European standard (telecommunications series), digital cellular telecommunications system (phase 2þ ): channel coding,” GSM 05.03, Version 8.5.0, Jul. 2000. [11] 3rd Generation Partnership Project, “Technical specification group GERAN: digital cellular telecommunications system (phase 2þ ): modulation,” 3GPP TS 05.04, v. 8.2.0, Jan. 2001. [12] ETSI/SMG, “Overall description of the GPRS radio interface stage 2,” GSM 03.64, v. 5.2.0., 1998. [13] 3rd Generation Partnership Project, “Technical specification group GERAN: digital cellular telecommunications system (phase 2þ ): radio transmission and reception,” TS 05.05, v. 8.8.0, Jan. 2001. [14] GSM, “European standard (telecommunications series), digital cellular telecommunications system (phase 2 þ ): background for RF requirements,” GSM 05.50, v. 8.2.0, Mar. 2000. [15] Lucent Technologies, “Proposal for EDGE EGPRS receiver performance values in GSM 05.05,” Tdoc SMG2 1566/99, Nov. 1999. [16] Ericsson, “EGPRS receiver performance for BTS,” Tdoc SMG2 EDGE 561/99, Dec. 1999. [17] EDGE Drafting Group, “Working assumption for receiver performance requirements,” Tdoc SMG2 EDGE 401/ 99, Aug. 1999. [18] Ericsson, Motorola, Nokia, “Change request GSM 05.05: EGPRS receiver performance for MS DCS 1800 and PCS 1900,” Tdoc SMG2 060/00, Jan. 2000. [19] EDGE Drafting Group, “Outcome of drafting group on MS EGPRS Rx performance,” Tdoc SMG2 086/00, Jan. 2000. [20] Ericsson, “Proposed values for 05.05 receiver performance for BTS,” Tdoc SMG2 EDGE 564/99, Dec. 1999. [21] R. Talluri, “Error-resilient video coding in the ISO MPEG-4 standard,” IEEE Communications Magazine, pp. 112–119, Jun. 1998. [22] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobsen, “RTP: a transport protocol for real time applications,” Network Working Group, RFC 1889, 1996. [23] C. Burmeister, M. Degermar, H. Hannu, L. Jonsson, R. Hakenberg, R. Hakenberg, et al., “RObust Header Compression (ROHC),” IETF Internet draft, Feb. 2001, expires Aug. 2001. [24] S. Casner, V. Jacobson, T. Koren, B. Thompson, D. Wing, P. Ruddy, et al., “Compressing IP/UDP/RTP headers for low-speed serial links,” IETF Internet draft, Nov. 2000, expires Jun. 2001. [25] 3rd Generation Partnership Project, “Technical specification group core network: digital cellular telecommunications system (phase 2þ ): general packet radio service (GPRS): mobile station (MS) – serving GPRS support node (SGSN): subnetwork dependent convergence protocol (SNDCP) (release 1999),” TS 04.65, v. 8.1.0., Sep. 2000. [26] 3rd Generation Partnership Project, “Technical specification group core network: digital cellular telecommunications system (phase 2 þ ): general packet radio service (GPRS): mobile station (MS) – serving GPRS support node (SGSN): logical link control (LLC) layer specification (release 1999),” TS 04.64, v. 8.6.0., Dec. 2000. [27] http://www.cadence.com/datasheets/fpga-design.html. [28] TSGR4#7(99)578, TSG-RAN working group 4 (radio) meeting #7 AH01, Noordwijkerhout, 30 Sep.–1 Oct. 1999. [29] TSGR4#7(99)581, TSG-RAN working group 4 (radio) meeting #7 AH01, Noordwijkerhout, 30 Sep.–1 Oct. 1999. [30] 3rd Generation Partnership Project, “Technical specification group radio access network: user equipment (UE) radio transmission and reception (FDD) (release 4),” TS 25.101, v. 4.10.0., Mar. 2002. [31] H. Holma and A. Toskala, WCDMA for UMTS: Radio Access for Third Generation Mobile Communications, John Wiley & Sons, Ltd., revised edition, 2001. [32] 3rd Generation Partnership Project, “Radio interface protocol architecture,” TS 25.301, v. 4.4.0., Sep. 2002. [33] L. Qiu, Y. Huang, and J. Zhu, “Fast acquisition scheme and implementation of PRACH in WCDMA system,” Vehicular Technology Conference, Vol. 3, pp. 1701–1705, Oct. 2001. [34] 3rd Generation Partnership Project, “Technical specification group terminals: radio link control (RLC) protocol specification (release 4),” TS 25.322, v. 4.7.0., Jan. 2003. [35] 3rd Generation Partnership Project, “Technical specification group terminals: medium access control (MAC) protocol specification (release 4),” TS 25.321, v. 4.7.0., Jan. 2003.
332
Visual Media Coding and Transmission
[36] “Universal mobile telecommunications system (UMTS); selection procedures for the choice of radio transmission technologies of the UMTS (UMTS 30.03 version 3.2.0),” TR 101 112, v. 3.2.0., Apr. 1998. [37] 3rd Generation Partnership Project, “Technical specification group terminals: common test environments for user equipment (UE) conformance testing (release 4),” TS 34.108, v. 4.7.0, Jun. 2003. [38] 3rd Generation Partnership Project, “Technical specification group radio access network: multiplexing and channel coding (FDD) (release 4),” TS 25.212, v. 4.6.0., Sep. 2002. [39] 3rd Generation Partnership Project, “Technical specification group radio access network: spreading and modulation (FDD) (release 4),” TS 25.213, v. 4.3.0., Jun. 2002. [40] S. Saunders, Antennas and Propagation for Wireless Communication Systems, John Wiley & Sons, Ltd., 1999. [41] 3rd Generation Partnership Project, “Technical specification group radio access network: physical channels and mapping of transport channel on to physical channel (FDD) (release 4),” TS 25.211, v. 4.6.0., Sep. 2002. [42] B. Vucetic and J. Yuan, “Turbo codes: principles and applications (The Springer International Series in Engineering and Computer Science) Springer; 1st edition.,” Jan. 2000. [43] K. Higuchi, H. Andoh, K. Okawa, M. Sawahashi, and F. Adachi, “Experimental evaluation of combined effect of coherent RAKE combining and SIR-based fast transmit power control for reverse link of DS-CDMA mobile radio,” IEEE Journal on Selected Areas in Communications, Vol. 18, No. 8, pp. 1526–1535, Aug. 2000. [44] J.J. Olmos and S. Ruiz, “Chip level simulation of the downlink in UTRA-FDD, 11th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Vol. 2, pp. 1469–1473, Sep. 2000. [45] 3rd Generation Partnership Project, “Technical specification group services and system aspects: quality of service (QoS) concept and architecture (release 4),” TS 23.107, v. 4.6.0., Jan. 2003. [46] 3rd Generation Partnership Project, “Physical layer procedures (FDD),” TS 25.214, v. 4.6.0., Apr. 2003. [47] M. Hunukumbure, M. Beach, and B. Allen, “Downlink orthogonality factor in UTRA FDD systems,” Electronics Letters, Vol. 38, No. 4, pp. 196–197, Feb. 2002. [48] 3rd Generation Partnership Project, “Technical specification group terminals: base station (BS) radio transmission and reception (FDD) (release 4),” TS 25.104, v. 4.4.0., Mar. 2002. [49] 3rd Generation Partnership Project, “Technical specification group terminals: packet data convergence protocol (PDCP) specification (release 4),” TS 25.323, v. 4.6.0., Sep. 2002. [50] A. Cellatoglu,“Adaptive header compression techniques for mobile multimedia networks,” PhD Thesis, University of Surrey, UK, Feb. 2003. [51] J.J. Olmos and S. Ruiz, “Transport block error rates for UTRA-FDD downlink with transmission diversity and turbo coding,” 13th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Portugal, Sep. 2002. [52] “IEEE standard for local and metropolitan area networks, part 16: air interface for fixed and mobile broadband wireless access systems, amendment 2: physical and medium access control layers for combined fixed and mobile operation in licensed bands and corrigendum 1, 802.16e-2005. [53] “IT þþ signal processing library, version 3.10.2,” http://itpp.sourceforge.net, May 2006. [54] M. Reza Soleymani Y. Gao, and U. Vilaipornsawai, Turbo Coding and Satellite and Wireless Communications, Kluwer Academic Publishers, 2002. [55] “Fixed and mobile channel models identifications,” WP2.1 SUIT Project Deliverable, Jul. 2006. [56] B. Baumgartner, M. Reinhardt, G. Richter, and M. Bossert, “Performance of forward error correction for IEEE 802.16e,” 10th International OFDM Workshop, Hamburg, Germany, Aug. 2005. [57] C. Eklund, R.B. Marks, S. Ponnuswamy, K.L. Stanwood, and N.J.M.V. Waes, “WirelessMAN: inside the IEEE 802.16 standard for wireless metropolitan networks,” IEEE Standards Wireless Networks Series, May 2006. [58] E. Westman, “Calibration and evaluation of the exponential effective SINR mapping (EESM) in 802.16,” Master’s degree project report, Royal Institute of Technology (Kungliga Tekniska H€ogskolan), Stockholm, Sweden, Sep. 2006.
9 Enhancement Schemes for Multimedia Transmission over Wireless Networks 9.1 Introduction Third-generation (3G) access networks were designed from the outset to provide a wide range of bearer services with different levels of quality of service (QoS) suitable for multimedia applications with bit rates of up to 2 Mbps. The bearer services are characterized by a set of transport channel parameters, which include: transport block size, CRC code length, channel coding schemes, RLC mode, MAC type, transport time interval, rate matching, and spreading factor. The perceived quality of the application seen by the end user is greatly affected by the settings of these transport channel parameters. The optimal parameter settings depend highly on the characteristics of the application, the propagation conditions, and the end-user QoS requirements. This section will examine the effect of these transport channel (network) parameter settings upon the performance of MPEG-4-coded video telephony and AMR-WB speech applications, and will investigate the optimal radio bearer design for real-time speech and video transmission over UTRAN, GPRS, and EGPRS. The influence of the network parameter settings and different channel and interference conditions upon the received video/ speech quality and network performance will be assessed experimentally using the real-time UMTS, GPRS, and EGPRS emulators described in Chapter 8. Furthermore, differences between packet-switched and circuit-switched radio bearer configurations for conversational video applications over UMTS will be investigated.
9.1.1 3G Real-time Audiovisual Requirements The most challenging form of communication class in terms of application requirement is the conversational service class. The real-time conversational scheme is characterized by two main requirements: very low end-to-end delay and the preservation of time relations between information entities in the stream. The maximum end-to-end delay is decided by human Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
334
Visual Media Coding and Transmission Table 9.1
Characteristics of conversational (real-time) applications [1]
Application example
Videophone
Degree of symmetry Data rates One-way end-to-end delay
Two-way 32–384 kbps
Context-based Visual Media Content Adaptation
517
(permittedDiaChanges – ConversionDescription) and the conditions under which those changes can be performed (changeConstraint). Furthermore, Table 11.4 illustrates an example of how to express in an MPEG-21-compliant license that a source (‘‘video’’) can be played as long as the ‘‘resolution’’ is under some specific limits and the ‘‘bit rate’’ stays between two other specific values. When dealing with protected/governed content, a content provision system with capabilities to adapt the content to a users context characteristics would need to check this license and use the conditions referred to therein as additional constraints during the adaptation decisiontaking process. This kind of license can be very useful in assisting content creators, owners, and distributors to keep some control of the quality of their products. It provides them with the means to set up conditions under which their contents are consumed. This can also contribute to augmenting user satisfaction, as the content presented to them will satisfy the quality conditions intended by its creator. In line with the above licensing examples for generic scenarios, a more specific license for use in the selected application scenario can also be derived. In this particular scenario, a teacher may want students to download the lectures at a good resolution exceeding a given minimum, as otherwise they will miss some important details of the presentation, video feed, and so on. Table 11.5 illustrates the resulting license that the teacher should issue associated with the lectures.
11.6.8 Adaptation Engines Stack The AESs considered in a context-aware content adaptation platform are capable of performing multiple adaptations, as illustrated in Figure 11.35. An AES encloses a number of AEs into a single entity. All the AEs in an AES reside in a single hardware platform, sharing all the resources. The advantage of such an approach is that it is possible to cascade multiple AEs optimally to minimize computational complexity. For example, if both cropping and scaling operations need to be performed on a give non-scalable video stream, those operations can be performed together. The service initialization agent is responsible for initializing each component in the AES. After initializing the AES, the registering agent communicates with the ADE to register its services, capabilities, and required parameters. It is also responsible for renewing the registered information in case of any change in its service parameters. The adaptation decision interpreter processes the adaptation decision message from the ADE requesting the adaptation service. Based on this information, it also decides the appropriate AE to be invoked, and its configurations. The progress of the adaptation operation is monitored by the AE monitoring service, and if necessary it informs the progress back to the ADE. This subsection presents two AEs for use in the virtual classroom application, with a focus on resource adaptation. The two AEs under consideration are: 1. Optimized Source and Channel Rate Allocation (OSCRA) 2. Cropping and Scaling of H.264/AVC Encoded Video (CS/H.264/AVC). As well as the above adaptation techniques, the possibilities of using the scalability extension of H.264/AVC as a means of achieving quality, temporal, and spatial adaptations at a reduced computational complexity are also highlighted in this part.
518
Visual Media Coding and Transmission
Table 11.4
License expressing conditions upon the values of video encoding parameters
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dia="urn:mpeg:mpeg21:2003:01-DIA-NS" xmlns:r="urn:mpeg:mpeg21:2003:01-REL-R-NS" xmlns:sx="urn:mpeg:mpeg21:2003:01-REL-SX-NS" xmlns:mx="urn:mpeg:mpeg21:2003:01-REL-MX-NS" xsi:schemaLocation="urn:mpeg:mpeg21:2003:01-DIA-NS ConversionDescription.xsd"
519
Context-based Visual Media Content Adaptation Table 11.4 (Continued) 352 240 5000 1000
As described above, the selected application scenario, namely the virtual classroom, involves numerous users with various preferences, as well as terminal- and network-specific constraints. In such an application scenario, one of the demanding user-centric adaptation cases is the delivery of user-centric services to different collaborators, such as the cropped view of a particular attention area (i.e. ROI) selected by a user. On the other hand, an optimal source and channel rate allocation technique is required in such a scenario in order to adapt for allocating a higher level of protection to segments of the video content that are estimated to be more prone to corruption during transmission, by accurately modeling the distortion based on the given channel conditions. These adaptation methods are designed to respond to the variations in a number of context parameters, as explained in Table 11.6. The function of the OSCRA-based AE is to adapt the level of error-resilience of the coded video sequence based on the prevailing error rates or packet drop rates experienced during transmission of media resources. In the virtual classroom application, this can be utilized to optimize the user satisfaction even under harsh network conditions. The AE improves the ratedistortion characteristics of the decoded video using differentially-prioritized segments of a video frame based on a metric that quantifies the relative importance of those different segments. In a generic adaptation scenario, the importance can be automatically weighted towards areas such as moving objects. In an IROI adaptation scenario, this relative importance measure is calculated based on a users feedback on their preferences for a particular attention area of the video frame. The request for adaptation can be generated for a number of reasons in the virtual classroom application scenario. For instance, a user may wish to focus on the lecturer alone, due to the
Context-based Visual Media Content Adaptation
521
Table 11.5 License expressing conditions upon the video resolution for virtual classroom session. Reproduced by Permission of 2007 IEEE min_value
522
Figure 11.35 Summit
Visual Media Coding and Transmission
Organization of an AES. Reproduced by Permission of 2008 IIMC Ltd and ICT Mobile
small display size of their PDA and/or mobile device (e.g. a phone). The AE can then reserve the maximum amount of resources for the lecturers region, as this can be considered as the attention area for the user. In the worst-case scenario, the AE may also consider totally ignoring the background or other speakers/regions in the scene based on the ADEs decision. The AE can thus reallocate source channel rates and the error-protection resources for the visually salient
Table 11.6 Contextual information handled by the AEs and the expected reaction of each AE to the variations in these contexts Context
IROI Network capacity and condition
Display window size and resolution in pixels
Expected reaction to changes in the context OSCRA
CS/H.264/AVC
Prioritizing ROI by allocating more resources One or more of the following actions: . Reducing resources for less important regions and reallocating them to the ROI . Changing the priority of more important syntax elements Prioritizing ROI by allocating more resources
ROI cropping One or more of the following actions: . ROI cropping . Resolution scaling . SNR scaling
One or more of the following actions: . ROI cropping . Resolution scaling
Context-based Visual Media Content Adaptation
523
regions accordingly. The reallocation of resources is based on a method that separates a video sequence into substreams, each of which carries an individual region, encodes them at varying rates according to their priority degrees, and then transmits them based on their relative importance over multiple radio bearers with different error-protection levels. Similarly, this method could also be used to apply adaptation to secure the important syntax elements of a video stream, so as to transmit sensitive data on secure channels. The CS/H.264/AVC-based AE serves user requests to view an ROI of their choice on a small display, such as a mobile phone, PDA, and so on. Once the ROI is defined by the user, the ADE has to determine whether this operation can be handled on the terminal itself. Network capacity and network condition must also be considered in this decision. Even when the network does not pose a bottleneck, the terminal and its decoder may not be capable of decoding the original high-resolution video stream. For example, a decoder in a PDA may not have the necessary computational resources to decode a high-definition television (HDTV) quality video due to processor and memory capacity restrictions, even when it is connected to a network without bandwidth limitations, such as a wireless local area network (WLAN). Apart from cropping the ROI, the AE often needs to resize the cropped video to match the terminal displays resolution and fit it into the viewing window. Frequently, a user selects an arbitrary ROI aspect ratio that may not be identical to the display window aspect ratio. Thus, the AE also needs to be capable of making up the gap caused by the aspect ratio mismatch under the guidance of the ADE. Each of the above AEs is discussed in detail in the following subsections. 11.6.8.1 Optimized Source and Channel Rate Allocation-based Content Adaptation The provision of multimedia over mobile networks is challenging due to the hostile and highly variable nature of the radio channels in a scenario where remote students may want to join in various virtual classroom sessions. Carefully designed error-resilience techniques and channel protection mechanisms enable the adaptation of the video data so that it is more resilient to channel degradations. However, as these (i.e. the error-resilience techniques and channel protection mechanisms) have been separately optimized, they are still insufficiently effective for application over mobile channels on their own. Joint source–channel coding approaches [122] have been proven to provide optimal performances for video applications over practical systems. These techniques can be divided into two areas: channel-optimized source coding and source-optimized channel coding. In the first approach, the channel coding scheme is fixed, and the source coding is designed to optimize the codec performance. In the case of source-optimized channel coding, the optimal channel coding is derived for a fixed source coding method. Application of source-optimized channel coding for video transmission in an error-prone propagation environment is considered in [123,124]. Adaptive source-channel code optimization has been proven to provide better perceptual video quality [125]. Source coding is dependent on the specific codec characteristics. Channel coding is performed at the physical link layer to overcome the effects of propagation errors. Both source and channel coding contribute to the overall channel coding rate, which is the characteristic of the network. Keeping the overall channel bit rate as constant, the source and channel rates can be modified. This can be done by prioritizing different parts of the video bitstream, by sending them using multiple bearers with different characteristics, such as different channel coding, modulation, and so on. In order to implement this method, it is necessary to separate
524
Visual Media Coding and Transmission
the encoded bitstream optimally into a number of substreams during the media adaptation process. Rate Allocation-based Adaptation This subsection addresses issues in designing an optimal joint source–channel bit rate allocation for adaptation of video transmission over mobile networks in the virtual classroom application scenario under consideration. The aforementioned scheme combines bit-level unequal error protection (UEP) and joint source–channel coding to obtain optimal video quality over a wide range of channel conditions. The encoded data is separated into a number of substreams based on the relative importance of data in different video packets, which is calculated by the estimated perceived importance of bits at the AE. These streams are then mapped on different radio bearers, which use different channel coding schemes depending on their relative importance. Realization of the scheme is presented in Figure 11.36. Rate Allocation Scheme The transmission channel is characterized by the probability of channel bit errors and the channel bandwidth, Rch, expressed in terms of bits per second. Assume that the underlying communication system allows a maximum of N communication subchannels or radio bearers, as in the case in a Universal Mobile Telecommunications System (UMTS) network, for a given video service, such as in a virtual classroom session. The encoded bitstream is separated into N substreams. Rn denotes the channel bit rate on the nth subchannel, and xn is the channel coding rate on the nth subchannel. The optimum possible source rate, R, is a function of channel bit rate, Rn, and channel coding rate, xn: R ¼ R1 þ . . . þ R n þ . . . þ R N Rch R1 =c1 þ . . . þ Rn =cn þ . . . þ RN =cN
Source–channel rate adaptation algorithm Stream n Stream N
Stream 1
...
...
Bearer 2
Mapping on to radio bearers
Bearer 1
...
...
Bearer 3
Adaptation engine
ð11:1Þ
Channel state predictor
Transmit bit energy allocation & channel multiplexing Feedback channel
Physical channel mapping
Physical channel
Figure 11.36 Realization of the rate allocation-based adaptation scheme. Reproduced by Permission of 2008 IEEE
525
Context-based Visual Media Content Adaptation
The expected distortion due to the corruption of the nth substream is E(Dn). Thus, the total sequence distortion E(D) becomes: EðDÞ ¼
N X
EðDn Þ
ð11:2Þ
n¼1
(Reproduced by Permission of 2008 IEEE.) The goal set here is to find the optimal substream separation and mapping of substream data on to multiple radio bearers, in order to maximize the received video quality for video transmission over a bandwidth-limited error-prone channel. The optimization problem can formally be written as: Minimise EðDÞ ¼
N X
EðDn Þ
ð11:3Þ
n¼1
(Reproduced by Permission of 2008 IEEE.) Subject to: R1 =c1 þ . . . þ Rn =cn þ . . . þ RN =cN Rch
ð11:4Þ
Let the input video sequence have L video frames. Each video frame is separated into M number of video packets. M is a variable. Each video packet is divided into K partitions. In the case of MPEG-4 video, which supports two partitions, K is equal to two. The expected distortion due to the corruption of the first partition of the mth video packet of the lth frame is am,l. Distortion resulting from the corruption of the second partition is bm,l, and lm,l is the expected distortion due to the corruption of the kth partition. Thus, the total expected distortion E(D) is: EðDÞ ¼
N X n¼1
EðDn Þ ¼
L X M X ðam;l þ bm;l þ . . . þ lm;l Þ
ð11:5Þ
l¼1 m¼1
(Reproduced by Permission of 2008 IEEE.) Now the optimization problem set in Equation (11.3) can be visualized as the optimal allocation of bits within each partition into N substreams, subject to the constraint set in Equation (11.4). Data separation into N substreams is conducted based on the importance level of information in each partition for the overall quality of the received video. For N substreams, there are N importance levels, labeled as In(n2{1, . . ., N}). In / Wm;l Am;l ðA 2 fa; b; . . . ; lg; m 2 f1; . . . ; Mg; l 2 f1; . . . ; LgÞ
ð11:6Þ
Wm,l specifies a weighting factor. This factor can be used to prioritize the ROIs from other regions. If all parts of the information in a video frame are equally interesting then Wm,l becomes one. The source–channel rate allocation algorithm is shown in Figure 11.37. The algorithm operates at video frame level with the estimation of the source rate, R, for a given channel bandwidth with maximum available channel protection.
526
Visual Media Coding and Transmission
Figure 11.37
The source–channel rate allocation algorithm
Context-based Visual Media Content Adaptation
527
After encoding the video frame with the estimated source rate, the expected distortion is calculated at video packet level. Based on the calculated distortion values, the data partition of each video packet is assigned an importance level, which is calculated according to Equation (11.6), and data is separated into N number of substreams accordingly. Instantaneous channel quality for each selected subchannel is predicted from the channel quality measurement conducted at the network or at the terminal. The source rates for subchannels are calculated, and the channel bandwidth requirement set out in Equation (11.4) is checked. If the bandwidth requirement is not satisfied, the data on highly-protected subchannels is reduced, and the importance levels are recalculated. After the channel bandwidth requirement steps, the total expected frame distortion is calculated for the particular subchannel configuration. This total expected frame distortion is compared to the value obtained from the previous iteration to find the local minimum distortion value. If the minimum distortion is obtained, the process is terminated. Otherwise, the source rate is incremented by one step, and the process is repeated. Note that the source-rate allocation algorithm starts with an estimated source rate for a maximum channel protection level. This provides the minimum source rate for a given channel bandwidth. For the second and following iterations, the effective channel bit-error ratio, m, is used in estimating the expected video packet distortion. m is computed as: m ¼ ðh1 R1 þ . . . þ hn Rn þ . . . þ hN RN Þ=R
ð11:7Þ
where hn denotes the channel bit error ratio on the nth subchannel. After finding the optimal source–channel rate allocations, the data on each substream is reformatted to achieve the stream synchronization at the receiver, and transmitted using the selected radio channels. Modeling of Distortions The source–channel rate allocation algorithm for adapting the video streams described in the previous subsection relies on an accurate distortion model at video packet level. This subsection describes the distortion modeling algorithm for estimating distortions due to the corruption of data in each partition of a video packet. The video packet format is assumed to be of MPEG-4 Visual simple profile with data partitioning enabled. Video performance can be shown to be dependent on a combination of quantization distortion, E(DQ,pv), and channel distortion. Channel distortion can be further subdivided into two parts: concealment distortion and distortion caused from error propagation over predictive frames. Concealment distortion depends on the concealment techniques applied at the decoder. The scheme under consideration applies a temporal concealment technique. That is to say, if an error is detected in a decoded video packet, it discards that packet and replaces the discarded data with the concealed data from the corresponding macro blocks (MBs) of the previous frame. The distortion caused by such a concealment process is called temporal concealment distortion, E(Dt_con,pv). Frame-to-frame error propagation through motion prediction and temporal concealment is called temporal domain error propagation, ftp. The distortion model adopted in this section has similarities with the method proposed in [126]. However, the distortion induced due to error propagation is calculated differently, even though the same assumption is made, namely the uniform distribution of the video reconstruction error. The model also uses adaptive intra refresh (AIR) techniques instead of intra frame refreshment,
528
Visual Media Coding and Transmission
which is used in the model presented in [126]. The modifications made enhance the accuracy of the distortion calculation. Taking the video packet as the base unit, the expected frame quality can be written as: EðQjf Þ
¼ 10 logðg=
Ij X
EðDi;j pv ÞÞ
ð11:8Þ
i¼0 j where EðQjf Þ is the expected quality, EðDi;j pv Þ is the expected distortion of the video packet, and I is the total number of video packets. (Reproduced by Permission of 2008 IEEE.) Superscript i and j represent the ith video packet of the jth video frame. g is a constant defined by the dimensions of the frame. For instance, for a common intermediate format (CIF)-resolution video g ¼ 2552 352 288. EðDi;j pv Þ can be written as:
i;j i;j i;j i;j EðDi;j pv Þ ¼ EðDQ;pv Þ þ r d;pv EðDt con;pv Þ þ ftp
ð11:9Þ
(Reproduced by Permission of 2008 IEEE.) In this equation, ri;j d;pv denotes the probability of receiving an erroneous video packet. Calculation of each term shown in Equation (11.9) depends on the formatting of the video coding, error-resilience and concealment techniques, and the coding standard. The probability calculation for MPEG-4 (simple profile with data partitioning)-encoded video is described below, so as to exemplify the process. Say the probability of receiving a video object plane (VOP) header with errors is ri;j VOP , and the probability of receiving the video packet header and the motion information with errors is ri;j M . In addition, the probability of finding an error in the discrete cosine transform (DCT) part is xi,j. Then: i;j i;j i;j ri;j d;pv ¼ ð1 rVOP Þ ð1 rM Þ c
ð11:10Þ
(Reproduced by Permission of 2008 IEEE.) For a given probability of channel bit-error rate, rb, it can be shown that: ri;j VOP ¼
V X
ð1 rb Þv 1 rb ¼ ð1 ð1 rb ÞV Þ
ð11:11Þ
v¼1
where V represents the VOP header size. (Reproduced by Permission of 2008 IEEE.) Similarly: ZM ri;j M ¼ 1 ð1 r b Þ
c ¼ i;j
Z DCT X z¼1
ZDCT i
Z ð1 rb ÞZDCT rzb ¼ 1 ð1 rb ÞZDCT
ð11:12Þ
ð11:13Þ
where ZDCT and ZM denote the lengths of the DCT and motion vector data, respectively. (Equations (11.12) and (11.13) Reproduced by Permission of 2008 IEEE.)
529
Context-based Visual Media Content Adaptation
The expected distortions of MBs for MPEG-4-encoded video are calculated in the same way as specified in [126]. The quantization distortion is computed by comparing the reconstructed MBs and the original MBs at the encoder. Concealment distortions are also computed in a similar manner. The transmitted video data belonging to each MB is corrupted using a noise generator located at the encoder [124]. Corrupted data is replaced by the concealed data, and the data belonging to the original and concealed MBs are compared. It is assumed that the neighboring video packets and reference frames are correctly received during the calculation. The temporal error propagation due to MB mismatch between adjacent video frames is quantified by the term ftpi;j in Equation (11.9), which is computed as: X i;j 1 1 j2 j1 i;j 1 ðrd;pv EðDk;i;j ð11:14Þ ftpi;j ¼ ð1 ri;j TP ½ t con;pv Þ þ ð1 ru;pv Þ P TP u;pv Þ P k2W
where W denotes the sets of coded blocks in a frame. (Reproduced by Permission of 2008 IEEE.) The summation in Equation (11.14) represents the error propagation through MBs. PjTP 1 quantifies the fraction of distortion of the reference video packet (j–1th frame), which should be considered in the propagation loss calculation. PjTP 1 is computed by: PjTP 1 ¼ ð1 ð1 rb ÞFj 1 Þ
ð11:15Þ
where Fj1 is the size of the j–1th frame. (Reproduced by Permission of 2008 IEEE.) Application of the Adaptation Scheme The adaptation scheme uses the predicted channel quality information to calculate the expected distortion. Different channel coding schemes combined with spreading gain provide a number of different radio bearer configurations that offer flexibility in the degree of protection. For example, UMTS employs four channel coding schemes. The available channel coding methods and code rates for dedicated channels are 1/2 rate convolutional code, 1/3 rate convolutional code, 1/3 rate turbo code, and no coding. A video frame is encoded at the selected source rate and separated into different substreams. The higher-priority data is sent over the highly-protected channels, while low protection is used to transmit low-priority streams. This arrangement adapts available network resources according to the perceived importance of the selected objects from the video data. The algorithm performs a number of iterations to obtain the optimal operating point, as shown in Figure 11.37. Re-encoding the frame with adjusted source rate can however delay the transmission process. Therefore, for a real-time video application, it is suggested that the encoding rate adjustments be performed only for the first two frames of the sequence. The following video frames should be encoded at the source rate obtained from the first two frames. As the encoding is performed only once for the following frames, the expected distortion calculation, and therefore the whole algorithm process, is simplified. ROI coding and UEP schemes if used together can significantly improve the perceived video quality at the user end [127]. The importance levels associated with the segmented objects can be applied to a region within a video frame of more interest to a user in the application scenario under consideration, so as to transmit it over one of the most secure radio bearers available with maximum error protection.
530
Visual Media Coding and Transmission
Experimentation Setup Video sequences are encoded according to the MPEG-4 visual simple profile [128] format. This includes the error-resilience tools, such as video packetization, data partitioning, and reversible variable-length coding. The first video frame is intra coded, while others use inter (i.e. predictive) coding. A Test Model 5 (TM5) rate control algorithm is used to achieve a smoother output bit rate, while an AIR algorithm [128,129] is used to stop temporal error propagation. The two test sequences shown in Figure 11.38, namely Singer and Kettle, are used as the source signals in the experiments. The Singer sequence is used as the background, and the Kettle sequence is segmented and used in the foreground of the output sequence. These CIF (352 288 pixels) sequences are coded at 30 fps. The encoded sequences are transmitted over a simulated UMTS channel [124]. The simulator consists of the UMTS Terrestrial Radio Access Network (UTRAN) data flow model and Wideband Code Division Multiple Access (WCDMA) physical layer for the forward link. The WCDMA physical layer model is generic and enables easy configuration of the UTRAN link-level parameters, such as channel structures, channel coding/decoding, spreading/despreading, modulation, transmission modeling, propagation environments, and their corresponding data rates according to the 3rd Generation Partnership Project (3GPP) specifications. The transmitted signal is subjected to a multipath fast-fading environment. The multipathinduced inter-symbol interference is implicit in the chip-level simulator. By adjusting the variance of the noise source, the bit error and block error characteristics can be determined for a range of SNRs and for different physical layer configurations. A detailed explanation of the link-level simulator used can be found in [124]. The simulation setup for transmission considers a vehicular A propagation condition and downlink transmission. The mobile speed is set to 50 kmph. The experimentally-evaluated average channel block error rates (BLER) over the vehicular A environment are listed for different channel protection schemes using convolutional coding (CC) and bit energy-to-noise ratios (Eb/No) in Table 11.7. The experiment carried out in this section assumes a perfect
Figure 11.38 Input sequences used in experiments: (a) Singer sequence; (b) Kettle sequence. Reproduced by Permission of 2007 IEEE
531
Context-based Visual Media Content Adaptation Table 11.7 Channel BLER for vehicular A environment. Reproduced by Permission of 2008 IEEE Eb/No
1/2 CC
1/3 CC
3 dB 4 dB 6 dB 8 dB 10 dB 12 dB
0.92 0.78 0.31 0.047 0.0020 0.0010
0.78 0.53 0.13 0.013 0.0010 0.000
channel state prediction (thus, actual instantaneous channel condition) in the expected distortion calculation. Results and Discussions The accuracy of the distortion model is evaluated by comparing the estimated performance and the actual video performance over a simulated UMTS vehicular A environment. 1/3 rate convolutional coding with a spreading factor of 32 is considered. The Singer and Kettle sequences are encoded at a frame rate of 30 fps. Experiments are carried out for a range of channel conditions, and the performance of the composite sequence of the two input sequences is shown in Figure 11.39. Each experiment is repeated 25 times to simulate the average effect of bursty channel errors on the performance. Initial test results demonstrate that the estimated peak signal-to-noise ratio (PSNR) values closely match the actual PSNR values. This accurate modeling of expected distortion during transmission can be used to allocate maximum protection to segments with higher distortion estimates. It also incorporates an importance Actual
Estimated
Frame PSNR (dBs)
32
31
30
29
28 1
9
17
25
33
41
49
57
65
73
81
Frame No
Figure 11.39 Estimated performance comparisons: Eb/No ¼ 10 dB, BLER ¼ 0.0003
532
Visual Media Coding and Transmission
Figure 11.40 Subjective performance: (a) Eb/No ¼ 10 dB, BLER ¼ 0.0003; (b) Eb/No ¼ 8 dB, BLER 0.0130. Reproduced by Permission of 2007 IEEE
level for certain regions of the video content, along with the distortion estimates for optimal rate allocations. Figure 11.40 shows the subjective quality of the output composite sequence. 11.6.8.2 Cropping of H.264 Encoded Video for User-centric Content Adaptation As shown in Table 11.2, one of the adaptation requirements for the virtual classroom application is the sequence-level cropping of the visually salient part(s) of video scenes exchanged between collaborators (for example between lecturers/presenters and remote students). Thus, the scope of this subsection is to describe an AE that carries out the sequence-level cropping-type recommendations specified in the adaptation decision to provide ROI-based content adaptation. An example of the resulting video sequence after performing the adaptation operation is illustrated in Figure 11.41. Sequence-level cropping-type content adaptation can be performed at two locations along the content delivery chain: (1) at an external
Figure 11.41
(a) Original sequence; (b) ROI-adapted sequence
533
Context-based Visual Media Content Adaptation
adaptation engine located at an edge of the network; and (2) at the user terminal. If the second option is utilized, the entire content has to be delivered to the user terminal. However, it should be noted that some of the information traversed through the network may be discarded due to bandwidth limitations or other transmission-related reasons without presenting to the user. Therefore, such an adaptation option can be considered a waste of precious bandwidth resources. Figure 11.42 illustrates that up to four times more bandwidth is necessary at times when the adaptation is performed within the user terminal device. Therefore, it is more advantageous if the cropping-type adaptations are performed at a network node and/or gateway. The AE is designed to accept both scalable and non-scalable H.264/AVC-coded digital video streams. If the input video is an IROI scalable [111] and has restricted motion compensation, it is a matter of selecting a substream that describes the ROI. However, in this subsection we only concentrate on non-scalable video adaptation. The following subsections describe the AE needed to realize such adaptation, and present relevant simulation results on the reduction of the computation complexity incurred by the adaptation operation, which is performed through transcoding. News 50
45
45
PSNR (dB)
PSNR (dB)
Paris 50
40 35 30 CIF Cropped (QCIF)
25
40 35 30 CIF Cropped (QCIF)
25
20
20 0
500
1000
1500
2000
2500
0
Bit Rate (kbps)
500
1500
2000
Bit Rate (kbps)
Coastguard
Flower
50
50
45
45
PSNR (dB)
PSNR (dB)
1000
40 35 30
40 35 30
CIF Cropped (QCIF)
25 20
CIF
25
Cropped (QCIF)
20 0
1000
2000
3000
4000
Bit Rate (kbps)
5000
6000
0
2000
4000
6000
8000
Bit Rate (kbps)
Figure 11.42 Rate-distortion comparison of the original-resolution video and the adapted video
534
Visual Media Coding and Transmission
The Content Adaptation Architecture The AE is an integral part of the content adaptation scenario shown in Figure 11.21. The service provider is informed of the user terminal capabilities, such as display and/or decoder capabilities and buffer size, during the initial negotiation phase. Unique to the adaptation scenario under consideration is the provision of a user-interaction service, through which it is assumed that the user can specify a selected ROI. This information can be sent in a feedback message to the service provider during the playback. Upon receiving the ROI feedback message, the service provider consults an ADE, which determines the nature of adaptation needed after processing a number of context descriptors. In addition to the user-driven ROI feedback signaling, the ROI information is used when it is necessary to adapt the content to meet usage environment constraints. For example, if there is a need to reduce the aspect ratio to fit the video into a smaller terminal display then a similar adaptation operation can be performed, taking into consideration the ROI information, which is extracted automatically. Similarly, if it is necessary to reduce the bit rate to address channel bandwidth constraints, the ADE may decide to utilize cropping rather than spatial scaling or quality scaling so that the ROI information can be presented at a better visual quality. These decisions are made after processing a number of context descriptors, which in turn describe the user-defined ROI and other constraints, such as QoS, terminal capabilities, access network capabilities, usage environment, DRM, and so on. In this subsection, an AE that carries out sequence-level cropping-type recommendations specified in the adaptation decision is described. This AE achieves interactive ROI-based user-centric video adaptation for H.264/AVC-coded video streams through transcoding, and maximally utilizes values of the syntax elements from the input bitstreams to minimize the computational complexity. Adaptation Engine (AE) The heart of the AE is essentially a transcoder, which performs sequence-level cropping of video frames of an H.264/AVC-encoded high-resolution video sequence, and produces an H.264/AVC-compatible output bitstream. The architecture of this transcoder is illustrated in Figure 11.43. The decoder decodes the incoming bitstream, and subsequently the reconstructed frame and parsed syntax elements are passed to the encoder through the Frame Adaptor (FA). The function of the FA is: . .
To crop the decoded frames as specified in the adaptation decision received from the ADE. To select and organize relevant syntax elements of the MBs that fall within the specific cropped region.
In order to simplify the AE, it is assumed that the cropping is performed along the MB boundaries, and it is the ADEs responsibility to consider this constraint when user requests are processed. Under the above assumption, there is a one-to-one correspondence in terms of MB
Figure 11.43
The AE. Reproduced by Permission of 2007 IEEE
Context-based Visual Media Content Adaptation
535
Figure 11.44 One-to-one relationship between MBs in the original and the adapted frames. Reproduced by Permission of 2008 IEEE
content. That is to say, for the ith MB (in raster order) in the adapted frame (MBi,adapted), there is a corresponding MB j in the decoded frame that has exactly the same luminance and chrominance composition, as illustrated (MBj,input) in Figure 11.44. Before encoding MBi,adapted, the encoder checks whether the coding mode of MBi,input and the associated syntax elements are valid for MBi,adapted. Possible tests evaluate the validity of motion vectors, intra-prediction mode, SKIP mode motion vector predictors, and so on. In this chapter, the discussions are limited to the MB SKIP mode [110] in predictive-coded pictures. Other coding modes, including block-level SKIP mode, are determined using the rate distortion optimization for performance evaluation purposes. SKIP-mode MBs are reconstructed using motion compensation at the decoder. Necessary motion vector is estimated at the decoder without any supporting information from the encoder. The estimated motion vector is known as the SKIP-mode motion vector predictor (MVP). Motion vectors of the surrounding MBs are involved in the estimation process [110]. MVP of a given MB may remain identical when the MB is decoded and re-encoded during transcoding, depending on the availability of surrounding MBs and their motion vectors. If the encoder and decoder come up with the same MVP, MB SKIP mode is assumed to be the best coding mode to re-encode the MB. Based on this assumption, the algorithm described in Figure 11.45 is executed. Inputs to this process are the adapted frame and syntax elements of corresponding MBs, which are gathered while the original sequence is being parsed. For each MB, a SKIPmode MVP is estimated at the encoder functional units (MVPSKIP,evaluated) if the corresponding MB in the original bitstream (MBinput) is coded in MB SKIP mode. If MVPSKIP,evaluated is identical to that calculated by the decoder when the original bitstream is decoded (MVPSKIP, input), MB SKIP mode is considered to be the optimum coding mode for encoding the MB in the adapted frame. Otherwise, the RD optimization algorithm is invoked to find a suitable coding mode.
536
Visual Media Coding and Transmission
Figure 11.45 The algorithm to determine the coding mode of an MB in the cropped frame. Reproduced by Permission of 2009 IEEE
Experimentation Setup In order to evaluate the credibility of the above algorithm, a transcoder is used, which is based on the Joint Scalable Video Model (JSVM) encoder and decoder version 7.13. The input bitstream is H.264/AVC extended profile compatible and the output bitstream is H.264/AVC baseline profile compatible. Experimental results are presented for CIF test video sequences available in the public domain and experimental conditions are described in detail in Table 11.8. Once the ROI window is selected, it is assumed to stay unchanged throughout the length of the
537
Context-based Visual Media Content Adaptation
Table 11.8 Experimental conditions for the selected video test sequences. Reproduced by Permission of 2011 IEEE Test sequence
Format
Number of frames
Origin of the ROI (in Pixels)
Format of the ROI
MB Count within the ROI
Paris (A) Paris (B) News (A) News (B) Coastguard Flower
CIF CIF CIF CIF CIF CIF
1060 1060 290 290 290 240
(0, 0) (176, 16) (0, 64) (176, 64) (48, 64) (0, 64)
QCIF QCIF QCIF QCIF QCIF QCIF
104 940 104 940 28 710 28 710 28 710 23 760
sequence for simplicity. PSNR is used for objective quality evaluation, and is computed only over the selected ROI. Results and Discussions Figure 11.46 compares the adapted quarter common intermediate format (QCIF) versions based on the cropping method, as described in Table 11.8, with the original CIF formatted and scaled QCIF versions. When the frame is scaled to fit to the display size, visually important details such as facial expressions become less visible, and thus the quality of the visual experience becomes minimal. However, with the adaptation mechanism under consideration, such details are preserved, as demonstrated in Figure 11.46. This is because the resolution scaling operation is not performed over the ROIs at all. Therefore, this technique is envisaged to produce better visual experiences for users of this AE technology. The first set of analytical experiments is carried out to investigate the utilization of MB SKIP mode for coding MBs within the selected ROIs. SKIP mode usage statistics within the cropped ROIs are presented in Table 11.9. In general, at lower bit rates (i.e. when the quantization parameter (QP) is higher) SKIP mode is used more often. When the QP is high, the distortion cost becomes more significant than the bit-rate cost. Therefore, the RD cost of SKIP mode (MB SKIP cost), which is independent of the QP, becomes greater than that of the case in which the residues are quantized with lower QPs. However, when the QP becomes higher, the distortion cost of other modes increases. As a result, the bit rate becomes the dominant factor in determining the coding mode. Since the bit budget of a SKIP-mode MB is very low, RD optimization favors this mode. Therefore, when the QP is higher, a higher number of MBs are coded in the MB SKIP mode. Statistics presented in Table 11.9 confirm the validity of the above argument. Table 11.9 also shows that a significant percentage of MBs are coded in SKIP mode. Therefore, there is a considerable potential for reducing the coding complexity significantly just by utilizing the MB SKIP-mode information derived from the input bitstream wherever possible. When a given MB within the ROI is considered before and after cropping, its surrounding areas may not be identical. Therefore, there is a substantial possibility that the decoder and encoder functional blocks of the transcoder come up with different SKIP-mode MVPs for a given MB. If this is the case, it is impossible to rely completely on the information available in the input bitstream to decide the transcoding mode without considering the RD cost. Table 11.10 summarizes the probability that MVPSKIP,input and MVPSKIP,evaluated are different. The results clearly show that the probability is higher when motion and texture in the scene are
538
Visual Media Coding and Transmission
Figure 11.46
ROI selections described in Table 11.8
539
Context-based Visual Media Content Adaptation
Table 11.9 MB SKIP-mode usage statistics within the cropped areas in the original H.264/AVC-coded bitstreams. Reproduced by Permission of 2012 IEEE QP
10 15 20 30 40
Usage of MB SKIP mode for coding the ROI in the original bits streams for each video test sequence (%) Paris (A)
Paris (B)
News (A)
News (B)
Coastguard
Flower
10.92 42.16 61.86 76.04 87.88
11.13 50.72 65.55 77.97 87.89
38.28 45.61 59.43 71.69 82.68
35.48 46.10 53.08 68.61 80.65
0.00 0.06 5.28 23.55 57.82
17.65 29.77 36.39 50.76 64.85
complicated. However, the probability stays relatively smaller for stationary sequences and sequences with regular motion (e.g. the Flower sequence). The results also indicate that this probability becomes smaller for low-bit-rate encoding. In summary, there is a considerable probability that MB SKIP-mode MVPs remain identical when decoding and re-encoding during content adaptation through transcoding. Figure 11.47 compares the rate distortion characteristics when the JSVM RD optimization is used to estimate the coding modes of all MBs and when the technique described in this subsection is used. Input sequences are generated by encoding the raw video sequences at five different quantization parameter settings. Each of the resulting bitstreams is then transcoded using different quantization parameter settings to obtain the RD performance for transcoding each input bitstream. The objective quality is measured by comparing the original video sequence with the one generated by decoding the transcoded bitstreams. The objective quality results illustrated in Figure 11.47 clearly show that there is not any noticeable RD penalty associated with the described technique. Table 11.11 also compares the average per frame coding time. These coding times are indicative, since the JSVM software is not optimized, and are obtained by measuring the time taken when transcoding longer versions of the video sequences. Each of these longer versions is at least 3000 frames long, and obtained by cascading the original video. Considering the above results, it can be concluded that up to 34% coding complexity reduction can be achieved with the transcoding technique discussed here. In summary, IROI adaptation of H.264/AVC-coded video applicable to a user-centric video adaptation scheme crops a user-specified attention area (i.e. ROI), enabling the user to actively Table 11.10 Probability of estimating different SKIP-mode MVPs for a given MB with the decoder and encoder functional units of the transcoder. Reproduced by Permission of 2013 IEEE QP
10 15 20 30 40
The probability that MVPSKIP,input and MVPSKIP,evaluated are different Paris (A)
Paris (B)
News (A)
News (B)
Coastguard
Flower
0.014 0.007 0.006 0.011 0.027
0.018 0.006 0.007 0.015 0.023
0.006 0.026 0.033 0.048 0.036
0.002 0.025 0.045 0.098 0.052
– 0.722 0.641 0.446 0.548
0.461 0.387 0.328 0.115 0.105
540
Visual Media Coding and Transmission Paris (A)
50
News (A) 50 45 PSNR (dB)
PSNR (dB)
45 40 35
40 35 30
30
25
25 0
500 1000 Bit Rate (kbps)
0
1500
1000
1500
Bit Rate (kbps)
Coastguard
Flower
49
49
44
44
PSNR (dB)
PSNR (dB)
500
39 34
39 34 29
29
24
24 0
1000
2000
3000
0
500
1000
1500
Bit Rate (kbps)
Bit Rate (kbps)
Input QP = 10; Transcode Method = RDOPT
Input QP = 10; Transcode Method = Proposed
Input QP = 15; Transcode Method = RDOPT
Input QP = 15; Transcode Method = Proposed
Input QP = 20; Transcode Method = RDOPT
Input QP = 20; Transcode Method = Proposed
Input QP = 30; Transcode Method = RDOPT
Input QP = 30; Transcode Method = Proposed
Input QP = 40; Transcode Method = RDOPT
Input QP = 40; Transcode Method = Proposed
Figure 11.47 Rate distortion comparison for input bitstreams coded with various QPs when JSVM RD optimization (RDOPT) and the adaptation technique (Adapted) are used, respectively. Reproduced by Permission 2010 IEEE
select an attention area. The adapted video sequence is also an H.264/AVC-compatible bitstream. Subjective quality test results demonstrate that this scheme is capable of producing the targeted ROIs, which in turn are envisaged as providing better user experience by preserving the details of the attention area. 11.6.8.3 Adapting Scalability Extension of H.264/AVC Compatible Video Even though the applicability of transcoding cannot be played down as an option for some of the operations, such as cropping, summarization, and error-resilience, the SVC approach is still
10 15 20 30 40
Quantizer
0.447 0.374 0.335 0.297 0.279
0.469 0.450 0.441 0.435 0.427
RD OptiAdapted mization
Average coding time per frame (s)
4.63 16.98 23.98 31.67 34.66
Improvement of coding time (%)
Paris (A)
0.428 0.400 0.359 0.331 0.300
0.503 0.476 0.459 0.459 0.438
RD OptiAdapted mization
Average coding time per frame (s)
15.07 15.94 21.81 27.82 31.50
Improvement of coding time (%)
News (A)
Test sequence
0.53 0.51 0.50 0.46 0.43
0.53 0.51 0.50 0.47 0.48
RD OptiAdapted mization
Average coding time per frame (s)
-0.65 0.00 0.00 2.22 10.07
Improvement of coding time (%)
Coastguard
Flower
0.500 0.454 0.442 0.371 0.333
0.517 0.483 0.483 0.454 0.429
RD OptiAdapted mization
Average coding time per frame (s)
Table 11.11 Reduction in coding complexity with the adaptation technique. Reproduced by Permission of 2014 IEEE
3.23 6.03 8.62 18.35 22.33
Improvement of coding time (%)
542
Visual Media Coding and Transmission Adaptation Decision
Description Adaptation
Adapted BSD Multiplexer
Input DI
Demultiplexer
BSD
Adapted DI
Packet Filter H.264/SVC Video
Adapted H.264/SVC Video
Figure 11.48 The scalability extension of H.264/AVC-compatible (H.264/SVC) video AE
feasible for most of the adaptation operations required in the virtual classroom application. The advantage of the SVC-based adaptation is that it is less computationally intensive than the transcoding-based adaptations. The architecture of the SVC-based AE for the virtual classroom application is illustrated in Figure 11.48. The demultiplexer separates the BSD and the SVC extension of H.264/AVC-encoded video from the input DI. The description adapter reformats the BSD according to the description provided in the adaptation decision. While performing the BSD adaptation, the description adapter drives the packet filter to discard or truncate the corresponding H.264/AVC SVC video data packets in order to perform the video adaptation. This adapted bitstream is then combined with the adapted BSD to form the output DI at the multiplexer. In this subsection, the feasibility of using scalability options in SVC extension of H.264/ AVC to achieve some of the required adaptation operations in the virtual classroom scenario is evaluated and discussed using publicly-available video sequences [130]. In these evaluations, the JSVM 7.13 and a H.264/AVC transcoder based on the same JSVM version are used as the software platform. Figure 11.49 compares the objective quality of bit-rate adaptation using transcoding and fine-grain fidelity scalability for the Crowd Run video test sequence. In the tests, four scalable and one non-scalable source bitstreams are used as the original video sequences. Scalable bitstreams are obtained by varying the number of temporal and spatial scalability levels. Highest spatial and temporal resolutions are 1280 704 pixels (cropped from the original 1280 720 resolution to meet the frame size constraints in JSVM software [131]) and 50 Hz (progressive), respectively. Lower resolutions are dyadic subdivisions of the maximum resolution. Adapted bitstreams also have the same temporal and spatial resolutions. For comparison purposes, the rate distortion performance that can be achieved by directly encoding the raw video sequence is also shown in the figure. According to Figure 11.49, it is clear that the fidelity scalability offers the flexibility of adjusting the bit rate over a large range. Since fine granular scalability layers can be truncated, a large number of bit-rate adaptation points can be achieved with this technology, allowing the adaptation platform to react to the dynamics of the available network resources more precisely. These experimental results also suggest that the objective quality can be further improved by selecting the most appropriate number of spatial and temporal scalability layers for a given situation. For example, when there is no demand for resolutions below 640 352 pixels, the
543
Context-based Visual Media Content Adaptation
39
37
PSNR (dB)
35
33 Transcoded from AVC
31
Scaled (T=3, S=3, Q=4) Scaled (T=4, S=3, Q=4) Scaled (T=3, S=2, Q=4)
29
Scaled (T=4, S=2, Q=4) AVC Encoded
27
25 10 000
20 000
30 000
40 000
50 000
60 000
Bit rate (kbps)
Figure 11.49 Objective quality comparison of bit-rate adaptation using fine-grain fidelity scalability: T, number of temporal scalability levels; S, number of spatial scalability levels; Q, number of fine-grain fidelity scalability levels. Reproduced by Permission of 2007 IEEE
encoder can omit the 320 176 pixel resolution. Limiting the number of resolution layers to two (i.e. S ¼ 2) can achieve an additional objective quality gain of over 0.5 dB. In order to make such a decision dynamically, though, it is necessary to have information regarding the required number of adaptation levels at a given time. Since ADE tracks the dynamics of the prevailing context, a feedback from the ADE can be used to decide the level of adaptation. The number of temporal scalability levels can be increased only at the expense of delay. However, unlike with spatial scalability levels, increasing the number of temporal levels in turn increases the compression efficiency, as illustrated in Figure 11.49. The reason behind the improved compression efficiency is the use of more hierarchically-predicted bidirectional pictures (B-pictures) to achieve more scalability layers [108]. Consequently, the allowed maximum delay becomes the major factor in selecting the number of temporal scalability layers. Figure 11.50 shows the rate distortion characteristics of adapting the abovementioned source bitstreams to achieve 640 352 pixel and 25 Hz temporal resolution. Source bitstreams used for this adaptation are the same as those used in the adaptations described in Figure 11.49. In this case, when S ¼ 2, the SVC-based adaptation performs better than transcoding. This is because when S ¼ 2, 640 352 pixel resolution is coded as the base layer, which is H.264/AVC compatible. Both Figure 11.49 and Figure 11.50 demonstrate that the scalability options are only available at the cost of coding efficiency. However, the computational complexity associated with the adaptation operation, which can be achieved by a simple packet selection, is negligible compared to the transcoding operations that achieve the same adaptation from a non-scalable bitstream. Therefore, it can be concluded that scalability is a more effective way of assisting spatial and temporal adaptation requirements in the virtual classroom application. Further
544
Visual Media Coding and Transmission 45 43 41
PSNR (dB)
39 37 35 33 Tanscoded from AVC
31
Scaled (T=4, S=3, Q=4) Scaled (T=4, S=2, Q=4)
29
Scaled (T=3, S=3, Q=4) Scaled (T=3, S=2, Q=4)
27 25 0
2000
4000
6000
8000
10 000
12 000 14 000
16 000
Bit rate (kbps)
Figure 11.50 Objective quality comparison of adaptation to 640 352 spatial and 25 Hz using temporal, spatial, and fine-grain fidelity scalability
investigations are needed in order to enable the IROI scalability in SVC extension of H.264/ AVC to achieve attention area-based content adaptation for user-centric media processing.
11.6.9 Interfaces between Modules of the Content Adaptation Platform This section provides insights into the interfaces required between the modules of the adaptation platform and the sequence of events that take place while performing DRM-based adaptation for a particular user or group of users. Figure 11.51 represents the functional
Figure 11.51 Functional architecture of the platform for context-aware and DRM-enabled multimedia content adaptation. Reproduced by Permission of 2008 IIMC Ltd and ICT Mobile Summit
545
Context-based Visual Media Content Adaptation Message Adaptation Decision Engine (ADE)
Context Providers (CxP)
REQUEST STATUS
REQUEST STATUS
Semantic Meaning ADE invokes the contextual information to be extracted by the CxPs
METADATA
The contextual information is sent by the respective CxPs
UPDATED
ADE sends an acknowledgement message saying that the contextual information has been received correctly
METADATA
UPDATED
Figure 11.52 Message exchange between CxP and ADE during service negotiation
architecture of this platform, in which the interfaces between modules are illustrated. For this distributed environment, the exchange of messages is addressed using SOAP, a simple and extensible Web Service protocol. 11.6.9.1 ADE–CxP Interface To obtain low-level context information, the ADE can either query CxPs or listen for events sent by CxPs, depending on the service status. During service negotiation, the ADE queries the CxPs as shown Figure 11.52. The received contextual information is formatted in standard MPEG-21 DIA UED descriptors, and registered in the ontology model. After the service is launched, however, the CxPs work in a ‘‘push’’ model, notifying the ADE when new context is available via the context update message illustrated in Figure 11.53. In this way, the ADE is enabled to react to any significant changes in context and adjust the adaptation parameters accordingly, in order to maximize user satisfaction under new usage environment conditions. 11.6.9.2 AE–ADE Interface While designing the interface between the AES and ADE, factors such as ability to have multiple AESs operating within the system and their ability to join, leave, and migrate seamlessly are considered. In order to provide the aforementioned flexibility, a dedicated system initialization phase, which is initiated by the AES, is introduced. The sequence of Message Adaptation Decision Engine (ADE)
Context Providers (CxP) CONTEXT EVENT UPDATE UPDATED
CONTEXT EVENT UPDATE UPDATED
Semantic Meaning The CxP sends the respective new contextual information to the ADE
ADE sends an acknowledgement message saying that the contextual information has been received correctly
Figure 11.53 Message exchange between CxP and ADE after the service is launched
546
Visual Media Coding and Transmission
Figure 11.54
Message exchange between AE and ADE during service negotiation
messages exchanged between the ADE and AE during system initialization is illustrated in Figure 11.54. The parameters required at this stage are mostly related to the AESs capabilities. Therefore, the AES informs adaptation capabilities and necessary metadata related to those capabilities, for example maximum and minimum bit rates, maximum spatial resolution, and so on, along with the registration request message. In order to conclude the registration on the ADE database, the AES should also inform the ADE of its IP address and the service identifier. Once the registration phase is completed, the AES is ready to perform adaptation operations when the ADE requests them. After making an adaptation decision, the ADE invokes the service initialization operation shown in Figure 11.55. During the service initialization, the ADE informs AESs to invoke the selected adaptation operations on the original DI and forward the adapted DI to the user. This request also contains the related adaptation parameters, including the source digital item (DI), desired adaptation operations, and associated metadata.
Figure 11.55
Message exchange between AE and ADE during service initialization
547
Context-based Visual Media Content Adaptation Message
Adaptation Decision Engine (ADE)
AUTHORIZATION
This message is sent by the ADE to request
REQUEST
information about the adaptation authorization associated with a content and User
Adaptation Authorizer (AA)
AUTHORIZATION REQUEST AUTHORIZATION RESPONSE
Semantic Meaning
AUTHORIZATION
The AA provides the authorization information –
RESPONSE
a list of permitted adaptation operations and
RECEIVED
associated constraints
RECEIVED
An acknowledgement message is sent from the ADE
Figure 11.56 Message exchange between AA and ADE during service negotiation
11.6.9.3 ADE–AA Interface The ADEs content adaptation decision is preceded by an authorization request, which identifies the User that consumes the adapted content and the multimedia resource that is going to be adapted, by its DI identifier. The sequence of messages exchanged between the AA and ADE to obtain permitted adaptation operations is illustrated in Figure 11.56. Once the AA has received the authorization request from the ADE, it responds with all the adaptation-related information contained in the license associated with the referred multimedia resource and User. This information includes the permitted adaptation operations, as well as the adaptation constraints associated with those operations. Both the permitted adaptation operations and related constraints are expressed in a format compatible with MPEG-21 DIA. 11.6.9.4 An Example Use Case This subsection details the sequence of messages exchange between modules of the content adaptation platform, based on an example use case. In the selected use case, a student wishes to attend a virtual classroom session using their PDA over a 3G mobile network. The minimum bandwidth required to receive the best-quality multimedia virtual classroom material is 1 Mbps and the associated video is of VGA (640 480 pixels) resolution. The sequence of messages transferred between modules to address content adaptation needs in the aforementioned use case is summarized in the sequence chart shown in Figure 11.57. During the system initialization phase, newly-commissioned AESs inform the ADE of their capabilities and required parameters. An example of the registration request message is shown in Table 11.12.
548
Visual Media Coding and Transmission
Figure 11.57 Sequence chart of messages exchanged between each module. Reproduced by Permission of 2008 IIMC Ltd and ICT Mobile Summit
In the selected use case, the virtual classroom administrator notifies the ADE when the student joins the virtual classroom session. Before taking the adaptation decision, the ADE needs to detect and extract the context information about the terminal capabilities, network conditions, and the surrounding environment. Therefore, the ADE queries the CxPs for context information during the service negotiation phase. Assuming that the CxP responsible for providing terminal capabilities and environment conditions is the user terminal, and the one Table 11.12 Contents of the AES registration request message Parameters
Metadata
Time Multimedia content identifier IP address Service identifier
Tue, 29 Jan 2008 15:31:55 MPEG-21 DI 202.145.2.98 Service ID of the AES
Capabilities Spatial resolution scaling Temporal resolution scaling Bit-rate scaling ROI cropping
Present subtitles (audio-to-text transmoding)
Scalable Scalable Scalable Maximum spatial resolution ¼ 720 560 pixels Minimum spatial resolution ¼ 16 16 pixels Maximum temporal resolution ¼ 50 fps Minimum temporal resolution ¼ 0 Maximum cropping window size ¼ 720 560 pixels Minimum cropping window size ¼ 16 16 pixels Transmoding languages ¼ English, Spanish, Portuguese Display font sizes ¼ small, medium, large Display position ¼ adaptive, coordinates on the display
549
Context-based Visual Media Content Adaptation Table 11.13 Context information received from the terminal Parameters
Metadata
Time Context provider identifier Terminal capabilities Terminal capabilities Terminal capabilities Terminal capabilities
Tue, 29 Jan 2008 15:32:58 Terminal ID Display size (height width) ¼ QCIF (176 144 pixels) Maximum frame rate ¼ 25 fps BatteryTimeRemaining ¼ 15 minutes Codecs supported ¼ MPEG-4, H.264/AVC
responsible for providing the network condition is the network service provider, the contents of context information messages received are listed in Tables 11.13 and 11.14. Moreover, the ADE requires information on the permitted adaptation operations on virtual classroom materials for the particular user and hence it also queries the AA. The contents of the adaptation authorization message received from the AA in response to the ADEs query are listed in Table 11.15, and the MPEG-21 DIA-formatted message is shown in Table 11.16. Based on the context information, it is clear that the particular user under consideration is using a terminal device with a small display (i.e. the PDA). As a result, the ADE decides to downscale the video resolution to 160 120 pixels, which is well within the authorized minimum resolution specified by the AA. Meanwhile, the ADE realizes that the remaining battery power level of the terminal (i.e. the PDA) is not adequate for presenting the entire lecture session at the highest possible fidelity. Moreover, the available network bandwidth is much less than the data rate required to deliver the audiovisual material at its best quality. Responding to these constraints, the ADE decides to decrease the temporal resolution and the Table 11.14 Context information received from the network service provider Parameters
Metadata
Time Context provider identifier Network conditions
Tue, 29 Jan 2008 15:32:58 Network service providers ID Available bandwidth ¼ 128 kbps
Table 11.15 Adaptation authorization response Parameters
Metadata
Time Multimedia content identifier Type of user identifier
Tue, 29 Jan 2008 15:32:58 MPEG-21 DI MPEG-21 KeyHolder
Possible Adaptation Operations Spatial resolution scaling Temporal resolution scaling Bit-rate scaling
Minimum Spatial Resolution ¼ 150 100 pixels Minimum Temporal Resolution ¼ 10 fps Minimum nominal bit rate ¼ 30 kbps
550
Visual Media Coding and Transmission
Table 11.16 An extract from an MPEG-21 DIA authorization response. Reproduced by Permission of 2007 IEEE 10
Context-based Visual Media Content Adaptation
551
Table 11.16 (Continued) 150 100 30000
552
Visual Media Coding and Transmission
Table 11.17 Adaptation parameters Parameters
Metadata
Time Multimedia content identifier Display size
Tue, 29 Jan 2008 15:32:58 MPEG-21 DI QCIF (176 144 pixels)
Required Adaptations Operations Spatial resolution scaling Temporal resolution scaling Bit-rate scaling
Spatial resolution before adaptation ¼ VGA (640 480) Spatial resolution after adaptation ¼ Quarter QVGA (160 120) Temporal resolution before adaptation ¼ 25 fps Temporal resolution after adaptation ¼ 12.5 fps Bit rate before adaptation ¼ 1 Mbps Bit rate after adaptation ¼ 128 kbps
fidelity of the visual content in order to minimize the processor utilization and required bandwidth. Once the ADE takes the adaptation decision, it contacts appropriate AESs to initiate the adaptation operations during the service initialization phase. A list of adaptation parameters conveyed to the AESs to address the adaptation requirements of the selected use case is shown in Table 11.17. During the in-service phase, the ADE keeps on monitoring the dynamics of the context through the context update information received from the CxPs. If it detects any significant change that affects the users satisfaction it reviews the adaptation options. If there is any better set of adaptation operations, the ADE reconfigures the AESs accordingly through another service initiation operation.
11.7 Conclusions This chapter has presented comprehensive state-of-the-art discussions on various technologies and trends that are being studied and used to develop context-aware content adaptation systems. Apart from a few references included and standardization-related efforts described, it has been noted that there is not much work published on adaptation decision implementations. Nevertheless, a thorough analysis of the different elements and factors that drive the adaptation decision process has been made, as well as an investigation of the current trends in how to use those elements, notably the low-level contextual information. There is a clear trend in merging the high-level semantic world with the adaptation decision process. Major research efforts point to the use of ontologies and Semantic Web technologies to accomplish that goal. The aim is to be able to use automatically-generated low-level contextual information to infer higherlevel concepts that better describe real-world situations. Subsequently, several aspects related to gathering requirements in terms of contextual information, forms of representing and exchanging it, and identification of entities/mechanisms involved in the processing of contextual information, together with the management of digital rights and provision for authorization to the adaptation of media contents, have been presented with a view to a defined application scenario. A number of resource adaptation algorithms have
Context-based Visual Media Content Adaptation
553
been described, which would operate in harmony with the decision and authorization operations so as to maximize users experiences. It has been shown that adaptation decision-taking operations can greatly benefit from the use of ontologies for deciding the best available resource adaptation method for the context of the usage environment. In line with the aforementioned discussions, the concepts and architecture of a scalable platform for context-aware content adaptation have also been described in this chapter; these are especially suited for virtual collaboration applications. The discussions have particularly focused on providing a consolidated view on the current research panorama in the area of context-aware adaptation and the identification of directions to follow. In particular, a number of relevant topics have been presented to identify real-world situations that would benefit from the use of context and of adaptation operations; to identify relevant contextual information; to formulate approaches to specify context profiles; to address DRM during adaptation; and to identify functionalities to build generic context-aware content adaptation systems. Furthermore, the interfaces between the modules of the content adaptation platform and the sequence of events that take place while performing DRM-based adaptation for a particular user or group of users in specific situations have also been presented. The innovative character of the adaptation platform under consideration has been brought out by addressing different aspects concerning the delivery of networked multimedia content simultaneously. This has been highlighted in the platform by combining the use of ontologies and low-level context to drive the adaptation decision process, verifying and enforcing usage rights within the adaptation operations, incorporating multi-faceted AEs, and being able to deliver on-the-fly, on-demand, different adaptation operations that suit different dynamic requirements. It is envisaged that the definition of such a modular architecture with welldefined and standards-based interfaces will greatly contribute to the interoperability and scalability of future content delivery systems.
References [1] B. Schilit, N. Adams, and R. Want, ‘‘Context-aware computing applications,’’ Proc. IEEE Workshop on Mobile Computing Systems and Applications, Santa Cruz, USA, pp. 85–90, Dec. 1994. [2] S. Pokraev, P.D. Costa, G.P. Filho, and M. Zuidweg, ‘‘Context-aware services: state-of-the-art,’’ Telematica Instituut, Tech. Rep. TI/RS/2003/137, 2003. [3] A.K. Dey, ‘‘Providing architectural support for building context-aware applications,’’ PhD Thesis, College of Computing, Georgia Institute of Technology, Atlanta, GA, 2000. [4] D.Rios, P.D. Costa, G. Guizzardi, L.F. Pires, J. Gonc¸alves, P. Filho, and M. van Sinderen, ‘‘Using ontologies for modeling context-aware services platforms,’’ Proc. Workshop on Ontologies to Complement Software Architectures, Anaheim, CA, Oct. 2003. [5] ‘‘Web Ontology Language (OWL): overview,’’ W3C Recommendation, Feb. 2004. http://www.w3.org/TR/owlfeatures/. [6] P.T.E. Yeow, ‘‘Context mediation: ontology modeling using Web Ontology Language (OWL),’’ Composite Information Systems Laboratory (CISL), Massachusetts Institute of Technology, Cambridge, MA, Working Paper CISL# 2004-11, Nov. 2004. [7] D. Preuveneers, J. van der Bergh, D. Wagelaar, A. Georges, P. Rigole, T. Clerckx, et al., ‘‘Towards an extensible context ontology for ambient intelligence,’’ Proc. Second European Symposium on Ambient Intelligence, Eindhoven, the Netherlands, Vol. 3295 of LNCS, pp. 148–159, Nov. 2004. [8] J.-Z. Sun and J. Sauvola, ‘‘Towards a conceptual model for context-aware adaptive services,’’ Proc. Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies, Sichuan, China, pp. 90–94, Aug. 2003.
554
Visual Media Coding and Transmission
[9] G.D. Abowd, M.R. Ebling, H.W. Gellersen, G. Hunt, and H. Lei, ‘‘Context-aware pervasive computing,’’ IEEE Wireless Commun. J, Vol. 9, No. 5, pp. 8–9, Oct. 2002. [10] T. Strang, C. Linnhoff-Popien, and K. Frank, ‘‘CoOL: a context ontology language to enable contextual interoperability,’’ Proc. 4th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems, Vol. 2893 of LNCS, pp. 236–247, 2003. [11] M. Keidi and A. Kemper, ‘‘Towards context-aware adaptable Web services,’’ Proc. 13th World Wide Web conference, NY, pp. 55–65, 2004. [12] T. Chaari, F. Laforest, and A. Celentano, ‘‘Adaptation in context-aware pervasive information systems: the SECAS project,’’ Int. J. of Pervasive Computing and Commun., Vol. 2, No. 2, Jun. 2006. [13] D. Preuveneers, Y. Vandewoude, P. Rigole, D. Ayed, and Y. Berbers, ‘‘Research challenges in mobile and context-aware service development,’’ Proc. Adjunct Proceedings of the 4th International Conference on Pervasive Computing, Dublin, Ireland, pp. 125–128, May 2006. [14] D. Preuveneers, Y. Vandewoude, P. Rigole, D. Ayed, and Y. Berbers, ‘‘Context-aware adaptation for component-based pervasive computing system,’’ Proc. 4th International Conference on Pervasive Computing, Dublin, Ireland, pp. 125–128, May 2006. [15] http://www.cs.kuleuven.be/davy/cogito.php. [16] H.L. Chen, ‘‘An intelligent broker architecture for pervasive context-aware systems,’’ PhD Thesis, Department of Computer Science and Electrical Engineering, University of Maryland, College Park, MD, 2004. [17] A.K. Dey, ‘‘The context toolkit: a toolkit for context-aware applications,’’ http://www.cs.berkeley.edu/dey/ context.html. [18] A.K. Dey,‘‘Supporting the construction of context-aware applications,’’ http://www.vs.inf.ethz.ch/events/ dag2001/slides/anind.pdf. [19] T. Gu, H.K. Pung, and D.Q. Zhang, ‘‘A service-oriented middleware for building context aware services,’’ Elsevier Journal of Network and Computer Applications, Vol. 28, No. 1, pp. 1–18, Jan. 2005. [20] T. Gu, X.H. Wang, H.K. Pung, and D.Q. Zhang, ‘‘An ontology-based context model in intelligent environments,’’ Proc. Communication Networks and Distributed Systems Modeling and Simulation Conference, San Diego, CA, pp. 270–275, Jan. 2004. [21] ‘‘MIT Project Oxygen: overview,’’ http://oxygen.csail.mit.edu/Overview.html. [22] P. Debaty, P. Goddi, and A. Vorbau, ‘‘Integrating the physical world with the Web to enable context-enhanced services,’’ HP, Tech. Rep. HPL-2003-192, 2003. [23] S. Yoshihama, P. Chou, and D. Wong, ‘‘Managing behaviour of intelligent environments,’’ Proc. IEEE International Conference on Pervasive Computing and Communications, Dalas, Fort Worth, TX, Mar. 2003. [24] S.-F. Chang and A. Vetro, ‘‘Video adaptation: concepts, technologies and open issues,’’ Proc. IEEE, Vol. 93, No. 1, pp. 148–158, Jan. 2005. [25] S.-F. Chang, T. Sikora, and A. Puri, ‘‘Overview of the MPEG-7 standard,’’ IEEE Trans. Circuits Syst. Video Technol., special issue on MPEG-7, Vol. 11, No. 6, pp. 688–695, Jun. 2001. [26] ‘‘Multimedia Content Description Interface - Part 5: Multimedia Description Scheme,’’ ISO/IEC Standard, MPEG-7 Part 5: ISO/IEC 15938-5, Oct. 2001. [27] A. Vetro, C. Christopoulos, and H. Sun, ‘‘Video transcoding architectures and techniques: an overview,’’ IEEE Signal Processing Mag., Vol. 20, No. 2, pp. 18–29, Mar. 2003. [28] J. Xin, C.-W. Lin, and M.-T. Sun, ‘‘Digital video transcoding,’’ Proc. IEEE, Vol. 93, No. 1, pp. 84–97, Jan. 2005. [29] I. Ahmad, X. Wei, Y. Sun, and Y.-Q. Zhang, ‘‘Video transcoding: an overview of various techniques and research issues,’’ IEEE Trans. Multimedia, Vol. 7, No. 5, pp. 793–804, Oct. 2005. [30] ‘‘Information Technology: Multimedia Framework (MPEG-21): Part 1: Vision, Technologies and Strategy,’’ ISO/IEC Standard, ISO/IEC TR 21000-1:2004, Nov. 2004. [31] I. Burnett, F. Pereira, R. Walle, and R. Koenen, The MPEG-21 Book, John Wiley and Sons, Ltd., 2006. [32] ‘‘Information Technology – Multimedia Framework (MPEG-21) – Part 7: Digital Item Adaptation,’’ ISO/IEC Standard, ISO/IEC 21000-7:2004, Oct. 2004. [33] F. Pereira et al., IEEE Signal Processing Mag, special issue on universal multimedia access, Mar. 2003. [34] K. Karpouzis, I. Maglogiannis, E. Papaioannou, D. Vergados, and A. Rouskas, ‘‘Integrating heterogeneous multimedia content for personalized interactive entertainment using MPEG-21,’’ Computer Communications: Special issue on Emerging Middleware for Next Generation Networks, 2006. [35] A. Vetro, ‘‘MPEG-21 digital item adaptation: enabling universal multimedia access,’’ IEEE Multimedia J., Vol. 11, No. 1, pp. 84–87, Jan.–Mar. 2004.
Context-based Visual Media Content Adaptation
555
[36] P. Carvalho,‘‘Multimedia content adaptation for universal access,’’ MSc Thesis, Faculdade de Engenharia da Universidade do Porto, 2004. [37] R. Mohan, J. Smith, and C.S. Li, ‘‘Adapting multimedia Internet content for universal access,’’ IEEE Trans. Multimedia, Vol. 1, No. 1, pp. 104–114, Mar. 1999. [38] S.-F. Chang, ‘‘Optimal video adaptation and skimming using a utility-based framework,’’ Proc. Tyrrhenian International Workshop on Digital Communications, Capri Island, Italy, Sep. 2002. [39] H. Hellwagner, ‘‘DANAE,’’ QoS Concertation Meeting (CG-2), Dec. 2004. [40] ‘‘DANAE executive summary,’’ Nov. 2005, http://danae.rd.francetelecom.com. [41] ‘‘ENTHRONE,’’ http://www.enthrone.org/. [42] M.T. Andrade, P.F. Souto, P.M. Carvalho, B. Feiten, and I. Wolf, ‘‘Enhancing the access to networked multimedia content,’’ Proc. 7th International Workshop on Image Analysis for Multimedia Interactive Services, Seoul, South Korea, Apr. 2006. [43] ‘‘Information technology: MPEG-21 multimedia framework,’’ Aug. 2006, http://mpeg-21.itec.uni-klu.ac.at/ cocoon/mpeg21/_mpeg21Demo.HTML. [44] F. Pereira et al., Signal Process.: Image Commu. J., special issue on multimedia adaptation, Mar. 2003. [45] ‘‘UAProf, user agent profile,’’ OMA Open Mobile Alliance (OMA), OMA-TS-UAProf-V2 0-20060206-A, 2006. [46] ‘‘Resource description framework (RDF): concepts and abstract syntax,’’ W3C Recommendation, Feb. 2004. [47] ‘‘Composite capability/preference profiles (CC/PP): structure and vocabularies 1.0,’’ Standard, W3C Recommendation, Jan. 2003. [48] M.H. Butler,‘‘Current technologies for device independence,’’ Hewlett Packard Laboratories, Bristol, UK, Tech. Rep. HPL-2001-83, Apr. 2001, http://www.hpl.hp.com/techreports/2001/HPL-2001-83.pdf. [49] M.H. Butler,‘‘Implementing content negotiation using CC/PP and WAP UAProf,’’ Hewlett Packard Laboratories, Bristol, UK, Tech. Rep. HPL-2001-190, Aug. 2001, http://www.hpl.hp.com/techreports/2001/HPL-2001190.pdf. [50] J. Gilbert and M. Butler,‘‘Using multiple namespaces in CC/PP and UAProf,’’ Hewlett Packard Laboratories, Bristol, UK, Tech. Rep. HPL-2003-31, Jan. 2003, http://www.hpl.hp.com/techreports/2003/HPL-2003-31.pdf. [51] X. Wang, T. DeMartini, B. Wragg, M. Paramasivam, and C. Barlas, ‘‘The MPEG-21 rights expression language and rights data dictionary,’’ IEEE Trans. Multimedia, Vol. 7, No. 3, pp. 408–416, Jun. 2005. [52] E. Rodrıguez, ‘‘Standardization of the protection and governance of multimedia content,’’ PhD Thesis, Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, 2006. [53] V. Torres, E. Rodrıguez, S. Llorente, and J. Delgado, ‘‘Use of standards for implementing a multimedia information protection and management system,’’ Proc. 1st International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, Florence, Italy, pp. 197–204, Nov.–Dec. 2005. [54] A. Vetro and C. Timmerer, ‘‘Digital item adaptation: overview of standardization and research activities,’’ IEEE Trans. Multimedia, Vol. 7, No. 3, Jun 2005. [55] C. Timmerer and H. Hellwagner, ‘‘Interoperable adaptive multimedia communication,’’ IEEE Multimedia Mag., Vol. 12, No. 1, pp. 74–79, Jan.–Mar. 2005. [56] L. Rong and I. Burnett, ‘‘Dynamic multimedia adaptation and updating of media streams with MPEG-21,’’ Proc. 1st IEEE Conference on Consumer Communications and Networking, pp. 436–441, Jan. 2004. [57] J. Bormans and K. Hill, ‘‘MPEG-21 Overview v.5,’’ ISO/IEC JTC1/SC29/WG11, Tech. Rep. JTC1/SC29/ WG11/N5231, Oct. 2002, http://www.chiariglione.org/mpeg/standards/mpeg-21/mpeg-21.htm. [58] ‘‘DRM specification (Candidate Version 2.0),’’ OMA, OMA-TS-DRM-DRM-V2_0-20050614-C, Sep. 2005, http://www.openmobilealliance.org/release_program/docs/DRM/V2_0-20050915-C/OMA-TS-DRM-DRMV2_0-20050915-C.pdf. [59] ‘‘TV-Anytime,’’ http://www.tv-anytime.org/. [60] ‘‘Windows Media digital rights management,’’ http://www.microsoft.com/windows/windowsmedia/forpros/ drm/default.mspx. [61] ‘‘XrML,’’ http://www.xrml.org/. [62] ‘‘ContentGuard,’’ http://www.contentguard.com/. [63] ‘‘Open Digital Rights Language Initiative,’’ Aug. 2007, http://odrl.net. [64] ‘‘International property & real estate systems,’’ http://www.iprsystems.com/. [65] ‘‘The Digital Property Rights Language,’’ Xerox Corporation, Manual and Tutorial, Nov. 1998, http://xml. coverpages.org/DPRLmanual-XML2.html.
556
Visual Media Coding and Transmission
[66] ‘‘Information technology: multimedia framework (MPEG-21): part 5: rights expression language,’’ ISO/IEC standard, ISO/IEC 21000-5:2004, Mar. 2004. [67] ‘‘Information technology: multimedia framework (MPEG-21): part 6: rights data dictionary,’’ ISO/IEC standard, ISO/IEC 21000-6:2004, May 2004. [68] ‘‘Information technology: multimedia framework (MPEG-21): part 7: digital item adaptation, AMENDMENT 1: DIA conversions and permissions,’’ ISO/IEC standard, ISO/IEC 21000-7:2004/FPDAM 1:2005(E), Jan. 2005. [69] ‘‘ADMITS: adaptation in distributed multimedia IT systems,’’ May 2004, http://admits-itec.uni-klu.ac.at. [70] ‘‘DAIDALOS,’’ Dec. 2005, http://www.ist-daidalos.org. [71] ‘‘aceMedia,’’ Dec. 2006, http://www.acemedia.org. [72] ‘‘AXMEDIS,’’ 2007, http://www.axmedis.org. [73] ‘‘Projecte Integrat,’’ Dec. 2005, http://www.i2cat.cat/i2cat/servlet/I2CAT.MainServlet?seccio¼21_33. [74] C. Timmerer, T. DeMartini, and H. Hellwagner, ‘‘The MPEG-21 multimedia framework: conversions and permissions,’’ Department of Information Technology (ITEC), Klagenfurt University, Klagenfurt, Tech. Rep. TR/ITEC/06/1.03, Mar. 2006. [75] ‘‘SOAP version 1.2 part 1: messaging framework (second edition),’’ W3C Recommendation, Apr. 2007, http:// www.w3.org/TR/soap12-part1/. [76] S. Dogan, A.H. Sadka, and M.T. Andrade,‘‘Development and performance evaluation of a shared library of visual format conversion tools,’’ VISNET I FP6 NoE deliverable D17, Nov. 2005. [77] Y. Wang, J.-G. Kim, S.-F. Chang, and H.-M. Kim, ‘‘Utility-based video adaptation for universal multimedia access (UMA) and content-based utility function prediction for real-time video transcoding,’’ IEEE Trans. Multimedia, Vol. 9, No. 2, pp. 213–220, Feb. 2007. [78] M. Kropfberger, K. Leopold, and H. Hellwagner, ‘‘Quality variations of different priority-based temporal video adaptation algorithms,’’ Proc. IEEE 6th Workshop on Multimedia Signal Processing, pp. 183–186, Oct. 2004. [79] G.M. Muntean, P. Perry, and L. Murphy, ‘‘Quality-oriented adaptation scheme for video-on-demand,’’ Electronics Letters, Vol. 39, No. 23, pp. 1689–1690, 2003. [80] H. Shu and L.P. Chau, ‘‘Frame-skipping transcoding with motion change consideration,’’ Proc. IEEE International Symposium on Circuits and Systems, Vancouver, Canada, Vol. 3, pp. 773–776, May 2004. [81] J.N. Hwang, T.D. Wu, and C.W. Lin, ‘‘Dynamic frame skipping in video transcoding,’’ Proc. IEEE Workshop on Multimedia Signal Processing, Redondo Beach, CA, pp. 616–621, Dec. 1998. [82] A. Doulamis and G. Tziritas, ‘‘Content-based video adaptation in low/variable bandwidth communication networks using adaptable neural network structures,’’ Proc. International Joint Conference on Neural Networks, Vancouver, Canada, Jul. 2006. [83] Y. Wang, S.F.C. Alexander, and C. Loui, ‘‘Subjective preference of spatio-temporal rate in video adaptation using multi-dimensional scalable coding,’’ Proc. IEEE International Conference on Multimedia Expo, Taipei, Taiwan, 2004. [84] W.H. Cheng, C.W. Wang, and J.L. Wu, ‘‘Video adaptation for small display based on content recomposition,’’ IEEE Trans. Circuits Syst. Video Technol., Vol. 17, No. 1, pp. 43–58, 2007. [85] M. Bettini, R. Cucchiara, A.D. Bimbo, and A. Prati, ‘‘Content-based video adaptation with users preferences,’’ Proc. IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, Vol. 3, pp. 1695–1698, Jun. 2004. [86] M. Xu, J. Li, Y. Hu, L.T. Chia, B.S. Lee, D. Rajan, and J. Cai, ‘‘An event-driven sports video adaptation for the MPEG-21 DIA framework,’’ Proc. 2006 IEEE International Conference on Multimedia and Expo, Ontario, Canada, pp. 1245–1248, Jul. 2006. [87] T. Warabino, S. Ota, D. Morikawa, M. Ohashi, H. Nakamura, H. Iwashita, and F. Watanabe, ‘‘Video transcoding proxy for 3G wireless mobile Internet access,’’ IEEE Commun. Mag., Vol. 38, No. 10, pp. 66–71, Oct. 2000. [88] M.A. Bonuccelli, F. Lonetti, and F. Martelli, ‘‘Temporal transcoding for mobile video communication,’’ Proc. 2nd Annual International Conference on Mobile and Ubiquitous Systems: Networking and Services, San Diego, CA, pp. 502–506, Jul. 2005. [89] S. Dogan, S. Eminsoy, A.H. Sadka, and A.M. Kondoz, ‘‘Personalized multimedia services for real-time video over 3G mobile networks,’’ Proc. IEE International Conference on 3G Mobile Communication Technologies, London, UK, pp. 366–370, May 2002. [90] Z. Lei and N.D. Georganas, ‘‘Video transcoding gateway for wireless video access,’’ Proc. IEEE Canadian Conference on Electrical and Computing Engineering, Montreal, Canada, Vol. 3, pp. 1775–1778, May 2003.
Context-based Visual Media Content Adaptation
557
[91] E. Amir, S. McCanne, and R.H. Katz, ‘‘An active service framework and its application to real-time multimedia transcoding,’’ Proc. ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, Vancouver, Canada, pp. 178–189, Aug.–Sep. 1998. [92] M. Ott, G. Welling, S. Mathur, D. Reininger, and R. Izmailov, ‘‘The JOURNEY, active network model,’’ IEEE J. Sel. Areas Commun., Vol. 19, No. 3, pp. 527–536, Mar. 2001. [93] K.B. Shimoga, ‘‘Region-of-interest based video image transcoding for heterogeneous client displays,’’ Proc. Packet Video Workshop, Pittsburgh, PA, Apr. 2002. [94] A. Sinha, G. Agarwal, and A. Anbu, ‘‘Region-of-interest based compressed domain video transcoding scheme,’’ Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, Vol. 3, pp. 161–164, May 2004. [95] A. Vetro, H. Sun, and Y. Wang, ‘‘Object-based transcoding for adaptable video content delivery,’’ IEEE Trans. Circuits Syst. Video Technol., Vol. 11, No. 3, pp. 387–401, Mar. 2001. [96] L. Itti, C. Koch, and E. Niebur, ‘‘A model of saliency-based visual attention for rapid scene analysis,’’ IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 11, pp. 1254–1259, Nov. 1998. [97] W. Osberger and A.J. Maeder, ‘‘Automatic identification of perceptually important regions in an image,’’ Proc. IEEE International Conference on Pattern Recognition, Brisbane, Australia, Aug. 1998. [98] L.Q. Chen, X. Xie, X. Fan, W.Y. Ma, H.J. Zhang, and H.Q. Zhou, ‘‘Avisual attention model for adapting images on small displays,’’ ACM Multimedia Syst. J., Vol. 9, No. 4, pp. 353–364, Oct. 2003. [99] D.S. Cruz, T. Ebrahimi, M. Larsson, J. Askelof, and C. Christopoulos, ‘‘Region of interest coding in JPEG 2000 for interactive client/server applications,’’ Proc. IEEE International Workshop on Multimedia Signal Processing, Copenhagen, Denmark, pp. 389–394, Sep. 1999. [100] A.P. Bradley and F.W.M. Stentiford, ‘‘Visual attention for region of interest coding in JPEG 2000,’’ Journal of Visual Communication and Image Representation, Vol. 14, No. 3, pp. 232–250, Sep. 2003. [101] L. Favalli, A. Mecocci, and F. Moschetti, ‘‘Object tracking for retrieval applications in MPEG-2,’’ IEEE Trans. Circuits Syst. Video Technol., Vol. 10, No. 3, pp. 468–476, Jun. 1997. [102] K. Yoon, D. DeMenthon, and D. Doermann, ‘‘Event detection from MPEG video in the compressed domain,’’ Proc. IEEE International Conference on Pattern Recognition, Barcelona, Spain, Sep. 2000. [103] B.S. Manjunath, J.R. Ohm, V.V. Vasudevan, and A. Yamada, ‘‘Color and texture descriptors,’’ IEEE Trans. Circuits Syst. Video Technol., Vol. 11, No. 6, pp. 703–715, Jan. 2001. [104] G. Agarwal, A. Anbu, and A. Sinha, ‘‘A fast algorithm to find the region-of-interest in the compressed MPEG domain,’’ Proc. IEEE International Conference on Multimedia and Expo, Baltimore, MD, Vol. 2, pp. 133–136, Jul. 2003. [105] V. Mezaris, I. Kompatsiaris, and M.G. Strintzis, ‘‘Compressed-domain object detection for video understanding,’’ Proc. 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisbon, Portugal, Apr. 2004. [106] M. Soleimanipour, W. Zhuang, and G.H. Freeman, ‘‘Modeling and resource allocation in wireless multimedia CDMA systems,’’ Proc. 48th IEEE Vehicular Technology Conference, Vol. 2, Nos 18–21, pp. 1279–1283, May 1998. [107] S. Zhao, Z. Xiong, and X. Wang, ‘‘Optimal resource allocation for wireless video over CDMA networks,’’ Proc. International Conference on Multimedia and Expo, Vol. 1, pp. 277–80, Jul. 2003. [108] H. Schwarz, D. Marpe, and T. Wiegand, ‘‘Overview of the scalable video coding extension of the H.264/AVC standard,’’ Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITUT SG16 Q.6), 23rd meeting, San Jose, CA, Input Doc., Apr. 2007. [109] D.D. Schrijver, W.D. Neve, D.V. Deursen, S.D. Bruyne, and R. van der Walle, ‘‘Exploitation of interactive region of interest scalability in scalable video coding by using an XML-driven adaptation framework,’’ Proc. 2nd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, Leeds, UK, pp. 223–231, Dec. 2006. [110] ‘‘Advanced video coding for generic audiovisual services,’’ ITU-T Recommendation H.264, Mar. 2005. [111] M.H. Lee, H.W. Sun, D. Ichimura, Y. Honda, and S.M. Shen, ‘‘ROI slice SEI message,’’ Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6), Geneva, Switzerland, Input Doc. JVT-S054, Apr. 2006. [112] ‘‘Applications and requirements for scalable video coding,’’ Tech. Rep. ISO/IEC JTC/SC29/WG11/N6880, Jan. 2005.
558
Visual Media Coding and Transmission
[113] P. Lambert, D.D. Schrijver, D.V. Deursen, W.D. Neve, Y. Dhondt, and R. van der Walle, ‘‘A real-time content adaptation framework for exploiting ROI scalability in H.264/AVC,’’ Lecture Notes in Computer Science, Springer, Germany, pp. 442–453, 2006. [114] M.T. Andrade, P. Bretillon, H. Castro, P. Carvalho, and B. Feiten, ‘‘Context-aware content adaptation: a systems approach,’’ Proc. European Symposium Mobile Media Delivery, Sardinia, Italy, Sep. 2006. [115] F. Pereira and I. Burnett, ‘‘Universal multimedia experiences for tomorrow,’’ IEEE Signal Processing Mag., Vol. 20, pp. 63–73, Mar. 2003. [116] M.T. Andrade, H.K. Arachchi, S. Nasir, S. Dogan, H. Uzuner, A.M. Kondoz,et al., ‘‘Using context to assist the adaptation of protected multimedia content in virtual collaboration applications,’’ Proc. 3rd IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, New York, NY, Nov. 2007. [117] A. Farquhar, R. Fikes, W. Pratt, and J. Rice, ‘‘Collaborative ontology construction for information integration,’’ Stanford University, Tech. Rep. KSL-95-63, Aug. 1995. [118] D. Reynolds, ‘‘Jena 2 inference support,’’ Jul. 2007, http://jena.sourceforge.net/inference/. [119] ‘‘Creative Commons,’’ http://www.creativecommons.org/. [120] J. Delgado and E. Rodrıguez, ‘‘Towards the interoperability between MPEG-21 REL and Creative Commons licenses,’’ ISO/IEC JTC 1/SC 29/WG 11/M13118, Montreux, Switzerland, Apr. 2006. [121] T. Kim, J. Delgado, F. Schreiner, C. Barlas, and X. Wang, ‘‘Information technology: multimedia framework (MPEG-21): part 5: rights expression language, AMENDMENT 3: ORC (Open Release Content) profile: a proposal for ISO/IEC standardization,’’ ISO/IEC 21000-5:2004/FPDAM 3, Apr. 2007. [122] M. Bystrom and T. Stockhammer, ‘‘Dependent source and channel rate allocation for video transmission,’’ IEEE Transactions on Wireless Communications, Vol. 3, No. 1, pp. 258–268, Jan. 2004. [123] P. Cherriman, E.L. Kuan, and L. Hanzo, ‘‘Burst-by-burst adaptive joint detection CDMA/H.263 based video telephony,’’ IEEE Trans. Circuits Syst. Video Technol., Vol. 12, No. 5, pp. 342–348, May 2002. [124] C. Kodikara, S. Worrall, S. Fabri, and A.M. Kondoz, ‘‘Performance evaluation of MPEG-4 video telephony over UMTS,’’ Proc. 4th IEE International Conference on 3G Mobile Communication Technologies, London, UK, pp. 73–77, Jun. 2003. [125] R.E. Dyck and D.J. Miller, ‘‘Transport of wireless video using separate, concatenated, and joint source-channel coding,’’ Proc. IEEE, Vol. 87, No. 10, pp. 1734–1750, Oct. 1999. [126] I.M. Kim and H.M. Kim, ‘‘An optimum power management scheme for wireless video service in CDMA systems,’’ IEEE Transactions on Wireless Communications, Vol. 2, No. 1, pp. 81–91, Jan. 2003. [127] M.M. Hannuksela, Y.K. Wang, and M. Gabbouj, ‘‘Sub-picture: ROI coding and unequal error protection,’’ Proc. IEEE International Conference on Image Processing, Rochester, NY, Vol. 3, pp. 537–540, Sep. 2002. [128] ‘‘Coding of audio-visual objects: part 2: visual: second edition,’’ ISO/IEC standard, ISO/IEC FDIS 14496-2 (MPEG-4 Visual), Dec. 2001. [129] S. Worrall, S. Fabri, A. Sadka, and A. Kondoz, ‘‘Prioritization of data partitioned MPEG-4 over mobile networks,’’ Europ. Trans. Telecom., Vol. 12, No. 3, May–Jun. 2001. [130] ‘‘The SVT high definition multi format test set,’’ ftp://vqeg.its.bldrdoc.gov/HDTV/SVT_MultiFormat. [131] J. Vieron, M. Wien, and H. Schwarz,‘‘JSVM 7 software,’’ Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6), 20th Meeting, Klagenfurt, Austria, Output Doc. JVT-T203, Jul. 2006.
Index 2 2D + t, 40, 41, 65, 66, 67, 69, 96 2D-to-3D conversion algorithms, 44 2G MSC, 231 2G SGSN, 231 3 3D, 1, 2, 3, 30, 36, 37, 39, 40, 44, 45, 46, 54, 56, 57, 58, 60, 62, 65, 69, 71, 80, 82, 85, 87, 88, 95, 96, 191, 192, 193, 200, 201, 202, 213, 230, 398, 471 3D video, 1, 2, 3, 30, 36, 39, 44, 45, 46, 56, 58, 62, 80, 85, 87, 88, 95, 96, 193, 213, 230, 398 3D wavelet, 39, 40, 69, 95, 96 3G, 2, 3, 4, 214, 221, 231, 260, 308, 313, 315, 341, 393, 394, 396, 525 3GPP, 3, 212, 220, 221, 233, 234, 237, 239, 256, 259, 260, 263, 266, 267, 268, 270, 271, 272, 277, 278, 279, 281, 284, 285, 297, 323, 341, 342, 343, 403, 413, 508 3GPP AMR, 221, 341 3rd Generation Partnership Project, 212, 239, 260, 267, 268, 277, 278, 284, 395, 508 6 64-QAM, 304, 305, 308, 312, 313 8 8-PSK schemes, 245, 246 A AA, 5, 434, 482, 493, 525, 527 AC coefficients, 13
Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6
Access links, 215, 345 ACELP, 221, 341 AceMEDIA, 460 Adaptation authorization, 459, 460, 481, 525, 527 Adaptation Decision Engine, 5, 488 Adaptation Engines Stack, 5, 495 Adaptation-resource-utility, 463, 464 Adaptive arithmetic codec, 51 Adaptive spreading gain control techniques, 379 Adaptive-QoS, 397, 398 ADE, 5, 434, 435, 453, 461, 464, 465, 466, 469, 481, 482, 487, 488, 492, 493, 495, 501, 512, 521, 523, 524, 525, 526, 527, 530 ADE framework on MPEG XE ‘‘MPEG’’ -21, 465 ADE Interface, 525 ADMITS, 459 ADTE, 465 Advanced forward error correction, 221, 222 Advanced video compression, 16 AE, 435, 448, 452, 466, 467, 468, 469, 470, 481, 495, 498, 500, 501, 502, 510, 511, 512, 515, 520, 523, 524 AES, 495, 500, 523, 524, 526 AIR, 318, 325, 347, 348, 350, 367, 372, 373, 374, 384, 395, 505, 508 AMC permutation, 306 AMR-WB, 221, 315, 341, 342, 343, 344 API, 442, 488 Application QoS, 399, 401, 402 AQoS, 446, 448, 450, 451, 453, 458, 464, 465, 488
560 AR(1) model, 118 ARE, 118 ASC, 439 ASC model, 439 Audio-Visual, 1 Autoregressive model, 117 AV, 450, 453, 468, 472, 477 AV content analysis, 468 AVC, 1, 2, 19, 28, 29, 30, 36, 40, 41, 42, 44, 58, 59, 60, 61, 62, 63, 64, 65, 73, 75, 79, 95, 96, 97, 112, 120, 127, 129, 131, 132, 137, 142, 152, 153, 154, 155, 156, 157, 158, 170, 173, 180, 186, 190, 192, 193, 200, 392, 468, 469, 495, 511, 512, 514, 517, 518, 520, 521, 522, 527 AWGN source, 241, 263, 278, 405, 406 B B frames, 17, 18, 38, 79 Base transceiver station, 98, 231 Bit Allocation, 153, 160, 162, 163, 164, 165 Bit Plane Extraction, 104 Bitstream Syntax Description Language, 449 BLER, 240, 241, 242, 243, 244, 245, 246, 248, 249, 250, 251, 277, 279, 280, 281, 284, 285, 287, 288, 291, 353, 354, 355, 366, 367, 374, 375, 376, 378, 384, 394, 423, 508, 509, 510 Block matching, 14, 36, 194 Block transform, 20 BlueSpace, 444 BSD, 451, 452, 520 BSD tools, 452 BSDL, 449, 452 BSS, 231, 233 BTS, 98, 231 Buffer Control, 153, 162 C CAE method, 48 Call admission control, 214, 397 Call request, 385, 386, 387, 389, 390 Carrier Frequency, 241, 392, 405, 406 Carrier-to-interference, 234, 245, 255, 263, 361, 376, 403 Catalan Integrated Project, 453 CC, 284, 286, 313, 322, 326, 327, 330, 332, 342, 353, 355, 383, 424, 425, 426, 427, 428, 429, 430, 431, 445, 455, 456, 493, 508, 509 CC/PP, 445, 455, 456 CC/PP eXAMPLE, 457
Index
CCSR, 241, 242, 243, 245, 246, 247, 248, 279, 280, 281, 297, 299, 300, 336 CD, 28, 88, 466 CD-ROM, 28 CGS, 42 CIF, 37, 61, 68, 69, 70, 81, 82, 84, 88, 91, 161, 162, 168, 169, 506, 508, 511, 514, 515 CIR, 318, 325, 358, 359, 360, 361, 363, 364, 365, 366, 367, 368, 369, 370, 373, 374, 375, 376, 377, 378, 384 CMP, 112, 113 Coarse-grain scalability, 42 CoBrA, 442, 444 CoDAMoS, 441 Coding mode control, 160 CoGITO, 441 Committee draft, 88 Complementary alpha plane, 176 Complementary shape data, 175, 176, 177, 179 Complex FIR filter, 274 Composite capability profiles, 445 Computational power, 39 Conditional entropies, 101 Content Adaptation, 433, 466, 481, 482, 501, 510, 512, 522 Context Broker Architecture, 442 Context brokers, 440 Context classes, 486 Context definition, 436 Context interpreter, 440, 443, 444 Context mediation, 439, 441 Context of usage, 435, 440, 446, 464, 473 Context ontology Languag, 439 Context providers, 5, 472, 485 Context space, 439 Context toolkit, 439, 442, 443 Context-adaptive binary arithmetic coding, 51 Context-aware content adaptation, 4, 433, 434, 435, 445, 470, 479, 480, 481, 482, 486, 495, 530, 531 Contextual information, 436, 437, 470, 476, 485, 500 Contour matching, 49 CoOL, 439 Core network, 371 Correlated-frame structure, 73, 74 COSSAP, 234, 240, 403 COSSAP hierarchical models, 240 COST 231-Walfish-Ikegami model, 356, 361 Creation of Smart Local City Services, 441
Index
CROSLOCIS, 441 CS/H.264/AVC, 495, 500, 501 Curvature scale space image, 46, 47 CxP…. 5, 434, 444, 473, 482, 486, 487, 488, 493, 523, 526, 530 D DANAE, 453, 459 dB, 32, 34, 52, 53, 64, 70, 75, 77, 82, 84, 85, 87, 92, 96, 112, 113, 125, 126, 129, 130, 133, 134, 135, 136, 142, 143, 148, 161, 162, 168, 169, 170, 178, 188, 189, 202, 205, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 255, 272, 276, 277, 278, 279, 281, 285, 287, 288, 292, 294, 299, 305, 307, 312, 319, 320, 326, 327, 328, 329, 332, 333, 335, 337, 338, 339, 340, 341, 344, 345, 346, 351, 353, 354, 355, 357, 359, 360, 361, 364, 365, 366, 367, 368, 369, 370, 374, 375, 376, 377, 378, 379, 380, 381, 383, 387, 388, 390, 392, 393, 394, 395, 405, 406, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 488, 509, 510, 511, 518, 521, 522 DCH, 262, 284, 317 DCT, 1, 9, 10, 11, 12, 13, 20, 28, 34, 35, 36, 41, 69, 103, 113, 120, 121, 122, 123, 133, 134, 143, 144, 145, 146, 147, 159, 171, 172, 173, 220, 346, 348, 349, 466, 506 Deblocking filters, 21 Decoder complexity, 39, 98 Depth-map, 45, 46, 57 DI, 64, 115, 120, 165, 446, 447, 448, 449, 461, 472, 520, 524, 525, 526, 527, 530 DIA, 445, 446, 448, 449, 453, 458, 472, 473, 482, 486, 496, 497, 498, 499, 528, 529 DIA tools, 445, 458 DIBR, 45, 64, 65, 80, 87, 88 DID, 446, 472 DIDL, 446 Differential pulse code modulation, 13, 28, 39, 40, 41 Differentiated services, 397 Digital Rights Management, 4, 458 DIP, 453 Disparity information, 196 Distributed adaptation, 453 Distributed Systems and Computer Networks, 441 DistriNet, 441 DMIF, 220
561 Doppler spectrum, 239 Downsampling, 8, 204, 205 DPCH fields, 270 DPDCH, 262, 270, 271, 292, 293, 317 DPRL, 459 DRM, 4, 5, 433, 434, 435, 436, 458, 459, 469, 470, 473, 481, 482, 485, 512, 522, 531 DRM-based adaptation, 435, 531 DRM-enabled adaptation, 5, 434, 470, 473 DS, 171, 449 DSL, 3, 37, 213 E EBCOT algorithm, 51, 52 EC, 277, 278, 314, 453 EDDD, 145 EDGE, 2, 3, 211, 212, 231, 234, 246, 248, 344, 401, 403, 404, 405, 421, 422, 424, 425, 431, 432 EDGE Emulator, 404 EDGE-to-UMTS System, 424 EDGE-to-UMTS transmission, 421, 424, 431 Editing-like content adaptation, 452 EGPRS data flow simulator, 403 EGPRS physical link layer model, 236, 248, 252 EGPRS protocol layer, 362 EIMS, 453 Encoder complexity, 99, 193 Encoder control, 15, 16 ENTHRONE project, 453 Entropy coder, 15, 159 EPB, 90, 91, 92, 93, 96 Equalization, 219 Error concealment, 143, 153, 173, 175, 176, 182, 183, 185, 186 Error patterns, 308 Error propagation, 24 Error resilience, 153, 155, 181, 182 Error robust, 71 Error tracking, 223 Error-prone environments, 155, 180, 186, 218, 220, 221, 326, 355 Error-robustness algorithm, 74, 75 ERTERM, 91 ETSI EFR, 221, 341 Event detection, 468 Excerpt of a network UED, 451 Exclusive OR, 74 Extensible context ontology, 439 Extensible markup language, 439
562 Extensible stylesheet language T=transformation, 452 Extrapolation orientations, 73 F FA, 512 Fast fading, 271, 356, 358, 388 FD, 466 FEC/ARQ method, 223 FIBME algorithm, 67 FIFO buffer, 254 FIRE, 235, 236 FIRE code, 236 Flexible Macroblock Ordering, 25 FlexMux, 220 FMO, 24, 25, 26, 27, 131, 132, 155, 186, 187, 189, 468, 469 FP, 453 Frame dropping, 463, 466 FUSC, 302, 305, 306, 307 G Gaussian, 106, 115, 138, 139, 140, 172, 233, 234, 240, 263, 272, 281, 290, 387, 388 Gaussian probability distribution, 139 gBS, 449, 452 gBS tools, 449 gBSDtoBin, 453 GERAN, 221, 231, 232, 233, 239, 254, 312, 341, 403 GERAN data flow model, 403 Global motion model, 177, 184, 196 Global motion parameters, 176 GMSK modulation scheme, 236, 249 Gold codes, 269 GOP, 52, 53, 71, 72, 73, 75, 76, 77, 96, 109, 112, 114, 115, 116, 117, 118, 120, 121, 122, 124, 126, 129, 130, 133, 134, 142, 151, 154, 198, 202, 203, 206, 208 GOS, 160, 163, 164, 165, 167 GPRS, 3, 211, 212, 215, 231, 233, 234, 235, 236, 237, 240, 241, 242, 243, 244, 245, 252, 253, 254, 255, 256, 257, 260, 312, 315, 332, 333, 334, 335, 336, 337, 339, 344, 345, 363, 371, 403, 405, 411, 412, 417, 425, 427, 428, 430 GPRS PDTCHs, 233, 236 GPRS SNDC, 252, 253, 403 GSM 05.05, 240, 241, 243, 405 GSM SACCH, 235
Index
H H.264, 1, 2, 8, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 34, 36, 40, 41, 42, 44, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 73, 75, 79, 80, 81, 88, 95, 96, 97, 98, 112, 120, 127, 129, 131, 132, 137, 142, 152, 153, 154, 155, 156, 157, 158, 170, 172, 180, 186, 189, 190, 200, 202, 203, 392, 468, 469, 495, 510, 511, 512, 514, 517, 518, 520, 521, 522, 527 HardwarePlatform, 454 HCI, 436 HDTV, 29, 501 Heterogeneous networks, 215, 217, 330, 396, 397, 427, 466 Hierarchical coding, 23 HP, 444 HR/DSSS, 3, 212 HSCSD, 3, 211, 241, 405 HT, 238, 241, 405 HTML, 454 Huffman codec, 48 I IBM, 444 IBMH-MCTF, 70 ICAP, 454 Ideal rake receiver, 273, 274, 299 IEC, 1, 36, 41, 60, 61, 150, 445, 446 IEEE802.16e, 85, 301 IETF, 321, 330, 454 IFFT block, 302, 307 IMT2000, 270 In-Band MCTF, 70 Independent Replication, 391 Info pyramid, 462, 463 In-loop filter, 22 Integer transform, 20 Interleaving, 88, 90, 94, 99, 219, 236, 238, 240, 264, 267, 268, 278, 285, 289, 291, 301, 302, 304, 325, 406 Internet Content Adaptation Protocol, 454 Inter-networked QoS, 400, 397, 399, 401 Inter-view Prediction, 196, 206 Interworked heterogeneous networks, 215 Intra frames, 14 Intra prediction, 21 Inverse Quantization, 108 IP video conferencing, 221, 341 IPR, 459 IROI, 468, 498, 500, 511, 517, 522
Index
IROI adaptation, 468, 498, 517 IS-95, 3, 212 ISO, 1, 28, 35, 36, 41, 60, 61, 150, 191, 192, 212, 445, 446 IST, 38, 161, 168, 169, 385, 453 ITEC, 453 ITU Standards, 28 ITU-T G.722.2, 221, 341 ITU-T G.729, 221 J JAS-UEP scheme, 381, 383 Joint MVS Encoding, 153, 156 Joint MVS Transcoding, 153, 170 JPEG, 11, 13, 14, 36, 54, 69, 88, 89, 90, 91, 93, 94, 96, 446 JPEG 2000, 36, 69, 88, 89, 90, 91, 93, 94, 96 JPWL EPB, 91, 92 JPWL system description, 89 JSVM, 60, 75, 76, 77, 79, 82, 84, 96, 193, 197, 198, 514, 517, 518, 520 K Kalman filtering, 116, 117 Key frames, 71, 72, 73, 74, 76, 77, 96, 106, 107, 109, 110, 112, 114, 116, 117, 120, 127, 129, 130, 131, 132, 133, 134, 137, 139, 142, 466 L Lagrangian cost, 19, 73 Lagrangian optimization, 19, 20 Laplacian, 106, 138, 139, 140, 146, 172 Laplacian distribution, 140, 146, 172 LAPP, 110, 111, 147 Layered transmission, 48 Layers of depth images, 64 Link adaptation, 221, 225, 356, 362, 394 LLC PDUs, 233 LLC-PDU, 254, 333, 334, 367 LLR, 107, 108, 137, 138, 139 Log likelihood ratio, 107 LogMap algorithm, 274, 281, 288, 299 M MAC, 3, 58, 59, 61, 62, 65, 192, 212, 232, 233, 238, 239, 240, 243, 252, 253, 256, 258, 259, 260, 262, 284, 295, 296, 301, 311, 315, 316, 317, 333, 334, 362, 403, 411, 412, 413, 417 MAC multiplexing, 296, 317 Macroblock Bit Allocation, 166
563 MAD, 154, 166 MALR, 111, 112, 113 Maximum likelihood, 137 MB, 8, 14, 15, 17, 18, 19, 20, 21, 22, 25, 26, 28, 29, 35, 62, 145, 148, 149, 150, 154, 155, 158, 159, 160, 162, 166, 169, 187, 188, 190, 347, 350, 507, 512, 513, 515, 517 MB motion vectors, 160 MC-EZBC codec, 65 MCS-1, 233, 234, 236, 237, 238, 241, 246, 247, 248, 250, 339, 340, 341, 363, 364, 365, 367, 370, 373, 405, 406, 412, 413, 414, 415, 422, 423, 424, 425, 426, 427, 428, 429 MCS-4, 234, 237, 241, 246, 247, 248, 250, 251, 339, 340, 363, 364, 413, 414, 415 MCS-9, 234, 237, 246, 339, 340, 363, 364, 413, 414, 415, 422 MDC, 2, 33, 36, 37, 39, 42, 43, 44, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 96, 223 MDC decoder, 44, 81, 82 MDC stream, 44, 80, 81, 82, 85 MDS, 445, 464 MDS media characteristics, 464 Mean absolute log-likelihood ratio, 111 Medium-grain scalability, 42 Mesh-based temporal filtering, 51 MGS, 42 MIT, 444 Mobile Alliance Forum, 454 Mobile Station mode, 410 Mobile Terminal, 241, 405, 406 Mobile terminal power, 227 Mobile terminal velocity, 285 Mode decision, 15 MOS, 343, 344, 345 Motion compensation, 14 Motion estimation, 14, 40, 41, 107, 114, 118, 127 Motion-compensated interpolation, 114, 117 Motion-compensated temporal filtering, 44, 51, 96 MP3, 29, 446 MPEG, 465 MPEG Ad Hoc Group, 96 MPEG-21, 96, 445, 446, 447, 448, 451, 452, 453, 454, 457, 458, 459, 460, 461, 464, 468, 470, 471, 472, 473, 474, 475, 476, 482, 486, 488, 493, 495, 523, 525, 526, 527, 528, 530 MPEG-21 DIA, 445, 446, 447, 451, 452, 453, 459, 460, 464, 468, 470, 471, 472, 473, 474, 475, 476, 488, 493, 523, 525, 527, 528
564 MPEG-21 REL, 459, 460, 493 MPEG-4, 2, 8, 18, 19, 27, 28, 29, 30, 39, 48, 50, 53, 54, 58, 59, 60, 61, 62, 63, 64, 65, 95, 97, 154, 155, 161, 162, 163, 168, 170, 175, 179, 180, 190, 192, 219, 220, 222, 225, 252, 253, 254, 255, 256, 294, 296, 315, 318, 319, 320, 321, 325, 326, 327, 328, 329, 331, 332, 333, 334, 335, 336, 340, 346, 347, 348, 352, 353, 356, 363, 367, 373, 380, 381, 384, 395, 396, 401, 407, 408, 409, 410, 416, 421, 423, 424, 426, 431, 447, 466, 468, 469, 503, 505, 506, 507, 527 MPEG-4 MAC, 58, 59, 60, 61, 62, 63, 64, 95 MPEG-7 descriptors, 468 MPEG-7 DSs, 450, 472 MQ coder, 91 Multimedia Communications, 214, 217 Multimedia Description Schemes, 445 Multiple Auxiliary Component, 58 multiple video sequence, 152, 156, 162 Multiplexing, 181, 221, 224, 260, 284, 345, 406 Multi-sequence rate control, 170 Multi-View Coding, 16, 192, 193, 196, 201, 203, 204, 206, 207, 208, 209, 210 MVO performance, 160 MVO rate control, 163 MVP, 513 MVS, 2, 152, 153, 156, 157, 158, 159, 160, 162 MVS Encoding, 153, 162 MVS Encoding Rate Control, 153, 162 N NA-CDMA-IS-127, 221, 341 NACK, 224, 412 NAL, 468 NA-TDMA IS-641, 221, 341 NoE, 465 Non-normative, 2, 152, 154, 155, 190 Non-normative tools, 2, 152 Normative tools, 2, 152 O OAVE encoding scheme, 48 OCNS, 277, 314 Odd stream, 81, 82 ODRL, 459 ODWT, 66, 67, 69, 70 OMA, 459 Ontologies, 438, 481, 531 Open Digital Rights Language, 459
Index
Orthogonality factor, 263, 289, 290, 291, 388, 392 OSCRA, 495, 498, 500 Overhead, 61, 71, 74, 76, 79, 94, 96, 112, 194, 202, 220, 322, 331, 332, 346, 382, 417 OWL, 439, 442, 455, 481, 492, 531 P P frames, 14, 16, 17, 38, 116 P2P, 454 PACCH, 232 Packet Data Block Type 2, 235 Packet Data Block Type 3, 235 Packet Data Block Type 4, 236 PAGCH, 232 PAL, 28 Parameter set, 24, 25, 85, 167, 256, 262, 277, 282, 283, 286, 292, 296, 312, 313, 315, 321, 322, 325, 344, 346, 405, 410, 421, 424, 431, 517 Parity Bit, 106 Parity bit puncturer, 106 Parity check, 237, 301, 313 PCCCH, 232 PCCPCH, 261 PDA, 456, 500, 501, 525, 527 PDSCH, 262 PDTCH, 232, 240, 241, 336, 405 Percentage of sign changes, 111 Performance evaluation objective, 5, 6, 30, 31, 34, 35, 43, 53, 56, 65, 95, 121, 131, 137, 155, 190, 201, 204, 214, 231, 338, 343, 354, 398, 434, 436, 453, 462, 466, 467, 480, 481, 515, 517, 520, 521 Performance evaluation Subjective, 6, 30, 31, 32, 53, 56, 62, 65, 67, 68, 71, 84, 85, 88, 95, 96, 112, 153, 162, 173, 175, 180, 181, 182, 186, 204, 210, 329, 338, 343, 345, 451, 462, 510 Physical Context, 437, 438, 476 PLR, 132, 133, 134, 136, 143, 188 Post filter, 22 Power control algorithm, 291, 292, 293, 294, 328, 376 Power limitations, 217 PPCH, 232 PRACH, 232, 262 Prediction step, 66, 69 Preferences Profiles, 445 PRISM codec, 103
565
Index
Profiling, 434, 467, 480 Project Integration, 460 Propagation model, 238, 239, 245, 248, 254, 256, 263, 270, 299, 337, 356, 387, 388 PSCS, 111, 112, 113, 114 PSD, 115, 120 PSNR, 31, 32, 33, 34, 52, 53, 55, 61, 64, 68, 70, 75, 76, 77, 82, 85, 87, 93, 112, 113, 125, 126, 129, 132, 133, 134, 135, 136, 142, 148, 149, 161, 162, 168, 169, 170, 178, 188, 189, 202, 207, 318, 319, 320, 321, 322, 325, 326, 327, 328, 329, 332, 333, 335, 337, 338, 340, 341, 346, 351, 354, 355, 364, 365, 366, 367, 368, 369, 370, 374, 379, 380, 381, 383, 395, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 509, 511, 515, 518, 521, 522 PUSC, 85, 302, 305, 306, 307, 308, 309, 310, 312 Q QCIF, 48, 70, 74, 77, 88, 110, 112, 120, 122, 124, 132, 142, 147, 161, 169, 321, 353, 367, 421, 511, 515, 527, 530 QoS, 3, 4, 71, 89, 213, 214, 215, 216, 217, 219, 220, 223, 224, 226, 230, 259, 262, 282, 294, 296, 298, 315, 316, 345, 347, 371, 372, 378, 385, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 404, 405, 407, 408, 409, 410, 416, 418, 419, 421, 423, 425, 426, 427, 428, 429, 430, 431, 448, 512 QoS discrepancies, 216 QoS Mapping Emulator, 409 QP, 58, 61, 63, 65, 75, 76, 77, 82, 85, 87, 147, 154, 155, 162, 163, 166, 169, 188, 208, 515, 517, 518 Quantization, 12, 16, 87, 104, 121, 124, 226 Quantization bins, 104, 123 R RA, 90, 238, 241, 405 Random access, 16, 17, 161, 163, 168, 193, 204, 206, 208, 209, 232 Random View Access, 204 Randomizer, 302 Rate Control, 134, 136, 153, 156, 159, 253, 295 Rate distortion, 32, 33, 50, 76, 98, 99, 103, 112, 113, 114, 115, 118, 119, 120, 124, 130, 131, 132, 133, 134, 136, 137, 142, 143, 152, 154, 155, 156, 157, 159, 172, 208, 513, 515, 517, 518, 519
Rate matching, 267, 278, 285, 406 Rate matching ratio, 285, 406 Rayleigh distribution, 239, 271 Rayleigh fading channel, 141, 265 RD, 32, 33, 50, 76, 98, 99, 103, 112, 113, 114, 115, 118, 119, 120, 124, 130, 131, 132, 133, 134, 136, 137, 142, 143, 152, 154, 155, 156, 157, 159, 172, 208, 513, 515, 517, 518, 519 R-D performance, 48, 49, 50, 52, 58, 61, 64, 65 RDD, 459, 493, 496 RDF, 455, 456 RDO, 154 RDOPT, 518 Real-time communication, 30 Redundant Slices, 27, 44 REL, 459, 460, 493, 496 Residual Error Descriptor, 90 Resource Management, 225, 385 Resources context, 437, 476 Resources context category, 476 Retransmission techniques, 223 RLC/MAC blocks, 233, 236, 243, 295, 296, 334, 413, 417 RLC/MAC layer, 254, 294, 296, 323, 362, 409 RM8, 154 Robust FMO Scheme, 153, 186 ROI, 467, 468, 469, 470, 471, 498, 500, 501, 507, 510, 511, 512, 514, 515, 516, 517, 526 ROI selection algorithm, 467, 516 ROPE algorithm, 143, 144 RSC encoder, 105, 106 RTP/UDP/IP, 220, 233, 252, 254, 294, 403, 410 RTP-PDU header, 254 Run-level coding, 13 S SA-SPECK algorithm, 51 Scalable intra-shape coding scheme, 47, 48 Scalable predictive coding scheme, 49 Scalable shape encoding, 46 SCCPCH, 261, 262 Scene Plane, 163, 164 SDC stream, 82 SECAS, 440 SEGMARK, 91 Selective combining, 273 Sensing higher-level context, 461 Sensitivity to Feedback Delay, 375 Service-Oriented Context-Aware Middleware, 442
566 Shape data, 153, 175, 176, 177, 182, 183, 184, 185, 186 Shape-adaptive bit-plane coding, 51, 52 Side Information, 106, 109, 139 Signal Processing WorkSystem, 256, 312, 405 Signal-to-noise ratios, 163, 403 SISO, 107, 108, 110, 111, 112, 137, 140, 141 Skip, 19, 513, 515, 517 Slice, 17, 24, 25, 26, 27, 147, 155, 186, 187, 188 SN-DATA PDU, 410 SNDC header, 254 SNDCP, 232, 233, 410, 411 SNR, 28, 43, 70, 85, 87, 139, 142, 143, 275, 312, 379, 380, 381, 382, 383, 490, 500 SN-UNITDATA PDU, 410 SOAP, 465, 523 SOCAM, 442, 443, 444 Soft-QoS control, 399 SoftwarePlatform, 454 SOP, 91 Source sequences, 37 Spaces for adaptation decision, 462 Spatial power spectral density, 115 Spatial redundancy, 6, 7, 8, 9, 13, 200 Spatio Temporal, 159 SPECK algorithm, 51, 52 Speech compression, 221 SPIHT, 69 Spreading Factor, 285, 286, 326, 380, 383, 406 SQBAS, 363, 365, 367, 368, 369, 370, 371, 384 SR-ARQ, 223 SSD, 19 SSIM, 34 Standardization Bodies, 27 State-of-the-art, 96, 131, 155, 193, 530, 531 State-space model, 116, 118 STC, 302 Stereo Video Sequences, 126 Stereoscopic, 44, 56, 57, 58, 60 Stereoscopic 3D video, 44 Stereoscopic content, 44 Stereoscopic Video Conversion, 471 Streaming, 143, 371 SVC, 1, 33, 36, 39, 41, 44, 59, 60, 61, 63, 71, 73, 74, 79, 80, 81, 82, 88, 95, 96, 468, 469, 520, 521, 522 SVO encoding, 160, 162
Index
SVO rate control algorithm, 161 Symmetric Distributed Coding, 126 T Tailing bits, 238 TB size, 294, 296, 317, 420 TBAS, 363, 364, 369, 370, 371, 384 TCH/FS speech channel, 235 TDMA, 2, 211, 219, 232, 236, 240 TDMA frames, 236 TDWZ, 112, 113, 131, 137 Temporal filtering, 40, 51, 53, 67, 96 Temporal redundancy, 6, 7 TFCI, 260, 261, 270, 278, 285, 317, 406 TFI, 260, 261 Thermal noise power, 388 Time Context, 437, 476 TM5, 154, 294, 318, 325, 353, 373, 421, 508 TMN8, 154 Traffic model, 219, 385 Training Sequence Codes, 241, 405 Transcoding, 100, 153, 156, 171, 186 Transmission Time Interval, 265, 286 Transmit bit energy, 354, 379, 381, 383 TransMux layer, 220 TTI, 261, 264, 265, 267, 284, 286, 295, 296, 317, 323, 347, 379, 395, 417, 420 TU, 238, 241, 246, 336, 337, 341, 405, 422 TU3, 241, 242, 245, 361, 363, 365, 366 Turbo coding, 104, 107, 286 Turbo Encoder, 104 U UAProf specifications, 456 UCD, 446, 448, 450, 451, 453, 464 UCDs, 453, 465 UED, 446, 448, 449, 450, 451, 453, 464, 470, 472, 473, 482, 486, 487, 489, 523 UED ral environment characteristics, 450, 472 UEP, 27, 58, 90, 91, 92, 93, 96, 187, 222, 347, 351, 352, 353, 354, 355, 356, 381, 382, 383, 384, 502, 507 UF, 466 UMA, 53, 453, 457, 460, 470 UMA scenario, 460 UMTS, 2, 3, 37, 211, 212, 215, 216, 219, 221, 226, 231, 256, 257, 260, 262, 263, 264, 272, 279, 280, 282, 293, 296, 297, 299, 312, 315, 316, 321, 329, 330, 331, 332, 341, 342, 344, 345, 346, 347, 352, 353, 354, 355, 356, 357,
Index
378, 379, 380, 381, 382, 383, 384, 385, 395, 401, 403, 404, 405, 406, 407, 408, 409, 410, 413, 416, 417, 421, 424, 425, 426, 427, 428, 429, 430, 431, 432, 502, 507, 508, 509 UMTS DL Model Verification, 279, 280 UMTS emulator, 296, 297, 401, 405, 407, 410, 416, 421, 431 UMTS-FDD simulator, 297 Unequal Error Protection, 27, 93 User Agent Profile, 454 User Context, 437, 438, 476 User Interface, 417 USF bits, 235, 236 Utility space vehicle, 461, 464 UTRAN, 221, 256, 257, 258, 259, 260, 271, 285, 292, 295, 296, 300, 315, 321, 322, 325, 329, 330, 341, 342, 344, 385, 395, 403, 508 UTRAN emulator, 300, 325, 344 V VBV, 161, 163, 164, 167, 168 VCEG, 1, 36, 41, 161, 192 VCS, 477, 478, 479, 480 VERT, 73 Vertex-based shape coding, 46, 47, 48 Video Buffering Verifier Control, 160, 167 Video Buffering Verifier Control Compensation, 167 Video Coding Expert Group, 36 Video object planes, 158, 175 Video objects, 2, 47, 49, 52, 152, 174, 175, 176, 177, 178, 179, 180, 182, 183, 225, 466 Video surveillance, 99 Video traffic model, 219 Virtual Classroom, 478 Virtual Collaboration, 477, 480 Virtual Collaboration System, 38, 477 Virtual Desk, 37 Virtual View Generation, 200 VISNET I NoE project, 465
567 ViTooKi operating system, 453 VLSI, 2, 211 VO, 2, 47, 49, 52, 152, 158, 160, 162, 163, 164, 165, 166, 168, 170, 173, 174, 175, 176, 177, 178, 179, 180, 182, 183, 186, 225, 348, 466, 506 VOP, 159, 161, 163, 164, 165, 166, 168, 169, 175, 176, 178, 324, 331, 332, 348, 349, 506 VOP header, 324, 332, 348, 349, 506 VS, 152, 157, 160 W W3C, 438, 443, 445, 454, 455, 456, 458 WAP, 454, 455 Wavelet, 36, 40, 41, 44, 65, 66, 67, 68, 69, 71, 91, 96 Wavelet analysis, 40, 41 Wavelet transform, 40, 44, 65, 69 WCDMA, 91, 221, 231, 257, 258, 290, 355, 378, 392, 395, 409, 508 WDP, 454 Web Ontology Language, 531 Web Services technologies, 439 WiMAX, 3, 38, 212, 213, 215, 216, 231, 300 Windows Media DRM 10, 459 Wireless Access Protocol, 454 Wireless JPEG 2000, 88 Wireless PC cameras, 99 WLAN, 3, 37, 212, 215, 216, 501 WML, 454 WMSA, 274, 299 WZ frames, 116, 118, 131, 132, 133, 134 X XDSL, 3, 213, 301 XML, 439, 440, 446, 449, 452, 455, 456, 459, 472, 473, 486 XrML, 459 XSLT, 452, 465