Video compression techniques over low-bandwidth lines Roalt Aalmoes August 27, 1996
c
Copyright 1996, Twente Universi...
27 downloads
676 Views
622KB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Video compression techniques over low-bandwidth lines Roalt Aalmoes August 27, 1996
c
Copyright 1996, Twente University
Abstract In this Master Thesis, current video compression techniques are examined and an implementation of a video compression algorithm based on the H.263 standard is presented. All video compression standards that are available now have lossy characteristics: the algorithm reduces the size of video stream signi cantly at the cost of (some) quality loss. Motion-JPEG is currently mostly used, but to make the video stream even smaller, Motion Estimation algorithms are applied, which are found in the H.261, MPEG and H.263 standards. Application of these motion estimation algorithms must be done correctly to make them ecient with regard to compression ratio and computing speed. For Desktop Video Conferencing over lowbandwidth lines, the new H.263 standard is most appropriate due to its high compression ratio. To make an implementation of this standard applicable for Real-Time purposes, the Motion Estimation, DCT, and Quantization algorithm are examined carefully and fast implementations that do not sacri ce quality much are integrated in the encoder. Other advanced techniques that reduce the size of the stream that must compressed are also examined and implementations based on skin detection, background substraction, and change detection are tested and evaluated. The result is a video compression library that compresses QCIF-sized (176 144) video streams at near Real-Time speed on fast desktop platforms. An application that sets the correct video compression parameters based on system load is also implemented.
ii
Preface Welcome to video compression! In the next couple of chapters, I will describe what video compression is, why it is required, and how it can be done eectively. This thesis is the last part of my study Computer Science at Twente University, and I hope my work will be the rst part of further development in the area of Real-Time video compression research. I have looked at video compression for Desktop Video Conferencing (DVC) in particular. Large-size video streams of high quality cannot be compressed that much to transmit it through, for instance, a telephone line. As desktop PCs or small workstations also lack enough computing power for compression of these large video streams, DVC is the best choice for these systems. Emphasis of my work has been put onto software video compression as opposed to hardware compression for a number of reasons. In the rst place, it is much more exible than hardware implementation. To change an algorithm, the code only has to be changed and be compiled. For hardware implementation, a whole new design must be made and implementation of a chip is also quite expensive. Another reason for using software compression is that it can be used to test algorithms eectively, before the actual (intended) implementation in hardware is done. The nal reason is the compatibility of a software compression compared to hardware compression. Software can be run on a variety of systems, while a hardware compression device often requires dierent drivers or is only available for one or not more than a few dierent systems. The result is a software library that can be compiled on many system, independent of their internal architectures. This is also the reason I prevented to use a mixture of a software and hardware solution as it would make the software implementation dependent of (at some point in the future) outdated hardware. I am personally not against a mixture of hard- and software, but only if the hardware provides a clear functional limited part of the encoding that replaces some time-consuming software routines. Some object-oriented approach, where a software object is replaced by a hardware object would in this case be a solution. An example of some software/hardware combination that I did use was the ability of the AVA to transmit YUV color spaced frames, instead of more common RGB color spaced frames1. 1 As will be explained later, video compression is applied on YUV color spaced frames only. If the encoder receives RGB color spaced frames, these must rst be converted to YUV.
iii
Preface
Due to the nature of video streams, large amounts of data must be processed by the processor for any video algorithm, including compression. In the rst time of my graduation, I was a bit sceptical about Real-Time software compression, and thought that the implementation would not be faster than 5{10 fps. However, developments in performance of PCs (Pentiums that are as fast as DEC Alpha workstations), right motion analysis, and usage of fast implementations made it clear Real-Time video conferencing becomes feasible very soon now . Another surprise is the size to which a video stream can be compressed. It might turn out that audio and not video will be the bottleneck of video conferencing2 .
Roalt Aalmoes, August, 1996
2
Although I have not done much research in the area of audio compression
iv
Contents Preface 1 Introduction 1.1 1.2 1.3 1.4 1.5
A brief history of computing power The Pegasus project . . . . . . . . Goals . . . . . . . . . . . . . . . . Problems and Constraints . . . . . Areas of research . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
2 Overview of Still-picture Compression Standards 2.1 Introduction . . . . . . . . . . . . . . 2.2 Still-picture compression techniques 2.2.1 Lossy and Lossless . . . . . . 2.2.2 Color spaces . . . . . . . . . 2.2.3 DCT transformation . . . . . 2.2.4 Scalar Quantization . . . . . 2.2.5 Vector Quantization . . . . . 2.2.6 Entropy Encoding . . . . . . 2.2.7 Fractal compression . . . . . 2.2.8 Wavelet compression . . . . . 2.3 Still-picture compression standards . 2.3.1 JPEG . . . . . . . . . . . . . 2.3.2 GIF . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
3 Overview of Video Compression Standards 3.1 Video compression techniques 3.1.1 Prediction . . . . . . . 3.2 Video compression standards 3.2.1 Motion-JPEG . . . . . 3.2.2 MPEG-1 . . . . . . . 3.2.3 MPEG-2 . . . . . . . 3.2.4 MPEG-4 . . . . . . . 3.2.5 H.261 . . . . . . . . . 3.2.6 H.263 . . . . . . . . . 3.3 Summary . . . . . . . . . . .
. . . . . . . . . .
v
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
iii 1 1 2 2 2 3
5
5 6 6 6 8 8 9 9 11 12 13 13 18
19
19 19 20 20 20 22 23 23 25 26
Contents
4 Choosing an encoding standard { the case for H.263 4.1 4.2 4.3 4.4 4.5
Introduction . . . . . . . . . . . . . . . . . . The choice for H.263 . . . . . . . . . . . . . Advanced options in H.263 . . . . . . . . . DVC scenario and H.263 advanced options . Summary . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Basics of Motion Estimation . . . . . . . . . . 5.3 Block-comparison alternatives . . . . . . . . . 5.3.1 SAD . . . . . . . . . . . . . . . . . . . 5.3.2 Summation of columns/rows (SOCR) 5.4 Estimation Alternatives . . . . . . . . . . . . 5.4.1 Exhaustive search with search window 5.4.2 Logarithmic search . . . . . . . . . . . 5.4.3 Half-pixel prediction . . . . . . . . . . 5.4.4 Other techniques . . . . . . . . . . . . 5.5 Motion in video sequences . . . . . . . . . . . 5.5.1 Motion vectors . . . . . . . . . . . . . 5.5.2 Frame-rate and motion vectors . . . . 5.5.3 Size of the search window . . . . . . . 5.5.4 Real-Time and frame rate . . . . . . . 5.5.5 Results on SOCR block comparison . 5.6 Quality of a Real-Time video stream . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
5 Choosing Motion Estimation algorithms
6 Choosing DCT components and Quantization levels 6.1 6.2 6.3 6.4
Introduction . . . . . . . . . . . Implementations of DCTs . . . Comparison of DCT algorithms Quantization . . . . . . . . . . 6.4.1 Scaled DCT . . . . . . . 6.5 Summary . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
7 Choosing advanced video techniques
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Skin selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Implementation of skin selection under H.263 . . . . . . . 7.2.2 Very fast skin selection implementation . . . . . . . . . . 7.3 Background substraction . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Background replacement (Chroma-key with virtual blue room) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
29
29 29 30 31 31
33
33 33 34 34 35 36 36 36 37 38 38 38 40 40 42 43 44 45
47
47 47 48 49 50 50
51
51 52 52 52 53 54 54 55
Contents
8 Compression techniques vs. network technology 8.1 Introduction . . . . . . . . . . 8.2 Classi cation of networks . . 8.3 Computer networks . . . . . . 8.3.1 POTS . . . . . . . . . 8.3.2 ISDN . . . . . . . . . 8.3.3 Internet . . . . . . . . 8.3.4 ATM . . . . . . . . . . 8.4 Compression and Network . . 8.5 Choice of compression . . . . 8.5.1 Inter vs. Intra frames 8.6 Latency . . . . . . . . . . . . 8.7 Summary . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
9 Implementation and performance of the R263 encoder 9.1 Introduction . . . . . . . . . . . . . . . . . 9.2 Code optimization . . . . . . . . . . . . . 9.2.1 Compiler ags . . . . . . . . . . . 9.2.2 Fine-tuning . . . . . . . . . . . . . 9.3 Implementation of change detection . . . 9.4 Results of the R263 encoder . . . . . . . . 9.5 Adaptive encoding based on system load . 9.5.1 Implementation . . . . . . . . . . . 9.5.2 Results . . . . . . . . . . . . . . . 9.6 Summary . . . . . . . . . . . . . . . . . .
10 Conclusion 11 Acknowledgements
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
57
57 57 58 58 58 58 59 59 60 61 61 62
63
63 63 63 64 65 66 68 69 69 70
71 73
vii
Contents
viii
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
The 4:2:0-format . . . . . . . . . . . . . . The 4:2:2-format . . . . . . . . . . . . . . Quantization and de-quantization . . . . . Example of a Discrete Wavelet Transform Overview of the JPEG codec . . . . . . . Zig-zag coecients sequence . . . . . . . . Component interleaving using MCUs . . . JPEG Lossless prediction for pixel X . . . Hierarchical encoding in JPEG . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
7 8 9 12 14 15 15 17 18
3.1 3.2 3.3 3.4
Forward prediction . . . . . . Average prediction . . . . . . Composition of an H.261 CIF B-frame of H.263 . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
21 22 24 26
SOCR block comparison using 8 8 blocks . . . . . . . . . . . . Logarithmic search algorithm for 9 9 blocks . . . . . . . . . . . Example of interpolation of 2 2 pixels . . . . . . . . . . . . . . Average length of X and Y motion vectors for a video sequence. Maximum length found per frame of X and Y motion vectors for a video sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Size of MV against fraction of MVs with this size . . . . . . . . . 5.7 Size of MV against cumulative percentage of MVs within this size
35 36 37 39
. . . .
5.1 5.2 5.3 5.4 5.5
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
40 41 42
6.1 Plot of compression ratio versus SNR (dB) . . . . . . . . . . . . 50 7.1 Overview of background substraction . . . . . . . . . . . . . . . . 53 7.2 Steps for encoding of one Macro-Block . . . . . . . . . . . . . . . 54 7.3 Macro-block encoding with block change detection . . . . . . . . 55 9.1 Model for a real-time video compression application . . . . . . . 68
ix
List of Figures
x
List of Tables 2.1 LZW encoding of word \bananas" . . . . . . . . . . . . . . . . . 11 2.2 LZW dictionary after encoding/decoding of word \bananas" . . . 11 2.3 Lossless prediction formula table . . . . . . . . . . . . . . . . . . 17 3.1 MPEG-1 Constraint Parameter Bit stream . . . . . . . . . . . . . 23 4.1 Eect of H.263 advanced options on compression ratio (Miss America sequence, 150 frames) . . . . . . . . . . . . . . . . . . . 30 5.1 5.2 5.3 5.4
Search window size for Miss America video stream at 15 fps . . . Search window size for Miss America video stream at 7.5 fps . . SOCR block comparison on Miss America sequence at 15 fps . . Quality of video stream compared with real frame rate of 30.0 fps
41 42 43 45
6.1 6.2 6.3 6.4
SNR comparison of DCT algorithms. . . . . . . . . . . . Performance comparison of DCT algorithms. . . . . . . Compression comparison of DCT algorithms. . . . . . . Quantization versus Quality for Miss America sequence
48 48 48 49
. . . .
. . . .
. . . .
. . . .
. . . .
8.1 Comparison between dierent compression methods for a QCIF frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9.1 Comparison of Miss America encoding (75 frames) with and without motion detection (MD) . . . . . . . . . . . . . . . . . . . 9.2 The streams that are used for compression . . . . . . . . . . . . . 9.3 The streams that are used for compression . . . . . . . . . . . . . 9.4 Pro le of encoding . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
66 67 67 68
List of Tables
xii
Chapter 1
Introduction 1.1 A brief history of computing power Computer technology is improved in a revolutionary way since the rst systems were developed in the 50s and 60s: Size of computers decreased signi cantly while at the same time computing power increased by factors of two each year. The developments lead to new applications for each new generation of computers. After the introduction of computer systems in businesses in the 70s, small computer systems entered households in the 80s. First, text-based applications were replaced by graphics-based What You See Is What You Get (WYSIWYG) applications. At the end of the 80s, more and more applications made use of the available graphical user interface (GUI) for cartoon animations and to display pictures. This development encouraged manufacturers of video cards and monitor to display up to true-color1 images on standard computer monitors. The next logical step was the development of technology to display motion-fragments on a computer monitor. In summary, video technology and computing power drove each other to new and higher standards in the computer industry. Video processing requires a large amount of resources. Data storage and network capacity are currently the main bottlenecks for more widespread use of video applications. Waiting for new technology in these elds is not an option and other approaches must be taken to make more eective use of the available resources. Compression might be the solution in this situation. Compression algorithms make it possible to store the same data in less storage space. Compression is not only useful for storage on media such as hard disks, CDs and tape streamers, it is also useful for transmission of data over a network: less bandwidth is used for the same data. However, compression does not come free: the initial data must be transformed to its compressed equivalent. To use the initial data again, the compressed data must be decompressed. This transformation forwards and backwards takes time, may reduce quality of the data, uses system memory and above all, it costs CPU power. 1 True color is 24-bit color, which means that for every red, green and blue component 8 bits per pixel are reserved
1
1.2 The Pegasus project
Introduction
1.2 The Pegasus project The Pegasus Project [Mullender92] is a three-year European ESPRIT project with the University of Twente and the University of Cambridge as main participants. In 1995, this rst project was successfully nished with a workshop in Cambridge. The outcomes of the project is a multimedia system, including a real-time kernel, a lesystem for multimedia-data, and a video conferencing application (the digital TV director). In succession of this project, the Pegasus-II project continues the work in this area. The Pegasus Lab consists of a number of DEC Alpha workstations with DEC JV300 video cards. Cameras and microphones are connected through an Audio/Video Adapter (AVA) to an Asynchronous Transfer Mode (ATM) network. The DEC Alphas are also connected to this ATM network through an OTTO interface card. A Video stream is set up by sending control information to the AVA about the required video stream. This control information includes the video format (RGB, YUV, Motion-JPEG), the size of the image, the frame rate, and scaling information to automatically down-sample the resolution of the camera. Video is displayed on a DEC Alpha through the JV300 video card. The video can also be redirected to an ATM Television (ATV) that displays multimedia streams directly from the ATM network.
1.3 Goals The current algorithms used for video transport and storage are not as ecient as newer, advanced algorithms. In the Pegasus multimedia system, video streams are transmitted as Motion-JPEG streams. These streams consist of video frames that are compressed individually by a factor of 10 with negligible quality loss (invisible to the human eye). This compression mechanism is done Real-Time (30 frames per second). However, more advanced compression algorithms make use of the temporal redundancy found in subsequent video frames. These algorithms compress video up to a factor of 400. Unfortunately, this cannot (at this time) be done real-time without advanced, multiprocessor, dedicated computer systems. To make video conferencing possible between two persons at two dierent locations, these advanced compression algorithms, discussed in the previous paragraph, must be used to reduce the required network bandwidth. Unfortunately, computing power of a desktop personal computer is signi cant lower than for an high-end workstation. Purpose of my work is to examine the dierent video compression algorithms that are available, nd the conditions and parameters that apply to DVC and create a working implementation of a video compression algorithm.
1.4 Problems and Constraints Three mayor constraints determine the eectiveness of a video compression algorithm: 2
Introduction
1.5 Areas of research
Computing power: The computing power that is available in a system
determines the number of instructions that can be executed in a speci ed time. The more instructions are executed, the more redundancies in the data can be found. Processor (CPU) speed, the type of instruction set (e.g. multimedia instructions), main, virtual and cache memory in the the system and the system bus all determine the computing power of a system. Network bandwidth: The bandwidth of the network determines the amount of data that can be transported in a certain time. Not only the average bandwidth of the network is important, the minimum bandwidth and the network latency is also important. The minimum bandwidth is the bandwidth that is guaranteed by the network. In some network, e.g. Internet, this number is equal to zero. In other network, this number is equal to the average bandwidth (ISDN). In ATM-networks, services classes determine this behaviour (See also Section 8.3.4). Latency: The nal constraint for a real-time compression algorithm is the time between capturing a video frame from a camera and displaying this frame after sending it over a network. The latency depends on both the computing power and the network latency. For some applications other than DVC, latency is less important. For instance, when a video stream is stored on hard disk solely for later use. The total latency is the sum of all partial latency that occurs: The time between the frame that is grabbed and the time the data becomes available, the time it takes to compress this frame, the time it takes to transmit and receive a frame, and the time it takes to decompress and display the frame. If all resources are used on a system for compression, the compress latency is equal to the time it takes to compress a frame2. Latency is especially important for situation where interaction is required: the response time may not be too long, because it would disrupt the interaction.
See also Chapter 8 for a closer look at the characteristics of dierent network and the requirements for video communications.
1.5 Areas of research Video compression requires a lot of resources. Computing power is the main obstacle for real-time compression. Although real-time decoding of a compressed video stream is possible for the Motion Picture Experts Group (MPEG) video compression standard [Patel93] [Bhaskaran95b], real-time encoding is impossible without support of enhanced video hardware chips. Especially the motionestimation search algorithms require a lot of CPU time: an optimized encoder requires 5 times more time than its related decoder [Bhaskaran95a]. 2 We only consider full frame compression. Compression latency might be reduced further when frames are compressed in separate parts
3
1.5 Areas of research
Introduction
Current developments in the area of computing indicate that CPUs become faster and real-time encoding is feasible in a couple of years. Other developments in the area of data networks and communication indicate that network capacity will grow and the number of connected people with a faster-thanPOTS3 -connection also increases signi cantly over the next years. These developments give hope for teleconferencing in the future, but to make real-time compression possible now, new approaches must be taken. To make real-time encoding possible, a number of sacri ces must be made. Frame-rate could be decreased to a still-acceptable rate, screen size may be reduced, and the size of the compressed video stream might be increased to win valuable computing power. Another research topic is concentration of computing power to areas of the video screen that are important. These areas can be detected by nding parts of the screen that contain skin color [Zwaal95] or foreground. All these issues in uence each other and make it dicult to nd a standard solution in the form of one piece of hardware and one o-the-shelf software package. Finding an appropriate video compression standard and creating a software codec that satis es this standard is the rst part of my work. The second part is the analysis of a DVC-speci c video stream in order to optimize the codec for this kind of data. The nal, and most ambitious part of this research is the development and implementation of a codec that proves that near Real-Time encoding is possible for modest-sized video sequences. In the next two chapters, respectively image and video compression are discussed. These two chapters were published earlier as a Pegasus Paper [Aalmoes95]. In Chapter 4, the arguments are given to use the H.263 standard for video compression of low-bandwidth streams. In Chapter 5, a video stream is analyzed on its movements and alternative motion estimation algorithms are discussed here. In Chapter 6, a variety of Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) algorithms are examined. Quantization, that is sometimes combined with the DCT operation is also discussed here. In Chapter 7, a number of methods are discussed to reduce the video stream or encoding time in an non-confessional way by only compressing the parts of the screen that are most important. In order to connect the technology with the reality, a variety of network connections are discussed and their implications on the video compression engine. In Chapter 8, a number of dierent computer and telecommunications networks are discussed and their implication on video encoding. The implementation of a fast H.263 encoder is discussed in Chapter 9, and its corresponding results are also given here. Finally, in Chapter 10, the conclusions are given on Real-Time video compression for low-bandwidth networks.
3 POTS stands for Plain Old Telephone System, and represents the interconnected telephone network of the national PTTs
4
Chapter 2
Overview of Still-picture Compression Standards This chapter was published earlier together with chapter 3 as the article \Overview of Still-picture and Video Compression Standards", Pegasus paper 95-3
2.1 Introduction In the past years a number of compression standards have emerged and a number is now being developed. Although it would be useful to use only one general video compression standard, a growing number of standards is developed because of enhanced processing power, dedicated hardware, new compression techniques, and networks with dierent bandwidths. Each compression standard supports a speci c video application. It is dicult to choose the correct compression standard for a speci c application. As is true of compression in general that there does not exist one best compression algorithm, the same is true of video compression: there is no best standard. Some applications require fast real-time encoding, at the cost of the compression factor (video-conferencing), while other applications want maximum compression at encoding that need not be done real-time, as long as decoding is real-time (e.g. compressing a video stream on CD-ROM). This paper describes the dierent compression techniques used in the available standards on video compression. This simpli es the choice of which standard is most suitable for a certain application. Furthermore, an estimation can be made on the computational costs and size of video stream of a video compression standard. A division is made between still-picture and video compression techniques. In Section 2.2, some commonly used terms and compression techniques for still-picture compression are explained. These techniques include ways to remove redundant information and information that is not visible to the human eye. In Section 2.3, the most important still-picture compression standards are discussed. Video compression techniques rely strongly on the techniques used in still-picture compression, but also incorporate prediction and motioncompensation algorithms. These algorithms that remove redundant information 5
2.2 Still-picture compression techniques
Overview of Still-picture Compression Standards
between the dierent frames of a video stream are discussed in Section 3.1. In Section 3.2, some video compression standards or standards to be are discussed.
2.2 Still-picture compression techniques 2.2.1 Lossy and Lossless Compression algorithms can be categorized in two groups: lossless and lossy compression. Lossless algorithms generate exactly the same bit pattern of an object after decompression as before the object was compressed. These compression algorithms are used for text and computer binary les. Lossy compression algorithms, however, may loose some information during compression and decompression. In a good lossy compression algorithm, the lost information is not visible in case of a picture. Most lossy compression algorithms have the ability to specify a quality-setting that determines how much quality (information) may be lost for a higher compression factor. Lossy compression algorithms are useful for compression of sampled data. This data is analog data from a microphone or a camera that is converted to a digital approximation. Therefore, lossy compression algorithms that change the data slightly are not catastrophic. Lossy compression followed by decompression, however, causes quality loss that can better be avoided by reducing the number of compression-decompression operations for a picture. If a picture must be manipulated (in the image space), it can best be stored as raw data between the image operations. Video compression methods are generally lossy. Video streams need to be compressed aggressively to reduce the required bandwidth and storage capacity for the video stream. Since lossy algorithms reduce the stream the most, such algorithms are used for video stream compression. It does not really matter much if parts of a video stream are lost: since the video stream is continuous, the next frame may repair the lost frame part.
2.2.2 Color spaces To de ne a video image, not only the resolution of the image must be speci ed but also the way the color information is stored. A gray-scale picture only has one color-component: luminance [Poynton95]. For an 8-bit gray-scale image, the higher the value, the lighter the color gray. The value 0 represents the color black and the value 255 represents the color white. To represent color pictures, three components are required. The most popular color space used is RGB . In this color space, the R (red) component represents the amount of red in the pixel, the G (green) component represents the amount of green in the pixel and the B (blue) component represents the amount of blue. True-color pictures use 8-bit for each component and thus 24-bit per pixel. Another known color space is Y C C . The Y component represents the luminance, while the two chrominance components C and C determine the actual color. B
R
B
6
R
Overview of Still-picture Compression Standards
2.2 Still-picture compression techniques
To convert a RGB color space to an Y C C color space, the three color component intensities of RGB determine the luminance Y . The Y value is a weighted sum of the three color intensities: the green component is brighter than the red component and the red component is brighter than the blue component for the same value. The C component is the blue component without the total luminance (C ? Y ). The C component is the red component without the total luminance (C ? Y ). The Hue-Saturation-Brightness (HSB ), the Hue-Lightness-Saturation (HLS ), or also called the Hue-Saturation-Value (HSV ) color space is based on specifying the colors numerically. The Hue component describes the pure color, the Saturation component describes the degree to which the pure color is diluted by while light, and the Brightness describes the brightness or luminance. The problem with this color space is that no reference is made to the linearity or non-linearity of the colors. To determine the lightness from RGB , the three color component values are averaged ((R + G + B )=3), while the visual luminance of green is much higher than the visual luminance of blue. For (lossy) image compression, it is advised to convert RGB color space to Y C C [Wallace91]. The human eye is more sensitive to luminance components than to chrominance components and by separating them, the luminance component can be encoded in a higher resolution than the chrominance components. In other words, less bits for the chrominance components need to be encoded. The relation between the resolution of the luminance components and the chrominance components determines the picture format. A luminance component accompanied by two chrominance components that are down-sampled in both horizontal and vertical dimensions by two is called the 4:2:0-format (see Figure 2.1) [Kleine95] [Filippini95]. If the chrominance components are only down-sampled in horizontal direction by 2, the format is called the 4:2:2-format (see Figure 2.2). Finally, the 4:1:1-format has its chrominance components horizontally down-sampled by 4 and has no down-sampling in the vertical dimension. B
R
B
B
R
R
B
R
+
Y (8 x 8)
+
C
B
(4 x 4)
Figure 2.1: The 4:2:0-format 7
C
R
(4 x 4)
2.2 Still-picture compression techniques
Overview of Still-picture Compression Standards
+
Y
+
C
(8 x 8)
B
(4 x 8)
C
R
(4 x 8)
Figure 2.2: The 4:2:2-format
2.2.3 DCT transformation A transformation that is useful in image compression is the DCT [Wallace91]. This transformation converts an n n block of elements into another block of n n coecients. These n n coecients represent two-dimensional unique spatial frequencies. The DCT function is reversible by using an IDCT function. The rst coecient, which has a zero horizontal and vertical frequency, is called the DC-coecient and is equal to the average value of the original elements. The other coecients are called AC-coecients and represent the dimensional spatial frequencies. The DCT and IDCT are lossless if the DCT encoded data are stored with perfect accuracy. In practice, however, the coecients are stored as integers which can introduce small dierences with the original data after the IDCT decoding. If the DCT transformation is applied to blocks of pixels, higher spatial frequency coecients become (near) zero because most pixels next to each other dier little in value. If relative more bits are used to encode the lower frequency coecients than the higher frequency coecients, a (lossy) compression method is created.
2.2.4 Scalar Quantization When people refer to "quantization", they usually mean Scalar Quantization and not other forms of Vector Quantization (VQ) (see Section 2.2.5). Scalar quantization is used to reduce the number of bits that are needed to store an integer. This can be done by dividing the integer by a quantization factor and rounding it to the nearest integer before it is stored. To retrieve the integer again, the stored (quantized) integer is multiplied by the quantization factor again. This step is not lossless, as the density of the domain of the integer is reduced by the quantization factor. An example of quantization is given in Figure 2.3. A 3-bit domain is quantized by factor 2, which reduces it by one bit to a 2-bit domain. After de8
Overview of Still-picture Compression Standards
2.2 Still-picture compression techniques
Quantized value
Reproduced value
0
0
0
1
0
0
Original value
2
Quantization by factor 2
1
Dequantization by factor 2
2
3
1
2
4
2
4
5
2
4
6
3
6
7
3
6
Figure 2.3: Quantization and de-quantization quantization, only the values 0, 2, 4 and 6 have the same value as before the quantization, but the other values are approximated by the nearest value to the original integer.
2.2.5 Vector Quantization VQ techniques make use of codebooks in combination with a matrix of vectors to represent an image [Gray92]. Instead of referring to elements directly, elements are referenced via the codebook. To transmit an image, only the references to the codebook (the vectors) have to be sent. A lot of dierent VQ methods exist. For example, each vector points to an RGB-triplet that represents one pixel, or each vector points to an n n image block that represents the vector. The way the contents of the codebook is determined also varies. A common way to generate a codebook is by using a training set to nd the "best" codes, the codes that occurs most frequently. A codebook can also be calculated based on the data that are quantized. In this case, the codebook itself is transmitted together with the vectors. Known video compression implementations based on VQ are Cinepak from Radius and Indeo 3.23 from Intel, both free for developers [Bryan95].
2.2.6 Entropy Encoding An entropy encoding algorithm is a lossless compression method, which encodes the data based on their statistical characteristics. The term "entropy encoding" is used in the JPEG compression method (see Section 2.3.1), but it can apply to all compression algorithms that increases the "energy" or the information 9
2.2 Still-picture compression techniques
Overview of Still-picture Compression Standards
density in a message.
Human and Arithmetic compression One of the rst used general-purpose lossless compression algorithms is Human-coding [Nelson91]. This method assigns shorter bit-patterns to characters in the message that occur more frequently and longer bit-patterns to characters that occur less often. The table which is used to nd the frequency of occurrence of a character is called the Human-table. This table is determined before encoding is done by analyzing the statistics of the data to be encoded. If decoding is done on data with the same statistical characteristics, the Human-table is incorporated in the decoder. If the decoder is used to decode data with dierent statistical characteristics, the Human-table itself is sent prior to the encoded data. Another well-known method for entropy encoding is adaptive Humancoding. This method outputs a bit-pattern for each character of the message, based on the occurrence of this character of the previously encoded characters; a character that occured more frequent in the past has a smaller bit-pattern. The Human-table of adaptive Human-coding is built up on the y at the encoder, and in the same way it is rebuilt at the decoder. An advantage of this method is that the Human-table does not have to be transmitted. A variant of Human coding that compresses data more than (adaptive) Human coding, is arithmetic coding. This method assigns a fractional number of bits per code, instead of a xed number of bits in Human-coding. The result of an arithmetic coded message is a number between 0 and 1. This number is multiplied by the number of characters in the message. The integer part is used for decoding the next character and the fraction for decoding the rest of the message. Because the coding table has for each character a range between two fractional numbers to choose from, it can `choose' the best number. A drawback of the arithmetic coding algorithm is that it is patented, and therefore it is mostly replaced by the less ecient Human-coding.
LZW compression Lempel-Ziv-Welch (LZW) is an entropy encoding technique, developed by Terry Welch [Nelson91]. The best known implementations of LZW are the UNIX "compress" utility and CompuServe's Graphics Interchange Format (GIF). LZW is based on the LZ77 and LZ78, which are developed by Lempel and Ziv. LZ77 and LZ78 are dictionary-based algorithms: they build up a dictionary of previously used strings of characters. The output stream of these encoders consists of characters or references to the dictionary. A combination of a reference with a character generates a new reference in the dictionary. For example, a reference to "Hi" in the dictionary followed by the character "s" results in a new reference "His". LZW is an improvement over LZ78. LZW uses a table of entries with an index eld and a substitution-string eld. This dictionary is pre-loaded with every possible symbol in the alphabet. Thus, every symbol can be found in 10
Overview of Still-picture Compression Standards
2.2 Still-picture compression techniques
the dictionary by using a reference. The encoder searches in the dictionary for the largest possible reference to the string at the input. This reference plus the rst symbol of the input stream after the reference is stored in the output stream. An example of the encoding of the word "bananas" is given in Figure 2.2 and Table 2.1. The decoder reads the encoded stream and replaces the reference by the substitution string that is stored in the associated entry of the dictionary. The symbol that follows the reference is directly stored in the decoded stream. The reference and the symbol are also used to create a new entry in the dictionary. Input stream 'b' 'a' 'n' 'a' 'n' 'a' 's'
Generated entry Output stream 256 = \ba" 257 = \an" 258 = \na" 257 259 = \ana" 260 = \as"
'b' 'a' 'n' 257 'a' 's'
Table 2.1: LZW encoding of word \bananas" Index 0 1 ... 255 256 257 258 259 260
substitution string (char) 0 (char) 1 .. (char) 255 \ba" \an" \na" \ana" \as"
Table 2.2: LZW dictionary after encoding/decoding of word \bananas"
2.2.7 Fractal compression Fractal compression is one of the latest techniques in lossy image compression. Fractals are images that recursively contain themselves. They are de ned by a number of translations that include rescales, rotations and dimensional ips. If you zoom into a fractal image, it appears that the image has an in nite resolution, but it is actually a part of the same image that reappears in itself. The idea behind fractal compression is to automatically nd a fractal that resembles the image that must be compressed. A mayor advantage of fractal compression is the ability to decompress the image to any given resolution. The rst implementation of such an algorithm was implemented by Arnaud Jacquin [Gailly95] 11
2.2 Still-picture compression techniques
Overview of Still-picture Compression Standards
and was capable of compression from 8:1 to 50:1 while remaining reasonable quality. This implementation searches a combination of transformations that represent the image the best. Unfortunately, the search to nd these transformation is very computationally intensive, which makes it unattractive for real-time image compression. Iterated Systems developed and sells a fractal-based compressor/decompressor, mainly used for CD-ROM encyclopedia applications.
2.2.8 Wavelet compression A relative new and promising development in the area of lossy compression is the use of wavelet transformation [cody92] [press91]. An important characteristic of this transformation is that if it is applied on a time-domain signal, it results in a representation that is localized in time domain as well as in frequency domain. Compared to the Fast Fourier Transform (FFT) that is of an order of N 2 log(N ) for N elements, a fast wavelet transform has an order of N for the same number of elements. The wavelet transformation converts a sample of 2 values into 2 ?1 approximation wavelet transform coecients and 2 ?1 detail wavelet transform coecients. This transformation can be repeated over the generated approximation wavelet transform coecients a number of times, until the minimum number of 2 approximation transform coecients and 2 ? 2 detail transform coecients remain. The number of transformations is called the number of levels of the wavelet transformation. The wavelet transformation is inversive, so applying this inverse wavelet transform a number of times (equal to the number of levels) on the generated wavelet coecients, the original sample is recomposed. J
J
J
J
X0
S
X1
S
X2
S
0
1
2
Approximation transform coordinates
DWT X3
S
X4
D0
X5
D1
X6
D2
X7
D3
3
Detail transform coordinates
Figure 2.4: Example of a Discrete Wavelet Transform An example of wavelet transform is given in Figure 2.4. In this example a Discrete Wavelet Transform (DWT) is applied to an array of 8 coordinates. The result are 4 approximation transform coordinates S0::S3 (also called the smooth vector) and 4 detail transform coecients D0::D3 (also called the detail vector). Now, the DWT is applied again on the approximation transform coecients. 12
Overview of Still-picture Compression Standards
2.3 Still-picture compression standards
All of these resulting coecients together with the detail transform coecients from Figure 2.4 form the nal wavelet coecients. Wavelet compression is obtained by only storing those coecients of the wavelet transformation that have an amplitude above a certain threshold together with the place of those coecients in the transformed domain. Because the coecients are also time-domain, high contrast edges are maintained at the cost of low contrast areas. By using quantization and entropy encoding in combination with wavelet transform the number of bits needed to store the wavelet coecients is further reduced.
2.3 Still-picture compression standards Still-picture compression techniques take advantage of spatial redundancy found in images: in most cases, pixels close to each other have the same color, or almost the same color.
2.3.1 JPEG The JPEG standard [Wallace91] is developed by the Joint Photographic Experts Group. It is a collaboration between the former International Telegraph and Telephone Consultative Committee (CCITT)1, and the International Standardization Organization (ISO). The JPEG standard is now widely adopted in the world. There are four modes of operation:
Sequential encoding: This is the general mode, in which a picture is encoded from top to bottom.
Progressive encoding: In this mode, the picture builds up in multiple scans. After each scan, the picture gets sharper.
Lossless encoding: In this mode, the picture is compressed in a way that no data is lost after decompression. The algorithm used for lossless encoding is rather dierent from one used in the sequential and progressive modes of operation.
Hierarchical encoding: In this mode, the image is encoded in dierent
resolutions. Accessing a low-resolution version does not require decompression of the full resolution version.
The JPEG encoder works on one color component at a time. For gray-scale pictures, which only consist of one component, the encoding is straight-forward. For color pictures, every component is encoded separately just like a gray-scale picture. The color components can be interleaved with each other or can be sent after one another, see Section 2.3.1. 1
The CCITT is now called the International Telecommunication Union (ITU)
13
2.3 Still-picture compression standards
Overview of Still-picture Compression Standards
Sequential encoding The most common way to encode a JPEG picture is by using sequential encoding. An overview of the codec (Compressor/de-compressor) is given in Figure 2.5. JPEG-encoder
Data analysis Quantization table
Original picture
Color space conversion
DCT
Quantization
Huffman table
Entropy encoding
Compressed picture
Decompressed picture
Color space conversion
IDCT
De-quantization
Quantization table
Entropy decoding
Huffman table
JPEG-decoder
Figure 2.5: Overview of the JPEG codec For every component, the picture is divided in blocks of 8 8 pixels. Each block is transformed in another 8 8 block using a DCT function. The resulting transformed block consists of 64 unique two-dimensional spatial frequencies coecients, of which the higher frequency coecients are very small or zero. After the DCT transformation, the transformed block is quantized by using an 8 8 quantization table. This means that every coecient is divided by its corresponding quantization value and rounding the result to the nearest integer. Note that this step is lossy and removes data that may not be visible to the human eye. The resulting block of coecients contains even more small or zero values. This block of coecients is stored in a sequence according to a zig-zag route de ned in the block, see Figure 2.6. This zig-zag sequence is chosen in a way that low-frequency coecients are stored rst and the highfrequency coecients last. Putting the high-frequency coecient next to each other results in a series of low or zero value data at the end of the sequence. This sequence is encoded eciently by using the entropy encoder. The nal step is entropy encoding of the created sequence. The quantized DC coecients are treated a little dierent from the other AC coecients: because the value of DC coecient of adjacent blocks correlate strongly, not the value itself but the dierence with the previous DC coecient is used for entropy encoding. The entropy encoder is a mixture of a variable-length encoder and the Human or arithmetic encoder. 14
Overview of Still-picture Compression Standards
2.3 Still-picture compression standards
Figure 2.6: Zig-zag coecients sequence
Color Component Interleaving It is possible to interleave the dierent color components per frame. In interleaving mode, every color component is divided in a number of Minimum Coding Units (MCUs). Every MCU consists of i by j (where 1 i; j 2) data units for each color component. A data unit is a block of 8 8 pixels that is converted to its DCT equivalent block of 8 8 coecients. Data units in an MCU, and MCUs in a frame are ordered in a left-to-right and top-to-bottom order. The number of data units in a MCU may not exceed 10. Because of this restriction, not all non-interleaved JPEG compressed images can use interleaving. For a 4:2:0-format, an MCU contains 4 data units of the Y component, 1 data unit of the C component, and 1 data unit of the C component, see Figure 2.7. The interleaved video stream of this example looks like this: MCU1 ; MCU2; ::: = B
R
Y0 0 ; Y1 0 ; Y0 1 ; Y1 1 ; C ;
0
;
;
2
;
4
B0;0
;C
R0;0
6
; Y2 0; Y2 1; Y3 0; Y3 1; C ;
;
;
0
0
;
B1;0
;C
R1;0
2
; :::
0
0
2
0
2
+
4
+ 2
2 6
Y
C
(8 x 8)
MCU 1
B
(4 x 4)
MCU 2
Figure 2.7: Component interleaving using MCUs 15
C
R
(4 x 4)
2.3 Still-picture compression standards
Overview of Still-picture Compression Standards
Progressive encoding Progressive encoding allows a user to transmit a picture in a number of scans. The rst scan is a rough approximation of the picture, but every next scan improves the picture. Progressive encoding uses the same compression techniques found in sequential encoding. The progressive encoding mode, however, introduces a buer between the quantization and entropy encoding step large enough to store the whole DCT-encoded and quantized picture. The buer is then entropy encoded in a number of scans. Two methods can be chosen to select the information per scan:
Spectral selection: a selection of the quantized coecients is made that
are transmitted. For example, in the rst scan, only the DC coecient and the three rst AC coecients (according to the zig-zag ordering) are transmitted, in the second scan the next 16 AC coecient are transmitted, and in the nal scan the last 44 AC coecients are transmitted. Successive approximation: a bit selection of every quantized coecient is sent per scan instead of the whole quantized coecient. For instance, in the rst scan, the three most signi cant bits of all the quantized coef cients are transmitted and in the second and nal scan the rest of the quantized coecients are transmitted.
The two methods can be mixed, which enables the user to choose the "progression" in a very exible way. The drawback of progressive encoding compared to sequential encoding is the extra buer that is introduced in the encoder and the decoder, and a more computational-intensive decoder as for each scan the quantization and IDCTprocesses must be executed again.
Lossless encoding The JPEG lossless mode does not make use of the DCT transformation. Instead of DCT and quantization, it uses a prediction process that determines the value based on the values of the pixels on the left and above the current pixel, see Figure 2.8. The selection value for prediction (see Table 2.3) and the dierence with the actual pixel value is then sent to the entropy encoder. The entropy encoder can be either an Human encoder or an arithmetic encoder. For the Human encoder, the entropy encoding stage is almost identical the the DCcoecient encoder in the sequential mode.
Hierarchical encoding Hierarchical encoding allows a user to decode a low-resolution version of a picture, without decoding and down-sampling the whole encoded picture. The hierarchical mode is used in combination with lossless, sequential, or progressive encoding mode. A number of steps are done by the hierarchical encoder, see Figure 2.9: rst, the picture is down-sampled a desired number of times by a factor of two in 16
Overview of Still-picture Compression Standards
2.3 Still-picture compression standards
C
B
A
X
Figure 2.8: JPEG Lossless prediction for pixel X Selection value Prediction for X 0 no prediction 1 A 2 B 3 C 4 A+B ?C 5 A + (B ? C )=2 6 B + (A ? C )=2 7 (A + B )=2 Table 2.3: Lossless prediction formula table horizontal dimension, in vertical dimension, or in both dimensions. The result is the minimal resolution of the picture that can be retrieved by the decoder. Then, the encoder compresses this down-sampled picture using one of the sequential, progressive or lossless compression modes. This compressed image is sent to the outgoing video stream. After that, the encoder decompresses the compressed image, so it has the same image as the decoder2 . This image is up-sampled by a factor of 2 in either horizontal, vertical or both dimensions, by using an interpolation lter that is also used in the decoder. The result is then compared with the original image which is down-sampled to the same resolution, or with the original image itself (without down-sampling) if it is already the same resolution. The dierence of this comparison is encoded using the same compression method as mentioned before. In this encoding, a dierent quanti cation table can be used, because the dierence of two images has other (statistical) characteristics than an image itself. If the up-sampled image still has a lower resolution than the original image, the encoder can up-sample, interpolate, compare with the (down-sampled) original image, and calculate a compressed dierence-image again, until the whole resolution of the original image is sent over. The drawback of hierarchical encoding is the need for picture buers for each resolution that is sent over at the encoder and one extra buer at the decoder. Furthermore, if the decoder wants the picture with the highest resolution, a lot more calculations must be made at the encoder and at the decoder than with sequential coding. 2 This decoding step can be optimized when lossless mode is used or an intermediate result is stored before the lossless entropy encoding during encoding.
17
2.3 Still-picture compression standards
Overview of Still-picture Compression Standards Down-sampled image
Original image
First scan Down-sampling
Upsampled reconstructed image
Compression
Decompressed image
Up-sampling &
Decompression
Interpolation
Difference image
Comparison
∆
Second scan Compression
Figure 2.9: Hierarchical encoding in JPEG The Independent JPEG Group (IJG) developed a public domain source code that supports lossless, sequential and progressive operations modes. Hierarchical mode is not (yet) supported.
2.3.2 GIF
GIF is a lossless, 256 color, still-picture compression standard [Gailly95]. It a variation of the LZW compression technique called Variable-Length LZW. GIF is most suitable for images that have a small number of colors, such as computer generated graphics and cartoons. It is also useful for small images. The dierence between the LZW compression method and Variable-Length LZW used in GIF is that in the latter the size of the code to represent an entry in the table is increased by a bit when the table is full. If this code is 12-bit and the table is full, a special character symbol (a clear code) is encoded to indicate that the table must be emptied and the table must be rebuilt from scratch. There are two widely used versions of GIF: 87a and 89a [Compuserv90]. The 89a version has some extensions to insert text into the picture, and comments and application speci c codes in a GIF le, but the LZW algorithm used is not dierent.
18
Chapter 3
Overview of Video Compression Standards This chapter was published earlier together with chapter 2 as the article \Overview of Still-picture and Video Compression Standards", Pegasus paper 95-3
3.1 Video compression techniques The advantage of video compression standards over still-picture compression standards is that they not only make use of spatial redundancy, but also make use of temporal redundancy, which can reduce the size of the video-stream signi cantly. Temporal redundancy is the property of a video stream to show the same information (objects) over a certain period of time (a number of frames). In video compression algorithms, motion-compensation prediction techniques are used to scan previously sent frames for parts that are (about) the same as the current frame that is encoded.
3.1.1 Prediction The most basic form of prediction checks if a block of n m data in the current frame is the same as the block on the same place in the previous frame. If there is no change, the data of this block is not encoded. Although this is an easy example, the implementation still requires quite some thought: What size is chosen for the block that is compared and must the blocks be exactly the same, or is there a threshold value before the block is marked as changed? In most implementations, prediction is combined with motion-compensated interpolation. If a block of data is not identical to a block of data in a previous frame, the best matching block is found and the dierence is used for further compression. The resulting block is compressed better than the original block of data. The area that is used for comparison of the block determines the quality of the nal prediction. The larger the area that is searched to nd a matching block, the larger the change it is actually found. But most matching blocks are 19
3.2 Video compression standards
Overview of Video Compression Standards
found around the place of the original block and increasing the search area also increases the computation time to nd a matching block. Bi-directional prediction not only searches a previous frame for a closematching block, it also searches a next frame in the video stream. Another advantage of bi-directional prediction is that it can combine the prediction of a previous frame with the prediction of a next frame into an average prediction image block.
3.2 Video compression standards 3.2.1 Motion-JPEG The (lossy) JPEG compression standard is so successful that it is also used for motion-picture compression [Lane95]. Although Motion-JPEG is not declared as a standard by any standardization committee, it is used by many vendors and may therefore be called a de facto standard. Motion-JPEG has no standard set of picture formats, nor is there agreement over the le format. Motion-JPEG encodes every frame as a sequential mode JPEG-picture, without making use of temporal redundancies in the video stream. MotionJPEG has a number of advantages. First, Motion-JPEG requires less computation time than other compression standards, because of the lack of motioncompensation algorithms. Second, random access of frames is possible as every frame can be encoded and decoded independent of the other frames. Third, Motion-JPEG introduces less latency than methods with Motion Estimation like MPEG (See the next Chapter). The main disadvantage of Motion-JPEG is the poor compression factor, due to the lack of temporal redundancy reduction techniques.
3.2.2 MPEG-1 MPEG stands for Motion Picture Experts Group, and is concerned with the development of video compression standard [Gall91]. Although the MPEG is also developing audio and synchronization standards as part of the MPEG standard, we only look at the video compression techniques used. MPEG-1 is the rst standard of the MPEG-group. It describes the way a video stream must be stored, but it does not give speci cations on how the coding (and decoding) must be done. The MPEG-1 standard is designed for (encoded) video streams of 1.5 Mbps, which is sucient for CD-ROM applications. MPEG-1 supports encoding of the Standard Interchange Format (SIF), which has a resolution of 352 240 pixels at a rate of 30 frames per second for NTSC and a resolution of 352 288 at a rate of 25 frames per second for PAL and SECAM. A MPEG video stream uses three dierent types of frames:
I-frames: The I or intra-picture frames are compressed independent of the other frames in the video stream.
20
Overview of Video Compression Standards
3.2 Video compression standards
P-frames: P or predicted frames are frames that store the dierence be-
tween the current frame and the previous P or I frame that is encoded. B-frames: B or Bidirectional prediction frames use both the previous I or P frame and the next I or P frame to predict the current frame.
I, B and P frames are compressed dierently. In I frames, compression is achieved by reducing spatial redundancy in the frame. P and B frames also use temporal redundancy reduction to improve the compression factor. Because B-frames make also use of the next I or P frame as reference, B frames have the highest compression factor. An MPEG video stream consisting of only I frames has, except for some quantization and Human-encoding details, the same compression factor as a motion-JPEG video stream (using the same video format). However, I frames are important for random access, the ability to decode independent frames without decoding the whole video stream. A frame is divided in a number of 16 16 blocks of pixels called Macroblocks. A macro-block can be encoded in four dierent ways:
Intra-block encoding: no prediction is used. Forward predicted encoding: A 16 16 block of pixels is searched in the
next I or P-frame that most closely resembles the current macro-block, see Figure 3.1. The dierence between these blocks are used for further compression. Backward predicted encoding: This encoding is the same as forward predicted encoding, but with the dierence that blocks are searched in the previous I or P frame instead of the next frame. Average encoding: Backward and forward predicted encoding is used to nd two blocks of pixels that resemble the current macro-block best, see Figure 3.2. These two blocks are averaged and the dierence with the current macro-block is used for further compression.
T0
T1
Figure 3.1: Forward prediction 21
3.2 Video compression standards
T-1
Overview of Video Compression Standards
T0
T1
Figure 3.2: Average prediction I frames only use intra-block coding; P frames use either intra-block coding or backward predicted coding. B frames use any of the encoding modes. After motion prediction, a macro-block must be compressed to reduce spatial redundancy. An 8 8 DCT is used similar to one found in JPEG. After DCT, coecients are stored in zig-zag order. These coecient are quantized depending of the original encoding mode. For intra-block encoding, low spatial frequency coecients are quantized with a lower quantization factor than high spatial frequency coecients. For the other encoding modes, the coecients are DCT-transformed dierences of pixel blocks. Low frequencies of these blocks will be close to zero, because of the applied prediction. Therefore, another quantization matrix must be used than for intra-block-encoded, DCT blocks. MPEG also allows dierent quantization step sizes for dierent blocks. This is independent of the encoding mode (intra or predictive encoding). Dierent quantization step sizes allow the encoder to code certain blocks more accurate than others. In general, an MPEG video stream consists of many B-frames, some Pframes and a few I-frames. The I-frames guarantee random-access in the video stream. P frames are also important because B-frames can only refer to I or P-frames, not to B-frames. After motion prediction, DCT and quantization, the output stream is entropy encoded by a variant of the variable-length compression technique found in JPEG. The MPEG-1 is tuned for compression of video streams that comply to a Constraint Parameter Bit stream (CPB), see Table 3.1. Video streams that use more bandwidth compared to this CPB may be encoded through MPEG-1, but the encoding is not necessarily ecient and support is not guaranteed by MPEG-1 hardware.
3.2.3 MPEG-2 The MPEG-2 standard is developed for high-end video applications that need a compressed video stream from 4 Mbps up to 100 Mbps [Kleine95] [Okubo95]. MPEG-1 may not be ecient for these video streams, but is to video streams 22
Overview of Video Compression Standards
3.2 Video compression standards
Horizontal size in pixels Vertical size in pixels Total macro-blocks per picture Total macro-block per second Frame rate Bit rate Decoder buer
720 576 396 396*25 (=330*30) 30 1.86 Mbps 376832 bits
Table 3.1: MPEG-1 Constraint Parameter Bit stream that conform to the CPB of MPEG-1. Furthermore, interlaced video streams, which are common in the television industry, are not easily converted to MPEG1; MPEG-2 is more suited for these interlaced video streams. Video streams of MPEG-2 are nevertheless compatible with MPEG-1. The MPEG-2 standard deals with dierent resolution video streams that are divided in pro les and levels. The lowest level format is 352 288 pixels (PAL format) and the highest is 1920 1152 (PAL format) pixels. The simplest pro le does not use B-frames, is not scalable and uses a 4:2:0 luminance/chrominance format, while the high pro le uses B-frames, is scalable and uses either a 4:2:0 or a 4:2:2 luminance/chrominance format.
3.2.4 MPEG-4 At this moment, MPEG-4 is still in development and no concrete algorithms or methods are yet determined [Filippini95]. However, an outline of the goals of MPEG-4 is available. MPEG-4 is not just a compression standard, it will incorporate a description language that determines the contents of a video stream. It also distinguishes dierent objects that enables the user to set priorities to dierent objects so that the foreground of a picture has a higher priority than the background. MPEG4 intends to support a wide variety of video streams from low-bandwidth to 3-dimensional video streams. An MPEG-4 stream combines tools, algorithm and pro les. These will determine how data is stored. For example, subtitles will be coded dierently than other video objects. MPEG-4 is scheduled to become a standard in the end of 1998.
3.2.5 H.261 The CCITT developed the H.261 video compression standard that is designed for video communications over ISDN networks [Liou91][Turletti93]. H.261 can handle p 64 Kbps (where p = 1; 2; :::; 30) video streams and this is equal to the possible bandwidths in ISDN. The H.261 standard supports the following two video formats:
Common Intermediate Format (CIF). This format has a resolution of 360 288 pixels for the luminance (Y ) part of the video stream and a 23
3.2 Video compression standards
Overview of Video Compression Standards
resolution of 180 144 pixels for the two chrominance parts (C and C ) of the video stream; R
B
Quarter-CIF (QCIF). This format contains a quarter of the information of a CIF video stream. This means that the luminance resolution is 180 144 pixels and the two chrominance resolutions are 90 72 pixels; The maximum frame rate for a H.261 video stream is 30 frames per second. The CIF and QCIF consist of pictures for each frame, and within each picture of Group Of Blocks or GOBs, see Figure 3.3. A QCIF has 3 GOBs, while a CIF has 12 GOBs. Each GOB consist of 3 11 Macro Blocks (MB). A Macro Block is composed of 4 8 8 luminance blocks and two 8 8 chrominance blocks (C and C ). A macro block can be compared to an MCU in JPEG. B
R
GOB
QCIF C Y
CIF
B
C
R
MB
Figure 3.3: Composition of an H.261 CIF The H.261 encoder can operate in two modes. In the intra-frame mode, every 8 8 block is DCT-transformed, linearly quantized, and sent to the video multiplexer. In the inter-frame mode, every 8 8 block is also DCT-transformed and linearly quantized, but the result is rst sent to a motion-compensator before it is sent to the video multiplexer. The motion-compensator is used for comparing the macro-block of the current frame with blocks of data from the previously sent frame. If the dierence is below a pre-determined threshold, no data is sent for this block. Otherwise, the dierence is DCT transformed, and linearly quantized. The nal encoding step is the video multiplexer that uses a variable wordlength coder to reduce the bit stream even more. After the video multiplexer, the result is inserted in a transmission buer, which controls the linear quantizer in order to regulate the outgoing bit stream. H.261 is similar to MPEG with respect to the DCT encoding and quantization. During the standardization of MPEG-1, this is done on purpose to simplify implementations that encode or decode both H.261 and MPEG. The main dierence between H.261 and MPEG-1 is that motion vectors in H.261 are restricted to 15 pixels away from the original place in the picture. Furthermore, no future-prediction is used in H.261 which means that H.261 has no equivalent of B-frames in MPEG. 24
Overview of Video Compression Standards
3.2 Video compression standards
INRIA Video conferencing System (IVS) is an implementation of a video conferencing tool that uses H.261 for video compression. Vic is also a video conferencing tool that supports H.261.
3.2.6 H.263 The ITU H.263 draft 1 [ITULBC95] is an improvement over H.261. H.263 is developed for low-bandwidth communications over Plain-Old Telephone Systems (POTS), in particular 28.8 Kbps modems. Compared to H.261, the number of available picture formats is increased, the motion-compensation algorithm has improved, better entropy encoding is used and a new frame is introduced that allows a simple form of forward-prediction. Three new video formats are added in H.263: a sub-QCIF format is added (128 96), a 4CIF format is added (704 786), and a 16CIF format is added (1408 1152). As in H.261, the number of chrominance pixels is always half of the number of total (luminance) pixels. This means that for 2 2 luminance pixels, one C and one C pixel is used. H.263 also supports unrestricted motion vector mode. In the default (restricted) motion vector mode, the block that is referenced should be fully inside the picture. In unregistered motion vector mode, an arbitrary number of pixels may be outside the pixel. For every reference to these pixels, the closest edge pixel is used instead. The Advanced Prediction mode is also new in H.263. Instead of one motion vector to a 16 16 macro-block, four motion vectors to 8 8 blocks are used for prediction. Although this encoding uses more bits than with one motion vector, the quality of prediction improves signi cantly. Another improvement in H.263 is the use of motion vectors that refer to half-pixel displacements instead of displacements with a integer number of pixels. To calculate the referenced sub-pixels, the value of surrounding pixels are interpolated. Besides Intra and Inter encoded frames, H.263 introduces PB-frames. The name is derived from the MPEG P and B frames. A PB-frame (see Figure 3.4) consists of two mixed frames: one "normal" P-frame frame and one bi-directional prediction frame. When a PB-frame is decoded, rst the B-frame and after that the P-frame is shown. The P-frame can use Inter- or Intra encoding modes, but the B-frame can only use the new forward or older backward prediction mode and not the intra encoding mode. The B-frame can refer to the associated P-frame in the PB-frame, and to an average of the associated P-frame and the previously encoded P-frame. Test Model Near (TMN) is an implementation of the upcoming H.263 standard and is used as test model for this standard. It claims that it has a factor of two better compression than H.261 [Telenor95]. Source code of TMN is also available. B
1
R
At the moment of writing, H.263 is not yet a standard.
25
3.3 Summary
Overview of Video Compression Standards
T-1
T0
T1
B frame
P frame
PB-frame
Figure 3.4: B-frame of H.263
3.3 Summary A number of picture and video compression techniques are discussed. For picture compression, distinction is made between lossless and lossy methods. Lossless compression generate an exact copy of compressed data after decompression. Lossy compression methods give up this requirement to obtain a much higher compression factor. Thus, the quality of lossy compression depends not only on the compression factor as with lossless compression , but also on the way the decompressed image resembles the original picture. Lossless methods for compression are LZW, (adaptive) Human and DCT, although the latter loses this property during storage of the rounded coecients. DCT transforms a block of n n pixels into a matrix of n n coecient which represent spatial frequencies. Because most high-frequency coecients are (near) zero, compression is attained. Quantization reduces the number of bits for data by reducing the density of the domain. Scalar quantization does this by dividing data by a quantization factor; decompression is done by multiplying with the same quantization factor. Vector quantization methods use a codebook to translate data to an index in the codebook; the collection of indexes and the codebook together are used to retrieve the original picture. Wavelet transform is another technique that transforms a time domain into a time-frequency domain. Compression is done by storing only some of the generated coecients. JPEG has four dierent operating modes. The lossy modes use a combination of the lossy DCT and quantization methods together with lossless entropy encoding methods to compress pictures. The progressive mode allows a user to send the picture in a number of scans so that the picture improves after each scan. The JPEG lossless mode uses a prediction-based method to compress a picture. Hierarchical mode enables a user to send dierent resolutions of a picture at the same time; low-end decoders will only decode the rst scan or the rst couple of scans, while high-end decoders decode all scans. 26
Overview of Video Compression Standards
3.3 Summary
GIF uses the LZW algorithm to compress lossless 256-color pictures. Video compression techniques not only make use of spatial redundancy to reduce the bitstream, but also use temporal redundancy found in consecutive video frames. In this way, the current frame is predicted from the previous and sometimes future frames and only the dierence of these frame with the current frame is encoded. All video compression standards discussed here use DCT-transformation followed by quantization and entropy encoding. A number of video compression standards are available. Motion-JPEG is a series of JPEG compressed images stored after each other. Only spatial redundancy and not temporal redundancy is reduced. The advantage of this method is the easy implementation and random access of individual frames. The disadvantage is the poor compression factor of video streams. MPEG-1 is a standard for (compressed) bitstreams of 1.5 Mbps. It uses three dierent types of frames: I-frames that store a frame independent of the others in a stream, P-frames that store the dierence between the current and the previous I- or P-frame, and B-frame that use both the previous I- or P-frame and the next I- or P-frame in the video stream for prediction of the current frame. Pand B-frames use motion compensation techniques for their prediction. MPEG2 is an enhancement of MPEG-1 that is optimized for higher bitstreams and better video resolutions than MPEG-1. MPEG-4 is still in development but will make it possible to determine individual objects in video streams. Standards designed for ISDN and POTS telecommunication are H.261 and H.263, respectively. H.261 has two picture formats: CIF has a resolution of 352 288 and QCIF is a quarter of the CIF resolution. H.261 frames operate in either of the following modes: In intra-frame mode, the frame is individually compressed. In inter-frame mode, the dierence with the previous frame is calculated (using motion-compensation) and is stored in the outgoing video stream. H.263 is an improvement over H.261 which supports two more resolutions: 4CIF (704 786) and 16CIF (1408 1152). Furthermore, it introduces a PB-frame that enables a simple form of forward-prediction. The motion prediction algorithm is also improved in H.263.
27
3.3 Summary
Overview of Video Compression Standards
28
Chapter 4
Choosing an encoding standard { the case for H.263 4.1 Introduction In Chapters 2 and 3, a number of image and video compression methods are discussed. The conclusion is that there are a number of compression algorithms that are useful video encoders in their speci c environment. Computing speed and network bandwidth are the most important constraints that determine the encoder to use. In this short chapter, the ideas for using H.263 in Desktop Video Conferencing are discussed.
4.2 The choice for H.263 If we take a look at the dierent standards that are used for video compression, it might be clear that MPEG is a leap ahead of the other standards. First, MPEG is the rst real video compression standard. Other formats existed before, but where merely variations of Motion-JPEG. Second, MPEG makes it possible to encode a video in a number of ways, including a Motion-JPEG variant (by using only I or Intra frames). Third, most applications that use MPEG do not use it with MPEG as real-time encoding source. Although this is changing currently with digital satellite TV, MPEG still requires a lot of processing power for real-time encoding. Standards that are aimed at one-to-one or few-to-few telecommunication are H.261 and now also H.263. These standards make it possible to encode video stream and transmit them over low-bandwidth networks, such as Integrated Services Digital Network (ISDN) and Plain-Old-Telephone-System (POTS). H.261 is the standard for ISDNp 64 Kbps connections (where 1 <= p <= 30) [Liou91]. H.263 is intended to make video conferencing possible over even lower-bandwidth networks such as POTS, with 28.8 Kbps. However, as H.263 is stream-compatible with H.261 (a decoder can see the dierence between these kind of streams) and H.263 has more \potential" to compress further, it is to be expected that H.263 will replace H.261, see Section 4.3. One of the main dierences between H.261 and H.263 encoding is the use 29
4.3 Advanced options in H.263
Choosing an encoding standard { the case for H.263
of half-pixel prediction for motion estimation. As movements are small, a good estimation is made by comparing the macro-blocks with an interpolation of the luminance data. Consequence of this feature is the required extra time to calculate an interpolation. For example, a DEC alpha 150Mhz takes approximately 8 ms to make an interpolation of a Quarter-CIF (QCIF) sized luminance component.
4.3 Advanced options in H.263 To measure the compression performance of H.263, a number of test are done to show the behaviour of the H.263 advanced options towards to compression ratio. These results are given in Table 4.1. Table 4.1: Eect of H.263 advanced options on compression ratio (Miss America sequence, 150 frames) TMN 1.6 encoder for H.263 Encode Size Compress options (bytes) ratio no options 16199 352.0 D 16084 354.5 E 15724 362.7 DF 16084 354.5 G 11639 489.9 DEFG 10704 532.7 Summary of H.263 advanced options: Option D | Unrestricted Motion Vector Mode.Without this option, motion compensation is done for block within [?16 15 5] pixels to the right or below the current block. This mode removes this boundary. D mode also includes edge expansion, which improves compression near the border. Option E | Syntax-based Arithmetic Coding Mode instead of Human encoding. Option F | Advanced Prediction mode. Used only in combination with option D. It uses 4 8 8 motion vectors instead of 1 16 16 motion vector Option G | PB frames mode ;
:
The advanced options in uence compression signi cantly in a number of aspects. In the rst place, compression ratio is increased by using more of the advanced options. Other in uences that are not given here is the extra compression time for more advanced options. Especially the slowness of the advanced prediction mode makes practical use impossible. Option D and G also require more motion estimation time as both uses a larger search area. The time penalty cause by the arithmetic coding instead of Human coding is relatively less than with the other options and also depends on the size of the output bitstream. 30
Choosing an encoding standard { the case for H.263
4.4 DVC scenario and H.263 advanced options
4.4 DVC scenario and H.263 advanced options To tune the video compressor to DVC, some assumptions must be made on the contents of the video. In DVC, the video consists of a person sitting behind his or her desk, most of the time facing the camera. The camera is also pointed to the same place and there is no need to move the camera often, as the person is expected to sit at the same place for most of the time. Some conclusions may be drawn out of this scenario, that aect the negotiable options (See Section 3.2.6) in H.263: 1. The camera does not move. Therefore, there is not much movements at the border or that of the person itself. 2. There are no complicated movements: Not much objects move past each other, also because the background is steady. 3. Parts of the view are hardly used, such as the corners upper-left and upper-right. On average, only 50% of the view contains the image of the person. The rst conclusion gives a good argument for not using the unrestricted motion vector option, as there is not much movement over the border. The second conclusion reduces the need to implement the time-consuming advanced prediction mode . The nal negotiable option is the use of Syntax-based Arithmetic Coding instead o Human coding. This reduces intra frame bit rate about 10%, but inter frame bit rate only 3-4%, according to Telenor. Considering that inter frames are the main frame type, together with the argument that extra computing power is required, this option was also not used. One of the advanced options that is not further discussed, but might be a good candidate to implement is the PB-frames mode. In a Miss America encoding test, it shows some good results regarding the compression ratio. However, to nd matching blocks for B-frames, a search must be made in both the previous frame as the P-frame. This requires a double number of motion-block comparisons, which is the most time-critical part of the encoder process. Another reason for not using the PB-frames mode is that it also introduces an extra latency of one frame before the B-frame can be encoded.
4.5 Summary After examination of the dierent encoding standards and techniques, H.263 is currently the best choice for DVC. This standard not only allows a number of dierent ways of encoding (intra, inter) in a exible way, it is developed for low-bandwidth lines and it also uses advanced techniques not found in other standards (such as half-pel estimation and the advanced options). This reserves some capabilities for video encoding in the future, when faster computers are available and the advanced options are more than now usable. In the next three chapters, alternatives are discussed for the various components found in an H.263 encoder. After this discussion, in chapter 8 implication 31
4.5 Summary
Choosing an encoding standard { the case for H.263
of network characteristics on compression are given and in Chapter 9 the implementation and the results of the developed H.263 encoder are discussed.
32
Chapter 5
Choosing Motion Estimation algorithms 5.1 Introduction In this chapter, video streams are examined as objects with movements in time and not as a sequence of independent images. To determine movement in a sequence of consecutive frames, motion estimation algorithms are applied. Motion Estimation reduces the bit-rate signi cantly, at the cost of a large amount of processing power. The principles of motion estimation are discussed in Section 5.2. In order to apply the best motion estimation algorithm for a particular video conferencing situation, we have to examine the behaviour of the movement within a video sequence. This is discussed in Section 5.5.
5.2 Basics of Motion Estimation To detect motion in a video sequence, the current frame must be compared with the previous frame. Comparing each pixel individually is not ecient, because the compressed video stream would increase signi cantly with motion vector 1 information that is transmitted for each pixel. Calculation of the average movement of the whole frame, on the other hand, will neither improve compression as opposite movements cancel each other out. Therefore, the most basic form of motion estimation is done in the following way: First, a video frame is divided into a number of neighbouring macro-blocks of N N pixels. Then, every macro-block is compared with every N N block in the previous frame, called the reference frame . These blocks from the previous frame can have any oset for its x and y position (not necessarily a multiple of N ). When all possible N N blocks from the previous frame are compared with the macro-block, the block with the best match is used for motion compensation. The best-block match is determined by comparing each 1 A motion vector represents the relative movement of a (block of) pixel(s) compared with the matching (block of) pixel(s) in the previous frame, expressed in a 2-dimensional vector of pixels
33
5.3 Block-comparison alternatives
Choosing Motion Estimation algorithms
two pixels by a Mean Square Error (MSE)and adding them up. The block with the lowest summation of MSEs is considered as the match. Now, only the coordinates of this reference block and the dierence between this reference block and the macro-block are stored in the video stream. After all the macro-blocks have been processed, the frame is transmitted and the current frame is used as reference frame for the next frame to be encoded. If applied not eectively, motion estimation requires an enormous amount of CPU power at a relative small reduction of video stream size. The following calculations display the amount of processing power that is required: Comparison: To compare two blocks of pixels, the following calculation is used:
MSE (x; y; k; l) = N12
N X
i;j
=0
(F?1 (i + x; j + y ) ? F0 (i + k; j + l))2
Where:
{ F0 and F?1 are respectively the current and previous frame. { k; l are position of macro-block in current frame (multitude of N ) { x; y are position of reference block in previous frame
Estimation: The number of comparisons to compute depends on the frame size. For a frame size of U V pixels, where U and V are a multitude of N , (U ? N ) (V ? N ) comparisons must be made for a single macro-block
and thus: No of comparisons = (U ? N ) (V ? N ) . . . for a whole frame. If we apply this algorithm on a small Quarter-CIF format (176x144), we will need more than 5 billion pixel comparison operations (including the square operation) per frame without even including control operations (such as increasing the index to the image data) that are also required. U
U
N
N
5.3 Block-comparison alternatives 5.3.1 SAD
By reducing the computation time of the macro-block comparison, performance will increase signi cantly. A good alternative to the MSE (which is considered as the optimal block comparison method) is the Sum of Absolute Dierences (SAD). The SAD replaces the square operation by an absolute-value operation, which requires less computation power:
SAD(x; y; k; l) =
N X
i;j
=0
jF?1 (i + x; j + y) ? F0(i + k; j + l)j
All popular compression implementations found for MPEG, H.261 and H.263 use SAD to compare blocks. 34
Choosing Motion Estimation algorithms
5.3 Block-comparison alternatives
5.3.2 Summation of columns/rows (SOCR) A new comparison technique that I developed is called summation of columns/rows (SOCR). It compares two N N blocks by rst adding the sum of the columns of each block and adding the sum of the rows of each block. The arrays of columns-sums (of size N ) of the two blocks are compared similar to the SAD method, and this is also done for the two arrays of row-sums (of size N ). The resulting value is the sum of both SADs and this is used as an indication how well the two blocks match. By using this block-match method, the number of comparisons per block is reduced from N N to 2 N if we ignore the calculation of the sums of columns and rows themselves. Calculation of these sums need be done only once per frame when the sums are stored as given in Figure 5.1. + +
Horizontal summation Table
+
Image Vertical summation Table
Figure 5.1: SOCR block comparison using 8 8 blocks Because consecutive pixel values can cancel each other out, this method is less precise than the SAD method. However, since it is done both horizontally and vertically, the chance of this occuring is somewhat reduced. The original idea for this implementation was the reduced time to compare blocks. For more comparisons per macro-block, this method becomes more suitable as after a initial complete row and sum addition of the frame, it requires N=2 times less comparisons as with full block comparison. Results of this methods are discussed in Section 5.5.5. 35
5.4 Estimation Alternatives
Choosing Motion Estimation algorithms
5.4 Estimation Alternatives Searching the whole previous frame for a matching block is both cumbersome and inecient. The chance a block moved from the lower-right corner to the upper-left corner of the screen is slim, especially for high frame rates. Besides, luminance information is used only2 for comparison and a matched block that is on the other end of the screen does not imply that the chrominance information matches just as good.
5.4.1 Exhaustive search with search window
To make the block-search more realistic, a search window is introduced that limits the block-match algorithm. Reducing the search to 16 16 pixels in the neighbourhood reduces the complexity of the example given in Section 5.2 to approximately 6 million pixel comparison operations. For high frame rates and low movement, the search window can even be reduced down to 4 4, which are less than half million pixel comparisons per frame. This way of motion estimation is called exhaustive search, because all blocks (limited within the search window) are examined. Most video compression systems, both software and hardware, are using this strategy as it guarantees the best matching block to be found and thus the highest compression factor.
5.4.2 Logarithmic search Original position block in previous frame Search locations for step 1 Search locations for step 2 Best matched block
Figure 5.2: Logarithmic search algorithm for 9 9 blocks An alternative to the exhaustive search algorithm is the logarithmic search algorithm, see Figure 5.2. Instead of comparing every block in the search window, it only compares a number of blocks that are evenly distributed in the search area (e.g. one block in the center and eight blocks between the border of the search window and the block in the center). For the best-matching block, this 2 Luminance data is used only for two reasons: rst, the chrominance data is down-sampled which means that is has a lower resolution than luminance data; second, it requires less time to use only the luminance part.
36
Choosing Motion Estimation algorithms
5.4 Estimation Alternatives
step is repeated with the search window set around all pixels that are more close to this block than the other blocks. This step continues again until the best-matching block is found for only neighbouring blocks. This block found is used for the motion compensation. The advantage of a logarithmic-based search algorithm is the reduction of complexity. Instead of a complexity of O(N 2), with N the size of the search window, the complexity is reduced to O(log (N )). Even for a small search window of 9 9 as given in Figure 5.2, the number of required comparisons is reduced from 9 9 = 81 to 9 + 8 + 8 = 25. A drawback of the logarithmic search is that not all blocks are used for comparison. It is therefore possible that the algorithm gets \stuck" at a local minimum. Compression tests con rm this behaviour with a slight reduction of compression ratio compared with the exhaustive search, see further Section 5.5.
5.4.3 Half-pixel prediction A technique that has become popular in the last couple of years is the use of half-pixel prediction. Instead of comparing blocks with each other that are a multitude of one pixel apart, half-pixel prediction calculates pixel values between two pixels by using interpolation, see Figure 5.3. Original pixels
Interpolated pixels
Used for interpolation
Figure 5.3: Example of interpolation of 2 2 pixels Although half-pixel search increases the size of the search window, it improves prediction signi cantly. This technique in particular makes H.263 compression superior over H.261 (See Section 3.1). Half-pixel prediction can be combined with both exhaustive and logarithmic search. In the Telenor implementation of H.263, even a combination of full- and half-pixel prediction is used: First, an exhaustive full-pixel search is performed for blocks in the search window. After that, for the best matching block, a search is done among the eight neighbouring half-pixel interpolated blocks. 37
5.5 Motion in video sequences
Choosing Motion Estimation algorithms
5.4.4 Other techniques Motion estimation or block-matching algorithms have been one of the focus points in research on video compression. Some well-known alternatives to exhaustive and logarithmic search algorithms are the Three Step Search (TSS) [Koga81], the New Three Step Search (NTSS) [Renxiang94], Sub-block Motion-Field Estimation with Alternating Pixel-Decimation Patterns [Liu93] and the New Adaptive Pixel Decimation [Chan96]. The rst two techniques use the property of motion vectors that they are usually small (see also the next section). The third is only partly usable because it requires changes in the current video standard protocols, although the other technique discussed in this article may reduce exhaustive search by a factor of 4 by only comparing a quarter of the macro-block with the reference frame. The last technique rst selects the most representative pixels in a block and uses these pixels only to nd the best-matching block. Techniques that abandon the idea of using blocks in favor of using regions for the comparison become more and more popular[Yokoyama95][Girod96][Moura96]. In these algorithms, the size and shape of the regions are also determined and transmitted together with the motion and image (dierences) information. Especially for very low bit rate applications (down-to 9600 bit/s) these techniques win popularity, although it posses two main disadvantages: they require often even more processing power than in the traditional approach, because regions must rst be detected; and they need to transmit size and shape information as well. Another disadvantage is the incompatibility with current standards, which makes introduction of these ideas more dicult.
5.5 Motion in video sequences 5.5.1 Motion vectors Finding a matching block in a frame is computational expensive. To reduce this valuable processing time, the motion estimation algorithm must limit it search to some maximum movement. The frame-rate of the source video also determines the movement in the video: the average movement between two frames for a video sequence with a high frame-rate is less than for a video sequence with a low frame-rate. To examine the movement in a video sequence, a full (exhaustive) halfpixel search algorithm with a search window of [?15:5::15:5] is used. The video source is a video recording of a person with a steady background. This is a sequence of 150 frames with a frame rate of 20 frames per second ( 7.5 seconds). Compared to the miss America sequence, this video has more movement and is considered more realistic than the Miss America sequence3 . Based on these results, estimations are made on the movements in similar sequences. 3 The Miss America sequence, on the other hand, is more used in the science community and makes it easier to compare scienti c data
38
Choosing Motion Estimation algorithms
5.5 Motion in video sequences
Figure 5.4 shows the average X and Y motion vectors per frame. A positive value means the block in the previous frame with the best prediction was to the right of or under the current block, while a negative value means that the best-matching block was found in the previous frame to the left or above the position of the current block. 2
average X movement average Y movement
1.5 1 0.5 0 -0.5 -1 -1.5 -2
0
20
40
60
80
100
120
140
Figure 5.4: Average length of X and Y motion vectors for a video sequence. If these gures are compared with the real video sequence, movements of the head to the left and right are detected by the peaks in the graph of the average X value. The absolute value of the motion vectors that have the highest value for a frame are given in Figure 5.5. For both the X and Y lines, these data give a distorted view on the behaviour in the video sequence. This is caused by a number of predictions in each frame that are uncorrelated with the other motion vectors found in the frame. In this source video sequence, it is unlikely the maximum motion vector is near 15 pixels between each frame. To explain this phenomenon, a matrix of motion vectors is taken from an arbitrary frame in the encoding process: (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0,14.5) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0)
(0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0)
( 0.0,0.0) ( 0.0,0.0) (-1.0,1.0) (-1.0,1.0) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0)
( 0.0,0.0) (-1.5,1.5) (-1.5,1.5) (-1.0,1.0) (-1.0,0.5) (-1.0,1.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0)
(-1.5,0.0) (-1.5,0.5) (-1.5,1.0) (-1.5,1.0) (-1.5,1.0) (-1.0,1.0) (-1.0,1.5) ( 0.0,0.0) ( 0.0,0.0)
(-1.5,0.0) (-1.5,0.5) (-1.0,0.5) (-1.0,0.5) (-1.5,1.0) (-1.0,0.5) (-1.0,1.5) (-0.5,0.5) ( 0.0,0.0)
( 0.0,0.0) (-1.0,0.5) (-1.0,0.5) (-1.0,0.0) (-0.5,0.0) (-1.0,0.5) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0)
( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0)
(0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0)
(0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0,
0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0)
(0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0)
In this rst column, fth row, a motion vector occurs that is out of line of most other vectors. Examining the rest of the motion vectors show a maximum motion of 1.5 pixels, while this distorted value makes this frame have a maximum motion of 14.5 . These distortions appear in about every frame 39
5.5 Motion in video sequences
Choosing Motion Estimation algorithms
absolute maximum X movement absolute maximum Y movement
16 14 12 10 8 6 4 2 0
0
20
40
60
80
100
120
140
Figure 5.5: Maximum length found per frame of X and Y motion vectors for a video sequence. on arbitrary places (not only the border motion vectors!). Although thorough research is not done on these vectors, it is probably caused by tiny distortions in the camera which are enhanced by the nature of motion estimation: a block that matches the previous block only a bit more is selected instead of the \real" reference block. The selection of a SAD instead of a MSE might also in uence the wrong choice of the motion estimation algorithm.
5.5.2 Frame-rate and motion vectors In Figure 5.6 and Figure 5.7, the length of the motion vectors is displayed against the fraction of the total number of motion vectors that have this length. For a frame rate of 30, almost all motion vectors are within 1 pixel, and searching for motion vectors that are longer is inecient.
5.5.3 Size of the search window To determine the best search window size for a video stream, the quality, time to encode, compression size, and the applied search algorithm must be compared. For the Miss America sequence at 15 fps, Table 5.1 displays the result of dierent encodings. This is also done for the Miss America sequence at 7.5 fps, and those results are found in Table 5.2. These results show that quality is not in uenced signi cantly with dierent search windows or search algorithms. As expected, the exhaustive search method performs a lot worse with larger search windows, while the video stream output is only reduced marginally. 40
Choosing Motion Estimation algorithms
0.9 0.8 0.7 0.6 Fraction of MV 0.5 with this size 0.4 (full pixels) 0.3 0.2 0.1 0
5.5 Motion in video sequences
Miss Am. 7.5 fps Miss Am. 15 fps Miss Am. 30 fps
0
1
2 3 4 Length of Motion Vector (full pixels)
5
6
Figure 5.6: Size of MV against fraction of MVs with this size Table 5.1: Search window size for Miss America video stream at 15 fps
Algorithm & Output Quality Total time time per Search window size Size (Y / Cb / Cr) to encode frame Exhaustive [0 0] 32075 37.3 / 38.1 / 37.2 8.2s 109ms Exhaustive [?1 1] 17023 37.9 / 38.5 / 37.7 9.8s 131ms Exhaustive [?3 3] 14871 38.1 / 38.4 / 37.7 16.4s 219ms Exhaustive [?6 6] 14719 38.2 / 38.4 / 37.7 33.4s 445ms Logarithmic [0 0] 32075 37.3 / 38.1 / 37.2 8.2s 109ms Logarithmic [?1 1] 19336 37.3 / 38.5 / 37.7 8.8s 117ms Logarithmic [?3 3] 18312 37.6 / 38.5 / 37.7 9.0s 120ms Logarithmic [?6 6] 17484 37.9 / 38.4 / 37.5 9.5s 127ms Logarithmic [?12 12] 19310 37.5 / 38.4 / 37.6 9.3 124ms ::
:: :: ::
::
:: :: ::
::
The logarithmic algorithm is more exible. Increasing the search window costs only a fraction of the computational costs. It is also shown that if we compare the result without motion estimation (a search window of zero) and with motion estimation, the output video stream is approximately halved with only 5 ? 15% extra processing time required. As with exhaustive search, it is advisable to make the search window not too large: although the calculation costs even decrease, the size of the output stream increases. This phenomenon is explained by the nature of the logarithmic search: the larger the search area, the larger the chance that the algorithm is trapped in a local minimum. The search time is also reduced by comparison with blocks that do not match very good: during the search, a block comparison is aborted if the preliminary SAD value is already higher than a previously calculated SAD (such as the SAD at position 0,0 which is always calculated rst). Comparison of the 7.5 fps and 15 fps video streams show the decreased encoding time for sequences with higher frame rates. This is caused by the 41
5.5 Motion in video sequences
Choosing Motion Estimation algorithms
100
Miss Am. 7.5 fps Miss Am. 15 fps Miss Am. 30 fps
80 Motion Vectors 60 within this size (percentage)40 20 0
0
1
2 3 4 5 Length of Motion Vector (full pixels)
6
Figure 5.7: Size of MV against cumulative percentage of MVs within this size Table 5.2: Search window size for Miss America video stream at 7.5 fps
Algorithm & Output Quality Total time time per Search window size Size (Y / Cb / Cr) to encode frame Exhaustive [0 0] 23113 37.4 / 38.2 / 37.2 4.4s 119ms Exhaustive [?1 1] 14976 37.6 / 38.3 / 37.6 5.0s 135ms Exhaustive [?2 2] 11843 37.9 / 38.4 / 37.6 6.5s 176ms Exhaustive [?3 3] 10482 38.0 / 38.5 / 37.7 8.4s 227ms Exhaustive [?6 6] 9907 38.1 / 38.5 / 37.7 17.3s 468ms Logarithmic [0 0] 23113 37.4 / 38.2 / 37.2 4.4s 119ms Logarithmic [?1 1] 15873 37.1 / 38.4 / 37.6 4.5s 122ms Logarithmic [?3 3] 12556 37.6 / 38.4 / 37.7 4.7s 127ms Logarithmic [?6 6] 12170 37.8 / 38.4 / 37.5 4.9s 132ms Logarithmic [?12 12] 13881 37.5 / 38.3 / 37.5 4.8 130ms ::
:: :: :: ::
::
:: :: ::
::
(0,0) motion vector that occurs more frequent in a higher frame rate sequence than in a lower frame rate sequence. For the same reason as is mentioned in the previous paragraph, encoding for these higher frame rates takes less time.
5.5.4 Real-Time and frame rate In a Real-Time situation, a scheduler determines the maximum time that is available to encode a frame. If other processes use more processing time or the the previous frame took more time than average to encode, it may be impossible to encode a frame at the same high frame rate, and one or more frames must be skipped. This requires a adapted encoding strategy for the new frame rate (that is, if one of every two frames is skipped constantly, half the frame rate as before). The most practical solution for dealing with a lower frame rates is to increase 42
Choosing Motion Estimation algorithms
5.5 Motion in video sequences
the size of the search window to obtain an ecient compression ratio for the new circumstances. For an exhaustive algorithm, it requires doubling the width and height of the search window, which is computational expensive. Where for 15 fps, a search window of [?1::1] seems most ecient4 , for 7.5 fps, a search window of at least [?2::2] is required. With these search windows, the time it takes to encode a single frame at 7.5 fps takes 35% more time than encoding a single 15 fps frame. If we compare the CPU usage it takes to compress a stream at 15 fps compared with a stream at 7.5 fps, compression of the 15 fps stream takes only 50% more CPU usage compared to compression of the 7.5 fps stream. If we compare the logarithmic algorithm with the exhaustive algorithm, then the logarithmic algorithm performs better for real-time conditions. Compression of one frame at 7.5 fps, only takes 10% more CPU time than compression of one frame at 15 fps. So, the CPU time required to compress a 15 fps video is here 80% more CPU usage compared to compression of a 7.5 fps video. In summary, compression of low-frame rate videos requires less CPU usage than high-frame rate videos (as expected), but this reduction is less than the ratio between the number of frames that are compressed.
5.5.5 Results on SOCR block comparison In Table 5.3, the results are given for encoding of the Miss America sequence with the SOCR block comparison algorithm, see Section 5.3.2. Table 5.3: SOCR block comparison on Miss America sequence at 15 fps
Algorithm & Output Quality Search window size Size (Y / Cb / Cr) Exhaustive [0 0] 41997 37.2 / 38.2 / 37.3 Exhaustive [?1 1] 19686 37.6 / 38.4 / 37.6 Exhaustive [?3 3] 17233 37.8 / 38.1 / 37.7 Exhaustive [?6 6] 17291 37.9 / 38.4 / 37.7 Logarithmic [0 0] 31997 37.3 / 38.2 / 37.2 Logarithmic [?1 1] 22693 36.8 / 38.4 / 37.6 Logarithmic [?3 3] 20921 37.3 / 38.3 / 37.6 Logarithmic [?6 6] 20199 37.7 / 38.3 / 37.5 Logarithmic [?12 12] 22998 37.1 / 38.3 / 37.5 ::
:: :: ::
::
:: :: ::
::
For exhaustive and logarithmic search, the SOCR method delivers a video stream that is between 15%{20% larger than with a full block comparison. The quality on the other hand is only 0.3 dB lower. The performance of the SOCR method depends heavily on the search parameters. As the initial build-up of the SOCR table causes most time, this method is inecient for small search windows. The method is also not very suited for half-pixel prediction, as it increases the processing time by eight to build up the initial summation table. For instance, in case of a QCIF encoding, it takes nearly 3.5 times as much time 4 This captures 95% of the real motion vectors, and thus gives a good compression, see Figure 5.7
43
5.6 Quality of a Real-Time video stream
Choosing Motion Estimation algorithms
to calculate the summation table than to calculate a luminance interpolation 5. Situation where the SOCR comparison might work better are circumstances where half-pixel is not used or where multiple frames refer to a single frame. It the latter case, a summation table need to be built only once. As these circumstances apply to MPEG encoding, this algorithm might improve the speed of this encoding technique signi cantly. As we concentrate on real-time encoding of DVCvideos, this is left for further study.
5.6 Quality of a Real-Time video stream In the area of video compression, the quality of the processed video stream is used as an objective indication for the accomplishments of a video codec. This quality is generally measured by a Signal To Noise (SNR) ratio, that calculates the MSE of pixels from the original frame and the reconstructed frame after compression and subsequent decompression. The direction of research in this area is to use more and more complex and sophisticated techniques to increase this quality at the expense of demanding computing power. The idea that is generally accepted in video compression research is that the next generation of computers will easily run their algorithms. Although this is true in some extent, other developments such as increased demands for larger video windows and higher resolutions may put a constraint the availability of more computing power. Current (software) video encoders have diculty compressing data (near) Real-Time. The result is a video stream that achieves not more than 4 frames per second, without increasing the size of the video stream considerable. Although the quality of each frame in this sequence, measured by a SNR algorithm, is high, the combination of the frame together result in a shocky, non-smooth video stream that is not expressed in the individual SNR number per frame. To express the achieved frames per second of the encoder into the quality measurement, the SNR of the non-encoded frames should also be included into the quality measurement. To do this, the non-encoded (original) frames should be compared to previously encoded and decoded frame, as this frame is displayed instead of the original frame. For instance, if a video stream is recorded with 20 fps and only 5 frames per second are encoded, the quality of the sequence is given by: SNRtotal = ref1:fps ref fps X ref :fps SNR(Frameorig(i); Framerecon(i ? 1)) encodefps =1 20 1 X SNR(Frameorig(i); Framerecon(i ? 1)) 20 = 20 5 =1 :
i
i
In Table 5.4, the signal-to-noise ratio is calculated using this method for 5
This has been measured on a DEC Alpha workstation
44
Choosing Motion Estimation algorithms
5.7 Summary
two Miss America sequences at 15 and 7.5 fps. The compression used are a full window search area of [?1::1] with exhaustive search, which is the same as given in row two of Table 5.2 and Table 5.1. Table 5.4: Quality of video stream compared with real frame rate of 30.0 fps Encoding frame rate (fps) 7.5 15 30
Quality of Quality of encoded frames all frames Y / Cb /Cr (dB) Y / Cb / Cr (dB) 37.6 / 38.3 / 37.6 34.8 / 38.1 / 36.9 37.9 / 38.5 / 37.7 36.9 / 38.4 / 37.6 38.0 / 38.5 / 37.8 38.0 /38.5 / 37.8
As expected, the SNR reduces signi cantly for lower frame-rates, especially for the 7.5 fps video. To keep quality of video-compression as high as possible, video streams must be compressed with high frame-rates. Increasing the speed of the video compression will therefore also increases video quality as higher frame-rates are attained.
5.7 Summary Motion Estimation, the technique to reduce temporal redundancy between successive video frames, requires large amount of computing power. To reduce this demand for computing resources, algorithms and parameters must be selected to make motion estimation usable in Real-Time video encoding. Analyses of video data give a good indication of ecient parameter settings. Choosing a search algorithm partly depends on the parameter settings. The exhaustive algorithm is only practically useful for small search windows, while the logarithmic algorithm achieves a little less compression than its exhaustive counterpart, but is better scalable for dierent search window sizes. In Chapter 7, a number of techniques are discussed to reduce the number of macro-blocks that are encoded. As Motion Estimation takes a signi cant amount of time, techniques used here are more applicable by combining them with the techniques found in Chapter 7.
45
5.7 Summary
Choosing Motion Estimation algorithms
46
Chapter 6
Choosing DCT components and Quantization levels 6.1 Introduction The main algorithm found in image and video compression is the Discrete Cosine Transformation (DCT), see chapter 2.2.3. It consists of a transformation algorithm, called the Forward Discrete Cosine Transform (FDCT) and its inverse counterpart, the Inverse Discrete Cosine Transform (IDCT). In some articles, the former is also called \DCT", which is generally accepted. The FDCT operation is a basis function and it is applied on raw image matrix of N N pixels for every color component. The result is a transformed matrix of the same dimension. Application of the IDCT on this matrix results in the original matrix before transformation. If the FDCT is applied on image data, high energy values are collected in the upper-left coordinates of the matrix. The DCT is the most appropriate transformation for image data. Other transformations are the Discrete Fourier Transform (DFT), the Discrete Sinus Transform (DST) and the Karhunen-Loeve Transform (KLT) (See [Bhaskaran95a]). Both DFT and DST do not compact high energy data in a small number of coordinates as much as the DCT. The KLT does a better job than the DCT, but is, unfortunately, image-dependent. This makes the DCT the best candidate for transformation. Current standards (MPEG, H.263) and development of fast DCT algorithms con rm the leading role of DCT in image compression.
6.2 Implementations of DCTs Telenor's implementation of H.263 contains one FDCT and two IDCT transformations. The standard FDCT rst does a horizontal pass of the coordinates and after that a vertical pass and is faster than a straight-forward implementation. The standard IDCT is a straight-forward, accurate but slow implementation. The Telenor implementation also includes a fast IDCT, here later discussed as \FAST IDCT". In additions to these implementations, one FDCT and one IDCT are added from a dierent source for comparison. The added FDCT 47
6.3 Comparison of DCT algorithms
Choosing DCT components and Quantization levels
implementation was taken from the MPEG-1 encoder from Berkeley and will be addressed as \Fast DCT". The added IDCT operation was taken from the H.263 decoder and is also known as the Chen-Wang algorithm. This IDCT algorithm will be addressed as \Very fast IDCT" in this chapter.
6.3 Comparison of DCT algorithms Table 6.1: SNR comparison of DCT algorithms.
Y / Cb / Cr SNR (dB) Standard IDCT Fast IDCT Very fast IDCT Standard DCT 38.0 / 38.4 / 37.7 38.0 / 38.4 / 37.7 38.1 / 38.4 / 37.7 Fast DCT 38.2 / 38.5 / 37.8 38.2 / 38.5 / 37.8 38.2 / 38.5 / 37.8
Table 6.2: Performance comparison of DCT algorithms.
time for DCTs Standard IDCT Fast IDCT Very fast IDCT Standard DCT 8.6s 6.5s 5.35s Fast DCT 5.61s 3.68s 2.47s
Table 6.3: Compression comparison of DCT algorithms.
Output size Standard IDCT Fast IDCT Very fast IDCT Standard DCT 21708 bytes 21708 bytes 21718 bytes Fast DCT 22331 bytes 22331 bytes 22255 bytes
To compare the dierent DCT algorithms, a number of tests are performed that measure the quality, the time and the in uence on the compression ratio for a combination of dierent FDCT and IDCT implementations. The results are given in Table 6.1 through Table 6.3. The source video used is the Miss America sequence of 150 frames at 30 fps. Compression parameters used are half-pixel exhaustive search with a search window of +/- 3 full pixels with a quantization factor of 8. Table 6.1 shows that quality is hardly aected by dierent implementations. In fact, the Fast DCT operations even performs slightly better for quality; this is probably caused by some rounding and quantization, as the dierence in quality is insigni cant. The fast IDCT operation gives the same results as the standard IDCT operation. Due to the realistic quantization factor of 8, quality is not sacri ced for speed. Table 6.2 gives a good indication of the dierent (real) transform speeds of the algorithms. The time given is the time that the compression algorithm spending for FDCT plus IDCT transformation. Using the combination of Fast DCT and Very fast IDCT instead of standard DCT/IDCT gives an improvement of almost 3.5 times for (I)DCT transformations. Finally, Table 6.3 shows the size of the compressed video stream for the dierent implementations. It shows that standard IDCT and fast IDCT give the same output size and examination of both streams show that they are exactly the same. Using the fast DCT and very fast IDCT combination only gives a 2.5 percent video stream size penalty over Standard DCT/IDCT. These tests show that using the Fast DCT/Very fast IDCT combination, 48
Choosing DCT components and Quantization levels
6.4 Quantization
a large performance improvement can be attained, while sacri cing no quality loss and hardly compromising on the compression ratio.
6.4 Quantization Quantization is the step in image compression that causes both high compression ratios and some loss of data. In short, quantization reduces the resolution of data determined by the quantization factor. Although some data may be lost during DCT (due to fast DCT algorithms and limited storage for transformed coordinates), the main consideration between compression and loss is done here. Too much quantization causes an heavily distorted video sequence, while too little quantization causes a relative large amount of data for the kind of video sequence. Blocking artifacts in a decoded video stream are probably an indication of too much quantization. A very large video stream for a sequence that is not signi cant better than a video stream with a higher quantization factor indicates too small quantization. Table 6.4: Quantization versus Quality for Miss America sequence Quantization factor 1 2 4 6 8 12 16
Compression ratio 4.2:1 21.3:1 66.8:1 124.8:1 191.7:1 345.3:1 497.8:1
SNR for Y / Cb / Cr (dB) 48.5 / 48.5 / 48.5 44.8 / 42.5 / 44.3 41.1 / 40.6 / 41.2 38.8 / 39.4 / 39.2 37.4 / 38.4 / 37.7 35.7 / 37.4 / 35.9 34.7 / 37.1 / 35.1
Comparison of dierent quantization factors are given in Table 6.4 and Figure 6.4. The tests are performed on 150 frames encoding of the Miss Americasequence. Higher compression factors yield smaller video streams but also reduce quality. The encoding time for each of these sequences are equal, except for the entropy encoding that takes longer for larger output video streams. Comparison of the pro le outputs of the encoding with quantization factor of 1 and 16 shows that in the latter case encoding takes 57 percent longer due to extra entropy encoding. In practical video conferencing situations, however, quantization will vary between 4 and 12. Except for the amount of data that must be encoded (and decoded), the quantization level does not in uence the speed of the encoder (and decoder). If network bandwidth is not an issue (e.g. bandwidth is otherwise not used) , the lowest possible quantization factor that will ll up the bandwidth may be chosen to optimize bandwidth usage and the resulting video quality. 49
6.5 Summary
Choosing DCT components and Quantization levels
50
SNR Y SNR Cb SNR Cr
48 46 Signal 44 To Noise 42 Ratio (dB) 40 38 36 34
0
50
100 150 200 250 300 350 400 450 500 Compression ratio
Figure 6.1: Plot of compression ratio versus SNR (dB)
6.4.1 Scaled DCT
One of the optimizations that is used in video compression is the combination of DCT and quantization (or IDCT and de-quantization). In these so-called scaled-(FDCT) and scaled-(IDCT) operations, the quantization is done during the transformation. This is possible because, for example for a FDCT, delivering a scaled version of the actual frequency coordinates requires less operations. After this scaled operations, the quantization factor must be adjusted for the scaled-DCT. Although a version of a scaled-DCT was not implemented for testing, an implementation of a decoder using a scaled-IDCT is given in [Bhaskaran95b]. The IDCT used here is the Feig-Winograd algorithm, discussed in [Feig92].
6.5 Summary The DCT is the most suitable transformation available for video compression. A number of dierent implementations are available and tests show that the fast alternatives do not sacri ce quality for speed. Variation of quantization directly in uences quality and size of the video stream. It is therefore the instrument to ne-tune and adapt compression to user or network constraints. Nevertheless, quantization must kept between practical limits to prevent an explosion of network bandwidth requirements or strong degradation of quality.
50
Chapter 7
Choosing advanced video techniques 7.1 Introduction The current video compression methods put enormous constraints on the CPU. To make (Real-Time) video conferencing possible now, other techniques must be used to reduce the time to encode frames. A way to do this is to give higher priority to more important parts of the screen and lower priority to unimportant parts like background. One of these techniques for selecting \important parts" is called skin detection. This technique is also used in the pegasus Face Tracker with the purpose to locate persons in a video. Another technique, which seems more promising for DVC, is background substraction. A nal technique that is discussed here is a fast method to detect blocks that have changed since the previous frame, prior to the actual encoding phase. The arguments for using a priority method to determine the important parts of the screen are appealing: More eort and thus CPU power can be allocated to encode high important areas, while low eort is used to encode the other areas. Unfortunately, the H.263 codec (and many others like MPEG) is only partially adaptable to give areas higher and lower priority. DCT and quantization cannot be scaled for using more or less CPU time to encode lower or higher priority areas. For instance, the quantization is not more than a division and this operation take the same time independent of the denominator. For reducing network bandwidth, on the other hand, quantization is highly suitable for reducing bandwidth of low-priority areas. The only really scalable part of the coder is the motion estimation algorithm. Reducing the motion estimation search for low priority areas will mainly increase network bandwidth usage, and a little reduction of CPU usage. Combination of this technique with increased quantization might reduce the usage of extra bandwidth, but CPU usage is still too high. A better approach that both reduces CPU time and does not increase network bandwidth is to transmit only the parts of the screen that have changed. Although the H.263 encoder does in essence the same, supplying the encoder with only the changed blocks reduces all motion estimation, DCT and quanti51
7.2 Skin selection
Choosing advanced video techniques
zation steps that are otherwise performed on these blocks before detecting that they really are unchanged.
7.2 Skin selection Zwaal [Zwaal95] discusses the use of the red color component to track skin color. He uses both the RGB color space as the HSV color space to determine if a pixel has skin color. This skin detection method works on people of all color and races and thus makes it even politically applicable. The drawback of this method is that all objects that are reddish are marked \skin-color", including background objects. Another disadvantage of the method is the usage of two color spaces, which make color space conversion functions a (CPU) expensive part of the implementation.
7.2.1 Implementation of skin selection under H.263 The use of two dierent color spaces is inecient for already mentioned reasons. Using this skin selection in H.263 makes the algorithm even more unsuited as H.263 works with frames using the YUV color space. To reduce color conversion, Zwaal's formulas are converted to the YUV format. \skin color" is a boolean. skin color = (R > G) ^ (R > B ) ^ (hue < 0:234) $ skin color = (2 :0781 V > ?0:3438 U ) ^ (1:375 V > 1:734 U ) ^ ( 0:7031 V +6 1:3906 U < 0:234) As most multiplication constants are the same or a sum of another pair of constants, using multiplication tables makes the implementation of these formulas ecient.
7.2.2 Very fast skin selection implementation Although these formulas show the ratio between V and U as a requirement for skin selection, experiments have shown that examination of the V component itself also selects skin very well. This formula is given below: skin color = (V > threshold )
(7.1)
The threshold is a constant that depends on the lightning in the video sequence and settings of the camera and must be determined in advance. This implementation requires only one simple assembler instruction per pixel. For applications where YUV frames are used as input, this is the most ecient skin detection routine available. 52
Choosing advanced video techniques
7.3 Background substraction
7.3 Background substraction The goal of background substraction is to eliminate processing redundant background information at the encoder. The advantage of this approach is the reduction of data that is fed into the compressor, and thus the reduction compression time. Camera
Background Substraction
Average Background
Comparison
H.263 Encoder
Network Display
H.263 Decoder
Figure 7.1: Overview of background substraction The background substraction method (See Figure 7.1) rst takes a snapshot of the view of the current camera position, without the presence of moving objects. It averages a number of successive frames so this reference background frame is free of camera noise. This step must be repeated for every new camera position1 or for large background changes. After this initialization phase, background substraction may begin. Every captured frame is divided in macro-blocks that are compressed independently by the encoder. For every macro-block, the number of pixels that dier from the background more than a certain threshold are counted. Only the macro-blocks with more than a predetermined number of foreground pixels are encoded and transmitted to the receiver, to prevent camera noise to be transmitted. The resulting frames are processed by the compression unit and transmitted to the receiver. At the receiver, the encoded macro-blocks of a frame are decompressed and combined with macro-blocks that have not changed. Advantages of this approach compared with skin selection is that any difference with the background is detected, including shadows, clothes of persons, and objects that persons are holding. The background substraction method works well in an environment which has suitable lightning. But in areas where the person is close to the camera and blocks the light from the background, the background substraction method behaves considerable worse. Logic in the camera that changes the output brightness for darker circumstances is also unsuitable for this method. Further analysis on these issues might improve background substraction for these kind of circumstances, but this is left for further study. 1
In a DVC setting, this will occur not regularly
53
7.4 Change detection
Choosing advanced video techniques
7.3.1 Background replacement (Chroma-key with virtual blue room) A nice feature of background substraction is the ability to replace the original background by another background that is more interesting. In the video production world, this is known as chroma-keying. This technique only works when the foreground object diers from the background. If, for instance, the object has some black pixels in it and the background also has these black pixel on the same place, the substraction assumes these pixels are \background" and not incorporated in the foreground-only video.
7.4 Change detection The nal technique discussed here is a variant of background substraction. Instead of using a background frame to select the blocks that have changed, the previous frame is used. In this way, only blocks that have changed to a certain degree are transmitted. An implementation of this idea is also used in the H.261 implementation of INRIA [Turletti93]. The advantage of this technique is that a background frame is no longer needed, and the algorithm becomes more suitable for small movements of the camera itself. The algorithm divides every macro-block into sixteen 4 4 blocks of luminance pixels. A circular algorithm picks one position in this 4 4 block and for all sixteen blocks checks whether the pixel has changed compared to the reconstructed macro-block. If the number of pixels that changed is beyond a certain threshold, the macro-block must be encoded. Macro-Block
DCT
Quantization
De-quantization
IDCT
Variable Length Coder
Zero?
Transmit/Skip
Figure 7.2: Steps for encoding of one Macro-Block Although detection of change is already done in the encoder, this is quite cumbersome, see Figure 7.2: First, a motion estimation is made. Then, the difference image is DCT transformed and quantized, followed by a de-quantization and an IDCT. Finally, when the coordinates are all zero, the block is considered as \not changed", and no information is transmitted. When change detection is applied, a short-cut is taken for blocks that do not change, see Figure 7.3: First, the block is immediately checked with the block at the same position in the previous frame. If the comparison shows that 54
Choosing advanced video techniques
Macro-Block
7.5 Summary
Compare with previous MB in frame
Equal?
Transmit/Skip
Macro-Block encoder
Figure 7.3: Macro-block encoding with block change detection change is below a threshold, the block will not be encoded, otherwise the block will pass the usual encoding steps. As most blocks do not change, the extra time required to calculate the changed blocks is more than compensated by the extra time it would cost to process these blocks by the encoder.
7.5 Summary A number of techniques are discussed that reduce the input for the encoder. This makes it possible to do faster frame rate encoding, while maintaining the same quality for important areas of the screen. Skin detection is based on the detection of human skin color and its implementation is fast as the YUV color space is used. Background substraction requires a snapshot of the background, but performs better on foreground objects than skin detection. Finally, change detection does a brief comparison of the current frame with the previous frame to detection blocks that have changed.
55
7.5 Summary
Choosing advanced video techniques
56
Chapter 8
Compression techniques vs. network technology 8.1 Introduction Video compression is the key element in a long-distance video conferencing system. But compression cannot work properly without interaction with its environment other then the user: the network. The network determines what compression is required. It introduces network latency next to the latency caused by the time to compress a frame and it puts a constraint on both average and peak bandwidth. To cope with these circumstances, an interaction must exist between the encoder and the network protocol. This interaction depends on the type of network that is used. The network determines what bandwidth is available and this information is used for compression of the next frame. The encoder determines the time it takes to encode a frame and this aects both the network bandwidth that is used and the compression of the next frame.
8.2 Classi cation of networks The traditional classi cation for computer networks is based on locality: a network connecting computers in an oce or building is called a Local Area Network (LAN), while a network connecting computers between cities, countries or continents is called a Wide Area Network (WAN). LANs are mainly fast connections, separated physically from the outside world, while WANs are slower networks that are accessible world-wide. But as more and more computer become connected globally, this classi cation diminishes, and with wide area fast gigabit connections showing up in the next couple of years, this classi cation will no longer be appropriate. To classify a network on its performance, the network throughput can be used. The throughput determines the maximum capacity of the network. However, this property does not fully determine the actual performance, as network latency speci es the time it takes to transport information from one end to the other. For instance, multimedia applications require a network that has both 57
8.3 Computer networks
Compression techniques vs. network technology
a high throughput, but also a low latency to make interactive communication possible. Therefore, the combination of network throughput and latency gives a good indication of the performance of a network. Another classi cation that is often used is the separation between connection-oriented and connection-less networks. In the rst case, a network session is rst set-up, after which data is transmitted. When the network connection is no longer needed, the session is closed. In connection-less networks, no connections are set up, but information is directly transmitted to the destination. In these networks, every piece of information that is transmitted (called a packet ) contains the destination address, and a route to this destination must be found per packet by the network.
8.3 Computer networks 8.3.1 POTS
A POTS is the traditional analog telephone network. A POTS is a connectionoriented network as the connection must rst be set up by dialing a telephone number. The original application for this network is voice-communication only, and this obsolete network in not intended for digital communication. Nevertheless, by using modems speeds can be obtained around 28.8 Kbps. For worse quality connections, this speed is reduced to 14.4 Kbps or even less. The latency of the network varies from low for local connection up to approximately one second for long-distance satellite connections. Although the performance of the network is not good, its wide, global application makes it available for all over the world.
8.3.2 ISDN The ISDN is designed for telecommunication of computers, faxes and telephones. ISDN Comes in two connection types: The Basic Rate Interface (BRI) delivers two bidirectional 64 kilobit/s channels and one 16 kilobit/s control channel. The Primary Rate Interface (PRI) delivers 23 64 kilobit/s channels and one 64 kilobit/s control channel. Like POTS, ISDN is connection-oriented. Although ISDN is not as widespread as POTS, it is already available in large areas in Europe. Based on its network capacity, ISDN is a suitable infrastructure for video conferencing applications.
8.3.3 Internet The largest and most widely used network for computer communication is Internet. Internet is actually a collection of interconnected networks, that developed in the 90s from a research network for computer scientists to a general public network. On top of its connection less IP protocol, it supports both unreliable connection-oriented UDP/IP connections as reliable connection-oriented TCP/IP connections. The throughput and latency of the network varies depending on the locations and the route between two sites, as on the time of 58
Compression techniques vs. network technology
8.4 Compression and Network
day (load on this part of network). The reliable TCP/IP does not give guarantees when a packet will arrive, only that a packet will arrive and in order of transmission. Unless average network performance is not improved signi cantly and protocols that support reservation of resources are introduced, high-quality video conferencing is not feasible on Internet. But because of its wide-spread usage, Internet is a ne test-bed for video conferencing applications, such as vic [McCanne95] and CU-See-Me.
8.3.4 ATM
In the science community, ATM receives a lot of attention. ATM is based on transmission of 53 bytes sized cells and is connection-oriented. Large ATM networks are being set up, mainly for testing purposes. ATM makes it possible to guarantee certain bandwidth and makes it ideal for a number of applications that require these guarantees. Currently, there is not \one ATM Network", but this new technology makes it possible to do high quality communications over long distances. In ATM networks, dierent services classes are available for dierent types of data: Constant Bit Rate (CBR), Real-Time Variable Bit Rate (Real-Time VBR), non-Real Time Variable Bit Rate (non-real time VBR), Unspeci ed Bit Rate (UBR) and Available Bit Rate (ABR). These service classes in uence both network bandwidth and latency characteristics of the communication link. For video compression, the Real-Time VBR is most appropriate. In the pegasus LAB, a number of ATM devices are available, such as workstation interface cards, an ATM network switch, video grabbers (AVAs) which send video over ATM, and video displayers (ATVs) which show video streams from an ATM network.
8.4 Compression and Network As discussed in Section 1.4, transmission of video depends on computing power, network bandwidth and latency all together. If network bandwidth is super uous, video compression might not at all be needed. On the other hand, high latency might make interactive video conferencing practically impossible. A number of dierent compression strategies are available. Depending on the mix of network performance and computing power, the right compression strategy is available: No compression : If network bandwidth is very high and super uous and/or computing power is sparse, video might be transported as raw data. Intra compression : In intra encoding, frames are compressed individually. A good example of intra encoding is motion-JPEG. Some video hardware support this encoding and decoding, which reduces the video stream signi cantly without much processing power used by the CPU. Intra coding is lossy and the quantization factor determining the quality and compression ratio can be varied. 59
8.5 Choice of compression
Compression techniques vs. network technology
Inter compression without motion estimation : This encoding strategy compares the current frame with the previous frame. The dierence between these frame are compressed. This encoder may require some Intra compression frame encodings, for instance, for the rst frame or for blocks that have changed a lot. Inter compression with motion estimation : The most drastic compression strategy is the use of motion estimation within inter compression. Results show (See Section 5.5) better compression for sequences with high motion or low frame rates.
8.5 Choice of compression Examination of the (I)DCT, quantization and motion estimation techniques give a good understanding of the dierent possibilities for compression. If a QCIF format is used as video input, the following results are given for an H.263 encoder 1 : Table 8.1: Comparison between dierent compression methods for a QCIF frame Method (Average) size of time to frame (bytes) encode No compression 38016 0 Hardware intra 1434 << 72 ms compression Software intra 1434 72 ms compression Inter 15 fps 260 98 ms Inter 7.5 fps 340 103 ms The quality of decoded stream of the dierent methods are comparable for DVC circumstances2 . The use of uncompressed video over networks requires large bandwidth, that is often unavailable over long distance. The use of hardware-assisted intra encoding (such as a hardware implementation of motion-JPEG) makes encoding feasible and preferable over software intra encoding. The software intra compression includes reconstruction of the frame, which is not necessary, because there are no inter frames that refer to the intra frames. Without this reconstruction, software encoding of intra frames is actually even faster. If we want to compress the video stream even more, inter encoding must be used and as hardware support is currently not (yet) available, software solutions are the only option. times of the encoder given in this table are based on preliminary results on a DEC Alpha 150 MHz. 2 Because there is no quality loss when using no compression, the quality of the lossy methods is chosen in a way that dierences with the original stream cannot or hardly be seen by the human eye. 1
60
Compression techniques vs. network technology
8.6 Latency
Current software encoders available for internet, such as VIC [McCanne95] and the INRIA H.261 encoder [Turletti93] are primary based on intra-only encoding, sometimes combined with motion detection to skip macro-blocks that have not changed. The main reasons for using only intra encoding are that packet loss is high which makes inter encoding less appropriate and that intra encoding takes less time to compress. For network connections that have low bandwidth such as POTS and ISDN, inter encoding becomes more important and therefore a fast H.263 encoder is essential for video conferencing over these networks.
8.5.1 Inter vs. Intra frames When compressing frames, a choice must be made whether (parts of) frames are compressed as either independent intra frames or dependent inter frames. Within inter frame compression, macro-blocks are sometimes transmitted as intra blocks, because a good matching block out of the previous frame cannot be found. Another reason for using intra frames instead of inter frame is loss of data during transmission: if there are errors in the compressed stream, it prevents not only reconstruction of the current frame, but also of the next inter frames (that rely on this frame) until an intra frame is transmitted again. Insertion of more intra frames, at the cost of a lower compression ratio, reduces the time a series of frames become unusable. The number of intra frames that must be inserted depends on the quality of the connection (how much data is lost in a certain period), and the time that is considered as least painful for loosing the connection. Another approach which could be considered is negative acknowledgement of the receiving party that encounters errors. However, this requires more protocol overhead and puts more constraints on network latency than a protocol that transmit without acknowledgements. Further research in this area of telecommunications is useful, but is out of the scope of this thesis.
8.6 Latency Before the implementation of the encoder is discussed, rst one of the constraints named in Chapter 1 will be considered: Latency. The time of the latency in a video conference determines the usefulness for interactive communication. If latency is high (larger than, for instance, 1 second), visual reactions might not be understood or misinterpreted. On the other hand, a unnecessary low latency constraint might make a video conferencing system too expensive to implement, as better networks and faster computers are required. To give an overview where latency occurs, the total latency is divided in a number of smaller latency components:
Video Grab Latency: This is the latency between the actual time something happens before the camera, and the time the data of this frame is received in the computer. Because most intermediate steps are done here 61
8.7 Summary
Compression techniques vs. network technology
in hardware (of the camera or the video grabber device), the maximum time is the time between two frame grabs. Compress Latency: This is the time it takes to compress a frame. This latency depends heavily on the encoder that is used and the performance of the system it runs on. If the encoder uses all computing time for encoding (and compresses at the maximum frame rate it can handle), the latency is the time it takes to compress one frame. If the encoder works at a frame rate of 10 fps, the compress latency is 10 1fps = 100ms. Network Latency: This is the time required before data transmitted from one end of the network arrives at the other end of the network. This latency depends heavily on the kind of network used. For instance, for some implemented ATM this latency is negligible compared to the compress latency. Other networks, like internet (TCP/IP) may have latencies that depends heavily on the network trac. Measures have shown that network latency between Enschede (NL) and Michigan (USA) could be up to 10 seconds. Decompress and display latency: Although both decompress latency and display latency could be named independently, for the easy of use are they combined. Decompress latency requires less time than compress latency (because Motion Estimation searches need not be done at decoding) and display latency is also small. The two main contributors to the total latency are the compress and network latency. If frames are encoded at a high frame rate (> 10 fps), compress latency will not be too high to disturb the video interaction. Network latency is dicult to in uence, but on some networks (ATM) this may be speci ed in the Quality of Service call during communication-setup. In case of internet, other protocols must be used that handle loss and corruption of packages for video data better than TCP/IP. A nal solution to reduce latency is to combine the compression and transmission of frame at the same time. In this case, it is not required to wait until a frame is fully compressed. In fact, and what will not be discussed further, this is already done by the H.263 encoder that outputs bytes immediately after its contents is known.
8.7 Summary To transmit video over a network, the compression strategy is determined by the capabilities of the network and the computing power that is available. The more bandwidth is available or the more loss occurs, intra frames as opposed to inter frames must be transmitted. The protocol involved also determines the choice for inter frames or intra frames. Finally, the size of the latency depends whether video conferencing is a useful way for interaction. For fast networks, latency is mainly determined by the latency of the compression, which is at most the time to encode one frame. 62
Chapter 9
Implementation and performance of the R263 encoder 9.1 Introduction This section discusses the implementation of an H.263 encoder, called R263. Although it is based on an H.263 encoder developed at Telenor, the Telenor implementation is a full implementation of all H.263 advanced options and not intended as real-time encoder. R263 is aimed at real-time encoding of desktoporiented video conferencing and does not make use of advanced options that require processing power not available now or in the near future, see Section 4.4. In Section 9.2, a number of optimizations are discussed to make the encoder applicable to real-time situations. The result of these optimizations on complete QCIF frame encoding are given in Section 9.4. Finally, in Section 9.5, a UNIX application is discussed that uses the R263 encoder library to generate an encoded stream based on the load of the system and lost frames.
9.2 Code optimization
9.2.1 Compiler ags
Optimization ags are useful for making the code faster without sacri cing much time rewriting inecient implementation. During testing of the functionality itself of the encoder, it may not be wise to choose \heavy" speed optimization. Some optimization options make it harder to nd bugs in the code and compilation takes more time. The clearest example between the choice of \debug-ability" and speed is the use of the debug compiler ag (in gcc, this is ag -g), that compiles symbol information into the code. Using this option makes it easier to nd bugs with the help of the debugger, but it has a dramatic eect of the performance of the code. Another ag that has some impact on the speed is the pro ling capability (in gcc, this is ag -p), but without this option, it is dicult to monitor the real performance of parts of the code. 63
9.2 Code optimization
Implementation and performance of the R263 encoder
The most useful compiler options for increasing performance are the optimization ags (in gcc, the -O ag, that may be followed by a number indicating how much optimization must be done). The native compiler also has similar optimization options, that improve speed without much eort (For the DEC alpha, the options -O5 -migrate make ecient code). The use of two dierent compilers makes it possible to compare not only the compilers themselves, but also parts of the codes that might need optimizations. For instance, although the gcc compiler performed generally better than the native cc compiler for the DEC Alpha, the two pro les showed that cc created a code for a procedure with a lot of if instructions that was almost ve times faster than the code gcc generated. Rewriting this code results is a more balanced performance for both compilers.1
9.2.2 Fine-tuning Just using the optimization ags of the c-compiler will optimize the code, but much more eciency is achieved by hand-tuning the code to a speci c machineplatform. This may make the code more ecient for some architecture than for others, but most techniques applied here, like copying data in the largest possible memory register size, will optimize the code for any architecture. The best technique to improve the speed of the code is to optimize the part of the code that the encoder spends the most time in. This is done by generating pro le data on a representative run of the encoder. For video encoding, using a representative run is especially important because frame rate determines strongly what part of the code is mostly used. For instance, high frame rate encodings rely heavily on the DCT operation while lower frame rate encodings spent more time in motion estimation. The pro le information of the H.263 encoder showed that the encoder used a lot of time for the motion estimation, the DCT and IDCT and copy-operations. Therefore, these functions were the primary target for optimization.
Example: Macro-block comparison One of the most time-critical operation in encoding is the macro-block comparison operation. This operation compares two blocks of 16 16 pixels and returns a value that represents the way in which the blocks dier. The higher the number, the more the blocks dier. The critical part of this procedure that is called 16 times per call compares a line of 16 pixels and is given below: sad += abs(*ii-*curr) abs(*(ii+4)-*(curr+2)) abs(*(ii+8)-*(curr+4))
+ abs(*(ii+2)-*(curr+1)) + + abs(*(ii+6)-*(curr+3)) + + abs(*(ii+10)-*(curr+5)) +
1 Another option is to use a mix of two compilers to generate the fastest code. However, this approach makes the code very speci c for a system and requires continuous monitoring of the speed of each function.
64
Implementation and performance of the R263 encoder abs(*(ii+12)-*(curr+6)) abs(*(ii+16)-*(curr+8)) abs(*(ii+20)-*(curr+10)) abs(*(ii+24)-*(curr+12)) abs(*(ii+28)-*(curr+14))
+ + + + +
9.3 Implementation of change detection
abs(*(ii+14)-*(curr+7)) + abs(*(ii+18)-*(curr+9)) + abs(*(ii+22)-*(curr+11)) + abs(*(ii+26)-*(curr+13)) + abs(*(ii+30)-*(curr+15));
The abs function calculates the absolute value and after compilation, it is inserted in-line, so no function-call overhead is introduced. The new, optimized, code is given underneath: sad sad sad sad sad sad sad sad sad sad sad sad sad sad sad sad
+= += += += += += += += += += += += += += += +=
abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-*curr);
ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii
+= += += += += += += += += += += += += += +=
2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2;
As both pointer must be updated for the next line anyway, resetting the pointers does not cost extra time. Splitting up the calculation of the sad value into multiple instructions and replacing the indexed pointer by a moving pointer improved the speed of the procedure by 40%. For a procedure that took more than 20% of the CPU time for encoding (for some full-search options even more than 60%), this reduction is signi cant.
9.3 Implementation of change detection In [Turletti93] (See also Section 7.4), a change detection scheme is discussed to nd movement in the picture based on a comparison between the previous frame and the current frame that is to be encoded. The method compares 16 pixels of a macro-block with the same pixels in the previous frame. If these pixels dier beyond a threshold (called the pixel threshold ), they are marked as changed. When more than a certain number of pixels are marked changed per macro-block (called the macro-block threshold ), this macro-block is encoded, or otherwise skipped. 65
9.4 Results of the R263 encoder
Implementation and performance of the R263 encoder
Table 9.1: Comparison of Miss America encoding (75 frames) with and without motion detection (MD) Stream Pixel MB Size Quality (SNR) Encode threshold threshold (bytes) Y / Cb / Cr time Original 15786 36.1 / 38.3 / 37.3 6.6s Encoding + MD 2 2 15564 36.1 / 38.3 / 37.3 4.7s Encoding + MD 2 4 15157 35.8 / 38.2 / 37.1 4.1s Encoding + MD 4 2 14640 35.7 / 38.2 / 36.9 3.8s Encoding + MD 4 4 13658 34.5 / 38.1 / 36.1 3.1s An implementation was made to examine this method and the results are given in Table 9.1. The Miss America sequence is encoded for 75 frames / 15 fps with logarithmic search window of [?2::2], and a quantization factor of 8. Initially, as pixel threshold, an absolute dierence for the luminance of more than two was chosen. For the macro-block threshold, a number of more than two changed pixels per macro-block was chosen. With this setting, approximately 60% of the macro-blocks on the screen were used for encoding and this seems to be correct regarding this DVC-type video. This test shows that quality is not sacri ced for speed, probably due to a good threshold-setting. The size of the stream produced by using change detection is also not aected in a negative way, and speed is improved by almost 30% (and this includes the change detection procedure time itself). Tests with higher threshold values give higher speed but at the cost of some quality. Two tests with respectively a macro block and pixel threshold set to four encode approximately 45% and 40% of all macro-blocks. In the reconstructed video, some artifacts of these settings are visible by some background pixels that are not restored when the person in the video moves over them. A nal test with both thresholds set to four gives even worse results. Only 25% of the macro-blocks are encoded and results are especially worse on the border of the face with the background. The change detection method used in IVS is useful for encoding when the right thresholds are chosen. It is also possible to manipulate the threshold value to reduce compression time. Finally, the time to compress a frame can be measured by looking at the number of macro-blocks that are marked as changed.
9.4 Results of the R263 encoder In this section, the results of the implemented H.263 encoder are discussed. The compression parameters are chosen in a way that compression is still relative good, while not much speed and quality is sacri ced for the result. This results in bitstreams that are not more than 20% larger than with full compression, but cost less than half the time to compress than with full compression. The video streams that are used here are given in Table 9.2. The three streams are the same, but dier in frame rate. For instance, the B stream is constructed by 66
Implementation and performance of the R263 encoder
9.4 Results of the R263 encoder
taking every second frame from the A stream. Table 9.2: The streams that are used for compression Stream Video Capture frame rate No. of frames A Miss America 30 fps 150 frames B Miss America 15 fps 75 frames C Miss America 7.5 fps 38 frames The results are given in Table 9.3. In column 3, the average time is given for compression of a frame on a DEC Alpha 150 MHz. The time to encode a frame is slightly longer for streams with lower frame rates. This is logical if we look at the movement between two consecutive frames: for the C stream, the movement is up to 4 times more as for the A stream. Therefore, more blocks must be compared to nd the best-matching block. For the A stream, at most 9 block comparisons are done, for the B stream at most 9+4 = 13 comparisons are done and for the C stream, at most 9 + 9 = 18 comparisons are done per macro-block. The time to encode a frame at 30 fps is actually larger than the time that is available, as 30 fps indicate that at most 1s=30:0 fps = 33:3 ms may be used. Therefore, real-time encoding at 30.0 fps is not feasible on this system. If frame rate is dropped to 15.0 fps, real-time encoding is just possible. The size per frame, given in column 5, also shows that the information transmitted between two frames is much higher with the lower frame rate sequences than with a higher frame rate sequence. Nevertheless, as 30 frames must be transmitted for the A stream against 7.5 for the C stream, the double bandwidth is required for a video stream that has a frame rate of four times as much. Table 9.3: The streams that are used for compression Stream Motion Estimation frame enQuality frame size parameters code time Y / Cb / Cr (SNR) (bytes) A Exhaustive [?0:5; 0:5] 59 ms 38.0 / 38.5 / 37.7 162 B Logarithmic [?1; 1] 60 ms 37.3 / 38.5 / 37.6 252 C Logarithmic [?2; 2] 66 ms 37.7 / 38.3 / 37.4 325 In Table 9.4, the encoding time is split in a number of components. The Motion Estimation component includes the search algorithm with its blockmatching algorithm, a luminance interpolation function required for half-pixel prediction and the change detection algorithm discussed in Section 7.4. The DCT & Quantization contains the fast DCT algorithm discussed in Chapter 6 and a quantization by 8 function. In fact, the quantization function also delivers some information for the variable-length encoder as this was ecient to combine, but the time this takes extra is negligible. The reconstruction component include all operations that are required to generate a copy of the reconstructed frame that was just encoded. This is required for making a good motion estimation for the next frame. It is also easy to use this copy at the encoder to verify the quality of the video stream. In some video compression algorithm 67
9.5 Adaptive encoding based on system load
Implementation and performance of the R263 encoder
where good quality is transmitted and Intra frames are transmitted regularly (such as MPEG), this step is omitted and the source video stream is used as reference frame instead of the reconstructed frame. For low bit-rate encoding, however, these circumstances do not apply. The variable-length encoding component consists of the Human encoding and delivering an H.263 stream. The I/O component is the time required to read a source frame from disk and writing the output stream. Finally, the control component is the time to manage the encoding process itself. By selecting a good parameter setting for compression, a nice balanced encoding is achieved. Even the time to do more motion estimation at lower frame rate is spreaded across the dierent components, because more macroblocks are encoded at lower frame rates, and so components other than motion estimation will also take more time. Stream A B C
Table 9.4: Pro le of encoding Motion DCT & Recon- VLE I/O Control Estimation Quantization struction 39.6% 21.3% 18.4% 8.4% 3.7% 8.6% 37.2% 21.7% 21.1% 8.2 % 3.1% 8.7 % 41.7% 17.8% 19.5% 9.9% 3.2% 7.9%
9.5 Adaptive encoding based on system load Timer Network Frame Grabber Compression Control Unit
H.263 Compression
Figure 9.1: Model for a real-time video compression application In this section, a model is discussed that allows real-time video compression under varying system and network loads. This model is given in Figure 9.1. The implementation of this model is also used as a sample application to show that the R263 compression library can be applied and controlled in an easy way. The Compression Control Unit (CCU) is the central unit that directs the compression. It receives QCIF frames from the camera and determines the time between the arrival of this frame and the previous frame that was received. Based on this information and control information from the network, it 68
Implementation and performance of the R263 encoder
9.5 Adaptive encoding based on system load
determines the compression parameters used for compression of this frame. The CCU may also decide not to compress this frame, because network bandwidth or processing time is not sucient. Parameters that may be set by the control unit are the type of frame encoding (inter or intra coding), search algorithm (Exhaustive or Logarithmic ) and size of the search window (larger for lower frame rates and smaller for high frame rates). It is also possible to deliver an array of the macro-blocks that must be encoded, which makes it possible to use a skin detection or background substraction technique in advance to the compression.
9.5.1 Implementation
The frame grabber consists of a camera, an AVA connected to an ATM network, an ATM interface card and a thread in the application that grabs frames and converts them to QCIF frames. Although the ATM delivers YUV-type data, it must be converted from an 8 8-based tile format to a scan line-order format and the chrominance data must also be down-sampled from 4:2:2 to 4:2:0 (See Section 2.2.2). When the CCU is nished processing a frame, it receives a new frame together with a number indicating how many frames have been skipped and the time between the last two frame grabs. This information is used to calculate the appropriate compression parameters, using the ndings in Chapter 5. The compression engine converts the QCIF frame using the compression parameters speci ed by the CCU and delivers a bitstream of compressed data. This bitstream is transmitted over the network using TCP/IP. The decoder at the receiving end is the standard tmn-1.7 decoder, that is available under GNU License. It is adapted to receive H.263 streams over TCP/IP sockets and uses the same IDCT operation that is found in the encoder. The decoder uses much less time than the encoder (almost 3 times less), and most of it (30%) is actually used for dithering and not for the decoding itself.
9.5.2 Results
To show the performance of the CCU on a non-dedicated system, a video was taken with low, but representative movement. In this video, there are some seconds where there is hardly movements and some seconds where there is a little bit more movement than in the rest of the sequence. These characteristics are directly related to the average number of frames that can be compressed. The output of CCU is given here: 10.00 8.00 8.00 8.50 10.00 10.50 8.50 8.50
fps fps fps fps fps fps fps fps
|1828.25 |2077.06 |2325.75 |2217.88 | 838.85 | 671.71 |1307.94 | 672.53
bpf bpf bpf bpf bpf bpf bpf bpf
| | | | | | | |
18.28 16.62 18.61 18.85 8.39 7.05 11.12 5.72
69
Kbps Kbps Kbps Kbps Kbps Kbps Kbps Kbps
9.6 Summary 12.00 10.50 8.00 8.00 9.00 6.00 8.00 8.50 7.00
Implementation and performance of the R263 encoder fps fps fps fps fps fps fps fps fps
| 250.42 |1136.33 |1540.50 |2565.44 |1592.00 |4087.25 |1413.75 | 757.59 |2166.14
bpf bpf bpf bpf bpf bpf bpf bpf bpf
| | | | | | | | |
3.01 11.93 12.32 20.52 14.33 24.52 11.31 6.44 15.16
Kbps Kbps Kbps Kbps Kbps Kbps Kbps Kbps Kbps
Every row gives a summation of the compression results that are achieved in a predetermined interval, which is set to 2 seconds in this example. The rst column gives the number of frames that is compressed per second (excluding skipped frames), and the second and third column give the average number of bits per frame and the number of kilobits per second that is transmitted. This result shows that on a non-dedicated machine, video compression is possible at a medium frame rate. For a computer running a Real-Time operating system, the actual encoding frame rate will be higher as the scheduler can give the encoder more time to compress. Then, results given in Section 9.4 are feasible. The test also shows that the size of the video stream is low: With a 28k8 modem, the stream could be transmitted easily. For ISDN lines of 64 Kbps, this test gives good perspective on transmission of video and audio combined over one line. It might turn out that audio, and not video, is the bottleneck in video conferencing transmissions over low-bandwidth networks.
9.6 Summary In this chapter, the results are presented of the R263 encoder. Selection of suitable compression parameters, optimization of the code both by automatic tools as by hand, and the implementation of advanced video techniques show that Real-Time encoding of QCIF frames is possible in the near future on desktop PCs. To make the encoder useful for a variety of application, the encoder may be used as library function. A sample application that uses this library to compress video based on system load is also implemented.
70
Chapter 10
Conclusion Technical developments in the computer industry has created a current situation where under some constraints, video communication becomes feasible. Unfortunately, video communication required many resources, such as processing speed, camera and screen resolution, but beyond all network bandwidth. As some of these developments stay behind other for ideal video communication circumstances, sophisticated algorithms must be applied to overcome technical constraints. One of these techniques that reduces network bandwidth at the cost of computing power is video compression. Compression of a video stream is done as either frame per frame or as a whole, where compression of the current frame depends on the compression of the previous frames. Although frame per frame compression, such as motionJPEG, are popular, much more compression can be attained by compressing the video as a whole. Motion Estimation is the proven technique to reduce the temporal redundancy between the frames. Unfortunately, when it is applied with the wrong set of parameters, the video compression ratio is too small, or the computing power required is too much for certain compression. Another factor that must be taken into account is the quality of the video stream that is transmitted as lossy compression algorithms are used here. This research has determined the appropriate settings (or parameters) for dierent video streams to be compressed most eectively. As this research is aimed at low-bandwidth communication, the choice for the H.263 standard is obvious: The compression ratio is higher than MPEG for low-bandwidth communication at the same quality. H.263 is also preferable over H.261, because more advanced techniques are integrated in the standard. Examination of the advanced options of H.263 showed that usage of some of these options has low eect on the compression ratio or that it does have good eect on the compression ratio, but that it requires too much computing power. Half-pixel prediction, not found in the H.261 standard, is used however. To make the H.263 implementation usable, drastic steps must be taken to improve performance. Compared the the Telenor functional test implementation of H.263, some parts are rewritten, replaced or added: discrete cosine transformations, motion estimation algorithms, and most of the copying and control code. 71
Conclusion
By using high frame rates, the compress latency that occurs is kept as small as possible: the maximum latency here is equal to the time it takes to compress a frame. During encoding, compressed data that is generated is immediately ready for usage, and the application is not required to wait until the full frame is compressed. In this way, the total latency that occurs for between video grabbing and display is kept as low as possible to improve interaction. Three new techniques were tested to focus compression on most important part of the picture, the person in the video: skin selection, background substraction, and change detection. A fast implementation of skin selection is developed that only requires one test per pixel for determining skin color. The background substraction uses a previously stored image of the background to determine foreground objects that are more important than background. The nal method is change detection. Research has shown that most macro-blocks in the video frame do not have motion, but this is detected after quite a number of complex operations. Change detection determines these blocks so only they are fed into the encoder. Application of change detection show signi cant improvement of the speed of the encoder. The encoder is implemented as a library function, which is easy to use and to integrate into other applications. A sample application under UNIX, called Compression Control Unit, is also implemented that uses this library to compress images from the AVA and to transmit them over TCP/IP. This application determines the time between the last two received frames and applies the best compression parameters. Results of this application show that QCIF video (176 144 pixels) can be transmitted over POTS networks (28.8 Kbps) at 10 fps, using a (non-dedicated) desktop PC (Pentium 150 MHz). Pro ling information shows that dedicated machines, or systems that run real-time operating systems will compress at up-to 15{20 fps1 . When faster computers are available, this encoder is suitable for real-time (25{30 fps) transmission over ISDN. This research has shown that networks with large bandwidth (> 1 Mbps), are not a requirement for video communication: New, advanced standards such as H.263 are able to compress video streams drastically if enough computing power is available. Although dedicated hardware implementation are the fastest solution, this work shows that the more exible way of software encoding is a serious option. With the introduction of new processors capable of handling more data in parallel (multimedia instructions), and faster chip technology, prospects of real-time software video compression are more than hopeful. I therefore recommend further research on the usefulness of this new generation of processors for video encoding.
1 Numbers taken for sequences with moderate movement: When there is no movement at all, the encoder compresses already at the maximum of 25 fps
72
Chapter 11
Acknowledgements A large number of people have helped me during my study in one way or another. I will try to give everyone who helped me some credits, but the order of the credits given here is irrelevant, as classi cation of the dierent kinds of \help" might be a master thesis on its own. I would like to thank my graduation committee: Prof. Sape Mullender, for giving me such an interesting assignment, Peter Bosch for being my rst rst \begeleider" (I bought Elements of Style !) and of course Lars Vognild for being my second rst \begeleider". I also like to thank the other group members: Arne, Martijn, Paul, Pierre and Tatjana and everyone I forget. I also like to thank everyone who had given me work as student-assistant: Martin Beusekamp, Mr. Bonnema, Hans Scholten, Albert Schoute, and everyone else. Also would I like to thank all students who made working at SPA inspiring and pleasant. Thanks to Erwin: the word \Hacker" is named after you. Thanks BJ for your excellent taste for music. Thanks Mars for your AVA code. Thanks Robert (at least somebody else also working on H.263). And not to forget: Eelco, Berend, Rob, Tjipke (how about all your schedules?), Jing-Xu (success on your promotion), Nick, Michiel, and all others. There are a lot of people who helped me enormously, who I have never met or seen. With the result of my work, we might have a video conference in the future. Of course would I like to thank Karl Lillevold for making the H.263 code freely available. Without him, it would have taken several more months before my work was nished. I also like to thank him for his clear implementation of the code. I also want to thank Leonid Kasperovich for an interesting discussion on motion estimation in H.261, your encoder looks promising! Also thanks to Burkhard Neidecker-Lutz, I will try to give you a copy. In particular I would like to thank Edwin, whom I worked with for quite some courses. However, Coca Cola (TM) is it! 73
Acknowledgements
The other part of my life at Twente could be pronounced in a six-letter word: KRONOS. Thank you all! Training has been the best distraction there is, and I will miss you. In the nal place, and last but de nitely not least I would would like to thank my family for supporting me during my study, in every way possible. Even in a small country like the Netherlands, Twente is a long way from Schagen, and the Dutch public transportation does not always help to bridge the gap (and neither does the Secretary of Education).
Roalt Aalmoes
74
Bibliography [Aalmoes95] Roalt Aalmoes and Peter Bosch. Overview of Still-picture and Video Compression Standards. Pegasus 95-03. University of Twente, December 1995. in preparation. [Bhaskaran95a] V. Bhaskaran and K. Konstantinides. Image and video compression standards. Kluwer academic publishers, 1995. [Bhaskaran95b] Vasudev Bhaskaran, Konstantinos Konstantinides, Ruby B. Lee, and John P. Beck. Algorithmic and Architectural Enhancements for real-time MPEG-1 decoding on a General Purpose RISC workstation. IEEE Transactions on circuits and systems for video technology, 5(5):380{ 6, October 1995. [Bryan95] John Bryan. Compression scorecard. Byte, 20(5):107{12, May 1995. [Chan96] Yui-Lam Chan and Wan-Chi Siu. New adaptive pixel decimation for block motion vector estimation. IEEE Transactions on circuits and systems for video technology, 6(1):113{18, February 1996. [cody92] Mac A. Cody. The fast wavelet transform. Dr. Dobbs journal, 17(4):16{28, April 1992. [Compuserv90] CompuServe Inc. Graphics interchange format (sm) version 89a. Technical report. CompuServe, Incorporated Columbus, Ohio, 1990. [Feig92] Ephraim Feig and Shmuel Winograd. Fast algorithms for the discrete cosine transform. IEEE Transactions on signal processing, 40(9):2174{93, September 1992. [Filippini95] Luigi Filippini. MPEG informations, questions and answers, 31 July 1995. http://www.crs4.it/~luigi/MPEG/mpegfaq.html. [Gailly95] Jean loup Gailly. comp.compression frequently asked questions, 28 September 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/part[1-3]. [Gall91] Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of ACM, 34(4):46{58, april 1991. [Girod96] Bernd Girod. Recent advances in video compression. ISCAS-96 (Atlanta , May 1996), May 1996. 75
Bibliography
[Gray92] Robert M. Gray, Pamela C. Cosman, and Eve A. Riskin. Image compression and tree-structured vector quantization. In James A. Storer, editor, Image and text compression, Communications and information theory, pages 3{34. Kluwer academic publishers, 1992. [ITULBC95] R. Schaphorst. Draft recommendation H.263. Technical report LBC-95-251. ITU-T, 3 October 1995. [Kleine95] G. Kleine. Digitale televisie met behulp van mpeg-2-kompressie. Elektuur, 35(9):68{75, September 1995. [Koga81] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, Motioncompensated interframe coding for video conferencing, Proceedings of NTC 81 (New Orleans), November/December 1981. [Lane95] Tom Lane. JPEG-faq, part 1, 28 May 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/jpeg-faq/part1. [Liou91] Ming L. Liou. Overview of the p*64 kbit/s video coding standard. CACM, 34(4):60{3, apr. 1991. [Liu93] Bede Liu and Andre Zaccarin. New fast algorithms for the estimation of block motion vectors. IEEE transactions on circuits and systems for video technology, 3(2):148{57, April 1993. [McCanne95] Steven McCanne and Van Jacobson. vic: a exible framework for packet video. ACM Multimedia (San Franscisco, November 1995), November 1995. [Moura96] Jose M. F. Moura, Radu S. Jasinschi, Hirohisa Shiojiri, and JyhCherng Lin. Video over wireless. IEEE personal communications, 2:44{54, February 1996. [Mullender92] Sape J. Mullender, Ian M. Leslie, and Derek McAuley. Pegasus Project Description. Memoranda Informatica 92{75. University of Twente, Faculty of Computer Science, September 1992. [Nelson91] Mark Nelson. The data compression book. M & T Publishing, Incorporated, 501 Galveston Drive, Redwood City, CA 94063-4728, U.S.A., 1991. [Okubo95] Sakae Okubo, Ken McCann, and Andrew Lippmann. MPEG-2 requirements, pro les and performance veri cation | framework for developing a generic video coding standard. Signal processing image communication, pages 201{9, July 1995. [Patel93] Ketan Patel, Brian C. Smith, and Lawrench A. Rowe. Performance of a Software MPEG Video Decoder. ACM Multimedia Conference (Anaheim 1993), 1993. [Poynton95] Charles A. Poynton. Frequently asked questions about colour, 1995. ftp://ftp.inforamp.net/pub/users/poynton/doc/colour/. 76
Bibliography
[press91] William H. Press. Wavelet Transforms. Harvard-Smithsosian Center for Astrophysics, 1991. Preprint No. 3184. [Renxiang94] Renxiang Li, Bing Zeng, and Ming L. Liou. New three-step search algorithm for block motion estimation. IEEE Transactions on circuits and systems for video technology, 4(4):438{42, August 1994. [Telenor95] Karl O. Lillevold. Digital video coding at Telenor R&D, 11 November 1995. Homepage on internet. [Turletti93] Thierry Turletti. H.261 software codec for videoconferencing over the internet. Technical report 1834. INRIA, January 1993. [Wallace91] Gregory K. Wallace. The jpeg still picture compression standard. Communications of ACM, 34(4):30{44, april 1991. [Yokoyama95] Yutaka Yokoyama, Yoshihiro Miyamoto, and Mutsumi Ohta. Very low bit rate video coding using arbitrarily shaped region-based motion compensation. IEEE Transactions on circuits and systems for video technology, 5(6):500{7, December 1995. [Zwaal95] Hugo Zwaal. The Design and Implementation of a Camera Independent Face Tracker. Technical report 95-09. University of Twente, January 1995.
77