Multimedia Networking: Technology, Management and Applications Syed Mahbubur Rahman Minnesota State University, Mankato, USA
Idea Group Publishing
Information Science Publishing
Hershey • London • Melbourne • Singapore • Beijing
Acquisition Editor: Managing Editor: Development Editor: Copy Editor: Typesetter: Cover Design: Printed at:
Mehdi Khosrowpour Jan Travers Michele Rossi Maria Boyer LeAnn Whitcomb Deb Andre Integrated Book Technology
Published in the United States of America by Idea Group Publishing 1331 E. Chocolate Avenue Hershey PA 17033-1117 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.idea-group.com and in the United Kingdom by Idea Group Publishing 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2002 by Idea Group Publishing. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Library of Congress Cataloging-in-Publication Data Multimedia networking : technology, management, and applications / [edited by] Syed Mahbubur Rahman p. cm. Includes bibliographical references and index. ISBN 1-930708-14-9 1. Multimedia systems. 2. Computer networks. I. Rahman, Syed Mahbubur, 1952QA76.575 .M8345 2001 006.7--dc21
2001039268
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library.
NEW from Idea Group Publishing • • • • • • • • • • • • • • • • • • • • • • • • • • •
Data Mining: A Heuristic Approach Hussein Aly Abbass, Ruhul Amin Sarker and Charles S. Newton/1-930708-25-4 Managing Information Technology in Small Business: Challenges and Solutions Stephen Burgess/ ISBN:1-930708-35-1 Managing Web Usage in the Workplace: A Social, Ethical and Legal Perspective Murugan Anandarajan and Claire Simmers/1-930708-18-1 Challenges of Information Technology Education in the 21st Century Eli Cohen/ 1-930708-34-3 Social Responsibility in the Information Age: Issues and Controversies Gurpreet Dhillon/1-930708-11-4 Database Integrity: Challenges and Solutions Jorge H. Doorn and Laura Rivero/ 1-930708-38-6 Managing Virtual Web Organizations in the 21st Century: Issues and Challenges Ulrich Franke/1-930708-24-6 Managing Business with Electronic Commerce: Issues and Trends Aryya Gangopadhyay/ 1-930708-12-2 Electronic Government: Design, Applications and Management Åke Grönlund/1-930708-19-X Knowledge Media in Health Care: Opportunities and Challenges Rolf Grutter/ 1-930708-13-0 Internet Management Issues: A Global Perspective John D. Haynes/1-930708-21-1 Enterprise Resource Planning: Global Opportunities and Challenges Liaquat Hossain, Jon David Patrick and MA Rashid/1-930708-36-X The Design and Management of Effective Distance Learning Programs Richard Discenza, Caroline Howard, and Karen Schenk/1-930708-20-3 Multirate Systems: Design and Applications Gordana Jovanovic-Dolecek/1-930708-30-0 Managing IT/Community Partnerships in the 21st Century Jonathan Lazar/1-930708-33-5 Multimedia Networking: Technology, Management and Applications Syed Mahbubur Rahman/ 1-930708-14-9 Cases on Worldwide E-Commerce: Theory in Action Mahesh Raisinghani/ 1-930708-27-0 Designing Instruction for Technology-Enhanced Learning Patricia L. Rogers/ 1-930708-28-9 Heuristic and Optimization for Knowledge Discovery Ruhul Amin Sarker, Hussein Aly Abbass and Charles Newton/1-930708-26-2 Distributed Multimedia Databases: Techniques and Applications Timothy K. Shih/1-930708-29-7 Neural Networks in Business: Techniques and Applications Kate Smith and Jatinder Gupta/ 1-930708-31-9 Information Technology and Collective Obligations: Topics and Debate Robert Skovira/ 1-930708-37-8 Managing the Human Side of Information Technology: Challenges and Solutions Edward Szewczak and Coral Snodgrass/1-930708-32-7 Cases on Global IT Applications and Management: Successes and Pitfalls Felix B. Tan/1-930708-16-5 Enterprise Networking: Multilayer Switching and Applications Vasilis Theoharakis and Dimitrios Serpanos/1-930708-17-3 Measuring the Value of Information Technology Han T.M. van der Zee/ 1-930708-08-4 Business to Business Electronic Commerce: Challenges and Solutions Merrill Warkentin/ 1-930708-09-2
Excellent additions to your library! Receive the Idea Group Publishing catalog with descriptions of these books by calling, toll free 1/800-345-4332 or visit the IGP Online Bookstore at: http://www.idea-group.com!
Multimedia Networking: Technology, Management and Applications Table of Contents
Preface ............................................................................................................ vii Chapter 1 ......................................................................................................... 1 Managing Real-Time Distributed Multimedia Applications Vana Kalogeraki, Hewlett-Packard Laboratories, USA Peter Michael Melliar-Smith, UC Santa Barbara, USA Louise E. Moser, UC Santa Barbara, USA Chapter 2 ....................................................................................................... 17 Building Internet Multimedia Applications: The Integrated Service Architecture and Media Frameworks Zhonghua Yang, Nanyang Technological University, Singapore Robert Gay, Nanyang Technological University, Singapore Chengzheng Sun, Griffith University, Queensland, Australia Chee Kheong Siew, Nanyang Technological University, Singapore Chapter 3 ....................................................................................................... 54 The Design and Performance of a CORBA Audio/Video Streaming Service Naga Surendran, Washington University-St. Louis, USA Yamuna Krishamurthy, Washington University-St. Louis, USA Douglas C. Schmidt, University of California, Irvine, USA Chapter 4 .................................................................................................... 102 MPEG-4 Facial Animation and its Application to a Videophone System for the Deaf Nikolaos Sarris, Aristotle University of Thessaloniki, Greece Michael G. Strintzis, Aristotle University of Thessaloniki, Greece Chapter 5 .................................................................................................... 126 News On Demand Mark T. Maybury, The MITRE Corporation, USA
ix protocol and technology aspects also have focus on the QoS frameworks and implementation issues. Because of wide range of application areas the delivery of high quality video content to customers is now is a driving force for the evolution of the Internet. Chapter eight presents an originally developed video retrieval application with its unique features including a flexible user interface based on HTTP browser for content querying and browsing, support for both unicast and multicast addressing and a user oriented control of QoS of video streaming in Integrated Services IP networks. A part of the chapter is devoted to some selected methods of modelling information systems, the prediction of a system performance, and on influence of different control mechanisms on quality of service perceived by end users. Chapter nine discusses various issues related to the shaping of Motion Picture Experts Group (MPEG) video for generating constrained or controlled variable bit rate (VBR) data streams. The results presented in this chapter can be utilized not only for network and nodal (buffer) capacity engineering but also for delivering the user-defined quality of service (QoS) to the customers. The next chapter presents a novel traffic shaping approach to optimize both the resource allocation and utilization for VBR media transmission. This idea is then extended to online transmission problems. The emergence of high-speed networked multimedia systems provides opportunities to handle collection of real-time continuous media (CM) applications. Admission control in CM servers or video-on-demand systems restricts the number of applications supported on the resources. It is necessary to develop more intelligent mechanisms for efficient admission control, negotiation, resource allocation, and resource scheduling with an aim to optimize the total system utilization. In particular, there has been increased interest in I/O issues for multimedia or continuous media. Chapter eleven presents a dynamic and adaptive admission control strategy for providing a fair disk bandwidth scheduling and better performance for video streaming. It also presents a comparison of the simulation result on the behavior of conventional greedy admission control mechanisms with that of the proposed admission control and scheduling algorithm. The traffic generated by multimedia applications presents a great amount of burstiness, which can hardly be described by a static set of traffic parameters. For dynamic and efficient usage of the resources the traffic specification should reflect the real traffic demand and at the same time optimize the resources requested. To achieve this goal chapter twelve presents a model for dynamically renegotiating the traffic specification (RVBR) and shows how this can be integrated with the traffic reservation mechanism RSVP demonstrating through an example of application that is able to accommodate its traffic to manage QoS dynamically. The remaining of this chapter is focused on the technique used to implement RVBR taking into account the problems deriving from delay during the renegotiation phase and on the performance of the application with MPEG4 traffic. Audio is frequently perceived as one of the most important component of multimedia communications. Very high transmission delay and transmission delay variance (known as jitter) experienced in the current architecture of the Internet impair real-time human conversations. One way to cope with this problem is to use adaptive control mechanisms. These mechanisms are based on the idea to use a voice reconstruction buffer at the receiver in order to add artificial delay to the audio stream to smooth out the jitter. Chapter thirteen describes three different control mechanisms that are able to dynamically adapt the audio application to the network conditions so as to minimize the impact of delay jitter (and packet loss). A set of performance results is reported from extensive experimentation with an Internet audio tool designed by the authors.
Chapter 6 .................................................................................................... 134 A CSCW with Reduced Bandwidth Requirements Based on a Distributed Processing Discipline Enhanced for Medical Purposes Iraklis Kamilatos, Informatics and Telematics Institute, Greece Michael G. Strintzis, Informatics and Telematics Institute, Greece Chapter 7 .................................................................................................... 151 A Layered Multimedia Presentation Database for Distance Learning Timothy K Shih, Tamkang University, Taiwan Chapter 8 .................................................................................................... 172 QoS-Aware Digital Video Retrieval Application Tadeusz Czachorski, Polish Academy of Sciences, Poland Stanislaw Jedrus, Polish Academy of Sciences, Poland Maciej Zakrzewicz, Poznan University of Technology, Poland Janusz Gozdecki, AGH University of Technology, Poland Piotr Pacyna, AGH University of Technology, Poland Zdzislaw Papir, AGH University of Technology, Poland Chapter 9 .................................................................................................... 186 Network Dimensioning for MPEG-2 Video Communications Using ATM Bhumip Khasnabish, Verizon Labs, Inc, USA Chapter 10 .................................................................................................. 222 VBR Traffic Shaping for Streaming of Multimedia Transmission Ray-I Chang, Academia Sinica, Taiwan Meng-Chang Chen, Academia Sinica, Taiwan Ming-Tat Ko, Academia Sinica, Taiwan Jan-Ming Ho, Academia Sinica, Taiwan Chapter 11 .................................................................................................. 237 RAC: A Soft-QoS Framework for Supporting Continuous Media Applications Wonjun Lee, Ewha Womans University, Seoul, Korea Jaideep Srivastava, University of Minnesota, USA Chapter 12 .................................................................................................. 255 A Model for Dynamic QoS Negotiation Applied to an MPEG4 Applications Silvia Giordano, ICA Institute, Switzerland Piergiorgio Cremonese, Wireless Architect, Italy Jean-Yves Le Boudec, Laboratoire de Reseaux de Communications, Switzerland M. Podesta, Whitehead Laboratory, Italy
Chapter 13 .................................................................................................. 269 Playout Control Mechanisms for Speech Transmission over the Internet: Algorithms and Performance Results Marco Roccetti, Universita di Bologna, Italy Chapter 14 .................................................................................................. 290 Collaboration and Virtual Early Prototyping Using The Distributed Building Site Metaphor Fabien Costantini, CEDRIC, France Christian Toinard, CEDRIC, France Chapter 15 .................................................................................................. 333 Methods for Dealing with Dynamic Visual Data in Collaborative Applications–A Survey Binh Pham, Queensland University of Technology, Australia Chapter 16 .................................................................................................. 351 An Isochronous Approach to Multimedia Synchronization in Distributed Environments Zhonghua Yang, Nanyang Technological University, Singapore Robert Gay, Nanyang Technological University, Singapore Chengzheng Sun, Griffith University, Queensland, Australia Chee Kheong Siew, Nanyang Technological University, Singapore Abdul Sattar, Griffith University, Queensland, Australia Chapter 17 .................................................................................................. 369 Introduction To Multicast Technology Gábor Hosszú, Budapest University of Technology & Economics, Hungary Chapter 18 .................................................................................................. 412 IP Multicast: Inter Domain, Routing, Security and Address Allocation Antonio F. Gómez-Skarmeta, University of Murcia, Spain Pedro M. Ruiz, University Carlos III of Madrid, Spain Angel L. Mateo-Martinez, University of Murcia, Spain Chapter 19 .................................................................................................. 441 Mobile Multimedia over Wireless Network Jürgen Stauder, Thomson Multimedia, France Fazli Erbas, University of Hannover, Germany About the Authors ..................................................................................... 473 Index ............................................................................................................ 482
vii
Preface We are witnessing an explosive growth in use of multiple media forms (voice, data, images and video etc.) in varied application areas including entertainment, communication, collaborative work, electronic commerce and university courses. The increasing computing power, integrated with multimedia and telecommunication technologies, is bringing into reality our dream of real time, virtually face-to-face interaction with collaborators sitting far away from us. In the process of realizing our technological ambitions, we need to address a number of technology, management and design issues. We need to be familiar with exciting current applications. It is impossible to track the magnitude and breadth of changes that the multimedia and communication technology is bringing daily to us in many different ways throughout the world. Consequently this book presents an overview of the expanding technology beginning with application techniques that lead to management and design issues. Our goal is to highlight major multimedia networking issues, understanding and solution approaches, and networked multimedia applications design. Because we wanted to include diverse ideas from various locations, chapters from professionals and researchers from about thirteen countries working in the forefront of this technology are included. This book has nineteen chapters, which include the following major multimedia networking areas. • • • • • • • •
Development and management of real time distributed multimedia applications Audio/video applications and streaming issues Protocols and technologies for building Internet multimedia applications QOS frameworks and implementation Collaborative applications Multimedia synchronization in distributed environment Multicasting technology and applications Use of mobile multimedia over wireless network.
The chapters in this book address the dynamic and efficient usage of the resources, which are the fundamental aspects of multimedia networks and applications. This book also details some current research, applications and future research directions. The following paragraphs are intended to put together the abstracts from each chapter in a manner to provide an overview of the topics covered.
Development and management of real time distributed multimedia applications Management of distributed multimedia networking along with streaming issues are focused in the first three chapters. Real-time distributed multimedia environments, characterized by timing constraints and end-to-end quality of service (QoS) requirements, have set forth new challenges for the efficient management mechanisms to respond to transient changes in the load or the availability of the resources. Chapter one presents a real-time distributed multimedia framework, based on the Common Object Request Broker Architecture (CORBA), that provides resource management and Quality of Service (QoS) for CORBA
viii applications. Chapter two presents state-of-the art coverage of the Internet integrated service architecture and two multimedia frameworks that support the development of real time multimedia applications. The Internet integrated service architecture supports a variety of service models beyond the current best effort model. A set of new real time protocols that constitute the integrated service architecture are described in some detail. The new protocols covered are those for real-time media transport, media session setup and control, and those for resource reservation in order to offer the guaranteed service. Two emerging media frameworks that provide a high level abstraction for developing real time media applications over Internet: CORBA Media Streaming Framework (MSF) and Java Media Framework (JMF) both of which provide an object-oriented multimedia middleware. The future trends are also discussed. Chapter three focuses on another important topic in ORB end system research: the design and performance of the CORBA audio/video streaming service specification.
Protocols and technologies for building Internet multimedia applications Next several chapters focus on the protocols and technological aspects for building networked multimedia applications. Chapter four aims to introduce the potential contribution of the emerging MPEG-4 audio-visual representation standard for future multimedia systems. This is attempted by the ‘case study’ of a particular example of such a system -‘LipTelephone’which is a special videoconferencing system. The objective of ‘LipTelephone’ is to serve as a videophone that will enable lip readers to communicate over a standard telephone connection, or even over the Internet. The main objective of the chapter is to introduce students to these methods for the processing of multimedia material, provide to researchers a reference to the state-of-the-art in this area and urge engineers to use the present research methodologies in future consumer applications. Recently scientists have been focusing on a new class of application that promises ondemand access to multimedia information such as radio and broadcast news. Chapter five describes how the synergy of speech, language and image processing has enabled a new class of information on demand news systems. In this chapter the ability to automatically process broadcast video 7x24 and serve this to the general public in individually tailored personal casts has also been discussed and some remaining challenging research areas identified. The next chapter presents another application covering open telecooperation architecture for medical telecosultation with modern high power workstations implemented using distributed computing system. The resulting medical Computer Supported Cooperative Work (CSCW) tool is evaluated experimentally. This tool has also the potential to be used in distance education environment. Chapter seven describes a 5 layer multimedia database management system (MDBMS) with storage sharing and object reuse support with application to an instruction on demand system that is used in the realization of several computer science related courses at Tamkang university.
More focus on QOS frameworks and implementation In multimedia applications, media data such as audio and video are transmitted from server to clients via network according to some transmission schedules. Different from the conventional data streams, end-to-end quality-of-service (QoS) is necessary for media transmission to provide jitter-free playback. The subsequent chapters while dealing with the
x Collaborative applications The next two chapters have major focus on collaborative applications and their design issues. Rapid Prototyping within a virtual environment offers new possibilities of working. But tools to reduce the time to design a product and to examine different design alternatives are missing. The state of art shows that the current solutions offer a limited collaboration. Within the context of an extended team, the solutions do not address how to move easily from one style of working to another one. They do not define how to manage the rapid design of a complex product. Moreover, the different propositions suffer mainly from the client-server approach that is inefficient in many ways and limits the openness of the system. Chapter fourteen presents a global methodology enabling different styles of work. It proposes new collaboration services that can be used to distribute a virtual scene between the designers. The solution, called the Distributed Building Site Metaphor, enables project management, meeting management, parallel working, disconnected work and meeting work, real time validation, real time modification, real time conciliation, real time awareness, easy motion between these styles of work, consistency, security and persistency. Much work has been devoted on the development of distributed multimedia systems in various aspects: storage, retrieval, transmission, integration and synchronization of different types of data (text, images, video and audio). However, such efforts have concentrated mostly on passive multimedia material, which had been generated or captured in advance. Yet, many applications require active data, especially 3D graphics, images and animation that are generated by interactively executing programs during an ongoing session, especially of a collaborative multimedia application. These applications demand extensive computational and communication costs that cannot be supported by current bandwidth. Thus, suitable techniques have to be devised to allow flexible sharing of dynamic visual data and activities in real time especially for collaborative applications. Chapter fifteen discusses different types of collaborative modes and addresses major issues for collaborative applications, which involve dynamic visual data from four perspectives: functionality, data, communication and scalability. In this chapter current approaches for dealing with these problems are also discussed, and pertinent issues for future research are identified.
Multimedia synchronization in distributed environment Synchronization between various kinds of media data is the key issues for multimedia presentation. Chapter sixteen discusses temporal relationship and multimedia synchronization mechanism to ensure a temporal ordering of events in a multimedia system. Multicasting technology and applications Chapter seventeen and eighteen has major focus on multicasting technologies. Multicasting increases the user’s ability to communicate and collaborate, leveraging more value from the network investment. Typical multicasting applications are video and audio conferencing for remote meetings, updates on the latest election results, replicating databases and web site information, collaborative computing activities, transmission over networks of live TV news or live transmission of multimedia training, etc. Multimedia multicasting would demand huge resources if not properly optimized. Although IP Multicast is considered a good solution for internetworking multimedia in many-to-many communications, there are issues that have not
xi been completely solved. Protocols are still evolving and new protocols are constantly coming up to solve these issues because that is the only way for making multicast to become a true Internet service. The multimedia transport on the Internet, and the IP multicasting technology including the routing and transport protocols is described in chapter seventeen. It also includes discussions on the popular Multicast Backbone (MBone) and presents different aspects of the policy of the multicast applications detailing the main multicast application design principles, including the lightweight sessions, the tightly coupled sessions and the virtual communication architectures on the Internet. Chapter eighteen continues to describe the evolution of IP multicast from the obsolete MBone (Multicast Backbone) and intra-domain multicast routing to the actual inter-domain multicast routing scheme. Special attention is given to the challenges and problems that need to be solved, the problems that have been solved and the way they were solved. The readers can get a complete picture of the state of the art explaining the idea behind each protocol and how all those protocols work together. Some of the topics discussed are broadly related to address allocation, security and authentication, scope control and so on. Results and recommendations are also included in this chapter.
Mobile multimedia over wireless network In recent years increasing use of multimedia over the Internet are being experienced in most application areas. The next step in the information age is the mobile access to multimedia applications: everything everywhere any time! The last chapter of this book is a tutorial chapter that addresses a key point of this development: Data transmission for mobile multimedia applications in wireless cellular networks. The main concern of this chapter is the cooperation between multimedia services and wireless cellular global networks. For network developers, the question is what constraints impose multimedia transmission on wireless networks? For multimedia experts, the question is rather which constraints impose the existing or foreseen wireless network standards on multimedia applications? This chapter follows the multimedia expert’s view of the problem. Having studied this chapter, the reader should be able to answer several questions like: Which network will be capable to transmit real-time video? Does a rainfall interrupt my mobile satellite Internet connection? When will high bandwidth, wireless networks be operational? How to tune existing multimedia applications to be efficient in wireless networks?
Audiences As is evident from the above discussions many different audiences can make use of this book. Students and teachers can use the book in their courses related to multimedia networking. Professionals involved in the management and design of multimedia network and applications will find many solutions to their questions and technological conundrums. Provocative ideas from the applications, case questions and research solutions included in this book will be useful for professionals, teachers and students in their search for design and development projects and ideas. It will also benefit even casual readers by providing them a broader understanding of this technology.
xii Acknowledgments Many people deserve credit for the successful publication of this book. I express sincere gratitude to each of the chapter authors, who contributed their ideas and expertise in bringing this book to fruition. Thanks to many colleagues and authors who have contributed invaluable suggestions in their thorough reviews of each chapter. Support from colleagues and staff in the Department of Computer and Information Sciences at Minnesota State University, Mankato helped sustain my continued interest. Many also helped with reviews of the chapters. A further special note of thanks goes also to all staff at Idea Group Publishing, whose contributions throughout the whole process from inception of the initial idea to final publication have been invaluable. In particular thanks to Mehdi Khosrowpour for his encouragement to continue with this project proposal, and to Jan Travers and Michele Rossi, who continuously prodded via e-mail for keeping the project on schedule. I am grateful to my parents, my wife Sharifun and my son Tahin who by their unconditional love have steered me to this point and given me constant support. They have sacrificed my company for extended periods of time while I edited this book. Mahbubur Rahman Syed Editor
Managing Real-Time Distributed Multimedia Applications 1
Chapter I
Managing Real-Time Distributed Multimedia Applications Vana Kalogeraki Hewlett-Packard Laboratories, USA Peter Michael Melliar-Smith and Louise E. Moser UC Santa Barbara, USA
Distributed multimedia applications are characterized by timing constraints and endto-end quality of service (QoS) requirements, and therefore need efficient management mechanisms to respond to transient changes in the load or the availability of the resources. This chapter presents a real-time distributed multimedia framework, based on the Common Object Request Broker Architecture (CORBA), that provides resource management and Quality of Service for CORBA applications. The framework consists of multimedia components and resource management components. The multimedia components produce multimedia streams, and combine multimedia streams generated by individual sources into a single stream to be received by the users. The resource management components provide QoS guarantees during multimedia transmissions based on information obtained from monitoring the usage of the system’s resources.
INTRODUCTION Real-time distributed multimedia environments have set forth new challenges in the management of processor and network resources. High-speed networks and powerful endsystems have enabled the integration of new types of multimedia applications, such as videoon-demand, teleconferencing, distance learning and collaborative services, into today’s computer environments. Multimedia applications are variable in nature, as they handle a combination of continuous data (such as audio and video) and discrete data (such as text, images and control information) and impose strong requirements on data transmission, including fast transfer and substantial throughput. Copyright © 2002, Idea Group Publishing.
2 Kalogeraki, Melliar-Smith & Moser
The technical requirements necessary to achieve timeliness are obviously more difficult to satisfy in distributed systems, mostly because of the uncertain delays in the underlying communication subsystem. This difficulty is further exacerbated by the heterogeneity of today’s systems with respect to computing, storage and communication resources and the high levels of resource sharing that exist in distributed systems. Multimedia tasks may involve components located on several processors with limited processing and memory resources and with shared communication resources. Different transport mechanisms, such as TCP or UDP, can be used for data transfer within local- or wide-area networks. Distributed object computing (DOC) middleware is software built as an independent layer between the applications and the underlying operating system to enable the applications to communicate across heterogeneous platforms. At the heart of the middleware resides an object broker, such as the OMG’s Common Object Request Broker Architecture (CORBA), Microsoft’s Distributed Component Object Model (DCOM) or Sun’s Java Remote Method Invocation (RMI). Multimedia technologies can take advantage of the portability, location transparency and interoperability that middleware provides to enable efficient, flexible and scalable distributed multimedia applications. Developing a system that can provide end-to-end real-time and QoS support for multimedia applications in a distributed environment is a difficult task. Distributed multimedia applications are characterized by potentially variable data rates and sensitivity to losses due to the transmission of data between different locations in local- or wide-area networks and the concurrent scheduling of multiple activities with different timing constraints and Quality of Service (QoS) requirements. Several QoS architectures (Aurrecoechea, Campbell & Hauw, 1998) that incorporate QoS parameters (such as response time, jitter, bandwidth) and QoS-driven management mechanisms across architectural layers have emerged in the literature. Examples include the QoS Broker, COMET’s Extended Integrated Reference Mode (XRM), the Heidelberg QoS model and the MAESTRO QoS management framework. Providing end-to-end QoS guarantees to distributed multimedia applications requires careful orchestration of the processor resources, as multimedia interactions may lead to excessive utilization and poor quality of service, and multimedia applications can easily suffer quality degradation during a multimedia session caused by network saturation or host congestion. Efficient management of the underlying system resources is therefore essential to allow the system to maximize the utilization of the processors’ resources and to adapt to transient changes in the load or in the availability of the resources. The goals of this chapter are to present a distributed framework for coordinating and managing the delivery of real-time multimedia data. The framework manages the transmission of real-time multimedia data and uses current resource measurements to make efficient management decisions.
CORBA The Common Object Request Broker Architecture (CORBA) (Object Management Group, 1999) developed by the Object Management Group (OMG) has become a widely accepted commercial standard for distributed object applications. CORBA provides an architecture and platform-independent programming interfaces for portable distributed object computing applications. The CORBA core includes an Object Request Broker (ORB) which acts as the message bus that provides the seamless interaction between client and server objects. CORBA
Managing Real-Time Distributed Multimedia Applications 3
Interface Definition Language (IDL) describes the functional interface to the objects and the type signatures of the methods that the object embodies. IDL interfaces are mapped onto specific programming languages (e.g., Java, C/C++, etc.). From the IDL specifications, an IDL compiler generates stubs and skeletons that are used for the communication between the client and server objects. Both the implementation details and the location of the server object are kept hidden from the client objects. Interoperability is achieved using the General Inter-ORB Protocol (GIOP) and the TCP/IP-specific Internet Inter-ORB Protocol (IIOP). CORBA’s independence from programming languages, computing platforms and networking protocols makes it highly suitable for the development of distributed multimedia applications and their integration into existing distributed systems.
RELATED WORK Because of its general applicability, the OMG streaming standard (Object Management Group, 1997) has provoked interest in areas such as telecommunications, biomedicine, entertainment and security. McGrath and Chapman (1997) have successfully demonstrated that the A/V streaming specification can be used for telecommunication applications. Mungee, Surendran and Schmidt (1999) have developed an audio/video streaming service based on the OMG’s A/V Streams model. Several researchers have focused on providing QoS support for distributed multimedia applications. Hong, Kim and Park (1999) have defined a generic QoS Management Information Base (MIB) which consists of information objects that represent a set of layered QoS parameters, organized into four logical groups: service, application, system and network. Le Tien, Villin and Bac (1999) have used m-QoS and resource managers responsible for the QoS mapping and monitoring of multimedia applications and for managing the QoS of the resources. Waddington and Coulson (1997) have developed a Distributed Multimedia Component Architecture (MCA) that extends the CORBA and DCOM models to provide additional mechanisms and abstractions for continuous networked multimedia services. MCA exercises the use of those foundation object models by using object abstractions in the form of interfaces to encapsulate and abstract the functionality of multimedia devices and processing entities. MCA presents a solution which incorporates support for real-time continuous media interactions; configuration, control and QoS management of distributed multimedia services; and dual control/stream interfaces and support for basic multimedia object services, including event handling, timer and callback services. Szentivanyi and Kourzanov (1999) have provided foundation objects that define different aspects of multimedia information management. Their approach is built on two notions: (1) a model that covers all related aspects of media management, such as storage, streaming, query, manipulation and presentation of information of several media-types, and (2) a distributed architecture that provides distribution, migration and access for the object model and its realization in a seamless, configurable and scalable manner. Several research efforts have concentrated in enhancing non-CORBA distributed multimedia environments with resource management capabilities. Alfano and Sigle (1996) discuss the problems they experienced at the host and network levels when executing multimedia applications with variable resource requirements. Nahrstedt and Steinmetz (1995) have employed resource management mechanisms, emphasizing host and network resources, to guarantee end-to-end delivery for multimedia data transmission and to adapt when system resource overloading occurs. Rajkumar, Juvva, Molano and Oikawa (1998)
4 Kalogeraki, Melliar-Smith & Moser
have introduced resource kernels to manage real-time multimedia applications with different timing constraints over multiple resources.
OVERVIEW OF THE MULTIMEDIA FRAMEWORK Desired Features Our CORBA framework for managing real-time distributed multimedia applications is responsible for dynamic QoS monitoring and adaptation over changing processor and network conditions. End users receive multimedia streams from different sources without the need to know the exact location of the sources or to have specialized processors to capture the multimedia data. The framework satisfies QoS requirements expressed by the users through a combination of system design choices (e.g., assigning priority/importance metrics to the multimedia objects) and physical configuration choices (e.g., allocating memory and bandwidth). The framework has the following design objectives: 1. To reduce the cost and difficulty of developing multimedia applications. End users should be engaged with a convenient way of expressing their QoS requirements without having to address low-level implementation details. Information such as the port number or the host address of the endpoints should be kept transparent to the users. 2. To satisfy the QoS requirements and to meet the timing constraints specified by the users. User QoS requirements are translated into application-level parameters and are mapped into requirements for system-level resources. Providing QoS-mapping mechanisms is beneficial to the system because it is more systematic and therefore can largely reduce potential user errors. Monitoring functions determine QoS violations that can cause quality degradation and lead to poor system performance. 3. To coordinate the transmission and synchronization of multimedia streams. Live synchronization requires both intra-stream and inter-stream synchronization (Biersack & Geyer, 1999). Intra-stream synchronization refers to maintaining continuity within a single multimedia stream, while inter-stream synchronization refers to preserving the temporal relationships between media units of related streams. Live synchronization requires the capture and playback/display of multimedia streams at run-time and, therefore, can tolerate end-to-end delay on the order of a few hundred milliseconds. 4. To balance the load on the resources and to minimize system overheads. Dynamic configuration management is essential to deal with complex, scalable and evolving multimedia environments. Multimedia objects must be distributed evenly across the processors with respect to their resource requirements and dependency constraints.
Structure of the Framework The framework manages multimedia applications and the underlying system resources in an integrated manner. The framework consists of multimedia components for managing the transmission and delivery of multimedia data and resource management components for managing the multimedia components and monitoring the underlying system resources, as shown in Figure 1.
Managing Real-Time Distributed Multimedia Applications 5
Figure 1: The architectural components of the framework
The multimedia components consist of Suppliers that produce streams of multimedia data, a Coordinator that receives multimedia streams from different sources and combines them into a single stream, and Consumers that receive a single stream and separate the different flows in the stream for individual playback or display. The resource management components consist of Profilers that measure the usage of the resources, Schedulers that schedule the tasks of the multimedia objects and a Resource Manager that allocates the multimedia objects to the processors and takes appropriate actions to satisfy resource requirements. The Resource Manager is implemented as a set of CORBA objects that are allocated to various processors across the distributed system and replicated to increase reliability; logically, however, there is only a single copy of the Resource Manager. The Resource Manager maintains a global view of the system and is responsible for allocating the multimedia objects to the processors. The Resource Manager works in concert with the Profilers and the Schedulers. The Profiler on each processor monitors the behavior of the multimedia objects and measures the current load on the processors’ resources. It supplies information to the Resource Manager, which adapts the allocations over changing processing and networking conditions. The Scheduler on each processor exploits the information collected from the Resource Manager to schedule the multimedia objects on the processor. The multimedia components of the framework are implemented as a set of CORBA objects. The Resource Manager decides the location of those objects across the system based on the utilizations of the processors’ resources, the users’ QoS requirements and the communication among the multimedia objects. The Coordinator uses a reliable, totally ordered group communication system to multicast the multimedia data to the Consumers to achieve synchronization of the streams.
6 Kalogeraki, Melliar-Smith & Moser
QUALITY OF SERVICE FOR DISTRIBUTED MULTIMEDIA APPLICATIONS Quality of Service (QoS) represents the set of those quantitative and qualitative characteristics that are needed to realize the level of service expected by the users (Vogel, Kerherve, Bochmann & Gecsei, 1995). There are typically many layers that determine the actual end-to-end level of service experienced by an application. User QoS parameters are translated into application-level parameters and are mapped into system-level (processor, network) parameters to control the system resources. The QoS mapping is still an open research issue, largely because there are numerous ways (Alfano & Sigle, 1996) to describe QoS for each layer. The QoS mapping is performed by the resource management components of the framework, which enables the user to specify QoS requirements without having to map the QoS parameters into parameters for the underlying layers. The QoS parameters are expressed as (name,value) pairs.
User QoS Parameters To enable users to express their QoS requirements in a simple and convenient manner, a graphical user interface is provided. User QoS parameters are specified in terms of a level of service (such as best effort or best quality) or properties that the user requires. The user must be prepared to pay a higher price when higher Quality of Service is desired. For example, a high-resolution video stream incurs a higher price in terms of increased delivery delay. User QoS requirements are expressed in terms of the media type (i.e., audio or video) and a set of media format parameters such as the color space or the data size (i.e., width and height) of an image, or the compression technique for the frames of a video stream. Users can also specify timing constraints such as start and end delivery times, the desired rate of transmission, the worst-case end-to-end delay and the maximum jitter. The QoS specified by the user includes media-specific parameters, if additional hardware or software constraints are imposed.
Application Layer Application QoS parameters describe the characteristics of the media requested for transfer. Some of the user’s parameters (e.g., end-to-end delay, rate of transmission) can be used directly as application QoS parameters, while others are translated into QoS parameters for the application. For example, for a video stream, the frame size is determined by the image height, width and color of an uncompressed image as specified by the user, and is computed as Frame_size = Width * Height * Color_resolution. A multimedia application has an associated level of service metric, which is explicitly defined by the user or is determined by the resource management components of the framework based on the user’s QoS parameters and the other multimedia applications running in the system. In addition, priority metrics can be associated with the multimedia application as a whole or as individual frames. For example, in MPEG compression, video I-frames contain the most important information and, therefore, should have a higher priority than P-frames or B-frames. Application QoS parameters may also include media-specific information, such as the format of the video source (i.e., PAL or NTSC), the pixel data type, the compression pattern (i.e., IBP pattern for MPEG-1 compression), the bit rate and the number of images to be skipped between captures for a video transmission. The rate of
Managing Real-Time Distributed Multimedia Applications 7
transmission can be derived from the IMAGE_SKIP parameter and the format of the video source. The maximum number of buffers determines the maximum number of images that a video card can store.
System Layer While perception of QoS can vary from user to user and from application to application, user and application QoS requirements must be translated into system parameters in order to monitor and control the system resources. The processor layer determines whether there are sufficient resources to accommodate the user and application requirements. Typical parameters of this layer are the utilization of the CPU, the size of the memory, the available disk space and the load imposed by special devices used for multimedia processing. This layer also encompasses the scheduling strategy used to schedule the multimedia objects on the processors’ resources. The network layer determines the transport requirements for the multimedia application, including the transport protocol to be used for the delivery of packets and packet-related parameters such as packet size, priority, ordering, transfer rate, round-trip delay, packet error and loss rate. Different multimedia streams experience random delays in the delivery of multimedia data due to the heterogeneity of the underlying communication infrastructure. Ideally, the network would deliver the multimedia data as they are generated with minimal or bounded delay.
DEVELOPING A DISTRIBUTED MULTIMEDIA FRAMEWORK FOR CORBA CORBA A/V Streaming Specification The CORBA A/V streaming specification (Object Management Group, 1997) defines a basic set of interfaces for implementing a multimedia framework that leverages the portability and flexibility provided by the middleware. The principal components of the framework are: 1. Multimedia Device (MMDevice): A multimedia device abstracts one or more items of hardware. Typical multimedia devices can be physical devices, such as a microphone, a video camera or a speaker, or logical devices, such as a video clip. A MMDevice can potentially support a number of streams simultaneously. For each individual stream, a virtual device and a stream endpoint connection are created. 2. Virtual Device (Vdev): A virtual multimedia device represents the device-specific aspects of a stream. Virtual devices have associated configuration parameters, such as the media format and the coding scheme of the transmitted stream. For instance, a video camera might be capable of transmitting both JPEG and MPEG formats in the data streams. A multimedia device can contain different virtual multimedia devices with different characteristics, and different virtual devices can refer to the same physical device. 3. Stream Endpoint (StreamEndPoint): A stream endpoint terminates a stream within a multimedia device and represents the transport-specific parameters of the stream. A stream endpoint specifies the transport protocol and the host name and port number of the transmitting endpoints.
8 Kalogeraki, Melliar-Smith & Moser
4.
Stream: A stream represents continuous media transfer between two or more virtual multimedia devices. Each stream is supported by the creation of a virtual multimedia device and a stream endpoint connection representing the device-specific and network-specific aspects of a stream endpoint. A stream may contain multiple flows, each flow carrying data in one or both directions. 5. Stream Controller (StreamCtrl): A stream controller abstracts continuous media transfer between virtual devices. It supports operations to control (start, stop) the stream as a whole or the individual flows within the stream. The StreamCtrl interface is used by the application programmer to set up and manage point-to-point or multipoint streams. Our framework uses the components of the A/V streaming specification as building blocks for the multimedia components. The advantage is that the A/V streaming specification allows the framework to hide the underlying implementation details.
Multimedia Components of the Framework Figure 2 shows the UML representation of the multimedia components of the framework. The multimedia components are based on a three-layered object structure. Multimedia suppliers and consumers are represented by the Supplier and Consumer objects, respectively. Multimedia streams that originate from different Suppliers are transmitted to the Coordinator object for multiplexing as a single stream before being forwarded to the Consumers. The Coordinator object is a key component of our framework. It is responsible for the synchronization of the streams that the user wishes to receive so that individual buffers at the endpoints are not required.
The Supplier The Supplier (Figure 3) represents the stream endpoint from which the multimedia data are derived. The Supplier defines the media to be transferred using the MMDevice interface. Typical configuration parameters of the MMDevice object are the type (i.e., video camera, microphone) or the name (i.e., “Star Wars”) of the media. The Supplier is implemented as a CORBA object and, therefore, can be located on any of the processors, but typically is associated with the physical location of the multimedia device. For example, to obtain live images from a camera or to listen to a live conversation, specific physical devices must be selected. On the other hand, to playback a video clip from a file, any of the processors can be chosen. The Supplier uses the virtual multimedia device (VDev) object to configure the specific flow transfer (i.e., by setting the video format for a video transfer) and the StreamEndPoint object to define the host address where the Supplier is located.
The Coordinator The Coordinator (Figure 4) multiplexes different streams originating from different sources into a single stream to be transmitted to the Consumers. Specific transport parameters are associated with the Coordinator through the StreamEndPoint interface. These parameters define the host address where the Coordinator is located and the port number to which it listens. To accommodate a large number of consumers, different Coordinator objects can be configured to receive different multimedia streams. The Coordinator is an essential component of the framework and, therefore, is replicated for fault tolerance (Kalogeraki & Gunopulos, 2000).
Managing Real-Time Distributed Multimedia Applications 9
Figure 2: UML representation of the multimedia components of the framework
Figure 3: The Supplier
10 Kalogeraki, Melliar-Smith & Moser
Figure 4: The Coordinator
Figure 5: The Consumer
The Consumer The Consumer (Figure 5) receives a single stream of multimedia data from the Coordinator and separates the flows that originate from different sources. These flows are subsequently supplied to video and audio buffers to be viewed or played, respectively. Compressed video images must be decompressed before they are displayed by the Consumer. The Consumer is associated with the MMDevice interface, where multiple VDev objects can be created to represent the various flows that the object is expected to receive. Typical parameters of the VDev object are image displays and speakers. The host address where the Consumer is located is defined using the StreamEndPoint interface.
RESOURCE MANAGEMENT Multimedia applications have high resource requirements, and lack of resource management mechanisms can lead to transmission problems with multimedia objects competing for limited unmanaged resources. Pre-planned allocations are usually not
Managing Real-Time Distributed Multimedia Applications 11
efficient because they can result in overloaded resources as the multimedia environment evolves over time. To provide efficient delivery of multimedia data, the framework employs resource management components that consist of a Profiler and a Scheduler for each processor and a global Resource Manager for the system (Kalogeraki, Moser & MelliarSmith, 1999).
The Profilers The Profiler for each processor measures the current load on the processor’s resources (i.e., CPU, memory, disk) and the bandwidth being used on the communication links. The Profiler also monitors the messages exchanged between the objects and uses the information extracted from the messages to measure the execution and communication times of the objects and compute the percentage of the resources used by the objects during execution. The Profilers supply their measurements as feedback to the Resource Manager. During operation, the Profilers may detect overloaded or insufficient resources to provide the Quality of Service required by the user. The QoS can change either because of an explicit request by the user (for example, when the user desires a higher level of service) or implicitly while the application executes. In both cases, the Profiler reports the monitored change of the QoS parameters to the Resource Manager which can initiate negotiation with the user so that alternative QoS parameters can be selected.
The Schedulers The Scheduler on each processor specifies an ordered list (schedule) for the method invocations on the objects on that processor, which defines how access to the CPU resources is granted. The schedule is based on the least-laxity scheduling algorithm. In least-laxity scheduling, the laxity of a task is defined as: Laxity = Deadline Remaining_Computation_Time, where Deadline is the interval within which the task must be completed and Remaining_Computation_Time is the estimated remaining time to complete the multimedia task. The Scheduler calculates the Deadline and the Remaining_Computation_Time for each task, thus deriving the task laxity. The task with the least laxity is assigned the highest real-time priority, and tasks are then scheduled according to the real-time priorities assigned by the Scheduler.
The Resource Manager The Resource Manager is implemented as a set of CORBA objects that are allocated to various processors across the distributed system. The Resource Manager objects are replicated for fault tolerance; logically, however, there is only a single logical copy of the Resource Manager in the system. All the replicas of the Resource Manager perform the same operations in the same order, and therefore have the same state. The Resource Manager maintains a system profile that includes the physical configuration of the system (various resources along with their specific characteristics) and the multimedia objects running on the processors. As new requests are introduced, the Resource Manager determines whether it can satisfy those requests, by considering the available resources and the other multimedia applications running in the system. To make accurate decisions, the Resource Manager uses current system information, obtained from the Profilers, that increase the likelihood of meeting the QoS requirements specified by the user. For example, if a service requested by the user is available on more than one processor, the
12 Kalogeraki, Melliar-Smith & Moser
Resource Manager selects the service from the most appropriate processor, e.g., the least-loaded processor or the processor located closest to the sender. The Resource Manager translates the QoS properties specified by the user into application QoS parameters and then allocates the necessary resources at the nodes along the path between the sender and the receiver. When insufficient resources remain to provide the required Quality of Service, the Resource Manager gradually degrades the Quality of Service for certain multimedia applications. The applications chosen are the ones with the least importance to the system, as defined by the user or determined by the Resource Manager. Alternatively, the Resource Manager attempts to reallocate the multimedia objects dynamically by migrating them to other processors. This reallocation may free some computing resources and enable the remaining objects to operate. Dynamic reallocation may also be required if a processor or other resource is lost because of a fault. If the quality of the multimedia applications continues to deteriorate, the Resource Manager can drop the least important multimedia applications so that the remaining applications can be accommodated at their desired level of service.
IMPLEMENTATION AND EXPERIMENTAL RESULTS Prototype Implementation The platform for our implementation and experiments consisted of six 167 MHz Sun UltraSPARCs running Solaris 2.5.1 with the VisiBroker 3.3 ORB over 100 Mbit/s Ethernet. For the implementation, we used three Supplier objects, two of which transmit live images captured from different cameras, while the third reads a video clip stored in a file. The Supplier objects are located on the processors equipped with the cameras, while the third Supplier object is located on a different processor. Each Supplier object transmits its data stream to a Coordinator object located on a different processor. The Coordinator waits until it receives all of the data streams sent by the Supplier and merges them into a single stream, which it then transmits to the Consumer objects. The three Consumers receive data streams from the Coordinator and display each of the individual flows within the data stream on a separate display. Figure 6 shows the multimedia application at run-time. The implementation uses the XIL Imaging Library for image processing and compression. XIL provides an object-oriented architecture and supplies a rich set of basic image processing functions that serve as building blocks for developing complex image processing applications. The implementation also uses the SunVideo subsystem, a real-time video capture and compression subsystem for Sun SPARCstations. The SunVideo subsystem consists of a SunVideo card, which is a digital video card, and supporting software, which captures, digitizes and compresses unmodulated NTSC and PAL video signals from video sources. MPEG-1 video coding is used for image data compression. The compressed video streams are transmitted over the network, stored on disk, and decompressed and displayed in a window on a workstation.
Performance Measurements In our experiments we measured the end-to-end delay experienced by the video frames i.e., the delay between the time a video frame is captured from the camera at a Supplier, until
Managing Real-Time Distributed Multimedia Applications 13
Figure 6: Multimedia application at run-time
the time it is transferred to a Consumer and displayed to the user. Of particular interest was the jitter associated with the random delays of the individual frames. Ideally, frames should be displayed continuously with a fixed frame rate. Typically, the jitter is eliminated by employing a buffer at the receiver (Stone et al., 1995). For example, if the receiver knows a priori the maximum possible end-to-end delay experienced by a video frame, it can buffer the first frame for this maximum time before displaying it. In our framework, the end-to-end delay experienced by the video frames depends on the following factors: (1) the time required for the Suppliers to capture the frames from the camera devices and compress them into a compressed frame sequence, (2) the time to transmit the compressed frame sequence from the Suppliers to the Coordinator, (3) the time required for the Coordinator to collect the compressed frame sequences from the individual Suppliers and combine them into a single stream, (4) the time to transmit the single stream from the Coordinator to the Consumers, and (5) the time required for the Consumers to separate the compressed frame sequences from the different Suppliers and decompress them for display. To determine the end-to-end delay, we assumed that the compression/decompression time of frames of the same size is approximately the same; thus, the end-to-end delay is mainly a function of the delay in the transmission of compressed frame sequences from the Suppliers to the Consumers and the delays in the collection of compressed frame sequences from different sources. Our measurements indicate that the frames are captured and displayed at the same rate by both the Suppliers and the Consumers, as the introduction of the Coordinator did not result in any irregularity in the compressed frame sequence transmissions and did not introduce any additional delay in the transmissions. Figure 7 shows the delay in the transmission of frames as a function of time, with the load on the processor increasing over time. As the load on the processor increases, the delay becomes larger and more variable. System jitter refers to the variable delays arising within the end-system, and is generally caused by the varying system load and the compression/decompression load. Network jitter refers to the varying delays the stream packets experience on their way from the sender to
14 Kalogeraki, Melliar-Smith & Moser
Figure 7: Delay (in ms) for successive frames as the load on the processor is increased
Figure 8: Jitter as a function of processor load
the receiver, and is introduced by buffering at the intermediate nodes. For our application, we measured both the system jitter and the network jitter. The jitter mainly depends on the load on the processors at which the Suppliers and the Consumers are located. To demonstrate the effect of the processor load on the jitter, we introduced a random increase in the load of the processor at which the Supplier was located and the frames were captured. We measured the jitter when both a single Supplier and two Suppliers were used to transmit live images to the Consumer. Figure 8 shows that the jitter is larger for a single Supplier than for two Suppliers and that it increases unacceptably when the load on the processor is increased above 0.5.
Managing Real-Time Distributed Multimedia Applications 15
CONCLUSION Distributed object computing is becoming a popular paradigm for next-generation multimedia technologies. Several efforts have demonstrated that the CORBA provides a powerful platform for developing distributed multimedia applications. By leveraging the flexibility, portability and interoperability that CORBA provides, we can build real-time distributed multimedia applications more easily. We have designed a framework for managing real-time distributed multimedia applications based on CORBA. The multimedia components of the framework consist of Suppliers that produce streams of multimedia data, a Coordinator that receives multimedia streams generated by the individual Suppliers and combines them into a single stream, and Consumers that receive the single stream of multimedia data and separate the flows within the stream to be viewed or played individually. The resource management components of the framework consist of Profilers that monitor the usage of the resources and the behavior of the application objects, Schedulers that schedule the tasks of the multimedia objects and a Resource Manager that allocates the multimedia objects to the processors, sharing the resources efficiently and adapting the allocations over changing processing and network conditions.
REFERENCES Alfano, M. and Sigle, R. (1996). Controlling QoS in a collaborative multimedia environment. Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, 340-347, Syracuse, NY: IEEE Computer Society. Aurrecoechea, C., Campbell, A. T. and Hauw, L. (1998). A survey of QoS architectures. Multimedia Systems, 6(3), 138-151. Biersack, E. and Geyer, W. (1999). Synchronized delivery and playout of distributed stored multimedia streams. Journal of Multimedia Systems, 7, 70-90. Hong, J. W. K., Kim, J. S. and Park, J. T. (1999). A CORBA-based quality of service management framework for distributed multimedia services and applications. IEEE Network, 13(2), 70-79. Kalogeraki, V., Moser, L. E. and Melliar-Smith, P. M. (1999). Using multiple feedback loops for object profiling, scheduling and migration in soft real-time distributed object systems. Proceedings of the 2nd IEEE International Symposium on ObjectOriented Real-Time Distributed Computing, 291-300, Saint Malo, France: IEEE Computer Society. Kalogeraki, V. and Gunopulos, D. (2000). Managing multimedia streams in distributed environments using CORBA. Sixth International Workshop on Multimedia Information Systems, Chicago, IL, 114-123. McGrath, D. and Chapman, M. (1997). A CORBA framework for multimedia streams. Proceedings of TINA’97-Global Convergence of Telecommunications and Distributed Object Computing, 239-243, Santiago, Chile: IEEE Computer Society. Mungee, S., Surendran, N. and Schmidt, D. (1999). The design and performance of a CORBA audio/video streaming service. Proceedings of the 32nd Annual IEEE Hawaii International Conference on Systems Sciences, 14, Maui, HA: IEEE Computer Society. Nahrstedt, K. and Steinmetz, R. (1995). Resource management in networked multimedia systems. Computer, 28(5), 52-63.
16 Kalogeraki, Melliar-Smith & Moser
Object Management Group, Inc. (1997). Control and Management of Audio/Video Streams Specification, 1.0. Object Management Group, Inc. (1999). The Common Object Request Broker: Architecture and Specification, 2.3.1. Rajkumar, R., Juvva, K., Molano, A. and Oikawa, S. (1998). Resource kernels: A resource-centric approach to real-time and multimedia systems. Multimedia Computing and Networking, 150-164, San Jose, CA: SPIE-International Society for Optical Engineering. Stone, D. L. and Jeffray, K. (1995). An empirical study of delay jitter management policies. Journal of Multimedia Systems, 2(6), 267-279. Szentivanyi, G. and Kourzanov, P. (1999). A generic, distributed and scalable multimedia information management framework using CORBA. Proceedings of the 32nd Annual IEEE Hawaii International Conference on Systems Sciences, 15, Maui, HA: IEEE Computer Society. Le Tien, D., Villin, O. and Bac, C. (1999). Resource managers for QoS in CORBA. Proceedings of the 2nd IEEE International Symposium on Object-Oriented RealTime Distributed Computing, 213-222, Saint Malo, France: IEEE Computer Society. Vogel, A., Kerherve, B., Bochmann, G. V. and Gecsei, J. (1995). Distributed multimedia applications and quality of service: A survey. IEEE Multimedia, 2(2), 10-19. Waddington, D. G. and Coulson, G. (1997). A distributed multimedia component architecture. Proceedings of the First International Enterprise Distributed Object Computing Workshop, 334-345, Gold Coast, Queensland, Australia: IEEE Computer Society.
Building Internet Multimedia Applications 17
Chapter II
Building Internet Multimedia Applications: The Integrated Service Architecture and Media Frameworks Zhonghua Yang, Robert Gay and Chee Kheong Siew Nanyang Technological University, Singapore Chengzheng Sun Griffith University, Australia
The Internet has become a ubiquitous service environment. This development provides tremendous opportunities for building real-time multimedia applications over the Internet. In this chapter, we present a state-of-the art coverage of the Internet integrated service architecture and two multimedia frameworks that support the development of real-time multimedia applications. The Internet integrated service architecture supports a variety of service models beyond the current best-effort model. A set of new real-time protocols that constitute the integrated service architecture are described in some detail. The new protocols covered are those for real-time media transport, media session setup and control, and those for resource reservation in order to offer the guaranteed service. We then describe two emerging media frameworks that provide a high-level abstraction for developing real-time media applications over Internet: CORBA Media Streaming Framework (MSF) and Java Media Framework (JMF), both of which provide an object-oriented multimedia middleware. The future trends are also discussed.
Copyright © 2002, Idea Group Publishing.
18 Yang, Gay, Sun & Siew
INTRODUCTION The Internet has gone from near-invisibility to near-ubiquity and penetrated into every aspect of society in the past few years (Deptartment of Commerce, 1998). The application scenarios have also changed dramatically and now demand a more sophisticated service model from the network. A service model consists of a set of service commitments, in other words, in response to a service request the network commits to deliver some service. Despite its tremendous growth, the Internet is still largely based on a very simple service model, best effort, providing no guarantee on the correct and timely delivery of data packets. Each request to send is honored by the network as best it can. This is the worst possible service: packets are forwarded by routers solely on the basis that there is any known route, irrespective of traffic conditions along that route. Routers that are overloaded are allowed to discard packets. This simplicity has probably been one of the main reasons for the success of IP technology. The best-effort service model, combined with an efficient transport layer protocol (TCP), is perfectly suited for a large class of applications, which tolerate variable delivery rates and delays. This class of applications is called elastic applications. The interactive burst communication (telnet), interactive bulk transfers (FTP) and asynchronous bulk transfers (electronic mail, Fax) are all examples of such elastic applications. The elastic applications are insensitive to delay since the receiver can always wait for data that is late, and the sender can usually re-transmit any data that is lost or corrupted. However, for a real-time application, there are two problems with using this service model: if the sender and/or receiver are humans, they simply cannot tolerate arbitrary delays; on the other hand, if the rate at which video and audio arrive is too low, the signal becomes incomprehensible. To support real-time Internet applications, the service model must address those services that relate most directly to the time-of-delivery of data. Real-time applications like video and audio conferencing typically require stricter guarantees on throughput and delay. The essence of real-time service is the requirement for some service guarantees in terms of timing. There has been a great deal of effort since 1990 by Internet Engineering Task Force (IETF) to add a broad range of services to the Internet service model, resulting in the Internet Integrated Service model (Braden, Clark and Shenker, 1994; Crowcroft, Handley and Wakeman, 1999). The Internet Integrated Services Model defines five classes of service which should satisfy the requirements of the vast majority of future applications: 1. Best Effort: As described above, this is the traditional service model of the Internet. 2. Fair: This is an enhancement of the traditional model, where there are no extra requests from the users, but the routers attempt to partition up network resources in some fair manner. This is typically implemented by adopting a random drop policy when encountering overload, possibly combined with some simple round robin serving of different sources. 3. Controlled load: This is an attempt to provide a degree of service guarantee so that a network appears to the user as if there is little other traffic, and it makes no other guarantees. The admission control is usually imposed so that the performance perceived is as if the network were over-engineered for those that are admitted. 4. Predictive service: This service is to give a delay bound which is as low as possible, and at the same time, is stable enough that the receiver can estimate it. 5. Guaranteed service: This is where the delay perceived by a particular source or to a group is bounded within some absolute limit. This service model implies that resource reservation and admission control are key building blocks of the service.
Building Internet Multimedia Applications 19
The Internet that provides these integrated services is called the Integrated Service Internet. In order to use the integrated service such as multimedia applications, the user must have a workstation that is equipped with built-in multimedia hardware (audio codec and video frame grabbers). However, the realization of the Integrated Service Internet fundamentally depends upon the following enabling technologies: High-Bandwidth Connection: A fast/high bandwidth Internet access and connection are important. For example, an Internet user probably will not spend 46.3 minutes to transfer a 3.5-minutes video clip (approximately the amount of video represented by a 10 mega-byte file) if using 28.8 Kbps modem. But he would wait if it took only a few seconds to download the same file (8 seconds if using a 10 Mbps cable modem). Obviously, the bandwidth of an Internet connection is a prime determinant of the Internet multimedia service. Telephone companies, satellite companies, cable service providers and Internet service providers are working to create faster Internet connections and expand the means by which users can access the Internet. New Internet access technologies such as ADSL (Asynchronous Digital Subscriber Line) enable copper telephone lines to send data at speeds up to 8 mega-bit per second (Mbps). For the wireless access, 28.8 Kbps is widely available, and the bandwidth of 1.5 Mbps is offered from the Local Multipoint Distribution Service (LMDS) or the Multi-Channel Multipoint Distribution Service (MMDS). Internet access using cable modems can have 1.2–27 Mbps speed. Real-time Protocols: Complementing TCP/UDP/IP protocol stack, a new set of realtime protocols are required, which provide end-to-end delivery services for data with realtime characteristics (e.g., interactive audio and video). IP-Multicasting: Most of the widely used traditional Internet applications, such as WWW browsers and email, operate between one sender and one receiver. In many emerging multimedia applications, such as Internet video conferencing, one sender will transmit to a group simultaneously. For these applications, IP-multicast is a requirement, not an option, if the Internet is going to scale. Digital Multimedia Applications: Using IP-multicast and real-time protocols as the underlying support, the sophisticated multimedia applications can be developed. There are two approaches to multimedia application development: directly by using network APIs (IP sockets) or by using media frameworks as middleware. The MBone (Internet Multicast Backbone) applications are developed using Internet socket APIs, and Java Media Framework and CORBA Media Stream Framework provide middleware-like environments for multimedia application development. In this chapter, we describe two aspects of the latest development of the Internet: the Internet multimedia protocol architecture and the media frameworks. The Internet multimedia protocol architecture supports the integrated service models beyond the current TCP/IP best-effort service model. These service models are presented, and a set of real-time protocols is described in some details. The development of multimedia applications over the Internet is a complex task. The emerging media frameworks-CORBA Media Stream Framework and Java Media Framework--provide an environment that hides the details of media capturing, rendering and processing. They also abstract away from the details of underlying network protocols. These two frameworks are described. Future trends are also discussed.
20 Yang, Gay, Sun & Siew
THE INTERNET MULTIMEDIA PROTOCOL ARCHITECTURE Since the large portion of Internet applications use TCP/IP protocol stacks, it is tempting to think of using TCP and other reliable transport protocols (for example, XTP) to deliver real-time multimedia data (audio and video). However, as Schulzrinne argued, TCP and other reliable transport protocols are not appropriate for real-time delivery of audio and video. The reasons include the following (Schulzrinne, 1995): • TCP cannot support multicast, which is a fundamental requirement for large-scale multimedia applications (e.g., conference). • Real-time data is delay-sensitive. The real-time multimedia applications can tolerate the data loss but will not accept the delay. The TCP mechanism ensures the reliability by re-transmitting the lost packets and forcing the receiver to wait. In other words, reliable transmission as provided by TCP is not appropriate, nor desirable. • Audio and video data is a stream data with a natural rate. When network congestion is experienced, the congestion control for media data is to have transmitter change the media encoding, video frame rate or video image size based on the feedback from receivers. On the other hand, the TCP congestion-control mechanisms (slow start) decrease the congestion window when packet losses are detected. This sudden decrease of data rate would starve the receiver. • The TCP (or XTP) protocol headers do not contain the necessary timestamp and encoding information needed by the receiving applications. In addition, the TCP (or XTP) headers are larger than a UDP header. • Even in a LAN with no losses, TCP would suffer from the initial slow start delay. As described previously, the integrated services Internet offers a class of service models beyond the TCP/IP’s best-effort service, and thus it imposes strict new requirements for a new generation of Internet protocols (Clark and Tennenhouse, 1990). In the Integrated Service Internet, a single end system will be expected to support applications that orchestrate a wide range of media (audio, video, graphics and text) and access patterns (interactive, bulk transfer and real-time rendering). The integrated services will generate new traffic patterns with performance considerations (such as delay and jitter tolerance) that are not addressed by present Internet TCP/IP protocol architecture. The integrated service Internet is expected to have a very fast bandwidth and even operate at gigabit rate; at this speed, the current Internet protocols will present a bottleneck. The fast networks demand the very low protocol overhead. Furthermore, the new generation of protocols for integrated services will operate over the range of coming network technology, including Broadband ISDN which is based on small fixed sized cell switching (ATM) mode different from classic packet switching. The set of Internet real-time protocols, which constitute the Internet Multimedia Protocol Architecture, represents a new style of protocols. The new style of protocols follows the principles of application level framing and integrated layer processing proposed by Clark and Tennenhouse (1990). According to this principle, the real-time application is to have the option of dealing with a lost data unit, either by reconstituting it or by ignoring the error. The current Internet transport protocols such as TCP do not have this feature. To achieve this, the losses are expressed in terms meaningful to the real-time application. In other words, the application should break the data into suitable aggregates (frame), and lower levels should preserve these frame boundaries as they process the data. These aggregates are called Application Data Units (ADUs), which will take the place of the packet as the unit of manipulation. This principle is called Application Level Framing. From
Building Internet Multimedia Applications 21
an implementation perspective, the current implementation of layered protocol suites restricts the processing to a small subset of the suite’s layers. For example, the network layer processing is largely independent of upper layer protocols. A study shows that presentation can cost more than all other manipulations combined; therefore, it is necessary to keep the processing pipeline going. In other words, the protocols are so structured as to permit all the manipulation steps in one or two integrated processing loops, instead of performing them serially as is most often done traditionally. This engineering principle is called Integrated Layer Processing. In this approach to protocol architecture, the different functions are “next to” each other, not “on top of” each other as seen in the layered architecture. The integrated service Internet protocol architecture is shown in Figure 1. As shown, the overall multimedia data and control architecture currently incorporates a set of real-time protocols, which include the real-time transport protocol (RTP) for transporting real-time data and providing QoS feedback (Schulzrinne, Casner, Frederick and Jacobson, 2000), the real-time streaming protocol (RTSP) for controlling delivery of streaming media (Schulzrinne, Rao and Lanphier, 1998), the session announcement protocol (SAP)for advertising multimedia sessions via multicast (Handley, Perkins and Whelan, 1999) and the session description protocol (SDP) for describing multimedia sessions (Handley and Jacobson, 1998). In addition, it includes the session initiation protocol (SIP) which is used to invite the interested parties to join the session (Handley, Schulzrinne, Schooler and Rosenberg, 2000). But the functionality and operation of SIP does not depend on any of these protocols. Furthermore, the Resource ReSerVation Protocol (RSVP) is designed for reserving network resources (Braden, Zhang, Berson and Herzog, 1997; Zhang, Deering, Estrin, Shenker and Zappala, 1993; Braden and Zhang, 1997). These protocols, together with the IP-Multicast, are the underlying support for Internet multimedia applications. We will describe these protocols in some detail. The IP-Multicast is described in great detail in Deering and Cheriton (1990); Deering (1991); Obraczka (1998); Floyd, Jacobson, Liu, McCanne and Zhang (1997); Paul, Sabnani, Lin and Bhattacharyya (1997); and Hofmann (1996). Note that in the context of multimedia communication, a session is defined as a set of multimedia senders and receivers and the data streams flowing from senders to receivers. A multimedia conference is an example of a multimedia session.” (Handley and Jacobson, 1998). As defined, a callee can be invited several times, by different calls, to the same session. In the following, we describe real-time Internet protocols that constitute the Internet multimedia protocol architecture.
Figure 1: Internet protocols architecture for multimedia Multimedia Applications
MBone
RTP/RTCP
Multimedia Session Setup & Control
Reliable Multicast
RSVP
UDP
RTSP
SDP SAP
SIP TCP
IP + IP Multicast Integrated Service Forwarding (Best-Effort, Guaranteed)
HTTP
SMTP
22 Yang, Gay, Sun & Siew
REAL-TIME TRANSPORT PROTOCOLS: RTP AND RTCP We now describe the Internet protocols designed within the Internet Engineering Task Force (IETF) to support real-time multimedia conferencing. We first present the transport protocols, then describe the protocols for session control (session description, announcement and invitation). These protocols normally work in collaboration with IP-multicast, although they can also work with unicast protocols. A common real-time transport protocol (RTP) is primarily designed to satisfy the needs of multi-participant multimedia conferences (Schulzrinne, Casner, Frederick and Jacobson, 2000). Note that the named “transport protocol” is somewhat misleading, as it is currently mostly used together with UDP that is a designated Internet transport protocol. The RTP named as a transport protocol emphasizes that RTP is an end-to-end protocol, and it provides end-to-end delivery services for data with real-time characteristics, such as interactive audio and video. Those services include payload type identification, sequence numbering, timestamping and delivery monitoring. As discussed earlier, the design of RTP adopts the principle of Application Level Framing (ALF). That is, RTP is intended to be malleable to provide the information required by a particular application and will often be integrated into the application processing rather than being implemented as a separate layer. RTP is a protocol framework that is deliberately not complete; it specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed. Therefore, a complete specification of RTP for a particular application will require one or more companion documents, typically a profile specification and a payload format specification. Profile defines a set of payload-type codes and their mapping to payload formats (e.g., media encodings). A profile may also define extensions or modifications to RTP that are specific to a particular class of applications. Typically an application will operate under only one profile. A profile for audio and video data is specified in Schulzrinne and Casner (1999). Payload format specification defines how a particular payload, such as an audio or video encoding, is to be carried in RTP. RTP is carried on top of IP and UDP (see Figure 1). RTP consists of two closely linked parts, data part and control part. Although named real-time, RTP does not provide any mechanism to ensure timely delivery or provide other quality of service guarantees, and it is augmented by the control protocol, RTCP, that is used to monitor the quality of data distribution; RTCP also provides control and identification mechanisms for RTP transmissions. Continuous media data is carried in RTP data packets. The functionality of the flow control uses the control packets. If quality of service is essential for a particular application, RTP can be used over a resource reservation protocol, RSVP, which will be described below. In summary: • the real-time transport protocol (RTP), to carry data that has real-time properties; • the RTP control protocol (RTCP), to monitor the quality of service and to convey information about the participants in an on-going session. In RTP, the source of a media stream is called synchronization source (SSRC). All packets from a synchronization source form part of the same timing and sequence number space, so a receiver can group packets by synchronization source for playback. Examples of
Building Internet Multimedia Applications 23
synchronization sources include the sender of a stream of packets derived from a single source such as a microphone or a camera. A synchronization source may change its data format, e.g., audio encoding, over time. The receiver has to be able to tell each source apart, so that those packets can be placed in the proper context and played at the appropriate time for each source. In the following, we describe RTP data transfer protocol and RTP control protocol in some detail. But first we have to discuss the network components that constitute an RTP media network.
RTP Network Configuration The RTP media network typically consists of end systems and intermediate systems. The end system is the one that generates continuous media stream and delivers it to the user. Every original RTP source is identified by a source identifier, and this source is carried in every packet as described below. In addition, RTP allows flow from several sources to be mixed or translated in intermediate systems, resulting in a single flow or a new flow with different characteristics. When several flows are mixed, each mixed packet contains the source IDs of the entire contributing source. Generally, the RTP intermediate system is an RTP relay within the network. The purpose of having the RTP relays is that the media data can be transmitted on the links of different bandwidth in different formats. There are two types of RTP relay: mixer and translator. To further illustrate why the RTP relay (mixer and translator) are desirable network entities, let’s consider the case where participants in one area are connected through a low-speed link to the majority of the conference participants who enjoy high-speed network access. Instead of forcing everyone to use a lower bandwidth, reduced-quality audio encoding, a mixer may be placed near the low-bandwidth area. This mixer resynchronizes incoming audio packets to reconstruct the constant 20 ms spacing generated by the sender, mixes these reconstructed audio streams into a single stream, and translates the audio encoding to a lower bandwidth one and forwards the lower bandwidth packet stream across the low-speed link. To achieve this, the RTP header includes a means, Contributing Source Identifier (CSRC) field, for mixers to identify the sources that contributed to a mixed packet so that correct indication can be provided at the receivers. In general, a mixer receives RTP packets from one or more sources, possibly changes their data format, combines them in some manner and then forwards a new RTP packet. All data packets originating from a mixer will be identified as having the mixer as their synchronization source. In the other case, some of the intended participants in the audio conference may be connected with high bandwidth links but might not be directly reachable via IP multicast. For example, they might be behind an application-level firewall that will not let any IP packets pass. For these sites, mixing may not be necessary; instead another type of RTP-level relay called a translator may be used. Two translators are installed, one on either side of the firewall, with the outside one funneling all multicast packets received through a secure connection to the translator inside the firewall. The translator inside the firewall sends them again as multicast packets to a multicast group restricted to the site’s internal network. Using translators, a group of hosts speaking only IP/UDP can be connected to a group of hosts that understand only ST-II, or the packet-by-packet encoding of video streams can be translated from individual sources without resynchronization or mixing. A translator forwards RTP packets with their synchronization source intact. Mixers and translators may be designed for a variety of purposes. A collection of mixers and translators is shown in Figure 2 to illustrate their effect on SSRC and CSRC
24 Yang, Gay, Sun & Siew
Figure 2: The RTP network configuration with end systems, mixer and translators End System
E1
E6
E1:17 Mixer
E6:15
Translator
M1:48(1,17) M1
E6:15 M1:48(1,17)
M1:48(1,17) T1
T2
E2:1
E7
E4:47 M3:89(45)
E4:47 E4:47
E2
M3:89(45)
E4 M3
Legend:
E5:45
Source:SSRC (CCRC, ...)
E5
Figure 3: RTP fixed header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + V=2 P X CC M PT Sequence number timestamp Synchronization Source (SSRC) identifier Contributing Source (CSRC) identifier
identifiers. In the figure, end systems are shown as rectangles (named E), translators as triangles (named T) and mixers as ovals (named M). The notation “M1: 48(1,17)” designates a packet originating a mixer M1, identified with M1’s (random) SSRC value of 48 and two CSRC identifiers, 1 and 17, copied from the SSRC identifiers of packets from E1 and E2.
RTP Data Transfer Protocol Every RTP data packet consists of a header followed by the payload (e.g., a video frame or a sequence of audio samples). The RTP header is formatted as in Figure 3. The first 12 octets are present in every RTP packet, while the list of CSRC identifiers is present only when inserted by a mixer. As shown in Figure 3, the RTP header contains the following information: 1. version (V): 2 bits. This field identifies the version of RTP. The current version is two (2). 2. padding (P): 1 bit. If the padding bit is set, the packet contains one or more additional padding octets at the end which are not part of the payload.
Building Internet Multimedia Applications 25
3.
extension (X): 1 bit. If the extension bit is set, the fixed header must be followed by exactly one header extension. 4. CSRC count (CC): 4 bits. The Contributing Source (CSRC) contains the number of CSRC identifiers that follow the fixed header. 5. marker (M): 1 bit. The interpretation of the marker is defined by a profile. For video, it marks the end of a frame. 6. payload type (PT): 7 bits. This field identifies the format of the RTP payload and determines its interpretation by the application, for example JPEG video or GSM audio. A receiver must ignore packets with payload types that it does not understand. 7. sequence number: 16 bits. The sequence number increments by one for each RTP data packet sent, and may be used by the receiver to detect packet loss and to restore packet sequence within a series of packets with the same timestamp. 8. timestamp: 32 bits. The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant must be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations. The resolution of the clock must be sufficient for the desired synchronization accuracy (Yang, Sun, Sattar and Yang, 1999) and for measuring packet arrival jitter. 9. SSRC: 32 bits. The Synchronization Source (SSRC) field identifies the synchronization source. This identifier is chosen randomly, with the intent that no two synchronization sources within the same RTP session will have the same SSRC identifier. 10. CSRC list: 0 to 15 items, 32 bits each. The CSRC list identifies the contributing sources for the payload contained in this packet. The number of identifiers is given by the CC field. Readers may note that RTP packets do not contain a length indication; the lower layer, therefore, has to take care of framing.
Real-Time Control Protocol (RTCP) RTCP occupies the place of a transport protocol in the protocol stack. However, RTCP does not transport application data but is rather an Internet control protocol, like Internet Control Message Protocol (ICMP), Internet Group Management Protocol (IGMP), or routing protocols. Thus, RTCP as the control protocol works in conjunction with RTP (Schulzrinne, Casner, Frederick and Jacobson, 2000). Real-Time Control Protocol (RTCP) packets supplement each RTP flow. RTCP control packets are periodically transmitted by each participant in an RTP session to all other participants. Feedback of information to the application can be used to control performance and for diagnostic purposes. RTCP performs the following four functions: 1. Provide feedback information to application: The primary function is to provide information to an application regarding the quality of data distribution. Each RTCP packet contains sender and/or receiver reports that report statistics useful to the application. These statistics include number of packets sent, number of packets lost, inter-arrival jitter, etc. This reception quality feedback will be useful for the sender, receivers and third-party monitors. For example, the sender may modify its transmissions based on the feedback; receivers can determine whether problems are local, regional or global; network managers may use information in the RTCP packets to evaluate the performance of their networks for multicast distribution. Note that each RTCP packet contains the NTP and RTP timestamps by data senders that help intermedia synchronization.
26 Yang, Gay, Sun & Siew
Identify RTP source: RTCP carries a transport-level identifier for an RTP source, called the canonical name (CNAME). This CNAME is used to keep track of the participants in an RTP session. Receivers use the CNAME to associate multiple data streams from a given participant in a set of related RTP sessions, e.g., to synchronize audio and video. 3. Control RTCP transmission interval: To prevent control traffic from overwhelming network resources and to allow RTP to scale up to a large number of session participants, control traffic is limited to at most 5% of the overall session traffic. This limit is enforced by adjusting the rate at which RTCP packets are periodically transmitted as a function of the number of participants. Since each participant multicasts control packets to everyone else, each can keep track of the total number of participants and use this number to calculate the rate at which to send RTCP packets. 4. Convey minimal session control information: As an optional function, RTCP can be used as a convenient method for conveying a minimal amount of information to all session participants. For example, RTCP might carry a personal name to identify a participant on the user’s display. This function might be useful in loosely controlled sessions where participants informally enter and leave the session. The RTP Control Protocol defines five RTCP packet types to carry a variety of control information: sender report (SR), receiver report (RR), source description (SDES), packet BYE to indicate end of participation and packet APP for application-specific functions. The control packets are distributed to all participants in the session using the same distribution mechanism as the data packets. The underlying protocol must provide multiplexing of the data and control packets, for example using separate port numbers with UDP. Since RTCP packets are sent periodically by each session member, the balance between the desire for up-to-date control information and the desire to limit control traffic to a small percentage of data traffic (5%) must be made. The RTCP protocol specification presents an algorithm to compute the RTCP transmission interval. RTP and RTCP packets are usually transmitted using UDP/IP service. Figure 4 shows an RTP packet encapsulated in a UDP/IP packet. However, RTP is designed to be transport-independent and thus can be run on top of other transport protocols, even directly over AAL5/ATM. 2.
PROTOCOLS FOR MULTIMEDIA SESSION SETUP AND CONTROL There are two basic forms of multimedia session setup mechanism. These are session advertisement and session invitation. Session advertisements are provided using a session directory, and session invitation (inviting a user to join a session) is provided using a session invitation protocol such as SIP (described below) (Handley, Schulzrinne, Schooler and Rosenberg, 2000) or Packet-Based Multimedia Communication Systems standard H.323 (ITU, 1998). Before a session can be advertised, it must be described using the session description protocol (SDP). SDP describes the content and format of a multimedia session, and the session announcement protocol (SAP) is used to distribute it to all potential session recipients. SDP and SAP are described on the next page. Figure 4: RTP packets are encapsulated in an IP packet IP header
UDP header
RTP header
RTP payload
Building Internet Multimedia Applications 27
The Session Description Protocol (SDP) The session description protocol is used for general real-time multimedia session description purposes. It assists the advertisement of conference sessions and communicates the relevant conference setup information to prospective participants. SDP is designed to convey such information to recipients. This protocol is purely a format for session description. In other words, it does not incorporate a transport protocol, and is intended for using different transport protocols as appropriate, including the Session Announcement Protocol (SAP), Session Initiation Protocol (SIP), Real-time Streaming Protocol (RTSP), electronic mail using the MIME extensions and the Hypertext Transport Protocol (HTTP). SDP serves two primary purposes. It is a means to communicate the existence of a session, and is a means to convey sufficient information to enable joining and participating in the session. In a unicast environment, only the latter purpose is likely to be relevant. A session description contains the following: 1. Session name and purpose. 2. The media comprising the session: the type of media (video, audio); the transport protocol (RTP/UDP/IP, H.320); the format of the media (H.261 video, MPEG video). 3. Time(s) the session is active: an arbitrary list of start and stop times bounding the session; for each bound, repeat times such as “every Wednesday at 10am for one hour.” 4. Information to receive those media (addresses, ports, formats and so on). As resources necessary to participate in a session may be limited, some additional information may also be desirable: 1. Information about the bandwidth to be used by the conference. 2. Contact information for the person responsible for the session. In general, SDP must convey sufficient information to be able to join a session (with the possible exception of encryption keys) and to announce the resources to be used to nonparticipants that may need to know. A SDP description is formatted using the following keys: Session description v= (protocol version) o= (owner/creator, session identifier) s= (session name) i=* (session information) u=* (URL of description) e=* (email address) p=* (phone number) c=* (connection, IN: Internet, IP address/ttl) b=* (bandwidth information) z=* (time zone adjustments) k=* (encryption key) a=* (session attribute) Time description t= (time the session is active)
r=* (zero or more repeat times)
Media description m= (media name and transport address) i=* (media title) c=* (connection information) b=* (bandwidth information) k=* (encryption key) a=* (media attribute lines)
28 Yang, Gay, Sun & Siew
An example SDP description is: v=0 o=yang 2890844526 2890842807 IN IP4 132.234.86.1 s=Internet Integrated Service Architecture i=An overview of Internet service models and protocol architecture u=http://www.cit.gu.edu.au/~yang/services.pdf
[email protected] (Zhonghua Yang) p=+61 7 3875 3855 c=IN IP4 132.234.86.1/127 t=2873397496 2873404696 a=recvonly m=audio 49170 RTP/AVP 0 m=video 51372 RTP/AVP 31 The SDP description is announced using the Session Announcement Protocol (SAP) described next.
Session Announcement Protocol (SAP) In order to assist the advertisement of multicast multimedia conferences and other multicast sessions, and to communicate the relevant session setup information to prospective participants, a distributed session directory may be used. An instance of such a session directory periodically multicasts packets, which contain a description of the session, and these advertisements are received by potential participants who can use the session description to start the tools required to participate in the session. SAP defines an announcement protocol to be used by session directory clients (Handley, Perkins and Whelan, 1999). Sessions are described using the session description protocol (SDP) as described in the previous section. The session description is the payload of SAP packet (Figure 5).
Session Announcement and Deletion SAP defines no rendezvous mechanism. A SAP announcer periodically sends an announcement packet to a well-known multicast address and port. In other words, the SAP announcer is not aware of the presence or absence of any SAP listeners, and no additional reliability is provided over the standard best-effort UDP/IP semantics. A
Figure 5: SAP packet format 0 0
1 1
2
3
4
5
6
7
8
9
0
2 1
2
+ + + + + + + + + + + + + + V=1
A R T E C
3
4
5
6
7
8
9
0
3 1
2
3
4
5
6
7
8
9
0
1
+ + + + + + + + + + + + + + + + + + +
auth len
msg id hash
originating source (32 bits for IPv4 or 128 for IPv6) optional authentication data optional payload type payload
0 + +
Building Internet Multimedia Applications 29
SAP announcement is multicast with the same scope as the session it is announcing, ensuring that the recipients of the announcement can also be potential recipients of the session being advertised. A SAP listener learns of the multicast scopes it is within and listens on the well-known SAP address and port for those scopes. Multicast addresses in the range 224.2.128.0— 224.2.255.255 are used for IPv4 global scope sessions with SAP announcements being sent to 224.2.127.254. For IPv4 administrative scope sessions, the administratively scoped IP multicast is used (Mayer, 1998). The announcement interval or the time period between periodic multicasts to the group of an announcement is chosen such that the total bandwidth that is used by all announcements on a single SAP group remains below a pre-configured limit. The base interval between announcements is derived from the number of announcements being made in that group, the size of the announcement and the configured bandwidth limit. Sessions that have previously announced may be deleted by implicit timeout or explicit deletion using the session deletion packet. The session description payload contains timestamp information that specifies a start and end time for the session. If the current time is later than the end-time for the session, then the session is deleted. The deletion packet specifies the version of the session to be deleted. The announcement and deletion packets are indicated by the message type field as described in the SAP packet format (Figure 5).
The SAP Packet Format The SAP packet contains the following information (Figure 5): 1. 2. 3. 4. 5. 6.
7.
8.
9. 10.
V: Version number. The version number field must be set to 1. A: Address type. If the A bit is 0, the originating source field contains a 32-bit IPv4 address. If A bit is 1, the originating source contains a 128-bit IPv6 address. R: Reserved. T: Message type. The T field 0 for a session announcement packet and 1 for a session deletion packet. E: Encryption bit. 1 if the payload of the SAP packet is encrypted; 0 if the packet is not encrypted. C: Compressed bit. 1 if the payload is compressed using the zlib compression algorithm (Jacobson and Casner, 1998). If the payload is to be compressed and encrypted, the compression must be performed first. auth len: Authentication Length. An 8-bit unsigned quantity giving the number of 32-bit words following the main SAP header that contain authentication data. If it is zero, no authentication header is present. Optional authentication data containing a digital signature of the packet, with length as specified by this authentication length header field. msg id hash: Message Identifier Hash. A 16-bit quantity that, used in combination with the originating source, provides a globally unique identifier indicating the precise version of this announcement. Originating source: This gives the IP address of the original source of the message. Whether it is an IPv4 address or an IPv6 address is indicated by the A field. optional payload type: The payload type field is a MIME content type specifier, describing the format of the payload. This is a variable length ASCII text string, followed by a single zero byte (ASCII NUL). An example of the payload type is “application/sdp.”
30 Yang, Gay, Sun & Siew
11.
payload: If the packet is an announcement packet, the payload contains a session description. If the packet is a session deletion packet, the payload contains a session deletion message. If the payload type is “application/sdp,” the deletion message is a single SDP line consisting of the origin field of the announcement to be deleted. If the E or C bits are set in the header, both the payload type and payload are encrypted and/ or compressed.
Encrypted and Authenticated Announcements As indicated in the SAP packet format, an announcement can be encrypted (set E=1). However, if many of the receivers do not have the encryption key, there is a considerable waste of bandwidth since those receivers cannot use the announcement they have received. For this reason, this feature of encrypted SAP announcements is not generally recommended to be used. The authentication header can be used for two purposes: 1. Verification that changes a session description or deletion of a session are permitted. 2. Authentication of the identity of the session creator. SAP is not tied to any single authentication mechanism. The precise format of authentication data in the packet depends on the authentication mechanism in use. SAP protocol describes the use of Pretty Good Privacy (PGP) (Callas, Donnerhacke, Finney and Thayer, 1998) and the Cryptographic Message Syntax (CMS) (Housley, 1999) for SAP authentication.
Session Initiation Protocol (SIP) Not all sessions are advertised, and even those that are advertised may require a mechanism to explicitly invite a user to join a session. The Session Initiation Protocol (SIP) is an application layer control protocol that can establish, modify and terminate multimedia sessions or calls (Handley, Schulzrinne, Schooler and Rosenberg, 2000). These multimedia sessions include multimedia conferences, distance learning, Internet telephony and similar applications. SIP can invite both persons and “robots,” such as a media storage service. SIP can invite parties to both unicast and multicast sessions; the initiator does not necessarily have to be a member of the session to which it is inviting. Media and participants can be added to an existing session. SIP can be used to initiate sessions as well as invite members to sessions that have been advertised and established by other means. Sessions can be advertised using multicast protocols such as SAP, electronic mail, news groups, Web pages or directories (LDAP), among others. SIP does not care whether the session is already ongoing, or is just being created, and it doesn’t care whether the conference is a small, tightly coupled session or a huge broadcast. It merely conveys an invitation to a user in a timely manner, inviting them to participate, and provides enough information for them to be able to know what sort of session to expect. Thus although SIP can be used to make telephonestyle calls, it is by no means restricted to that style of conference. SIP supports five facets of establishing and terminating multimedia communications: 1. User location: determination of the end system to be used for communication. 2. User capabilities: determination of the media and media parameters to be used. 3. User availability: determination of the willingness of the called party to engage in communications. 4. Call setup: “ringing,” establishment of call parameters at both called and calling party. 5. Call handling: including transfer and termination of calls.
Building Internet Multimedia Applications 31
Inviting a callee to participate in a single conference session or call involves one or more SIP request-response transactions. In the remainder of this section, we illustrate the SIP protocol operation.
SIP Protocol Operation Callers and callees are identified by SIP addresses. When making a SIP call, a caller first locates the appropriate server and then sends a SIP request. The most common SIP operation is the invitation. Instead of directly reaching the intended callee, a SIP request may be redirected or may trigger a chain of new SIP requests by proxies. Users can register their location(s) with SIP servers. SIP address is used to locate users at hosts represented by a SIP URL. The SIP URL takes a form similar to a mailto or telnet URL, i.e., user@host. The user part is a user name or a telephone number. The host part is either a domain name or a numeric network address. It can also contain information about the transport (UDP, TCP, SCTP) and user (e.g., user=phone), even the method of the SIP request (e.g., INVITE, ACK, BYE, CANCEL and REGISTER). Examples of SIP URLs are: sip:
[email protected] sip:j.doe:
[email protected];transport=tcp sip:
[email protected]?subject=project sip:
[email protected] sip:
[email protected] sip:
[email protected] sip:
[email protected];method=REGISTER sip:alice sip:+1-212-555-1212:
[email protected];user=phone When a client wishes to send a request, the client either sends it to a locally configured SIP proxy server (as in HTTP) or sends it to the IP address and port of the server. Once the host part has been resolved to a SIP server, the client sends one or more SIP requests to that server and receives one or more responses from the server. A request (and its retransmissions) together with the responses triggered by that request make up a SIP transaction. The methods of the SIP request include INVITE, ACK, BYE, CANCEL and REGISTER. The protocol steps involved in a successful SIP invitation are simple and straightforward. The SIP invitation consists of two requests, INVITE followed by ACK. The INVITE request asks the callee to join a particular conference or establish a two-party conversation. After the callee has agreed to participate in the call, the caller confirms that it has received that response by sending an ACK request. If the caller no longer wants to participate in the call, it sends a BYE request instead of an ACK. The SIP protocol specification describes more complicated initiations (Handley, Schulzrinne, Schooler and Rosenberg, 2000). The INVITE request typically contains a session description, for example, using SDP format, as described in the previous section, that provides the called party with enough information to join the session. For multicast sessions, the session description enumerates the media types and formats that are allowed to be distributed to that session. For a unicast session, the session description enumerates the media types and formats that the caller is willing to use and where it wishes the media data to be sent. In either case, if the callee wishes to accept the call, it responds to the invitation by returning a similar description listing the media it wishes to use. For a multicast session, the callee only returns a session description if it is unable to receive the media indicated in the caller’s description or wants to receive data via unicast. The protocol operations for the INVITE method are shown in Figure 6 using a proxy
32 Yang, Gay, Sun & Siew
Figure 6: The example SIP protocol operations Caller
Proxy Server
Callee
Location Server
(1) INVITE (2) Contact the location server with the address
(4) INVITE
(3) The location server returns more precise location
(5) 200 OK (6) 200 OK (7) ACK (8) ACK
server as a server example. In this SIP transaction, the proxy server accepts the INVITE request (step 1), contacts the location service with all or parts of the address (step 2) and obtains a more precise location (step 3). The proxy server then issues a SIP INVITE request to the address(es) returned by the location service (step 4). The user agent server alerts the user and returns a success indication to the proxy server (step 5). The proxy server then returns the success result to the original caller (step 6). The receipt of this message is confirmed by the caller using an ACK request, which is forwarded to the callee (steps 7 and 8). Note that an ACK can also be sent directly to the callee, bypassing the proxy. All requests and responses have the same Call-ID. Note that SIP does not offer conference control services such as floor control or voting and does not prescribe how a conference is to be managed, but SIP can be used to introduce conference control protocols. SIP does not allocate multicast addresses. SIP can invite users to sessions with and without resource reservation. SIP does not reserve resources, but can convey to the invited system the information necessary to do this. SIP as an application protocol makes minimal assumptions about the underlying transport and network-layer protocols. The lower-layer can provide either a packet or a byte stream service, with reliable or unreliable service. In an Internet context, SIP is able to utilize both UDP and TCP as transport protocols. SIP can also be used directly with protocols such as ATM AAL5, IPX, frame relay or X.25. In addition, SIP is a text-based protocol, and much of the SIP message syntax is and header fields are identical to HTTP. SIP uses URL for the user and service addressing. However, SIP is not an extension of HTTP. As SIP is used for initiating multimedia conferences rather than delivering media data, it is believed that the additional overhead of using a text-based protocol is not significant.
Controlling Multimedia Servers: RTSP A standard way to remote control multimedia streams delivered, for example, via RTP, is Real-Time Stream-Control Protocol (RTSP) (Schulzrinne, Rao and Lanphier, 1998). Control includes absolute positioning within the media stream, recording and possibly
Building Internet Multimedia Applications 33
device control. RTSP is primarily aimed at Web-based media-on-demand services, but it is also well suited to provide VCR-like controls for audio and video streams, and to provide playback and record functionality of RTP data streams. A client can specify that an RTSP server plays a recorded multimedia session into an existing multicast-based conference, or can specify that the server should join the conference and record it. RTSP acts as a network remote control for multimedia servers; it does not typically deliver the continuous streams itself. The protocol supports the following operations: 1. Retrieval of media from media server: The client can request a presentation description via HTTP or some other method. If the presentation is being multicast, the presentation description contains the multicast addresses and ports to be used for the continuous media. If the presentation is to be sent only to the client via unicast, the client provides the destination for security reasons. 2. Invitation of a media server to a conference: A media server can be “invited” to join an existing conference, either to play back media into the presentation or to record all or a subset of the media in a presentation. This mode is useful for distributed teaching applications. Several parties in the conference may take turns “pushing the remote control buttons.” 3. Addition of media to an existing presentation: The media server can tell the client about additional media becoming available. This feature is useful, particularly for live presentations. In order for RTSP to control the media presentation, each presentation and media stream is identified by an RTSP URL. The overall presentation and the properties of the media the presentation is made up of are defined by a presentation description file. The presentation description file contains a description of the media streams making up the presentation, including their encodings, language and other parameters that enable the client to choose the most appropriate combination of media. In this presentation description, each media stream that is individually controllable by RTSP is identified by an RTSP URL, which points to the media server handling that particular media stream and names the stream stored on that server. Several media streams can be located on different servers; for example, audio and video streams can be split across servers for load sharing.
RTSP Protocol Operation The syntax and operation of RTSP is intentionally similar to HTTP/1.1 so that extension mechanisms to HTTP can in most cases also be added to RTSP. As such, RTSP has some overlap in functionality with HTTP, and it also may interact with HTTP in that the initial contact with streaming content is often to be made through a Web page. Similar to HTTP, the RTSP protocol is an Internet application protocol using the request/response paradigm. A client sends to the server a request that includes, within the first line of that message, the method to be applied to the resource, the identifier of the resource (RTSP URL) and the protocol version in use. The main request methods defined in RTSP include: 1. SETUP: A client can issue a SETUP request for a stream that is already playing to change transport parameters. 2. PLAY: The client tells the server to start sending data via the mechanism specified in SETUP. 3. PAUSE: The PAUSE request causes the stream delivery to be interrupted (halted) temporarily. 4. DESCRIBE: The client issues the DESCRIBE method to retrieve the description of a
34 Yang, Gay, Sun & Siew
presentation or media object identified by the request URL from a server. 5. ANNOUNCE: When sent from client to server, ANNOUNCE posts the description of a presentation or media object identified by the request URL to a server. When sent from server to client, ANNOUNCE updates the session description in real-time. The RTSP URL uses “rtsp:” or “rtspu:” as schemes to refer to network resources via the RTSP protocol, for example, the RTSP URL: rtsp://media.example.com:554/twister/ audiotrack identifies the audio stream within the presentation “public-seminar,” which can be controlled via RTSP requests issued over a TCP connection to port 554 of host media.example.com. Each request-response pair has a CSeq field for specifying the sequence number. After receiving and interpreting a request message (containing the method above), a server responds with an RTSP response message that basically consists of the protocol version, a numeric status code followed by the reason-phrase. The RTSP adopts the most HTTP/1.1 status codes, for example, “200 OK,” “400 Bad Request,” “403 Forbidden” and “404 Not Found.” The following example client-server interactions uses SETUP method: C->S: SETUP rtsp://example.com/foo/bar/baz.rm RTSP/1.0 CSeq: 302 Transport: RTP/AVP;unicast;client_port=4588-4589 S->C: RTSP/1.0 200 OK CSeq: 302 Date: 23 Jan 2001 15:35:06 GMT Session: 47112344 Transport: RTP/AVP;unicast; client_port=4588-4589;server_port=6256-6257 In the following example, the server, as requested by the client, will first play seconds 10 through 15, then, immediately following, seconds 20 to 25, and finally seconds 30 through the end. C->S: PLAY rtsp://audio.example.com/audio RTSP/1.0 CSeq: 835 Session: 12345678 Range: npt=10-15 C->S: PLAY rtsp://audio.example.com/audio RTSP/1.0 CSeq: 836 Session: 12345678 Range: npt=20-25 C->S: PLAY rtsp://audio.example.com/audio RTSP/1.0 CSeq: 837 Session: 12345678 Range: npt=30It should be noted that while RTSP is very similar to HTTP/1.1,it differs in a number of important aspects from HTTP:
Building Internet Multimedia Applications 35
1. 2. 3. 4.
RTSP introduces a number of new methods and has a different protocol identifier. An RTSP server needs to maintain state by default in almost all cases, as opposed to the stateless nature of HTTP. Both an RTSP server and client can issue requests. Data is carried out-of-band by a different protocol. (There is an exception to this). RTSP is now an Internet proposed standard (RFC 2363).
RESOURCE RESERVATION PROTOCOL (RSVP) For flows that may take a significant fraction of the network resources, we need a more dynamic way of establishing these multimedia sessions. In the short term, this applies to many multimedia conferences since at present the Internet is largely under-provisioned. The idea is that the resources necessary for a multimedia session are reserved and if no sufficient resource is available, the admission is rejected. The Resource ReSerVation Protocol (RSVP) has been standardized for just this purpose (Braden, Zhang, Berson and Herzog, 1997; Braden and Zhang, 1997; Zhang, Deering, Estrin, Shenker and Zappala, 1993). The RSVP protocol is part of a larger effort to enhance the current Internet architecture with support for Quality of Service (QoS) flows. It provides flow identification and classification. The RSVP protocol is used by a host to request specific qualities of service from the network for particular application data streams or flows. RSVP is also used by routers to deliver quality-of-service requests to all nodes along the path(s) of the flows and to establish and maintain state to provide the requested service. RSVP requests will generally result in resources being reserved in each node along the data path. RSVP is a simplex protocol in that it makes reservations for unidirectional data flows. RSVP is receiver-oriented, i.e., the receiver of a data flow is responsible for the initiation and maintenance of the resource reservation used for that flow. The designers of the RSVP protocol argue that this design decision enables RSVP to accommodate heterogeneous receivers in a multicast group. In that environment, each receiver may reserve a different amount of resources, may receive different data streams sent to the same multicast group and may switch channels from time to time without changing its reservation (Zhang, Deering, Estrin, Shenker and Zappala, 1993). Normally, a host sends IGMP messages to join a host group and then sends RSVP messages to reserve resources along the delivery path(s) of that group. RSVP operates on top of IPv4 or IPv6, occupying the place of a transport protocol in the protocol stack. However, RSVP does not transport application data but is rather an Internet control protocol, like ICMP, IGMP or routing protocols. It uses underlying routing protocols to determine where it should carry reservation requests. As routing paths change, RSVP adapts its reservation to new paths if reservations are in place.
RSVP Reservation Model and Styles An RSVP reservation is described by flow descriptor which is a pair of the flowspec and the filter spec. The flowspec specifies a desired QoS. The filter spec, together with a session specification, defines the set of data packets—the “flow”—to receive the QoS defined by the flowspec. The stream filtering allows the receiver to reserve the amount of network resources only for the subset of the data the receiver is interested in receiving and thus not to waste network resources. Stream data packets that do not match any of the filter specs are handled as best effort traffic with no further QoS guarantee. If no filter is used, on
36 Yang, Gay, Sun & Siew
the other hand, then any data packets destined for the multicast group may use the reserved resources. The flowspec is used to set parameters in the node’s packet scheduler or other link layer mechanism, while the filter spec is used to set parameters in the packet classifier (see below). To support different needs of various applications and to make the most efficient use of network resources, RSVP defines different reservation styles. RSVP reservation styles are actually a set of options included in a reservation request. There are three styles (options) defined: 1. Wildcard-Filter (WF) Style that creates a single reservation shared by flows from all upstream senders. 2. Fixed-Filter (FF) Style that creates a distinct reservation for data packets from a particular sender, not sharing them with other senders’ packets for the same session. 3. Shared Explicit (SE) Style that creates a single reservation shared by selected upstream senders. Unlike the WF style, the SE style allows a receiver to explicitly specify the set of senders to be included. These styles make it possible that the intermediate switches (routers) can aggregate the different reservations for the same multicast group and results in more efficient utilization of network resources.
RSVP Protocol Operations We now focus on the RSVP protocol mechanisms and message types for making resource reservations. The RSVP protocol uses the traffic control mechanisms to implement quality of service for a particular data flow. These mechanisms include: 1. A packet classifier that determines the QoS class (and perhaps the route) for each packet. 2. Admission control that determines whether the node has sufficient available resources to supply the requested QoS, and policy control that determines whether the user has administrative permission to make the reservation. 3. A packet scheduler or some other link-layer-dependent mechanism to determine when particular packets are forwarded, thus to achieve the promised QoS. During reservation setup, a QoS request from a receiver host application is passed to the local RSVP process. The RSVP protocol then carries the request to all the nodes (routers and hosts) along the reverse data path(s) to the data source(s), but only as far as the router where the receiver’s data path joins the multicast distribution tree. In this procedure, if the RSVP QoS request passes “admission control” (a sufficient resource available) and “policy control” (the reservation permitted), the reservation is successful and parameters are set in the packet to obtain the desired QoS. If either check fails, the RSVP protocol returns an error notification to the application process that originated the request. The RSVP protocol defines two fundamental RSVP message types: Path and Resv. Each RSVP sender host transmits RSVP “Path” messages downstream along the uni-/ multicast routes provided by the routing protocol(s), that is, “path” messages follow the paths of the data. These Path messages store “path state” in each node along the way. The path state is used to route the Resv messages hop-by-hop in the reverse direction. “Resv”
Building Internet Multimedia Applications 37
messages are RSVP reservation requests that each receiver host sends upstream towards the senders. These messages must follow exactly the reverse of the path(s) the data packets will use upstream to all the sender hosts.
ALL WORKING TOGETHER In this section, we illustrate how the real-time protocols described above work together to support Internet multimedia conferencing. As shown in Figure 7, User A at Site 1 first created a conference session, which is described using the Session Description Protocol (SDP), and announced it to the world using the Session Announcement Protocol (SAP). According to the start time as described in the SDP, User A started the session and sent out the media data using RTP, and periodically sent out the sender’s report using RTCP. User B at Site 2 received the conference announcement and notes the start time. He joined the conference using Internet Group Management Protocol (IGMP). During the session, User B would send out the receiver's report using RTCP. In order to provide some guarantee on the quality of service, User B made the resource reservation along the path (from A) using RSVP protocol. The reservation has been successful, the quality of service was improved. The RTP/RTCP session continues until User B leaves the group or User A stops the conferencing. As shown, the audio and video are carried in a separate RTP session with RTCP packets controlling the quality of the session. Routers communicate via RSVP to set up and manage reserved-bandwidth sessions. Figure 7: An illustration of all protocols working together Site 1
Site 2
User A creates Conference User A Starts Sending
SDP/SAP
User B receives Session Announcement
RTP
User B joins conference
RTCP IGMP
RTP RTP
RTCP
User B sends RTCP receiver's report
RTCP RSVP path RTP
RSVPresv
RTP RTP RTCP
User A makes reservation
38 Yang, Gay, Sun & Siew
THE MEDIA FRAMEWORKS FOR DEVELOPING INTERNET MEDIA APPLICATIONS With the real-time protocols described above as foundations, we now present two emerging media frameworks that provide a high-level abstraction for developing real-time media applications over the Internet. The media frameworks we will discuss are CORBA Media Streaming Framework (MSF) (OMG, 1998) and Java Media Framework (JMF) (Sun Microsystems, 1999), both of which provide an object-oriented multimedia middleware. The Internet media applications can be developed using the low-level network APIs (for example, MBone tools). They can also be implemented within some middleware environments that abstract away from the low-level details of the Internet. The CORBA Media Streaming Framework and Java Media Framework (JMF) are the examples of the middleware for developing media applications. We first describe CORBA Media Stream Framework, followed by JAVA Media Framework, both of which are based on the distributed object technology.
CORBA MEDIA STREAMING FRAMEWORK The Common Object Request Broker Architecture, also known as CORBA, is a distributed object architecture that allows objects to interoperate across networks (Internet) regardless of the language in which they were written. The CORBA is derived from a high-
Figure 8: The OMG's Object Management Architecture (OMA) Not standardized by OMG; Scope is Single application / vendor
Application Objects
Business Objects
Compound Docs
Healthcare
Task Mgmt
Finance, Mfg
Help Facilities
Telecommunication
Desktop Mgmt
CORBA Domains (Vertical Facilities)
CORBAfacilities (Horizont. Facilities)
Object Request Broker (ORB)
Lifecycle
Externalization
Events
Security
Naming Persistence
CORBA Services
Time Properties
Transactions
Query
Concurrency
Licensing
Building Internet Multimedia Applications 39
level OMA architecture (Figure 8). The OMA divides the whole distributed object space into the four areas: CORBA services, CORBA facilities, CORBA domains and Application Objects. In the systems, Object implementations that provide capabilities required by all objects in the environment have code-named CORBAservices. The growing set of object services adopted by the OMG handles lifecycle, naming, events and notification, and persistence, transactions, relationships, externalization, security, trading and messaging. CORBA facilities are object implementations that many applications will use. Specification of common facilities is the most recent aspect of the architecture. Facilities include support for compound documents, workflow, among others. For vertical markets and applications, such as health care, insurance, financial services, manufacturing and e-commerce, the architecture defines CORBAdomains interfaces, which will provide the mechanism for software vendors to build components or to support the integration of new or existing (legacy) applications into a CORBA environment. The final component refers to Application Objects, those object implementations specific to end-user applications. Typically, these purpose-built objects comprise the systems that support a business process, and thus are not standardized by the OMG. Within the OMA, CORBA specifies the infrastructure for inter-object communication, called the object request broker or ORB. In essence, ORB is the object bus, allowing distributed object to communicate in heterogeneous environments. Very often, CORBA is called distributed object middleware, because they mediate between components to allow them to work together, integrating them into a single, functional whole. The problem that CORBA addresses is how to define, construct and deploy the software elements that comprise complex systems. The CORBA architecture is based on the distribution of data and process as objects. This general approach is the result of industry experience with mainframe or host-centric systems, and data server-centric technology in use over the last four decades. The goal of such distributed systems is to enable applications and data to reside on different platforms, yet to work together seamlessly and support various business processes. CORBA is a complete architecture for a vast variety of distributed applications in heterogeneous environments. In this section we focus on its support for real-time multimedia applications, the more detailed information about CORBA can be found in OMG (1999); Schmidt (2000); Vinoski (1997); and Yang and Duddy (1996).
CORBA MSF Components The CORBA Media Streaming Framework (CORBA MSF) takes the object-oriented approach to the control and management of multimedia, especially audio/video stream. The CORBA MSF specification (OMG, 1998) defines a set of object interfaces which implement a distributed media streaming framework. The principal components of the CORBA Media Streaming Framework are stream and stream endpoint, flow and flow endpoint, multimedia device and virtual multimedia device, and flow device. These interfaces provide a high-level abstraction for developing multimedia applications in a CORBA environment and are described below. 1. Streams. A stream represents continuous media transfer, usually between two or more virtual multimedia devices. A stream interface is an aggregation of one or more source and sink flow endpoints associated with an object. Although any type of data could flow between objects, this CORBA framework focuses on applications dealing with
40 Yang, Gay, Sun & Siew
audio and video exchange with Quality of Service (QoS) constraints. The stream is represented by the IDL interface StreamCtrl which abstracts a continuous media transfer between virtual devices. It supports operations to bind multimedia devices using a stream. There are also operations to start and stop a stream. An application can establish a stream between two devices by calling the operation bind_devs() on the StreamCtrl interface: boolean bind_devs(in MMDevice a_party, in MMDevice b_party, inout streamQoS the_qos, flowSpec the_spec). When the application requests a stream between two or more multimedia devices, it can explicitly specify the quality of service (QoS) of the stream. 2. Stream endpoints. A stream endpoint terminates a stream, and is represented by the StreamEndPoint interface. A stream endpoint and a virtual multimedia device will exist for each stream connection. There are two flavors of stream endpoint: an A-party represented by the interface StreamEndPoint_A and a B-party represented by the interface StreamEndPoint_B. An A party can contain producer flow endpoints as well as consumer flow endpoints, similarly with a B party. Thus, when an instance of a typed StreamEndPoint is created, the flows will be plumbed in the right direction (i.e., an audio consumer FlowEndPoint in a StreamEndPoint_A will correspond to an audio producer FlowEndPoint in a StreamEndPoint_B). 3. Flows and flow endpoints. A flow is a continuous sequence of frame in a clearly identified direction, and a stream endpoint terminates a flow. A flow endpoint may be either a source (producer) or a sink (consumer). A stream may contain multiple flows. For example, a videophone stream may contain four flows labeled video1, video2, audio1, audio2. An operation on a stream (for example, stop or start) may be applied to all flows within the stream simultaneously or just a subset of them. A stream endpoint may contain multiple flow endpoints. The flow and flow endpoints are represented by FlowConnection and FlowEndPoint interfaces respectively. The framework supports two basic profiles for the streaming service: the full profile in which flows endpoints and flow connection have accessible IDLinterfaces and light profile in which flows endpoints and flow connections do not expose IDL interfaces and is a subset of the full profile. In the light profile the FlowEndPoint objects are colocated with the StreamEndPoint objects and so do not expose IDL interfaces. 4. Multimedia device. Multimedia device (as opposed to a virtual multimedia device) is the abstraction of one or more items of multimedia hardware and acts as a factory for virtual multimedia devices. A multimedia device can support more than one stream simultaneously, for example, a microphone device can stream audio to two speaker devices. For each stream connection requested, the multimedia device creates a stream endpoint and a virtual multimedia device. The multimedia devices are represented by IDL interface MMDevice. A virtual device is represented by VDev interface. 5. Flow devices. The flow devices are exactly analogous to the multimedia devices for streams. A flow connection binds two flow devices in exactly the same manner as a stream connection (StreamCtrl) binds multimedia devices (MMDevice). The flow devices are represented by the FDev. An FDev operation creates a FlowEndPoint, whereas an MMDevice creates a StreamEndPoint and a VDev. A simple stream between a microphone device (audio source or producer) and speaker device (audio sink or consumer) is shown in Figure 9. To illustrate the interaction between the components of the framework, we show the process when establishing a stream by calling the bind_devs () operation on a StreamCtrl
Building Internet Multimedia Applications 41
Figure 9: A simple audio stream represented using CORBA Media Stream Framework Legend: B A contains B B A associated with B
A A
StreamCtrl
VDev
VDev
StreamEndpoint
StreamEndpoint
Stream connection
Figure 10: Establishing a stream 1) bind_devs(aMMdev,bMMdev,someQoS) aStreamCtrl aMMDev
2.1) create_A(...)
2.3) create_B(...) bMMDev
2.2) A_EndPoint_ref
2.4) B_EndPoint_ref
4) connect(B_EndPoint,someQoS)
A_EndPoint
B_EndPoint
5) request_connection() aVDev
bVDev 3) configure()
Fi
10 E t bli hi
t
object (Figure 10). 1. A StreamCtrl object may be created by the application to initiate a stream between two multimedia devices(aMMDev and bMMDev). For this purpose, a call is made to the bind_devs() operation on the StreamCtrl interface. 2. The StreamCtrl asks aMMDev and bMMDev for aStreamEndPoint and VDev to support the stream by calling create_A(...,someQoS,...) on one and create_B(...,someQoS,...) on the other. The former call, for example, will return a StreamEndPoint_A object which is associated with a VDev. This is the point at which an MMDevice could decide that it can support no more connections and refuse to create the StreamEndPoint and VDev. A particular MMDevice might be specialized to only create A-endpoints for a type of stream (for example, a microphone endpoint
42 Yang, Gay, Sun & Siew
3.
4.
5.
1. 2. 3.
4.
5.
6.
7. 8.
and VDev) or only B-endpoints (for example, a speaker endpoint and VDev). This step involves the aVDev calling configuration operations on the bVDev and vice versa. This is the device configuration phase. When two virtual devices are connected using a stream, they must ensure that they are both appropriately configured. For example, if the microphone device is using A-law encoding and a sampling frequency of 8-kHz, then it must ensure that the speaker device it is connected to is configured similarly. It does this by calling configuration operations such as set_format(“audio1”,”MIME:audio/basic”), set_dev_params(“audio1”,myaudioProperties) on the VDev interface for the speaker device. The actual stream is set up by calling connect() on the A_EndPoint with the B_EndPoint as a parameter. Stream endpoints contain a number of flow endpoint objects. A flow endpoint object can be used to either pull information from a multimedia device and send it over the network or vice versa. The A_EndPoint may choose to listen on transport addresses for a subset of the flows to be connected. It will then call request_connection(). Among the information passed across will be the transport addresses of the listening flows on the A-side. The B_EndPoint will connect to the listening flows on the A-side and will listen on transport addresses for any remaining flows for which the A-side is not listening. Among the information passed back by the request_connection() operation will be the transport addresses of the flows listening on the B-side. The final stage is for the A_EndPoint to connect to the listening flows on the B_EndPoint. The CORBA Media Streaming Framework has the following features: topologies for streams: allows for one-to-one (unicast) and/or one-to-many (multicast) flow sources and sinks to be configured into the same stream binding. multiple flows: allows for flows to be aggregated within a stream, such that joint operations can be performed. stream description and typing: allows a stream interface, in terms of its constituent flow end-points, to be described and typed. Operations defined in flow endpoint control object IDL interfaces may be used to determine the characteristics of a flow endpoint. The subtyping of the flow endpoint control interfaces may be used to type the flow endpoints themselves. stream interface identification and references: allows a stream interface and its flow endpoints to be unambiguously named and identified by other objects and the ORB, by using normal CORBA object references corresponding to IDL stream and flow endpoint control interfaces. stream setup and release: allows for the establishment and tear-down of bindings between stream interfaces and flow endpoints. Establishment may include the negotiation of QoS parameters (using reference to QoS specification interfaces) and error handling capabilities if QoS is not successfully negotiated. stream modification and termination: the IDL interfaces include operations for reconfiguration of a stream connection during its lifetime, such as adding and removing endpoints from an existing stream connection. The framework allows for flow data endpoints to be terminated either in hardware or software. multiple protocols: the framework allows for multiple flow protocols and flow protocol endpoint addresses. Quality of Service (QoS): allows QoS specification interfaces to be defined and passed
Building Internet Multimedia Applications 43
9. 10.
via reference to specify QoS to be expressed. Interfaces can be specialized to allow for the monitoring of QoS characteristics. flow synchronization: limited support for synchronization is provided by interfaces. interoperability: because all operations are expressed in unmodified OMG CORBA IDL, normal CORBA interoperability may be used for all operations defined in this media framework. Any protocol conversions required for interworking of various flow protocols are attainable using normal application objects.
Media Streaming Framework in CORBA Environments The CORBA Media Stream Framework operates in a general CORBA environment. We now discuss the major components of the CORBA stream architecture and their interaction in the CORBA environment, as shown in Figure 11. In the example CORBA stream architecture (Figure 11), there is a stream with a single flow between two stream endpoints, one acting as the source of the media data and the other the sink. Each stream endpoint consists of three logical entities (see Figure 11): 1. A stream interface control object that provides IDL defined interfaces (as server, 2b) for controlling and managing the stream (as well as potentially, outside the scope of this specification, invoking operations as client, 2a, on other server objects).The Stream interface control object uses a basic object adapter (BOA) or portable object adapter (POA) (OMG, 1999, Chapter 11) that transmits and receives control messages in a CORBA-compliant way. 2. A flow data source or sink object (at least one per stream endpoint and can be many in the multi-point case) that is the final destination of the data flow (3). 3. A stream adapter that transmits and receives a flow (a sequence of frames) over a network. The framework supports multiple transport protocols. In Figure 11, the CORBA MSF provides definitions of the components that make up a stream and for the interface definitions onto stream control and management objects (denoted as interface number 1a). The standardized interface onto stream interface control objects associated with individual stream endpoints is denoted as interface number 2b. The CORBA MSF does not standardize the interfaces shown as dashed lines. That is, the CORBA MSF does not provide the standard way of how the stream interface control object communicates with the source/sink object and perhaps indirectly with the stream adapter (interface 4), and how the source/sink object communicates with the stream adapter (interface 3). When a stream is terminated in hardware, the source/sink object and the stream adapter may not be visible as distinct entities. As indicated early, the CORBA MSF allows multiple transport protocols for transmitting and receiving the media data, thus the media data is not necessarily transported by TCP. In many cases, the media data is transported by RTP/UDP, ATM/AAL5 or the Simple Flow Protocol defined in the CORBA MSF framework, which is described next.
CORBA’s Simple Flow Protocol (SFP) The CORBA Media Streaming Framework is designed to support three fundamental types of transport: 1. Connection-Oriented Transport like TCP: This is provided by transports like TCP and required where completeness and reliability of data are essential. 2. Datagram-Oriented Transport like UDP: This is used by many popular Internet streaming applications. This type of transport is frequently more efficient and
44 Yang, Gay, Sun & Siew
Figure 11: CORBA Media Stream Framework in a CORBA environment Stream Endpoint 4. Flow data endpoint (Source) 3.
data flow
Stream Adapter
4. Stream Interface Control
Control & Management Objects
Client
Server
2a.
2b.
Client
Server
Flow Data Endpoint (Sink)
Client
data flow
3.
Server 1a.
BOA/POA
Client
Stream Interface Control
1b. Stream Control 2b. 2a. Operations BOA/POA
Stream Adapter
Object Request Broker (ORB) Core & IIOP Flow
Flow
SFP, RTP/IP-Multicast, AAL5/ATM
lightweight if the application doesn’t mind losing the occasional datagram and handling a degree of mis-sequencing. The framework must insert sequence numbers into the packets to ensure that mis-sequencing and packet-loss is detected and handled by applications. 3. Unreliable connection-oriented transport: This is the type of service supplied by ATM AAL5. Messages are delivered in sequence to the endpoint, but they can be dropped or contain errors. If there are errors then this fact is reported to the application. There is no flow control unless provided by the underlying network. In addition to various transport types, the CORBA MSF allows streams to be structured and transported ‘in band.’ Such information can include sequence numbers, Source indicators, Timestamps and Synchronization source. The CORBA MSF argues that there is no single transport protocol that provides all the capabilities needed for streamed media. ATM AAL5 is good but it lacks flow control so a sender can overrun a receiver. Only RTP provides facilities for transporting the in-band information above, but RTP is Internet-centric, and it should not be assumed that a platform must support RTP in order to take advantage of streamed media. Furthermore, none of the transports provide a standard way for transporting IDL-typed information. In order to accommodate the various needs of multimedia transport over a multitude of transports, CORBA MSF defines a simple specialized protocol which works on top of various transport protocols and provides architecture independent flow content transfer. This protocol is referred to as the Simple Flow Protocol (SFP). There are two important points to note about SFP: 1. It is not a transport protocol, it is a message-level protocol, like RTP, and is layered on top of the underlying transport. It is simple to implement. 2. It is not mandatory for a Media Streaming Framework implementation to support SFP. A flow endpoint which supports SFP can switch it off in order to communicate with a flow endpoint which does not support it. The use of the SFP is negotiated at stream establishment. If the stream data is not IDL-typed (i.e., it is in agreed byte layout, for
Building Internet Multimedia Applications 45
example MPEG) then, by default, SFP will not be used. This allows octet stream flows to be transferred straight to a hardware device on the network. It should be emphasized, however, that the role of the SFP in CORBA Media Streaming Framework is a very important one. In the CORBA context, it is little use to standardize a set of IDL interfaces that are to manipulate Audio/Video streams if the flows themselves are not interoperable or do not contain the framing and in-band information. This information is necessary to provide timely, accurate delivery of multimedia data over a variety of common protocols such as ATM AAL5 and UDP which cannot do this of themselves. These goals are simply unattainable without the specification of SFP. The SFP is not “just another protocol.” It is as fundamental to the media streaming framework as General Inter-ORB Protocol (GIOP) is to the CORBA architecture (OMG, 1999, Chapter 15). In other words, just as CORBA was not complete without GIOP, so this standard would be of little value without SFP. The SFP message format, including message types and messages header, is formally specified in the OMG IDL (OMG 1999, Chapter 3). module flowProtocol{ enum MsgType{ // Messages in the forward direction Start, EndofStream, SimpleFrame, SequencedFrame, Frame, SpecialFrame, // Messages in the reverse direction StartReply, Credit }; struct frameHeader{ char magic_number[4]; // ‘=’, ‘S’, ‘F’, ‘P’ octet flags; // bit 0 = byte order, // 1 = fragments, 2-7 always 0 octet message_type; unsigned long message_size; // Size following this header }; struct fragment{ char magic_number[4]; // ‘F’, ‘R’, ‘A’, ‘G’ octet flags; // bit 1 = more fragments unsigned long frag_number; // 1,..,n unsigned long sequence_num; unsigned long frag_sz; unsigned long source_id; // Required for UDP multicast }; // with multiple sources struct Start{
46 Yang, Gay, Sun & Siew
char magic_number[4]; // ‘=’, ‘S’, ‘T’, ‘A’ octet major_version; octet minor_version; octet flags; // bit 0 = byte order }; // Acknowledge successful processing of // Start struct StartReply{ char magic_number[4]; // ‘=’,’S’,’T’,’R’ octet flags; // bit 0 = byte order, 1 = exception }; // If the message_type in frameHeader is sequencedFrame // the the frameHeader will be followed by this struct sequencedFrame{
unsigned long sequence_num; }; // If the message_type is Frame then // the frameHeader is followed by this struct frame{ unsigned long timestamp; unsigned long synchSource; sequence
source_ids; unsigned long sequence_num; }; struct specialFrame{ frameID context_id; sequence context_data; }; struct credit{ char magic_number[4]; // ‘=’,’C’,’R’,’E’ unsigned long cred_num; }; }; // module For a message type of simple frames (SimpleFrame), there is no subsequent header; the message is simply a stream of octets. The other message headers are frameHeader, frame and specialFrame followed by a stream of octets. When SFP is running over RTP, the sequencedFrame and frame structures are not used since the values for timestamp, synchSource, source ids and sequence number are embedded directly into the corresponding RTP protocol fields. The typical structure of an SFP dialog for a point-to-point flow is shown in Figure 12. The dialogue begins with a Start message (for its format see the IDL definition earlier) being sent from source to sink. The source waits for a StartReply message. The message_size field in the frame header denotes the size of the message (including the headers) if no fragmen-
Building Internet Multimedia Applications 47
Figure 12: Typically SFP dialogue waits
StartReply
Start frame FrameHeader{... flags = 02 i.e. fragmentation message_type = frame message_size = 65,536} frame{ timestamp = x …} fragment
65,536 octets
Data... Fragment{... flags = 0 ie no more fragments}
frames x 9
Data...
waits
Credit More messages EndofStream EndofStream
tation is taking place. If fragmentation is being used, then message_size indicates the size of the first fragment including headers. When fragmentation occurs, the first fragment, which begins with the FrameHeader, is implicitly the fragment number 0. Subsequent fragments associated with this header are labeled fragment number 1 through N. As shown in Figure 12, the sink of a flow may send a credit message to the source of the flow to tell it to send more data. The cred_num field will be incremented with each credit message sent. This facility may be used with protocols such as ATM/AAL5 or UDP that have no flow control. Note that the SFP dialogue is much simpler on a multicast flow since no messages are sent in the reverse direction.
JAVA MEDIA FRAMEWORK (JMF) Java has been considered as the Internet programming language. By exploring the advantage of the Java platform, Sun Microsystems, in collaboration with other companies, has designed the JavaMedia Framework (JMF) to provide a common cross-platform Java API for accessing underlying media framework and for incorporating time-based media into Java applications and applets (Sun Microsystems, 1999). The current version of JMF is 2.0. The JMF 2.0 API is being jointly designed by Sun Microsystems and IBM Corporation; the earlier version, JMF 1.0 API, was jointly developed by Sun Microsystems, Intel and Silicon Graphics. JMF 2.0 supports media capture and addresses the needs of application developers who want additional control over media processing and rendering; it also provides a plug-in architecture that provides direct access to media data and enables JMF to be more easily customized and extended. Furthermore, JMF provides support for RTP, which enables the
48 Yang, Gay, Sun & Siew
transmission and reception of real-time media streams across the Internet. The RTP APIs in JMF 2.0 support the reception and transmission of RTP streams and allow application developers to implement media streaming and conferencing applications. JMF is designed to support most standard media content types, such as AU, AVI, MIDI, GSM and MPEG, QuickTime. The JMF API consists mainly of interfaces that define the behavior and interaction of objects used to capture, process and present time-based media; and implementations of these interfaces operate within the structure of the JMF framework. 1. Media Capture: A multimedia-capturing device can act as a source for multimedia data delivery. For example, a microphone can capture raw audio input or a digital video capture board might deliver digital video from a camera. DataSources in JMF are abstraction of such capture devices. Some devices deliver multiple data streams; the corresponding DataSource can contain multiple SourceStream interfaces that model the data streams provided by the device. JMF data sources can be categorized as pull data-source or push data-source according to how data transfer is initiated: a. Pull Data-Source: The client initiates the data transfer and controls the flow of data from pull data-sources. Established protocols for this type of data transfer include HTTP and FILE. b. Push Data-Source: The server initiates the data transfer and controls the flow of data from a push data-source. The examples of this model include broadcast media and media-on-demand (VOD). The Internet protocols for this type of data include RTP as we discussed early. 2. Media Presentation: In JMF, the presentation process is modeled by the Controller interface. Controller defines the basic state and control mechanism for an object that controls, presents or captures time-based media. A Controller posts a variety of controller-specific MediaEvents to provide notification of changes in its status. JMF employs the Java event model to handle events. The JMF API defines two types of controllers: Players and Processors. A Player or Processor is constructed for a particular data source. A Player performs the basic playback functions, processes an input stream of media data and renders it at a precise time. A DataSource is used to deliver the input media-stream to the Player. The rendering destination depends on the type of media being presented. A processor is a specialized type of (or inherited from) player that provides control over what processing is performed on the input media stream. A processor takes a DataSource as input, performs some user-defined processing on the media data and then outputs the processed media data. A processor can send the output data to a presentation device or to a DataSource, in which case, DataSource can be used as the input to another player or processor, or as the input to a DataSink. When a presentation destination is not a presentation device, a DataSink is used to read data from a DataSource and render the media to the specific destination. A DataSink may write media to a file (e.g., media file writers), transmit it over network (for example, transmitting RTP data, in this case, RTP network streamer). The JMF Figure 13: (a) The JMF player model; (b) The JMF processor model
DataSource
Player
(a)
DataSource
DataSource
Processor
(b)
Building Internet Multimedia Applications 49
Figure 14: The JMF class (interface) hierarchy Note y The time-based media requires a has a Clock TimeBase clock as a base y Presentation is modeled by extends controller y There are two types of controller: Controller Controls player and processor y Presentation needs data source extends extends and they are either pull data has a DataSource source or push data source Player y Media capture is modeled by data source extends Processor
DataSource
creates a
SourceStream
manages extends PullDataSource
PushDataSource
Figure 15: The RTP-based JMF architecture Java Applications, Java Applets, Java Beans Java Presentation and Processing API RTP APIs JMF Plug-In API Demux
Mux
Codec
Renderers
Effects
(Native or Pure Java)
Player model and processor model are shown in Figure 13. Media Processing. In JMF, media processing is performed by a processor. In this case, a processor is used as a programmable Player that enables you to control the decoding and rendering process; it can also be used as a capture processor that enables you to control the encoding and multiplexing of the captured media data. Figure 14 shows the major JMF interfaces and their relationship. Because the JMF is designed to manipulate the time-based media, the clock class (interface) is the base class. We now discuss how JMF 2.0 enables the playback and transmission of RTP streams through a set of APIs defined in the following Java media packages: javax.media.rtp javax.media.rtp.rtcp javax.media.rtp.event The JMF RTP APIs are designed to work seamlessly with the capture, presentation and processing capabilities of JMF. The JMF RTP architecture is shown in Figure 15. 3.
50 Yang, Gay, Sun & Siew
The JMF RTP applications are often divided into RTP clients and RTP servers. The RTP client applications are those that receive stream data from the network. The examples are multimedia conferencing applications which need to receive a media stream from an RTP session and render it on the console. A telephone answering machine application is another example. The application needs to receive a media stream from an RTP session and store it in a file. The RTP server applications are those that transmit captured or stored media stream across the network. For example, in a conferencing application, a media stream might be captured from a video camera and sent out on one or more RTP sessions. The media streams might be encoded in multiple media formats and sent out on several RTP sessions for conferencing with heterogeneous receivers. Some applications are both RTP client and server. A session manager in JMF manages the RTP session. More specifically, the JMF RTP APIs: 1. Enable the development of media streaming and conferencing application in Java. 2. Support media data reception and transmission using RTP and RTCP. 3. Support custom packetizer and depacketizer plug-ins through the JMF 2.0 plug-in architecture. The RTP media transmission and reception using JMF object components that implement the corresponding interfaces are shown in Figure 16 (a) and (b) respectively. As show in Figure 16, a session manager in JMF is used to coordinate the reception and transmission of RTP stream from/to network. In RTP, the association among a set of participants communicating with RTP constitutes an RTP media session. A session is defined by a network address plus a port pair for RTP and RTCP. In effect, a JMF session manager is a local representation of a distributed entity, i.e., the RTP session (see the previous discussion about RTP). Thus, the session manager also handles the RTCP control channel, and supports RTCP for both senders and receivers. The JMF SessionManager interface defines methods that enable an application to initialize and start participating in a session, remove individual streams created by the application and close the entire session. Just like any other media contents, Players and Processors are used to present and manipulate Figure 16: RTP transmission data flow in JMF File Data Source
Data Source
Session Manager
Data Source
DataSink
Network
Processor
Capture Device
File
(a)
Network
Session Manager
Data Source
Processor
Data Source
Player
Data Source
DataSink
Data Source
Console
(b)
File
DataSink
File
Building Internet Multimedia Applications 51
RTP media streams (RTPStream) that have been captured using a capture DataSource or that have been stored to a file using a DataSink. JMF RTP APIs also define several RTP-specific events to report on the state of the RTP session and streams. RTP events are handled in a standard way of Java.
FUTURE TRENDS The Internet will continue to be very dynamic, and it is somewhat dangerous to predict its future technological trends. In this section, we restrict ourselves to the enabling technologies that will affect the development and deployment of real-time multimedia applications by discussing high-bandwidth connection, Internet appliance, the new generation of protocols and high-level development environments. The fast Internet access and high-bandwidth connectivity will expand from business to residential communities. In the past few years, we have seen a rapid increase of the options for high-speed Internet access. This trend will continue. These options include the penetration of ISDN technologies and various Digital Subscriber Line (DSL) technologies. Cable TV infrastructure will be upgraded to enable 1.2Mbps to 27 Mbps shared capacity Internet access speeds. Wireless Internet access from terrestrial and satellite service providers are the new options for business and residential users. Internet Appliance will evolve from currently the primary PC to new types of Internet appliances, such as Internet TV and smart phones. These new Internet appliances are based on familiar consumer electronics that non-technical consumers will likely find less intimidating and easier to use than the PC. The new generation of Internet protocols for real-time multimedia, some of which were described in this chapter, will mostly become standardized or widely deployed. The Internet will evolve to become an integrated service Internet in a large deployed scale. The development environments for real-time multimedia applications are in their infancy stage. Multimedia applications are often large and complex; there is not enough attention to the systematic approach for taming this complexity (McCanne, Brewer, Katz, Rowe et al., 1997). The noticeable efforts for providing a standard multimedia-networking middleware are Java JMF and CORBA MSF. These two frameworks provide higher level programming interfaces (APIs) that hide the details and complexity of the underlying medianetwork. They are immature, incomplete and still under development. It is expected that these frameworks will become promising, attractive environments in which new emerging Internet multimedia applications will be built in a standard and portable way.
CONCLUSION The Internet has evolved from a provider of the simple TCP/IP best-effort service to an emerging integrated service Internet. This development provides tremendous opportunities for building real-time multimedia applications over Internet. In this chapter we have introduced the emerging Internet service models and presented the Internet Integrated Service architecture that support the various service models. The constituent real-time protocols of this architecture are the foundations and the critical support elements for building of Internet real-time multimedia applications, and are described in some detail. Multimedia applications are often large and complex. The CORBA Media Streaming
52 Yang, Gay, Sun & Siew
Framework and Java Media Framework are the two emerging environments for implementing Internet multimedia. They provide applications with a set of APIs to hide the underlying details of Internet real-time protocols. In this chapter, we provided a high-level overview about these two frameworks. The Internet multimedia applications developed using these two frameworks are expected to appear in the near future.
REFERENCES Braden, B., Zhang, L., Berson, S. and Herzog, S. (1997). Resource ReSerVation Protocol (RSVP)-Version 1 Functional Specification. IETF, RFC 2205. Braden, R., Clark, D. and Shenker, S. (1994). Integrated Services in the Internet Architecture: an Overview. IETF RFC 1633, June. Braden, R. and Zhang, L. (1997). Resource ReSerVation Protocol (RSVP)—Version 1 Message Processing Rules. IETF, RFC 2209, September. Callas, J., Donnerhacke, L., Finney, H. and Thayer, R. (1998). OpenPGP Message Format. Internet Engineering Task Force. RFC 2440, November. Clark, D. D. and Tennenhouse, D. L. (1990). Architectural considerations for a new generation of protocols. In SIGCOMM’90 Symposium. Computer Communications Review, ACM Press, 20(4), 200-208. Crowcroft, J., Handley, M. and Wakeman, I. (1999). Internetworking Multimedia, Morgan Kaufmann Publishers. Deering, S.E. (1991). Multicast Routing in a Datagram Internetwork. PhD thesis, Stanford University, December. Deering, S. E. and Cheriton, D. R. (1990). Multicast routing in datagram Internetworks and extended LANs. ACM Transactions on Computer Systems, 8(5), 85-110. Dept of Commerce. (1998). The Emerging Digital Economy, United States, April. Floyd, S., Jacobson, V., Liu, C., McCanne, S. and Zhang L. (1997). A reliable multicast framework for light-weight sessions and application level framing. IEEE/ACM Transactions on Networking, 5(6), 784-803. Handley, M. and Jacobson, V. (1998). SDP: Session Description Protocol. Internet Engineering Task Force. RFC 2327, April. Handley, M., Perkins, C. and Whelan, E. (1999). Session Announcement Protocol. Internet Engineering Task Force. Internet-Draft. Handley, M., Schulzrinne, H., Schooler, E. and Rosenberg J. (2000). SIP: Session Initiation Protocol. Internet Engineering Task Force. Internet-Draft. Hofmann, M. (1996). A generic concept for large-scale multicast. In Proceedings of International Zurich Seminar on Digital Communications (IZS’96). SpringerVerlag, February. Housley, R. (1999). Cryptographic Message Syntax. Internet Engineering Task Force. RFC 2630, June. ITU (1998). Packet-Based Multimedia Communication Systems Recommendation H.323. Telecommunication Standardization Sector of ITU, Geneva, Switzerland, February. Jacobson, V. and Casner, S. (1998). Compressing IP/UDP/RTP Headers for Low-Speed Serial Links. Internet Engineering Task Force. Internet-Draft, December. Mayer, D. (1998). Administratively Scoped IP Multicast. Internet Engineering Task Force. RFC 2365, July. McCanne, S., Brewer, E., Katz, R., and Rowe, L. (1997). Toward a common infrastructure
Building Internet Multimedia Applications 53
for multimedia-networking middleware. In Proceedings of the 7th Int’l Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSAV’97), At. Louis, Missouri, May. OMG. (1998). Control and management of audio/video streams. In CORBAtelecoms: Telecommunications Domain Specifications, Chapter 2, Object Management Group, June. OMG. (1999), The Common Object Request Broker: Architecture and Specification (Revision 2.3), Object Management Group (OMG), Framingham, MA., June. Obraczka, K. (1998). Multicast transport protocols: A survey and taxonomy. IEEE Communications Magazine, 36(1), January. Paul, S., Sabnani, K.K, Lin, J. and Bhattacharyya, S. (1997). Reliable multicast transport protocol (RMTP). IEEE JSAC, Special issue on Network Support for Multipoint Communication. Schmidt, D.(2000). Distributed Object Computing with CORBA Middleware. Retrieved on the World Wide Web: http://www.cs.wustl.edu/~schmidt/corba.html. Schulzrinne, H. (1995). Internet services: From electronic mail to real-time multimedia. In Proceedings of KIVS’95, 21-34, Chemnitz, Germany, February, Springer Verlag. Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, J. (2000). RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, Internet Draft, 14 July. Schulzrinne, H. and Casner, S. L. (1999). RTP Profile for Audio and Video Conferences with Minimum Control. Internet Engineering Task Force, Internet Draft, October 21. Schulzrinne, H., Rao, A. and Lanphier, R. (1998). Real-time Streaming Protocol (RTSP). Internet Engineering Task Force, Network Working Group, RFC 2326, April. Sun Microsystems. (1999). Java Media Framework API Guide, JMF 2.0 FCS, November 19. Vinoski, S. (1997). CORBA: Integrating diverse applications within distributed heterogeneous environments. IEEE Communications Magazine, 14(2), February. Yang, Z. and Duddy K. (1996). CORBA: A platform for distributed object computing. ACM Operating Systems Review, 30(2), 4-31, April. Yang, Z., Sun, Z., Sattar, A. and Yang, Y. (1999). On clock-based distributed multimedia synchronization. In Proceedings of the Sixth International Conference on Distributed Multimedia Systems (DMS’99), Aizu, Japan, 26-30, IEEE CS Press. Zhang, L., Deering, S., Estrin, D., Shenker, S. and Zappala, D. (1993). RSVP: A new Resource ReSerVation Protocol. IEEE Network, 7, 8-18, September.
54 Surendran, Krishamurthy & Schmidt
Chapter III
The Design and Performance of a CORBA Audio/Video Streaming Service Naga Surendran and Yamuna Krishamurthy Washington University-St. Louis, USA Douglas C. Schmidt University of California, Irvine, USA
INTRODUCTION Advances in network bandwidth and CPU processing power have enabled the emergence of multimedia applications, such as teleconferencing or streaming video, that exhibit significantly more diverse and stringent quality-of-service (QoS) requirements than traditional data-oriented applications, such as file transfer or email. For instance, popular Internet-based streaming mechanisms, such as Realvideo (RealNetworks, 1998) and Vxtreme (Vxtreme, 1998), allow suppliers to transmit continuous streams of audio and video packets to consumers. Likewise, non-continuous media applications, such as medical imaging servers (Hu et al., 1998) and network management agents (Schmidt and Suda, 1994), employ streaming to transfer bulk data efficiently from suppliers to consumers. However, many distributed multimedia applications rely on custom and/or proprietary low-level stream establishment and signaling mechanisms to manage and control the presentation of multimedia content. These types of applications run the risk of becoming obsolete as new protocols and services are developed (Huard and Lazar, 1998). Fortunately, there is a general trend to move from programming custom applications manually to integrating applications using reusable components based on open distributed object computing (DOC) middleware, such as CORBA (Object Management Group, 1999), DCOM (Box, 1997), and Java RMI (Wollrath et al., 1996). Copyright © 2002, Idea Group Publishing.
CORBA Audio/Video Streaming Service 55
Although DOC middleware is well-suited to handle request/response interactions among client/server applications, the stringent QoS requirements of multimedia applications have historically precluded DOC middleware from being used as their data transfer mechanism (Pyarali et al., 1996). For instance, inefficient CORBA Internet Inter-ORB Protocol (IIOP) (Gokhale and Schmidt, 1999) implementations perform excessive datacopying and memory allocation per-request, which increases packet latency (Gokhale and Schmidt, 1998). Likewise, inefficient marshaling/demarshaling in DOC middleware decreases streaming data throughput (Gokhale and Schmidt, 1996). As the performance of DOC middleware steadily improves, however, the stream establishment and control components of distributed multimedia applications can benefit greatly from the portability and flexibility provided by DOC middleware. Therefore, to facilitate the development of standards-based distributed multimedia applications, the Object Management Group (OMG) has defined the CORBA Audio/Video (A/V) Streaming Service specification (OMG, 1997a), which defines common interfaces and semantics necessary to control and manage A/V streams. The CORBA A/V Streaming Service specification defines an architecture for implementing open distributed multimedia streaming applications. This architecture integrates (1) well-defined modules, interfaces and semantics for stream establishment and control with (2) efficient data transfer protocols for multimedia data transmission. In addition to defining standard stream establishment and control mechanisms, the CORBA A/V Streaming Service specification allows distributed multimedia applications to leverage the inherent portability and flexibility benefits provided by standardsbased DOC middleware. Our prior research on CORBA middleware has explored the efficiency, predictability and scalability aspects of ORB endsystem design, including static (Schmidt et al., 1998a) and dynamic (Gill et al., 2001) scheduling, I/O subsystem (Kuhns et al., 1999) and pluggable ORB transport protocol ((O’Ryan et al., 2000) integration, synchronous (Schmidt et al., 2001) and asynchronous (Arulanthu et al., 2000) ORB Core architectures, event processing (Harrison et al., 1997), optimization principle patterns for ORB performance (Pyarali et al., 1999), and the performance of various commercial and research ORBs (Gokhale and Schmidt, 1996; Schmidt et al., 1998b) over high-speed ATM networks. This chapter focuses on another important topic in ORB endsystem research: the design and performance of the CORBA A/V Streaming Service specification. The vehicle for our research on the CORBA A/V Streaming Service is TAO (Schmidt et al., 1998a). TAO is a high-performance, real-time Object Request Broker (ORB) endsystem targeted for applications with deterministic and statistical QoS requirements, as well as best effort requirements. The TAO ORB endsystem contains the network interface, OS I/O subsystem, communication protocol and CORBA-compliant middleware components and services shown in Figure 1. Figure 1 also illustrates how TAO’s A/V Streaming Service is built over the TAO ORB subsystem. TAO’s real-time I/O (RIO) (Kuhns et al., 2001) subsystem runs in the OS kernel and sends/receives requests to/from clients across high-speed, QoS-enabled networks, such as ATM or IP Integrated (IETF, 2000b) and Differentiated (IETF, 2000a) Services. TAO’s ORB components, such as its ORB Core, Object Adapter, stubs/skeletons and servants, run in user-space and handle connection management, data transfer, endpoint and request demultiplexing, concurrency, (de)marshaling and application operation processing. TAO’s A/V Streaming Service is implemented atop its user-space ORB components. At the heart of TAO’s A/V Streaming Service is its pluggable A/V protocol framework. This framework
56 Surendran, Krishamurthy & Schmidt
Figure 1: Layering of TAO’s A/V Streaming Service atop the TAO ORB endsystem
provides the “glue’’ that integrates TAO’s A/V Streaming Service with the underlying I/O subsystem protocols and network interfaces. The remainder of this chapter is organized as follows: first, we illustrate how we applied patterns to develop and optimize the CORBA A/V Streaming Service to support the standard OMG interfaces; second, we describe two case studies that illustrate how to develop distributed multimedia applications using TAO’s A/V Streaming Service and its pluggable A/V protocol framework; third, we present the results of empirical benchmarks we conducted to illustrate the performance of TAO’s A/V Streaming Service; fourth, we outline our plans for future work and finally present concluding results. For completeness, we include three appendices that outline the intents of all the patterns applied in TAO’s A/V Streaming Service, summarize the CORBA reference model, and illustrate the various pointto-point and point-to-multipoint stream and flow endpoint bindings implemented in TAO’s A/V Streaming Service.
CORBA Audio/Video Streaming Service 57
THE DESIGN OF TAO’S AUDIO/VIDEO STREAMING SERVICE This section first presents an overview of the key architectural components in the CORBA A/V Streaming Service. We then summarize the key design challenges faced when developing TAO’s CORBA A/V Streaming Service and outline how we applied patterns (Gamma et al., 1995; Buschmann et al., 1996; Schmidt et al., 2000) to resolve these challenges. Finally, we describe the design and performance of the pluggable A/V protocol framework integrated into TAO’s A/V Streaming Service.
Overview of the CORBA Audio/Video Streaming Service Specification The CORBA Audio/Video (A/V) Streaming Service specification (OMG, 1997a) defines an architectural model and standard OMG IDL interfaces that can be used to build interoperable distributed multimedia streaming applications. Below, we outline the architectural components and goals of the CORBA A/V Streaming Service specification.
Synopsis of Components in the CORBA A/V Streaming Service The CORBA A/V Streaming Service specification defines flows as a continuous transfer of media between two multimedia devices. Each of these flows is terminated by a flow endpoint. A set of flows, such as audio flow, video flow and data flow, constitute a stream, which is terminated by a stream endpoint. A stream endpoint can have multiple flow endpoints. Figure 2 shows a multimedia stream, which is represented as a flow between two flow endpoints. One flow endpoint acts as a source of the data and the other flow endpoint acts as a sink. Note that the control and signaling operations pass through the GIOP/IIOP-path of the ORB, demarcated by the dashed box. In contrast, the data stream uses out-of-band stream(s), Figure 2: CORBA A/V Streaming Service architecture
58 Surendran, Krishamurthy & Schmidt
which can be implemented using communication protocols that are more suitable for multimedia streaming than IIOP. Maintaining this separation of concerns is crucial to meeting end-to-end QoS requirements. Each stream endpoint consists of three logical entities: (1) a stream interface control object that exports an IDL interface, (2) a data source or sink and (3) a stream adaptor that is responsible for sending and receiving frames over a network. Control and Management objects are responsible for the establishment and control of streams. The CORBA A/V Streaming Service specification defines the interfaces and interactions of the Stream Interface Control Objects and the Control and Management objects. The section CORBA A/V Streaming Service Components describes the various components in Figure 2 in detail.
Synopsis of Goals for the CORBA A/V Streaming Service • •
•
•
The goals of the CORBA A/V Streaming Service include the following: Standardized stream establishment and control protocols. Using these protocols, consumers and suppliers can be developed independently, while still being able to establish streams with one another. Support for multiple data transfer protocols. The CORBA A/V Streaming Service architecture separates its stream establishment and control protocols from its data transfer protocols, such as TCP, UDP, RTP or ATM, thereby allowing applications to select the most suitable data transfer protocols for a particular network environment or set of application requirements. Provide interoperability of flows. A flow specification is passed between two stream endpoints to convey per-flow information, such as format, network host name and address, and flow protocol, required to bind or communication between two multimedia devices. Support many types of sources and sinks. Common stream sources include videoon-demand servers, video cameras attached to a network or stock quote servers. Common sinks include video-on-demand clients, display devices attached to a network or stock quote clients.
Overview of Design Challenges and Resolutions Below, we present an overview of the key challenges faced when we developed TAO’s CORBA A/V Streaming Service and outline how we applied patterns (Gamma et al., 1995; Schmidt et al., 2000) to resolve these challenges. Later sections then examine these design and optimization pattern techniques in more depth. Appendix 1 outlines the intents of all the patterns applied in TAO’s A/V Streaming Service. Flexibility in stream endpoint creation strategies. The CORBA A/V Streaming Service specification defines the interfaces and roles of stream components. Many performance-sensitive multimedia applications require fine-grained control over the strategies governing the creation of their stream components. For instance, our past studies of Web server performance (Hu et al., 1997,Hu et al., 1998) motivate the need to support adaptive concurrency strategies to develop efficient and scalable streaming applications. In the context of our A/V Streaming Service, we determined that the supplier-side of our MPEG case-study application (described in the section Case Study 1: An MPEG A/V Streaming Application) required a process-based concurrency strategy to maximize stream throughput by allowing parallel processing of separate streams. Other types of applications required different implementations, however. For example, the consumer-side of our
CORBA Audio/Video Streaming Service 59
MPEG application benefited from the creation of reactive (Schmidt, 1995) consumers that contain all related endpoints within a single process. To achieve a high degree of flexibility, therefore, TAO’s A/V Streaming Service design decouples the behavior of stream components from the strategies governing their creation. We achieved this decoupling via the Factory Method and Abstract Factory patterns (Gamma et al., 1995). Flexibility in data transfer protocol. A CORBA A/V Streaming Service implementation may need to select from a variety of transfer protocols. For instance, an Internet-based streaming application, such as Realvideo (RealNetworks, 1998), may use the UDP protocol, whereas a local intranet video-conferencing tool might prefer the QoS features offered by native high-speed ATM protocols. Likewise, RTP (Schulzrinne et al., 1994) is gaining acceptance as a transfer protocol for streaming audio and video data over the Internet. Thus, it is essential that an A/V Streaming Service support a range of data transfer protocols dynamically. The CORBA A/V Streaming Service defines a simple specialized protocol, called the Simple Flow Protocol (SFP), which makes no assumptions about the communication protocols used for data streaming and provides an architecture independent flow content transfer. Consequently, the stream establishment components in TAO’s A/V Streaming Service provide flexible mechanisms that allow applications to define and use multiple network programming APIs, such as sockets and TLI, and multiple communication protocols, such as TCP, UDP, RTP or ATM. Therefore, another design challenge we faced was to define stream establishment components that can work with a variety of data transfer protocols. To resolve this challenge, we applied the Strategy pattern (Gamma et al., 1995). Providing a uniform API for different flow protocols. The CORBA A/V Streaming Service specification defines the flow specification syntax that can be used for connection establishment. It defines the protocol names and syntax for specifying the flow and data transfer protocol information, but it does not define any interfaces for protocol implementations. We resolved this omission with our pluggable A/V protocol framework (described in the section Overview of TAO’s Pluggable A/V Protocol Framework) using design patterns, described in Appendix 1, such as Layer (Buschmann et al., 1996), Acceptor-Connector (Schmidt et al., 2000), Facade and Abstract Factory (Gamma et al., 1995). Moreover, TAO’s A/V Streaming Service defines a uniform API for the different flow protocols, such as RTP and SFP, that can handle variations using the standard CORBA policy. Flexibility in stream control interfaces. A/V streaming middleware should provide flexible mechanisms that allow developers to define and use different operations for different streams. For instance, a video application typically supports a variety of operations, such as play, stop and rewind. Conversely, a stream in a stock quote application may support other operations, such as start and stop. Since the operations provided by the stream are application-defined, it is useful for the control logic component in streaming middleware to be flexible and adaptive. Therefore, another design challenge facing designers of CORBA A/V Streaming Services is to allow applications the flexibility to define their own stream control interfaces and access these interfaces in an extensible, type-safe manner. In TAO’s A/V Streaming Service implementation, we used the Extension Interface (Schmidt et al., 2000) pattern to resolve this challenge. Flexibility in managing states of stream supplier and consumers. The data transfer component of a streaming application often must change behavior depending on the current
60 Surendran, Krishamurthy & Schmidt
state of the system. For instance, invoking the play operation on the stream control interface of a video supplier may cause it to enter a PLAYING state. Likewise, sending it the stop operation may cause it to transition to the STOPPED state. More complex state machines can result due to additional operations, such as rewind and fast-forward operations. Thus, an important design challenge for developers is designing flexible applications whose states can be extended. Moreover, in each state, the behavior of supplier/consumer applications, and the A/V Streaming Service itself, must be well-defined. To address this issue we applied the State Pattern (Gamma et al., 1995). Providing a uniform interface for full and light profiles. To allow developers and applications to control and manage flows and streams, the CORBA A/V Streaming Service specification exposes certain of their IDL interfaces. There are two levels of exposure defined by the CORBA A/V Service: (1) the light profile, where only the stream and stream endpoint interfaces are exposed and the flow interfaces are not exposed and (2) the full profile, where flow interfaces are also exposed. This two-level design provides more flexibility and granularity of control to applications and developers since flow interfaces are CORBA interfaces and are not locality constrained. Therefore, the design challenge was to define a uniform interface for both the light and full profiles to make use of TAO’s pluggable A/V protocol framework. We resolved this challenge by deriving the full and light profile endpoints from a base interface and by generating the flow specification using the Forward_FlowSpec_Entry and Reverse_FlowSpec_Entry classes. Providing multipoint-to-multipoint bindings. Different multimedia applications require different stream endpoint bindings. For example, video-on-demand applications require point-to-point bindings between consumer and supplier endpoints, whereas videoconferencing applications require multipoint-to-multipoint bindings. The CORBA A/V specification defines a point-to-multipoint binding, but not a multipoint-to-multipoint binding, which is left as a responsibility of implementers. Thus, we faced the design challenge of providing multipoint-to-multipoint bindings for applications that use multicast protocols provided by the underlying network. We have provided a solution based on IP multicast and used to Adapter pattern (Gamma et al., 1995) to adapt it to ATM’s multicast model. The Adapter pattern is used to allow multiple components to work together, even if they were not originally designed to work together. This adaptation was done by having TAO’s A/V Streaming Service set source ids for the flow producers so that the flow consumers can distinguish the sources. We added support in both SFP and RTP to allow them to be adapted for such bindings. Our implementation of Vic, described in the section Case Study 2: The Vic Video-Conferencing Application, uses TAO’s A/V Streaming Service multipoint-to-multipoint binding and its RTP adapter.
CORBA A/V Streaming Service Components The CORBA A/V Streaming Service specification defines a set of standard IDL interfaces that can be implemented to provide a reusable framework for distributed multimedia streaming applications. Figure 3 illustrates the key components of the CORBA A/V Streaming Service. This subsection describes the design of TAO’s A/V Streaming Service components shown in Figure 3. The corresponding IDL interface name for each role is provided in brackets. In addition, we illustrate how TAO provides solutions to the design challenges outlined in the previous section.
CORBA Audio/Video Streaming Service 61
Figure 3: A/V Streaming Service components
Multimedia Device Factory (MMDevice) An MMDevice abstracts the behavior of a multimedia device. The actual device can be physical, such as a video microphone or speaker, or be logical, such as a program that reads video clips from a file or a database that contains information about stock prices. There is typically one MMDevice per physical or logical device. For instance, a particular device might support MPEG-1 (ISO, 1993) compression or ULAW audio (SUN Microsystems, 1992). Such parameters are termed “properties’’ of the MMDevice. Properties can be associated with the MMDevice using the CORBA Property Service (OMG, 1996), as shown in Figure 4. An MMDevice is also an endpoint factory that creates new endpoints for new stream connections. Each endpoint consists of a pair of objects: (1) a virtual device (VDev), which encapsulates the device-specific parameters of the connection and (2) the StreamEndpoint, which encapsulates the data transfer-specific parameters of the connection. The MMDevice component also encapsulates the implementation of strategies that govern the creation of the VDev and StreamEndpoint objects. For instance, the implementation of MMDevice in TAO’s A/V Streaming Service provides the following two concurrency strategies: • Process-based strategy. The process-based concurrency strategy creates new virtual devices and stream endpoints in a new process, as shown in Figure 5. This strategy is useful for applications that create a separate process to handle each new endpoint. For instance, the supplier in our MPEG player application described in the section Case Study 1: An MPEG A/V Streaming Application creates separate processes to stream the audio and video data to the consumer concurrently. • Reactive strategy: In this strategy, endpoint objects for each new stream are created in the same process as the factory, as shown in Figure 6. Thus, a single process handles all the simultaneous connections reactively (Schmidt, 1995). This strategy is useful for applications that dedicate one process to control multiple streams. For instance, to minimize synchronization overhead, the consumer of the MPEG A/V player application uses this strategy to create the audio and video endpoints in the same process. In TAO’s A/V Streaming Service, the MMDevice uses the Abstract Factory pattern (Gamma et al., 1995) to decouple (1) the creation strategy of the stream endpoint and virtual
62 Surendran, Krishamurthy & Schmidt
Figure 4: Multimedia device factory
Figure 5: MMDevice process-based concurrency strategy
Figure 6: MMDevice reactive concurrency strategy
device from (2) the concrete classes that define it. Thus, applications that use the MMDevice can subclass both the strategies described above, as well as the StreamEndpoint and the VDev that are created. The Abstract Factory pattern allows applications to customize the concurrency strategies to suit their needs. For instance, by default, the reactive strategy creates new stream endpoints using dynamic allocation, e.g., via the new operator in C++. Applications can override this behavior via subclassing so they can allocate stream endpoints using other allocation techniques, such as thread-specific storage (Schmidt et al., 2000) or special frame buffers.
CORBA Audio/Video Streaming Service 63
Virtual Device (Vdev) The virtual device (VDev) component is created by the MMDevice factory in response to a request for a new stream connection. There is one VDev per stream. The VDev is used by an application to define its response to configure requests. For instance, if a consumer of a stream wants to use the MPEG video format, it can invoke the configure operation on the supplier VDev, as shown in Figure 7. Stream establishment is a mechanism defined by the CORBA A/V Streaming Service specification to permit the negotiation of QoS parameters via properties. Properties are name-value pairs, i.e., they have a string name and a corresponding value. The properties used by the A/V Streaming Service are implemented using the CORBA Property Service (OMG, 1996). The CORBA A/V Streaming Service specification specifies the names of the common properties used by the VDev objects. For instance, the property currformat is a string that contains the current encoding format, e.g., “MPEG.’’ During stream establishment, each VDev can use the get_property_value operation on its peer VDev to ensure that the peer uses the same encoding format. When a new pair of VDev objects are created, each Vdev uses the configure operation on its peer to set the stream configuration parameters. If the negotiation fails, the stream can be torn down and its resources released immediately. The section Interaction Between Components in the CORBA Audio/Video Streaming Service Model describes the CORBA A/V Streaming Service stream establishment protocol in detail.
Media Controller (MediaCtrl) The Media Controller (MediaCtrl) is an IDL interface that defines operations for controlling a stream. A MediaCtrl interface is not defined by the CORBA A/V Streaming Service specification. Instead, it is defined by multimedia application developers to support operations for a specific stream, such as the following IDL interface for a video service: interface video_media_control { void select_video (string name_of_movie); void play ();
Figure 7: Virtual device
64 Surendran, Krishamurthy & Schmidt
void rewind (short num_frames); void pause (); void stop (); }; The CORBA A/V Streaming Service provides developers with the flexibility to associate an application-defined MediaCtrl interface with a stream. Thus, the A/V Streaming Service can be used with an infinitely extensible variety of streams, such as audio and video, as well as non-multimedia streams, such as a stream of stock quotes. The VDev object represented device-specific parameters, such as compression format or frame rate. Likewise, the MediaCtrl interface is device-specific since different devices support different control interfaces. Therefore, the MediaCtrl is associated with the VDev object using the Property Service (OMG, 1996). There is typically one MediaCtrl per stream. In some cases, however, application developers may choose to control multiple streams using the same MediaCtrl. For instance, the video and audio streams for a movie might have a common MediaCtrl to enable a single CORBA operation, such as play, to start both audio and video playback simultaneously.
Stream Controller (StreamCtrl) The Stream Controller (StreamCtrl) interface abstracts a continuous media transfer between virtual devices (VDevs). It supports operations to bind two MMDevice objects together using a stream. Thus, the StreamCtrl component binds the supplier and consumer of a stream, e.g., a video-camera and a display. It is the key participant in the Stream Establishment protocol described in the section Interaction Between Components in the CORBA Audio/Video Streaming Service Model. In general, a StreamCtrl object is instantiated by an application developer. There is one StreamCtrl per stream, i.e., per consumer/supplier pair.
Stream Endpoint (StreamEndpoint) The StreamEndpoint object is created by an MMDevice in response to a request for a new stream. There is one StreamEndpoint per stream. A StreamEndpoint encapsulates the data transfer-specific parameters of a stream. For instance, a stream that uses UDP as its data transfer protocol will identify its StreamEndpoint via a host name and port number. In TAO’s A/V Streaming Service, the StreamEndpoint implementation uses patterns, such as Double Dispatching and Template Method (Gamma et al., 1995), described in Appendix 2, to allow applications to define and exchange data transfer-level parameters flexibly. This interaction is shown in Figure 8 and occurs in the following steps: 1. An A/V streaming application can inherit from the StreamEndpoint class and override the operation handle_connection_requested in the new subclass TCP_StreamEndpoint. 2. When binding two MMDevices, the StreamCtrl invokes connect on one StreamEndpoint with the peer TCP_StreamEndpoint as a parameter. 3. The StreamEndpoint then requests the TCP_StreamEndpoint to establish the connection for this stream using the network addresses it is listening on. 4. The virtual handle_connection_requested operation of the TCP_StreamEndpoint is invoked and connects with the listening network address on the peer side. Thus, by applying patterns, the StreamEndpoint design allows each application to configure its own data transfer protocol, while reusing the generic stream establishment control logic in TAO’s A/V Streaming Service.
CORBA Audio/Video Streaming Service 65
Figure 8: Interaction between StreamEndpoint and a multimedia application
Interaction Between Components in the CORBA Audio/Video Streaming Service Model The preceding discussion described the structure of components that constitute the CORBA A/V Streaming Service. Below, we describe how these components interact to provide two key A/V Streaming Service features: stream establishment and flexible stream control.
Stream Establishment Stream establishment is the process of binding two peers who need to communicate via a stream. The CORBA A/V Streaming Service specification defines a standard protocol to establish a binding between streams. Several A/V Streaming Service components are involved in stream establishment. A key motivation for providing an elaborate stream establishment protocol is to allow components to be configured independently. This design allows the stream establishment protocol to remain standard, while still providing sufficient hooks for multimedia application developers to customize the process for a specific set of requirements. For instance, an MMDevice can be configured to use one of several concurrency strategies to create stream endpoints. Thus, at each stage of the stream establishment process, individual components can be configured to implement desired policies. The CORBA A/V Streaming Service specification identifies two peers in stream establishment, which are known as the “A’’ party and the “B’’ party. These terms define complimentary relationships, i.e., a stream always has an A party at one end and a B party at the other. The A party may be the sink, i.e., the consumer, of a video stream, whereas the B party may be the source, i.e., the supplier, of a video stream and vice versa. Note that the CORBA A/V Streaming Service specification defines two distinct IDL interfaces for the A and B party endpoints. Hence, for a given stream, there will be two distinct types for the supplier and the consumer. Thus, the CORBA A/V Streaming Service specification ensures that the complimentary relationship between suppliers and consumers is typesafe. An exception will be raised if a supplier tries to establish a stream with another supplier accidentally. Stream establishment in TAO’s A/V Streaming Service occurs in several steps, as illustrated in Figure 9. This figure shows a stream controller (aStreamCtrl) binding the A party together with the B party of a stream. The stream controller need not be collocated with either end of a
66 Surendran, Krishamurthy & Schmidt
Figure 9: Stream establishment protocol in the A/V Streaming Service
stream. To simplify the example, however, we assume that the controller is collocated with the A party, and is called the aStreamCtrl. Each step shown in Figure 9 is explained below: 1. The aStreamCtrl binds two Multimedia Device (MMDevice) objects together: Application developers invoke the bind_devs operation on aStreamCtrl. They provide the controller with the object references of two MMDevice objects. These objects are factories that create the two StreamEndpoints of the new stream. 2. Stream Endpoint creation: In this step, aStreamCtrl requests the MMDevice objects, i.e., aMMDevice and bMMDevice, to create the StreamEndpoints and VDev objects. The aStreamCtrl invokes create_A and create_B operations on the two MMDevice objects. These operations request them to create A_Endpoint and B_Endpoint endpoints, respectively. 3. VDev configuration: After the two peer VDev objects have been created, they can use the configure operation to exchange device-level configuration parameters. For instance, these parameters can be used to designate the video format and compression technique used for subsequent stream transfers. 4. Stream setup: In this step, aStreamCtrl invokes the connect operation on the A_Endpoint. This operation instructs the A_Endpoint to initiate a connection with its peer. The A_Endpoint initializes its data transfer endpoints in response to this operation. In TAO’s A/V Streaming Service, applications can customize this behavior using the Double Dispatch (Gamma et al., 1995) pattern. 5. Stream Establishment: In this step, the A_Endpoint invokes the request_connection operation on its peer endpoint. The A_Endpoint passes its network endpoint parameters, e.g., hostname and port number, as parameters to this operation. When the B_Endpoint receives the request_connection operation, it initializes its end of the data transfer connection. It subsequently connects to the data transfer endpoint passed to it by the A_Endpoint. After completing these five stream establishment protocol steps, a data transfer-level stream is established between the two endpoints of the stream. Later, we will describe how the Media Controller (MediaCtrl) can control an established stream, e.g., by starting or stopping the stream.
CORBA Audio/Video Streaming Service 67
Stream Control Each MMDevice endpoint factory can be configured with an application-defined MediaCtrl interface, as described above. Each stream has one MediaCtrl and every MediaCtrl controls one stream. Thus, if a particular movie has two streams, one for audio and the other for video, it will have two MediaCtrls. The MediaCtrl applies the Extension Interface pattern (Schmidt et al., 2000) outlined in Appendix 1. After a stream has been established by the stream controller, applications can obtain object references to their MediaCtrls from their VDev. These object references control the flow of data through the stream. For instance, a video stream might support certain operations, such as play, rewind and stop, and be used as shown below: // The Audio/Video Streaming Service invokes this application-defined // operation to give the application a reference to the media // controller for the stream. Video_Client_VDev::set_media_ctrl (CORBA::Object_ptr media_ctrl, CORBA::Environment &env) { // “Narrow” the CORBA::Object pointer into a media controller for the // video stream. this->video_control_ = Video_Control::_narrow (media_ctrl); } The video control interface can be used to control the stream, as follows: // Select the video to watch. this->video_control_->select_video (“gandhi”); // Start playing the video stream. this->video_control_->play (); // Pause the video. this->video_control_->stop (); // Rewind the video 100 frames. this->video_control_->rewind (100); When binding two multimedia devices, a flow specification is passed between the two StreamEndpoints to convey per-flow information. A flow specification represents key aspects of a flow, such as its name, format, flow protocol being used, and the network name and address. A flow specification string is analogous to an interoperable object reference (IOR) in the CORBA object model. The syntax for the interoperable flow specifications is shown in Figure 10. Standardizing the flow specifications ensures that two different StreamEndpoints from two different implementations can interoperate. There are two different flow specifications, depending on the direction in which the flowspec is traveling. If it is from the A party’s StreamEndpoint to the B party’s StreamEndpoint then it is a “forward flowspec;’’ the opposite direction is the “reverse flowspec.’’ TAO’s CORBA A/V Streaming Service implementation defines two classes, Forward_FlowSpec_ Entry and Reverse_FlowSpec_Entry, that allow multimedia applica-
68 Surendran, Krishamurthy & Schmidt
Figure 10: Flow specification
tions to construct the flow specification string from their components without worrying about the syntactic details. For example, the entry takes the address as both an INET_Addr and a string and provides convenient parsing utilities for strings.
The Design of a Pluggable A/V Protocol Framework for TAO’s A/V Streaming Service At the heart of TAO’s A/V Streaming Service is its pluggable A/V protocol framework, which defines a common interface for various flow protocols, such as TCP, UDP, RTP or ATM. This framework provides the “glue’’ that integrates its ORB components with the underlying I/O subsystem protocols and network interfaces. In this section, we describe the design of the pluggable A/V protocol framework provided in TAO’s A/V Streaming Service and describe how we resolved key design challenges that arose when developing this framework.
Overview of TAO’s Pluggable A/V Protocol Framework The pluggable A/V protocol framework in TAO’s A/V Streaming Service consists of the components shown in Figure 11. Each of these components is described below. AV_Core. This singleton (Gamma et al., 1995) component is a container for flow and data transfer protocol factories. An application using TAO’s A/V implementation must Figure 11: Pluggable A/V protocol components in TAO's A/V Streaming Service
CORBA Audio/Video Streaming Service 69
initialize this singleton before using any of its A/V classes, such as StreamCtrl and MMDevice. During initialization, the AV_Core class loads all the flow protocol factories, control protocol factories and data transfer factories dynamically using the Component Configurator pattern (Schmidt et al., 2000) and creates default instances for each known protocol. Data Transfer components. The components illustrated in Figure 12 and described below are required for each data transfer protocol: • Acceptor and Connector: These classes are implementations of the Acceptor-Connector pattern (Schmidt et al., 2000), which are used to accept connections passively and establish connections actively, respectively. • Transport_Factory: This class is an abstract factory (Gamma et al., 1995) that provides interfaces to create Acceptors and Connectors in accordance to the appropriate type of data transfer protocol. • Flow_Handler: All data transfer handlers derive from the Flow_Handler class, whose methods can start, stop and provide flow-specific functionality for timeout upcalls to the Callback objects, which are described in the following paragraph. Callback interface. TAO’s A/V Streaming Service uses this callback interface to deliver frames and to notify FlowEndPoints of start and stop events. Multimedia application developers subclass the Callback interface for each flow endpoint, i.e., there are producer and consumer callbacks. TAO’s A/V Streaming Service dispatches timeout events automatically so that applications need not write event handling mechanisms. For example, all flow producers are automatically registered for timer events with a Reactor. The value for the timeout is obtained through the get_timeout hook method on the Callback interface. This
Figure 12: TAO's A/V Streaming Service pluggable data transfer components
70 Surendran, Krishamurthy & Schmidt
hook method is called whenever a timeout occurs since multimedia applications typically have adaptive timeout values. Flow protocol components. Flow protocols carry in-band information for each flow that a receiver can use to reproduce the source stream. The following components are required for each flow protocol supported by TAO’s A/V Streaming Service: • Flow_Protocol_Factory: This class is an abstract factory that creates flow protocol objects. • Protocol_Object: This class defines flow protocol functionality. Applications use this class to send frames and the Protocol_Object uses application-specified Callback objects to deliver frames. Figure 13 illustrates the relationships among the flow protocol components in TAO’s pluggable A/V protocol framework. AV_Connector and AV_Acceptor Registry. As mentioned above, different data transfer protocols require the creation of corresponding data transfer factories, acceptors and connectors. The AV_Core class creates the AV_Connector and AV_Acceptor registry classes to provide a facade that maintains and accesses the abstract flow and data transfer factories both for light and full profile objects. This design gives users a single interface that hides the complexity of creating and manipulating different data transfer factories.
Applying Patterns to Resolve Design Challenges for Pluggable A/V Protocols Frameworks Below, we outline the key design challenges faced when developing TAO’s pluggable A/V protocol framework and discuss how we resolved these challenges by applying various patterns (Gamma et al., 1995; Buschmann et al., 1996; Schmidt et al., 2000).
Adding New Data Transfer Protocols Transparently ·
Context: Different multimedia applications often have different QoS requirements. For example, a video application over an intranet may want to take advantage of native ATM protocols to reserve bandwidth. An audio application in a video-conferencing
Figure 13: TAO’s A/V Streaming Service pluggable A/V protocol components
CORBA Audio/Video Streaming Service 71
Figure 14: Connector registry
•
•
•
•
application may want to use a reliable data transfer protocol, such as TCP, since loss of audio is more visible to users than video and the bit-rate of audio flows are low (approximately 8 kbps using GSM compression). In contrast, a video application might not want the overhead of retransmission and slow-start congestion protocol incurred by a TCP (Stevens, 1993). Thus, it may want to use the facilities of an unreliable data transfer protocol, such as UDP, since losing a small number of frames may not affect perceived QoS. Problem: It should be possible to add new data transfer protocols to TAO’s pluggable A/V protocol framework without modifying the rest of TAO’s A/V Streaming Service. Thus, the framework must be open for extensions but closed to modifications, i.e., the Open/Closed principle (Meyer, 1989). Ideally, creating a new protocol and configuring it into TAO’s pluggable A/V protocol framework should be all that is required. Solution: Use a registry to maintain a collection of abstract factories based on the Abstract Factory pattern (Gamma et al., 1995). In this pattern, a single class defines an interface for creating families of related objects, without specifying their concrete types. Subclasses of abstract factories are responsible for creating concrete classes that collaborate amongst themselves. In the context of pluggable A/V protocols, each abstract factory can create concrete Connector and Acceptor classes for a particular protocol. Applying this solution in TAO’s A/V Streaming Service: In TAO’s A/V Streaming Service, the Connector_Registry plays the role of the protocol registry. This registry is created by the AV_Core class. Figure 14 depicts the Connector_Registry and its relation to the abstract factories. These factories are accessed via a facade defined according to the Facade pattern (Gamma et al., 1995). This design hides the complexity of manipulating multiple factories behind a simpler interface. The Connector_Registry described above plays the facade role.
Adding New A/V Protocols Transparently •
Context: Multimedia flows often require a flow protocol since most multimedia flows need to carry in-band information for the receiver to reproduce the source stream. For example, every frame may need a timestamp so that the receiver can play the frame at the right time. Moreover, sequence numbers will be needed if a connectionless
72 Surendran, Krishamurthy & Schmidt
• •
•
protocol, such as UDP, is used so that the applications can do resequencing. In addition, multicast flows may require information, such as a source identification number, to demultiplex flows from different sources. SFP is a simple flow protocol defined by the CORBA A/V Streaming Service specification to transport in-band data. Likewise, the Real-time Transport Protocol (RTP) (Schulzrinne et al., 1994) defines facilities to transport in-band data. RTP is Internet-centric, however, and cannot carry CORBA IDL-typed flows directly. For example, RTP specifies that all header fields should be in network-byte order, whereas the SFP uses CORBA’s CDR encoding and carries the byte-order in each header. Problem: Flow protocols should be able to run over different data transfer protocols. This configuration of a flow protocol over different data transfer protocol should be done easily and transparently to the application developers and users. Solution: To solve the problem of a flow protocol running over different data transfer protocols, we applied the Layers pattern (Buschmann et al., 1996) described in Appendix 1. We have structured the flow protocols and data transfer protocols as two different layers. The flow protocol layer creates the frames with the in-band flow information. The data transfer layer performs the connection establishment and sends the frames that are sent down from the flow protocol layer onto the network. The layered approach makes the flow and data transfer protocols independent of each other and hence it is easy to tie different flow protocols with different data transfer protocols transparently. Applying this solution in TAO’s A/V Streaming Service: TAO’s A/V Streaming Service provides a uniform data transfer layer for a variety of flow protocols, including UDP unicast, UDP multicast and TCP. TAO’s A/V Streaming Service provides a flow protocol layer using a Protocol_Object interface. Likewise, its AV_Core class maintains a registry of A/V protocol factories.
Adding New Protocols Dynamically •
•
•
•
Context: When developing new pluggable A/V protocols, it is inconvenient to recompile TAO’s A/V Streaming Service and applications just to validate a new protocol implementation. Moreover, it is often useful to experiment with different protocols to compare their performance, footprint size and QoS guarantees systematically. Moreover, in telecom systems with 24x7 availability requirements, it is important to configure protocols dynamically, even while the system is running. This level of flexibility helps simplify upgrades and protocol enhancements. Problem: The user would like to populate the registry dynamically with a set of factories during run-time and avoid the inconvenience of recompiling the AV Service and the applications when different protocols are plugged in. The solution explains how we can achieve this. Solution: We can solve the above stated problem using the Component Configurator pattern (Schmidt et al., 2000), which decouples the implementation of a component from the point in time when it is configured into the application. By using this pattern, a pluggable A/V protocol framework can dynamically load the set of entries in a registry. For instance, a registry can simply parse a configuration script and dynamically link the services listed in it. Applying this solution in TAO’s A/V Streaming Service: The AV_Core class maintains all parameters specified in a configuration script. Adding a new parameter to represent the list of protocols is straightforward, i.e., the default registry simply examines this
CORBA Audio/Video Streaming Service 73
Figure 15: Acceptor-connector registry and service configurator
list and links the services into the address-space of the application, using the ACE Service Configurator implementation (Schmidt and Suda, 1994). ACE provides a rich set of reusable and efficient components for high-performance, real-time communication, and forms the portability layer of TAO’s A/V Streaming Service. Figure 15 depicts the connector registry and its relation to the ACE Service Configurator framework, which is a C++ implementation of the Component Configurator pattern (Schmidt et al., 2000).
Designing an Extensible Interface to Control Protocols •
•
•
•
Context: RTP has a control protocol—RTCP—associated with it. Every RTP participant must transmit RTCP frames that provide control information, such as the name of the participant and the tool being used. Moreover, RTCP sends reception reports for each of its sources. Problem: Certain flow protocols, such as SFP, use A/V interfaces to exchange control interfaces. The use of RTP for a flow necessitates it to transmit RTCP information. RTCP extracts this control information from RTP packets. Therefore, TAO’s A/V Streaming Service must provide an extensible interface for these control protocols, as well as provide a means for interacting between the data and control protocols. Solution: The solution is to make the control protocol information part of the flow protocol. For example, RTP knows that RTCP is its control protocol. Therefore, to reuse pluggability features, it may be necessary to make the control protocol use the same interfaces as its data components. Applying this solution in TAO’s A/V Streaming Service: During stream establishment, Registry objects will first check the flow factory for the configured flow protocol. After the listen or connect operation has been performed for a particular data flow, the
74 Surendran, Krishamurthy & Schmidt
Registry will check if the flow factory has a control factory. If so, it will perform the same processing for the control factory, except the network endpoint port will be one value higher than the data endpoint. Since the CORBA A/V Streaming Service specification does not define a portable way to specify control flow endpoint information, we followed this approach as a temporary solution until the OMG comes up with a portable solution. The RTCP implementation in TAO’s A/V Streaming Service uses the same interfaces that RTP does, including the Flow_Protocol_Factory and Protocol_Object classes. Thus, RTP will call the handle_control_input method on the RTCP Protocol_Object when a RTP frame is received. This method enables the RTCP object to extract the necessary control information, such as the sequence number of the last frame.
Uniform Interfaces that Hide Variations in Flow Protocols •
•
•
•
Context: Above, we explained how TAO’s pluggable A/V protocol framework factors out different flow protocols and provides a uniform flow protocol interface. In certain cases, however, there are inherent variations in such protocols. For example, RTP must transmit the payload type, i.e., the format of the flow in each frame, whereas SFP uses the control and management interface in TAO’s A/V Streaming Service to set and get the format values for a flow. Similarly, the RTP control protocol, RTCP, periodically transmits participant information, such as the sender's name and email address, whereas SFP does not transmit such information. Such information does not changed with every frame, however. For example, the name and email address of a participant in a conference will not change for a session. In addition, the properties of the transfer may need to be controlled by applications. For instance, a conferencing application may not want to have multicast loopback. Problem: An A/V Streaming Service should allow end-users to set protocolspecific variations, while still providing a single interface for different flow protocols. Moreover, this interface should be open to changes with the addition of new flow protocol and data transfer protocols. Solution: The solution to the above problem is to apply the CORBA Policy framework defined in the CORBA specification (Object Management Group, 1999). The CORBA Policy framework allows the protocol component developer to define policy objects that control the behavior of the protocol component. The policy object is derived from the CORBA Policy interface (Object Management Group, 1999) which stores the Policy Type (Object Management Group, 1999) and the associated values. Applying this solution in TAO’s A/V Streaming Service: By defining a policy framework, which is extensible and follows the CORBA Policy model, the users will have shorter learning curve to the API and be able to add new flow protocols flexibly. We have defined different policy types used by different flow protocols that can be accessed by the specific transport and flow protocol components during frame creation and dispatching. For example we have defined the TAO_AV_PAYLOAD_TYPE_ POLICY, which allows the RTP protocol to specify the payload type.
CORBA Audio/Video Streaming Service 75
CASE STUDIES OF MULTIMEDIA APPLICATIONS DEVELOPED USING TAO’S A/V STREAMING SERVICE To evaluate the capabilities of the CORBA-based A/V Streaming Service, we have developed several multimedia applications that use the components and interfaces described in the previous. In this section, we describe the design of two distributed multimedia applications that use TAO’s A/V Streaming Service and pluggable A/V protocol framework to establish and control MPEG and interactive audio/video streams.
Case Study 1: An MPEG A/V Streaming Application This application is an enhanced version of a non-CORBA MPEG player developed at the Oregon Graduate Institute (Chen et al., 1995). Our application plays movies using the MPEG-1 video format (ISO, 1993) and the Sun ULAW audio format (SUN Microsystems, 1992). Figure 16 on page 76 shows the architecture of our A/V streaming application. The MPEG player application uses a supplier/consumer design implemented using TAO. The consumer locates the supplier using the CORBA Naming Service (OMG, 1997b) or the Trading Service (OMG, 1997b) to find suppliers that match the consumer’s requirements. For instance, a consumer might want to locate a supplier that has a particular movie or a supplier with the least number of consumers currently connected to it. Once a consumer obtains the supplier’s MMDevice object reference, it requests the supplier to establish two streams, i.e., a video stream and an audio stream, for a particular movie. These streams are established using the CORBA A/V stream establishment protocol. The consumer then uses the MediaCtrl to control the stream. The supplier is responsible for sending A/V packets via UDP to the consumer. For each consumer, the supplier transmits two streams, one for the MPEG video packets and one for the Sun ULAW audio packets. The consumer decodes these streams and plays these packets in a viewer, as shown in Figure 17. This section describes the various components of the consumer and supplier. The following table illustrates the number of lines of C++ source required to develop this system and application. Component Lines of Codenes of code TAO CORBA ORB 61,524 TAO Audio/Video (A/V) Streaming Service 3,208 TAO MPEG Video Application 47,782 Using the ORB and the A/V Streaming Service greatly reduced the amount of software that otherwise would have been written manually.
Supplier Architecture The supplier in the A/V streaming application is responsible for streaming MPEG1 video frames and ULAW audio samples to the consumer. The files can be stored in a filesystem accessible to the supplier process. Alternately, the video frames and the audio packets can be sent by live source, such as a video camera. Our experience with the supplier indicates that it can support approximately 10 concurrent consumers simultaneously on a dual-CPU 187 Mhz Sun Ultrasparc-II with 256 MB of RAM over a 155 mbps ATM network.
76 Surendran, Krishamurthy & Schmidt
Figure 16: Architecture of the MPEG A/V streaming application
Figure 17: A TAO-enabled audio/video player
Figure 18: TAO audio/video supplier architecture
CORBA Audio/Video Streaming Service 77
The role of the supplier is to read audio and video frames from a file, encode them and transmit them to the consumer across the network. Figure 18 depicts the key components in the supplier architecture. The main supplier process contains an MMDevice endpoint factory. This MMDevice creates connection handlers in response to consumer connections, using process-based concurrency strategy. Each connection triggers the creation of one audio process and one video process. These processes respond to multiple events. For instance, the video supplier process responds to CORBA operations, such as play and rewind, and sends video frames periodically in response to timer events. Each component in the supplier architecture is described below: • The Media controller component. This component in the supplier process is a servant that implements the Media Controller interface (MediaCtrl). A MediaCtrl responds to CORBA operations from the consumer. The interface exported by the MediaCtrl component represents the various operations supported by the supplier, such as play, rewind and stop. At any point in time, the supplier can be in several states, such as PLAYING, REWINDING or STOPPED. Depending on the supplier’s state, its behavior may change in response to consumer operations. For instance, the supplier ignores a consumer’s play operation when the supplier is already in the PLAYING state. Conversely, when the supplier is in the STOPPED state, a consumer rewind operation transitions the supplier to the REWINDING state. The key design forces that must be resolved while implementing MediaCtrls for A/V streaming are (1) allowing the same object to respond differently, based on its current state, (2) providing hooks to add new states and (3) providing extensible operations to change the current state. To provide a flexible design that meet these requirements, the control component of our MPEG player application is implemented using the State pattern (Gamma et al., 1995). This implementation is shown in Figure 19. The MediaCtrl has a state object pointer. The object being pointed to by the MediaCtrl’s state pointer represents the current state. For simplicity, the figure shows the Playing_State and the Stopped_State, which are subclasses of the Media_State abstract base class. Additional states, such as the Rewinding_State, can be added by subclassing from Media_State. The diagram lists three operations: play, rewind and stop. When the consumer Figure 19: State pattern implementation of the media controller
78 Surendran, Krishamurthy & Schmidt
Figure 20: Reactive architecture of the video supplier
•
invokes an operation on the MediaCtrl, this class delegates the operation to the state object. A state object implements the response to each operation in a particular state. For instance, the rewind operation in the Playing_State contains the response of the MediaCtrl to the rewind operation when it is in the PLAYING state. State transitions can be made by changing the object being pointed to by the state pointer of the MediaCtrl. In response to consumer operations, the current state object instructs the data transfer component to modify the stream flow. For instance, when the consumer invokes the rewind operation on the MediaCtrl while in the STOPPED state, the rewind operation in the Stopped_State object instructs the data component to play frames in reverse chronological order. The Data transfer component. This component is responsible for transferring data to the consumer. Our MPEG supplier application reads video frames from an MPEG1 file and audio frames from a Sun ULAW audio file. It sends these frames to the consumer, fragmenting long frames if necessary. The current implementation of the data component uses the UDP protocol to send A/V frames. A key design challenge related to data transfer is to have the application respond to CORBA operations for the stream control objects, e.g., the MediaCtrl, as well as the data transfer events, e.g., video frame timer events. An effective way to do this is to use the Reactor pattern (Schmidt et al., 2000), as shown in Figure 20 and described in Appendix 1. The video supplier registers two event handlers with TAO’s ORB Reactor. One is a signal handler for the video frame timer events. The other is a UDP socket event handler for feedback events coming from the consumer. The frames sent by the data component correspond to the current state of the MediaCtrl object, as outlined above. Thus, in the PLAYING state, the data component plays the audio and video frames in chronological order. Future implementations of the data transfer component in our MPEG player application will support multiple encoding protocols via the simple flow protocol (SFP) (OMG, 1997a). SFP encoding encapsulates frames of various protocols within
CORBA Audio/Video Streaming Service 79
an SFP frame. It provides standard framing and sequence numbering mechanisms. SFP uses the CORBA CDR encoding mechanism to encode frame headers and uses a simple credit-based flow control mechanism described in (OMG, 1997a).
Consumer Architecture The role of the consumer is to read audio and video frames off the network, decode them, and play them synchronously. The audio and video servers stream the frames separately. A/V frame synchronization is performed on the consumer. Figure 21 depicts the key components in the consumer architecture. The original non-CORBA MPEG consumer (Chen et al., 1995) used a process-based concurrency architecture. Our CORBA-based consumer maintains this architecture to minimize changes to the code. Separate processes are used to do the buffering, decoding and playback, as explained below: 1. Video buffer. The video buffering process is responsible for reading UDP packets from the network and enqueueing them in shared memory. The video decoder process dequeues these packets and performs MPEG decoding operations on them. 2. Audio buffer. Similarly, the audio buffering process is responsible for reading UDP packets of the network and enqueueing them in shared memory. The control/audio playback process dequeues these packets and sends them to /dev/audio. 3. Video decoder. The video decoding process reads the raw packets sent to it by the Video Buffer process and decodes them according to the MPEG-1 video specification. These decoded packets are sent to the GUI/video process, which displays them. 4. GUI/video process. The GUI/video process is responsible for the following two tasks: a. GUI: It provides a GUI to the user, where the user can select operations like play, stop and rewind. These operations are sent to the control/audio process via a UNIX domain socket (Stevens, 1998). b. Video: This component is responsible for displaying video frames to the user. The decoded video frames are stored in a shared memory queue. 5. Control/audio playback process. The control/audio process is responsible for the following tasks: a. Control: This component receives control messages from the GUI process and sends the appropriate CORBA operation to the MediaCtrl servant in the supplier
Figure 21: TAO audio/video consumer architecture
80 Surendran, Krishamurthy & Schmidt
process. b. Audio playback: The audio playback component is responsible for dequeueing audio packets from the audio buffer process and playing them back using the multimedia sound hardware. Decoding is unnecessary because the supplier uses the ULAW format. Therefore, the data received can be directly written to the sound port, which is /dev/audio on Solaris.
Case Study 2: The Vic Video-Conferencing Application Vic (McCanne and Jacobson, 1995) is a video-conferencing application developed at the University of California, Berkeley. We have adapted Vic to use TAO’s A/V Streaming Service components and its pluggable A/V protocol framework. The Vic implementation in TAO uses RTP/RTCP as its flow and data transfer protocols.
Overview of Vic Vic provides a video-conferencing application. Audio conferencing is done with another tool, Vat (LBNL, 1995). The Vic family of tools synchronize media streams using a conference bus mechanism, which is the “localhost’’ synchronization mechanisms used via loopback sockets. The Architecture of Vic is driven largely by the TclObject interface (McCanne and Jacobson, 1995). TclObject provides operations so that operations on the object can be invoked from a Tcl script. By using Tcl, Vic allows rapid prototyping and reconfiguration of its encode/decode paths. One design challenge we faced while adapting Vic to use TAO’s A/V Streaming Service was to integrate both the GUI and ORB event loops. This was solved using the Reactor pattern (Schmidt et al., 2000). In particular, we developed a Reactor that unified the GUI and ORB into a single event loop.
Implementing Vic Using TAO’s A/V Streaming Service Below, we discuss the steps we followed to adapt Vic to use TAO’s A/V Streaming Service. 1. Structuring of conferencing protocols. In this step, we decomposed the flow, control and data transfer protocols using TAO’s pluggable A/V protocol framework. The original Vic application was tightly coupled with RTP. For instance, its encoders and decoders were aware of the RTP headers. We decoupled the dependencies of the encoders/decoders from RTP-specific details by using the frame_info structure and using TAO’s A/V Streaming Service Protocol_Object interface. The modified Vic still preserves the application-level framing (ALF) (Clark and Tennenhouse, 1990) model embodied in RTP. Moreover, Vic’s RTCP functionality was abstracted into the TAO’s pluggable A/V protocol framework, so the framework automatically defines an RTCP flow for an RTP flow. The modified Vic is independent from the network specific details of opening connections and I/O handling since it uses the pluggable A/ V protocol framework provided by TAO’s A/V Streaming Service. Vic uses the multipoint-to-multipoint binding provided by TAO’s A/V Streaming Service, which is described in Appendix 3. Thus, our first step when integrating into TAO was to determine the proper abstraction for the conference device. A videoconferencing application like Vic serves as both a source and sink; thus, we needed a source and sink MMDevice. Moreover, to be extensible for future integration with Vat
CORBA Audio/Video Streaming Service 81
2.
and other multimedia tools, Vic uses flow interfaces, i.e., video is considered as a flow within the conference stream. Since Vat runs in a separate address space, its flow interfaces must be exposed using TAO’s full profile flow interfaces, i.e., FDev, FlowProducer and FlowConsumer. Define callback objects. In this step, we defined Callback objects for all the source and sink FlowEndPoints. The Source_Callback uses the timer functionality to schedule timer events to send the frames. Figure 22 illustrates the sequence of events that trigger the sending of frames. When the input becomes ready on the video card, the grabber reads it and gives it to the transmitter. The transmitter then uses the Source_Callback object to schedule a timer to send the frames at the requested bit rate using a bitrate buffer. On the sink-side, when a packet arrives on the network the receive_frame upcall is done on the Sink_Callback object which, using the frame_info structure gives it to the right Source object, which then passes it to the right decoder. To implement RTCP functionality, Vic implements a RTCP_Callback to provide Vic-specific source objects.
Figure 22: Architecture of Vic using TAO’s A/V Streaming Service
82 Surendran, Krishamurthy & Schmidt
3.
Select a centralized or distributed conference configuration. In this step, we have ensured that Vic can function both as a participant in a centralized conference, as well as a loosely coupled distributed conference. This flexibility is achieved by checking for a StreamCtrl object in the Naming Service and creating new StreamCtrl if one is not found in the Naming Service. Thus, by running a StreamCtrl control process that registers itself with the Naming Service, all Vic participants will become part of a centralized conference, which can be controlled from the control process. Conversely, when no such process is run, Vic reverts to the loosely controlled model by creating its own StreamCtrl and transmitting on the multicast address.
PERFORMANCE RESULTS This section describes the design and results of three performance experiments we conducted using TAO’s A/V Streaming Service.
CORBA/ATM Testbed The experiments in this section were conducted using a FORE systems ASX-1000 ATM switch connected to two dual-processor UltraSPARC-2s running Solaris 2.5.1. The ASX-1000 is a 96 Port, OC12 622 Mbs/port switch. Each UltraSPARC-2 contains a 300 MHz Super SPARC CPU with a 1 Megabyte cache per-CPU. The Solaris 2.5.1 TCP/IP protocol stack is implemented using the STREAMS communication framework (Ritchie, 1984). Each UltraSPARC-2 has 256 Mbytes of RAM and an ENI-155s-MF ATM adaptor card, which supports 155 Megabits per-sec (Mbps) SONET multimode fiber. The Maximum Transmission Unit (MTU) on the ENI ATM adaptor is 9,180 bytes. Each ENI card has 512 Kbytes of on-board memory. A maximum of 32 Kbytes is allotted per ATM virtual circuit connection for receiving and transmitting frames (for a total of 64 Kb). This allows up to eight switched virtual connections per card. The CORBA/ATM hardware platform is shown in Figure 23.
CPU Usage of the MPEG Decoder The aim of this experiment is to determine the CPU overhead associated with decoding and playing MPEG-1 frames in software. To measure this, we used the MPEG/ULAW A/ V player application described in the preceding section. We used the application to view two movies, one of size 128x96 pixels and the other of size 352x240 pixels. We measured the percentage CPU usage for different frame rates. The frame rate is the number of video frames displayed by the viewer per second. The results are shown in Figure 24. These results indicate that for large frame sizes (352x240), MPEG decoding in software becomes expensive, and the CPU usage becomes 100% while playing 12 frames per second or higher. However, for smaller frame sizes (128x96), MPEG decoding in software does not cause heavy CPU utilization. At 30 frames per second, CPU utilization is approximately 38%.
A/V Stream Throughput The aim of this experiment is to illustrate that TAO’s A/V Streaming Service does not introduce appreciable overhead in transporting data. To demonstrate this, we wrote a TCPbased data streaming component and integrated it with TAO’s A/V service. The producer
CORBA Audio/Video Streaming Service 83
Figure 23: Hardware for the CORBA/ATM testbed
Figure 24: CPU usage of the MPEG decoder
84 Surendran, Krishamurthy & Schmidt
in this application establishes a stream with the consumer, using the CORBA A/V stream establishment mechanism. Once the stream is established, it streams data via TCP to the consumer. We measured the throughput, i.e., the number of bytes per second sent by the supplier to the consumer, obtained by this streaming application. We then compared this throughput with the following two configurations: 1. TCP transfer — i.e., by a pair of application processes that do not use the CORBA A/ V Streaming Service stream establishment protocol. In this case, Sockets and TCP were the network programming API and data transfer protocol, respectively. This is the “ideal’’ case since there is no additional ORB-related or presentation layer overhead. 2. ORB transfer — i.e., the throughput obtained by a stream that used an octet stream passed through the TAO (Schmidt et al., 1998a) CORBA ORB. In this case, the IIOP data path was the data transfer mechanism. We measured the throughput obtained by varying the buffer size of the sender, i.e., the number of bytes written by the supplier in a single write system call. In each stream, the supplier sent 64 megabytes of data to the consumer. The results shown in Figure 25 indicate that, as expected, the A/V Streaming Service does not introduce any appreciable overhead to streaming the data. In the case of using IIOP as the data transfer layer, the benchmark incurs additional Figure 25: Throughput results
CORBA Audio/Video Streaming Service 85
performance overhead. This overhead arises from the dynamic memory allocation, datacopying and marshaling/demarshaling performed by the ORB’s IIOP protocol engine (Gokhale and Schmidt, 1996). In general, however, a well-designed ORB can achieve performance equivalent to sockets for higher buffer sizes due to various optimizations, such as eliding (de)marshaling overhead for octet data (Gokhale and Schmidt, 1999). The largest disparity occurred for smaller buffer sizes, where the performance of the ORB was approximately half that of the TCP and A/V streaming implementations. As the buffer size increases, however, the ORB performance improves considerably and attains nearly the same throughput as TCP and A/V streaming. Clearly, there is a fixed amount of overhead in the ORB that is amortized and minimized as the size of the data payload increases.
Stream Establishment Latency This experiment measures the time required to establish a stream using TAO’s implementation of the CORBA A/V stream establishment protocol described in the section Interaction Between Components in the CORBA Audio/Video Streaming Service Model. We measured the stream establishment latency for the two concurrency strategies: processbased and reactive. The timer starts when the consumer gets the object reference for the supplier’s MMDevice servant from the Naming Service. The timer stops when the stream has been established, i.e., when a TCP connection has been established between the consumer and the supplier. Figure 26: Stream establishment latency results
86 Surendran, Krishamurthy & Schmidt
We measured the stream establishment time as the number of concurrent consumers establishes connections with the supplier increased from 1 to 10. The results are shown in Figure 26. When the supplier’s MMDevice is configured to use the process-based concurrency strategy, the time taken to establish the stream is higher, due to the overhead of process creation. For instance, when 10 concurrent consumers establish a stream with the producer simultaneously, the average latency observed is about 2.25 seconds with the process-based concurrency strategy. With the reactive concurrency strategy, the latency is only approximately 0.4 seconds. The process-based strategy is well-suited for supplier devices that have multiple streams, e.g., a video camera that broadcasts a live feed to many clients. In contrast, the reactive concurrency strategy is well-suited for consumer devices that have few streams, e.g., a display device that has only one or two streams.
FUTURE WORK Our next steps are to enable end-to-end QoS for the flow protocols. TAO’s pluggable A/V protocol framework enables us to provide QoS guarantees using either IP-based QoS protocols (such as RSVP and Differentiated Services) or ATM. For example, if an application wants to use ATM, QoS guarantees it can choose ATM AAL5 data transfer protocol. Conversely, if it wants to use RSVP or Differentiated Service QoS provisions, it can choose the TCP or UDP data transfer protocols. TAO’s pluggable A/V protocol framework helps applications choose flow and data transfer protocols dynamically, in accordance with the data that they are streaming. For example, applications can use reliable TCP for audio transmission and unreliable UDP for video transmission. The CORBA A/V Streaming Service specification has provisions for applications to specify their QoS requirements when setting up a connection. These QoS parameters can be specified for each flow through a sequence of name/value properties. The specification leaves it to the A/V Streaming Service implementation to translate application-level QoS parameters (such as video frame rate) to network-level QoS parameters (such as bandwidth). We are planning to build a QoS framework the A/V Streaming Service can use to ensure endto-end QoS for all its flows. We have identified three main components for this framework: 1. QoS mapping. This component translates QoS specifications between different levels, such as application and network, in order to reserve sufficient network resources at connection establishment. Good mapping rules are required to prevent reservation of too much (or too little) resources. 2. QoS monitoring and adaptation. This component measures end-to-end QoS parameters over a finite time period and takes actions based on the measured QoS and the application requirements. It facilitates renegotiation of the QoS parameters between the sender and receiver. 3. QoS-based transport API. This component provides calls for provisioning control (renegotiation and violation notification) and media transfer enforcing end-to-end network QoS. The ACE framework provides QoS API’s that provide these functionalities. The ACE QoS APIs use the G-QoS and RAPI APIs to enforce end-toend network QoS.
CORBA Audio/Video Streaming Service 87
CONCLUDING REMARKS The demand for high quality multimedia streaming is growing, both over the Internet and for intranets. Distributed object computing is also maturing at a rapid rate due to middleware technologies like CORBA. The flexibility and adaptability offered by CORBA makes it attractive for use in streaming technologies, as long as the requirements of performance-sensitive multimedia applications can be met. This chapter illustrates an approach to building standards-based, flexible, adaptive, multimedia streaming applications using CORBA. There is a great deal of activity in the codec community to design new formats for audio and video transmission. Active research is also being done in designing new flow and data transfer protocols for multimedia. In such situations, a flexible framework that makes use of the A/V interfaces and also abstracts the network/protocol details is needed to adapt to the new developments. In this chapter we present a pluggable A/V protocol framework that provides the capability to rapidly adapt to new flow and data transfer protocols. With growing demand for real-time multimedia streaming and conferencing with increase in network bandwidth and the spread of the Internet, TAO provides the first freely available, open-source implementation of the CORBA Audio/Video Streaming Service specification, i.e., flow interfaces, point-to-multipoint binding and multipoint-to-multipoint binding for conferencing applications. Our experience with TAO’s A/V implementation indicates that the standard CORBA specification defines a flexible and efficient model for developing flexible and high-performance multimedia streaming applications. While designing and implementing the CORBA A/V Streaming Service, we learned the following lessons: 1. We found that CORBA simplifies a number of common network programming tasks, such as parsing untyped data and performing byte-order conversions. 2. We found that using CORBA to define the operations supported by a supplier in an IDL interface made it much easier to express the capabilities of the application. 3. Our performance measurements revealed that while CORBA provides solutions to many recurring problems in network programming, using CORBA for data transfer in bandwidth-intensive applications is not as efficient as using lower level protocols like TCP, UDP or ATM directly. Thus, an important benefit of the TAO A/V Streaming Service is to provide applications the advantages of using CORBA IIOP in their stream establishment and control modules, while allowing the use of more efficient data transfer protocols for multimedia streaming. 4. Enhancing an existing A/V streaming application to use CORBA was a key design challenge. By applying patterns, such as the State, Strategy (Gamma et al., 1995) and Reactor (Schmidt et al., 2000), we found it was much easier to address these design issues. Thus, the use of patterns helped us rework the architecture of an existing MPEG A/V player and make it more amenable to distributed object computing middleware, such as CORBA. 5. Building the CORBA A/V Streaming Service also helped us improve TAO, the CORBA ORB used to implement the service. An important feature added to TAO was support for nested upcalls. This feature allows a CORBA-enabled application to respond to incoming CORBA operations, while it is making a CORBA operation on a remote object. During the development of the A/V Streaming Service, we also applied many optimizations to TAO and its IDL compiler, particularly for sequences of octets and the CORBA::Anytype.
88 Surendran, Krishamurthy & Schmidt
All the C++ source code, documentation and benchmarks for TAO and its A/V Streaming Service are available at www.cs.wustl.edu/~schmidt/TAO.html.
APPENDIX 1. DESIGN PATTERNS USED IN THE TAO A/V STREAMING SERVICE This section outlines the intents of all the patterns used in TAO’s A/V Streaming Service and its pluggable A/V protocol framework. The references explore each pattern in greater depth. • Abstract Factory pattern [Gamma et al., 1995]: This pattern provides an interface for creating families of related or dependent objects without specifying their concrete classes. • Acceptor-Connector pattern [Schmidt et al., 2000]: This pattern decouples the connection and initialization of cooperating peer services in a distributed system from the processing performed by these peer services once they are connected and initialized. • Adapter pattern [Gamma et al., 1995]: This pattern allows two classes to collaborate that were not designed originally to work together. • Component Configurator pattern [Schmidt et al., 2000]: This pattern decouples the implementation of services from the time when they are configured. • Double Dispatching pattern [Gamma et al., 1995]: In this pattern, when a call is dispatched to a method on a target object from a parent object, the target object in turn makes method calls on the parent object to access certain attributes in the parent object. • Extension Interface pattern [Schmidt et al., 2000]: This pattern prevents bloating interfaces and breaking client code when developers add or modify functionality to existing components. Multiple extensions can be attached to the same component, each defining a contract between the component and its clients. • Facade pattern [Gamma et al., 1995]: This pattern provides a unified higher-level interface to a set of interfaces in a subsystem that makes the subsystem easier to use. • Factory Method pattern [Gamma et al., 1995]: This defines an interface for creating objects, but lets subclasses decide which class to instantiate. • Leader/Followers pattern [Schmidt et al., 2000]: This pattern provides a concurrency model where multiple threads efficiently demultiplex events received on I/O handles shared by the threads and dispatch event handlers that process the events. • Layer pattern [Buschmann et al., 1996]: This pattern helps to structure applications that can be decomposed into groups of subtasks in which each group of subtasks is at a particular level of abstraction. • Reactor pattern [Schmidt et al., 2000]: This pattern demultiplexes and dispatches requests that are delivered concurrently to an application by one or more clients. • State pattern [Gamma et al., 1995]: This pattern allows an object to alter its behavior when its internal state changes. The object will appear to change its class. • Strategy pattern [Gamma et al., 1995]: This pattern defines and encapsulates a family of algorithms and makes them interchangeable. • Template Method pattern [Gamma et al., 1995]: This pattern defines the skeleton of an algorithm in an operation, deferring certain steps to subclasses.
CORBA Audio/Video Streaming Service 89
APPENDIX 2. OVERVIEW OF THE CORBA REFERENCE MODEL CORBA Object Request Brokers (ORBs) [Object Management Group, 2000] allow clients to invoke operations on distributed objects without concern for the following issues: • Object location: CORBA objects either can be collocated with the client or distributed on a remote server, without affecting their implementation or use. • Programming language: The languages supported by CORBA include C, C++, Java, Ada95, COBOL, and Smalltalk, among others. • OS platform: CORBA runs on many OS platforms, including Win32, UNIX, MVS, and real-time embedded systems like VxWorks, Chorus, and LynxOS. • Communication protocols and interconnects: The communication protocols and interconnects that CORBA run on include TCP/IP, IPX/SPX, FDDI, ATM, Ethernet, Fast Ethernet, embedded system backplanes, and shared memory. • Hardware: CORBA shields applications from side effects stemming from differences in hardware, such as storage layout and data type sizes/ranges. Figure 27 illustrates the components in the CORBA 2.x reference model, all of which collaborate to provide the portability, interoperability and transparency outlined above. Each component in the CORBA reference model is outlined below: • Client: A client is a role that obtains references to objects and invokes operations on them to perform application tasks. Objects can be remote or collocated relative to the client. Ideally, a client can access a remote object just like a local object, i.e., object>operation(args). Figure 27 shows how the underlying ORB components described below transmit remote operation requests transparently from client to object. • Object: In CORBA, an object is an instance of an OMG Interface Definition Language (IDL) interface. Each object is identified by an object reference, which associates one or more paths through which a client can access an object on a server. An object ID associates an object with its implementation, called a servant, and is unique within the scope of an Object Adapter. Over its lifetime, an object has one or more servants associated with it that Figure 27. Components in the CORBA 2.x Reference Model INTERFACE REPOSITORY
IDL COMPILER
CLIENT
operation()
IMPLEMENTATION REPOSITORY
in args
DII
IDL STUBS
ORB REF
out arts + return value
OBJECT (SERVANT)
IDL SKELETON ORB INTERFACE
ORB CORE
DSI OBJECT ADAPTER
GIOP/IIOP/ESIOPS
STANDARD INTERFACE
STANDARD LANGUAGE MAPPING
ORB-SPECIFIC INTERFACE
STANDARD PROTOCOL
90 Surendran, Krishamurthy & Schmidt
implement its interface. • Servant: This component implements the operations defined by an OMG IDL interface. In object-oriented (OO) languages, such as C++ and Java, servants are implemented using one or more class instances. In non-OO languages, such as C, servants are typically implemented using functions and structs. A client never interacts with servants directly, but always through objects identified by object references. • ORB Core: When a client invokes an operation on an object, the ORB Core is responsible for delivering the request to the object and returning a response, if any, to the client. An ORB Core is implemented as a run-time library linked into client and server applications. For objects executing remotely, a CORBA-compliant ORB Core communicates via a version of the General Inter-ORB Protocol (GIOP), such as the Internet Inter-ORB Protocol (IIOP) that runs atop the TCP transport protocol. In addition, custom Environment-Specific Inter-ORB protocols (ESIOPs) can also be defined. • ORB Interface: An ORB is an abstraction that can be implemented various ways, e.g., one or more processes or a set of libraries. To decouple applications from implementation details, the CORBA specification defines an interface to an ORB. This ORB interface provides standard operations to initialize and shut down the ORB, convert object references to strings and back, and create argument lists for requests made through the dynamic invocation interface (DII). • OMG IDL Stubs and Skeletons: IDL stubs and skeletons serve as a “glue’’ between the client and servants, respectively, and the ORB. Stubs implement the Proxy pattern [Gamma et al., 1995] and provide a strongly-typed, static invocation interface (SII) that marshals application parameters into a common message-level representation. Conversely, skeletons implement the Adapter pattern [Gamma et al., 1995] and demarshal the message-level representation back into typed parameters that are meaningful to an application. • IDL Compiler: An IDL compiler transforms OMG IDL definitions into stubs and skeletons that are generated automatically in an application programming language, such as C++ or Java. In addition to providing programming language transparency, IDL compilers eliminate common sources of network programming errors and provide opportunities for automated compiler optimizations [Eide et al., 1997]. • Dynamic Invocation Interface (DII): The DII allows clients to generate requests at runtime, which is useful when an application has no compile-time knowledge of the interface it accesses. The DII also allows clients to make deferred synchronous calls, which decouple the request and response portions of two-way operations to avoid blocking the client until the servant responds. • Dynamic Skeleton Interface (DSI): The DSI is the server’s analogue to the client’s DII. The DSI allows an ORB to deliver requests to servants that have no compile-time knowledge of the IDL interface they implement. Clients making requests need not know whether the server ORB uses static skeletons or dynamic skeletons. Likewise, servers need • Object Adapter: An Object Adapter is a composite component that associates servants with objects, creates object references, demultiplexes incoming requests to servants, and collaborates with the IDL skeleton to dispatch the appropriate operation upcall on a servant. Object Adapters enable ORBs to support various types of servants that possess similar requirements. This design results in a smaller and simpler ORB that can support a wide range of object granularities, lifetimes, policies, implementation styles, and other properties. • Interface Repository: The Interface Repository provides run-time information about IDL
CORBA Audio/Video Streaming Service 91
interfaces. Using this information, it is possible for a program to encounter an object whose interface was not known when the program was compiled, yet be able to determine what operations are valid on the object and make invocations on it using the DII. In addition, the Interface Repository provides a common location to store additional information associated with interfaces to CORBA objects, such as type libraries for stubs and skeletons. • Implementation Repository: The Implementation Repository contains information that allows an ORB to activate servers to process servants. Most of the information in the Implementation Repository is specific to an ORB or OS environment. In addition, the Implementation Repository provides a common location to store information associated with servers, such as administrative control, resource allocation, security, and activation modes.
APPENDIX 3. SUPPORTING MULTIPLE ENDPOINT BINDING SEMANTICS IN TAO’S A/V STREAMING SERVICE The CORBA A/V Streaming Service can construct different topologies for establishing streams between stream endpoints. For instance, one-to-one, one-to-many, many-toone, and many-to-many sources and sinks may need to be configured in the same stream binding. The need for certain stream endpoint bindings is dictated by the multimedia applications. For example, a video-on-demand application may require a point-to-point binding when sources and sinks are pre-selected. However, a video-conferencing application may require a multipoint-to-multipoint binding to receive from and transmit to various sources and sinks simultaneously. This section illustrates the various stream and flow endpoint bindings that have been implemented in TAO’s A/V Streaming Service and shows how stream endpoints are created and the connections are established. In TAO’s A/V Streaming Service, we have implemented the standard point-to-point and point-to-multipoint bindings of the stream endpoints. In addition, we have used these configurations as building blocks for multipoint-tomultipoint bindings. Point-to-Point Binding Below, we describe the sequence of steps during a point-to-point stream establishment, as defined by the CORBA A/V specification and implemented in TAO’s A/V Streaming Service. In our example, we consider the stream establishment in a video-on-demand (VoD) application that is similar to the MPEG player application described in the case-studies section. As shown in Figure 28, the VoD server and VoD client device with two audio and video flows. The audio flow is carried over TCP and video over UDP. The client must first locate the Server MMDevice reference and then pass its MMDevice as the A party and the Server MMDevice as the B party to the StreamCtrl. Step 1: Endpoint creation. At this point, the Vdev and StreamEndpoint are created for this stream from the MMDevices. The client and server applications can choose either Process_Strategy, where the endpoints are created in a separate process, or a Reactive_Strategy, where the endpoints are created in the same process. The pluggable A/ V protocol framework in TAO’s A/V Streaming Service provides flexible Concurrency
92 Surendran, Krishamurthy & Schmidt
Figure 28. Video-on-Demand Consumer and Supplier
WUGS HIGH-SPEED NETWORK TAO QOS-ENABLED ORB
RIO SUBSYSTEM
TAO QOS-ENABLED ORB
RIO SUBSYSTEM
Strategies [Mungee et al., 1999] to create the endpoints. Step 2: Creation of flowendpoints. To create a full profile, an MMDevice can act as a container for FDevs. In this case, the MMDevice will create a FlowProducer or FlowConsumer from the FDev, depending on the direction of the flow specified in the flow specification parameter. The flow direction is always with respect to the A side. Thus, the direction “out’’ means that the flow originates from the A side to the B side, whereas “in’’ means that the flow originates from the B side to the A side. In the above case, the server is streaming the data to the client. Therefore, the direction of the flow for both audio and video will be “in’’ and the MMDevice will create a Flowproducer from the audio and video FDevs on the server and a FlowConsumer from the audio and video FDevs on the client. These FlowProducers and FlowConsumers are then added to the StreamEndpoint using the add_fep operation. The advantage of using the flow interfaces is that the FDevs can be shared across different applications. In our VoD server, for example, the audio and video processes could run as two separate processes and contain only the flow objects and a control process could add the FDevs from these two processes to the stream. Both flows can now be controlled through the same StreamCtrl interface. This configuration is a more scalable and extensible approach than the implementation of a MPEG player described in the case-studies section, where the audio and video were treated as two separate streams. Step 3: VDev configuration. The StreamCtrl then calls set_peer on each of the VDevs with the other VDevs. For light profiles, multimedia application developers are responsible for implementing the set_peer call to check if all flows are compatible. For full profiles, the VDev interface is not used because the FlowEndPoint contain these configuration operations. Step 4: Stream setup. During this step the actual connections for the different flows are established. For light profiles, the flows do not have any interfaces and the flow specification should contain the transfer information for each flow. For example, the following flow specs are typically passed to the bind_devs call from the VoD client: “audio\in\MIME:audio/mpeg\\TCP=ace.cs.wustl.edu;10000” and “video\in\MIME:video/mpeg\\UDP=ace.cs.wustl.edu;8080” In these flow specs, the client is offering to listen for a TCP connection and the server will connect to the client. This configuration might be useful if the server is behind a firewall.
CORBA Audio/Video Streaming Service 93
The StreamCtrl calls connect on one of the StreamEndpoints passing the other StreamEndpoint, QoS, and the flow spec. Step 5: Stream QoS negotiation. The StreamEndpoint will first check if the other StreamEndpoint has a negotiator property defined. If it does, StreamEndpoint calls negotiate on the negotiator and the client and server then negotiate the QoS. TAO’s A/V Streaming Service provides a default implementation that can be overridden by the applications. The StreamEndpoint then queries the “AvailableProtocols” property on the other StreamEndpoint. If there is no common protocol the stream setup will fail and the exception StreamOpDenied will be thrown. Step 6: Light profile connection establishment: The A party StreamEndpoint will then try to setup the stream for all its flows. For light profiles, the following steps are done for each flow: 1. The StreamEndpoint will extract the flow protocol and data transfer protocol information from the flow spec entry for this flow. If a network address is not specified then a default stream endpoint is picked. 2. The StreamEndpoint then does the following actions. a. It goes through the list of flow protocol factories in the AV_Core instance to find if there is any matching flow protocol. If no flow protocol is specified, it passes the protocol as the flow protocol string. TAO’s A/V Streaming Service provides “noop’’ implementations for all data transfer protocols so that the layering of the architecture is preserved and a uniform API is presented to the application. These no-op flow protocols do not process the frames — they simply pass them to the underlying data transfer protocol. b. If a flow protocol factory matches the specified flow protocol/transfer protocol, the Stream Endpoint then checks for the data transfer protocol factory that matches the protocol specified for this flow. c. After finding a matching data transfer protocol factory, it creates a one-shot acceptor for this flow passing the FlowProtocolFactory to the acceptor. d. If the flow protocol factory has an associated control protocol factory, the StreamEndpoint then tries to match the data transfer factory for that, as well. Figure 29 illustrates the sequence of steps outlined above. In each step, the StreamEndpoint uses base interfaces, such as Protocol_Factory, Transport_ Factory, and AV_Acceptor. Therefore, it can be extended easily to support new flow and data transfer protocols. In addition, the address information is opaque to the StreamEndpoint and is passed down to an Acceptor that knows how to interpret it. Moreover, since the flow and data transfer protocols can be linked dynamically via the ACE Figure 29. Acceptor Registry 1 . open(flowspec_entry)
ACCEPTOR REGISTRY 3. open(flowspecentry,flow_factory)
2. create_acceptor()
TRANSPORT FACTORY
ACCEPTOR 4. on accept make_protocol_object(entry,endpoint,handler,transport)
FLOW FACTORY 5. get_callback
6.callback
TAO_Base_Endpoint
94 Surendran, Krishamurthy & Schmidt
1. ge t_ fe p
,B)
4. 1
nnect (A
go 3 . is _t _fep o_ _com lis patib te le n
4 . co
int Po nd wE Flo fep et_ 2. g
Service Configurator mechanisms, applications can take advantage of these protocols by simply changing the name of the protocols in the flow spec. After completing the preceding steps, the StreamEndpoint then calls the request_connection operation on the B StreamEndpoint with the flowspec. This StreamEndpoint_B performs the following steps for each of the flow: 1. It extracts the flow and data transfer protocol information from the flow spec entry for this flow. If a network address is not specified then a default stream endpoint is picked. 2. The StreamEndpoint then performs the following actions. 3. Finds a flow protocol factory matching the flow protocol specified for this flow and in the absence of a flow protocol tries to match a null flow protocol for the specified data transfer protocol. 4. Finds a matching data transfer protocol factory and creates a connector for it. Then it calls connect on the connector, passing it the flow protocol factory. 5. Upon establishing a data transfer connection, the connector creates a protocol object for this flow. 6. The flow protocol factory typically creates the application-level callback object and sets the protocol object on the Base_EndPoint interface passed to it. 7. If an address was not specified for this flow then the StreamEndpoint does the similar steps for listening for those flows and extracts the network endpoints and inserts it into the flowspec to be sent back to the A StreamEndpoint. The A StreamEndpoint after receiving the reverse flowspec does the connect for all the flows for which B StreamEndpoint is listening and also sets the peer address for connectionless protocols, such as UDP. Step 7: Full profile connecFigure 30. Full Profile Point to Point Stream tion establishment. In the full proEstablishment file, the flow specification does not contain the data transfer information bind(A_StreamEndpoint,B_StreamEndPoint) for each flow since the flows are represented by flow interfaces and STREAMCTRL they need not be collocated in the same process. A StreamCtrl can be t in Po nd E used to control different flows, each ow Fl of which could reside on a different B_STREAMENDPOINT A_STREAMENDPOINT n machine. In this case, each . . FlowEndPoint will need to know the . 1 . . . FLOW CONNECTION 1 n network address information. In the 4.3 co nn full profile stream setup, the bind ec t_t o_ pe er operation is called on the StreamCtrl, 3.1 get_property(protocols) passing the two StreamEndpoints. FLOWENDPOINT FLOWENDPOINT B 3.2 get_property(format) A Figure 30 illustrates the se4.2 open(flow_spec_entry) quence of steps performed for a full 4.4 connect(flow_spec_entry) profile point-to-point stream setup. CONNECTOR REGISTRY ACCEPTOR REGISTRY Each of these steps is outlined below: 1. Flow endpoint matching. The StreamCtrl obtains the flow names in each StreamEndpoint by querying the “flows’’ property. For
CORBA Audio/Video Streaming Service 95
each flow name, it then obtains the FlowEndPoint using the get_fep method on the StreamEndpoint. If the flowspec is empty all the flows are considered. Otherwise, only the specified flows are considered for stream establishment. It then goes through the list of FlowEndPoints trying to find a match between the FlowEndPoints on the A and B side. Two FlowEndPoints are said to match if is_fep_compatible returns true. This call checks to make sure that the format and the protocols of the two FlowEndPoints match. Applications can override this behavior to do more complex checks, such as checking for semantic nuances of device parameters. For example, the FlowEndPoint may want only a French audio stream, whereas the other FlowEndPoint may support only English. These requested semantics can be checked by querying the property “devParams’’ and by checking the value for “language.’’ The StreamEndpoint then tries to obtain a FlowConnection from the StreamCtrl. The application developer can set the FlowConnection object for each flow using the StreamCtrl. All operations on a stream are applied to the contained FlowConnections and by setting specialized FlowConnections the user can customize the behavior of the stream operations. If the stream does not have a FlowConnection then a default FlowConnection is created and set for that flow. The StreamEndpoint then calls connect on the FlowConnection with the producer and consumer endpoints with the flow QoS. 2. Flow configuration. The FlowConnection calls set_peer on each of the FlowEndPoints during the connect operation and this will let the FlowEndPoints to check and set the peer FlowEndpoint’s configuration. For example, a video consumer can check the ColourModel, ColourDepth, and VideoResolution and allocate a window for the specified resolution and also other display resources, i.e., colormap, etc. In the case of audio, the quantization property can be used by the consumer to allocate appropriate decoder resources. 3. Flow connection establishment. In this step, the FlowConnection calls go_to_listen on one of the FlowEndPoints with the is_mcast parameter set to false and also passes the flow protocol that was set on the FlowConnection using the use_flow_protocol operation. The FlowEndPoint can raise an exception failedToListen in which case the FlowConnection calls go_to_listen on the other FlowEndPoint. In TAO’s implementation the go_to_listen does the sequence of operations shown in Figure 29 to accept on the selected flow protocol and data transfer protocol and also if needed Figure 31. Connector Registry 1 . connect
CONNECTOR REGISTRY 2 . create_connector() / get_connector()
TRANSPORT FACTORY
3 . open(flowspec,flow_factory)
CONNECTOR 4 . make_protocol_object
FLOW FACTORY 5 . get_callback
Endpoint
96 Surendran, Krishamurthy & Schmidt
Figure 32. Point-to-Multipoint Binding LIVESTREAM ENDPOINT SPANISH FLOW PRODUCER Spanish
Video
ENGLISH FLOW PRODUCER English
VIDEO FLOW PRODUCER
STREAMCTRL
English
sh gli
En
eo
TV DEVICE
RADIO ENDPOINT
ish
TV ENDPOINT
ENG FLOW CONSUMER
SPANISH FLOW
Vid
VIDEO ENG FLOW FLOW CONSUMER CONSUMER
ENGLISH FLOW
an
Vid
eo
VIDEO FLOW
Sp
the control protocol for the flow. Since the FlowEndPoint also derives from Base_EndPoint the Callback and Protocol_Objects will be set on the endpoint. In the case of the FlowProducer the get_timeout operation will be invoked on the Callback object to register for timeout events. The FlowConnection then calls connect_to_peer on the other FlowEndPoint with the address returned by the listening FlowEndPoint and also the flowname. In the case of connectionless protocols, such as UDP, the listening FlowEndPoint may need to know the reverse channel to send the data, in which case it can call the get_rev_channel operation to get it. When FlowEndPoint calls connect_to_peer, the sequence of steps shown in Figure 31 will occur to connect to the listening endpoint. With the above sequence of steps a stream will be established in a point-to-point binding between two multimedia devices.
VIDEO FLOW CONSUMER
ENG FLOW CONSUMER
TV ENDPOINT
TV DEVICE
Point-to-Multipoint Binding TAO’s point-to-multipoint binding support is essential to handle broadcast/multicast streaming servers. With new technologies, such as Web Caching [Fan et al., 1998], multicast updates of web pages and streaming media files is becoming common place. In addition, it has become common on websites to broadcast live events using commercial-off-the-shelf (COTS) technologies, such as RealPlayer. For example, during the World Cup Cricket 99, millions of people listened to the live commentaries of the matches from the BBC website. In such cases, it would be ideal for the servers to use multicast technologies like IP multicast to reduce server connections and load. TAO’s point-to-multipoint binding essentially provides such an interface for a source to multicast the flows to multiple sinks as shown in Figure 32. TAO’s A/V Streaming Service implementation provides a binding based on IP multicast [Deering and Cheriton, 1990]. In this section we explain the sequence of steps that lead to a point-to-multipoint stream establishment both in the light and full profiles. Step 1: Adding a multipoint source. A multipoint source MMDevice must be added before any sinks can be added to the stream. For example, the multicast server could add itself to the StreamCtrl and expose the StreamCtrl interface through a standard CORBA object location service, such as Naming or Trading. If the B party MMDevice parameter to bind_devs is nil, the source is assumed to be a multicast source. As with a point-to-point stream, the endpoints for the source device are created, i.e., the StreamEndpoint and VDev for light profiles, and the StreamEndpoint containing FlowProducers for the full profile. Unlike the point-to-point stream, however, there can only be FlowProducers in the MMDevice. Figure 33 shows the creation of endpoints in point-to-multipoint bindings. Step 2: Configure multicast interface. In the case of a multipoint binding there can be numerous sinks. Therefore, the CORBA A/V Streaming Service specification provides
CORBA Audio/Video Streaming Service 97
-A
,A _V Set Mca st Pee Dev r
Cr ea te
nt
oi
2. 1
dp
En
A_
_Fep
2.4 Add
umer
Cons
Fep
3.3 F low C onsum er
2.2 Cre ate P
reate
3.4 Add_
3.2 C
rod 2.3 uce Flow r Pro duc er
o dp En B_
eer
-B v te ea De Cr _V ,B 3.1 int
Add P
an MCastConfigIf interface, which is used instead of using point-to-point VDev con- Figure 33. Creation of Endpoints in figurations. Upon addition of a multipoint the Point-to-Multipoint Binding source, the StreamCtrl creates a new MCastConfigIf interface and sets it as the A . bind_devs ( Mpoint_source,Nil) B . bind_devs ( Nil,Mpoint_sink) multicast peer of the source VDev. This Light Profile design allows the stream binding to use Full Profile STREAM CTRL multicasting technologies to distribute the stream configuration information instead of using numerous point-to-point configurations. MPOINT SOURCE MCAST CONFIG IF The MCastConfigIf interface provides operations to set the initial configuMPOINT SINK ration of the stream example via the B_VDEV set_initial_configuration operation. This A - ENDPOINT FDEV operation can be called by the source VDev during the set_MCast_peer call. 1 . . . n B - ENDPOINT FDEV This information is conveyed to the A_VDEV multicast sink VDev during the set_peer 1 . . . n call on the MCastConfigIf when a multicast sink is added. The MCastConfigIf performs the configuration operation using point-to-point invocation on all sink VDevs. Step 3: Adding multicast sinks. When a sink wants to join a stream as a multicast sink, it can call bind_devs with a nil A party MMDevice. This call will create the endpoints for the multicast sink, i.e., the StreamEndpoint and the VDev. For full profiles, the StreamEndpoint will contain FlowConsumers. For light profiles, the VDev is added to the MCastConfigIf. Step 4: Multicast connection establishment. The StreamCtrl then calls connect_leaf on the multicast source endpoint for the multicast sink endpoint. In TAO, the connect_leaf operation will throw the notSupported exception. The StreamCtrl will then try the IP multicast model using the multiconnect call on the source StreamEndpoint. The following steps occur when multiconnect is called on StreamEndpoint_A for each flow in the full profile: 1. The StreamEndpoint makes sure that the endpoint is indeed a FlowProducer. 2. It then checks to see if a FlowConnection interface exists for this flow in the StreamCtrl, which is obtained through the Related_StreamCtrl property. 3. In the absence of a FlowConnection, the StreamEndpoint_A will create a FlowConnection and set the multicast address to be used for this flow on the FlowConnection. An application configure this address by passing it to the StreamEndpoint during its initialization. The A/V specification does not define how multicast addresses are allocated to flows. Thus, TAO’s StreamEndpoint uses a base multicast address and assigns different ports for the flows and sets the FlowConnection on the StreamCtrl. We ultimately plan to strategize this allocation so applications can decide on the multicast addresses to use for each flow. 4. The StreamEndpoint then calls add_producer on FlowConnection. 5. The call to add_producer will result in a connect_mcast on the FlowProducer, passing the multicast address with which to connect. The FlowProducer then returns the
98 Surendran, Krishamurthy & Schmidt
eer st_p Mca t cas Set_ M .4 t 1 nec ss Con re d d 1.2 A 1.3
um
ns
co
d_
ad
2
2.
er pe t_ se
2.4 connect (flow_spec_entry)
1.1 O pen(flo w
_spec_ 1.2 F entry) low_s pec_ entry
5 2.
y) _entr spec low_ pen(f 2.2 o
r ce du pro d_ ad 1.1
er
1.
Mu
ltic
on
1.5 Flo ,1.3 w_ sp ec
ne
ct
(flo
ws
)
address to which it will multicast the flow. If the return address is complete with network Figure 34. Connection Establishment in address, then IP multicast is used. In contrast, the Point-to-Multipoint Binding if the return address specifies only the protocol name an ATM-style multicast is used. STREAMCTRL 2. 6. In addition, the FlowConnection cre1 M ul tic on ates a MCastConfigIf if it has not been crene ct ated and sets it as the multicast peer on the A_ENDPOINT B_ENDPOINT FlowProducer. Since the same MCastConfigIf r is used for both FlowEndPoint and VDev, the ee FLOW CONSUMER _p to parameters to MCastConfigIf are passed as t_ c e nn co CORBA objects. It is the responsibility of 3 2. FLOWCONNECTION MCastConfigIf to check whether the peer is a VDev or a FlowEndpoint. 7. The connect_mcast does the actual MCASTCONFIGIF connection to the multicast address and reACCEPTOR REGISTRY CONNECTOR sults in the sequence of steps for multicast op en REGISTRY (flo w_ accept using the pluggable A/V protocols. sp ec _e ntr y) Figure 34 illustrates these steps graphiFLOW PRODUCER cally. The steps described above occur for each multipoint sink that is added to the stream. TAO’s pluggable A/V protocol framework is configured with both full profile and light profile objects. It is also configured in the point-to-point and point-to-multipoint bindings. Thus, the control and management implementation objects can be closed for modifications, yet new flow and data transfer protocols can be added flexibly to the framework without modification to these interface implementations. A similar set of steps happens when multiconnect is called on the StreamEndpoint_B.
Multipoint-to-Multipoint Binding The multipoint-to-multipoint binding is important for applications, such as videoconferencing, where there are multiple source and sink participants. The CORBA A/V Streaming Service specification does not mandate any particular protocol for multipoint-tomultipoint binding, leaving it to implementers to provide this feature. In TAO, we provide a multipoint-to-multipoint binding by extending the point-to-multipoint binding based on IP multicast. We apply a Leader/Follower pattern [Schmidt et al., 2000] for the sources, where the first source that is added to the stream will become the leader for the multipoint-tomultipoint binding, and every other source become a follower. This design implies that all stream properties, such as format and codec, will be selected by the leader.
REFERENCES Arulanthu, A. B., O’Ryan, C., Schmidt, D. C., Kircher, M., and Parsons, J. (2000). The design and performance of a scalable ORB architecture for CORBA asynchronous messaging. In Proceedings of the Middleware 2000 Conference. ACM/IFIP. Box, D. (1997). Essential COM. Addison-Wesley, Reading, MA. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., and Stal, M. (1996). PatternOriented Software Architecture-A System of Patterns. Wiley and Sons.
CORBA Audio/Video Streaming Service 99
Chen, S., Pu, C., Staehli, R., Cowan, C., and Walpole, J. (1995). A distributed real-time MPEG video audio player. In Fifth International Workshop on Network and Operating System Support of Digital Audio and Video. Clark, D. D. and Tennenhouse, D. L. (1990). Architectural considerations for a new generation of protocols. In Proceedings of the Symposium on Communications Architectures and Protocols (SIGCOMM), 200-208, Philadelphia, PA. ACM. Deering, S. E. and Cheriton, D. R. (1990). Multicast routing in datagram internetworks and extended LANs. ACM Transactions on Computer Systems, 8(2), 85-110, May. Eide, E., Frei, K., Ford, B., Lepreau, J., and Lindstrom, G. (1997). Flick: A flexible, optimizing IDL compiler. In Proceedings of ACM SIGPLAN ’97 Conference on Programming Language Design and Implementation (PLDI), Las Vegas, NV. ACM. D. D. (1996). Vaudeville: A High Performance, Voice Activated Teleconferencing Application. Department of Computer Science, Technical Report WUCS-96-18, Washington University, St. Louis. Fan, L., Cao, P., Almeida, J., and Broder, A. (1998). Summary cache: A scalable wide-area Web cache sharing protocol. In SIGCOMM 98, 254-265. SIGS. Gamma, E., Helm, R., Johnson, R. and Vlissides, J. (1995). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading, MA. Gill, C. D., Levine, D. L., and Schmidt, D. C. (2001). The design and performance of a realtime CORBA scheduling service. Real-Time Systems, The International Journal of Time-Critical Computing Systems, special issue on Real-Time Middleware, 20(2). Gokhale, A. and Schmidt, D. C. (1996). Measuring the performance of communication Middleware on high-speed networks. In Proceedings of SIGCOMM ‘96, 306-317, Stanford, CA, ACM. Gokhale, A. and Schmidt, D. C. (1998). Measuring and optimizing CORBA latency and scalability over high-speed networks. Transactions on Computing, 47(4). Gokhale, A. and Schmidt, D. C. (1999). Optimizing a CORBA IIOP protocol engine for minimal footprint multimedia systems. Journal on Selected Areas in Communications special issue on Service Enabling Platforms for Networked Multimedia Systems, 17(9). Harrison, T. H., Levine, D. L. and Schmidt, D. C. (1997). The design and performance of a real-time CORBA event service. In Proceedings of OOPSLA ‘97, Atlanta, GA. ACM. Hu, J., Mungee, S. and Schmidt, D. C. (1998). Principles for developing and measuring highperformance Web servers over ATM. In Proceedings of INFOCOM ‘98. Hu, J., Pyarali, I. and Schmidt, D. C. (1997). Measuring the impact of event dispatching and concurrency models on Web server performance over high-speed networks. In Proceedings of the 2nd Global Internet Conference. IEEE. Huard, J. F. and Lazar, A. (1998). A programmable transport architecture with QoS guarantees. IEEE Communications Magazine, 36(10), 54-62. IETF. (2000a). Differentiated services (diffserv). Retrieved on the World Wide Web: http:/ /www.ietf.org/html.charters/diffserv-charter.html. IETF. (2000b). Integrated services (intserv). Retrieved on the World Wide Web: http:// www.ietf.org/html.charters/intserv-charter.html. ISO. (1993). Coding Of Moving Pictures And Audio For Digital Storage Media At Up To About 1.5 Mbit/s. International Organization for Standardization. Kuhns, F., Schmidt, D. C. and Levine, D. L. (1999). The design and performance of a realtime I/O subsystem. In the Proceedings of the 5th IEEE Real-Time Technology and
100 Surendran, Krishamurthy & Schmidt
Applications Symposium, 154-163, Vancouver, British Columbia, Canada. IEEE. Kuhns, F., Schmidt, D. C., O’Ryan, C. and Levine, D. (2001). Supporting high-performance I/O in QoS-enabled ORB Middleware. Cluster Computing: the Journal on Networks, Software, and Applications. McCanne, S. and Jacobson, V. (1995). Vic: A flexible framework for packet video. In ACM Multimedia 95, 511-522, New York. ACM Press. Meyer, B. (1989). Object Oriented Software Construction. Prentice Hall, Englewood Cliffs, NJ. Mungee, S., Surendran, N. and Schmidt, D. C. (1999). The design and performance of a CORBA audio/video streaming service. In Proceedings of the Hawaiian International Conference on System Sciences. NRG, LBNL. (1995). LBNL Audio Conferencing Tool (vat). Retrieved on the World Wide Web: ftp://ftp.ee.lbl.gov/conferencing/vat/. Object Management Group. (1999). The Common Object Request Broker: Architecture and Specification. Object Management Group, 2.3 edition. Object Management Group. (2000). The Common Object Request Broker: Architecture and Specification. Object Management Group, 2.4 edition. OMG. (1996). Property Service Specification. Object Management Group, 1.0 edition. OMG. (1997a). Control and Management of A/V Streams Specification. Object Management Group, OMG Document telecom/97-05-07 edition. OMG. (1997b). CORBAServices: Common Object Services Specification, Revised Edition. Object Management Group, 97-12-02 edition. O’Ryan, C., Kuhns, F., Schmidt, D. C., Othman, O. and Parsons, J. (2000). The design and performance of a pluggable protocols framework for real-time distributed object computing Middleware. In Proceedings of the Middleware 2000 Conference. ACM/ IFIP. Pyarali, I., Harrison, T. H. and Schmidt, D. C. (1996). Design and performance of an objectoriented framework for high-performance electronic medical imaging. USENIX Computing Systems, 9(4). Pyarali, I., O’Ryan, C., Schmidt, D. C., Wang, N., Kachroo, V. and Gokhale, A. (1999). Applying optimization patterns to the design of real-time ORBs. In Proceedings of the5^th Conference on Object-Oriented Technologies and Systems, San Diego, CA. USENIX. RealNetworks. (1998). Realvideo Player. Retrieved on the World Wide Web: http:// www.real.com. Ritchie, D. (1984). A stream input-Output system. AT&T Bell Labs Technical Journal, 63(8), 311-324. Schmidt, D. C. (1995). Reactor: An object behavioral pattern for concurrent event demultiplexing and event handler dispatching. In Coplien, J. O. and Schmidt, D. C. (Eds.), Pattern Languages of Program Design, 529-545. Addison-Wesley, Reading, MA. Schmidt, D. C., Levine, D. L. and Mungee, S. (1998a). The design and performance of realtime object request brokers. Computer Communications, 21(4), 294-324. Schmidt, D. C., Mungee, S., Flores-Gaitan, S. and Gokhale, A. (1998b). Alleviating priority inversion and non-determinism in real-time CORBA ORB core architectures. In Proceedings of the 4th IEEE Real-Time Technology and Applications Symposium, Denver, CO. IEEE. Schmidt, D. C., Mungee, S., Flores-Gaitan, S. and Gokhale, A. (2001). Software architectures for reducing priority inversion and non-determinism in real-time object request
CORBA Audio/Video Streaming Service 101
brokers. Journal of Real-time Systems, special issue on Real-time Computing in the Age of the Web and the Internet. Schmidt, D. C., Stal, M., Rohnert, H. and Buschmann, F. (2000). Pattern-Oriented Software Architecture: Patterns for Concurrency and Distributed Objects, 2. Wiley& Sons, New York, NY. Schmidt, D. C. and Suda, T. (1994). An object-oriented framework for dynamically configuring extensible distributed communication systems. IEE/BCS Distributed Systems Engineering Journal (Special Issue on Configurable Distributed Systems), 2, 280-293. Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, V. (1994). RTP: A transport protocol for real-time applications. Internet-Draft. Stevens, W. R. (1993). TCP/IP Illustrated, Volume 1. Addison Wesley, Reading, Massachusetts. Stevens, W. R. (1998). UNIX Network Programming, Volume 1: Networking APIs: Sockets and XTI, Second Edition. Prentice Hall, Englewood Cliffs, NJ. SUN Microsystems, I. (1992). Sun Audio File Format. Sun Microsystems, Inc. Vxtreme. (1998). Vxtreme Player. Retrieved on the World Wide Web: http:// www.microsoft.com/netshow/vxtreme. Wollrath, A., Riggs, R. and Waldo, J. (1996). A distributed object model for the Java system. USENIX Computing Systems, 9(4).
102 Sarris & Strintzis
Chapter IV
MPEG-4 Facial Animation and its Application to a Videophone System for the Deaf Nikolaos Sarris and Michael G. Strintzis Aristotle University of Thessaloniki, Greece
INTRODUCTION This chapter aims to introduce the potential contribution of the emerging MPEG-4 audio-visual representation standard to future multimedia systems. This is attempted by the ‘case study’ of a particular example of such a system--‘LipTelephone’--which is a special videoconferencing system being developed in the framework of MPEG-4 (Sarris et al., 2000b). The objective of ‘LipTelephone’ is to serve as a videophone that will enable lip readers to communicate over a standard telephone connection. This will be accomplished by combining model-based with traditional video coding techniques in order to exploit the information redundancy in a scene of known content, while achieving high fidelity representation in the specific area of interest, which is the speaker’s mouth. Through this description, it is shown that the standard provides a wide framework for the incorporation of methods that had been the object of pure research even in recent years. Various such methods are referenced from the literature, and one is proposed and described in detail for every part of the system being studied. The main objective of the chapter is to introduce students to these methods for the processing of multimedia material, provide to researchers a reference to the state-of-the-art in this area and urge engineers to use the present research methodologies in future consumer applications. Copyright © 2002, Idea Group Publishing.
MPEG-4 Facial Animation and its Application 103
CONVENTIONAL MULTIMEDIA CODING SCHEMES AND STANDARDS The basic characteristic and drawback of digital video transmission is the vast amount of data that need to be stored and transmitted over communication lines. For example, a typical black-and-white videophone image sequence with 10 image frames per second and dimensions 176 x 144 needs 2Mbits/sec transmission rate (8 bits/pixel x [176x144] pixels/ frame x 10 frames/sec). This rate is extremely high even for state-of-the-art communication carriers and demands high compression, which usually results in image degradation. In recent years many standards have emerged for the compression of moving images. In 1992 the Moving Picture Experts Group (MPEG) completed the ISO/IEC MPEG-1 video-coding standard, while in 1994 the MPEG-2 standard was also approved (MPEG, online). These standards contributed immensely to the multimedia technological developments as they both received Grammy awards and made interactive video on CD-ROM and digital television possible (Chariglione, 1998). The ITU-T (formerly CCITT) organization established the H.261 standard in 1990 (ITU-T, 1990) and H.263 in 1995 (ITU-T, 1996), which were especially targeted to videophone communications. These have achieved successful videophone image sequence transmission at rates of approximately 64 Kbps (Kbits per second). The techniques employed in these standards were mainly based on segmentation of the image in uniformly sized 8x8 rectangular blocks. The contents of the blocks were coded using the Discrete Cosine Transform (DCT), and their motion in consecutive frames was estimated so that only the differences in their positions had to be transmitted. In this way spatial and temporal correlation is exploited in local areas and great compression rates are achieved. The side effects of this approximation, however, are visual errors on the borders of the blocks (blocking effect) and regularly spaced dots on the reconstructed image (mosquito effect). These effects are more perceptible when higher compression rates are needed, as in videophone applications. In addition, these standards impose the use of the same technique on the whole image, making it impossible to distinguish the background or other areas of limited interest, even when they are completely still (Schafer & Sikora, 1995; IMSPTC, 1998). These problems are easily tolerated during a usual videoconference session where sound remains the basic means of communication, but the system is rendered useless to potential hearing-impaired users who exploit lip reading for the understanding of speech. Even for these users however, image degradation could be tolerated in some areas of the image. For example, the background does not need to be continuously refreshed, and most areas of the head, apart from the mouth area, do not need to be coded with extreme accuracy. It is therefore obvious that multimedia technology at that time could benefit immensely from methods that utilize knowledge of the scene contents, as these could detect different objects in the scene and prioritize their coding appropriately (Benois et al., 1997).
THE MPEG-4 STANDARD In acknowledgment of the previously mentioned limitations of the existing standards, MPEG launched in 1993 a new standard called MPEG-4 which was approved in Version 1 in October 1998 and in Version 2 in December 1999. MPEG-4 is the first audio-visual representation standard to model a scene as a composition of objects with specific
104 Sarris & Strintzis
characteristics and behavior (MPEG, Figure 1: A virtual scene example combining online; Koenen et al., 1997; Abrantes & real and synthetic objects (Kompatsiaris, 2000) Pereira, 1997; Doenges, 1998). Every such object may be real or synthetic and can be coded using a different technique and requiring different quality in its reconstruction. A whole subgroup of MPEG-4 called Synthetic/Natural Hybrid Coding (SNHC) was formed to develop the framework for combining synthetic and natural content in the same scene (an example scene is shown in Figure 1). In addition, the standard provides a detailed framework for the representation of the human face. This is accomplished by a 3D face object which is associated mainly with two sets of atFigure 2: The MPEG-4 Facial Definition tributes: The Facial Definition ParamPoints (ISO/IEC, 1998) eters (FDPs) (shown in Figure 2) and the Facial Animation Parameters (FAPs) (a representative subset of which is included in Table 1). The first describes a set of characteristic feature points on the face while the second provides a set of facial deformation parameters, which have been found to be capable of describing any human expression. Both sets of parameters were built based on the physiology of the human face. In particular, FAPs are based on the study of minimal facial actions and are closely related to muscle actions. The units of measurement for the FAP values are relative to the face dimensions shown in Figure 3. The nonrigid motion dictated by any particular FAP may be uni- or bi-directional on one 3D axis, as shown in Table 1. At this point it should be made clear that the standard does not provide any means of performing the necessary operations in order to segment the image and detect the position of feature points. But, having detected the positions of these feature points and calculated the values of the animation parameters, MPEG-4 provides a model-based coding scheme, which multiplexes this information together with the representations of any other objects the scene may contain and transmits them over the communication channel. The MPEG-4 decoder on the receiver side demultiplexes the coded information and displays the reconstructed image. The coding gain using this technique for the transmission of a human
MPEG-4 Facial Animation and its Application 105
Table 1: Some of the FAPs as described by MPEG-4 (ISO/IEC, 1998) Uni-
FDP
FDP
motion
group num
subgrp num
U
down
2
1
MNS
B
down
2
2
Vertical bottom middle inner lip displacement
MNS
B
up
2
3
stretch_l_cornerlip
Horizontal displacement of left inner lip corner
MW
B
left
2
4
stretch_r_cornerlip
Horizontal displacement of right inner lip corner
MW
B
right
2
5
lower_t_lip_lm
Vertical displacement of midpoint between left corner and middle of top inner lip
MNS
B
down
2
6
FAP name
FAP description
units
or Bidir
open_jaw
Vertical jaw displacement (does not affect mouth opening)
MNS
lower_t_midlip
Vertical top middle inner lip displacement
raise_b_midlip
Pos
Figure 3: The MPEG-4 Facial Animation Parameter Units (ISO/IEC, 1998)
face image sequence is obvious, as the 3D model has to be transmitted only at the beginning of the session, and only the set of 68 FAPs (numbers) has to be transmitted for every frame. Experimental coding methods based only on the 3D face object have been reported to deliver videoconferencing scenes at bandwidths as low as 10 Kbps (Abrantes & Pereira, 1999).
‘LIP-TELEPHONE’: A CASE STUDY In previous sections the conventional multimedia coding standards were examined and their limitations with respect to demanding videoconferencing systems were pointed out. Furthermore, we highlighted some innovative features of the emerging MPEG-4 standard that may be utilized to overcome the aforementioned limitations. The present section describes in detail how the capabilities of MPEG-4 can be exploited in future videoconferencing systems to improve the quality of the reconstructed multimedia stream even in situations where the communication channel bandwidth is limited. These capabilities are demonstrated by a ‘case study’ of the ‘LipTelephone’ system, which is being developed in the Information Processing Lab under the Greek GSRT 98AMEA23 research project. LipTelephone will be a special videophone system that will enable the lip-reading deaf and hearing impaired to communicate over a standard telephone line. That will be accomplished by providing much better image quality and fidelity in the speaker’s mouth area than that provided by standard videophone systems. The development
106 Sarris & Strintzis
team aims to achieve this by prioritizing the need of image quality and fidelity in separate parts of the image as described in the following sections. It will be shown that the amenities of the MPEG-4 standard are necessary for the use of the proposed system in parallel with the development and use of special image processing algorithms, which will also be described in detail.
General Description of the System The reason which renders this application useful to the lip-readers is that existing videophone systems lack the necessary resolution and accuracy in the image provided that is necessary for the visual understanding of speech. We believe that these limitations can be overcome with the help of the recently emerged video coding standard MPEG-4. In particular, the purpose of this work is to combine a way of transmitting the area of the speaker’s mouth with high resolution and fidelity, while achieving very high compression rate for the rest of the image, which is not critical for the visual understanding of speech. The MPEG-4 standard may greatly assist in this cause as it has provided a framework, which allows: 1. Compression, by use of conventional methods, of a whole image, or specially defined parts of the image (i.e., rectangular or even irregular areas of the whole frame). 2. Special coding of a human face, describing its position, shape, texture and expression, by means of two sets of arithmetic parameters: the Facial Definition Parameters (FDPs) and the Facial Animation Parameters (FAPs). When using this second method, given a three-dimensional (3D) model of the speaker’s head, the only data that have to be transmitted are the values of the Facial Animation Parameters, which have been found to be capable of describing any facial expression. The 3D head model will be known to the receiver end as it will be transmitted at the beginning of the conference session. Based on past reports on MPEG-4 systems (Sarris et al., 2000b), we estimate that, together with the transmission of the sound which will be needed to make the system usable both to deaf and hearing persons, the required bandwidth for satisfactory quality will lie around 15-20kbps, while excellent quality will be assured with 30-40kbps. Using the first method to code the area of the mouth, which requires high quality in the decoded image stream, it is expected that the extra bandwidth needed will not exceed 20kbps. Therefore, the complete bitstream combining both methods with satisfactory quality is not expected to exceed 48kbps, which can be transmitted through a simple ISDN line, widely available to the public at low cost. The design and implementation of this system on dedicated hardware is also in progress by INTRACOM SA, the largest private telecommunication company in Greece (which also participates in the aforementioned research project), to ensure that the final product will work robustly and in real time providing the required image quality. As explained in the previous paragraphs, MPEG-4 provides the framework for developing model-based coding schemes, but it is open to the particular techniques that may be used to achieve the image analysis and modeling within these schemes. Thus, for the present system, techniques need to be developed for the following tasks: 1. Three-dimensional modeling of the human face: A ‘neutral’ 3D facial model has to be created and properly labeled. 2. Segmentation of the videophone image and detection of the characteristics of the face: The face and some of its characteristic points have to be detected within the first image frame
MPEG-4 Facial Animation and its Application 107
Adaptation of a 3D face Figure 4: LipTelephone block diagram: Initialization model to the speaker’s phase (above), tracking phase (below) face: The ‘neutral’ face model has to be adapted to 1st Image frame the particular speaker acSegmentation cording to the characterisfeature points tic points detected. Adaptation of 4. 2D feature tracking: The the 3D face Adapted 3D characteristic points of the face model face have to be tracked in Initial 3D face model subsequent image frames. 5. Estimation of the 3D mon tion of the speaker’s head: frame The 3D motion of the whole 2D Tracking of the head must be estimated feature points n+1 based on the observed 2D frame motion of the feature positions of position of the all feature mouth area points. points 6. Development of the user interface and communica3D motion estimation of High quality coding of tion channel handling isthe head and conversion the mouth area to FAPs sues: A window-based user interface has to be developed and the issues of transMPEG-4 stream mitting the encoded stream over the communication channel resolved. The above procedures are highlighted in the block diagram of the system shown in Figure 4. As seen, the system operation is divided to two phases: 1. The initialization phase, where the feature points are extracted from the first image frame and the ‘neutral’ face model is adapted to these characteristics. This process takes place only at the beginning of a conference session. 2. The tracking phase, where the feature points are tracked in subsequent image frames; from these the mouth area is detected and coded, and the 3D motion of the speaker’s head is estimated. The high resolution coding of the detected mouth area and the creation of the MPEG4 stream are issues handled fully by the standard. MPEG-4 provides a rate-controlled DCTbased coding algorithm, which is an improved version of the MPEG-2 codec with the added capability of defining areas in the image with specific coding characteristics. Therefore, having detected the position of the mouth area in every frame, this area will be coded with the provided algorithm so that when reconstructed, it displays the area of the mouth with the desired quality and accuracy. Then, the speaker’s head and mouth objects will be multiplexed in a stream according to the MPEG-4 format. In the ensuing sections each of the above tasks is described in detail by analyzing the techniques being developed and comparing them to other methods described in the literature. 3.
108 Sarris & Strintzis
Three-Dimensional Modeling The creation of three-dimensional graphic models was initiated by the need to electronically describe real-world scenes in an accurate and realistic way. Since our world is three-dimensional, 3D models are essential for this purpose. These models are composed of vertices (points) in the 3D space and polygonal interconnections between these vertices. The surface formed by these adjacent polygons aims to approximate the real-world surface. The denser these vertices are (i.e., the smaller the polygons), the better the approximation. In the proposed system we must represent the three-dimensional form of a human speaker. Adequate detail in modeling the head and particularly the face is essential. Various methods are used for the creation of such models: 1. Purely synthetic creation of the model with the aid of special computer design software (example screenshots of such applications are shown in Figure 5). The modeling starts with an initial rough approximation of the human head (may vary from a simple 3D ellipse to an ordinary head model) which is manually modified until the desired facial characteristics have been created. 2. By use of special hardware. Special laser scanners can produce accurate 3D models of small- to medium-sized artifacts, but the cost of these devices is still too high to integrate in a general-purpose product. 3. By processing of multiview images. Given a set of views (two or more) of a scene, special techniques may be employed to detect feature correspondences and construct a 3D model (Tzovaras et al., 1999). This technique, although cost efficient, as it requires no special hardware, has not yet proved to be robust enough for real-world applications. 4. Knowledge based modeling. Based on the knowledge of the characteristics of the subject to be modeled (i.e., the speaker’s head), methods have been developed which adapt a similar ‘generic’ model by deforming it in a natural-looking way (Zhang, 1998; Terzopoulos & Waters, 1990; Magnenat-Thalmann et al., 1998; Escher & MagnenatThalmann, 1997). Extensive research in the areas of 3D modeling has resulted in the development of a variety of formats for the representation of 3D scenes. Two of the most popular and easy to use are VRML (Virtual Reality Modeling Language) and the Open Inventor format from Silicon Graphics, Inc. Both formats provide a basic structure to describe a simple model, which consists of two lists: a list of nodes/vertices with their 3D coordinates and a list of polygons with the numbers of the nodes they interconnect. In addition to these two lists, many features are provided to render the model more easy to use and realistic. For example, Figure 5: Commercial applications for 3D modeling
MPEG-4 Facial Animation and its Application 109
keywords are provided for the insertion of primitives such as circles, squares, etc.; transformations such as translation, rotation or scaling can be easily applied and different colors can be defined for separate objects within the model. A simple example of a definition of a 3D model in VRML is shown in Figure 6. Moreover, a vast variety of 3D models are publicly available, including models of the human face of various characteristics (male, female, young, aged, etc.). Most of these models are built based on the knowledge of the basic anatomy of the human face, combining the use of special capturing devices with post processing by state-of-the-art 3D modeling applications. The resulting model is made even more realistic by the process of texture mapping which colors the surface of the model with the actual texture of a face provided by its photograph. In the particular case of the ‘LipTelephone’ system, techniques are sought to detect the positions of the known face characteristics (eyes, nose, mouth, etc.) and deform an initial face model so that it adapts naturally to these features. A single model will be used for this purpose (the one shown in Figure 5), with every node labeled by a special tag showing the part of the face in which it belongs (e.g., mouth, left eye, etc.). This labeling has been achieved with the development of an interactive tool, which highlights one node at a time and requests the user to identify the face part. Obviously, that is an off-line process which is needed every time a new ‘neutral’ model has to be incorporated into the system. Figure 6: A simple 3D model in VRML #VRML V1.0 ASCII Separator { # Separates different objects in the model Coordinate3 { # Set of 3D nodes point [ 0.000000 0.029621 0.011785, # node No0 0.000000 0.025242 0.015459, # node No1 0.014062 0.026902 0.008709, # .... 0.007543 0.019862 0.012705, 0.011035 0.018056 0.007652, 0.012934 0.018587 0.007649, 0.012934 0.018587 0.007649, 0.011035 0.018056 0.007652, 0.015393 0.016240 0.004451, 0.000000 0.029621 0.011785 ] } IndexedFaceSet { # Set of polygons-triangles coordIndex [ 0, 1, 3, -1, 3, 4, 5, -1, # a triangle which connects # nodes No3-No4-No5 6, 7, 8, -1, 6, 8, 9, -1, 0, 1, 8, -1, 1, 8, 9, -1 ] } }
110 Sarris & Strintzis
For the accurate calculation of the speaker’s face characteristics in 3D, two views at right angles (front and profile) will be needed, as shown in the later sections. This technique has been selected because it does not require the use of special equipment and was seen to be fast and reliable. Although the adaptation of the model needs to be performed only once at the beginning of a conference session, it must be fast enough to avoid annoying the user. To achieve even better performance, a set of previously adapted models will be stored by the system so that the procedure does not have to be repeated for someone who has previously participated in a conference session. The particular face model, shown in Figure 5, has neutral characteristics and medium density in number of vertices (the face consists of 240 vertices). These features make it easy to adapt to other faces and quick to handle through the various stages of the system.
Segmentation of the Image and Facial Feature Extraction The segmentation of an image in general and particularly the detection and extraction of facial features are common problems in many image processing applications such as facial recognition, tracking and modeling. In the special case of videoconferencing images, the detection of the face becomes a simpler task as the programmer is assisted by the knowledge of the scene content (i.e., it is known that the image scene contains one face approximately in the middle of the scene). In general face detection applications--techniques based on neural networks (NNs) (Lawrence et al., 1997; Sarris et al., 1999b; Sarris et al., 2000c), principal component analysis (PCA) (Craw et al., 1999; Turk & Pentland, 1991), or analysis of color distribution (Chai & Nghan, 1999; Sarris et al., 2000a; Sobottka & Pittas, 1998)--have proved to be efficient and reliable. General NN and PCA methods are based on the degree of correlation of the given image with a type of template representation of a face. In this way they manage to detect the presence and position of a face and sometimes even recognize its identity by comparing to a number of images in a given database. More complicated techniques based on deformable contours (Cootes et al., 1995), or use of deformable 2D and 3D models (DeCarlo & Metaxas, 1996) have also been reported to accurately detect the position and shape of a face. These are based on an initial contour approximation of a face (either 2D or 3D), which is deformed by the application of specific forces until the desired fitting to the given face has been achieved. In the current application, however, the extraction of facial features is necessary for the adaptation of a generic face model. These features consist of the exact locations of particular points, as outer left eye (which means the outermost/leftmost point of the left eye), inner left eye, leftmost mouth, etc. Therefore, a result in far greater detail is sought than that of detecting a rectangular, ellipsoid area or even the irregular contour containing the face. In summary, possible methods for extracting the face region from the rest of the scene can be based either on temporal or spatial homogeneity. Methods using temporal criteria may separate the user from his homogeneous movement in the scene. However, such methods will not separate the face from the rest of the visible body (hair, neck, etc.) and will also not work when the user is standing still. Spatial criteria may be applied as the region should generally be similarly colored, connected and have an ellipsoidal shape. However, difficulties arise in their implementation because dark areas, like eyes and facial hair, do not have the same color as the rest of the face. Moreover, the skin color may differ among users, and the proximity of the neck of the user, as well as objects in the background with color similar to that of the skin, can produce confusion. In order to deal with these difficulties, a semi-automatic method was implemented to
MPEG-4 Facial Animation and its Application 111
train a neural network to recognize image blocks Figure 7: Selection of the contained in the facial region (a more detailed neural network training area analysis is given by Sarris et al., 2000c). This is achieved by directing the user to position his face in a highlighted area, as shown in Figure 7. The contents of this area are then used as positive examples for the training of a feed-forward neural network, while the contents of the rest of the image scene (outside the highlighted area) are used as negative examples for the training of the same neural network. To avoid dependencies on varying illumination of the scene, only the chrominance components Cb and Cr from the YCbCr color space are being used for the description of the contents of the scene. Moreover, the feature vectors used for the training of the neural network are not built solely from the direct chrominance values, because the resulting neural network would then not be capable to separate skin pixels from similarly colored pixels in the background of the scene. Rather, both the facial and the background areas are broken in consecutive rectangular blocks from which the Cb and Cr histograms are quantised and concatenated to form the training vectors. The neural network, trained from the image shown in Figure 7, is used to determine facial blocks in subsequent frames captured from the camera. This is accomplished by dividing every such frame in rectangular blocks and constructing a feature vector from each block in the same way as in the training process (i.e., the Cb and Cr histograms are built, quantized and concatenated). The neural network is then consulted with each feature vector to decide whether the particular block belongs to the face of the user or not. Results are quite satisfactory (two samples are shown in Figure 8), although a small number of blocks are misclassified. To compensate for these errors, a number of post-processing operations, similar to the ones proposed in Chai and Nghan (1999), are performed on the output of the neural network. Finally, a connected component algorithm is employed to find all connected regions and accept only the one of the largest area, Figure 8: (from left to right) Camera image, facial region as identified by the neural network, corrected facial region after post-processing operations
112 Sarris & Strintzis
based on the knowledge that only one face should be present in the scene. Typical results from these post-processing operations are shown in Figure 8. Having detected the facial region in the image, specific operators are implemented to locate the exact positions of the eyes, eyebrows, nose and mouth. To locate robustly the position of the eyes, a temporal criterion was employed by detecting the blinking of the eyes as proposed by Bala et al. (1997). To detect the exact eye region within each window, the connected regions are sought within the window and the region with the greatest area is considered to represent the eye. The proposed algorithm proves to be robust in all situations where the user’s eyes are clearly visible in the scene. A sample result is shown in Figure 9. Having reliably located the positions of the eyes, we employ an iterative adaptive thresholding algorithm on the image luminance Y, within the facial area, until the connected thresholded areas are found at the expected positions. The thresholding operation is performed adaptively according to the mean luminance value observed within the detected facial area. False features are eliminated by exploiting geometric knowledge of the structure
Figure 9: Contour refinement and blink detection
Figure 10: The facial regions and features extracted by thresholding in the frontal and profile image views. In the profile view the rectangles show the search areas for the particular features.
MPEG-4 Facial Animation and its Application 113
of the face. Sample results for the frontal and profile view of a face are shown in Figure 10. The exact positions of the feature points are found as the edge points of the detected areas.
Three-Dimensional Facial Model Adaptation This section describes a method for the adaptation of a generic three-dimensional face to the particular person’s face viewed by the camera. Several methods have been proposed in the literature for the deformation of 3D face models. In Lee et al. (1995) and Eisert and Girod (1998), the required geometry of the face is captured by a 3D laser scanner. In Lee et al. (1995) the generic geometric model is deformed by applying physical spring forces to predetermined nodes according to the positions of the corresponding features in the target face geometry. In Eisert and Girod (1998) the generic model is built using triangular B-spline patches, and the deformation involves the displacement of the spline control points which correspond to facial features. Lee et al. (1999) deformed a generic face model using a threedimensional geometric transformation targeting to fit the two orthogonal views of the 3D model on the corresponding photographs taken from a target face. The transformation used, called Dirichlet FFD, was proposed by Moccozet and Magnenat-Thalmann (1997) and is a type of the Free Form Deformation (FFD) introduced by Sederberg and Parry (1986). Pighinn et al. (1997) estimated the required 3D positions of the facial features from a series of captured image frames of the target face and transformed the generic model by applying an interpolation function based on radial basis functions. Information from one view of the target face is utilized by Zhang (1998) to measure the face, eye and mouth dimensions which are used to adapt a simple 3D face model by rigidly transforming the whole face and locally correcting the position and orientation of the face, eyes and mouth. Geometric assumptions have to be made, however, as the 3D characteristics of the features cannot be totally deduced from only one view. Our approach (detailed in Sarris & Strintzis, 2000) differs from those of all above methods in that it treats the facial model as a collection of facial parts, which are allowed to be deformed according to separate affine transformations. This is a simple method and is effective for face characterization because the physiological differences in characteristics between faces are based on precisely such local variations, i.e., a person may have a longer or shorter nose, narrower eyes, etc. Figure 11: Local adaptation of the right eyebrow The required face geom(upper part) and right eye (lower part) etry is captured from two orthogonal views of the face acquired by the camera and segmented as described in the previous section. Having located the facial features as described in the previous section, their edge points (e.g., leftmost-eye, rightmost-eye, etc.) are defined as the feature points of the face. Having the positions of these feature points in 2D, standard geometrical calculations, assisted by least squares methods, provides
114 Sarris & Strintzis
the required positions of the feature points in the 3D space. Having calculated these required 3D positions for the facial feature points, the 3D facial model needs to be deformed so that its corresponding nodes are displaced towards these positions while maintaining the natural characteristics and symmetry of the human face. The first part of this adaptation process involves a rigid transformation of the model. Thus, the model is rotated and translated to match the pose of the real face. The rotation and translation transformations are calculated by a slight modification of the ‘spatial resection’ problem in photogrammetry (Haralick & Shapiro, 1993). A non-linear minimization is performed so that the rotation and translation transformations bring the model feature nodes the closest possible to their required positions. The rigid transformation may align the generic model with the required face and scale it to meet the total face dimensions. However, the local physiology of the face must also be altered in this way. Thus, in the second step of the adaptation process, the model is transformed in a non-rigid way aiming to further displace the feature nodes bringing them as close as possible to their exact calculated positions while the facial parts retain their natural characteristics. To perform this adaptation, the model is split into face parts (left eye, right eye, mouth, etc.) and a rigid adaptation is performed on every part separately. This means that every 3D face part is rotated, translated and stretched so as to minimize the distances of the feature points belonging to that face part from their required positions. This is accomplished with the following transformations: 1. Centering at the origin: The center of the face part is found as the 3D center of the feature nodes contained and the whole part is translated towards the origin so that this center is on the origin. The same is done with the set of required feature positions (i.e., their center is found and they are translated towards the origin in the same manner). Figure 12: Rigid (upper) and non-rigid (lower) adaptation of the face model shown on the front and profile views
MPEG-4 Facial Animation and its Application 115
2.
Alignment: The face part is rotated around the origin so that three lines connecting three pairs of feature nodes are aligned with the corresponding lines connecting the required feature positions. This is accomplished by minimizing the differences in the gradients of these lines. 3. Stretching: The face part is scaled around the origin with different scale factors for every axis so that the distances of the transformed nodes from their required positions are minimized. 4. Translation: After the stretching transformation the face part is translated back towards its original position by adding the position vector of the face part center calculated and subtracted in Step 1. Results of these steps for two face parts are shown in Figure 11. Between all face parts, one series of nodes is defined as border nodes (i.e., these nodes do not belong to any face part). The positions of these nodes after the deformation are found by linear interpolation of the neighboring nodes belonging to face parts. This is done to assure that a smooth transition is made from one facial part to the other, and possible discontinuities are filtered out. Thus, the final deformed model adapts to the particular characteristics implied by the feature points (e.g., bigger nose or smaller eyes) keeping the generic characteristics of a human face (e.g., smoothness of the skin and symmetry of the face). Figure 12 shows the results of the rigid and non-rigid adaptation procedures.
2D Feature Tracking and 3D Motion Estimation This section presents a method that tackles the problem of tracking the facial rigid and non-rigid 3D motion. Among the techniques that have been proposed for the estimation of facial motion (DeCarlo & Metaxas, 1996; Essa et al., 1994; Guenter et al., 1998; Pighin et al., 1999; Terzopoulos & Waters, 1993), most employ dense optical flow estimation algorithms, and are not suitable for real-time reliable computation requirements because of their complexity. Feature-based tracking methods have generally been avoided, as they tend to give unreliable results. Successful attempts to simplify this task by using artificial markers or heavy make-up on the tracked faces (Guenter et al., 1998; Terzopoulos & Waters, 1993) proved the suitability of feature-based tracking; however, these techniques are not applicable to real, unrestricted systems. The minimization of an error function over the set of facial expressions and face positions spanned by the 3D model by altering the values of facial position and expression parameters is proposed by Pighin et al. (1999). A similar approach where the facial expressions are represented by a set of eigenvectors is proposed by Valente and Dugelay (2000). Although in theory these two methods can yield directly the required values of facial motion in the parametric space introduced, they are based on the assumptions that an extremely realistic rendering of the model is available and that lighting conditions can be perfectly compensated. In the above methods, several parametric models of the face have been introduced for the interpretation of facial motion and expressions. However, the only standardized facial animation parameter space, recently introduced by MPEG-4, i.e., the FAPs, has not been sufficiently explored. Only one method (Eisert & Girod, 1998), has been proposed in the literature for the extraction of FAPs. This is based on a linearized dense optical flow motion estimation method that utilizes a spline-based 3D facial model for the extraction of FAPs. This model, however, can only be generated using specialized 3D scanning equipment, while the motion estimation method proposed suffers from the aforementioned complexity of dense optical flow. The present section first proposes a technique to improve the reliability of featurebased tracking methods for 3D facial motion estimation. This is achieved by introducing
116 Sarris & Strintzis
a criterion for assessing the reliability of tracked features that correspond to 3D model nodes, which is combined with two other criteria of accuracy of a feature tracking algorithm, to provide a measure of confidence for the estimated position of every feature. Furthermore, the framework standardized by MPEG-4 is utilized to provide a non-rigid facial motion model, which guides the feature tracking algorithm. These techniques are integrated into a system, which, without the use of any face markers, estimates the 3D rigid and non-rigid motion of points of a human face and through these, the Facial Animation Parameters (FAPs).
Facial Feature Tracking Various methods have been proposed for the tracking of motion in images based on dense optic flow, block matching, Fourier transforms or pel recursive methods, as described by Dufaux and Moscheni (1995). In the present work a feature-based approach has been preferred both because of its speed and because of its natural suitability to this knowledgebased node-point correspondence problem (Sarris et al., 1999a). In general, the features to be tracked may be corners, edges or points and the selection of the particular set of features must be such that they may be tracked easily and reliably. A widely used feature-based algorithm is that proposed by Kanade, Lucas and Tomasi, often referred to as the KLT algorithm in the literature (Tomasi & Kanade, 1991; Shi & Tomasi, 1994). The KLT algorithm is based on the minimization of the sum of squared intensity differences between a past and a current feature window, which is performed using a Newton-Raphson minimization method. Although the KLT algorithm has proved to yield satisfactory results on its own, in the present system it is very important to assess the results of tracking so that the optimum set of feature correspondences is used in the stage of the model adaptation. For this reason, the tracked feature points are sorted according to two criteria introduced by and closely related to the operation of the KLT algorithm, and a third criterion related to the nature of the 3D model to be adapted. These criteria are defined as follows: Trackability: The ability of a 2D feature point to be tracked reliably, which is related with the roughness of the texture within its window (Tomasi & Kanade, 1991; Shi & Tomasi, 1994). Dissimilarity: The sum of squared intensity differences within the feature window W, between the two consecutive image frames I and J. Dissimilarity indicates how well the feature has been tracked (Tomasi & Kanade, 1991; Shi & Tomasi, 1994). Reliability: Every tracked feature point corresponds to a node (feature node) on the 3D face model, which has been adapted to the face located in the previous image frame. The reliability metric of a node is defined as cosè, where è is the angle formed by the optical axis Figure 13: The tangent plane and normal vector of a node in relation with the optical axis
MPEG-4 Facial Animation and its Application 117
and the normal vector to the surface at the node in question, as seen in Figure 13. These three criteria are combined to provide a measure of confidence for a feature point that has been tracked. According to this measure we choose the best-tracked feature points for use in the next step, which is the estimation of facial motion.
Rigid 3D-Motion Estimation Having found the set of the most reliably tracked, according to these criteria, positions of the feature points, we need to move the head model in such a way that its feature nodes project the closest possible to these positions. This means that we need to compute the 3D motion parameters (R, T), which will minimize the distance of these projections from their positions tracked in the image. Many methods have been proposed in the literature for the solution of this problem (Aggarwal & Nandhakumar, 1988), and the accuracy of their solution depends highly on the reliability of the given feature correspondences. Having ensured from the previous step that the selected feature correspondences are suitable for the given tracking method, we employ the method proposed by Tsai and Huang (1984), improved by Weng et al. (1989) and enhanced to include the focal length of our camera in order to estimate R and T up to a scaling factor. Further, using the 3D model coordinates in the previous frame, we compute this scaling factor and determine an absolute solution for R and T. Furthermore, this solution is optimized by an iterative minimization algorithm, as recommended by Weng et al. (1993). Having determined R and T, the new rigidly transformed 3D positions for all the model nodes are known. As stressed in the previous section, the tracked feature points are ranked according to their trackability, dissimilarity and reliability, and only the ones with sufficiently high rank are used for the estimation of the 3D rigid motion. Thus, many features may Figure 14: Example of the assessment of tracked features, the rigid fitting of the 3D model and the relocation of the mistracked (rejected) features by projection of the corresponding transformed model nodes (Rejected feature points are shown by an empty box, accepted by a full box and the corrected positions by crosses)
118 Sarris & Strintzis
not be accurately positioned by the tracking procedure in the current frame. The correct positions of these feature points are calculated by the estimated R and T matrices. In this way, these points are relocated in the image, as shown in Figure 14, so that they can be tracked in subsequent frames. Obviously, a reliable relocation always requires a reliable 3D motion estimation. Moreover, as the possibility always exists that the real features are not visible due to occlusions by head rotation, we check again the sign and magnitude of the reliability metric. If it is negative (è>90o) the tangent to the node surface is oriented towards the opposite direction of the camera and thus the feature point is invisible on the image frame. Therefore, this feature point is not taken into consideration for tracking in the next frame. Even if the reliability metric is positive, its magnitude has to be greater than a threshold, as there is a high possibility for the re-projected feature to be close to the border of the face. In that case, a small calculation error may cause the feature to be assigned to the background, which of course is undesirable. This method has proven to be reliable even under extreme rotation conditions where most other methods fail. In Figure 16 the tracking results are shown for a moving head image Figure 15: Examples of tracking along a line: Black boxes: rigid positions of feature points; White boxes: Non-rigid positions of feature points
Figure 16: Results of rigid and non-rigid tracking
MPEG-4 Facial Animation and its Application 119
sequence. There, it is evident that although the feature tracking module may fail in the correct estimation of the positions of some feature points, our method succeeds in selecting the most accurately tracked points and thus calculates the accurate 3D rigid motion of the head.
Non-Rigid 3D Facial Motion Estimation As explained in the beginning of the chapter, the MPEG-4 standard has defined a set of 68 parameters, (the Facial Animation Parameters; FAPs), to fully describe the allowed rigid and non-rigid motions that can be observed on a human face with respect to a so-called ‘neutral pose.’ Three FAPs are used to describe the rotation of the head around the neck (FAP values 48-50: head pitch, yaw and roll, respectively). After estimating the rigid motion of the head as described in the previous section, these FAPs can be readily computed from the Euler angles of the rotation matrix R. Most of the other FAPs affect only one FDP describing its non-rigid motion, but one FDP may be affected by more than one FAP. For each of these parameters, the direction of movement is specified by the MPEG-4 standard, while the FAP value describes the algebraic value of the non-rigid motion. Once the rigid part of the motion has been successfully recovered by the proposed approach, the local 2D motion estimates provided by the feature tracker can be used to determine the non-rigid motion of the corresponding feature point. Although only the 2D motion of each feature point is available, its 3D non-rigid motion can be recovered since the possible axes of movement are specified by the MPEG-4 standard. Three cases may be identified for the evaluation of the possible number of degrees of freedom for a non-rigid feature point: The feature point can move in any of three possible directions. This happens only for the eyes, tongue and jaw. However, the eyes can be handled as separate rigid objects, while the tongue and jaw are given three degrees of freedom only to support exaggerated movement (e.g., by cartoon characters). Thus, in our case of real human face image sequences, the possible number of degrees of freedom will always be smaller than three. The feature point can only move in two directions (two degrees of freedom). For example, FDP 2.1 can be translated along the X-axis using FAP 15 and along the Z-axis using FAP 14. If (x, y, z) are the initial values of the corresponding wireframe node in the neutral pose (e.g., time 0), then (x+F15, y, z+F14) is the position of the same node at frame t if only non-rigid motion exists. However, under the presence of rigid motion, the resulting position of the 3D node will be:
x' x + F15 y ' = R y + T z ' z + F14 Using the projection equations we obtain a system of two equations with two unknowns, which can be used to determine the unknown FAP values F15, F14. The feature point can only move in one direction (one degree of freedom). This case can be dealt with in a similar way, resulting in an over-determined system with two equations and one unknown. Note that the tracking procedure used by the KLT transform has two degrees of freedom. Although this kind of tracking procedure is suitable for FDP nodes that are affected by two FAPs, it is not suitable for FDP nodes affected by only one FAP (one degree of freedom). For such nodes, it would be more suitable to design a 1D feature tracker, which determines the feature position along a 2D line. This line is determined using the above equation and the projection equation on the image plane. For this reason, we have constrained the KLT tracking algorithm to optionally search along a single 2D line, which
120 Sarris & Strintzis
is the projection of the calculated 3D-axis of permitted non-rigid motion. Examples of nonrigid tracking along a line are shown in Figure 15. A moving head image sequence illustrating both rigid and non-rigid motion was used to demonstrate the results of our method. The R and T matrices calculated by the rigidmotion estimation module were used to transform the head model adapted to the face in frame 1, which is projected on every subsequent frame to illustrate the accuracy of the proposed technique. The non-rigid motion estimation method produced an MPEG-4compatible FAP-file, which was given to an MPEG-4 FAP player to animate a synthetic face and illustrate the correspondence between the captured image sequence and the synthetically produced face animation. Both from the projection of the 3D model and from the observation of the rotation of the synthetic face, it is evident that both the rigid and non-rigid motion of the head are estimated with adequate accuracy in every frame.
User-Interface and Networking Issues The interface of the system has been designed to be user friendly and simple in providing the following operations: Establishment of a connection over a telephone line (simple or ISDN), or over the Internet. The users are able to dial a telephone number and establish a connection using a standard modem over a standard telephone line, or provide an IP address if an Internet connection exists. Real-time transmission and reception of video. The system is able to transmit and receive the video images of the speakers in real time. Real-time transmission and reception of audio. The system is able to transmit and receive audio in case a user is hearing. When this feature is disabled, the bandwidth gained will be exploited to improve the video image. Real-time transmission and reception of text. A text window is provided where the users may exchange messages in real time. Real-time transmission and reception of graphics. A ‘whiteboard’ window is provided where users may draw or paste areas of their desktop. This way many concepts are explained by diagramming information, or using a sketch. An initial Graphical User Interface (GUI), which has been developed to incorporate the menus and buttons needed for these operations, is shown in Figure 17. The leftmost button below the video image is used to place a call and establish a connection with another LipTelephone user. The next three buttons to the right launch the helper application windows (whiteboard, chat and file transfer), while the two in the right are used to enable or disable video and Figure 17: The main window of audio. The ‘File’ menu provides operations to cap- the Graphical User Interface ture and save in the local disk still images or video. The ‘Options’ menu controls the settings for the provided video and audio. The ‘Drivers’ menu lets the user select the capture card and camera to be used (if more than one). The additional windows needed during the operation are shown in Figure 18. For the communication of data between the two parties, we have selected the Real Time Protocol (RTP) (Schulzrinne, 1996) over an IP connection which is established either over a telephone line (standard or ISDN) or over the Internet. RTP is the
MPEG-4 Facial Animation and its Application 121
Figure 18: The helper applications windows as provided by NetMeeting
Internet-standard protocol for the transport of real-time data and has been preferred over TCP as it provides special functionality suited for carrying real-time content, such as timestamps and control mechanisms for synchronizing different streams with timing properties. RTP consists of a data and a control part. The data part of RTP is a thin protocol providing support for applications with real-time properties such as continuous media (e.g., audio and video), including timing reconstruction, loss detection, security and content identification. The control part of RTP provides support for real-time conferencing of groups of any size within the Internet. This support includes source identification and support for gateways like audio and video bridges as well as multicast-to-unicast translators. It offers quality-of-service feedback from receivers to the multicast group as well as support for the synchronization of different media streams.
Results Results from our initial experiments with the system have shown satisfactory image quality for a frame rate of 10fps (frames per second). User tests with hearing-impaired subjects have proved that 10fps are adequate for lip-reading, provided that the image is clear within the area of the lips. In a CIF-sized image (352x288 pixels), the bounding box of the lips is less than 100x50 pixels for the usual position of the speaker’s face (i.e., positioned so that the face covers almost the whole area of the image). Adequate image quality (25 db minimum) was achieved in our experiments for the lip area at a coding rate of 0.5bpp (bits per pixel). This means that the bandwidth required for the lip area is around 20Kbps (Kbits per second) at 10fps. This is the main bandwidth overhead of the system as the sound can be coded with 2Kbps at reasonable quality (11KHz sampling rate), and for the rest of the head only six floating point numbers (240bps) need to be transmitted for the description of the rigid motion of the head, or 68 floating point numbers–all FAPs–(2.7Kbps) for the description of both rigid and non-rigid motion of the head. Furthermore, these numbers can be arithmetically coded as proposed in the MPEG-4 standard providing a bitstream requiring a maximum bandwidth of 2Kbps. This means that the whole system requires a bandwidth of less than 25Kbps operating at 10fps, which is achievable over a standard telephone line or even some fast Internet versions. The initialization of the system, which involves locating the facial features and adapting the face model, requires 5-10 seconds but can be performed in parallel with the establishment of the connection, thus making the delay invisible to the user. During normal operation the system can currently code frames at a rate of 5-7fps on a standard Pentium II PC at 450MHz with 256MB RAM, but work is currently carried out in optimizing the speed of the system which is expected to reach 10fps.
122 Sarris & Strintzis
CONCLUSIONS In this chapter an attempt has been made to introduce the capabilities that are provided by the MPEG-4 audiovisual standard to content-based video coding. It has been shown that MPEG-4 has standardized the format of content-based coded data but most of the methods needed to perform the coding have been left open so that many different implementations can be proposed. Many such methods, which have been the object of past and present research, have been referenced from the literature and a particular method has been proposed and described in detail for every part of the videoconferencing application. The purpose of this study has been to introduce students to such image processing techniques, provide to researchers a reference to the state of the art in this area and urge engineers to use the present research methodologies in future consumer applications. The way to future multimedia applications, at least in the area of content-based coding, is now clearly visible with the help of the MPEG-4 standard. It is left to industry to elaborate and embed the developed methodologies in the applications to come.
REFERENCES Abrantes, G. A. and Pereira, F. (1999). MPEG-4 facial animation technology: Survey, implementation and results. IEEE Transactions on Circuits and Systems for Video Technology, 9(2). Aggarwal, J. K. and Nandhakumar N. (1988). On the computation of motion from sequences of images-A review. Proceedings of the IEEE, 76(8), 917-935. Bala, L. P., Talmi, K. and Liu, J. (1997). Automatic detection and tracking of faces and facial features in video sequences. Proceedings of the Picture Coding Symposium (PCS). Berlin, Germany. Benois-Pineau, J., Sarris, N., Barba, D. and Strintzis, M. G. (1997). Video coding for wireless varying bit-rate communications based on area of interest and region representation. International Conference on Image Processing ICIP’97, 3, 555-558. Santa Barbara, CA, USA. Chai, D. and Nghan, K. N. (1999). Face segmentation using skin color map in videophone applications. IEEE Transactions on Circuits and Systems for Video Technology, 9(4). Chariglione L. (1998). Impact of MPEG standards on multimedia industry. Proceedings of the IEEE, 86(2). Cootes, T. F., Di Mauro, E. C., Taylor, C. J. and Lanitis, A. (1995). Flexible 3D models from uncalibrated cameras. Proceedings of the British Machine Vision Conference, 147156. BMVA Press. Craw, I., Costen, N., Kato, T. and Akamatsu, A. (1999). How should we represent faces for automatic recognition? IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8). DeCarlo, D. and Metaxas, D. (1996). The integration of optical flow and deformable models with applications to human face shape and motion estimation. Proceedings of the CVPR, 231-238. Doenges, P. K. (1998). Synthetic/natural hybrid coding mixed media content in MPEG-4. MPEG-4 Seminar, Waseda University, Tokyo.
MPEG-4 Facial Animation and its Application 123
Dufaux, F. and Moscheni, F. (1995). Motion estimation techniques for digital TV: A review and a new contribution. Proceedings of the IEEE, 83(6), 858-876. Eisert, P. and Girod, B. (1998). Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics and Applications, 70-78. Escher, M. and Magnenat-Thalmann, N. (1997). Automatic 3D cloning and real-time animation of a human face. Proceedings of Computer Animation. Geneva, Switzerland, . Essa, I., Darell, T. and Pentland, A. (1994). Tracking facial motion. Proceedings of the IEEE Workshop on Non-Rigid and Articulate Motion. Austin, Texas. Guenter, B., Grimm, C., Wood, D., Malvar, H. and Pighin, F. (1998). Making faces. Proceedings of the SIGGRAPH, 55-66. Haralick, R. M. and Shapiro, L. G. (1993). Computer and Robot Vision. Volume II, 125-150. Addison Wesley. ISO/IEC. (1998). MPEG video and SNHC. Text of ISO/IEC FDIS 14496-3: Audio Doc. ISO/MPEG N2503, Atlantic City MPEG Meeting. ITU-T. (1990). Video codec for audiovisual services at px64 kbit/s. ITU-T Rec. H.261. Geneva. ITU-T. (1996). Video coding for narrow telecommunications channels at <64 kbit/s. ITUT Rec. H.263. Geneva. Kalva, H. et al. (1999). Implementing multiplexing, streaming and server interaction for MPEG-4. IEEE Transactions on Circuits and System for Video Technology, 9(8). Koenen, R., Pereira, F. and Chariglione, L. (1997). MPEG-4: Context and objectives. Image Communication Journal, 9(4), 295-304. Kompatsiaris, I. and Strintzis, M. G. (2000). Spatiotemporal segmentation and tracking of objects for visualization of videoconference image sequences. IEEE Transactions on Circuits and System for Video Technology, 10(8). Lawrence, S., Giles, L., Tsoi, A. and Back, A. (1997). Face recognition: A convolutional neural network approach. IEEE Transactions on Neural Networks, 8(1), 98-113. Lee, W.S., Esher, M., Sannier, G. and Magnenat-Thalmann, N. (1999). MPEG-4-compatible faces from orthogonal photos. Proceedings of the International Conference on Computer Animation, 186-194. Geneva, Switzerland. Lee, Y., Terzopoulos, D. and Waters, K. (1995). Realistic modeling for facial animation. Proceedings of the SIGGRAPH, 55-62. Magnenat-Thalmann, N., Kalra, P. and Escher, M. (1998) Face to virtual face. Proceedings of the IEEE, 86(5). Moccozet, L. and Magnenat-Thalmann, N. (1997). Dirichlet free-form deformations and their application to hand simulation. Proceedings of the International Conference on Computer Animation, 93-102. Geneva, Switzerland. Moving Picture Experts Group. (MPEG). Available on the World Wide Web at: http:// www.cselt.it/mpeg. Pighin, F., Auslander, J., Lischinski, D., Salesin, D. and Szeliski, R. (1997). Realistic facial animation using image-based 3D morphing. Technical Report UW-CSE-97-01-03. University of Washington. Pighin, F., Szeliski, R. and Salesin, D. (1999). Resynthesizing facial animation through 3D model-based tracking. Proceedings of the International Conference on Computer Vision(ICCV), 1. Corfu, September. Sarris, N. and Strintzis, M. G. (2000). Three-dimensional facial model adaptation. International Conference on Image Processing(ICIP). Vancouver, Canada.
124 Sarris & Strintzis
Sarris, N., Grammalidis, N. and Strintzis, M.G. (2000c). A novel neural network technique for the detection of human faces in visual scenes. IEEE 5th Seminar on Neural Network Applications in Electrical Engineering (NEUREL). Belgrade, Yugoslavia. Sarris, N., Karagiannis P. and Strintzis, M. G. (2000a). Automatic extraction of facial feature points for MPEG-4 videophone applications. Proceedings of the IEEE International Conference on Consumer Electronics (ICCE). Los Angeles, USA. Sarris, N., Makris D. and Strintzis, M. G. (1999a). Three-dimensional model-based rigid tracking of a human head. International Workshop on Intelligent Communication Technologies and Applications with Emphasis on Mobile Communications. Neuchatel, Switzerland. Sarris, N., Simitipoulos, D. and Strintzis, M. G. (2000b). LipTelephone: A videophone for the deaf. Proceedings of the Medical Infobahn for Europe (MIE 2000). Hanover, Germany. Sarris, N., Tzanetos, G. and Strintzis M. G. (1999b). Three-dimensional model adaptation and tracking of a human face. 4th European Conference on Multimedia Applications, Services and Techniques, (ECMAST). Madrid, Spain. Schafer, R. and Sikora, T. (1995). Digital video coding standards and their role in video communications. Proceedings of the IEEE, 83(6). Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, V. (1996). RTP: A Transport Protocol for Real-Time Applications, RFC1889, Audio-Video Transport Working Group. Sederberg, T. and Parry, S. R. (1986). Free-form deformation of solid geometric models. Proceedings of the SIGGRAPH, 151-160. Shi, J. and Tomasi, C. (1994). Good features to track. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 593-600. Sobottka, K. and Pittas, I. (1998). A novel method for automatic face segmentation, facial feature extraction and tracking. Signal Processing: Image Communication, 12, 263281. Terzopoulos, D. and Waters, K (1990). Physically based facial modeling, analysis and animation. Journal of Visualization and Computer Animation, 1, 73-80. Terzopoulos, D. and Waters, K. (1993). Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 569-579. The Image and Multidimensional Signal Processing Technical Committee (IMSPTC). (1998). The past, present and future of image and multidimensional signal processing. IEEE Signal Processing Magazine, March. Tomasi, C. and Kanade, T. (1991). Detection and Tracking of Point Features, Shape and Motion from Image Streams: A Factorization Method-Part 3. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Tsai, R. Y. and Huang, T. S. (1984). Uniqueness and estimation of three-dimensional motion parameters of rigid objects with curved surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), 13-26. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Science, 3, 71-86. Tzovaras, D., Kompatsiaris, I. and Strintzis, M. G. (1999). 3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding. Image Communication, 14(9). Valente, S. and Dugelay, J. L. (2000). Face tracking and realistic animations for telecommunicant clones. IEEE Multimedia, 7(1).
MPEG-4 Facial Animation and its Application 125
Weng, J., Ahuja, N. and Huang, T. S. (1993). Optimal motion and structure estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), 864-884. Weng, J., Huang, T. S. and Ahuja, N. (1989). Motion and structure from two perspective views: Algorithms, error analysis and error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(5), 451–474. Zhang, L. (1997). Tracking a face for knowledge-based coding of videophone sequences. Signal Processing: Image Communication, 10, 93-114. Zhang, L. (1998). Automatic adaptation of a face model using action units for semantic coding of videophone sequences, IEEE Transactions on Circuits and Systems for Video Technology, 8(6), 781-795.
126 Maybury
Chapter V
News On Demand Mark T. Maybury The MITRE Corporation, USA
Recently scientists have been focusing on a new class of application that promises ondemand access to multimedia information such as radio and broadcast news. This chapter describes how the synergy of speech, language and image processing has enabled a new class of information on demand news systems. We will describe the ability to automatically process broadcast video 7x24 and serve this to the general public in individually tailored personalcasts. We describe several systems and identify some remaining challenging research areas.
MULTIMEDIA INFORMATION ON DEMAND Information on demand, the ability to provide information tailored to specific user needs, promises new capabilities for research, education and training, and electronic commerce (e.g., on-line information access, question answering and customer service). Whereas significant commercial activity has focused on providing access to documents, Web pages and structured data sources, less attention has been given to multimedia information on demand. To achieve effective multimedia information on demand, however, requires a confluence of capabilities from several fields including image, speech and language processing, information retrieval, information extraction, translation, summarization and presentation design. Over the past several years, scientists have been exploring a range of systems to provide tailored, content-based access to multimedia including text, imagery, audio and video (Maybury, 1993). We describe two systems to illustrate this new functionality: BBN’s Rough ‘n Ready and MITRE’s Broadcast News Navigator (BNN).
ROUGH ‘N READY BBN’s Rough ‘n Ready (Kubala et al., 2000) uses a 60,000-word automatic speech transcription system to enable real-time monitoring of broadcast news with approximately Copyright © 2002, Idea Group Publishing.
News On Demand 127
18.8% word error rates for broadcast news sources. As Figure 1 illustrates, Rough ‘n Ready provides a “rough” translation of content that is “ready” for browsing. The left-most column in Figure 1 illustrates the system’s ability to detect a particular speaker (e.g., Elizabeth Vargas) or at least a speaker change and their gender. The center column shows the transcribed words, with color highlighting of named entities (people, locations, organizations) detected by natural language processing. Finally, using topic detection and tracking software, the third column lists keywords indicative of the subject of the associated text segment. Using the system, a user can either browse or search news for relevant stories. Topic Detection and Tracking (TDT) has become a common task with progress measured by community-based evaluation across different tools (Allan et al., 1998; Wayne, 1998).
BROADCAST NEWS NAVIGATOR (BNN) MITRE’s Broadcast News Navigator (BNN) (Maybury et al., 1997) goes beyond spoken language processing and synergistically combines techniques for processing speech, language and imagery, providing more sophisticated presentation of and access to news. The Web-based BNN gives the user the ability to browse, query (using free text or named entities) and view stories or their multimedia summaries. For example, Figure 2 displays all stories about Princess Diana on CNN Prime News during August-September 1997. For each story, the user is given the ability to view its closed-caption text, named entities (i.e., people, places, organizations, time, money), a generated multimedia summary or view the full original video of a story. The user can also graph trends of named entities in the news for given sources and time periods. For example, Figure 3 graphs the onset and abatement of stories on Princess Diana and Mother Teresa, September 8-15, 1997. Analyzing audio, video and text streams from digitized video, BNN segments stories extracts named entities, summarizes stories and designs presentations to provide the end user with content-based, personalized Web access to the news (Maybury et al., 1997). For example, within the video stream color histograms are used to classify frames and detect scene changes. In the audio stream, algorithms detect silence, speaker changes and transcribe Figure 1: BBN’s Rough ‘n Ready ABC World News Tonight, January 31, 1998
128 Maybury
Figure 2. Tailored multimedia news
Figure 3. Temporal visualization of named entities
the spoken language. Finally, the closed-caption stream and/or speech transcription is processed to extract named entities. This fully automated broadcast news system stands in contrast to the current method of manual transcription and summarization of broadcast news (e.g., via closed-captioning services) which is expensive, error prone and can result in dissemination delays. BNN has been integrated into a larger system called GeoNODE Hyland et al., 1999), which correlates named entities across stories to create story clusters and then partitions these clusters from a constructed hypergraph to automatically identify topics. Our analyses show that in key tasks, such as segmenting stories, audio and imagery processing can enhance algorithms which are based only on linguistic cues (e.g., explicitly stated anchor welcomes, anchor-to-reporter handoffs, story introductions). For example, silence detection, speaker change detection and key frame detection (e.g., black frames, logos) can improve the performance of text-only story segmentors. By modeling story
News On Demand 129
transitions using hidden Markov models and learning the most effective combination of crossmedia cues (Boykin & Merlino, 1999), successive versions of the system have incrementally increased performance. As illustrated in Figure 4, the system’s version 2.0 performance over a range of broadcast sources (e.g., CNN, MS-NBC and ABC) averaging across all cues is 38% precision and 42% recall. In contrast, performance for the best combination of multimodal cues rises to 53% precision and 78% recall. When visual anchor booth recognition cues are specialized to a specific source (e.g., ITN broadcasts that have more regular visual story change indicators), the performance rises to 96% precision and recall. Given properly delineated story boundaries, BNN is able to summarize the story in a variety of fashions. This includes extracting Aberdeen et al., 1995) the most significant named entities, extracting visual elements (e.g., keyframes) and/or summarizing the story text and creating from these elements a multimedia summary. We have integrated Carnegie Mellon University’s SPHINX-II speech system into BNN and have begun experiments in extracting named entities from transcribed stories. For example, Figure 5 shows a manual transcript of a segment of a story with human markup of locations (bold italic) and organizations (bold underlined). In contrast, Figure 6 shows an automated transcription of a news segment followed by automated markup of locations (bold italic) and organizations (bold underlined). Notice the errors in the automated transcript of omission of punctuation and content (e.g., “A,” “AND”) and substitutions (e.g., “EVELYN” for “CONSIDERING,” “TRIAL” for “PANEL”). Also note errors in the named entity identification in Figure 6 (“BLACK” is identified as a location, “UNITED STATES” and “CONGRESSIONAL BLACK CAUCUS” are missed). In the DARPA HUB-4 information extraction evaluation, the version of the system described in Palmer et al. (1999) achieved 71-81% accuracy on broadcast news speech data with a wide range of word error rates (WERs) ranging from 13% to 28%. The model achieved 88% accuracy on reference transcriptions with WER 0% (like Figure 5). Even when dealing with closed-captioned text, we face a 10-15% word error rate because of errors introduced during manual transcription. The word error rates for the best automated speech transcription systems (depending upon processing speed) range widely from 13-28% on studio quality speech (e.g., anchor segments) to 40% or higher on shots with degraded audio (e.g., reporters in the field, speakers calling in over the phone, music in the background). Furthermore, neither closed captions nor speech transcripts have case information to use, for example, to recognize proper nouns. In addition, speech transcripts contain Figure 4: BNN segmentation performance by version
100 80 v 2.0 60 ITN 40 20 0 All Cues
All Cues
Recall
Precision
Best Cues Recall
Best Cues Precision
130 Maybury
Figure 5: Manual transcription and extraction WHITE HOUSE OFFICIALS SAY THE PRESIDENT IS CONSIDERING TWO STEPS: A NEW PRESIDENTIAL PANEL MODELED ON THE 1968 KERNER COMMISSION, WHICH CONCLUDED THERE WERE TWO SOCIETIES IN THE UNITED STATES, ONE BLACK, ONE WHITE SEPARATE AND UNEQUAL. A WHITE HOUSE SPONSORED CONFERENCE THAT WOULD INCLUDE CIVIL RIGHTS ACTIVISTS, EXPERTS, POLITICAL LEADERS AND OTHERS THE NEW HEAD OF THE CONGRESSIONAL BLACK CAUCUS SAYS THE PRESIDENT SHOULD BE DOING ALL THIS AND MORE.
Figure 6: Automated transcription and extraction WHITE HOUSE OFFICIALS SAY THE PRESIDENT IS EVELYN TWO STEPS WHAT NEW PRESIDENTIAL TRAIL OF ALL ON THE NINETEEN SIXTY EIGHT KERNER COMMISSION WHICH CONCLUDED THE OUR TWO SOCIETIES IN THE UNITED STATES ONE BLACK ONE WHITE SEPARATE AND UNEQUAL WHITE HOUSE SPONSORED CONFERENCE THAT WILL INCLUDE CIVIL RIGHTS ACTIVISTS EXPERTS POLITICAL LEADERS AND OTHERS AND THE NEW HEAD OF THE CONGRESSIONAL BLACK CAUCUS OF THE PRESIDENT TO BE DOING ALL THIS AND MORE.
no punctuation and can contain disfluencies (e.g., hesitations, false starts), which further deteriorates performance. However, it is important to point out that the type of errors introduced during human transcription are distinct from those made by automated transcription, which can have different consequences for subsequent named entity processing. In the HUB-4 evaluations (DARPA Broadcast News Workshop, 1999; Hirschman, 1998), named entity extraction on clean, caseless and punctuationless manual transcripts was approximately 90% in contrast to the best extraction performance on newswire with case, which was approximately 94%. Related evaluations are investigating effective information dissemination and retrospective retrieval, including access to massive and multilingual sources (Harman, 1998). Significantly, MITRE’s named entity extraction system, called Alembic1 (Aberdeen et al., 1995), consists of rule sequences that are automatically induced from an annotated corpus using error-based transformational learning. This significantly enhances cross domain portability and system build time (from years or months to weeks) and also results in more perspicuous rule sets. The Alembic group is presently pursuing the induction of relational structure with the intent of automated event template filling.
BNN USER STUDIES We have performed a number of user studies (Merlino & Maybury 1999) with BNN and discovered users can perform their information seeking tasks both faster and more
News On Demand 131
accurately by using optimal mixes of news story summaries. For example, using the BNN display shown in Figure 2 (a key video frame plus the top three entities identified per segmented story which we term a “skim” in Figure 7), users can find relevant stories with the same rate of precision, but in one-sixth the time required to retrieve (unsegmented) indexed video using a digital VCR. If we add in a little more information to the display (e.g., an automatically generated one-line summary of each story and a list of extracted topics), the user’s retrieval rate slows to only one-third the time of the indexed video (called “Full” in Figure 7). However, recall and precision are as good as if the user watched the full video. Results for this identification task are summarized in Figure 7 with the best performance for both precision and recall resulting from “full” or “story details,” displays integrating multiple media elements (e.g., key frame, summary, top named entities, access to video and closed caption). In addition to this story retrieval task, we measured performance in question answering (called the comprehension task). As expected, on average answering comprehension questions took 50% longer to perform (approximately 4 seconds per question/story) than story identification (approximately 2 normalized seconds per story). The performance of users on comprehension tasks using presentations such as that shown in Figure 2 were not as beneficial. We attribute this to multiple factors, including the fact that summaries may be less valuable for comprehension tasks but also because of limits in the state of the art in information extraction. Nevertheless, on a satisfaction scale of 1 to10 (1 dislike, 10 like), for both story retrieval and question-answering tasks, users prefer mixed-media displays like that in Figure 2 about twice as much (7.8 average rating for retrieval, 8.2 for comprehension) over other displays including text transcripts, video source, summaries, named entities or topic lists (average ratings of the 5.2 and 4.0 for retrieval and comprehension tasks, respectively). Interestingly, giving users less information (e.g., only the three most frequently occurring named entities in a story) reduces task time without significantly affecting performance. This broadcast news understanding experimentation illustrates how synergistic processing of multimodal streams to segment, extract, summarize and tailor content can enhance information seeking task performance and user satisfaction. It is also the case that the user’s task performance can be enhanced by increasing intelligence in the interaction between user and system. Figure 7: Average precision vs. time for story identification IDEAL
1
Lower Recall P r e c i s i o n
0.95
High Precision
Topic Skim
Story Details Full Details
0.9
Summary 3 Named Entities Key Frame
0.85
All Named Entities
0.8
Video
Also High Recall
Text
0.75 0
2
4 Average Time (Seconds/Story)
6
132 Maybury
NEWS ON DEMAND RESEARCH There are many fascinating research areas raised by this new class of systems. These include: 1. Multilingual news processing. Our increasingly connected global marketplace will bring with it new opportunities and challenges for consumers to access the world’s news in much the same way they access the world’s Web today. 2. News content visualization. The growing volume of global materials will motivate methods for visualizing individual news programs, tracking topics and visualizing events across news programs and visualization methods for access to cross-media news repositories. 3. Agent-mediated news analysis. With the advent of intelligent multimedia interfaces that both process multimodal input and generate multimedia output (Maybury 1993a; 1993b), there are many possibilities for redefining the interaction style with a non-linear medium such as video. 4. Collaborative media access. The use of collaboration systems (e.g., recommender systems) promises new methods of virtual collaborative news analysis. 5. Usability. These systems need to operate in public kiosks, home and mobile applications so usability of this new class of system is critical (Maybury 1998). 6. Machine learning. Machine learning of algorithms using multimedia corpora promises portability across users, domains, and environments. There remain many research opportunities in machine learning applied to multimedia interaction such as on-line learning from one medium to benefit processing in another (e.g., learning new words that appear in newswires to enhance spoken language models for transcription of radio broadcasts). A central challenge will be the rapid learning of explainable and robust systems from noisy, partial, and small amounts of learning material. 7. Evaluation. Community defined evaluations will be essential for progress; the key to this progress will be a shared infrastructure of benchmark tasks with training and test sets to support cross-site performance comparisons.
CONCLUSION The use of multimedia information and interfaces continues to rise, and our ability to harness multimedia for user benefit will remain a key challenge as we move into the next millennium. Methodologies such as corpus-based systems and biologically inspired processing give us reason to be optimistic that we will improve our understanding of and, if successful, achieve regular and predictable progress with sophisticated multimedia interfaces to information and people.
ACKNOWLEDGMENTS I would like to thank Lynette Hirschman, David Palmer, John Burger and Andy Merlino for providing the spoken language processing examples for BNN. I thank Warren Greiff for discussions regarding probabilistic models of news understanding and retrieval. I also thank Stanley Boykin and Andy Merlino for providing BNN story segmentation performance results.
News On Demand 133
ENDNOTE 1 The Alembic Workbench, which enables graphical corpus annotation, can be downloaded from http://www.mitre.org/resources/centers/it/g063/nl-index.html.
REFERENCES Aberdeen, A., Burger, J., Day, D., Hirschman, L., Robinson, P. and Vilain, M. (1995). MITRE: Description of the Alembic System as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, November 6-8, 141-155. Allan, J., Carbonell, J., Doddington, G., Yamron, J. and Yang, Y. (1998). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. February, 194-218. Boykin, S. and Merlino, A. (1999). Improving broadcast news segmentation processing. IEEE International Conference on Multimedia and Computing Systems. Florence, Italy. June 7-11. DARPA Broadcast News Workshop, February 28–March 3, Herndon, VA. Harman, D. (1998). The text retrieval conferences (TRECs) and the cross-language track. In Rubio, A., Gallardo, N., Castro, R. and Tejadaeck, A. (Eds.), Proceedings of the First International Conference on Language Resources and Evaluation, 517-522. European Language Resources Association, Granada, Spain. Hirschman, L. (1998). The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language, 12, 281-305. Hyland, R., Clifton, C. and Holland, R. (1999). GeoNODE: Visualizing news in geospatial context. AFCEA Federal Data Mining Symposium. Washington, DC. Kubala, F., Colbath, S., Liu, D., Srivastava, A. and Makhoul, J. (2000). Integrated technologies for indexing spoken language. In Maybury, M. (Ed.), Special Section on News On Demand. Communications of the ACM, 43(2), 48-56. Maybury, M. T. (Ed.). (1993). Intelligent Multimedia Interfaces. Menlo Park: AAAI/MIT Press. Retrieved on the World Wide Web: http://www.aaai.org:80/Press/Books/Maybury1. Maybury, M. T. (1993). Planning multimedia explanations using communicative acts. In Intelligent Multimedia Interfaces, Cambridge, MA: AAAI/MIT Press, 60-74. Maybury, M. T. (Ed.). (1997). Intelligent Multimedia Information Retrieval. Menlo Park: AAAI/MIT Press. Retrieved on the World Wide Web: http://www.aaai.org/Press/ Books/Maybury-2. Maybury, M., Merlino, A. and Morey, D. (1997). Broadcast news navigation using story segments. ACM International Multimedia Conference, Seattle, WA, November 8-14, 381-391. Maybury, M. T. and Wahlster, W. (Eds.). (1998). Readings in Intelligent User Interfaces. Menlo Park, CA: Morgan Kaufmann. Merlino, A. and Maybury, M. (1999). An empirical study of the optimal presentation of multimedia summaries of broadcast news. In Mani, I. and Maybury, M. (Eds.), Automated Text Summarization. Cambridge, MA: MIT Press. Palmer, D., Burger, J. and Ostendorf, M. (1999). Information extraction from broadcast news speech data. DARPA Broadcast News Workshop, February 28–March 3, Herndon, VA. Wayne, C. (1998). Topic detection and tracking (TDT) overview and Perspective. DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, Lansdowne, VA. Retrieved on the World Wide Web: http://www.nist.gov/speech/ tdt98/tdt98.htm.
134 Kamilatos & Strintzis
Chapter VI
A CSCW with Reduced Bandwidth Requirements Based on a Distributed Processing Discipline Enhanced for Medical Purposes Iraklis Kamilatos and Michael G. Strintzis Informatics and Telematics Institute, Greece
Medical teleconsultation with modern high power workstations can be implemented using distributed computing systems. This chapter presents an open telecooperation architecture for such a system. The resulting medical Computer Supported Cooperative Work (CSCW) tool is evaluated experimentally.
INTRODUCTION A meeting of scientists from various disciplines sharing a common interest in “how people work” was organized in 1985 by Paul Cashman and Irene Grief. The objective of the meeting was to understand how technology could support them. During the meeting the term “Computer Supported Cooperative Work” (CSCW) was defined for the first time. Following this meeting a number of researchers and developers have shown interest in the subject. This acronym has been accused of being too long and may be confused with CSC, which stands for Computer-Supported Collaboration. Another criticism was that it states an aim and not Copyright © 2002, Idea Group Publishing.
A CSCW with Reduced Bandwidth Requirements 135
the reality by the term “cooperative.” Alternative names are used such as workgroup computing and groupware; these terms are moving the focus from the group activity to the technical approach and are restricted to the description of small organisational units. The new systems needed to be tuned to the way that people work, interact in a group or in an organization and the effects that technology could have in these interactions. Although these ideas were seen as new at the time, they were not brand new as they are echoed by engineers and people that worked in the Management Information Systems field for some time and their attempts to improve the success rates of large systems development. However, CSCW has started from the opposite end from that of the “Office Automation,” by looking at the needs of the system users based on the knowledge that social psychologist, anthropologists, organisational theorists, educators, economists and any one else that could provide on the understanding of the group activities. The applications that could be contained in the category of CSCW are desktop conferencing systems, collaborative authorship applications, electronic mail with its refinements and extensions, and electronic meeting rooms or group support systems. Other applications include distance learning, workflow management, concurrent engineering, Computer-Assisted Software Engineering (CASE), Computer-Assisted Design/Computer-Assisted Manufacturing (CAD/ CAM), real-time network conferences and medical teleconferencing. Medical teleconferencing requires the electronic transmission of medical images from one location to another for the purposes of interpretation and/or consultation. Users in different locations may simultaneously view the images. This allows more timely interpretation of the medical images and gives greater access to secondary consultations and improves continuing education. Appropriately utilized, medical teleconferencing can improve access to quality medical interpretations and thus significantly improve patient care. There are a number of commercially available packages for CSCW such as the Intel® ProShareTM and the SGI® InPersonTM. However, these packages cannot use 12-bit images and they do not support image files according to the ACR/NEMA and DICOM standards for medical image processing. Also, their image processing functionality is limited while a teleradiological package should have a wide selection of image processing tools. Ideally, the CSCW program should be directly connected to the medical modality in order to acquire directly the data without operator intervention. This would improve the accuracy, the security and the integrity of the data. Medical telecooperation or teleconsultation software packages often assume that all necessary bandwidth can be allocated to each session according to its needs. In this chapter, we present a telecooperation package that reduces drastically the bandwidth demand for message exchange between stations by minimizing the volume of messages exchanged based on the design proposed in Singh, Gupta and Levoy (1994). The telecooperation system can be viewed as a distributed computing system, in which a collection of processors functions as a single unit feeding information to the visualization and interaction systems. This approach takes advantage of the local resources at the end points and is in line with the current trend of the virtual parallel systems based on multiprocessor systems. The developed program acts as a manager between the two end stations and based on the message passing model, exchanges the minimum number of message primitives required for the systems to remain synchronized. An internal platform users memory and process replication to minimize the volume of messages exchanged. Process replication also permits the use of local resources such as scanners and graphics cards, and may further be used to parallelize the operation on a local level.
136 Kamilatos & Strintzis
The resulting CSCW system, which we named TeleWorks, was tailored to medical applications as described in Kopsacheilis, Kamilatos, Strintzis and Makris (1997) and allows users to interact with the medical data in a visual environment between physically separated places. The interaction includes image processing, pointing and annotation of the images. In this chapter the emphasis is on the full description of the distribution nature of the telecooperation package as well as the message passing mechanism used to achieve the best results. Locally distributed systems are increasingly useful as the LAN speed and the computational power increase. They provide a mean for constructing virtual computers for special purposes including graphics processors and vector processor Kim and Purtilo (1994) with increased speed of computation. Such systems may be used as intermediate layers between the user interface and the network, enabling real-time telecooperation.
THE CLIENT-SERVER MODEL A distributed system consists of a collection of processors connected via a communication network. The network is subject to latency and bandwidth restrictions. A distributed algorithm can be viewed as a sequence of local computations interleaved with communication steps, and we have overlap of computation and communication. In our work a replicated shared memory is used for the storing of the prefetched data. The parallelism’ seen indepth in Figure 1, is modified by moving the end server to the client, as in figure 2, and can be viewed as a point-to-point connection of a clientserver configuration. Each server of the configuration acts as an automatic answering machine when a remote client contacts the server. As long as the server is active, the call can be placed to the remote machine. Since the software is destined for medical use, extra precautions are taken in order to ensure that the data travelling across the network are encrypted. Each of these client-servers can be viewed as part of a distributed unit shown in Figure 3. This distribution is implemented by a bus connection of PCs or workstations on a LAN or a WAN as the application uses the TCP/IP protocol to exchange data. This protocol was preferred to others (notably UDP) because of its fast response and its reduced size data packets (which do not include the source address). The distributed system can be further
Figure 1: The parallelism indepth
Figure 2: The modified parallelism distributed system indepth
A CSCW with Reduced Bandwidth Requirements 137
expanded dynamically to further more stations, such as interaction system, to enable more users to cooperate remotely. The commands entered at one end of the distributed system are processed by the system, routed appropriately between the clients and servers and surface as results at the visualization points of the system. Inputs to the distributed system such as heart monitoring units can also be incorporated in the system for data acquisition. Each of the client-server units shown in Figure 2 contains the managers of the system. These managers handle the messages, filter any unnecessary information transmitted from the network to the interaction system. The interaction system includes the interface (2D windows, data graphs, pointing devices), the session manager, the system queue manager, the group manager and the system window-tree manager. The first three managers create the distributed interactive visualization platform on which the system window-tree manager and the user interface rest. The interaction system is synchronized using small messages between the end stations, and the resulting bandwidth needed is extremely low.
THE PC TELEWORKS ARCHITECTURE The internal architecture of the TeleWorks is shown in Figure 4. TeleWorks by its nature provides both the server and client functions in a single software package. The server function is the answering system of the TeleWorks responding to incoming calls and handling and re-routing of these calls for a group of users. The client function is the component of the system handling the communication between the remote sites. Each interaction system has a number of windows showing the amount of information need for teleconsultation. Each window on the end user may communicate with the corresponding remote system window and may also view its own data. The communication layer was designed in a way that it can be easily ported to different hardware platforms. TeleWorks versions were created for PC (MS-Windows ’95TM, MS-Windows NTTM), Silicon Graphics (IRIXTM) and Sun (SolarisTM). The message passing approach used was based on the X-Windows philosophy. Each of the distributed system consists of the session manager, the system queue manager, the group manager and a system window-tree manager as shown in Figure 4. The first three managers of the above constitute the “Interaction System.”
Figure 3: The distributed systems with the end points of interaction
138 Kamilatos & Strintzis
USER INTERFACE PC TeleWorks uses the user-friendly interface of Windows ’95. For the Unix version of the program, the X-Windows Motif environment is used. In practical experimental validation, the simplicity of the user interface was proven to be almost as valuable as the actual visualization functionality of the program (Makris, Kamilatos, Kopsacheilis & Strintzis, 1998). The tools provided for image manipulation and analysis were developed in C++ so as to allow object-oriented implementation of the layers that separate the complicated distributed system. In order to create a homogeneous environment suitable for several medical data types, a main window is created with child windows in it. These child windows may contain images and dialogue boxes. The latter may contain text that indicates the condition of the system and information or commentary on the image displayed. A layered approach has been adopted for the system implementation as shown in Table 1, and elaborated in the sequel. A typical screen of a session in Figure 6 incorporates various sources of information including an MRI image, an angiography video an ECG and a dialogue box. A password is needed for the secure connection between the end points as shown in Figure 5. Figure 4: The TeleWorks operational model
Table 1: TeleWorks layers diagram
Window Tree Manager Queue Manager Group Manager Session Manager
⇐ Network ⇒
A CSCW with Reduced Bandwidth Requirements 139
The graphical interface based on the description in Makris, Kamilatos, Kopsacheilis and Strintzis (1998) was implemented using the Microsoft Foundation Classes (MFC) on PC platforms and the Motif environment on Unix platforms. The available command can be either entered through the toolbars on the top of the main window or through the main window menu or, for a specific window, using the right button click to access the popup menu. The physician can access the images by using the custom ‘’Open file’’ dialogue which is placed under the ‘’File’’ menu. The “Save” operation permits the storage of the image currently in view. The TeleWorks files the deposition of the annotation data and the image processing data in layers of image apart from the original image data, which remains unaltered. When in session, the remote site receives a message naming the file that was opened. At this point the physician is ready to manipulate and analyze the image. A range of tools is provided to the user for image manipulation (zooming and panning), image data windowing and levelling (brightness and contrast) and image analysis (distance and angle measurement). Image grayscale inversion can be performed to highlight details, which are hidden in the original image. For the video data possible operations include rewind, forward and play, as well as a scrolling bar that shows the position of the frame in the sequence. Through this scroll bar the user can access an intermediate image of the video sequence directly, without forwarding or rewinding the entire sequence. If conference of a physician with a radiologist is to be arranged, the images and other patient files are transferred before the meeting to the physician remote unit through the builtin transfer protocol which includes encryption and line/data recovery facilities. The radiologist starts the teleconsultation by activating the TeleWorks program and leaves the system to act as an answering machine. The physician, when ready, accesses an address book as seen in Figure 7 and places the call. The remote user in turn accepts the connection and the program proceeds with a secure key exchange mechanism that leads to a securely connected session (figure 8). A ‘’Secure’’ label remains present in the main window status area as long as the encryption remains in effect. Instead of a username and a password, a smartcard may be supplied, having an identification number stored that can be used by the system for the user identification. The smartcard will also contain an address book, which will be automatically updated whenever the user enters the system from the systems address database.
SESSION MANAGER On PC platforms, the low end message passing mechanism of the Microsoft Windows ’95 built-in network ‘’WinSock’’ extension was used to build a global system for message Figure 5: The username and password part of the security features used by the program
140 Kamilatos & Strintzis
Figure 6: A typical TeleWorks window screen
Figure 7: The TeleWorks connections address book
Figure 8: Dialogue stating that a secure connection has been established with the stated remote end
A CSCW with Reduced Bandwidth Requirements 141
passing functions. For Unix platforms the built-in network functions provided by the operating system were used as they were equivalent to those of the low end “WinSock” instructions. This global system can handle a number of message passing primitives, parse them appropriately and forward them to the appropriate manager of the system. The parser is included in the session manager as well as the security handling functions. The higher layers of the system receive command from this layer. All the other managers only pass simple instructions to the session manager, which defines all the communication parameters of the system. To the rest of the managers, the session manager can be viewed as the equivalent of the Remote Process Call (RPC) interface that the compilers of the RPC language provide. This message exchange between the managers is transparent to the end users with the exception of the cases where a serious error occurs, such as the link break. In such cases, the end user is informed of the malfunction of the link. The encryption of the data also takes place at this manager’s level, just before the data enters the network. Another function of these managers is the abilities of the file transfer. This file transfer function utilizes compression (Kamilatos & Strintzis, 2000) prior to the file transmission and decompression after the file arrival. The distribution of the information by the session manager is simple. The distribution of the task is achieved as the CPU of each of the end points performs the task instructed by the peer station and the produced results are displayed on that end points interaction system. A check sum is produced at a final synchronization point that both systems reach at the end of the execution of a command confirming the success of the distribution. This approach has the trademarks of the distributed system as: • It sends out a command to a remote system. • The CPU at the remote end performs a function. • A confirmation of the result is returned to the initial station. • The results are distributed over LAN and WAN. Thus, a telecooperation system can be seen as a distributed system. As the session manager only relies on the TCP/IP protocol and is not tightly coupled to the network media, multiple low-bandwidth links can also lead to enhanced performance. The telecooperation program along with a video conference system are integrated in the TeleWorks medical CSCW system.
GROUP MANAGER The purpose of the group manager is to receive messages from the session manager and activate the appropriate functions within the system in order to achieve a higher telecooperation transparency. Each window is registered with the group manager at the other end and both remain active as long as the session lasts. The group manager keeps a list of the active connections and forwards the messages to one specified end user or to all end users in the case of a multicast conference. This group manager also informs each user as to the composition of the telecooperating group of users and may forward messages addressed personally to one participant rather than the whole group. For the Microsoft Windows version, the Message Pipeline was used in order to avoid recursive and eventual blockage of the system.
142 Kamilatos & Strintzis
QUEUE MANAGER The system allocates the needed resources according to its requirements. It overlooks the local queue of messages as well as the departure of messages for the remote units aiming to keep a balanced system and avoid instability or non-real-time performance. Window raster sharing is commonplace in teleconferencing tools. However, this method requires the needless transmission of an often very large amount of data. By contrast, TeleWorks only transmits the description of the image-processing tool that is to be used, and processing itself is done almost simultaneously at both ends of the distributed system. Similarly, the queue manager also makes sure that only the beginning and the end coordinates of an annotation are transmitted, and not its intermediate stages, which are filtered out. The bandwidth required to transmit the latter is minuscule.
WINDOW-TREE MANAGER The system window-tree manager is the memory-resident database that represents the windows/objects that are available. This database is being sized according to the current settings of the conference. Each window internally has a local identification number (LIN) that corresponds to a remote ID number (RIN). When a message is received, the RIN is translated to the LIN and the instruction captured by the message is conveyed to the appropriate local window. No LIN to RIN is needed when a message originates from the local unit and destined to the network. The database of the active windows also includes the name of the file where the displayed data are stored, and an extra window layer protects the data from the user manipulations and annotation. This extra window layer is transparent to the user and can be made visible when the file format of the saved data does not support layers of image information. An annotation layer also exists and is in constant synchronization with the image data. The window-tree manager also keeps a cyclic redundancy check sum (CRC) of 32-bits which correspond, to the data currently displayed on the screen and is checked at various points of the CSCW operation in order to verify that all the users view the same image data.
MESSAGE PRIMITIVES Each message that is to be exchanged is captured within other messages that are destined to the layer below it in the stack structure of the TeleWorks software (see Table 1). Precisely as in stack structure of the Open System Interconnection layers, each layer communicates with the opposite layer across the network using messages that pass through the layers below it. This standard maintains the transparency of the process and guarantees that the message reaches the proper destination layer.
MESSAGE STRUCTURE The session manager forwards the messages to their proper destination. Another role that the session manager is to verify is the origin of the message through the IP number that the packet has stamped on it.
A CSCW with Reduced Bandwidth Requirements 143
The messages can be broadly divided in 3 families : • Annotation • Image processing • Data exchange/General purpose Both the annotation and the image processing messages carry data for the window for which they are destined. The image processing messages normally contain only have one variable describing the value of the function that has been requested. The annotation messages contain the coordinates of the starting and ending points, which are of the form (x1,y1,x2,y2), i.e., the annotation message has the form:
MessageID
WindowID
X1
Y1
X2
Y2
while each image processing message is of the form: MessageID
WindowID
Parameter
To minimize communication errors, all messages use integer values as parameters or coordinate points. The float values are converted to integer values for the communication to the remote image processing system. The third type of messages listed has two parts, the message identification ( MessageID) and the actual data ( Data ) to be transferred in the form of: MessageID
Data
This type of message is used in all cases of variable length data, such as text message or an image transfer. These types of messages do not address a specific window but rather the whole of the TeleWorks system environment. The destination of an image transfer is a file in the storage disk and the destination of the text message is a dialogue box displayed on top of any images currently in view. The small size of all above packets results in a very rapid data delivery, resulting in a virtually real-time operation across the data link. Their small size (typically fewer than 48 bytes) also makes the system compatible with the ATM cells since ATM cells can carry useful data of up to 48 bytes using 53 bytes that ATM cells have available. The security subsystem of TeleWorks provides the following methods and procedures to support confidentiality and data integrity according to the recommendation of WG6 of CEN/TC251 (Makris, Argiriou & Strintzis, (1997). • All users are authenticated to the network. • There exist procedures to check messages for completeness/accuracy. • All messages carry identifiers showing source user-id and terminal. • Password, transmitted over communication lines external or internal to the site, are encrypted. • The system uses encryption on a node-to-node basis. • There are key management and distribution procedures for encryption. • Procedures for start-up and close-down of encryption circuits are also included.
144 Kamilatos & Strintzis
The recommendation of the European project SEISMED (Secure Environment for Information Systems in MEDicine) (Sanders, Furnell and Warren, 1996) on cryptographic system and key management were also followed. The methods supported are DES and IDEA. Message authentication codes, using hashing (MD5), are appended to the data prior to encryption. This creates a slight overhead but is necessary to ensure data integrity.
ANNOTATION MESSAGES A major innovation introduced in TeleWorks is the use of vector rather than raster graphics to transmit the annotation messages. Thus, only the type of annotation and the coordinates of its beginning and end need to be transmitted. This can be compared to the early generation of Computer-Aided Design/Computer-Aided Engineering (CAD/CAE) systems where there existed strict limits on the amount of memory of necessary processing and memory use. In data communication the limitation on memory size may be seen as equivalent on the volume of message transfers. In general, the coordinates of the annotation are inside the viewing area window (Figure 9). Thus, in theory either the absolute coordinates (Origin A in Figure 9) or the relative coordinates in respect to the viewing area window (Origin B in Figure 9) may be used. However, the second option above is reliable only in conditions of perfect synchronization of the two ends, while the first option requires the transmission of quite large ranges of numbers. In TeleWorks this conflict was resolved as a result of the user’s insistence that the position of the viewing area windows within the screen be the same at both ends. Thus, the coordinates of Origin A of the viewing area window is always known at both ends, and therefore the parsimonious transmission of the annotation coordinates relative to Origin B is possible.
Figure 9: Diagram showing the considerations of the coordinate origin used by the proposed annotation systems
A CSCW with Reduced Bandwidth Requirements 145
The transmission of “texture” with vector graphics is possible and useful for, e.g., varying the type of the text characters of the annotation. The resulting annotation has the following format: Annotation ID
WindowID
X1
Y1
X2
Y2
Width
TextureID
The width of the annotation line object can also be assigned with a minimal addition of an extra byte while the texturing can be achieved by the addition of another two bytes to allow a range of 65536 different textures to be available to the user of the system. The same annotation system may be used to mark a sequence (cine) of two-dimensional images.
THE DATABASE In order to store the data that are going to be used by the TeleWorks system, a database is created. A hierarchical database mirroring the structure of a hospital is being created. The database has at the top of the structure basic information about the hospital. The hospital has departments and each department is staffed by a number of people. The patients are admitted to a department of a hospital and are observed by a member of the staff. Each patient has a set of examinations that are kept on record. The key tables can be seen in bold which are interrelated in terms of keeping full record of a patient’s medical examinations and the departments that have been used as well as the medical personnel that are used during the examination and hospitalization of a patient. This structure can be seen on the following diagram. Each member of the hospital is treated as a user to the database and security data are associated with the username that is given. So the data that are kept in this database are kept with the appropriate security measures that are associated with medical data. These measures can be summarized by three categories which are the level of access according to the user,
Figure 10: The TeleWorks database structure
146 Kamilatos & Strintzis
the patients that a user can access and the information that a user can exchange with another user during a TeleWorks session. A special interest has been given to the association of the data that can or cannot be exchanged during a session. In order to achieve that, sub-tables are created “on the fly” gathering the data that can be exchanged during a session and disallowing access to the data that cannot be sent over the session. This is a gateway, in order to make it more flexible, that allows the user to include more information in the data exchange structure if that is necessary. These extra data are valid for exchange only during one session returning to the original data exchange structure in future exchanges. Another table is included which allows bi-directional translation of the DICOM standard that is used by a number of modalities to the structures that the database is storing and the TeleWorks system is able to utilize. This has been proven to be very useful as it can be easily modified and allows other standards to be included. Such a standard has already been included that allows the interface of an HP intensive care unit monitor to be able to exchange information with the database through an appropriate module that is built and included in certain versions of the TeleWorks system. A record of the session, in terms of pointer to the data exchanged and the data that are exchanged dynamically during a session, is being kept for future reference together with the discussed patient data. This part of the database can be seen as a recorder to the session. Another session record is also kept which states the date and time of the session, the participants of the session, together with a sequence of pointers to the patients that data are exchanged during the session.
RESULTS TeleWorks can operate on a true color graphic card and is compatible with the DICOM standard, and hence has the ability to use data such as the patient name, and identification number, date, type of examination and type of image. The system that has been created does fulfill the viewing requirements in Makris, Kamilatos, Kopsacheilis and Strintzis, (1998) including the support of • Ergonomic user interface supporting both inexperienced beginners and skilled experts who use the system in their daily routine. • Data and functions synchronization during teleconferences. The communication partner’s cursor is also visible on the screen. Both users have full access to all viewing functions. • Advanced viewing/reviewing functionality, including image analysis and annotation with graphics and text. • Basic image manipulation functions are interactivity (level/window functions, magnification, inversion of gray values). • Display of the two-dimensional images as cine sequences. • Availability of the maximum possible screen space for the images, by providing movable toolbars for the user to alter the image size. The system is easily customized to the needs of the specific user. The software is built using modules that can be added or removed according to the needs of the specific user. For example, modules such as 3D reconstructions can be added as an extension to the existing system. The system also has the following desirable features: • Due to its modular construction, it can be easily translated to other languages, making it friendlier to the local users.
A CSCW with Reduced Bandwidth Requirements 147
•
It may be implemented in a variety of hardware platforms with heterogeneous processing capabilities. Versions of the system exist for many operating systems including Unix and the Windows NT. • The user has the ability to customize the software interface up to a point. Also, some external programs including the teleconferencing package can be started from within the TeleWorks as a plug-in to the main program. For clinical validation, a first group of ISDN trials were implemented from March 1997 to August 1997. A second group of ISDN trials, initialized in January 1998, lasted until December 1999. The following hospitals participated in the international trials: • University Hospital Duesseldorf • Queen Elizabeth Hospital Burmingham • AHEPA University Hospital Thessaloniki • University Hospital Erasme Brussels In parallel, locally developed ISDN networks interconnected smaller hospitals and large university hospitals for remote expert consultation. Figure 11 depicts the network at Thessaloniki, which was used to interconnect the small PANAGIA hospital and the AHEPA University Hospital. In this network participated the Aristotle University of Thessaloniki (AUTH), the radiology laboratory PANAGIA and the AHEPA University Hospital. PANAGIA is a small hospital located at the suburb of Thessaloniki, and AHEPA is a large university hospital at the city center. The contributions of the participating organizations were as follows: • AUTH provided the other units with technical support, participated in tests and debugging procedures, consulted on system adjustments, performed the necessary software installations and trained the clinical users. • The Radiology Laboratory of PANAGIA produced computer tomography (CT) scans on films, which were digitized by a high-quality film scanner and transferred via the ISDN connection to the PANAGIA Department of Internal Medicine for examination. TeleWorks application was used for these transfers. CT or magnetic resonance images (MRI) on film were brought to the PANAGIA Department of Internal Medicine for examination. They were converted to a digital form by means of a high-quality film scanner. The digital images were then locally stored or used for diagnosis. For cases requiring extra advice, the digital images were forwarded to the AHEPA Department of Neurosurgery via the ISDN connection for interactive consultation using TeleWorks. Teleconsultation based on TeleWorks proceeded in two phases: prefetching of the images in deferred time and discussion of the patient case in real-time. International trials are also conducted, connecting Thessaloniki and the test sites in Duesseldorf, Brussels and Birmingham. The trials consisted of regular (two to three times per week) MRI-based teleconsultation between the teams of experts at the University Hospital Erasme in Brussels (led by Professor Baleriaux) and the less experienced neurological and neurosurgery units at University Hospital, Queen Elisabeth Hospital and AHEPA University Hospital. Figure 12 depicts the topology of these trials, which was based on the European ISDN infrastructure.
PERFORMANCE DATA The reduced bandwidth requirement of the proposed system allows the integration of the telecooperation application with other data such as those produced in teleconference. The resulting integrated CSCW system was tested successfully over EuroISDN lines with
148 Kamilatos & Strintzis
Figure 11: The network at Thessaloniki that was used for the ISDN trials
Figure 12: Topology of the international trials performed between the four test sites
A CSCW with Reduced Bandwidth Requirements 149
a total of 64Kbits/sec (8Kbytes/sec) for one-way communication. The telecooperation tool required approximately 48 bytes per packet. Thus, if one packet is generated every second due to the movement of the mouse for example, a bit rate of 48 bytes/sec is needed. In the worst possible case, where both users are transmitting data to each other over an ISDN data link with capacity of 16Kbytes/sec, a rate of 48 * 2=96 bytes/sec is required. This leaves the system with a capacity of (16*1024)-96=16,288/1024 = 15,906 Kbytes/sec or 127.25 Kbps free for use for teleconferencing.
CONCLUSIONS A medical CSCW tool for very low bandwidth channels was described. This tool combines an off-the-shelf video conference component with a sophisticated telecooperation program which minimizes the volume of message exchanges between the two end stations. The tool is DICOM compatible and has been designed for use on PC and Unix platforms using Windows NT, Windows 95, Solaris and IRIX operating systems. The CSCW tool was successfully validated, operating well in repeated international medical teleconsultation sessions. The CSCW tool, due to its low bandwidth requirement, has also been considered for other than medical imaging purposes such as the restoration of art objects (e.g., paintings) over long distances using the ISDN technology as the intermediate carrier.
ACKNOWLEDGMENTS This work was supported by the European Union in the framework of projects SAMMIE2 (Telematics for Health Project HC1044) and the HIM3 (TEN-IBC Project B3014), and the Greek Secretariat for Research and Technology in the framework of the Greek National Project IHIS (Integrated Hospital Information System) and PABE.
REFERENCES Cabral Jr., J. E. and Kim, Y. (1996). Multimedia systems for telemedicine and their communications requirements. IEEE Communications Magazine, July, 16, 21-27. Kamilatos, I. and Strintzis, M. G. (2000). A CSCW tool for medical purposes with reduced bandwidth requirements. Computer Supported Cooperative Work (CSCW), Kluwer Academic Publishers, submitted for publication. Kopsacheilis, V., Kamilatos, I., Strintzis, M. G. and Makris, L. (1997). Design of CSCW applications for medical teleconsultation and remote diagnosis support. Medical Informatics Journal, 2, 121-132. Makris L., Argiriou, N. and Strintzis, M. G. (1997). Network access and data security design for telemedicine applications. Medical Informatics Journal, 22, 133-142. Makris, L., Kamilatos, I., Kopsacheilis, E. V. and Strintzis, M. G. (1998). Teleworks: A CSCW application for remote medical diagnosis support and teleconsultation. IEEE Trans. on Information Technology in Biomedicine, June, 62-74.
150 Kamilatos & Strintzis
Sanders, P. W., Furnell, S. M. and Warren M. J. (1996). Baseline security guidelines for health care IT and security personnel. In Data Security in Health Care-Volume 2, Technical Guidelines. The SEISMED Consortium (Eds.), Technology and Informatics 32, IOS Press: 189-234. Singh, J. P., Gupta, A. and Levoy, M. (1994). Parallel visualization algorithms: Performance and architectural implications. IEEE Computer, July, 45-55.
A Layered Multimedia Presentation Database for Distance Learning 151
Chapter VII
A Layered Multimedia Presentation Database for Distance Learning Timothy K. Shih Tamkang University, Taiwan
Multimedia presentations are suitable for instruction delivery. In a distance learning environment, multimedia presentations are lecture materials to be broadcasted among a number of workstations connected by networks. In order to manage these course materials efficiently, a multimedia database management system (MDBMS) is essentially important. We propose a MDBMS, which has five layers. Attributes of elements in each layer as well as database operations are discussed. The system supports storage sharing and object reuse. The system is implemented on Windows ’98 with the support from a conventional database management system. Also, we present an instruction-on-demand system, which is an application of the underlying MDBMS. The instruction-on-demand system is used in the realization of several computer science-related courses in our university.
INTRODUCTION Multimedia computing and networking change the way that people interact with computers. In line with the new multimedia hardware technologies, as well as wellengineered multimedia software, multimedia computers with the assist of Internet change our society to a distanceless and colorful global community. Yet, in spite of these fantasies gradually being realized, there still exist many technique problems to be solved. This chapter summarizes state of the art research topics in multimedia database and addresses the problems from the perspective of multimedia applications. Theoretical details are dropped from the discussion not because of the lack of their importance but due to the avoiding of tediousness. A list of carefully selected references serves as suggested readings for those who are participating with the research of this new territory. Copyright © 2002, Idea Group Publishing.
152 Shih
In order to support the production of multimedia applications, the management of multimedia resources (e.g., video clips, pictures, sound files) is important. For instance, multimedia presentations can be designed as building blocks, which can be reused. To facilitate multimedia application design, many articles indicate the need of a multimedia database (Chen, Wu & Shen, 1994; Paul et al., 1994; Rody & Karmouch, 1995; Yoshitaka et al., 1994; Johnson, 1999; Kaji & Uehara, 2000; and Ozsu, 1999). A multimedia database is different from a traditional relational database in that the former is object-oriented while the latter relies on entity relations. Moreover, a multimedia database needs to support binary resource types of large and variable sizes. Due to the amount of binary information that needs to be processed, the performance requirement of a multimedia database is high. Clustering and indexing mechanisms that support multimedia databases are thus important. The discussion of research issues in multimedia database management systems can be found in Paul et al. (1994), Johnson (1999), Kaji and Uehara (2000), and Ozsu (1999). A distributed database supporting the development of multimedia applications is introduced in Chen, Wu and Shen (1994). A mechanism for formal specification and modeling of multimedia object composition is found in Little and Ghafoor (1990). The work discussed in Little and Ghafoor (1990), also considers the temporal properties of multimedia resources. A database system for video objects is discussed in Lin, Chang and Lee (1994). A content-based querying mechanism for retrieving images is given in Yoshitaka et al. (1994). Layered multimedia data modeling (Schloss & Wynblatt, 1995) suggests a mechanism to manage multimedia data. In addition to the general discussion on multimedia database management systems (MDBMSs), other articles take a similar approach to ours. The work discussed in [5] proposes a multimedia data model and a database to support hypermedia presentations and the management of video objects. Its specialized video server with an incremental retrieval method supports VCR-like functions for heterogeneous video clips. The design of multimedia DBMS is from scratch, which is similar to our approach. The system also supports object composition/decomposition. However, no specific reuse mechanism is emphasized in the discussion. Only an object-oriented data model was proposed. The system also provides a global data-sharing mechanism, including a video tool and an image collaboration tool, which are integrated with a distributed environment. A multimedia database for news-on-demand applications is proposed in Ozsu et al. (1995). The database follows international standards, such as SGML (Standard Generalized Mark-Up Language) and HyTime (Hypermedia/Time-Based Structural Language). Its Visual Query Interface supports presentation, navigation and querying. A multimedia-type system, especially useful for structured text and presentation information, is also proposed. This database takes an object-oriented approach, which is also used by us. Similar to their standardization approach, we follow standard multimedia file formats by Microsoft, which are used worldwide. The work discussed in Ozsu et al. (1995) has a multimedia-type system, which provides a limited object composition mechanism. However, similar to the work discussed in Chen et al. (1995), no explicit reuse mechanism is provided. The research in Chen, Wu and Shen (1994) uses an object-oriented approach to design a client-server database environment and a multimedia class library to support multimedia applications. Its graphical object editor based on OCPN (Object Composition Petri Net) allows scheduling and composing of multimedia objects. The implementation uses the Raima Data Manager/Object Manager (a database system) for its storage model, which is similar to our early approach. However, we redesign the system from scratch later due to some reasons for extension.
A Layered Multimedia Presentation Database for Distance Learning 153
CORBA (Common Object Request Broker Architecture) is a specification and architecture standard supported by the OMG (Object Management Group). CORBA aims to obtain distributed open architecture and specification through an object-oriented clientserver structure that achieves interoperability among independently developed applications. Based on the CORBA Object Broker standard, the work discussed in Thimm et al. (1996) proposes a database component of the VORTEL project. The implementation relies on the MOSS (Media Object Storage System) and the AMOS (Active Media Object Store) projects for underlying support. The distribution of database structure and the interoperability are based on the DEC’s implementation of the CORBA-compliant Object Request Broker. Another database design following CORBA is discussed in Ricarte and Tobar (1996). The Multiware database, as one portion of the Multiware Platform project, takes an objectoriented approach and adopts an object database standard (ODMG-93) announced by ODMG (Object Database Management Group).Using specialized servers for each media type, the prototype uses the Orbix implementation of CORBA, and the Object Store object oriented database system. This chapter is organized as follows: first, we discuss several important issues of multimedia database management systems. A distributed instruction-on-demand system is discussed , and then a multimedia database hierarchy that supports the instruction system is proposed. The data and control flow of the instruction system and how they affect our multimedia database is discussed next, and finally, we make some conclusions.
MULTIMEDIA DATABASE MANAGEMENT SYSTEMS In this section, issues of multimedia database development are discussed. Firstly, we look at the fundamental requirements of an MDBMS from the perspective of its functionality. We also discuss the architecture of an MDBMS from the view of a software system. Various approaches of developing an MDBMS are given. Finally, the reusability of MDBMS is discussed.
Requirements of an MDBMS Multimedia objects are different from traditional text or numerical documents in that multimedia objects usually require a large amount of memory and disk storage. Also, the operations applied to multimedia objects are different (e.g., displaying a picture or playing a video clip, which is different from displaying a text paragraph). A multimedia database management system should be able to provide the following basic functions: • handles image, voice and other multimedia data types; • handles a large number of multimedia objects; • provides a high-performance and cost-effective storage management scheme; and • supports database functions, such as insert, delete, search and update. Multimedia objects are Binary Large Objects (BLOBs) mostly. It is common that a video clip occupies more than 100 MB of disk storage. On a video server, it is possible that hundreds of video clips are stored. Due to the huge amount of storage required, an MDBMS needs a sophisticated storage management mechanism, which should also be cost-effective. The storage management scheme needs to support fundamental database operations as well. Moreover, an MDBMS should take the following issues into consideration:
154 Shih
• • • • • • • • •
composition and decomposition of multimedia objects; operations of multimedia objects with media synchronization; persistence object; content-based multimedia information retrieval; concurrent access and locking mechanisms for distributed computing; security; consistency, referential integrity and error recovery; long transactions and nested transactions; and indexing and clustering. A multimedia object usually does not exist by itself. A typical multimedia presentation or document may contain a number of objects of various types, such as picture, MIDI music and text. An object-oriented approach with different types of links of various semantics can be used to compose/decompose multimedia objects. To demonstrate the composed objects, multimedia operations, such as fast forward, suspend, resume and slow motion play of a video clip, and other operations to display pictures and text, as well as operations to play a sound file, need to be synchronized. It is important to point out that multimedia objects are embedded with some timing constraints, which is relatively less important as to a traditional database document. Of course, these multimedia operations with time constraints should not affect the persistency of multimedia objects. Content-based multimedia information retrieval has become a very important new research issue in the literature of multimedia computing. Unlike a searching scheme based on text or numerical data comparison usually used in a traditional database, the searching and matching criteria of multimedia information is hard to model. Imaging that, a multimedia presentation designer wants to find a picture, which contains a house and a car. It is difficult to conduct such a query specification. And it is even harder to match the specification against the large amount of picture files in a multimedia database. The same difficulty exists in searching a video clip or a sound file. In order to facilitate an MDBMS user with such kind of content-based information retrieval, a naive multimedia resource browser is not enough. On the other hand, when the MDBMS is realized on a distributed environment, one needs to consider concurrent access (with a reasonable performance) and locking mechanisms, especially when the MDBMS is to support a CSCW (Computer-Supported Corporative Work) environment. In a distributed environment, network security issues are essentially important. Another important issue is related to the update operation of the database. Referential integrity is essential in that database operations such as insertion and deletion need to maintain the consistency of the object semantics in the multimedia database. If exception condition occurs due to hardware failure or software error, the MDBMS should be able to recover to a previously consistent state. Since multimedia objects contain a large amount of data, long transactions--in terms of accessing time--are a requirement. And, since multimedia objects usually exist in a compound form, nested transactions are necessary. In order to support all of the above considerations, an MDBMS requires a fast indexing mechanism to locate multimedia objects and a performance guaranteed disk storage clustering scheme to achieve a reasonable quality of service.
Architecture of a MDBMS • • •
A multimedia database usually contains three layers in its architecture: the interface layer, the object composition layer, and the storage layer.
A Layered Multimedia Presentation Database for Distance Learning 155
The tasks to be dealt with in the interface level include object browsing, query processing and the interaction of object composition/decomposition. Object browsing allows the user to find multimedia resource entities to be reused. Through queries, either text-based or visualized, the user specifies a number of conditions to the properties of resource and retrieves a list of candidate objects. Suitable objects are then reused. Multimedia resources, unlike text or numerical information, cannot be effectively located using a text based query language. Even natural language presented in a text form is hard to precisely retrieve a picture or a video with certain content. Content-based information retrieval research focuses on the mechanism that allows the user to effectively find reusable multimedia objects, including pictures, sound, video and other forms. After the successful retrieval, the database interface should help the user to compose/decompose multimedia documents. The second layer works in conjunction with the interface layer to manage objects. Typically, object composition requires a number of links, such as association links, similarity links and inheritance links of an object-oriented system, to specify different relations among objects. These links are specified either via the database graphical user interface, or via a number of application program interface (API) functions. The last layer, the storage management layer, includes two performance-related issues: clustering and indexing. Clustering means to organize multimedia information physically on a hard disk (or an optical storage) such that, when retrieved, the system is able to access the large binary data efficiently. Usually, the performance of retrieval needs to guarantee some sort of Quality of Service and to achieve multimedia synchronization. Indexing means that a fast locating mechanism is essential to find the physical address of a multimedia object. Sometimes, the scheme involves a complex data or file structure.
MDBMS Development Approaches In spite of the many proposed MDBMSs, there still exists a problem. To increase the efficiency of a multimedia presentation design, the reuse of previously designed presentations is a key. Not many presentation design systems have an underlying database management system to support presentation reuse. In our research, the development of a reuse mechanism was a necessity from the beginning. As exhibited in the systems described above, the strategies for storing multimedia resources have four basic approaches: • rely on a regular file system; • use a traditional database management system (e.g., relational DBMS), with the support of an object-oriented interface; • use an object-oriented database management system, with user interface support; or • design the database from the scratch, based on object-oriented concepts, The first approach relies on the users to manage multimedia resources by themselves. There is no support for reuse of presentations. Most presentation systems allow the user to cut and paste portions of a presentation. However, this is not an ideal strategy due to the general limitations of a file system, such as the inflexibility of object composition and sharing. The second approach relies on a relational DBMS. However, due to the nature of multimedia presentations, it is easier to organize a presentation using an object-oriented methodology (Chen et al., 1995; Gibbs, 1991; Ozsu et al., 1995; and Vazirgiannis & Mourlas, 1993). The difference between the table-based organization and the objectoriented nature of multimedia information makes the usage of a relational database system become inefficient (Ozsu et al., 1995). Thus, the third approach overcomes the second by using an object-oriented DBMS as the underlying system. Even though the underlying
156 Shih
system may not be designed specially for multimedia data, we found that most objectoriented DBMSs provide a binary data type, which is useful for storing picture, sound, video, etc. The last approach results in the most efficient system in general. However, it is quite time-consuming to design everything from scratch. An early version of our database takes the third approach. However, as we attempted to extend the database to a distributed environment, we found that it is difficult to solve some implementation problems due to the lack of control over the object-oriented DBMS and the special requirements of an MDBMS specified below: • Quality of Service: Multimedia resources require a guarantee of the presentation quality. We have to use our own file structure and program to guarantee the Quality of Service (QoS). A traditional OODBMS does not support QoS multimedia objects. • Synchronization: Synchronization of multiple resource streams plays an important role in a multimedia presentation, especially when the presentation is running across a network. • Networking: A distributed database requires a locking mechanism for concurrent access controls. Using our own low-level design, it is easy for us to control some implementation issues, such as the two-phase locking mechanism, and the database administration and control of traffic on the network. Also, as pointed out by the work discussed in Chen et al. (1995), most OODBMSs are designed for non-multimedia information. The multimedia database implemented in Chen et al. (1995) takes the last approach. For these reasons, we also took the last approach in our new version of database implementation. In our system, we have a sophisticated storage management and indexing mechanism for the efficient retrieval of large multimedia resources.
Reusability From the perspective of software development, reusability is one of the most important factors in improving the efficacy of a multimedia database. This chapter focuses on a mechanism to support reuse. Many articles discuss reusability—one of them (Tracz, 1998) analyzes nine commonly believed software reuse myths, and points out why reusability has not played a major role in improving software productivity. Another article (Kaiser et al., 1987) points out three common approaches to achieve software reuse: subroutine libraries, software generators and object-oriented languages. However, each individual approach raises problems. Therefore, the author takes the best features from each of these and develops an object-oriented language, called Meld. Another reuse mechanism via building blocks is proposed in Lenz et al. (1987). The authors apply the concept to systems programming and achieve great success. The authors also indicate that there are three important aspects of reusability: the abstraction level, customization methods and reusability conditions. In Gargaro et al. (1987), reusability criteria are proposed for the Ada language. The author suggests that, to be reusable, a program must: be transportable, context independent and independent of the runtime environment. According to the discussion given in Prieto-Diaz et al. (1987), there are two levels of reuse: the reuse of knowledge and the reuse of software components. In order to support the fast retrieval of reusable objects, a classification scheme is required to group together objects sharing a common characteristic. A classification scheme organizing a software reuse library is proposed in Prieto-Diaz (1991). Instead of using a centralized repository, the author proposes a generic design that can be applied to different environments. Another article (Ghezala et al., 1995) proposes two concepts, domain and theme, to allow software component classification by the services that they offer and by application domain.
A Layered Multimedia Presentation Database for Distance Learning 157
The value of object reuse needs to be evaluated. A quantitative study of reuse through inheritance is found in Bieman et al. (1995). The author studies several software systems containing a large number of classes. The results show that reuse through inheritance is far less frequent than expected. Another article (Mieman et al., 1995) measures cohesion versus reuse through inheritance, and determines that reuse through inheritance results in lower cohesion. Fortunately, many articles propose solutions to software reusability. The RSL (Reusable Software Library) prototype (Burton et al., 1987) is a software system incorporating interactive design tools and a passive database of reusable software components. Each component is associated with some attributes to be used in their automatic retrieval. Another author (Fischer, 1987) claims that reusable components are not enough. Software engineers need an intelligent support system to help them learn and understand how to reuse components in a software environment. A set of tool kits is implemented to support object reuse. The NUT [31.32] is an object-oriented programming language/environment that supports class and object reuse. In NUT, classes are prototypes; when they are reused, initial values (could be nil values) are assigned according to the class specification. Objects can also be implicitly generated. For instance, program objects can be generated when equations are processed. Knowledge in the NUT environment can be reused in the following ways: • As a super-class; • When a new object is created via the “new” operator; • As a prototype for specifying components of another class; • As the specification of a predicate; or • As a specification of problem conditions in problem statements. Only the first two are typical on conventional object-oriented systems. The above discussions consider reusability from the perspective of software development. A multimedia presentation contains a collection of presentation components, as does a software system. To achieve object reuse in a multimedia database, we consider four tasks: • Declare Building Blocks: Presentation sections are the building blocks of a multimedia presentation. These sections are objects to be reused. • Retrieve Building Blocks: A presentation system should provide a tool to assist the user in retrieving appropriate building blocks. • Modify Building Blocks: The building blocks to be reused may be modified to fit the need of a new presentation. • Reorganize Database Storage: Database storage should be reorganized so that information sharing is feasible and disk space is used efficiently. Building block declaration is achieved by means of an object grouping mechanism at different levels of our database. These reusable objects can be retrieved using a browser. After finding the reusable presentation components, the presentation designer establishes links and possibly modifies the component. Finally, we have a storage management mechanism to allow information sharing on the disk.
A DISTANCE UNLEASHED INSTRUCTION-ONDEMAND SYSTEM Distance learning or so-called remote classroom relies on network and multimedia facilities and provides teachers an instruction environment, either via analog or digital transmission, such that the instructor and his/her students can join the class in different
158 Shih
locations. On the other hand, with the availability of a high performance network, the Videoon-Demand (VOD) system allows customers to access digital video randomly and remotely. In this chapter, we propose a system which combines functions of the above two types of systems, namely an instruction-on-demand system. The proposed system supports instructors to design course materials, including lectures, homework and tests. Students and the instructor in a class communicate with each other through the help of the system. If necessary, students can also review lectures and other course materials at home. Our research focuses contain several software levels. Issues include: • Instruction on demand applications; • Multimedia documentation database; • Multimedia information networking system; • Disk accessing control mechanism; and • Quality of service and media synchronization. In order to realize different levels of research focuses, the software architecture is divided into four levels. Figure 1 gives the system architecture of the proposed instructionon-demand system. The user interface layer in the architecture has a number of applications, including a multimedia lecture design system. Since a multimedia lecture is essentially a multimedia presentation, our lecture design system aims to provide a lecture design and playback facility. The lecture designed via this facility can be accessed by students in a remote computer. An instruction administrator will keep the information of which students are enrolled in the class. The white board subsystem provides an environment for the students and the instructor to discuss either in verbal or text form. Figure 1: Software architecture of the proposed instruction on demand system Control API
Instructor Instructor Student Instruction Workstation Workstatio Workstation Administrator n
Control Daemon
Whiteboard System DBMS API
Multimedia Instruction on Demand DBMS MPI Functions
Multimedia Information Networking Library DDA API
Distributed Disk Accessing System J++ and C++
Windows 95/NT and BSD Unix
MPC Hardware and Network Infrastructure
A Layered Multimedia Presentation Database for Distance Learning 159
The second layer of the architecture is a multimedia database management system (MDBMS). The purpose of this MDBMS is to support many instruction-on-demand applications. Lectures, assignments, tests and other course materials will be stored in the MDBMS via the help of an underlying multimedia networking and disk accessing system. The hierarchy of this MDBMS is to fit the need of instructions, which consists of five layers. An object-oriented approach is used to facilitate the design. Therefore, one of the focuses of this MDBMS is on its instruction reusability. While instruction objects of various types are stored in the MDBMS hierarchy, the system also supports a dynamic type check scheme allowing object composition/decomposition. Type constructors and converters based on the concepts of time and space are also provided. The implementation environment is distributed. A concurrent accessing mechanism is used, as well as those for database transition, versioning, administration and multimedia synchronization. The target application environment is heterogeneous, which includes different types of operating systems connected by various network infrastructures. The third layer is a multimedia information network library, which is based on Message Passing Interface (MPI). MPI is a standard of communication mechanism designed for parallel machines, especially for those incorporated with distributed memory. We aim to provide a solution to the communication facility of the multimedia instruction-on-demand system. We investigated MPI functions and source code provided by a vendor. Our objective is to realize the communication standard not only on the Unix machine (as is the source code provided), but also on the windowing systems by Microsoft. On top of MPI functions, we design and implement an encapsulation layer, which will be accessed by our multimedia database management system. For the sake of performance efficiency, we also extend MPI functions via enhancing its underlying disk accessing mechanism suitable for our database architecture. We believe that this extended MPI encapsulation will play an important role in our system. The last layer of our system design is a distributed disk accessing system, which is built on top of Windows and Unix. In order to achieve Quality of Service, disk scheduling mechanisms of our system relies on the logic block read/write statements of SCSI to access local disk data. Also, disk buffers are used for remote disk access. A control daemon is designed to control system status, which includes database accessing parameters, communication parameters, synchronization parameters and disk accessing parameters. The control is centralized. The control daemon is installed in the instructor workstation with a status buffer shared by all workstations. In the following section, we will focus our discussion on the multimedia database.
THE PROPOSED MULTIMEDIA DATABASE HIERARCHY In order to support the instruction-on-demand system, an in-house-designed multimedia database management system is used. The database hierarchy is designed for storing multimedia lectures, student homework and tests. These multimedia presentations are compound objects consisting of various levels of materials. The hierarchy of the MDBMS has five levels: 1. Database layer 2. Course layer 3. Lecture layer
160 Shih
4. 5.
Scene layer Resource layer As shown in figure 2, the MDBMS allows multiple databases. There are two types of databases: the instruction database (i.e., InstDatabase) and the class database (i.e., ClassDatabase). Each instruction database contains one or more courses. Related courses can be grouped. This course group (i.e., CG), through a declaration, can become a course object class (i.e., COC) in a class database. A COC is a reusable component. It can be instantiated to another course group with alternative properties. The purpose of object reuse includes information sharing and prototyping. Information sharing, in this circumstance, means that a COC and other COCs can share the same lecture object class (i.e., LOC). Prototyping means that the internal structure of a course is intact in an instantiation process, before the object group is confirmed to some new properties. In the same way, a course has one or more lectures. A lecture group can be declared as a lecture object class for reuse. A lecture, in its broad meaning, can be a lecture presentation, homework or a test. A lecture can be designed in our presentation design interface, or designed by using other commercial multimedia presentation software. A lecture can also be attached with a script. If so, the script keeps a sequence of interactions produced by the user who navigates through the lecture. This lecture can be played back if necessary. A lecture is a collection of windows, called scenes in our system. Scenes can be reused as well. A scene object can inherit properties from other scenes. A scene sends out navigation message (issued by the user) to scenes (including the origin scene) in order to control navigation sequences. Similarly, { scene object class} (i.e., SOC) can be reused. Finally, a scene contains a number of presentation resources. A presentation resource may have a synchronization requirement to another presentation resource. And, two resources may have a coexistence relation, or a similarity relation. The resource layer is the last layer of the database hierarchy. Binary Large OBjects (BLOBs) in this layer can be shared by object groups or object classes. Note that the object class hierarchy of a class database is different from the one of an instruction database. Course, lecture and scene object classes can be the top-level objects in the hierarchy. In this way, the three types of object classes can be identified for reuse at the top level. Each object in the hierarchy has a number of properties. These properties not only help the identification of an object, but the organization of a multimedia object. Properties of objects in our MDBMS are given below: • Database: - aggregation links (al): pointers to instruction or class databases. •
Database layer: - name: a unique name of the database. - keyword: one or more keywords are used to describe the database. - aggregation links (al): pointers to course groups belong to the database. - version: the version of this database. - date/time: the date and time this database was created. - author: author name and copyright information of the creator. - privilege table: database access privilege information.
•
Course layer: - name: a unique name of the course. - keyword: one or more keywords are used to describe the course. - aggregation links (al): pointers to lecture groups belong to the course.
A Layered Multimedia Presentation Database for Distance Learning 161
Figure 2: The proposal instruction database architecture
MDBMS InstData b
InstData b
InstData b
InstData b
InstData b
C
C
C
C C
C
C
C C
C C
C
C
L L
C
L
L
LO LO
L L
LO LO
L
L
LO LO
LO LO
LG
SO SO
S S
SO SO
S S
S
S
SO SO
S S
SG
R
S R
S
R
R
R
S
S
R
R
S
S
SO SO
R
R
R
R
S
S
S
R
R
R
R
S
S
S
S
162 Shih
-
version: the version of this course. date/time: the date and time this course was created. author: author name and copyright information of the creator. similarity links (sl): logical connections point to other courses which have attributes similar to the current course. Similarity links help a presentation designer to locate similar courses.
•
Lecture layer: - name: a unique name of the lecture. - keyword: one or more keywords are used to describe the lecture. - aggregation links (al): pointers to scene groups that are used in the current lecture. - version: the version of this lecture. - date/time: the date and time this lecture was created. - author: author name and copyright information of the creator. - similarity links (sl): logical connections point to other lectures which have attributes similar to the current lecture. Similarity links help a presentation designer to locate similar lectures.
•
Scene layer: - name: a unique name of the scene. - item keyword: one or more keywords are used to describe the scene. - aggregation links (al): pointers to resources that are used in the current scene. - inheritance links (il): pointers to other scenes that inherit properties from the current scene. - usage links (ul): messages from the current scene to the destination scenes, including possible parameters. - presentation data: presentation data used in the scene. - scene layouts: screen coordinates of presentation resources. - version: the version of this resource file. - date/time: the date and time this resource file was created. - similarity links (sl): logical connections point to other scenes which have attributes similar to the current scene. Similarity links help a presentation designer to locate similar scenes.
•
Resource layer: - name: a unique name of the resource. - keyword: one or more keywords are used as the description of a multimedia resource. - usage: how the resource is used (e.g., background, navigation or focus). - medium: what multimedia driver is used to carry out this resource (e.g., sound card driver, animation driver or MPEG-coded video driver). - model: how the resource is presented (e.g., table, map, chart or spoken language). - temporal endurance: how long the resource lasts in a presentation (e.g., 20 seconds or permanent). - synchronization tolerance: how a participant feels about the synchronization delay of a resource. For instance, a user typically expects an immediate response after pushing a button for the next page of text. However, one might be able to tolerate a two-second delay for a video playback.
A Layered Multimedia Presentation Database for Distance Learning 163
-
detectability: how strongly a resource attracts attention (e.g., high, medium or low). - startup delay: the time between a request and the presentation of the corresponding resource starts, especially when the resource is on a remote computer connected via network. - hardware limitation: what kind of hardware is essential to present the resource ensuring a minimal quality of service (e.g., MPC level 1, level 2, level 3 or other limitations). - version: the version of this resource file. - date/time: the date and time this resource file was created. - resolution: the resolution of this resource file, specified as X * Y screen units, or K-bit for sound. - start/end time: for non-permanent resources, the starting cycle and the ending cycle of the piece of video, sound or other kind of resources. A cycle can be a second, one-tenth of a second or an interval between two consecutive video scenes of a video clip. - resource descriptor: a logical descriptor to a physical resource data segment on the disk. - association links (ol): pointers to other resources that have the coexistence relation with the current resource. - synchronization links (yl): logical pointers to other resources which should be presented with the current resource starting at the same time. - similarity links (sl): logical connections point to other resources which have attributes similar to the current resource. Similarity links help a presentation designer to locate alternative resources. In addition, each object group or object class in the course layer, the lecture layer and the scene layer has the following properties: • name: a unique name of the group/class. • keyword: one or more keywords are used to describe the object. • structure: group or class. • similarity links (sl): logical connections point to other objects which have attributes similar to the current object. These properties help the user to identify an object group or object class. The operations of our database system are based on the concept of object-orientation. In the following sub-sections, we will discuss database operations as well as other database issues of the proposed database.
Database Operations Our database server handles database operations from the user’s program by means of application program interface (API) functions. Some of these functions came from traditional database systems, whereas others are new for handling multimedia resources. Database APIs are divided into the following groups: • Database Access Commands: provide the service of accessing an anchor database, such as opening or closing the database. • Object Access and Reuse Commands: allow the user to create and reuse objects. • Presentation Commands: present, stop, pause, resume presentation objects. • Object Insert/Delete Commands: insert, delete or update an object group.
164 Shih
• •
Link Commands: set object links in the database. Object Select Commands: allow the user to search for object groups, object classes, course, lecture, scene or resources, which meet specific conditions. • Support Commands: provide a mechanism to add additional restrictions to objects. • Transaction Commands: support the concept of database transactions and allow the commit/rollback process. • Network Commands: provide services such as register a multimedia presentation station, or copy database objects from a station to another. • Privilege Commands: set the access privileges of objects for users. These database commands are implemented as C++ classes and methods. We aim to provide the next version of the system using a visualized approach so that users can create and manipulate their database applications by a graphical user interface.
Object Reuse in the Database When we design our database hierarchy, reusability is a focus. It is the advantage of an object-oriented database system that allows the reuse of classes. We also use this approach. Moreover, the hierarchy of our database allows information sharing of multimedia BLOBs. These concepts are discussed in the following subsections.
Object Reuse via Declaration and Instantiation There are three levels of objects in the database hierarchy that can be reused: courses, lectures and scenes. The basic block to be reused is an object group. A singular object can be bound in a group for reuse as well. An object group, after its creation, can be declared as an object class. The object class can be instantiated to another object group for reuse. These reuse operations are implemented by the pseudo code given below: Algorithm: declare COC from CG Input: CG Output: COC { create a new copy of CG discard aggregation links to the new CG from the database layer declare lecture object classes from the aggregated lecture groups maintain similarity links of the course group reestablish an aggregation link from a class database change the structure property of the new object to ‘class’ } Algorithm: instantiate CG from COC Input: COC Output: CG { create a new copy of the COC discard the aggregation link from a class database instantiate lecture groups from the aggregated lecture object classes reestablish aggregation links to the new CG from the database layer change the structure property of the new object to group add or change properties of the new CG if necessary
A Layered Multimedia Presentation Database for Distance Learning 165
} Algorithm: declare LOC from LG Input: LG Output: LOC { similar to the course layer } Algorithm: instantiate LG from LOC Input: LOC Output: LG { similar to the course layer } Algorithm: declare SOC from SG Input: SG Output: SOC { create a new copy of SG discard aggregation links to the new SG from the lecture layer discard external inheritance and external usage links maintain similarity links of the scene group maintain aggregation links to resources of each scene in the scene group reestablish an aggregation link from a class database change the structure property of the new object to class } Algorithm: instantiate SG from SOC Input: SOC Output: SG { create a new copy of the SOC discard the aggregation link from a class database reestablish aggregation links to the new SG from the lecture layer reestablish external inheritance and external usage links change the structure property of the new object to group add or change properties of the new SG if necessary } The declaration of object class involves a creation of database space for the new class. Since a course group uses a number of lecture groups, the declaration of a course object class enforces the declarations of the used lecture object classes as well. The instantiation process is the reverse of a declaration process. However, additional database space is used. In general, we treat object groups and object classes as separated entities in the database so that a modification of one will not affect the content of the other. Object reuse in the lecture layer is similar to the one in the course layer. However, reuse in the scene layer is different. Since a scene is essentially a window, the navigation among windows is implemented by message
166 Shih
passing (which is called usage links in our system). Moreover, a scene can inherit information from another scene by means of an inheritance link. Therefore, the declaration and instantiation of a scene object class and a scene group need to manipulate usage links and inheritance links. These two types of links can be internal or external. An internal link has its source and destination scenes both lay within an object group (or an object class). An external link of an object group or class has either the source or the destination of the link outside the object group or class. When an object group is declared as an object class, all external links are discarded. When reused, the external links of this object class are reestablished (or changed).
Object Reuse via Storage Sharing Whether it is an object group or an object class hierarchy, the last level of objects are multimedia resources. Resources are shared among groups and classes, or between a group and a class. The update of a resource will change the multimedia instruction content of the entire database. Therefore, if resources of different versions are to be used in the database application, the reestablishment of aggregation links is necessary.
Concurrence and Locking We implement the database server on a distributed environment. The concurrent access control is centralized by means of a locking mechanism. Object locking is divided into the following levels: • lock/unlock a course (C), a course group (CG) or a course object class (COC) • lock/unlock a lecture (L), a lecture group (LG) or a lecture object class (LOC) • lock/unlock a scene (S), a scene group (SG) or a scene object class (SOC) • lock/unlock a resource (R) As a result, there are 10 types of objects that can be locked. In an object-oriented database, locking depends on the granularity of the objects in an object hierarchy in two ways: object composition and class inheritance. Object composition needs to maintain the locking of objects at different levels so that locking on a component still leaves a free space for the containing object. But, locking on a compound object at a higher level enforces locking on its descendent objects. Similarly, in a class hierarchy, when a user is about to change a parent class, its child classes cannot be changed by another user. In the database hierarchy we proposed, the relations among objects are based on object compositions. These relations can be summarized as the following: • resource is-part-of scene • scene is-part-of scene group • scene group is-part-of lecture • lecture is-part-of lecture group • lecture group is-part-of course • course is-part-of course group • resource is-part-of scene object class • scene object class is-part-of lecture object class • lecture object class is-part-of course object class Due to the locking mechanism used in object-oriented database systems, we have defined an object locking compatibility table. In general, if a container has a read lock by a user, its components (and itself) can have the read access by another user, but not the write access. However, the parent objects of the container can have both read and write access by
A Layered Multimedia Presentation Database for Distance Learning 167
The object locking compatibility table
COCr COCw LOCr LOCw SOCr SOCw Rr Rw
COCr COCw LOCr LOCw SOCr SOCw Rr Y N Y N Y N Y N N N N N N N Y Y Y N Y N Y Y Y N N N N N Y Y Y Y Y N Y Y Y Y Y N N N Y Y Y Y Y Y Y Y Y Y Y Y Y N
Rw N N N N N N N N
The object locking compatibility table (continued)
CGr CGw Cr Cw LGr LGw Lr SGr Sr Sw Rr Rw
CGr CGw Cr Cw LGr Y N Y N Y N N N N N Y Y Y N Y Y Y N N N Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
LGw Lr N Y N N N Y N N N Y N N Y Y Y Y Y Y Y Y Y Y Y Y
Lw SGr N Y N N N Y N N N Y N N N Y Y Y Y Y Y Y Y Y Y Y
SGw Sr Sw Rr N Y N Y N N N N N Y N Y N N N N N Y N Y N N N N N Y N Y N Y N Y Y Y N Y Y N N N Y Y Y Y Y Y Y N
Rw N N N N N N N N N N N N
another user. Of course, the accesses are prohibited in the current container object. The object locking compatibility table is divided into two parts: one for the object classes and the other for the object groups. The object locking compatibility table is used in the implementation as follows. When a user program is about to lock an object to a type of access (e.g., Cw for write access to a course), the server locks at the column of Cw. If the parent or child objects of the course object were locked with a certain access type (by looking at each row at the Cw column), then “Y” represents the access of the course object is possible. Otherwise, “N” represents the access is denied. For instance, if the course group containing the course object to be locked, or the course object itself was locked as read or write, the course cannot be locked as write again. However, locking on a child object of the course object will not affect the locking access of the course object.
THE INSTRUCTION AND CONTROL FLOW The distance unleashed instruction-on-demand system is divided into five parts. As shown in Figure 3, the multimedia database management system (MDBMS) is below the presentation system, which accepts commands and interactions from the user and sends
168 Shih
Figure 3: Data and control flows of the instruction on demand system Presentation System
Login Control Ack
Ack Ack
Password
Instructor Interface Control Integration
Password
Password
Student Interface
Whiteboard System
Sync Ctrl Instruction
Instruction
Instruction
Query Query
Query
MDBMS
Instruction Administrator and Intruction Player Sync Ctrl DB Query Result
DB Query
Raw Data
Instruction on Demand DMBS DB Ctrl & Sync Ctrl Net Query Result Net Query
Network System
Compress and Pack
Net Query
Control Daemon
Decompress
Net Query Result
Raw Data
Netword Library Comm Ctrl & Sync Ctrl Access Ack Access Cmd
Raw Data Local Buffer
Unpack Local Access Cmd
Local Access Ack Romote Access Cmd
Remote Access Ack
Network Resource Reservation Control Comm Ctrl & Sync Ctrl
Disk System
Local Access Cmd
Disk Resource Reservation Control
Local Access Ack
Disk Resource Reservation Control
Disk Ctrl & Sync Ctrl Disk R/W Cmd
Control States
Disk R/W Ack
Local Disk Unit
Disk R/W Cmd
Disk R/W Ack
Remote Disk Unit Network Transmission
Raw Data
Raw Data Local Buffer
Raw Data
A Layered Multimedia Presentation Database for Distance Learning 169
database queries to the MDBMS. These queries are application program interface (API) function calls in the instruction player. The database system relies on the network system and the disk system to transmit and store instruction. BLOBs will be compressed by standard CODEC (i.e., compress and decompress) mechanism. Since MPI (message passing interface) functions of the network system send only binary streams, we pack multimedia instruction in a special structure before the transmission and unpack the instruction afterward. To the MDBMS system, the transmission and data storage are transparent. That is, the MDBMS only issues the store and retrieve commands of logic blocks to and from some network addresses. The physical network resource or disk resource allocation will be handled by the network and disk systems. The control of the system is centralized. The control integration is applied via a control daemon and a centralized control state. The control state includes several types of information such as multimedia synchronization status, database locking status, communication status and disk accessing status.
CONCLUSIONS In this chapter, we discussed research issues of distributed multimedia databases. We also proposed an MDBMS to support our instruction on demand system. We implement our system on Windows 95. We hope that the proposed database hierarchy will ease the design and reuse of multimedia presentations. There are other issues in multimedia database researches, including transaction processing, persistency, versioning, security and referential integrity. Most of these are also issues of traditional database research. The research of multimedia database has become an important issue in the community of multimedia computing. Future multimedia database applications will be even more interesting. For instance, intelligent agents, based on Artificial Intelligence techniques, will further assist multimedia database software to interact with its users. Also, intelligent searching mechanisms will add power to content-based retrieval of multimedia information. High-speed processors, high performance multimedia hardware architectures and networks, and high capacity optical storage will support the need for high quality multimedia database applications in the future.
REFERENCES Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11). Bieman, J. M., et al. (1995). Reuse through inheritance: A quantitative study of C++ software. In Proceedings of the SSR’95 Conference, Seattle, WA, USA, 47-52. Burton, B. A., et al. (1987). The reusable software library. IEEE Software, July, 25-33. Chen, T., Wu, W. and Shen, C. (1994). A client-server database environment for supporting multimedia applications. In Proceedings of the 18th IEEE Annual International Computer Software and Application Conference (COMPSAC’94), Taipei, Taiwan, 215-220. Chen, C., Meliksetian, D., Chang, M. and Liu, L. J. (1995). Design of a multimedia objectoriented DBMS. Multimedia Systems, 3, Springer-Verlag, 217-227.
170 Shih
Chung, C., Shih, T. K., Huang, J., Wang, Y. and Kuo, T. (1995). An object-oriented approach and system for intelligent multimedia presentation designs. In Proceedings of the IEEE International Conference on Multimedia Computing and Systems, 278-281. Fischer, G. (1987). Cognitive view of reuse and redesign. IEEE Software, July, 60-72. Gargaro, A., et al. (1987). Reusability issues and ada. IEEE Software, July, 43-51. Henda H. B. and Ghezala, et al. (1995). A reuse approach based on object orientation: Its contributions in the development of CASE tools. In Proceedings of the SSR’95 Conference, Seattle, WA, USA, 53-62. Gibbs, S. (1991). Composite multimedia and active objects. ACM Object-Oriented Programming System, Languages and Application, 97-112. Johnson, P. (1992). Supporting exploratory CSCW with the EGRET framework emerging technologies for cooperative work. In Proceedings of ACM CSCW’92 Conference on Computer-Supported Cooperative Work, 298-305. Kaiser, G. E., et al. (1987). Melding software systems for reusable building blocks. IEEE Software, July, 17-24. Lenz, M., et al. (1987). Software reuse through building blocks. IEEE Software, July, 3442. Lin, K., Chang, C. and Lee, S. (1994). Design of an interactive video database. In Proceedings of the 1994 HD-Media Technical and Application Workshop, Taipei, Taiwan, PO2-17-PO2-22. Little. T. D. C. and Ghafoor, A. (1990). Synchronization and storage models for multimedia objects. IEEE Journal on Selected Areas in Communications, April, 8(3), 413-427. Mieman, J. M., et al. (1995). Cohesion and reuse in an object-oriented system. In Proceedings of the SSR’95 Conference, Seattle, WA, USA, 259-262. Ozsu, M. T., Szafron, D., El-Medani, G. and Vittal, C. (1995). An object-oriented multimedia database system for a news-on-demand application. Multimedia Systems, 3, Springer-Verlag, 182-203. Paul, R. M., Khan, F., Khokhar, A. and Ghafoor, A. (1994). Issues in database management of multimedia information. In Proceedings of the 18th IEEE Annual International Computer Software and Application Conference (COMPSAC’94), Taipei, Taiwan, 209-214. Prieto-Diaz, R., et al. (1987). Classifying software for reusability. IEEE Software, July, 6-16. Prieto-Diaz, R. (1991). Implementing faceted classification for software reuse. Communications of the ACM, 34(5), 88-97. Ricarte, I. L. M. and Tobar, C. M. (1996). Towards an architecture for distributed multimedia databases. In Proceedings of the 1996 IASTED/ISMM International Conference on Intelligent Information Management Systems, June 5-7, Washington, DC, USA, 65-68. Rody, J. A. and Karmouch, A. (1995). A remote presentation agent for multimedia database. In Proceedings of the IEEE 1995 International Conference on Multimedia Computing and Systems, May 15-18, Washington DC, 223-230. Schloss, G. A. and Wynblatt, M. J. (1995). Presentation layer primitives for the layered multimedia data model. In Proceedings of the IEEE 1995 International Conference on Multimedia Computing and Systems, May 15-18, Washington DC, 231-238. Shih, T. K. (1995). An artificial intelligent approach to multimedia authoring. In Proceedings of the Second IASTED/ISMM International Conference on Distributed Multimedia Systems and Applications, Stanford, California, August 7-9, 71-74.
A Layered Multimedia Presentation Database for Distance Learning 171
Shih, T. K., Kuo, C., Keh, H., Tsou, C. T. and An, K. (1996). An object-oriented database for intelligent multimedia presentations. In Proceedings of the 1996 IEEE International Conference on Systems, Man and Cybernetics, Beijing, China, October 14-17. Shih, T. K., Lo, S. K. C., Fu, S. and Chang, J. B. (1996). Using interval temporal logic and inference rules for the automatic generation of multimedia presentations. In Proceedings of the IEEE International Conference on Multimedia Computing and Systems, 425-428. Spivey, M. J. (1989). Z notation: A reference manual. International Series in Computer Science. Prentice Hall. Staehli, R., Walpole, J. and Maier, D. (1995). A quality-of-service specification for multimedia presentations. Multimedia Systems, Springer-Verlag, 3, 251-263. Thimm, H., Marder, U., Robbert, G. and Meyer-Wegener, K. (1996). Distributed multimedia databases as component of a teleservice for workflow management. In Proceedings of the 1996 Pacific Workshop on Distributed Multimedia Systems, June 25-28, Hong Kong, 111-117. Tracz, W. (1988). Software reuse myths. ACM SIGSOFT Software Engineering Notes, January, 13(1), 17-21. Tyugu, E. (1991). Three new-generation software environments. Communications of the ACM, 34(6), 46-59. Tyugu, E., et al. (1986). NUT—An object-oriented language. Computer and Artificial Intelligences, 5(6), 521-542. Vazirgiannis, M. and Mourlas, C. (1993). An object-oriented model for interactive multimedia presentations. The Computer Journal, 36(1). Yoshitaka, A., Kishida, S., Hirakawa, M. and Ichikawa, T. (1994). Knowledge-assisted content-based retrieval for multimedia database. IEEE Multimedia Magazine, Winter, 12-21. Johnson, R. B. (1999). Internet multimedia databases. IEEE Colloquium on Multimedia Databases and MPEG-7, May 1-6. Kaji, M. and Uehara, K. (2000). Creating multimedia presentation based on constraint satisfication problems in multimedia database. In Proceedings of the 1999 International Symposium on Database Applications in Non-Traditional Environment, 143-151. Ozsu, M. T. (1999). Issues in multimedia database management. In Proceedings of the International Symposium on Database Engineering and Applications, 452-459.
172 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
Chapter VIII
QoS-Aware Digital Video Retrieval Application Tadeusz Czachorski, Polish Academy of Sciences, Poland Stanislaw Jedrus, Polish Academy of Sciences, Poland Maciej Zakrzewicz, Poznan University of Technology, Poland Janusz Gozdecki, AGH University of Technology, Poland Piotr Pacyna, AGH University of Technology, Poland Zdzislaw Papir, AGH University of Technology, Poland Delivering high quality video content to customers is now expected to be one of the driving forces for the evolution of the Internet as it can be deployed in many niches of the emerging e-market. This chapter presents an originally developed Video Retrieval application with its unique features including a flexible user interface based on HTTP browser for content querying and browsing, support for both unicast and multicast addressing and a useroriented control of QoS of video streaming in Integrated Services IP networks. The remaining part of the chapter is devoted to some selected methods of information systems’ modelling requested for the prediction of a system performance and an influence of different control mechanisms on quality of service perceived by end users.
DIGITAL VIDEO RETRIEVAL AND STREAMING The global Internet is absorbing video. You do not have to be in a hurry any more to get home for the evening TV news. You do not have to record any more sport competitions you are interested in. All important political, sport and entertainment events are on the Net. More and more regular movies can be downloaded from the Net as well. Digital video storage, retrieval and delivery to customers was the only logical way for the Internet evolution as multimedia broadband services’ experts convince us (Cahners InStat Group [CISG], 2000; Riley & Richardson, 1997; and Minoli, 1995). This statement is a straighforward consequence of the stunning, however still controversial, success of the digital sound format MP3. CISG (2000) examined that streaming media is taking off and will drive the market for servers delivering multimedia content from the edges of the Internet to high populations of potential customers. CISG (2000) also found that new technologies will make it possible for servers to greatly increase the number of simultaneously delivered streams. CISG (2000) Copyright © 2002, Idea Group Publishing.
QoS-Aware Digital Video Retrieval Application 173
suggests as well that streaming media has the potential to make all Web sites more eyecatching to consumers and, therefore, to keep them at a site and attract new consumers. Servers and software for digital media retrieving and streaming are sophisticated database systems tuned for the management of high-quality graphical or video content. Traditional database systems were used to store and manage alphanumeric data types. Databases that additionally contain images, text, audio and video are called multimedia databases (Khoshafian & Baker, 1996). Video is a sequence of images, a recorded real-life event usually recorded with a video camera. In contrast to image data, video is time-dependent. Time-dependency means that the data has a meaningful interpretation only if presented with respect to a constantly progressing time scale. Video data items are described with a set of attributes like width and height of each frame (in pixels), compression/file format, compression quality, frame/bit rates, video duration and color depth (bits/pixel). Video inherits all image content-based features, like keywords, color histograms, object shapes, etc., but it also adds same specific features like object motion (spatiotemporal object movement), camera motion and representative frames (Flickner et al., 1995). DBMS should provide a set of video manipulation methods, like: - export/import to/from external data sources, - cut (create a subset of the original video), - real-time playback and recording: play forward, play backward, fast forward, set working point, remove working point, jump to working point. The multimedia data types show drastically higher storage requirements as compared to traditional alphanumeric data types. A single multimedia data item consumes kilobytes (simple JPEG-compressed images) to gigabytes (high-definition television) of memory, and usually data compression methods are employed to reduce the data size. This also implies the requirement for high data transfer rates in case of time-dependent continuous media. The requirement to store and manage multimedia information in a database significantly influences DBMS implementation (Aberer & Klas, 1992; Adjeroh & Nwosu, 1997; Khoshafian & Baker, 1996). The most interesting topics of DBMS technology, which are important for multimedia databases, include data storage, transaction and query processing, and data protection. The Video Retrieval Service is a derivative of a Video-on-Demand, which is defined in ATMF specification as an asymmetrical service that transfers digital, compressed and encoded video and audio information from a server (typically a video server) to a client (typically a set-top Terminal-STT). At the consumers’ set-top terminal, the streams are reassembled, uncompressed, decoded and presented on a display (ATM Forum, 1996a, p. 6). In Video-on-Demand service, the user has a predetermined level of control over the selection of the video content as well as the time of viewing. It offers the functionality of a home VCR, including select/cancel, start, stop, pause (with or without frame freeze), fast forward and reverse. More advanced implementation may enable scanning forward or reverse, setting and resetting memory markers, showing counters, jumping to different indexes. The Video-on-Demand service ensures synchronization of the audio and video streams as well as time base recovery. It is likely to be used for entertainment and educational/training purposes allowing subscribers to access a library of programs (e.g., movies) collected in a digital repository. Depending on the implementation, the user may have various levels of control over the video playback startup time and service type. It is assumed that the Video-on-Demand will allow for the individual treatment of a user. It means that the user will be provided with means
174 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
to express his subject of interest and personal preferences. Also the selection of content and set-up of other parameters of a session will be left to the user. Video users expect some kind of labels for video sequences, which are prose descriptions of a subject, content, duration and additional information. Such information may include information about the origins of a movie, its distributor, licensing terms and a license holder. Some extra information may also be necessary. Potential users often postulate an option of segmenting a film into several independent parts of variable length and the ability to play them back independent of each other, in larger groups or to play the whole. It is also suggested that individual parts be indexed to avoid time-consuming browsing in order to find a particular fragment suitable, e.g., for a given lecture. Indexing emphasizes the need for labels. Apart from distance learning, the possible areas for the Digital Video Library are ecommerce, a real estate market, telemedicine, administration with banking, entertainment and several others where a playback of a queried high-quality video content is concerned. The whole setup could be also deployed as a demonstration kit when exposing the capabilities of a broadband networking equipment.
VIDEO RETRIEVAL SERVER The Video Retrieval server and related service components have been developed and tested in the AC362 “Broadband Trial Integration” (BTI) project with the financial support of the European Commission within the framework of the “Advanced Communications Technologies and Service” (ACTS) program (Andersen et al. 2000). The server is responsible for digital content acquisition, storage and making it available to video content users. The system has been built around the Oracle Video Server (OVS). Its main role is to store the video content and deliver video streams to clients in real time. OVS is equipped with mechanisms to control system resources, which prevent the overload. The number of served clients is limited so that all of them received appropriate quality of service. The video recordings are stored on disks managed by the OVS. Although the OVS can support QoS guarantees applied to the storage system, it does not support bandwidth reservation and multicast connections on the network. These functions are performed by RSVP/Proxy modules located on both the server and the client sides. The reservation parameters (T-Spec) are taken from the database. The main role of the database is to keep information about the available films and provide search capabilities. Internet Explorer acts as a front-end application. The displayed pages are formatted using an HTML layout. Data acquisition (for example, query criteria) is done via HTML forms. Video content (films) is displayed using Oracle Video Player ActiveX control. Also many ActiveX controls are used to control the play process (START, STOP, RWD, FWD) or serve as dynamic objects (scrollbars, counters).
Components The Video Retrieval server consists of the following (Gozdecki et al., 2000): - video server, - application server, - video encoding station, - primary and secondary video storage, - content management system,
QoS-Aware Digital Video Retrieval Application 175
- database, - content upload module. In order to be able to support simultaneously numerous users, there is a need for a robust video server. A video server is a dedicated hardware through which the content is made available for access as a stream of time-critical audio and video with a common time base. A stream is an active process of reading such a mix of data from the server’s primary storage and delivering it through a distribution network to the user who requested the content. The video server is a highly integrated and optimised hardware dedicated to the efficient delivery of such streams to concurrent video service users. It is achieved by appropriate resource allocation and scheduling of access to the data stripped on primary storage disks (online storage). The video server has an internal video flow scheduler and manager and a built-in media pump for timely streaming of video content towards the video service users. The video server works together with the application server. The application server is a computer system on which the Video Retrieval Service is running. The server interacts with service consumers (users) through a user interface. The application server has a session control module, an interaction module, a reservation system module, a video server control module and an application insert module. The session control module of the application server performs a session setup, maintenance and tear-down. The interaction module of the application server provides a user interface to handle requests from users. Processing a request may result in the initiation of a video flow through the video server control module. The server is also responsible for supplying textual and other data, which complement the time-critical flows supported by a video sever. The reservation system module of the application server enables scheduling of a video session in advance and reserving appropriate system resources. The video server control module is responsible for controlling the behaviour of the video server. The application insert module transfers the executable code of a movie-on demand MoD application to a set-top terminal when a user decides to use the Movies on Demand service. The video encoding station (VES) module is used for a digital video content preparation. It comprises video sources and video encoders for a service provider to prepare and upload his or her own content into the video server in addition to what is acquired from professional content providers. He or she may also want to perform some kind of simple nonlinear edits to introduce permanent insertions into the content. Due to a high volume of video information, the video server may use multiple storage options for the video repository. Fast, random access magneto-optical disks will be used as online video storage (primary storage). Video titles, which are most frequently used, are kept there. The devices should sustain high data transfer rates and ensure a high level of reliability. These requirements can be met by means of a RAID technique or a similar one which employs content stripping, parity check and provides redundancy. The disks should be hot swappable, to avoid service outage during repair. Optionally, DVD changers may also be used here. The online video storage module should be co-located with the video server to ensure reliable feeding of video streams. Slower devices (e.g., tape drives) are secondary storage, used as cost-effective extensions to the primary storage . They contain video titles that are less frequently used and are reloaded to the online storage before video streaming begins. The primary role of the content management module is to enable management of the video content, related user-data (e.g., textual descriptions, trailers, screenshots) and meta-data.
176 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
The database is a central repository for the detailed information on available video titles, their prose descriptions and technical properties of the corresponding video files. It also contains meta-data information required by the application server as well as by the video server module. The database structure is implementation dependent. The content upload module is responsible for communication with external content providers. The modules of the Video Retrieval Service interact over the implementation-specific interfaces. The data that is exchanged over the interfaces falls into three types: user data, meta-data and device control data. User data are video and data flows resulting from browsing the database. Meta-data are auxiliary data, which facilitate (e.g., speed up) the access to the requested piece of user information or provide an indication on the appropriate handling of such information. The device control data are commands issued to control individual devices and read status information. For the modules of the video server residing on a single hardware platform, the communication is performed by means of inter-process communication over a bus, crossbar or a hardware system of another type as shown in Figure 1. The modules running on separate machines may be clustered or may communicate over the network interfaces. Video Retrieval Service provides a separate interface for content management (Gozdecki et al., 2000). It supports activities specific for a content manager, including content update, assignment of new titles to a category and subcategory, preparation of detailed descriptions, and making titles available for users. It also enables defining of film fragments (episodes) within a film that can be later referred to by users. These episodes may then be enlisted in the user interface as fragments of a given movie, so that the user may access them directly without the need to watch the whole film or to browse through it in order to find an episode of interest to him. This idea of advanced content indexing enables, for example, indexing the most interesting, spectacular or valuable film fragments. The content manager interface, part of the application restricted to content manager
Video encoding station
Content download
Content management system
Figure 1: Internal structure of the Video Retrieval server
Device control channel
On-line Secondary video storage storage
QoS-Aware Digital Video Retrieval Application 177
only, is used to fill the database with content descriptors (content labels). Among the descriptors are the default resource requirements of a video film expressed in networkrelated terms as well as application-related descriptors (the encoding type, video resolution, number of frames per second, etc.). The former are used by the application to reserve appropriate network resources during streaming, while the latter are intended for users to notify them about basic characteristics of the content they demand.
User Interface Video Retrieval Service has been equipped with a front end which provides the user with a consistent user interface (Gozdecki et al., 2000). The interface is embedded into the Web browser. The browser runs on a typical PC in order to minimise the cost of a video terminal on the client side and thus makes the service more affordable. The support for IP6 multicast and resource reservation is provided by the software module developed in BTI ACTS AC 362 Project, due to a lack of availability of commercial implementations at the time of application development. The interface allows users to choose a title from a list of available educational films and to perform video content browsing. Together with the content comes the description of a selected item, including general and supplementary information such as film duration, image resolution and others. A simple title selection feature has been enriched with a database query mechanism. It allows limitation of the scope of search to a particular category of films, a subcategory and enables a keyword search. The user interface offers advanced search capabilities to allow to define the subject of interest. The search criteria may include film title, author, category/ subcategory or search by keywords. When a user requests to play a film, methods to communicate network requirements are invoked and result in the reservation of appropriate resources along the video transmission path in the network. The user interface differs for “active” and “passive” video clients (Gozdecki et al., 2000). The active video client is a person who can communicate with the video server, select a video for streaming and receive it. He decides what video will be sent over the network, and has full control over the video: can start/stop/pause/rewind it, etc. He is also entitled to decide on the resource reservation parameters. The active video client, when deciding on the video film to watch, initiates the setup of a point-to-multipoint session. At the very beginning he or she is the only client receiving the video flow destined for the session. Later on, other clients, so-called passive clients, may join via a network and thus enlarge the audience. Passive video clients are those who have limited control over the video. They can get a list of running point-to-multipoint sessions and join the preferred one. Their functionality is similar to a man watching TV, who can choose a channel out of those currently “on air,” but neither modify the TV program nor become a broadcaster. Nevertheless, the passive video client can express his or her quality-of-service preference by modifying the initial QoS settings for the video he or she receives. He or she does it by invoking resource reservation functions of the application in a regular way. The QoS request is then processed by the QoS management software being part of the operating system. The request then traverses upstream along the video distribution path and is merged with other reservations, according to the rules of the RSVP protocol. The ability to differentiate the QoS between the members of a single session, though implemented in the application, is not supported in the ATM networks, due to limitations of the current UNI
178 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
signalling standard (ATM Forum, 1996b). Every user of the application has a free choice to become an “active” client or a “passive” one. Each film may have an associated hypertext description. The description is displayed on the right panel of the screen. Below the description the user may find supplementary information about the particular film like its category and subcategory, video file name, total film length (playing time and byte size), etc.
QoS Support in the Video Retrieval Service Implementation of mechanisms for increasing transmission efficiency and guaranteeing the quality of service in transport networks will influence the penetration of new multimedia services. Such mechanisms are the new protocol IP and models of IP networks with QoS. The chapter explains both issues as related to the Video Retrieval Service. The QoS approach incorporated in the video retrieval system follows the Integrated Services (Controlled Load) model. Reservation functions constitute two kinds of software modules: one for the application server and one for the application client. Video is streamed using a commercially available software with neither RSVP nor IP6 support, thus, suitable RSVP and IP6 modules were designed. These modules consist of (Figure 2): - Reservation manager for providing RSVP signalling with RSVP daemons on a server and client machines in order to reserve network resources for video server MPEG streams. - Proxy module which adapts MPEG streams carried in IP4 packets to IP6 network and translates addresses between these two kinds of networks. The client modules are launched automatically on his or her machine when the user starts the application. The reservation manager module on the server, being a sender of data, initiates an RSVP session. Whenever a user makes a choice and decides to watch video content, the RSVP modules communicate with each other and setup a reservation. The resource reservation is made in the following way: - The client loads a Web page from the application server. The page contains a number of fields to simplify the selection of the appropriate film. When the choice is made, the http server runs--by means of CGI interface--a process which sends a message with appropriate T-spec parameters read from the database to the reservation manager. Tspec parameters describe the compressed audio/video stream and define resources needed for its delivery with a corresponding QoS.
Figure 2: Network architecture of Video Retrieval Service
Server
Sess. Sig. (IP4 unicast)
Proxy IP4 ↔ IP6
RSVP manager WWW Server & Video database
Sess. Sig.
NIC card
Oracle Video Server
Sess. Sig. (IP6 unicast/multicast) (best-efort)
HTTP protocol (IP6 unicast) (best-efort)
Video (IP4 unicast)
Proxy IP6 ↔ IP4 NIC card
Video (IP4 unicast)
Sess. Sig. (IP4 unicast) Sess. Sig.
RSVP manager
Video player plug-in
Client Video (IP6 unicast/multicast) with RSVP
Internet Explorer 4.0
QoS-Aware Digital Video Retrieval Application 179
- The reservation manager transmits an RSVP Path message downstream to the application client. The source and destination addresses in the Path message are the same as those used for data flow so that the message defines the RSVP session. Along the way, each RSVP-enabled intermediate network element (router) stores state for the path. - Upon reception of the Path message the reservation manager, running at the client node, reads out T-spec parameters and responds with the RSVP Resv (reservation) message. It does not modify the set of parameters proposed in the Path message. The reservation message traverses upstream to the previous RSVP-enabled hop following exactly the reverse path back to the server. The flow descriptors in the Resv message define the reservation. - The Resv message is forwarded hop-by-hop back to the server until either the Resv message reaches the server, or the reservation is merged into an existing reservation or is discarded because of shortage of resources at the intermediate node.
COMPARISON WITH OTHER SYSTEMS Today there are many commercially available video streaming engines like RealSystem Server from Real Networks (Real Networks, 2000), QuickTime Streaming Server from Apple Computer (Apple Computer, 2000) or Oracle Video Server used in the proposed Video Retrieval Service. The Video Retrieval Service is a complete platform for an IP network operator focused on browsing, retrieving and at last streaming a video to customers. Other systems for video retrieving are worth mentioning as well. The platform developed at the University of Maryland (2000), based on a commercial video server, implements an original user interface and is proposed to be used in a large network of distributed servers. Yet another example is a system from the Massachusetts Institute of Technology (Gemmell, Scholer and Kermode, 1998) that deploys a multicast addressing scheme for solving architecture scalability issues or a video streaming system developed at Purdue University (2000) embedded in a distant learning platform. Halsall (2001) extracts most important features of video delivery architecture over the Internet. The main features of the proposed Video Retrieval Service are a flexible user interface based on an HTTP browser, both unicast and multicast addressing styles and a user-oriented control of QoS requirements in IP networks (Gozdecki et al., 2000). The Video Retrieval application has been extensively tested by Polish and Danish telecom operators in a field trial incorporating broadband islands interworking by an ATM link (Andersen et al., 2000). Several conclusions were drawn about the further development work needed to convert the Video Retrieval application into a commercial service. In particular, software modules for session monitoring, authentication, authorization and accounting (AAA) have to be added. The performance of the Video Retrieval Service was measured by qualitative methods (Rao & Hwang, 1996) and resulted in several interesting findings regarding the users’ behavior and attitude to the system (Andersen, 2000). In general, the users were enthusiastic about the efficiency of learning with the aid of multimedia applications in a cooperative environment. One commonly stated advantage was that such an environment gives them much ”freedom” in deciding about the subject, scope, depth and pace of learning. The user interface plays a dominant role in achieving the end-user satisfaction and thus should be considered a vital one. A regular user has no idea about the underlying technical issues and is not willing to modify his or her behavior to a network-oriented way of thinking.
180 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
QoS control has to be defined in such a way that it is clearly understood by a regular user, for example, by defining separate quality levels like ”high,” ”moderate” or ”low.” Other means, like continuous scale controls, raise a question to a user: “How much quality is enough?” The other issue of how to map the discrete values into traffic contracts or profiles is something that has to be solved by application/network experts. Users need to be motivated to understand the consequences of QoS differentiation, relating the quality they get to the cost they pay, regardless of whether the service provider really applies any tariffs or not. Otherwise, the user will always choose the ”high” quality profile, ignoring the consequences of his choice.
INFORMATION SYSTEMS’ MODELLING The prediction of information system performance and of the influence of different control mechanisms on the quality of service requires the use of modelling. There are two major types of models: analytical and simulation (more detailed but also requiring more computations). Simulation models use special software tools. Analytical models are based on the queuing theory. Network resources (transmission links) are represented as service stations, and transmitted (and queued in switches) traffic (packets, cells) is represented as a stream of customers. There are numerous techniques to build and solve these models, and the most useful ones are reviewed below. Mean Value Analysis (MVA), presented in Haverkort (1999), makes use of simple relationships between a device’s mean response time (queuing time plus service time) R, throughput X (a number of jobs completed per time unit), mean service time B, mean number Q of customers present in the system (waiting or in service) and device utilisation (fraction of time the device is busy) U:
U = X ×S, Q =
U B , R= 1−U 1−U
Simplicity is the advantage of this method; its disadvantage is the fact that only mean values are considered and the influence of the higher moments of interarrival and service time distributions on the performance of a queuing system are important. For transient state analysis the fluid flow approximation (Sharma & Tipper, 1993) can be used. The approximation uses first-order differential equations for system averages:
dQ (t ) = f in (t ) − f out (t ) dt where Q (t ) is the average number of customers in the system, f in (t ) , f out (t ) are average flows in and out of the system. The function f out (t ) is itself expressed by Q (t ) with the use of Pollaczek formula (Sharma and Tipper, 1993) so the above equation has an implicit form and should be solved numerically. Another approach expresses the behaviour of the analysed system by a continuous or discrete time Markov chain having states corresponding to the states of the considered system. The system of linear equations relating state probabilities (they are differential equations in transient-state analysis and algebraic ones in a steady-state analysis) is built and solved. Markov models are able to express various synchronization constraints related to control mechanisms in a network, but their application encounters several numerical
QoS-Aware Digital Video Retrieval Application 181
problems as a model size that overpasses 50,000 or 100,000 states easily and ill conditioning or stiffness of equations corresponding to the model. A considerable effort has already been made to overcome these problems. Explicit differential solution methods (Runge-Kutta), special stable implicit methods, an uniformization (randomization) method based on the reduction of continuous-time Markov chains to a discrete-time Markov chain subordinated to a Poisson process were proposed and tested. Also some efficient approximate methods based on the use of Krylov subspaces (the original matrix of an infinitesimal generator is projected into Krylov subspace; the new matrix has the same eigenvalues, but its dimension is considerably smaller) were applied. Nevertheless, this approach is still a challenge. A monograph (Stewart, 1994) presents an excellent review of these problems. Diffusion approximation is more detailed than MVA and the flow approximation: the use of two first moments of interarrival and service time distributions reduces errors, and the computation effort related to diffusion approximation is considerably smaller than in numerical analysis of Markov chains. We briefly recall the principles of the diffusion approximation (Gelenbe, Mang & Feng, 1996). Let A(x) and B(x)denote general interarrival and service time distributions in a service station respectively. Their means and variances are:
E [A] =
1 1 2 2 , E [B ] = , D [ A] = σ A , D 2 [ B] = σ B2 µ λ
Squared coefficients of variation are equal to C A2 = σ A2 λ2 , C B2 = σ B2 µ 2 . In practice, we are not able to determine the distribution p (n, t ) of the number N (t ) of customers present in a queue at time t. The diffusion approximation replaces the process N (t ) by the diffusion process X (t ) . The diffusion equation:
∂f ( x, t ; x 0 ) ∂f ( x, t; x 0 ) α ∂ 2 f ( x, t; x 0 ) −β = 2 ∂x ∂t 2 ∂x is to be solved with properly chosen parameters at initial and limit conditions to obtain the density function f ( x, t ; x 0 ) of the process that approximates a queue length distribution. For G/G/1 and G/G/1/N queues, to ensure the same mean and variance of changes for both processes N(t) and X(t), the values of the parameters α , β are chosen as
α = σ A2 λ3 + σ B2 µ 3 = C A2 λ + C B2 µ , β = λ − µ . The boundaries limit the process X(t) to the interval [0, N]. When the process comes to x = 0, it remains there for a time, which corresponds to the idle time of the system and then jumps to x = 1; when it comes to x=N, it stays there for a time during which the queue is full then jumps to x = N - 1. The jumps are called instanteous return processes. The pdf f ( x, t ; x 0 ) of the above process may be defined by a superposition of densities ϕ ( x, t ; x 0 ) of diffusion process with another kind of boundary conditions, absorbing barriers placed at x = 0 and x = N : the process finishes when it comes to one of them. The solution is obtained in terms of Laplace transforms and then numerically inverted (Czachorski, 1993). This transient solution assumes that the parameters of the model are constant. In a network of queues, however, the output flows of stations change
182 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
continuously, hence the input parameters of each station are also changing during a transient period. We are obliged to discrete these changes and keep the parameters constant within a
∆t . The transient solution at the end of each interval ∆t allows one to determine ρ (t ) and then λ (t ) and parameters of the output stream. This relatively small time-interval
solution serves also as the initial condition for the solution in the next interval: for the n-th interval, t ∈ [( n − 1) ∆t , n∆t ] , the density function of the diffusion process of station i is
f ( x, t ;ψ n ( x)) , whereψ n ( x) = f ( x, t = (n − 1)∆t ;ψ n −1 ( x)) . This approach allows us to take into consideration the autocorrelation and self-similarity of input streams. Also control decisions can be included in the model and changed every ∆t according to the current congestion.
Source Modeling Depending on a planned model complexity, only basic characteristics as a mean value or a variance of observed traffic intensity can be evaluated. More complicated models need spectral analysis results (autocorrelation and spectral density functions) for their development. In the case of network traffic or an MPEG stream, the self-similarity analysis appears to be more descriptive than the spectral analysis (Park & Willinger, 2000). On the basis of computed characteristics, various models can be developed (e.g., Garret & Willinger, 1994; Meier-Hellstern & Fischer, 1992; Erramilli & Wang, 1995), differing in their complexity and so in an approximation accuracy. The traffic description is given as subsequent samples X i of traffic intensity at equispaced time instants. The mean traffic intensity is equal:
E[X ] =
1 N
N
∑X
i
.
i =1
The traffic intensity variance is then given by:
1 N D [X ] = ∑ (X i − E[X ]). N − 1 i =1 2
Spectral analysis introduces two more functions describing traffic intensity series. The autocorrelation function is defined by the following formula:
ACR X (k ) =
1 N −|k | ∑ X i X i +| k | . N − | k | i =1
The power density is defined as the Fourier transform of the autocorrelation function:
S X (ω ) =
1 N
K
∑ ACR (k ) cos(kω ). X
k =− K
The autocovariance function computed as the autocorrelation function for a centred time series is more convenient than autocorrelation function
ACR (k ).
QoS-Aware Digital Video Retrieval Application 183
ACVX (k ) =
1 N −|k | ∑ (X i − E[X ])(X i +|k| − E[X ]). N − |k| i =1
In most cases the traffic observed in networks (on any level of a time scale) belongs to a specific class of statistical processes being self-similar processes with a power-law autocorrelation function
lim ACR X (k ) ≈ c1k β , k →∞
where cl is a constant factor and
β determines the shape of the curve. The β coefficient
depends on the so-called Hurst exponent:
β. 2
H = 1−
The Hurst exponent is a kind of long-time dependence measure. The value 0.5 indicates that the analyzed data is not long-time dependent. The values in (0.5, 1) mean that the data exhibits long-time dependence (the larger is H, the longer are trends) as stated by Leland, Wilinger, Taqqu and Wilson (1993). Another property of the self-similar process is a power-law decay of its variance with a time scale. A given process X viewed in a specific time scale m is defined as:
X k(m ) =
1 m−1 ∑ X k ⋅m+ j . m j =0
The above-mentioned property is described with the following formula:
[
]
lim D 2 X (m ) ≈ c2 m β .
m→∞
The autocorrelation function of the self-similar process also exhibits a power-law decay:
lim ACR X (m ) (k ) = c3 m β .
m→∞
The parameter β in the above formulae can take values from the intervals (0, 1). The related Hurst exponent H takes values from the interval (0.5, 1). The Hurst exponent H = 0.5 indicates the autocorrelation equal to zero. The characteristics of network traffic influence the performance of transmission systems, especially those in which the dispersion of packet delays is crucial. According to specifications, the most important QoS parameters for transmission of information are Cell/ Packet Loss Ratio, transfer delay and delay variation. Assuming that a communication channel is stable, mechanisms well designed and no extraordinary conditions take place, loss of information should be exceptional and thus irrelevant. An absolute transfer delay is relevant mostly in the case of interactive systems, where a user expects some bounds on the system reaction time. This can be of special importance in a case of teleconferences. However, in the case of the Video Retrieval System presented here, the delay variation is the most important parameter of QoS. The delay variation introduced by network mechanisms has to be eliminated in the receiver by buffering techniques. The size of this buffer depends on the probability density function of transfer delay and should be assigned
184 Czachorski, Jedrus, Zakrzewicz, Gozdecki, Pacyna & Papir
to ensure minimum probability of overflow due to a large delay variation.
CONCLUSIONS Delivering high quality video content to customers is now expected to be one of the driving forces for the evolution of the Internet as it can be deployed in many niches of the emerging e-market. However, just streaming video is a rather coarse service to be fitted with additional features like retrieving, browsing and even composing that enable regular users to manipulate the content in a friendly environment. According to the commonly adopted rules of a component-based software engineering, the Video Retrieval Service is built around a video streaming engine. Performance evaluation of computer systems and networks provide methods and tools to design efficient mechanisms suitable for multimedia networking. Presented methods include both source and system models, allowing for complex performance evaluation and parameter estimation for modern multimedia applications. One of the most important aspects of streaming applications design is their buffer dimensioning. In the case of multimedia systems, the buffering techniques should trade-off both low-delay and delay traffic intesity smoothing.
REFERENCES Aberer, K. and Klas, W. (1992). The impact of multimedia data on database management systems. Technical Report TR-92-065, ICSI, Berkeley, CA, USA. Adjeroh, D. A. and Nwosu, K. C. (1997). Multimedia database management–Requirements and issues. IEEE Multimedia, July-September, 24-33. Andersen, N. E., Azcorra, A., Berger, M., Bertelsen, E., Carapinha, J., Hvass, P., Kjaergaard, J. K., Maliszewski, J. and Papir, Z. (2000). Applying QoS control through integration of IP and ATM. IEEE Communications Magazine, 38(7), 130-136. Apple Computer. (2000). QuickTime Streaming Server. Retrieved on the World Wide Web: http://www.apple.com/quicktime/products/qtss/, 15th March, 2001. ATM Forum. (1996a). AudioVisual Multimedia Services: Video-on-Demand Specification 1.1. ATM Forum. (1996b). User-Network Interface Specification version 4.0. Cahners In-Stat Group. (2000). Streaming Media Servers–Rich Content Via Internet Protocol [Report MB00-13VS]. Retrieved on the World Wide Web: http:// www.instat.com/CATALOG/cat-dt.htm, 15th March 2001. Czachorski, T. (1993). A method to solve diffusion equation with instantaneous return processes acting as boundary conditions. Bulletin of Polish Academy of Sciences, 41(4). Erramilli, A. and Wang, J. (1995). An application of deterministic chaotic maps to model packet traffic. Queueing Systems, 20, 171-206. Flickner, M., et al. (1995). Query by image and video content: The QBIC system. IEEE Computer, September, 23-32. Garret, M. W. and Willinger, W. (1994). Analysis, modeling and generation of self-similar VBR video traffic. Proceedings of the ACM Sigcomm, 269-280. Gelenbe, E., Mang, X. and Feng, Y. (1996). Diffusion cell loss estimates for ATM with multiclass bursty traffic. Computer Systems - Science and Engineering, Special Issue:
QoS-Aware Digital Video Retrieval Application 185
ATM Networks, 11(6), 325-334. Gemmell, J., Schooler, E. and Kermode, R. (1998). A scalable multicast architecture for one-to-many telepresentations. IEEE Multimedia Systems 98, Austin, TX. Gozdecki J., Pacyna, P., Papir Z., Stankiewicz, R. and Szymanski, A. (2000). Networkbased digital video library system. 10th International Packet Video Workshop “Packet Video 2000.” Halsall, F. (2001). Multimedia Communications. Addison-Wesley. Haverkort, B. R. (1999). Performance of Computer Communication Systems. John Wiley & Sons Ltd. Khoshafian, S. and Baker, A. B. (1996). Multimedia and Imaging Databases. Morgan Kaufmann Publishers. Leland, W. E., Willinger, W., Taqqu, M. S. and Wilson, D. V. (1993). On the self-similar nature of Ethernet traffic. Proceedings of the ACM Sigcomm, 183-193. Meier-Hellstern, K. S. and Fischer, W. (1992). The Markov-modulated poisson process (MMPP) cookbook. Performance Evaluation, 18, 149-171. Purdue University. (2000). Multimedia Support Infrastructure. Retrieved on the World Wide Web: http://www.cs.purdue.edu/msi/site-visit/13-enh.ppt. Minoli, D. (1995). Video Dialtone Technology. McGraw-Hill, Inc. Park, K. and Willinger, W. (2000). Self-Similar Network Traffic and Performance Evaluation. John Wiley & Sons Inc. Rao, K. R. and Hwang, J. J. (1996). Techniques and Standards for Image, Video and Audio Coding. Prentice Hall. Real Networks. (2000). RealSystem Server. Retrieved on the World Wide Web: http:// www.realnetworks.com, 15th March 2001. Riley, M. J. and Richardson, I. E. G. (1997). Digital Video Communications. Artech House. Sharma, S. and Tipper, D. (1993). Approximate models for the study of nonstationary queues and their applications to communication networks. ICC Geneva, 352-358. Stewart, W. J. (1994). Introduction to the Numerical Solution of Markov Chains. Princeton University Press. University of Maryland. (2000). Office of Academic Computing Services. Retrieved on the World Wide Web: http://video.bsos.umd.edu/main/viewing.htm, download 15th March 2001.
186 Khasnabish
Chapter IX
Network Dimensioning for MPEG-2 Video Communications Using ATM Bhumip Khasnabish Verizon Labs, Inc., USA
This chapter discusses various issues related to the shaping of Motion Picture Experts Group (MPEG) video for generating constrained or controlled variable bit rate (VBR) data streams. MPEG-2 defines a set of standards for coding and compression of digital video. VBR video can offer constant picture quality without incorporating too much processing overhead in the network or transmission system’s architecture. In addition, they can offer substantial (20% to 50%) savings in both storage and transmission bandwidth requirements compared to constant bit rate (CBR) video. Either source coding or encoder’s output shaping or a combination of both can be used for adapting the MPEG-2 video streams for transmission over real-time VBR (rt-VBR)-type asynchronous transfer mode (ATM) channel. For experimental purposes, the VBR video traces are produced by defining peak and average bit rates over the group of pictures (GOP) and the entire clip, respectively. Specification and development of the traffic contract parameters for transmission of VBR MPEG-2 video using rt-VBR-type ATM service are then presented. The traffic parameters are determined for (i) different values of sustained cell rate (SCR) where SCR is varied from the average rate to within a few percent of the peak cell rate (PCR), and (ii) number of different ‘averaging intervals’ ranging from one frame (i.e., 1/30-th of a second for 30 frames/sec video) to an entire GOP (e.g, 0.5 sec when GOP length is 15 frames in a 30 frames/ sec video). The burstiness of MPEG-2-encoded video streams varies widely depending on the category of video encoded. The ‘averaging interval’ has a significant impact on the values of the traffic contract parameters. Typically, values of the maximum burst size (MBS) are higher and PCR values are lower when traffic is averaged over an entire GOP structure as opposed to averaging over a few frames. As the SCR value is increased from the average value, the MBS values decrease and reach a minimum asymptote. The decrease is sharpest when the rates are averaged over a few frames as opposed to an entire GOP. PCR/SCR ratios Copyright © 2002, Idea Group Publishing.
Network Dimensioning for MPEG-2 Video Communications Using ATM 187
are highest for rates averaged over a sub-GOP. It is possible to determine the conforming effective bandwidth (EBW) or equivalent capacity (EC) needed for transmitting the VBR video maintaining the same perceptual quality as in CBR but with less transmission bandwidth. These can be achieved using the current configurations of the commercially available ATM switches. The results presented in this chapter can be utilized not only for network and nodal (buffer) capacity engineering, but also for delivering the user-defined quality of service (QoS) to the customers. These are very useful for cost-effective design, engineering and operations of video service offerings using wireline (e.g., HFC, xDSL and FTTN/C) and wireless (e.g., L/M-MDS) networks. For example, when CBR video is used, bandwidth allocation is usually performed at approximately 6 Mbps per video stream, but for VBR video the effective bandwidth requirements per stream could be as low as 3 Mbps. This directly translates to approximately doubling the transmission capacity of the video delivery system.
INTRODUCTION A large number of emerging telecommunications and video service providers are currently planning to offer digital video services1 to both residential and business customers. For any real-time service like voice and video, the applications level quality of service (QoS) requirements are more stringent compared to those for non-real-time services like data and image transmission where it is possible to retransmit the erroneous segment(s) of a file. Real-time full-motion digital video (Chariglione, 1997) is bursty in nature, and the burstiness depends on the frequency of changes in the background and the movements of the foreground objects. Therefore, without rate control the output bit stream of a video encoder will be of variable rate, since it depends on the complexity of the scene, the degree of motion and the number of scene changes. Although it is possible to statistically multiplex the bursty sources to achieve bandwidth savings, uncontrolled burstiness may lead to wastage of both storage and transmission bandwidth. Since the uncompressed real-time entertainment quality video is bursty, and of high bit rate (in the order of tens of Mega-bits/sec), it is not unlikely that at times excessively high processing, storage (buffering) and transmission capacities are demanded from the network. However, both dynamic allocation of bandwidth and use of exceptionally high bandwidth are expensive propositions for cost-effective delivery of entertainment video to the masses. These motivated the early implementers and providers of digital video services to use a scaled-down (compressed) and constant bit rate channel for delivering real-time video to the customers, even if it meant acceptablequality (instead of high-quality) of video, and inefficient utilization of the channel. The digital video encoder manufacturers accordingly implemented additional bit stuffing and rate-control buffer in the encoder so as to generate a constant-bit rate stream for transmission and distribution applications. This is achieved by using (i) coarse quantization-which generates a smaller number of bits/picture– when excessive scene changes or motion occurs, or (ii) finer quantization and/or bit stuffing when less than desired amount of bits/picture are produced. These rate adaptations not only lead to variable video quality, but also at times may cause wastage of bandwidth because the encoder may use stuffing bits to maintain the constant bit rate. Now, if the burstiness of the video can be controlled using either some pre-specified source coding constraints and/or some constraints as dictated by the transmission service category-e.g., ATM supports both constant and variable bit rate channels (ATM Forum,
188 Khasnabish
1996)—without sacrificing the video quality, it may be possible to achieve substantial savings on the usage of storage and networking resources. In addition, it would be possible to offer constant quality of video instead of variable quality video which most of the service providers distribute/deliver today. Savings in storage resources, i.e., server capacity, imply that more movies or video sequences can be stored using the same capacity. Savings in networking resources imply that the same amount of switching, buffering and transmission capacity can deliver larger amounts of video contents to the customers maintaining the application (here video) level QoS requirements. We discuss these and related issues in this chapter. The remainder of this chapter is organized as follows. In the next section architectures of various residential video delivery systems are briefly described. Because of the hierarchical (or layered) approach used in MPEG-2—which defines a set of standards used for coding and compression of digital video—it is possible to achieve flexible (i.e., varying bit rate and quality) coding of digital video; these are briefly mentioned. The techniques for generating variable bit rate (VBR) MPEG video streams are then presented. We also discuss applications and motivations for using VBR video in that section. Transmission, switching, and switch/multiplexer buffer requirements for transmission of individual and/or a set of multiplexed VBR video streams are described next. Finally, recommendations and some concluding remarks are presented.
RESIDENTIAL VIDEO DELIVERY SYSTEMS’ ARCHITECTURES Traditional (analog) video service providers deliver analog video to the residential customers using Cable TV network or direct broadcast satellite (DBS2 ) technology mainly for entertainment purposes (Khasnabish, 1997; Kwok, 1998). Digital representation of video opens up the possibility of computerized (or automated) processing of the multimedia information, allowing consumers to archive, index and retrieve the programs in a contentbased manner. Digitally encoded information is (or can be made) also much more resilient to degradation during transmission, distribution and storage, giving the consumer a better picture and sound quality. This makes it possible for the consumer to interact with the audiovisual programming, in ways that were not possible with one-way analog broadcast systems. Digital transmission of audio-visual information also enables multimedia communication over the same network that supports voice/data communication (e.g., POTS or PSTN) using various digital subscriber line technologies (xDSLs). In the next two sections, we discuss these and related issues from the points of views of network and transport architectures. Details of these and their cost-effectiveness (i.e., the payback period) can be found in Khasnabish (1997).
Traditional Systems Most of the traditional residential video delivery systems usually deliver one-way analog video using either full coaxial cable (CATV) or hybrid fiber-coax (HFC) networks. A typical HFC network is shown in Figure 1. These networks offer very little or no capability for real-time interaction with the service provider’s facilities for content selection. For wireless delivery of video, cost-effective small dish antenna for receiving DBS signals started appearing in the market only recently. Although they were originally intended for delivery of analog/digital video to customer premises, at least one company (e.g.,
Network Dimensioning for MPEG-2 Video Communications Using ATM 189
DirecTV/DirecPc, see http://www.direcpc.com/, 2001, for details) is trying to utilize these for interactive delivery for digital video and Internet information services. Figure 2 presents a high-level description of one such architecture. To support interactivity, they can use the public switched telephone network (PSTN) as the return path from the customer’s premise. All of the above systems are based on shared media communication systems. A number of problems exist in the shared media architectures. These are as follows. Security, reliability and billing cannot be easily guaranteed for individual users; infrastructural supports for delivering integrated telephone, data and video services can not be easily achieved; and operation, administration, management and provisioning (OAM&P) technologies are not as mature as those used in the PSTN networks. Traditional CATV networks are designed to offer low-cost unidirectional 6 MHz bandwidth for analog TV services to home over coaxial cable. Tree and branch type architecture with series of unidirectional amplifiers (to improve signal strength and reliability) are commonly used in such networks. The wireline option for upgrading the CATV networks is probably the HFC architecture (Figure 1) which uses all-optical backbone network to interconnect the head-ends. From the head-end, fiber links run to regional distribution centers which just provide opto-electric conversion, and after that, it’s all coaxial cable in the local feeder loop. Traditional analog broadcast services occupy the spectrum between 50 MHz to 550 MHz for about 80 channels, where each channel uses 6 MHz of bandwidth. For example, using 8-Bit/Hz QAM or 16 VSB modulation, a 6 MHz channel could deliver 38 Mbps data (remaining bits are used for error
Figure 1: Architecture of Hybrid Fiber Coax (HFC) networks
TV Program Reception
Headend Fiber Network
Fiber Link
Coax. ...
<25 km Analog TV Program and Power Feed
ONU ONU ONU Home
Coaxial Cable Home
~3 km <500 Homes
Coaxial Cable Home
Coaxial Cable
0512–1
ONU: Optical Network Unit
190 Khasnabish
correction and control) to home (Khasnabish, 1997). If constant bit rate (e.g., 6 Mbps) digital video is used, a maximum of six video streams can be accommodated in one analog channel. The video stream multiplexing capability of analog channel would increase significantly if variable bit rate video were used. And, since some of the leading Telecom service providers are currently deploying HFC networks for video delivery services, it is important to study these in details so as to increase viewers’ preference on ‘type of TV program’/content selection. Other options for upgrading the video distribution networks would be to use wireless channels to home. Two possible techniques for high-quality (digital) video distribution and for offering digital video and fast Internet access using low-power, high-frequency radio signals over short to medium distance are LDMS and MMDS as described in Table 1. Investments in these systems could be incremental, in the sense that they can grow as new customers are added to the system. Using 6-bit/Hz QAM, up to 28 Mbps of data can be accommodated in a standard 6 MHz analog channel which is sufficient for transmitting four (each with 6 Mbps rate) MPEG-2 based digital video streams. When variable bit rate MPEG2 encoding is used, each MPEG-2 video stream could be as low as 3 Mbps, and up to nine streams can be accommodated in a standard 6 MHz analog channel. Thus, when VBR video is used, it is possible to accommodate more than 200 digital video streams over 33 analog TV channels in the downstream. Therefore, the use of VBR digital video increases viewers’ preference on ‘type of TV program’/content selection. General technical challenges associated with wireless video delivery systems are as follows: determining proper heights of transmitter and receiver antennas, dynamically
Figure 2: Interactive video delivery using wireless distribution hub and PSTN networks Intranet
TV Program Reception
Distribution Hub - XEO Satellite - L/M-MDS Hub -WLL Hub -...
Uplink Internet Service Provider
PSTN
Downlink Antenna
Service Provider's Headend
(Additional) Return Path e.g., Through PSTN Network
Transceiver 05 12 –3
Customer Premise
Network Dimensioning for MPEG-2 Video Communications Using ATM 191
adjusting transmitter power and/or activating multiple transmitters to ensure complete coverage of the service area with adequate signal strength at the receiver during adverse transmission conditions, e.g., heavy rainfall, branches of trees and leaves in the line-of-sight path, etc.
Other Systems The Telecommunications Act of 1996 allows the plain old telephone service (POTS) providers to offer broadband services such as broadcast video, high-speed Internet access and video conferencing in addition to traditional narrow band services (such as wired and wireless telephony). These new digital multimedia services require significantly more bandwidth than is necessary for POTS, so the POTS providers are trying to increase the switching and transmission capacities of their networks. One way to achieve this is to use various types of digital subscriber line (xDSL) modem technologies, which enable high bit rate data transmission using the existing copper plant (subscriber loop). Traditional Telecom service providers have conducted a number of field trials using ADSL modems for delivering broadband data to homes and corporations. It is also possible to use this technique for digital video delivery services. ADSL can support a bit rate of around 6 Mbps from the Central Office (CO) to user terminal and around 1.0 Mbps from user terminal to the CO over a distance of less than 10 Km. HDSL uses 2B1Q-based modulation technique to enable normal twisted-pair telephone lines to handle full T1 payload (1.536 Mbps) using two pairs of wires over a distance of around 5 Km. VDSL can deliver data at OC-1 rate (51.84 Mbps) to home using twisted pair telephone lines over a distance of 0.50 Km. VDSL may also be used for switched digital video services in both POTS and HFC networks. RADSL can deliver data at OC-1 rate (51.84 Mbps) to home using twisted pair or coaxial cable over a distance of around 3 Km.
Table 1: Video distribution using wireless transmission technologies Technology
Frequency Band (Up/Down, GHz/GHz)
Direct Broadcast Satellite (DBS)
Ku-Band 11 - 18
Local Multipoint Distribution System (LMDS) Multichannel Multipoint Distribution System (MMDS)
27.5 - 29.5
850 downstreams (27.5-28.35 GHz) 150 upstreams (29.2-29.35 GHz)
42 Analog Channels
2.15 - 2.70
198 downstreams 4 upstreams
33 Analog Channels ~50 Km Radius, (120 or more Digital using ~60 cm Channels) home dish antenna
Forward/Reverse Path Channel Capacity Bandwidth (each analog TV ('Downstream or to- channel is 6 MHz) User'/'Upstream or from-User') (MHz/MHz) 500 downstreams 150 Digital Channels (11.7 - 12.2 GHz) 500 upstreams (14.0 – 14.5 GHz)
Area Coverage and Size of Home Dish Antenna
Thousands of Km Radius, using ~50 cm home dish antenna Maximum 5 Km Radius, using ~30 cm home dish antenna
192 Khasnabish
Other techniques would be to use the broadband networking architectures such as Fiber to the home (FTTH), Fiber to the node (FTTN) and Fiber to the curb (FTTC). Note that the FTTC/N networks, coupled with xDSL for the “last mile” to the customer premise, can extend the capabilities of fiber plants closer to the end user, and thereby can facilitate the migration of narrow band networks to broadband networks. For delivering switched digital video, some Telecom and video service providers are currently investigating FTTN/C architectures (Figure 3), which can be extended to FTTH (Figure 4) networks in the future. FTTH is the ultimate wireline architecture for broadband services to the home; the raw bandwidth could be OC-3 (155 Mbps) or higher. It can use single-hop or multi-hop pointto-point networks with Dense Wavelength Division Multiplexed (DWDM) switches in the COs and active/passive optical networks as distribution and access network with optical network unit (ONU) at/near home. Low-cost FTTH deployment can be achieved if passive optical couplers are used—and hence a smaller number of laser transceivers are needed per home coverage—for splitting fiber from the CO into multiple fibers. Transmission of analog signal over fiber channel (and coaxial cable) is still economically more attractive compared to the digital counterpart, especially for multi-channel systems using both amplitude and frequency modulation technique. Opto-electronic and analog-to-digital conversions may then reside inside the ONU. Furthermore, optical networks are inherently more secure, because a cut in the fiber is easily detected. Various other fixed and mobile wireless architectures are also being proposed. They include broadband personal communication networks (BPCN), wireless local loops (WLL) including local/multi-channel multi-point distribution systems (L/M-MDS) as discussed before and the systems based on satellites orbiting in various earth orbits (xEO satellites). Details of these architectures can be found in Khasnabish and Banerjea (1998,1999) and Khasnabish (1997). Figure 3: Video delivery over ATM/VDSL using FTTN/FTTC networks Asymmetric 6-52 Mbps 3-6 Mbps
Central Office
ATM Network
Narrow Band Switch Broad Band Switch
V: VDSL Modem S: POTS/VDSL Splitter ONU: Optical Network Unit NT: Network Terminal
Remote Node Copper
S
ATM Over Copper
S
<1 Km V ONU
ATM 200 to 500 Over Home Serving Fiber Area
.. ..
V NT ATM to the Desktop and Set Top Box
05 12 –4
Network Dimensioning for MPEG-2 Video Communications Using ATM 193
Transmission Protocols The two most commonly discussed wide area transmission protocols are TCP/IP and ATM (ATM Forum, 1996; Kwok, 1998; Pitts and Schormans, 1996). Both are supported over a variety of wireline and wireless transmission media. TCP/IP uses window-based flow control mechanism, and TCP supports around 500Byte packet whereas IP supports around 9 KBytes packets. The concept of QoS does not exist yet in the TCP/IP-based networks, and usually over-provisioning-based techniques are used to meet the QoS requirements of a service. The asynchronous transfer mode (ATM) of data communication uses 53 Bytes packet or cell (including 5 Bytes for header) as the basic protocol data unit (PDU) for transmission and switching. It supports both constant and variable bit rate data transmission with guaranteed quality of service. We acknowledge that there is room for debate on whether TCP/IP-based services over xDSL links would be more popular (and useful) compared to the ATM technology-based transport for real-time digital video services. Our focus in this chapter is on ATM-based distribution of digital video since it supports coexistence of constant and variable rate connections, and can guarantee the quality of service to the end terminals. In addition, some of the recently proposed switched digital video architectures (as shown in Figure 3) use ATM transport technology and could utilize variable bit rate MPEG video. Therefore, it is required to investigate the issues related to transporting VBR MPEG video streams over ATM, so that the potential bandwidth and cost savings can be quantified. Figure 5 shows architecture for MPEG video delivery over IP/ATM/ADSL using PSTN networks. The details of this architecture remain open for future investigations.
Figure 4: Architecture of the Fiber to the home (FTTH) networks
Fiber Network
Fiber Link
ONU/ Splitter
ONU/ Splitter
.. Home
.. Home
Home
Home
05 12 –2
ONU: Optical Network Unit
194 Khasnabish
Figure 5: Video delivery over IP/ATM/XDSL using PSTN networks Router-Based Network
Wide Area Data Network(s)
LAN • MPEG-2 System Decoder • RTP Packetizer
ATM
Subscriber/ Copper Loop, PSTN
<4 km
• MPEG-2 Encoding • Packetization Source of Information (Cable/Satellite Feed, VHS VCR, etc.)
05 12 –5
(HTTP and MPEG-2 Client)
XDSL (CAP/DMT) Modem
DIGITAL VIDEO CODING STANDARDS The Motion Picture Experts Group (MPEG) was established to develop a common format for coding and storing digital video and associated audio information (Chariglione, 1997). MPEG completed the first phase of its work in 1991 with the development of the MPEG-1 standard (ISO/IEC 11172). Its goal was to produce video with a quality similar to that of a VHS video tape recorder using a bit rate of approximately 1.2 Mbps. In response to a need for greater input format flexibility, higher data rates and better error resilience, MPEG started to develop extensions to the MPEG-1 specification. This work led to the MPEG-2 standard (ISO/IEC 13818). Although the MPEG-2 standard was originally designed for compressing broadcast quality video at 4 Mbps to 6 Mbps, currently it supports compressed video from 1.5 Mbps to 20 Mbps at main level, and the rate could be as high as 100 Mbps at high level. A series of low bit rate extension of the MPEG standards (e.g., MPEG-4) are currently being developed (see http://www.mpeg.org, 2001, for details), however our focus in this chapter is on MPEG-2 video shaping for efficient transmission using ATM. The MPEG-1 and MPEG-2 specifications are similar, but the details are different. MPEG-2 is a superset of MPEG-1 and includes additional frame formats and encoding options. The MPEG specifications contain three major parts: system, audio and video. The
Network Dimensioning for MPEG-2 Video Communications Using ATM 195
audio and video portions describe the encoding processes, while the system specification describes bit stream syntax, multiplexing and synchronization. Algorithms used for reconstructing MPEG-2-based moving images at the decoder assume lossless and constant delay transmission over the networks. These conditions are not realistic for any operational network. Loss of information and network delay variation add various forms of distortion to the reconstructed images. The MPEG standard does not prescribe the encoding process; instead, the standard specifies the data input format for the decoder as well as detailed specifications for interpreting this data. The data format is commonly referred to as syntax while the rules for interpreting the data are called the decoding semantics. The encoding process is not standardized and may vary from application to application depending upon a particular application’s requirements and complexity limitations. An encoder is only constrained to produce bit streams which are syntactically correct and which follow the decoding semantics; therefore, enhancements in the encoding process are possible even though the standards have been finalized. MPEG video compression exploits spatial and temporal redundancies which occur in video. Spatial redundancy can be utilized by simply coding each frame separately. This technique is referred to as intraframe coding. Additional compression can be achieved by taking advantage of the fact that consecutive frames are often almost identical. This temporal compression has the potential for a major bandwidth reduction over simply encoding each frame separately and is referred to as interframe coding. In the simplest form of interframe coding, the difference between two successive frames is coded by subtracting the frames and by using an intraframe technique to code the difference. However, if the entire frame is moving, such as in scenes where the camera is panning or zooming, this technique does not work well. MPEG coding provides motion compensation techniques to solve this problem.
Hierarchical Coding MPEG-2 video coding is hierarchically organized into six syntax layers (Chariglione, 1997). The specification defines three type frames (display units), each of varying sizes: selfcontained or intracoded (I) frames, predictive (P) frames which code the block-by-block difference with the previous frame, and bi-directionally predicted (B) frames which code the differences with previous and next frames. Several frames are grouped together in a pattern to form a Group of Pictures (GOP) as shown in Figure 6. A series of GOPs (usually of constant length) produces a video sequence as per the specifications in the sequence header. At the sub-frame level, block, macro-block and slice layers have been defined which are the basic layers of coding and processing of video signal. Figure 6: Format of a Group of Pictures (GOP). The I-Frames are self-contained, while the P- and B-Frames are predicted from the surrounding frames (GOP Size is 15 frames)
I
B
B
P
B
B
P
B
I = Intracoded Frame P = Predictive Coded Frame B = Bidirectional Predictive Coded Frame
B
P
B
B
P
B
B
I
196 Khasnabish
A macro-block contains 16x16 pixels in the luminance space and 8x8 pixels in the chrominance space for the simplest color sub-sampling format (4:2:0). A macro-block is encoded by searching the previous frame for the closest match. In a frame with a fixed background and a moving foreground object, the foreground object can be represented by macroblocks from the previous frame and an offset which represents the movement of the object.
Transport and Program Streams The MPEG-2 system standard defines methods for multiplexing one or more audio, video or data elementary streams (Chariglione, 1997). Each stream is packetized, and timestamps are added to form a packetized elementary stream (PES). The PES streams for audio, video and data (optional) are then multiplexed to form a single output stream for storage or transmission applications. The objective is to provide the necessary syntax to synchronize the decoding and presentation of video and audio information. The MPEG-2 system standard specifies two types of stream: program stream (PS) and transport stream (TS). Program streams use long variable length packets while transport streams use short fixed-size packets. Program streams are similar to MPEG-1 system streams and are used for multiplexing together elementary streams with a common time base and are commonly used in storage applications such as in digital versatile disks (DVDs). The use of PS for transmission applications increases the likelihood of an erroneous packet since long frames are more susceptible to error. Transport streams are used for multiplexing together streams that do not have a common time base. The transport stream packets are fixed length (188 bytes) so that techniques to minimize the effect of damaged or lost packets can be efficiently implemented. Transport streams can be used in harsh environments, e.g., storage and transmission in lossy and error-prone media. Of the 188 bytes in the packet, at least four bytes are used for a header with the remaining bytes available for the storage of data, audio or video information. The mandatory TS packet header contains information regarding synchronization, packet priority, program identification and error status. The TS can contain additional header fields providing timing information, the number of packets until the next splice and private data information. For constant bit rate (CBR) video, both PS and TS contain continuous streams of PDUs irrespective of whether video information is there or not in the original video signal; stuffing bits are used when it is required to maintain the bit rate characteristics of the video stream. Variable bit rate (VBR) coding generates a bursty stream of PDUs so that the envelop of the video stream tends to closely match the variation of information in the original video signal. Therefore, the empty PDUs in the stream can be used for transferring other types of information, e.g., non-real-time data services, Internet traffic, etc.
Levels and Profiles MPEG-2 standards (Chariglione, 1997) define (a) FIVE profiles for basic configurations/complexity of encoding and decoding methods, and (b) FOUR levels for classifying image sizes and setting the ranges of the rate control parameters. Of these, the combination of main level and main profile (MP@ML, i.e., 720Hx576V pixels/frame, 30 frames/sec, 15 Mbps maximum) is used for generating broadcast quality (the standard CCIR 601 format) video. For every combination of level and profile, bit rate variability can be achieved by dynamically adjusting the number of bits/frame. In addition, it is possible to produce low bit rate video using lower levels and simpler profiles; however, we limit our discussion in this
Network Dimensioning for MPEG-2 Video Communications Using ATM 197
chapter to the MP@ML combination only.
VARIABLE BIT RATE MPEG VIDEO Full-motion real-time digital video with high and/or constant quality is bursty in nature (Chariglione, 1997), and the burstiness depends on the frequency of changes in the background and the movements of the foreground objects. Although it is possible to statistically multiplex the bursty sources to achieve bandwidth savings, uncontrolled burstiness may lead to wastage of both storage and transmission bandwidth. And since the real-time entertainment quality video is of high bit rate (in the order of tens of Mega-bits/sec), it is not unlikely that at times excessively high processing, storage (buffering) and transmission capacity is demanded from the network. These motivated the early implementers and providers of digital video services to use a constant bit rate channel for delivering real-time video to the customers. The digital video encoder manufacturers accordingly implemented additional rate control buffer in the encoder so as to generate a constant-bit rate stream for transmission and distribution applications. The rate control buffer of the MPEG-2 encoder dictates the quantization level (hence the number of bits/picture) used in macro-block and slice layers of coding. The encoder maintains a constant bit rate at its output irrespective of the amount of bits (which would actually be required for fixed quantization level) needed for encoding the pictures or frames. This is achieved by using (i) coarse quantization—which generates smaller number of bits/ picture–when excessive scene changes or motion occurs, or (ii) finer quantization and/or bit stuffing when less than the desired amount of bits/picture are produced. These rate adaptations not only lead to variable video quality, but also at times may cause wastage of bandwidth because the encoder may use stuffing bits to maintain the constant bit rate. Now, if the burstiness of the video can be controlled using either some prespecified source coding constraints and/or some constraints as dictated by the transmission service category without sacrificing the video quality, it may be possible to achieve substantial savings on the usage of network resources—buffering, processing and transmission capacity—while maintaining the application (here video) level quality of service. Without source rate control the output bit rate of a video encoder depends on the complexity of the scene, the degree of motion and the number of scene changes. Some open loop rate control techniques can still be applied to control the bit rate and burstiness in order to optimize the characteristics of VBR video for transmission and/ or storage applications. These include defining fixed mean and maximum quantization scales for different types of MPEG-2 frame. Such video streams will have controlled burstiness and variability in bit rate characteristics.
Techniques for VBR Video Coding Variable bit rate encoding of MPEG video can be achieved using single-pass (or realtime) and multi-pass techniques. In both cases, the available options—for producing a VBR video stream—include (a) source/coding rate control in the MPEG-2 encoder, (b) shaping the output bit stream (as utilized in this chapter) of the encoder, and (c) encoding rate control by using feedback from the network. Option-(a) and option-(b) or a combination of both are the most desirable ones, while option-(c) is the least desirable. This is because of the complexities involved in implementation and management of the feedback-based rate control mechanism. Degradation of video quality in these cases due to rate control/ adaptation depends not only on the video characteristics and category (sports, news, etc.) but also on the rate adaptation technique used. The real challenge is to find a compromise
198 Khasnabish
between coding and rate shaping constraints such that acceptable video and audio quality are maintained without putting excessive burdens on switch buffer, processing, and transmission capacities (hence costs).
Source/Coding Rate Control MPEG-2 video encoding offers a rich array of possible methods for rate control (Chariglione, 1997) in every possible combination of level and profile. A fundamental parameter is the quantizer scale since this controls the instantaneous bit rate used at the macroblock level. The quantizer scale can be set globally at the picture layer, and the desired rate can be achieved by varying the scale for different types of pictures. The quantizer scale can also be adjusted at the slice or macroblock layer to provide finer granularity in rate control. The block layer (the lowest layer of video coding) contains DCT coefficients of slow- or no-motion or DC (n=0) and rapid motion or AC (n>0) types. Discrete cosine transform (DCT) decomposes a block of data into a weighted sum of spatial frequencies (Haskell et al., 1997). For example, DCT can be used to transform the spatial information of an 8x8 block into the frequency domain. The DCT coefficients can then be quantized using a quantizer step size of 8, 4, or 2 for DC coefficients (or zero-coefficient frequency) and a variable quantization step size for AC coefficients (or non-zero-coefficient frequency). The DC (direct current) represents the spatially non-varying terms, and AC (alternating current) represents the spatially varying terms. Note that this analogy is derived from the two types (direct and alternating) of electricity or electric current, which are used in practice. Since most of the energy of the blocks is packed into low-frequency coefficients, it is possible to either drop or use less quantization levels for the higher-order coefficients without causing any visible impairment in picture. Consequently, it is possible to produce VBR video by dropping certain percentage of DC and AC type DCT coefficients without noticeable degradation in picture quality. For inter-coded or non-intracoded macroblocks (usually exist in P and B frames), the low frequency or DC coefficients can be dropped first, whereas for intracoded macroblocks (usually exist in I and P frames) the AC coefficients are dropped first since they play a less important role in video reconstruction (Chariglione, 1997; Gringeri et al., 1998a, 1998b). Note that the P and B frames occur more frequently compared to I frames in a video stream. The picture or frame layer controls the display of video, and it is possible to control the bit rate variability of video by setting different thresholds for different types of frames in the video buffer verifier (VBV) of the encoder. The GOP layer is the editing level of video coding, and by dynamically adjusting the GOP structure, the video bit rate can be controlled as well. Other methods for generating variable bit rate (VBR) video include the following. The bit rate variability in these cases results from using (piece-wise) constant rates in the segments of a multi-segment video bit stream. Video Resolution (defined in the video sequence layer) Based Rate Control: For NTSC video, the 720Hx480V pixels/frame represents the D1 resolution, which needs the highest bit rate. Similarly, the 544Hx480V pixels/frame is referred to as the 3/4 D1 resolutions, which needs medium bit rate, and the 352Hx480V pixels/frame is called 1/2 D1 resolution and it needs the lowest bit rate. Chrominance (defined in the video sequence layer) Based Rate Control: Blackand-white video always needs lower bit rate compared to the colored video. However, if there exists any detailed information in the color of video segments, that information may be lost during black-and-white transmission. Rate Control by Repeating Frame(s): This technique is commonly known as the
Network Dimensioning for MPEG-2 Video Communications Using ATM 199
‘pull-down’ technique. It is widely used for playing 24 frames/sec (FPS) film as 30 FPS NTSC video. In this method, every alternate pair of frames is mapped into a group of three frames, by using a blank frame or repeating a frame. Thus, a group of four frames in 24 FPS film is mapped into a set of five frames, and the resulting frame rate becomes [5x(24/4)]=30. Similar pull-down techniques can be used for displaying various low-frame-rate video as well. Costs for transmitting low frame rate video is also lower. Rate Control by Shaping Encoder’s Output: This method is used to shape encoder’s output to match the processing and buffering capacity of a switch, and the desired transmission bandwidth. A number of parameters like peak bit rate, average bit rate, and switch/multiplexer buffer size, etc., can be utilized to control the video bit rate without affecting the video quality. Note that the use of large switch buffer space may reduce the loss rate of video bits, but it may also adversely affect (because of large delay/delay-variation) the liveliness or real-time characteristics of the video.
Network/Multiplexer Feedback-Based Encoding Rate Control In this method the feedback signal from the network and/or video stream multiplexer is used to control the encoding bit rate of video. The feedback signal is usually related to the availability of buffer space and transmission bandwidth of the channel. Because of the time delay involved in receiving the feedback signal and reacting to it, this type of technique may adversely affect video compression and channel multiplexing efficiency. Furthermore, additional signaling and other implementation complexities are also involved in this technique, which make it less attractive.
Real-Time (or One-Pass) and Multi-Pass Coding Variable bit rate MPEG-2 video can be produced using both single-pass, i.e., in realtime, and multi-pass techniques. In case of single-pass VBR encoding, unless the coding and rate control parameters are selected very carefully, the quality of video may vary significantly over a single video session. This is because the amount of time and flexibility available for bit rate optimization are very limited (Chariglione, 1997; Schuster and Katsaggelos, 1997; www.sonic.com, 2001). Consequently, it may be very difficult to achieve any significant gain in bandwidth utilization over CBR transmission. In case of multi-pass VBR encoding, the following steps can be used. Video Characterization: This is the video pre-processing or pre-viewing phase. The objectives are to determine the (i) frequency of scene changes, (ii) degree of motion of the foreground and background objects, and (iii) coding complexity of the scenes and various objects in the video sequence. This pass helps decide the GOP structure to be used. It also allows one to choose a set of parameters, which can be used to generate various bit rate profiles for a target video quality. It may be possible to use the information from the profile file for determining the conditions for virtually open loop or unconstrained VBR video coding. Rate Profile Generation: In this pass the upper and lower bounds on the amount of bits allocated to various types of frames within the GOP are varied. In each profile, attempts are made to allocate large numbers of bits while coding complex and high-motion segments of a scene and fewer bits to relatively still scenes, but maintaining the target video quality. Several such profiles are generated during this phase. Video Coding: The target traffic control parameters of the channel (e.g., the parameters for a type of ATM service) can be used in this pass to determine the GOP structure and to select one of the bit rate profiles generated in the second pass, but still maintaining the
200 Khasnabish
target video quality. The quantization scale (defined at the slice layer, and for some in the macro-blocks) can now be used to maintain the required bit rate profile while satisfying the transmission requirements of the channel. This may result in highly granular (frame and/or sub-frame-level rate control instead of GOP-level control) coding satisfying the target quality requirements. MPEG-2 test model 5 (TM5) (Chariglione, 1997) suggests that the same quantizer_scale be used for I and P frames, and for B frames it could be 1.4 times less than that of I or P frames. Note that the capabilities of TM5 can be used to generate constrained VBR MPEG-2 video streams.
Applications of VBR Video It is interesting to note that constant bit rate video provides variable video quality, whereas variable rate video has the potential to offer constant quality video (Chariglione, 1997; Kwok, 1998; Schuster and Katsaggelos, 1997). This is because of the possibility of degradation of video quality when the generated bit rate tends to exceed the allocated peak bit rate of the video stream. These happen when significant changes of background scene and/or excessive movements of the foreground objects occur. VBR video can be used for both storage (e.g., DVD) and transmission applications.
Storage Applications Variable rate encoding is commonly used in storage applications such as in DVDs. In these applications the bit utilization is optimized over the entire encoding session to produce desired peak and average rates for the stream. In this case, the encoder needs to maintain a target video quality while not exceeding the specified peak and long-term average bit rates, i.e., the source control tries to optimize video quality for the DVD characteristics. For videoon-demand or near video on demand applications, it may be necessary to transcode3 the output of DVD so that the bandwidth requirements of the output stream match the available network buffer and transmission bandwidth. The storage requirements for a movie clip encoded using VBR encoding is much less than those for constant bit rate (CBR) coding. When clips of similar video quality are compared, the VBR encoding shows significant storage savings. For example, to code a 15minute video clip using CBR encoding at 6 Mbps, 675 MBytes of storage is required, whereas recording the same clip using VBR encoding with a peak rate of 6 Mbps and a mean rate of 3 Mbps requires only 355 MBytes of storage. This translates into almost a 50% savings in disk space for storing clips of very similar video quality. Note that the amount of savings is also highly dependent on the category and complexity of the video clip.
Transmission Applications For these applications it is required to bound the burstiness of the video stream. One way to achieve this would be to have the quantization levels determined by a rate control algorithm that takes into account the fullness of the VBV buffer as well as the parameters which determine the characteristics (buffer, bandwidth, etc.) of the transmission channel. Bandwidth savings due to VBR encoding during transmission of independent video streams can be achieved by varying the bit rate according to the scene complexities and motion requirements. This is because bandwidth allocation for transmission of CBR video is done using the peak bit rate of the video stream, whereas for VBR transmission, a bit rate much less than the peak bit rate can be used, as shown in Figure 7a and Figure 7b. Note that the characteristics of the transmission channel in Figure 7b is such that it can efficiently accommodate bursts, and therefore, the possibility of maintaining constant video quality
Network Dimensioning for MPEG-2 Video Communications Using ATM 201
remains high. When a number of VBR video streams are statistically multiplexed together, the bandwidth savings may depend on specific requirements. For some applications, it may be necessary to maintain grouping of the video streams, whereas for others it may be necessary to keep the aggregate bit rate of the streams constant. For example, for switched video applications (FTTC/N networks), the grouping of the channels may not be uniform throughout the network. And, in these cases the rate must be controlled for each stream and not for the group of streams, and transcoding may be performed when needed. For maintaining constant aggregate bit rate of the multiplexed stream, open loop, closed loop or a combination of both can be used. Open-loop techniques include mixing real-time VBR video streams with time-delayed streams, multiplexing large number low/medium bit rate less bursty streams with small number of high bit rate bursty streams, etc. Feedback-based rate control methods are less attractive for reasons mentioned earlier.
Characterizing VBR MPEG Video Streams To investigate the MPEG frame level bit utilization pattern, and savings in transmission bandwidth and storage, a 15-minute long video segment from a high-action NTSC movie is used. The flow of the activities for generating, analyzing and shaping the video streams is as presented in Figure 8. This video clip is encoded at main profile and main level using the Minerva Encoder (an MPEG-2 encoder based on the C-Cube CL4040 VideoRISC-3 chip-set). The video resolution used is 720Hx480V pixels/frame and the frame rate is 29.97 frames per sec (FPS). The Group of Picture (GOP) structure is set at N=15, and M=3, and hence a GOP contains one Intracoded (I) frame, four Predicted (P) frames and 10 bidirectionally coded (B) frames. The peak rate (e.g., 6 Mbps) is defined over a GOP, and the mean or average rate (e.g., 3 Mbps) is specified over the duration of the entire clip. It is possible to enable the scene change detection during encoding of the video sequence to empower the encoder to start a new GOP (i.e., insert a new I frame) if a significant scene change is detected. The encoder uses a two-pass method to produce a variable bit rate video Figure 7: Comparison of video quality for CBR and VBR transmission
Bandwidth (Mbps)
Loss of Video Quality
Natural Video Stream
Constant Rate (Using Rate Control Buffer)
Smoothened Video Stream (Using Rate Control Buffer in the Encoder) 05 12 –6
Time (a) Constant Bit Rate (CBR) Video
202 Khasnabish
Natural Video Stream Peak Rate
Bandwidth (Mbps)
Average Rate Contrained or Controlled Variable Rate Video Stream
05 12 –7
Time Figure 7: (b) Variable Bit Rate (VBR) Video
Figure 8: Experimental set-up for generating and analyzing VBR MPEG-2 video bit streams -Video Rate, defined in terms of Minimum, Maximum and Average Rates -GOP Structure -Video Resolution VTR
Pentium-based PC MPEG-2 Streams Analog
VBR MPEG Encoder
•• •
Digital
Laserdisc
Analysis MPEG-2 Video Clips • Frame Size Statistics • ATM Traffic Contract • Effective Bandwidth • Multiplexing Gain
Network Dimensioning for MPEG-2 Video Communications Using ATM 203
stream: profiling the video on the first pass and encoding it during the second pass. Table 2 shows the frame size statistics for several 15-minute long video clips as obtained during a series of experiments in our lab. It is clear that the size of the frames varies significantly, and that without smoothing or rate control, MPEG video would be of variable rate in nature. This is because without source rate control, the output bit rate of a video encoder depends on the complexity of the scene, the degree of motion and the number of scene changes. Some open-loop rate control techniques (like fixed mean and maximum quantization scales for different types of frames) can be used to optimize the characteristics of variable rate video for transmission or storage applications. The results for the high-action NTSC movie show that the amount of storage capacity required to store a VBR movie clip is significantly smaller (could be as low as 37%) than that needed for storing CBR clips of the same length (15 minutes) as shown in Table 3. This is because when VBR encoding is used a larger number of bits is used to encode significant changes and movements, and fewer bits are used to encode relatively small foreground movements and changes in the background objects. The amount of bits needed for coding I frames is both higher and burstier compared to those for B and P frames as shown in Table 4. This is because I frames are self-contained or intra-coded frames, and most of the times they represent significant changes of the background objects/scenes. P and B frames are predicted and bi-directionally coded frames, and hence they contain fewer bits and are less bursty in nature. Figure 9 and Figure 10 demonstrate the frame size variation with time for high-action and talking head type video clips. It is interesting to see that for P and B frames, the variations of size with time are not very significant, but this is not the case for I frames which are selfcontained or intra-coded frames. This is because these are the anchor frames, and are responsible to encode significant changes in the background objects/scenes. Finally, we observe that the distributions of the bit contents for all three types of MPEG frames (i.e., I, P and B frames) tend to follow the Gamma or slanted Gaussian distribution, but the tail of the distribution for the I frame is long and very burtsy. This is shown in Figure 11 using histogram of frame sizes. It implies that there are very few (less than 10%) I frames with very large sizes. And, although they tend to appear very infrequently, they have Table 2: Frame size statistics for various categories of 30-minute long video clips
Title of the Frame Length (Bits) No. of Clip Min Max Average Variance Frames Movie #1 32256 839936 102393 4.71E+09 26999 Movie #2
34816
815360
102419 5.77E+09
26999
Cartoon
31744
837888
103332 5.77E+09
26999
News
19968
887040
103106 7.88E+09
26999
Sports
20480
842496
103137 6.79E+09
26999
Talking Head 24064
445184
66713 4.72E+09
28229
204 Khasnabish
Table 3: Storage requirements for CBR and VBR video clips
Video Quality High
File Size (MegaBytes) VBR CBR 347MB (Peak 5 Mbps, Mean 3 Mbps) 355MB (Peak 6 Mbps, Mean 3 Mbps) 439MB (Peak 6 Mbps, Mean 4 Mbps)
High
High
File Size Reduction
450 MB (4 Mbps)
23%
563 MB (5 Mbps)
37%
675MB (6 Mbps)
35%
Table 4: Bit requirements for I, P and B frames in an action movie
Clip: Total Recall I-Frames
Min 103168
Frame Length (Bits) Max Average Variance 839936 267840 8.46E+09
No. of Frames 1840
P-Frames
59904
358400
147065.7
2.21E+09
7160
B-Frames
32256
252672
67709.26
5.23E+08
17999
All Frames
32256
839936
102393.2
4.71E+09
26999
Figure 9: Variation of frame size with time in a high-action "type" VBR video stream Frame Sizes for a High Action VBR (6Mbps Peak, 3Mbps Average) Video I Frame
P Frame
B Frame
1000000 800000 600000 400000 200000 0 0
200
400 600 Frame Number
800
1000
Network Dimensioning for MPEG-2 Video Communications Using ATM 205
Figure 10: Variation of frame size with time in a talking head-type VBR video stream Frame Sizes for a Talking Head type VBR (3Mbps Peak, 2Mbps Average) Video I Frame
P Frame
B Frame
100000 0 800000 600000 400000 200000 0 0
200
400 600 Frame Number
800
1000
Figure 11: Histogram of I, P and B frames in a high-action "type" VBR video stream Histogram of I, P, and B Frame Sizes 10000 I Frame 1000 No. of Occurrence
P Frame B Frame
100 10 1 0
200000
400000
600000
800000
1000000
Frame Size (Bits)
significant impact on video quality.
Audio and Video Quality It is interesting to observe that there exists a trade-off between image distortion and video bit rate as shown in Figure 12. Note that there exist inverse relationships (Schuster and Katsaggelos, 1997) between image distortion and coding delay, and between bit rate and coding complexity. When variability in the video bit rate is allowed, the video quality tends to remain constant because of the fixed amount of distortion suffered by all the video frames. When bit rate remains constant, both video quality (distortion) and coding delay tend to be variable. These relationships hold in general irrespective of the category (action movie, sports, news, cartoons, etc.) of video. Recently we have studied (Gringeri et al., 1998a) the effects of various ATM network transmission level impairments (like bit error/loss, cell loss, delay jitter, etc.) on audio and
206 Khasnabish
video quality for constant bit rate (CBR) MPEG-2 streams. The results are as presented in Table 5. It is obvious that a majority of the artifacts are caused by excessive delay variations or delay jitters. This is because for timely playback of real-time information, it is imperative that all the necessary bits are received before the playback begins. Otherwise, underflow or overflow of the receiver buffer may occur frequently, which results in degradation of audio and video quality. When VBR bit streams are used for video transmission/distribution, it is expected that the decoder or the integrated received decoder (IRD) would have some degree of built-in tolerance4 to transmission errors and delay jitters. This is because the bit rate variability in the reception of the data stream is expected to occur in any realistic operational network.
TRANSMISSION OF VBR MPEG VIDEO USING ATM The asynchronous transfer mode (ATM) of data communication uses 53 Bytes packet or cell (including 5 Bytes for header) as the basic protocol data unit (PDU) for transmission and switching (ATM Forum, 1996; Pitts and Schormans, 1996). It supports both constant and variable bit rate data transmission with guaranteed Quality of Service (QoS) defined in
Figure 12: Trade-off between bit rate of video streams and distortion of image
Constant Bit rate and Complexity
Constant Quality and Delay
Coding Complexity
Bit Rate of Video Stream
Variable Quality and Delay
Variable Bit rate of Video Stream 05 12 –8
Image Distortion Coding Delay
Network Dimensioning for MPEG-2 Video Communications Using ATM 207
Table 5: Description and causes of audio/video artifacts for CBR MPEG video Description and Type of Artifacts Formation of small blocks with distinct boundaries (Tiling or Pixelation) Loss of video (Screen Blanking) Irregular or unnatural motion observed (Motion Jerkiness) Screen freezes (similar to motion jerkiness but for a longer duration) or Frame Freezing Loss or severe distortion of audio signal (Audio Breakup) Audio anomalies and/or intermittent glitches observed ( Audio Noise) Small solid color blocks - usually green or yellow – appear on the screen (Error Blocks) Loss of color stability; colors cycle through a range of hues (Color Cycling) Audio-Video Mis-synchronization or Loss of lip synchronization between the audio and video
Caused by Bit error in payload section of TS PDU Bit loss or error in the TS PDU header, Incorrect PDU length, CDV or delay jitter Bit loss or error in the TS PDU header, CDV or delay jitter Incorrect PDU length Bit error in payload section of TS PDU, CDV or delay jitter Bit error in payload section of TS PDU, CDV or delay jitter Bit error in payload section of TS PDU Incorrect PDU length, CDV or delay jitter Excessive CDV or delay jitter
terms of bit rate(s) and burst size (when applicable). Although the ATM adaptation layer 1 (AAL-1) was designed to offer network-level clocking for real-time CBR services, the current trend is towards using the 5-th adaptation layer (AAL-5) for CBR/VBR MPEG-2 video transmission. AAL-5 was designed for transmission of error-sensitive non-real-time bursty data traffic; it is widely available and incurs the same amount of overhead as in AAL1 (it can be used for CBR video) when two MPEG-2 transport streams (TS) are multiplexed together for transmission applications. We compare and contrast generation and transmission of ATM service-contract (CBR and rt-VBR only) friendly MPEG-2 video streams. For CBR encoding of video, the output rate control buffer dictates the quantization level used in macroblock and slice layers of coding; no further shaping of the output bit stream is performed. Whereas for VBR encoding of video, the available options include (a) source/coding control, (b) output rate control and (c) network feedback-based rate control, as discussed earlier. Degradation of video quality in these cases due to rate control/adaptation depends not only on the video characteristics and category, but also on the rate adaptation technique used. The real challenge is to find the acceptable coding and rate-shaping constraints such that acceptable video and audio quality are maintained without incorporating excessive burdens (hence costs) on switch buffering, processing and transmission capacities.
ATM Service Classes and Traffic Contracts For ATM-based transport networks, constant bit rate (CBR), variable bit rate (VBR), available bit rate (ABR) and unspecified bit rate (UBR) type service categories have been defined (ATM Forum, 1996; Pitts and Schormans, 1996). The simplest of these is the CBR type service for which the only traffic parameter that needs to be defined is the peak cell rate (PCR), given that the network already supports a specified cell delay variation tolerance
208 Khasnabish
(CDVT), as needed by all types of ATM services. For real-time and non-real-time VBR (rtVBR and nrt-VBR) services, PCR, sustained cell rate (SCR) and maximum burst size (MBS) need to be defined. In ABR type service the PCR needs to be specified, a minimum cell rate (MCR) is guaranteed, and it incorporates feedback-based rate control using the resource management (RM) type ATM cell. And as the name suggests the UBR type service does not guarantee either MCR or QoS; only the peak cell rate needs to be defined for this type of service. The QoS parameters for all types (except UBR) of ATM service include the cell loss ratio (CLR) parameter. And, for CBR and rt-VBR type services, QoS parameters also include cell transfer delay (CTD) and cell delay variation (CDV) parameters (ATM Forum, 1996) because these are intended for transmission of real-time traffic. We limit our discussion in this section to CBR and rt-VBR type ATM services for constant and variable rate MPEG-2 video transmission.
Shaping of MPEG Video Streams for VBR Transmission As mentioned earlier, the video bit streams are naturally of variable rates. To generate CBR video, the output rate control buffer of MPEG-2 encoder dictates the quantization level (hence the number of bits/picture) used in macro-block and slice layers of coding; no output reshaping-based rate control is performed. The encoder maintains a constant bit rate at its output irrespective of the amount of bits—which would actually be required for fixed quantization level—needed for encoding the pictures or frames. This is achieved by using (i) coarse quantization—which generates smaller number of bits/picture–when excessive scene changes or motion occurs, or (ii) finer quantization and/or bit stuffing when less than the desired amount of bits/picture are produced. And hence the PCR value for the CBR type ATM service is set at more than or approximately equal to the bit rate value at which the encoder is operating. This is shown in Figure 13a. It is interesting to note that a certain amount of bandwidth is wasted because of bit rate variability of the original video stream. Now, for generating controlled VBR bit streams, either source/coding rate control or transmission channel-characteristics-based rate control or a combination of both can be performed. Experiments on source/coding rate control are currently being performed using the MPEG-2 test model 5 (TM5) based software encoder and decoder. The results can then be used for combined source and channel-coding-based video bit rate control as well. These are topics of future investigations, and the results will be published when they become available. In this chapter we present the results of rate control by reshaping the output of the MPEG-2 encoder. As mentioned earlier, for rt-VBR type ATM services, a set of three parameters {PCR, SCR, MBS} needs to be specified using, e.g., the generic cell rate algorithm (GCRA) (ATM Forum, 1996). GCRA is a virtual scheduling or leaky-bucket algorithm, which can be used to check the conformance of traffic arrival with a traffic contract. GCRA (I, L) means that increment is set to “I” and the limit is set to “L.” If a traffic source sends ATM cells too fast, i.e., before the scheduled (i.e., theoretically calculated using the traffic contract parameters) time, it is sending too many non-conforming cells, and thus exceeding its limit. The conforming cells are only those which arrive at or earlier than the scheduled time. For example, a GCRA (T, zero) or GCRA (Ts, ts) compliant periodic ATM cell stream with period (BxTs) can transmit B cells at peak rate of PCR with an inter-burst spacing, TI [= (Bx(Ts - T) + T)] resulting in a mean cell rate of SCR=(1/ Ts), burst length of B and peak cell rate of PCR=(1/T). Note that ts is calculated as [(MBS-1) x (Ts - T)], given that the {PCR, SCR, MBS=B} parameters of traffic contract are already defined. For video services, usually the PCR is set at the maximum allowable bit rate of the data
Network Dimensioning for MPEG-2 Video Communications Using ATM 209
stream. The value of MBS determines the maximum number of cells that can be transmitted at PCR; the higher the value of MBS, the lower becomes the burstiness of the stream. Also, the MBS must be selected within the limit of available buffer space of the switches in the transmission path. For example, the MBS may need to be less than half of the multiplexer/ switch buffer size for acceptable delay variation and cell loss (e.g., less than 10-4) probability (Guerin, Ahmadi and Naghshineh, 1991; Mark and Ramamurthy, 1996) when the switch/ multiplexer is designed for supporting real-time traffic. The value of SCR can be selected (i.e., it’s a tunable parameter) in such a way that the MBS remains within acceptable limit, and the ratio between PCR and SCR becomes as high as possible. Note that the ratio between PCR and SCR affects both (a) bandwidth allocation to an individual data stream, and (b) achievable multiplexing gain when a number of VBR streams are mixed together for transmission over, e.g., a CBR channel. If for all reasonable (or acceptable) values of MBS, the SCR value becomes approximately equal to the PCR value, CBR type ATM service can be used. Once the {PCR, SCR, MBS} parameters are determined for a video stream, the bandwidth allocation for the stream can be done (when PCR >> SCR) using the required effective bandwidth or equivalent capacity of the stream as explained in the next section, and qualitatively shown in Figure 13b. Notice that if the characteristics of the bit streams support bandwidth allocation using effective bandwidth, the remaining (i.e., the difference between PCR and EBW, as shown in Figure 13b) available link capacity can be used for serving nonreal-time bursty traffic sources.
Effective Bandwidth or Equivalent Capacity Effective bandwidth (EBW) is defined as the amount of link capacity (or transmission bandwidth) needed to serve a traffic source with an objective to satisfy a pre-specified cell loss probability (CLP) and the delay/delay-variation criteria. Another bandwidth requirements measure similar to the EBW is called the equivalent capacity (EC) requirements. It can be shown that the logarithm of the inverse of CLP increases almost linearly with an increase of switch or multiplexer buffer size. Therefore, increasing the storage capacity (i.e., buffer) in the switch would reduce the value of CLP. However, this also adds more delay, and potentially may increase the delay variation suffered by the data stream. The EBW and EC measures commonly use statistical characterization of traffic sources. In the mathematical analysis and estimation of the required EBW presented by Guerin, Ahmadi and Naghshineh (1991), two approaches are discussed. One is based on adopting a two-state fluid-flow model to capture the basic behavior of a data source. The model assumes that the transmission state of each individual source alternate between “idle, or off,” the zero transmission period, and “active, or on,” transmission at the source peak rate, period. This approach estimates the upper bound of the needed effective bandwidth (λeff) for a source as follows.
λeff _ Guerin
C − Bmux + (C − Bmux ) 2 + 4 ρBmux C = 2C / PCR
C = αb(1 − ρ ) PCR α = ln ( 1/ CLP) PCR= peak cell rate Bmux = buffer size of the statistical multiplexer
210 Khasnabish
ρ = source utilization b = mean burst period CLP= cell loss probability The effective bandwidth given by this flow approximation for N multiplexed sources is just the sum of the individual effective bandwidth values. The second approach is as presented by Mark and Ramamurthy (1996). This method is based on the Gaussian approximation model of aggregated traffic. It is assumed that a large number (N) of sources with similar characteristics (mean rate (µi), standard deviation (σi)), and with the same or very similar quality of service requirements, are being multiplexed together. Multiplexing of N video sources with similar characteristics to form a composite broadcast stream is a typical example of such an assumption. The effective bandwidth (λeff) for N sources is estimated using the Gaussian approximation as follows.
λeff _ Gauss = µ + ησ
η = − 2 ln(CLP) − ln(2π ) N
σ 2 = ∑σ 2i i =1
N
µ = ∑ µi i =1
The Gaussian distribution of the diffusion process approximation here depends on the mean and variance of the on-off periods of a fluid source. It is assumed that the on and off periods of a source are distributed according to Erlang-k distributions. For k=1 the source model is a Markov on-off fluid source, and constitutes an upper bound on the EBW calculation. As k increases, the variability in the on-off periods decreases to give a more optimistic measure of the EBW. The EBW estimation of a source becomes as follows.
λeff _ Mark = SCR + θ
θ=
SCR ( PCR − SCR ) MBS s PCR
ln(1 / CLP ) 2kBmux
MBS s =
( PCR − SCR ) * MBS PCR
MBS = maximum burst size SCR = sustainable cell rate The EBW (or EC) requirements can be used for both ATM connection admission control (CAC) and dimensioning or capacity provisioning of transport or backbone network links. When it is used for CAC, the objective is to characterize a single traffic source using approximate formulations (Guerin, Ahmadi and Naghshineh, 1991; Mark and Ramamurthy, 1996), which are suitable for real-time implementation. This is because the admission control decision must be made in real-time. In addition, it is the EBW measure of a source which determines whether or not there would be sufficient gain (bandwidth savings) from
Network Dimensioning for MPEG-2 Video Communications Using ATM 211
blind-multiplexing5 of traffic sources. These are studied in detail in this chapter. For backbone or transport links, which provide long-haul transmission facility for a large number of multiplexed traffic streams, the stationary distribution—which is usually Gaussian (Guerin, Ahmadi, and Naghshineh, 1991) when the number of streams being multiplexed together is large enough—of the number of bits in the link can be used for determining the EBW. Note that in most of the cases, blind statistical multiplexing of a number of VBR traffic sources results in a constant bit rate data stream.
Some Experimental Results In this section we discuss the results of some experiments on shaping the VBR MPEG video streams for generating rt-VBR-type ATM service contract-compliant bit streams. The VBR MPEG video streams are generated using the experimental set-up described earlier, that is, by defining the peak rate (e.g., 6 Mbps) over a group of pictures, and the average rate (e.g., 3 Mbps) over the duration of the entire video clip. The encoder creates profiles of the bit streams in the first pass and generates specifications compliant bit streams in the second pass. The output of the second pass is the raw VBR MPEG video stream. This stream is parsed using an internally developed software package to produce a file, which contains MPEG frame level information like frame number, frame type (I, P or B) and frame size. This information is then used to generate shaped (i.e., compliant to specific ATM service contract parameters) VBR MPEG video streams. The following steps can be used to achieve this: (a) Choose an averaging interval; this can vary from one frame time (1/30 sec.) to a GOP period (e.g., 15-frame time or 0.5 sec.). (b) Characterize the stream, i.e., determine the values of SCR and PCR (sustained and peak cell rates, respectively) and the MBS parameter using the GCRA for rt-VBR type ATM service (ATM Forum, 1996). (c) Tune the SCR using a simple weighted averaging algorithm, e.g., vary the SCR tuning parameter, x in the equation: SCRnew=[(1–x)SCRold + x PCRold]; 0 < x < 1, and compute Figure 13: Transmission bandwidth allocation for (a) CBR and (b) VBR video services
Bandwidth (Mbps)
Loss of Video Quality
Natural Video Stream
Constant Rate (Using Rate Control Buffer)
Smoothened Video Stream (Using Rate Control Buffer in the Encoder) Wasted Bandwidth Time (a)
05 12 –9
212 Khasnabish
Bandwidth Savings (In Comparison With Peak Rate Based Bandwidth Allocation)
Natural Video Stream Peak Rate
Bandwidth (Mbps)
Effective Rate Average Rate Constrained or Controlled Variable Rate Video Stream
05 12 – 10
Time Figure 13: (b) the new values for SCR and MBS; the value of PCR remains the same as found in step Repeat steps (a)-(c) with a new value of the averaging interval. Once again, the flow of the activities for generating, analyzing and shaping the video streams is as presented in Figure 8. The results for the high-action NTSC movie demonstrate the following. The value of MBS is higher when the averaging interval is larger, e.g., 15 frames. The value of MBS decreases when averaging interval is smaller, e.g., one frame. Also as the SCR becomes close to the PCR value, MBS decreases and asymptomatically reaches (faster for smaller averaging interval) a constant value, as shown in Figure 14. These imply that rt-VBR type video streams degenerate to CBR (i.e., SCR=PCR, and MBS becomes large) type video streams when averaging interval is large, i.e., more than or equal to one GOP period. Figure 14 therefore demonstrates the significance of the averaging interval in maintaining burstiness of the video streams. The value of the ratio between PCR and SCR is higher when the averaging interval is smaller, e.g., one frame; this is very distinct when the SCR value is close to the average rate of the bit stream. As the SCR becomes closer to the PCR value, the PCR/SCR ratio approaches unity and the stream degenerates into a CBR video stream. These are shown in Figure 15. The amount of transmission bandwidth savings for using rt-VBR type ATM service could be approximately 20% over that needed for CBR type ATM transmission, as shown in Figure 16a and Figure 16b. Note that the effective bandwidth requirements for rt-VBR type ATM transmission are computed with a switch/multiplexer buffer size of 10 Kcells, and cell loss probability of 10-6. Two independently developed techniques (Guerin, Ahmadi and Naghshineh, 1991; Mark and Ramamurthy, 1996) are used for computing the EBW (or EC) requirements. The relative effectiveness of these two methods are discussed in detail in Gringeri et al. (1998b). (d)
Network Dimensioning for MPEG-2 Video Communications Using ATM 213
Figure 14: Variation of the Maximum Burst Size (MBS) value with averaging interval and sustained cell rate (SCR)
MBS Values for a High Action Movie 1000000
1 Frame 5 Frames 10 Frames 15 Frames 1 GOP
100000 MBS (Cells)
10000 1000 x=0
x=0.2
x=0.4
x=0.6
x=0.8
x=0.9
x=0.95
SCR = (1-X)*Mean + X*Peak
Figure 15: Variation of the ratio between Peak and Sustained Cell Rate (PCR/SCR) with averaging interval and Sustained Cell Rate (SCR) PCR/SCR Ratio for a High Action Movie 9
1 Frame 5 Frames 10 Frames 15 Frames 1 GOP
8 7 6 PCR/ SCR 5 Ratio 4 3 2 1 0 x=0
x=0.2
x=0.4
x=0.6
x=0.8
x=0.9
x=0.95
SCR=(1-X)*Mean + X*Peak
Impact of Switch Buffer Size For transmission of constant bit rate data streams using CBR type ATM service, it can be shown that as long as the bit rate of the stream does not exceed the link capacity, i.e., link capacity utilization remains less than 100%, the cell loss probability (CLP) becomes virtually independent of the available buffer space after a certain threshold. For example, in an OC-3 (155 Mbps) link, when the input cell rate is around 140 Mbps, a CLP of less than 10-9 can be achieved (Pitts and Schormans, 1996) with less than 100-cell buffer space. This however may not be the case if a burst of 140 cells arrives at the input of a busy channel where
214 Khasnabish
there are only 100-cell buffer in the switch, and hence in this case the excess 40 [=(140-100)] cells will be lost. These are pathological cases, and designing switch or buffering to handle those cases may be prohibitably high unless some other intelligent techniques are used. When a switch supports transmission of real-time variable bit rate streams, the value of CLP depends not only on cell level loss but also on the burst level loss, and hence the buffer requirements need to be dimensioned accordingly. Note that the burst level loss can be associated with the burst level delay as well for delay-sensitive traffic, and hence the buffer dimension for protecting cell level loss becomes of secondary concern. As discussed earlier, the VBR video streams generated by an MPEG-2 encoder need to be reshaped without sacrificing the video quality in order to use the parameters6 for CAC and to adapt them for transmission using the rt-VBR type ATM service. It appears that the value of MBS reduces, and the ‘PCR/SCR’ ratio (hence higher burstiness) increases when the averaging interval is ‘small,’ e.g., one frame period or 1/30 sec, when the video frame rate is 30 frames/sec. However, there exists a trade-off between the value of SCR and the MBS. Note that a larger value of MBS contributes to increased bandwidth savings and multiplexing gain7 , but it may also call for very large switch buffer size, which may not be desirable for supporting transmission of real-time video streams. This is because large buffer may cause increase in cell delay, delay variation, and switching requirements, and larger cell delay variation can cause underflow/overflow of decoder’s buffer which may result in screen blanking and other types of image distortions (see Table 5 for details). Figure 17 demonstrates the effects of variation of MBS and SCR on the effective bandwidth with the assumption that the switch buffer size (Bmux) is large enough to support the required MBS values. It appears that the EBW is almost always lower for higher values of MBS. And, the value EBW is always minimum when SCR is equal to the streams average value (i.e., when x=0.5). Therefore, given the value of buffer size, the SCR values can be appropriately chosen so as to minimize the effective bandwidth requirements. Figure 16: Comparison of bandwidth requirements for CBR and VBR video streams
VBR (6 (6 Mbps Mbps Peak,3 3MBPS Mbps Average Average) i Peak, i Average) CBR (6 Mbps Peak) CBR (6 Mbps i 8
ATM Transmission ATM Transmission overhead is overhead is or ~13% [(424-376)/376]
Bandwidth (Mbits/sec)
7
[(424 376)/376] or 13%
6
5 x=0
x=0.4
x=0.8
SCR=(1-X)*Mean + X*PCR (a) Averaging interval is 15 frames
x=0.95
Network Dimensioning for MPEG-2 Video Communications Using ATM 215
Figure 16: (b) Averaging interval varies from 5 frames to 15 frames CBR: Peak PeakCell CellRate Rate(PCR) (PCR)based basedbandwidth bandwidth CBR: allocation VBR: Effective EffectiveBandwidth Bandwidth (EBW) based bandwidth VBR: (EBW) based bandwidth allocation Clip:aFrom High Movie Action VideoVideo Clip: From High aAction Averaging Interval: 5 and 15 VBR 5 Frames Averaging Interval: 5 and 15 Frames VBR 15 Frames
CBR 5 Frames CBR 15 Frames
Bandwidth
(Mbits/sec)
12 10 8
VBR 15 F 5 VBR x=0.95
x=0.9
x=0.8
x=0.6
x=0.4
x=0.2
x=0
6
SCR=(1-X)*Mean +
Multiplexing of VBR MPEG Video Streams When CBR type ATM service is used for transmission of MPEG-2 video, the amount of bandwidth which needs to be allocated is the peak bit rate of the video stream. Thus, if the link capacity is 45Mbps (a DS3 link) and each video stream has a peak bit rate of 6 Mbps, then the number of video streams that can be multiplexed together (assuming that the constraints of delay/delay-variations are satisfied) is [45/6] or 7 streams with link capacity utilization of [(7x6)/45] or 94%. However, when rt-VBR type ATM service is used, it may be possible to perform bandwidth allocation to the connection according to the effective bandwidth (EBW) requirements. For example, if the peak bit rate of the video stream is 6 Mbps and the mean bit rate is 3 Mbps, then by selecting appropriate averaging interval, if the SCR can be tuned to a value close to the mean bit rate of the stream, and the MBS value also remains within the available buffer space in the switch (e.g., less than 20 Kcells) for acceptable cell loss rate (hence video quality), EBW can then be used for video transmission using rt-VBR type ATM service. For the above case, it appears that the EBW or EC becomes approximately 4.5 Mbps. Consequently for the same 45 Mbps transmission link, the number of uncorrelated video streams that can be multiplexed together becomes [45/4.5] or 10 streams with (45/45) or 100% utilization of the link capacity. Thus, the corresponding savings in link capacity for using VBR video becomes [7x(6-4.5)] or 10.5 Mbps. Consequently 43% [=(10-7)/7] more video streams can be packed using the same transmission bandwidth. For transmission of multiplexed constant bit rate video streams using CBR type ATM service, it can be shown that as long as the aggregated bit rate of all the streams does not exceed the link capacity, i.e., link capacity utilization remains less than 100%, the cell loss probability (CLP) becomes virtually independent of the available buffer space after a certain
216 Khasnabish
Figure 17: Variation of effective bandwidth with multiplexer buffer size; average interval is 5 frame-period or 5/30 sec., and Bmux varies from 5000 ATM cells to 30,000 ATM cells
Effect of Multiplexer EffectBuffer of Multiplexer Size on Effective BufferBandwidth Size on for a High Effective Bandwidth a High Action Movie Actionfor Movie 5000
Effective Bandwidth (Mbits/sec)
12
10000
20000
30000
10
8
6 x=0
x=0.2
x=0.4
x=0.6
x=0.8
x=0.9
x=0.95
SCR=(1-X)*Mean + X*Peak
limit. For example, in an OC-3 link (155 Mbps), when about 140 streams (each with PCR of 1Mbps) are multiplexed together, a CLP of less than 10-9 can be achieved (Pitts and Schormans, 1996) with less than 100-cell buffer space. This however may not be the case if all the sources generate cells at the same time, i.e., 140 sources generate 140 cells where there are only 100-cell buffer in the switch, and hence in this case the excess 40 [=(140-100)] cells will be lost. In addition, usually the CLP becomes smaller when the ratio between ‘switch buffer size’ and the MBS is larger. However, when a number of real-time VBR video streams are multiplexed together, the final value of the ‘switch buffer’ must be chosen as: Bmux-final = Minimium {Bmux-1, Bmux-2, … …, Bmux-N} Where Bmux-i is the size of the ‘switch buffer’ needed to support the shaped (or resultant) MBS value for the i-th VBR video stream. This may affect the loss criterion for some of the streams; the effective bandwidth (EBW) values for those streams need to be recalculated (see Figure 17) using the new value of the switch buffer size (Bmux-final). It is therefore recommended that the individual VBR video streams be shaped in such a way that the Bmux requirements of each of them (i.e., the values of Bmux-i) are very close to each other. In addition, the number of real-time bursty sources (i.e., N) being multiplexed together should be restricted in such a way that the instantaneous total input rate only rarely (e.g., one in tens of thousands) exceeds the cell slot rate; it should be assumed that all the excess-rate cells are lost. Speaking in terms of probability, the design should be such that Prob.[(N.PCR) > LinkCapacity] is less than 10-4. Consequently, an OC-3 switch/link with large buffer (e.g., more than 100 Kcells) may be sufficient for serving multiplexed random and/or non-real-time VBR streams, but for real-time VBR video streams, a higher capacity (e.g., an OC-12) switch/link with smaller buffer (e.g., 25 Kcells) space would be more useful (and hence recommended). The situation
Network Dimensioning for MPEG-2 Video Communications Using ATM 217
becomes more complicated (and interesting, albeit challenging!) when the switch supports data transmission from a mixture of CBR, rt-VBR and/or other types of random traffic sources using the same buffer space. One technique to solve this problem is to hard-partition the total switch buffer in order to allocate-small amount of buffer for high-priority service to be used by rt-VBR and CBR type traffic. There could be two potential problems associated with blindly packing a larger number of VBR video streams—over the same link—instead of using constant bit rate video streams. First of all, in switched digital video architectures (e.g., fiber to node/curb), channel grouping is not necessarily always maintained during switching. Therefore, it may be required to use some transcoding (hence additional delay may be incurred) techniques when regrouping the video streams. Secondly, since each variable rate video stream has its own traffic contract, it may be occasionally needed to switch and manipulate them independently in order to avoid potential ‘explosions’ in the demand for link bandwidth. These ‘explosions’ usually occur in case of blind statistical multiplexing when peaks of multiple video streams occur (or tend to occur even when the streams are not highly correlated) at the same time during transmission. Thus, the data stream obtained by blindly multiplexing a group of variable bit rate video steams may not necessarily always form a constant/constrained bit rate stream. One way to prevent the occurrence of these ‘explosions’ is to artificially insert sporadic gaps in the colliding GOPs of the video streams, as shown in Figure 18. The decoder can ‘play’ previously received frame(s) for the gap(s) it receives in the GOP. These gaps help maintain the bit rate characteristics of the video stream. The details of inserting gap(s) in the GOP can be discussed in a future article. Other possible techniques for controlling bit rate and burstiness of individual video streams are as discussed earlier in this chapter. In addition, for controlling burstiness and bit rate variability of the multiplexed video stream, one or more of the following methods can be used. (a) Use feedback from the network to dynamically regroup the channels and/or renegotiate the traffic contracts of the video streams; these are not very attractive because of the additional processing, signaling and management overheads involved in them. (b) Multiplex different categories of video streams which are not sufficiently correlated, e.g., multiplex a few high-action movie streams with a large number of news and talking head type video streams. (c) Mix real-time video streams with delayed and/or non-real-time video streams. Details of these techniques remain open for future investigations.
CONCLUSIONS AND FUTURE WORKS In this chapter, we have discussed applications, challenges and benefits of using VBR MPEG-2 video in comparison with CBR type video. For CBR encoding of video, the output rate control buffer dictates the quantization level used in macroblock and slice layers of coding; no further shaping of the output bit stream is needed. Whereas for VBR encoding of video, source/coding rate control in encoder, shaping of the encoder’s output bit stream and network feedback-based encoding rate control can be used. Because of the complexities involved in implementation and management, the network feedback-based rate control technique is not very attractive. Degradation of video quality due to rate control/adaptation depends not only on the video characteristics and category (e.g., action movie, cartoon, sports, talking head, news, etc.), but also on the rate adaptation technique used. We have analyzed MPEG-2 video traces and have determined their traffic contract parameters for rt-VBR type ATM service using output-shaping-based rate control tech-
218 Khasnabish
nique. The required traffic parameters are determined for (i) different values of SCR where SCR is varied from the average rate to within a few percent of the PCR, and (ii) a number of different ‘averaging intervals’ ranging from one frame (i.e., 1/30-th of a second for 30 frames/sec video) to an entire GOP (e.g., 0.5 sec when GOP length is 15 frames in a 30 frames/sec video). The burstiness of MPEG-2 coded video streams varies widely depending on the category of the video being encoded. The ‘averaging interval’ has a significant impact on the values of the traffic contract parameters. Typically, the MBS values are higher and the PCR values are lower when traffic is averaged over an entire GOP structure as opposed to averaging over a few frames. As the SCR value is increased from the average value, the MBS decreases and reaches a minimum asymptote. The decrease is sharpest when the rates are averaged over a few frames as opposed to an entire GOP. PCR/SCR ratios are highest for rates averaged over a sub-GOP. Currently available ATM and other packet switches seem to have sufficiently large buffer space to support transmission of VBR type MPEG video streams so as to yield significant savings in network bandwidth. It appears that using only output-shaping-based rate control technique, it is possible to achieve at 20% savings in transmission bandwidth without any significant loss in video quality, and the amount of storage space saving always tends to be higher. Also, realistic8 values of traffic contract parameters for supporting transmission of VBR type MPEG-2 video using rt-VBR type ATM service can be achieved, although short-term burstiness may reduce the savings for transmission applications. We realize that savings of transmission bandwidth can be achieved by using source rate control of MPEG-2 encoding in such a way that the characteristics of the video stream adhere to the traffic contract with practical8 MBS values. That way the demand on ATM networks may be less stringent, and shaping of the encoder’s output may not be needed at all.
Figure 18: Statistical multiplexing of VBR video streams (a) Blind multiplexing and (b) Multiplexing with Gap(s) in the GOP
Bandwidth
Bandwidth
Explosion in bandwidth requirements
05 12 – 11
Time
Time
(a)
Insertion of a gap (b)
Network Dimensioning for MPEG-2 Video Communications Using ATM 219
Consequently, our current efforts are being devoted to implementing some of the proposed source/coding rate control methods (e.g., using constant quant scale) using the test model 5 (TM5) software of MPEG-2 encoding. The objective is to compare the results with those obtained when only output-shaping-based rate control is used for generating rt-VBR type ATM service compliant bit streams. We plan to integrate the results in such a way that an integrated source rate control and output-shaping-based (channel rate) control can be achieved in order to maximize the savings in nodal (processing and buffer) and network transmission resources. We are also simulating the encoder’s output shaping and the effects of multiplexing a number of VBR video streams with an objective to verify the EBW metric obtained by using the analytical formulations. In addition, some transcoding techniques are being developed to effectively control the burstiness of the video streams. We recommend that the video encoding, multiplexing and transmission equipment manufacturers need to expand the capabilities of their equipment to enable realization of the maximum benefits of using VBR transmission of MPEG video. Emerging telecom and video service providers must pursue the vendors to add these special features to enable more effective utilization of the already deployed transmission and storage resources in their switched digital video (e.g., FTTC/N) networks.
ENDNOTES 1 It is interesting to note that digital television promises cinema quality audio and video, and a whole new set of digital data services. Also, the Federal Communications Commission (FCC) of the USA recently mandated that all video transmission in the USA should be digital by the year 2006. 2 Direct broadcast satellite (DBS) networks use geo-stationary satellites, which orbit the earth at 23,500 miles above the surface. 3 Re-encoding a video stream with a different set of constraints but without fully decoding it. 4 A recent buyers’ guide in a multimedia magazine (http://www.msdmag.com/, 1999) reveals that almost 70% of the commercially available MPEG decoders implement some sort of error concealment techniques. 5 That is, multiplexing without taking into consideration the instantaneous availability of buffer, processing and transmission capacity of the switch. 6 Peak cell rate (PCR), sustained cell rate (SCR) and maximum burst size (MBS). 7 This happens because of increased burstiness of the stream. 8 That is an MBS value lower than the available switch/multiplexer buffer space, and a PCR value much larger than the SCR.
ACKNOWLEDGMENTS The author would like to thank Alvin O. Jimenez-Rivera of the University of Puerto Rico-Mayaguez, Puerto Rico, and Steve Gringeri and Roman Egorov of GTE Labs. Inc., Waltham, MA, USA. They contributed to designing and running the experiments, which are reported in this chapter. Bert Basch and Dean Casey of GTE Labs, Inc. provided additional support without which this work could not have been completed. Finally, Ashmita, Inrava and Srijesa excused me for the time I utilized to write this chapter, which should have been spent with them, I am greatly indebted to them.
220 Khasnabish
REFERENCES ATM Forum. (1996). ATM Traffic Management Specifications Version 4.0. Chariglione, L. (1997). MPEG and multimedia communications. IEEE Trans. on Circuits and Systems for Video Technology, 7(1). Gringeri, S., Khasnabish, B., Lewis, A., Shuaib, K., Egorov, R. and Basch, B. (1998a). Transmission of MPEG-2 video streams over ATM. IEEE Multimedia Magazine, 5(1). Gringeri, S., Shuaib, K., Egorov, R., Lewis, A., Khasnabish, B. and Basch, B. (1998b). Traffic shaping, bandwidth allocation and quality assessment for MPEG video distribution over broadband networks. IEEE Network Magazine, 12(6). Guerin, R., Ahmadi, H. and Naghshineh, M. (1991). Equivalent capacity and its applications to bandwidth allocation in high-speed networks. IEEE Journal on Selected Areas in Communications (J-SAC), 9(7). Haskell, B. G., Puri, A. and Netravali, A. N. (1997). Digital Video: An Introduction to MPEG-2. New York, NY, USA and London, UK: Chapman & Hall. Khasnabish, B. (1997). Broadband to the home (BTTH): Architectures, access methods and the appetite for it. IEEE Network Magazine, 11(1). Khasnabish, B. and Banerjea, A. (1998). Transmission and distribution of digital video, part I: Architecture, control and management, guest editorial. IEEE Network Magazine. 12(6). Khasnabish, B. and Banerjea, A. (1999). Transmission and distribution of digital video, part II: Field trials and prototype implementations, guest editorial. IEEE Network Magazine, 13(2). Kwok, T. (1998). ATM: The New Paradigm for Internet, Intranet, and Residential Broadband Services and Applications. New Jersey: USA: Prentice Hall. Mark, B. L. and Ramamurthy, G. (1996). UPC-based traffic descriptors for ATM: How to determine, interpret and use them. Telecommunication Systems, 5. Pitts, J.M. and Schormans, J.A. (1996). Introduction to ATM: Design and Performance. England: John Wiley and Sons Ltd. Schuster, G. M. and Katsaggelos, A. K. (1997). Rate-Distortion-Based Video Compression. Boston, MA, USA: Kluwer Academic Publishers. Tatipamula, M. and Khasnabish, B. (Eds.). (1998). Multimedia Communications Networks: Technologies and Services. Boston, MA: USA, Artech House Publishers.
Network Dimensioning for MPEG-2 Video Communications Using ATM 221
LIST OF ACRONYMS AAL: ATM Adaptation Layer
NIC: Network Interface Card
ABR: Available Bit Rate
NNI: Network to Network Interface
ATM: Asynchronous Transfer Mode
NPC: Network Parameter Control
CAC: Connection Admission Control
NTSC: National TV Standardization Committee
CATV: Cable TV or Community Access TV
OAM&P: Operations, Administration, Management
CBR: Constant Bit Rate
and Provisioning
CDV: Cell Delay Variation
ONU: Optical Network Unit
CDVT: Cell Delay Variation Tolerance
PCR: Peak Cell Rate in context of ATM, and
CLP: Cell Loss Probability
Program Clock Reference in MPEG
CO: Central Office
PDU: Protocol Data Unit
CPE: Customer Premise Equipment
PES: Packetized Elementary Stream
DBS: Direct Broadcast Satellite
POTS: Plain Old Telephone Service
DCT: Discrete Cosine Transform
PPL: Pixels (picture elements) Per Line
DVD: Digital Versatile Disk
PS: Program Stream
EBW: Effective BandWidth
PSTN: Public Switched Telephone network
EC: Equivalent Capacity
QoS: Quality of Service
ETE: End To End
rt-VBR: real time VBR type ATM service
FPS: Frames Per Second
RTP: Real Time Protocol
FTTX: Fiber To The X (X=Node, Curb, Home) SCR: Sustained Cell Rate GCRA: Generic Cell Rate Algorithm of ATM
SNR: Signal to Noise Ratio
GOP: Group Of Picture
TM5: Test Model 5 of MPEG
HFC: Hybrid Fiber Coax
TS: Transport Stream
HTML: HyperText Markup Language
UNI: User to Network Interface
HTTP: HyperText Transfer Protocol
UPC: Usage Parameter Control
MBS: Maximum Burst Size
URL: Universal Resource Locator
MPEG: Motion Picture Experts Group
VBR: Variable Bit Rate
ML@MP: Main Level and Main Profile
VBV: Video Buffer Verifier xDSL: various types of Digital Subscriber Line technologies
222 Chang, Chen, Ko & Ho
Chapter X
VBR Traffic Shaping for Streaming of Multimedia Transmission Ray-I Chang, Meng-Chang Chen, Ming-Tat Ko and Jan-Ming Ho CCL, IIS, Academia Sinica, Taiwan
INTRODUCTION In multimedia applications, media data such as audio and video are transmitted from server to clients via network according to some transmission schedules. Different from the conventional data streams, end-to-end quality-of-service (QoS) is necessary for media transmission to provide jitter-free playback. Therefore, each data packet has been assigned with related timing constraints for transmission. As network resources are allocated exclusively in fixed-size chunks to serve different data streams, it is simple to support constant-bit-rate (CBR) transmission. Grossglauser and Keshav (1996) have investigated the performance of CBR traffic in a large-scale network with many connections and switches. They concluded that the network queuing delay for CBR transmission is less than one cell time per switch even under heavy loading. Besides, resource allocation and admission control are simple as there are no variations in resource requirements. However, media streams are notably variable-bit-rate (VBR) in nature due to the coding and compression technologies applied (Garrett and Willinger, 1994). The average data rate of an MPEG-1 movie is usually less than 25% of its peak data rate. It is inherently at odds with the goals of designing efficient real-time network transmission and admission control mechanisms capable of achieving high resource utilization (Sen et al., 1997). The conventional CBR service model that allocates the peak data rate to transmit the VBR stream would be a waste of bandwidth. Furthermore, it requires a large size of client buffer. To ameliorate this problem, we need a good traffic shaping algorithm to transmit VBR video in a less bursty (i.e., smoother) manner by exploiting different performance measurements. In a multimedia system, we usually measure the performance of a transmission schedule by the following four indices: peak bandwidth, network utilization, initial delay and client buffer. Copyright © 2002, Idea Group Publishing.
VBR Traffic Shaping for Streaming of Multimedia Transmission 223
•
Peak bandwidth is the maximum network bandwidth allocated for media transmission. A user request is admitted if the peak bandwidth required is smaller than the available bandwidth of the current network. • Network utilization is the ratio of the total bandwidth consumed to the total bandwidth allocated. Maximizing the network utilization is beneficial for the server and the network, as well as for a client who pays on a per-unit time basis. Generally, the higher the network utilization means more users can be served at the same time. • Initial delay is the length of time interval from the time that the client sends the media request to the time that the client starts playing the received data. It is an important QoS indicator for users. • Client buffer acts as a reservoir to regulate the difference between the transmission rate and the playback rate. It is an important resource for users to prevent playback jitters, i.e. buffer overflow and underflow. While serving a VBR media stream, a good transmission schedule is designed to minimize the peak bandwidth, initial delay and buffer size required to keep the network utilization as large as possible. Then, the variance of transmission rates is minimized (smoothed) to reduce the traffic burst. Moreover, end-to-end QoS of media transmission needs to be guaranteed for supporting jitter-free playback (Lam et al., 1994; Ott et al., 1992). In past years, different approaches are proposed to shape the traffic burst for high network utilization, smaller buffer size and short initial delay. In McManus and Ross (1996) the CRTT (constant-rate transmission and transport) method was presented to transmit VBR media data by a constant bandwidth. By giving the available transmission bandwidth and initial delay, CRTT minimized the required buffer size by the dynamic programming technique. It takes O(n2logn) computation complexity where n is the number of data frames of the media stream. In Sen et al. (1997) a speedup algorithm that takes O(nlogn) computation complexity is proposed. It examines the tradeoffs between the transmission rate, the client buffer size, the network utilization and the initial delay. Although the CRTT transmission approach was simple in admission control and transmission schedule, it had the drawback of requiring large buffer and delay. To reduce the required buffer, a piecewise CRTT (PCRTT) method was introduced to evenly divide the media stream into sub-streams and applied CRTT to each sub-stream. Based on the similar idea, the RCBR (renegotiated CBR) method (Grossglauser et al., 1995) was proposed to use the average data rates in different sub-streams. Given initial delay and client buffer, the MCBA (minimum changes bandwidth allocation) (Feng et al., 1996) and CBA (critical bandwidth allocation) (Feng and Sechrest, 1995) methods were proposed to minimize the number of bandwidth changes and the peak bandwidth required, respectively. In Salehi et al. (1996) the MVBA (minimum variability bandwidth allocation) method was proposed to minimize the bandwidth variation for media transmission by the shortest-path algorithm (Reibman and Berger, 1995). Although the conventional traffic shaping methods had reduced some problem parameters in media transmission, they did not achieve the optimize schedule results that minimize the initial delay, the client buffer and the bandwidth utilization at the same time. For example, in MVBA, the allocated initial delay and the bandwidth utilization were not optimized as discussed in Zhang and Hui (1997). In this chapter, a novel traffic shaping approach is presented to optimize both the resource allocation and utilization for VBR media transmission. Then, we extend this idea to online transmission problems. More issues about the admission control can be found in Chang et al. (1998). It takes O(nlogn) computation complexity to examine the tradeoffs between the transmission rate, the client buffer size, the network utilization and the initial delay. (Different from the CBR transmission schedule
224 Chang, Chen, Ko & Ho
considered in Sen et al. (1997), the problem setting considered in Chang et al. (1998), is a piecewise CBR transmission schedule.)
PROBLEM SETTING In this chapter, we consider the end-to-end transmission of VBR media stream. While a user request is presented, media data are first retrieved from the storage subsystem by following a disk retrieval scheduler (Chang and Zakhor, 1994; Chen et al., 1993; Wang, et al., 1997; Change et al., 2000). The network transmission scheduler then transmits the retrieved data from server to client at the proper time. On the client side, incoming data are temporarily stored in the client buffer and consumed frame-by-frame periodically by the playback scheduler. If a frame arrives late or is incomplete at its playback time, unpleasant jittery effects would be perceived by the audience. To avoid jittery playback, the transmission schedule must always be ahead of its related playback schedule so that the client buffer would not be underflow. On the other hand, the transmission schedule must avoid sending more data to the client buffer than the total data that the buffer can store. Otherwise, the overflow condition occurs and the client loses data. It will require an extra bandwidth for retransmission. While serving a media stream, a good transmission schedule is designed to minimize resource allocation and maximize resource utilization without playback jitter. In this chapter, to concentrate on the formalization of the media transmission problem, the disk retrieval scheduler is assumed to always retrieve sufficient data before they request by the network transmission scheduler. More detailed descriptions of the design and implementation of a multimedia disk retrieval scheduler are shown in Chang et al. (2000). A media stream V can be represented by a set of frames { f0, f1, ..., fn-1 } where n is the number of frames and fi is the i-th frame. We assume that the media stream is played at t = 0 and the time to play the i-th frame is i*Tf where Tf is the playback time interval between adjacent frames. (For example, Tf = 1/30 second in a MPEG video stream.) In this chapter, without loss of generality, we let Tf = 1 (unit time). The i-th accumulative frame size of V is Fi = Fi-1 + fi where the initial value F-1 = 0. The media stream size is the total frame size |V| = Fn-1. As the client plays the media stream frame-by-frame periodically (fi is consumed at the i-th frame time), the playback schedule can be denoted by its accumulative playback function F(.) in the following: x F(t) = Fx = ∑ fi ; ∀ x ≤ t < (x+1) i=0 Figure 1 illustrates the relations between a media stream and its cumulative playback function. Note that F(.) is a nonnegative stair function with jumps at time t for t = 0, 1, ..., n-1. The lower corner and the upper corner at time t are F(t)- = F(t-1) and F(t)+ = F(t), respectively. Based on the same idea, we define the transmission schedule G(.) as a function that cumulates the amount of media data received at the client. Assume that the media data are transmitted by rate r(t) at time t, the transmission schedule is defined as the integration function of r(t) as follows: t G(t) = ⌠ ⌡ r(x) dx
x=0
VBR Traffic Shaping for Streaming of Multimedia Transmission 225
Figure 1: A VBR media stream and its cumulative playback function
F(i)
Frame Size fi
Tf Time i
Time i
Note that this function is continuous and monotonically non-decreasing. The peak bandwidth of the network channel allocated for media transmission is measured as follows: Peak bandwidth: r = max{ r(t) | ∀ t } Let ts = min{ t | ∀ r(t) > 0 } and te = max{ t | ∀ r(t) > 0 } be the start time and the end time of the transmission schedule G(.), respectively. The value tc = te - ts is the connection time of the network channel allocated (called the channel holding time in Sen et al., 1997). We can compute the network utilization of the allocated channel as follows: Network utilization: u = |V| / (r * tc) * 100% Network idle rate: 1 - u According to the definition, G(t) is the amount of data sent by the server up to time t. Assume that there is no transmission error and the network delay is zero. G(t) also represents the amount of data received by the client up to time t. If the client starts the playback of the media stream at time 0, the value of the initial delay is shown as follows: Initial delay: d = - ts As the media data must be transmitted before received and played, the start time ts < 0. Note that, at the client, G(t) and F(t) represent the cumulated data received and consumed up to time t respectively. The buffer occupancy b(t) = G(t) - F(t) is the amount of transmitted data temporarily stored in the client buffer at time t. The minimal client buffer size required for media transmission and playback can be computed as follows: Client buffer: b = max{ b(t) | ∀ t }. Obviously, b is no smaller than the maximum frame size, and is no larger than the stream size. An example to illustrate the cumulative playback function, the initial delay and the buffer size is shown in Figure 2. In this chapter, a transmission schedule is said to be feasible if it can provide the jitter-free playback. By definition, a jitter-free transmission schedule demands a complete media frame before its playback. The cumulative transmission function F(t) must not to be smaller than the cumulative playback function G(t). Besides, the buffer occupancy must not be larger than the specified buffer size. A feasible transmission schedule satisfies the following conditions:
F(t) ≤ G(t) ≤ H(t) H(t) is the up bound of G(t) and, in this chapter, H(t) = min{ |V|, F(t)- + b }.
226 Chang, Chen, Ko & Ho
Figure 2: The relations among the cumulative transmission function, the cumulative playback function, the initial delay and the client buffer size
STORED MEDIA TRANSMISSION When designing a transmission schedule, two important resources are considered: network bandwidth and client buffer. In this chapter, a transmission schedule is said to be optimal if it allocates the minimal resources (both network bandwidth and client buffer) and has the maximal resource utilization. Given a media stream, we first introduce AlgorithmL to allocate the required resource for media transmission. Given a media stream V, the transmission schedule L(.) obtained with the network bandwidth r is shown as follows. Algorithm-L: L(t) = max{ F(t), L(t+1) - r }; ∀ 0 ≤ t < (n-1) The initial value L (n -1 ) = |V | = F (n. Note that the media data are transmitted to the client only when they are necessary for providing jitter-free playback. As the media data are transmitted and stored in the client buffer as late as possible, this algorithm can decide the minimal buffer occupancy at any time t. It achieves the minimal client buffer size b = max{L(t) - F(t) | ∀ t } and the minimum initial delay d = L(0)/r for the given transmission bandwidth r. Besides, given any transmission schedule G(.) with the peak bandwidth r, we have L(t) ≤ G(t) for any time t. The achieved result L(.) is called the minimal r-bounded transmission schedule. Lemma-1: L(.) is the minimal r-bounded transmission schedule. Lemma-2: L(.) requires the minimal buffer size and initial delay for all r-bounded transmission schedules. In this chapter, Algorithm-L is introduced to allocate the minimal resource required. Based on the system resources allocated, Algorithm-A is introduced to maximize the utilization of bounded resources. The obtained transmission schedule A(.) is shown as follows: Algorithm-A: A(t) = min{ H(t), A(t-1) + r }; ∀ 0 ≤ t < (n-1) The initial value A(-d) = 0 and A(0) = L(0) where the minimal initial delay d is decided
VBR Traffic Shaping for Streaming of Multimedia Transmission 227
by Algorithm-L. Note that, as the media data are transmitted to the client as early as possible, this algorithm can maximize the utilization of given resources — the transmission bandwidth r and the client buffer b. The obtained transmission schedule is more robust against network errors. Besides, for any transmission schedule G(.) that has the same transmission bandwidth r and client buffer b, we can prove G(t) ≤ A(t) for any time t. The obtained result A(.) is called the maximal (r, b)-bounded transmission schedule. Lemma-3: A(.) is the maximal (r, b)-bounded transmission schedule. Lemma-4: A(.) has the maximal utilization in buffer size and network bandwidth for all (r, b)-bounded transmission schedules. We have shown that, given any (r, b)-bounded transmission schedule G(.),
G(t) ≤ A(t) for all t. Let (r, b) be the minimal transmission bandwidth r and client buffer b obtained by Algorithm-L. The maximal (r, b)-bounded transmission schedule A(.) can determine the up bound of the transmission schedules that optimize both the resource allocation and utilization. It has the minimal end time te for all (r, b)-bounded transmission schedules. In this subsection, by applying Algorithm-L and the minimal end time te, we can determine the minimal (r, b)-bounded transmission schedule R(.) as follows:
R(t) = max{ F(t), R(t+1) - r }; ∀ 0 ≤ t < (n-1) The initial value R(te) = |V| = A(te). Given any transmission schedule G(.) that have the optimal resource allocation and utilization, we can prove that R(t) ≤ G(t) for all t. R(.) determines the low bound of the transmission schedules that optimize both the resource allocation and utilization. R(.) is the minimal (r, b)-bounded transmission schedule. Figure 3 shows the up bound A(.) and the low bound R(.) for all the optimal transmission schedules with the same peak bandwidth r, buffer size b, initial delay d and network utilization u. Instead of giving a fixed schedule result, our approach gives the up bound and the low bound of the optimal transmission schedules. It allows users to determine their own optimal schedules under various QoS requirements and resource constraints to support Figure 3: An example that uses MVBA to the up bound and the low bound of the optimal transmission schedules
228 Chang, Chen, Ko & Ho
differentiated services. For example, if we want to smooth the variance of transmission bandwidths applied, we can use the MVBA algorithm (Salehi et al., 1996) to the up bound and the low bound of the optimal transmission schedules. (The proof of the optimal smoothness is similar to the proof shown in Salehi et al., 1996.) The result obtained not only has the minimal bandwidth variance but also has the optimal resource allocation. Moreover, the resource utilization is also optimized. It is better than the original MVBA algorithm that does not provide the optimal initial delay and network utilization. The same idea can be applied to minimize the number of bandwidth changes by MCBA (Feng et al., 1996).
Experimental Results In this subsection, a two-hour long MPEG-encoded movie Star Wars is examined to Figure 4: (a) The relation between the initial delay and client buffer size required; (b) The relation between the initial delay and network idle rate obtained
(a)
(b)
VBR Traffic Shaping for Streaming of Multimedia Transmission 229
evaluate the effectiveness of the proposed algorithm. This movie includes 174,136 frames. The frame rate is 24 fps (number of frames played per second) and the average frame size is 1.9 KB. We have tested the input media stream by different performance parameters such as the buffer size, the initial delay, the network bandwidth and the network idle rate obtained. Note that our algorithm is proved optimal in resource allocation and utilization. The relation between the initial delay and client buffer size required is shown Figure 4(a). Figure 4(b) presents the relation between the initial delay and network idle rate obtained. Experimental results to more test media streams can be found in Chang et al. (1997).
ONLINE MEDIA TRANSMISSION Window-Based Online Traffic Smoothing For the online-generated media stream V = { f0f1f2 ... }, we only know the sizes of frames that are generated (e.g., f0, f1, ..., and ft at time t). Besides, as some media data may be already transmitted (e.g., f0f1 ... fx where x ≤ t), only a limited amount of media data can be applied for traffic smoothing. The available media data can be viewed as a window of the entire media stream. It is intuitive to apply a window-based scheme to smooth the available media data. Given the available client buffer and the playback delay, the goal of a good online window-based traffic-smoothing algorithm is to minimize the allocated peak bandwidth with the minimum computation overhead. As shown in Figure 5, the playback delay D is defined as the time between the generation and the playback of the first media frame. The time d is between the transmission and the playback of the first media frame. As the playback starts at t = 0, the first frame f0 must be generated at time –(d + W) and the server starts to transmit at time –d for continuous playback. The set wt = {ft+d, ft+d+1, ..., ft+d+W-1} represents Figure 5: The window-based online smoothing method with window size W and playback delay D
server (frame generating) W G(t)
frame transmitting and receiving
transmission segment d
client (frame playing) D
F(t)
230 Chang, Chen, Ko & Ho
the frames generated during the time period [t-W, t). In the window-based method, at any time t, the frames fj (j < t+d) are already transmitted, and frames fj, j > t+d+W-1 are not generated yet. Therefore, only frames in wt can be considered for traffic smoothing. As the media data in the window can be viewed as the pre-stored data, the transmission schedule can be decided by the media data in this time window under the constraints of F(t) and H(t). Based on this concept, in 1997 a window-based online traffic-smoothing method called SLWIN(k) was proposed (Rexford et al., 1997). The constant value k represents the sliding distance of window. SLWIN(k) is a static window-sliding approach. It periodically applies the traffic-smoothing algorithm to the online-generated media data with time period k. During every period, there are k new media frames generated. The specified window will move forward k frames to construct a new window. As the window size W is not smaller than k, these k new generated frames can be pre-stored without being immediately transmitted. The server can smooth the traffic in this window to schedule the transmission of media data into the client in advance of each burst. The required peak bandwidth can be reduced.
Dynamic Sliding Distance The drawback of SLWIN(k) is in the selection of sliding distance k. As the trafficsmoothing algorithm is periodically applied, SLWIN(k) requires O(n * W / k) time complexity where n is the number of media frames. Experiments show that, in such a static window-sliding method, it is hard to decide the best sliding distance for minimizing both the peak bandwidth and the computation cost. If the maximum sliding distance k = W is applied, SLWIN(W) has the minimum computation cost O(n). However, the peak bandwidth required is large. Although SLWIN(1) with the minimum sliding distance 1 can achieve small peak bandwidth, it requires O(n * W) computation cost. Note that, to reduce the required peak bandwidth for traffic-smoothing, intuitively, the bandwidth allocated for transmitting each frame should be decided with as much information of future frames as possible. Given a constant window size W, the number of future frames used to allocate bandwidth for each frame is decided by the sliding distance k (k≤W). For minimizing the computation cost, the applied sliding distance should be as large as possible. Our basic idea is to find the maximum sliding distance for each window without increasing the allocated peak bandwidth shown on SLWIN(1). Thus, given the same client buffer, playback delay and window size, our allocated peak bandwidth can be the same as that achieved by SLWIN(1). A simple example to describe the basic idea of proposed algorithm is shown in Figure 6. We compare the traffic-smoothing results for two successive windows wi and wi+1 obtained by SLWIN(1). Based on the applied transmission segment construction algorithm (MVBA), we can easily prove that only the last transmission segment at wi will be adjusted at wi+1. The bandwidth allocated for other transmission segments are decided and will not be changed at future windows. Thus, we can decide the dynamic sliding distance for wi by selecting the star point of the last transmission segment as the start point of the next window. The sliding distance is 1 if there is only one transmission segment. Our approach only executes the traffic-smoothing algorithm at windows wi and wi+5. It is not necessary to execute the traffic-smoothing algorithm at windows wi+1, wi+2, wi+3 and wi+4 as the conventional SLWIN(1) algorithm. In our dynamic sliding distance scheme, we keep the media data in the last transmission segment to the next smoothing window. Thus, in each window, the most information of media data (in terms of the limited window size) can be applied to smooth the coming traffic. Given an online media stream, the client buffer, the playback delay and the window size, the transmission schedule obtained by our above algorithm (which combines the proposed
VBR Traffic Shaping for Streaming of Multimedia Transmission 231
Figure 6: By comparing the obtained transmission schedules for two successive windows, we can decide the maximum possible sliding distance
maximum sliding
adjusted at the 2nd window constructed at the 1st window
dynamic sliding distance scheme and the MVBA scheme) is the same as that obtained by SLWIN(1). As our sliding distance is usually large, the required computation cost would be less than that of SLWIN(1). Based on SLWIN(1), our above algorithm minimizes the peak bandwidth allocated for each window independently. For example, assume that the peak bandwidth allocated for the previous windows is large and the traffic size in the current window is small. If traffic in different windows is smoothed independently as with the above algorithm, the bandwidth allocated for transmitting the current window is minimized without considering the peak bandwidth allocated in the previous windows. Assume that the traffic is burst in the next windows. By applying the above algorithm, as SLWIN(1), the allocated peak bandwidth would be large. As concluded in offline traffic-smoothing methods, media frames should be transmitted before their playback times as early as possible to reduce the required bandwidth. It is called the aggressive work-ahead scheme for media transmission. Different from the SLWIN(1) approach, the aggressive work-ahead scheme smooths the media data in each window in a dependent way. The peak bandwidth allocated for the previous windows is considered as an input parameter to smooth the traffic at the current window. Although the traffic size of the current window is small, we still utilize the peak bandwidth to transmit the media data as early as possible. Thus, the peak bandwidth allocated for transmitting the burst traffic at next windows is smaller than that obtained by SLWIN(1). Based on the aggressive work-ahead scheme, we dynamically select the best sliding distance to keep more information available in smoothing the next un-transmitted media data. However, in some cases, the aggressive scheme must be utilized with a lazy manner. We can prove that the peak bandwidth allocated by the proposed algorithm is better than that allocated by SLWIN(1).
Proposed Online Traffic Smoothing Algorithm Assume that the media frames in the current window are fi+1fi+2 ... fi+W where
232 Chang, Chen, Ko & Ho
fj (i+1 ≤ j ≤ i+W) is the first un-transmitted frame (i = -d and j = 0 for the first window). For each transmission segment , we denote (s, G(s)) and (e, G(e)) as the startpoint and the end-point for transmission. The variable rs represents the transmission bandwidth in this segment where re-1 = re-2 = ... = rs. The allocated peak bandwidth can be computed by max{ ri; for all i}. For the first transmission segment in window fi+1fi+2 ... fi+W, the start-point is (i, G(i) = F(i) + Q) where Q = F(j-1) - F(i) is the pre-transmitted traffic size (called the backlog of the current window). By following the traffic-smoothing algorithm, given the start-point (s, G(s)) (usually the end-point of the previous transmission segment), we want to find the maximum index e for a jitter-free transmission segment that satisfies F(t)+ ≤ G(t) ≤ H(t)- for all time t. We can prove that the computation cost of the proposed traffic-smoothing algorithm is the same as that of SLWIN(W). A detail description about the proposed algorithm and the sub-procedure “Jitter-Free” (for identifying the jitterfree transmission segment) are shown as follows.
Algorithm: Online Traffic-Smoothing // F(.) is the input cumulative playback function. // H(t) = F(t) + b where b is the available buffer. // G(.) is the cumulative transmission function. Initial the start point (s, G(s)) = (-d, 0), the end-point at time t = -1 and tH = tF = 0. Initial the peak bandwidth r_max. // r_max = the average rate of the 1st window. repeat { // The current window contains: // (a) the un-transmitted frames in previous window, and // (b) the new generated frames. ss = s ; // the start point of the current window repeat { t = t + 1; // RF(t) is the lowest test rate at the time t. RF(t) = (F(t)+ - G(s)) / (t - s); // RH(t) is the highest test rate at the time t. RH(t) = (H(t)- - G(s)) / (t - s); if ((RH(tH) < RF(t) or (RF(tF) > RH(t))) { call procedure: Jitter-Free } // —> Try the next frame. else if (ft is not the last frame in window) { if ((RH(tH) >= RH(t)) and (H(tH)- < F(i+W))) { tH = t } if (RF(tF) <= RF(t)) { tF = t } } } until (ft is the last frame in window) if (ft is the last frame in stream) { // ——> end of stream rs=max{min{r_max, RH(tH)}, RF(tF)}; e=s+(F(t)-G(s))/rs ; G(e)=F(t); output segment: < G(s), G(e), rs >; // —> the end of this transmission schedule r_max = max{ r_max, rs }; } else if (s = ss) { // There is only one transmission segment e = s + 1; G(e) = G(s) + rs; rs = max{ min{ r_max, RH(tH) }, RF(tF) }; output segment: < G(s), G(e), rs >; r_max = max{ r_max, rs };
VBR Traffic Shaping for Streaming of Multimedia Transmission 233
s = t = e; // the next start point is (s, G(s)) = (e, G(e)). tH = tF = s + 1; } else { // Do nothing while there is not only one transmission segment } } until (ft is the last frame in stream) // End of algorithm Procedure: Jitter-Free { RF(t) = (F(t)+ - G(s)) / (t - s); // RF(t) is the lowest test rate at the time t. // RH(t) is the highest test rate at the time t. RH(t) = (H(t)- - G(s)) / (t - s); if (RH(tH) < RF(t)) { // The segment is up-bounded by H(tH) G(tH) = H(tH) -; rs = RH(tH); output segment: < G(s), G(tH), rs >; tH = tF = s + 1; r_max = max{ r_max, rs }; s = t = tH ; } else if (RF(tF) > RH(t)) { // The segment is low-bounded by F(tF) G(tF) = F(tF)+; rs = RF(tF); output segment: < G(s), G(tF), rs >; tH = tF = s + 1; r_max = max{ r_max, rs }; s = t = tF ; } }
Experimental Results In this sub-section, we evaluate the proposed method by generating online a Star Wars movie stream. It has large frame size and high frame size variation as shown in many other realworld media streams. Comparisons are made to the offline scheduler and the previous online scheduler (Rexford et al., 1997). The window size W used in the experiments is the number of frames in a GoP (group of picture). The related playback delay (W + 1) is acceptable for real-world applications. The initial value of peak bandwidth is assigned to be zero. Note that, if the smoothing window starts with a large I-frame, the required bandwidth will be higher. As the applied window size is the GoP size, the second window for traffic smoothing has to be started with an I-frame. We extend the size of the first window size to W+1 to include this I-frame as suggested in Rexford et al. (1997). This adjustment is intuitive. We evaluate the proposed method by three different parameters: the peak bandwidth allocated, the required window-sliding number and the network idle rate. Note that, in our proposed algorithm, the selected window-sliding sizes are not smaller than one. Therefore, our required window-sliding number is less than that of SLWIN(1). This property can be easily proved. Given an acceptable buffer size b = 90 KB for real-world applications, Figure 7(a) shows that our required window-sliding number is only 75% of the window-sliding number obtained by SLWIN(1)–the best previous method. When comparing the required bandwidth, our method is more than 13% smaller than SLWIN(1). The obtained network idle rate is more than 4% smaller than that of SLWIN(1) as shown in Figure 7(b). Experiments show that, when transmitting a real-world, high-bursty VBR stream, our method can utilize the allocated bandwidth effectively and efficiently. The required bandwidth and the network
234 Chang, Chen, Ko & Ho
idle rate are smaller than those obtained by previous methods under the same initial delay and buffer size. On the other hand, the proposed method has the same performance as previous methods for low buffer size. Comparing the required window-sliding number, our approach is better than SLWIN(1). Although the SLWIN(W) method with a W-frame hopping-window could compute faster than our method, it has the drawbacks in requiring higher bandwidth and network idle rate.
CONCLUSION In this chapter, a traffic shaping scheme is introduced to decide suitable transmission Figure 7: Comparisons of the proposed method with the SLWIN(1), the SLWIN(W) and the offline scheduler on the basis of (a) window-sliding number and (b) bandwidth idle rate
(a)
(b)
VBR Traffic Shaping for Streaming of Multimedia Transmission 235
schedules. It is proved optimal in resource allocation and utilization. Experiments show that our algorithm achieves a more dramatic improvement than the conventional approaches in both the client buffer size and the network idle rate obtained. The proposed approach is shown to be efficient and flexible in supporting continuous media transmission. It can be easily extended to traffic smoothing of online media. Note that, instead of giving a fixed result, our approach also provides the schedulable region for all optimal transmission schedules. We have proved that all the schedule results presented in this given region have the minimal initial delay and client buffer for the network channel applied. It provides the flexibility to design a more precise scheme for admission control (Chang et al., 1998; Chang, 1999). Note that, at any time t, A(t) represent the maximal amount of media data that could be received by a client without buffer overflow. The minimal amount of media data that should be received is R(t). Let the media stream V be packed as packets { p0, p1, ... } for network transmission, and the cumulative size Px = p0 + p1 + ... + px. When a transmission schedule G(.) is specified, packet px is transmitted/received at time t where G(t) = Px. Based on the up bound A(.) and the low bound R(.), we can specify the schedulable region (sx, ex) for each data packet px to design a good network scheduling and flow/error control algorithm for multiple multiplexing media streams.
REFERENCES Apostol, T. M. (1973). Mathematical Analysis. Chang, E. and Zakhor, A. (1994). Scalable video data placement on parallel disk arrays. IS&T/SPIE Symposium on Electronic Imaging Science and Technology. Chang, R. I. (1999). An optimal approximation method to characterize the resource tradeoff functions for media servers. Multimedia Storage and Archiving Systems IV (SPIE3846), 382-391. Chang, R. I., Chen, M. C., Ho, J. M. and Ko, M. T. (1997). Designing the on-off CBR transmission schedule for jitter-free VBR media playback in real-time networks. IEEE RTCSA, 1-9. Chang, R. I., Chen, M. C., Ho, J. M. and Ko, M. T. (1997). Optimizations of stored VBR video transmission on CBR channel. SPIE VVDC, 382-392. Chang, R. I., Chen, M.C., Ho, J. M. and Ko, M. T. (1998). Characterize the minimum required resources for admission control of pre-recorded VBR video transmission by an O(nlogn) algorithm. IEEE ICCCN. Chang, R.I. & Chen, M.C. & Ho, J.M. & Ko, M.T. (1999). An effective and efficient traffic smoothing scheme for delivery of online VBR media streams. IEEE INFOCOM. Chang, R. I., Shih, W. K. and Chang, R. C. (1998). A new real-time disk scheduling algorithm and its application to multimedia systems. Lecture Notes on Computer Science (LNCS-IDMS98). Chang, R. I., Shih, W. K. and Chang, R. C. (2000). Real-time disk scheduling for multimedia applications with deadline-modification-scan scheme. Real-Time Systems. Chen, M. C., Kandlur, D. D. and Yu, P. S. (1993). Optimization of the grouped sweeping scheduling (GSS) with heterogeneous multimedia streams. ACM Multimedia Conference, 235-142. Feng, W., Jahanian, F. and Sechrest, S. (1996). Optimal buffering for the delivery of compressed prerecorded video. IASTED International Conference on Networks. Feng, W. and Sechrest, S. (1995). Smoothing and buffering for delivery of prerecorded
236 Chang, Chen, Ko & Ho
compressed video. IS&T/SPIE Multimedia Computing and Networking, 234-242. Garrett, M. and Willinger, W. (1994). Analysis, modeling and generation of self-similar VBR video traffic. ACM SIGCOMM, 269-280. Grossglauser, M., Keshav, S. and Tse, D. (1995). RCBR: A simple and efficient service for multiple time-scale traffic. ACM SIGCOMM. Grossglauser, M. and Keshav, S. (1996). On CBR service. IEEE INFOCOM. Knightly, E. W., Wrege, D. E., Liebeherr, J. and Zhang, H. (1995). Fundamental limits and tradeoffs of providing deterministic guarantees to VBR video traffic. ACM SIGMETRICS, 47-55. Krunz, M. and Hughes, H. (1995). A traffic model for MPEG-coded VBR streams. ACM SIGMETRICS, 47-55. Lam, S. S., Chow, S. and Yau, D. K. Y. (1994). An algorithm for lossless smoothing of MPEG video. ACM SIGCOMM. McManus, J. M. amd Ross, K. W. (1996). Video on demand over ATM: Constant-rate transmission and transport. IEEE INFOCOM. Ott, T., Lakshman, T. V. and Tabatabai, A. (1992). A scheme for smoothing delay-sensitive traffic offered to ATM networks. IEEE INFOCOM. Reibman, A. R. and Berger, A. W. (1995). Traffic descriptors for VBR video teleconferencing over ATM networks. IEEE/ACM Trans. on Networking. Rexford, J., Sen, S., Feng, W., Kurose, K., Stankovic, J. and Towsley, D. (1997). Online smoothing of live variable-bit-rate video. NOSSDAV. Salehi, J. D., Zhang, Z. L., Kurose, J. F. and Towsley, D. (1996). Supporting stored video: Reducing rate variability and end-to-end resource requirements through optimal smoothing. ACM SIGMETRICS. Sen, S., Dey, J. K., Kurose, J. F., Stankovic, J. A. and Towsley, D. (1997). Streaming CBR transmission of VBR stored video. SPIE Symposium: Multimedia Network. Sohraby, K. (1993). On the theory of general on-off source with applications in high-speed networks. IEEE INFOCOM, 401-410. Wang, Y. C., Tsao, S. L., Chang, R. I., Chen, M. C., Ho, J. M. and Ko, M. T. (1997). A fast data placement scheme for video server with zoned-disks. SPIE VVDC, 92-102. Zhang, H. and Knightly, E. W. (1995). A new approach to support delay-sensitive VBR video in packet-switched networks. NOSSDAV. Zhang, J. and Hui, J. (1997) Traffic characteristics and smoothness criteria in VBR video traffic smoothing. IEEE ICMCS, 3-11.
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 237
Chapter XI
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications Wonjun Lee Ewha Womans University, Korea Jaideep Srivastava University of Minnesota, USA
In this chapter, we present a novel disk admission control algorithm that exploits the degradability property of continuous media applications to improve the performance of the system. The algorithm is based on setting aside a portion of the resources, i.e., disk I/O bandwidth, as reserves and managing it intelligently so that the total utility of the system is maximized. This reserve-based admission control strategy (RAC) is a compromise between purely greedy and non-greedy strategy, and it leads to an efficient protocol that improves the performance of the system. While the protocol is simple for admission decision, it also results in better performance for the system by reserving some resources for the important future applications. The framework consists of two parts. One is the reserve-based admission control mechanism in which new streams, arriving during periods of congestion, are offered lower QoS, instead of being blocked. The other one is a resource scheduler for continuous media with dynamic resource allocation to achieve higher utilization than non-dynamic schedulers by effectively sharing available resources among contending streams and by reclamation, which is a scheduler-initiated negotiation to reallocate resources among streams to improve overall QoS.
Copyright © 2002, Idea Group Publishing.
238 Lee & Srivastava
INTRODUCTION The emergence of varied high-speed networked multimedia systems opens up the possibility that a much more varied collection of continuous media applications could be handled in real-time. However, due to the limitation of available resources in such systems, we still need to develop more intelligent mechanisms for efficient admission control, negotiation, resource allocation and resource scheduling, so that we could optimize the total system utilization. In particular, there has been increased interest in I/O issues for multimedia or continuous media (CM). The goal of conventional disk scheduling policies is to reduce the cost of seek operations and to achieve a high throughput, while providing fair access to every process that seeks its services. The goal of disk scheduling for continuous media is to meet the deadlines of the periodic I/O requests generated by the stream manager to meet rate requirements. An additional goal is to minimize buffer requirements. In order to ensure continuous and stringent real-time constraints of video delivery in CM servers (Harinath et al., 2000; Haskin & Schmuck, 1996; and Lelend et al., 1994), several factors such as disk bandwidth, buffer capacity, network bandwidth, etc., should be considered carefully and should be handled efficiently. The reservations of these resource factors are required for supporting an acceptable level of display quality and for providing on-time delivery constraints. In particular, disk bandwidth constraint may be the most important factor, given that the I/O bandwidth reserved for each stream on disk depends on latency overhead time, transfer time, defined cycle length and contention of multiple streams. Hence, we should be able to guarantee that the request of each stream can be fairly supported with good disk utilization and server cost-performance. The goal of disk/server scheduling for CM is to satisfy QoS requirements (Vin, Goyal & Goyal, 1994) by meeting deadlines of periodic I/O requests generated by some server resident CM stream manager with minimum buffers and with a fair scheduling algorithm (Kenchammana-Hosekote & Srivastava, 1997). A video stream being viewed requires timely delivery of data, but it is able to tolerate some loss of the data for small amounts of time. The problem here is dual (i.e., admission control and disk I/O bandwidth management in CM server systems) to that of network call admission control and dynamic bandwidth management (Lau & Lui, 1995). Admission control in CM servers or video-on-demand systems restricts the number of applications supported on the resources. For example, the applications may be video streams, and the resources used could be connections on the network or on continuous media servers (Lee & Sabata, 1999, 2001; Lee et al., 1998). To handle these issues in CM server systems, we propose an adaptive admission control strategy which achieves better performance than the conventional greedy admission control strategies generally used for CM servers. It recognizes that CM (e.g., video) applications can tolerate certain variations on QoS parameters. It develops an algorithm for sharing processing resources at the server to share available resources effectively among contending streams (Lee & Srivastava, 1998). The proposed algorithm provisions are for reclamation (i.e., scheduler-initiated negotiation) to reallocate resources among streams to improve the QoS overall. In this chapter, we present a dynamic and adaptive admission control strategy for providing a fair disk bandwidth scheduling and better performance for video streaming. It efficiently services multiple clients simultaneously and satisfies the real-time requirement for continuous delivery of video streaming at specified bandwidths in distributed environments. This new stream scheduler provides a proficient admission control functionality which can optimize disk utilization with a good response ratio for the requests of clients--
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 239
in particular, under heavy loaded traffic environments. We will show how our algorithm provides a better admission ratio (that is, a lower input stream rejection ratio, which is computed as the ratio of the number of rejected streams to the number of total input streams) for multiple CM stream inputs. It still presents similar levels of stream qualities (how well the requested rate is attained in CM servers) compared to the basic greedy admission control algorithm. We will present the comparison of the simulation result on the behavior of conventional greedy admission control mechanisms with that of our admission control and scheduling algorithm. The rest of this chapter is organized as follows. First, we briefly describe an assumed CM server architecture and the admission control constraints. The detailed description of our admission control and scheduling algorithm is then presented, followed by quantitative results of experimental evaluations, and a description of experimental designs and parameters. We then will examine the related work of other researchers. Finally concluding remarks and on-going work are offered
OVERALL CM SERVER ARCHITECTURE In this section, we outline the assumed CM server system and its architecture (Lee & Sabata, 1999; Lee et al., 1998). Figure 1 depicts the architecture of a CM server, which plays out multiple (single) streams to requesting clients across the network. Its basic admission control mechanism and constraints will be presented in this section, as well.
System Overview Our CM server system is based on a client-server model and two file systems: i.e., Presto File System (PFS)} (Agrawal et al., 1997; Gemmell & Christodoulakis, 1992) and conventional Unix file system (UFS). The core part of the CM server involves the network manager, the QoS manager, I/O managers and proxy servers. These sub-modules in the CM server are threads running in a single process (Harinath et al., 1997; Lee et al., 1998). Figure 1: CM server-clients architecture
240 Lee & Srivastava
The clients and server can either be on the same local machine or are located across the network. The Network Manager instances a proxy server upon receipt of a new client request, and the proxy server communicates with the client thereafter. That is, the network manager delegates the connection management functionality to the proxy server once the latter is operational. The client’s request is a generic structure that can be used for all types of streams; the structure contains elements such as: data rate (fps), sampling rate, client’s pid, file system (UFS/PFS), stream type (AU, MJPEG, etc.). The Network Manager is intended to work closely with the QoS Manager module, since it knows exactly how many connections are still active and what the stream types and sizes are. A Proxy Server is created for each client connection. After the Network Manager instances a Proxy Server with a new client request, the network manager is waiting until there is another new client request. Each Proxy Server related to a certain client’s request handles send/receive operations between the CM server and the client across the network. It also interacts with the QoS Manager and an I/O Manager (which is a thread created by the Proxy Server) to handle admission control, buffer management and I/O operations. The Proxy Server cooperates with its peer-the I/O Manager in a producer/consumer fashion. Each connection is associated with a Proxy Server and an I/O Manager. The Proxy Server is designed to handle generic stream; it has no knowledge of stream type. The I/O Manager corresponding with the stream type does all the work and passes individual frames into the common buffer for the Proxy Server to get and send to the client.
Admission Control The admission control for input CM streams is managed by the QoS Manager, which is composed of the Admission Controller and the QoS Handler. The Admission Controller provides two constraint tests: (1) I/O bandwidth test (2) Available buffer test Each request (stream) arrives with some rate value (e.g., playing-back rate: i.e., frames per second) and the Admission Controller determines whether to admit it or not. In the following subsections, we will describe the basic admission control constraints in terms of availability of I/O bandwidth and buffer. The QoS Handler module takes care of data rate handling according to a rate input parameter provided by the client’s request. In case the playing time gets delayed due to some kind of system overhead, this module will drop some frames properly so as to keep the data rate. This module also controls the dynamic QoS negotiation and management.
Basic Conditions of Admission Control Every time either (1) a new client’s request arrives; (2) a rate control operation (such as Set_Rate, Pause, Resume and Fast_Forward) is received; (3) the playing of a running stream is over; or (4) when the resources in the reserves are not used for some time, the following two constraints must be checked for admission control requirements.
I/O Bandwidth Constraint To check the availability of disk I/O bandwidth in admission control, we use an I/O bandwidth test (see Figure 2) The first element (ts) explains that the maximum disk seek latency overhead in a cycle. The second component shows the sum of each rotational latency incurred for retrieving each disk block (for each request ri) in a given cycle length (Tsvc). The last component describes
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 241
Figure 2: Disk I/O bandwidth constraint
the time to transfer ri during a cycle. Finally, the overall time for disk access time in a given cycle length must be done in the service cycle length (Tsvc).
Buffer Constraint The total available buffer requirement must be smaller than the available buffer size. Here we assume a dual buffer mechanism. Hence, the requirement of admission control is to satisfy both I/O bandwidth constraint and available buffer constraint. If either of them is not satisfied, the input request must be rejected.
QoS Factors in Continuous Media Streaming In order to achieve high performance in designing our CM server, we should consider the resource constraints as well as the properties of CM streams. These CM streams have their own features and special QoS metrics. Since we decided to design our CM server and clients based on the lossy UDP, we adopted the QoS metrics suitable for the lossy protocol. These QoS metrics play an important role in our QoS-driven CM server (in particular, QoS Manager). With respect to QoS metrics, Steinmetz (Shenoy et al., 1998; Smith, 1994; and Steinmetz & Engler, 1996) has surveyed and specified QoS factors, which affect human perception on various aspects of multimedia. However, Steinmetz did not consider the lossiness of the streams, hence making the parameters unsuitable for evaluating lossy CM applications. Wijesekera (Vin, Goyal & Goyal, 1994) defined a set of metrics that are suitable for the lossy nature of UDP-based CM communication. Continuity of a CM stream is metricized by three components: rate, drift and content. For the purposes of describing these metrics, we envision a CM stream as a flow of data units (referred to as logical data units-LDUs in the uniform framework of Smith, 1994). The ideal rate of a flow and the maximum permissible deviation from it constitute our rate parameters. Given the ideal rate and the beginning time of a CM stream, there is an ideal time for a given LDU to arrive or to be displayed. Given the envisioned fluid-like nature of CM streams, the appearance time of a given LDU may deviate from this ideal.
242 Lee & Srivastava
The rate variations can be measured more accurately by drift parameters. Our drift parameters specify aggregate and consecutive non-zero drifts from these ideals over a given number of consecutive LDUs in a stream. For example, the first four LDUs of two example streams with their expected and actual times of appearance are shown in Figure 3. In the first example stream, the drifts are respectively 0.0, 0.8, 0.2 and 0.2 seconds; accordingly, it has an aggregate drift of 1.2 seconds per 4 time slots, and a non-zero consecutive drift of 1.2 seconds. In the second example stream, the largest consecutive non-zero drift is 0.2 seconds; the aggregate drift is 0.3 seconds per 4 time slots. The reason for a lower consecutive drift in stream 2 is that the unit drifts in it are more spread out than those in stream 1. In addition to the timing and rate, the ideal contents of a CM stream are specified by the ideal contents of each LDU. Due to loss, delivery or resource overload problems, the appearance of LDUs may deviate from this ideal, and may consequently lead to discontinuity. The metrics of continuity are designed to measure the average and bursty deviation from the ideal specification. A loss or repetition of an LDU is considered a unit loss in a CM stream. (A more precise definition is given in Vin & Rangan, 1992). The aggregate number of such unit losses is the aggregate loss of a CM stream, while the largest consecutive nonzero loss is its consecutive loss. In the example streams of Figure 1, stream 1 has an aggregate loss of 2/4 and a consecutive loss of 2, while stream 2 has an aggregate loss of 2/4 and a consecutive loss of 1. The reason for the lower consecutive loss in stream 2 is that its losses are more spread out than those of stream 1. Human response to video and audio is quite interesting. According to Vin & Rangan, (1992), up to 23% of aggregate video loss and 21% of aggregate audio loss are tolerable. The acceptable values for consecutive loss of both video and audio are approximately 2 LDU. Up to about 20% of video and 7% of audio rate variations are tolerable. They give an upper bound for ADF and CDF. Given certain resources, we want to support as many CM streams as possible whose QoS parameters are all acceptable. When the load of the CM server is low, it is possible to meet this requirement. However, as more and more clients require CM streams, the QoS of CM server will degrade. It is a good choice to make it degrade gracefully. To guarantee that Figure 3: Two example streams used to explain metrics
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 243
each client is served with some reasonable quality, admission control is also necessary. Thus, one of the bottom lines, which we should consider with respect to the resource management among competitive applications in CM servers, is to meet the above measured characteristics of CM as much as possible.
ADAPTIVE ADMISSION CONTROL AND RESOURCE SCHEDULING In this section, we describe the widely used greedy admission control mechanism and its limitations. Next we explain a dynamic and adaptive admission control and scheduling algorithm that involves QoS degradation and adaptation to enhance the system resource utilization.
Background In the prior section, we explained the two fundamental admission constraints in terms of I/O bandwidth and buffer size. Here we consider the first constraint (I/O bandwidth) as the condition of admission decision in the commonly used greedy admission control and schedule mechanisms. Most of the existing admission control approaches are purely greedy strategy in the sense that a new application (video stream) can be accepted only if the server could give the client all the requested resources. The major disadvantage of the greedy methods, which have been accepted as much of the work in soft real-time systems such as continuous media servers, is that they have focused on either conservative approaches based on bandwidth peaks, or statistical methods (Crorella & Bestavros, 1996; Tripathi & Raghavan, 1998) which model arrival rates and stream bandwidths with probability distributions and determine a satisfactory level of performance by the disk and network subsystems, either separately or as a system. These approaches are too conservative and admit too few streams, thereby under-utilizing the server resources. Although probabilistic methods exist to amortize the cost of this failure, this is undesirable in general (Lee & Srivastava, 2000). We propose an enhanced admission control algorithm in that it is capable of readjusting resources according to the amount of remaining resources. The key idea here is to assign a portion of the resources as the reserves, and when the applications start to dip into the reserves, another strategy is invoked. In heavy loaded traffic (when a fairly large number of client requests want services; for example, the total disk bandwidth utilization gets over 70%), if the remaining available resources become smaller than some value (threshold: Treserve), we assign only some portion of the requested resources to the new requests according to the done_ratio and available resources. For the degraded streams to adapt based on the availability of resource, resource-negotiation is required. Resource-negotiation (reclamation) occurs under these circumstances: (1) when a running request returns the resources back to the system; (2) when set_rate video functions, such as Fast_Forward, are received, or (3) after some period elapses without any further resources being used. We start with a brief description of the typical greedy approach and then discuss our new strategy, which is the focus of this chapter. In the following sections, we will describe the experimental evaluation of the performance of our approach. Table 1 describes the attributes used in the algorithms.
244 Lee & Srivastava
Table 1: Attributes used in algorithm
Attributes
Descriptions
si qi vi
Stream i Initially requested resource of si Currently serviced resource of si = qi * di RS Set of all running streams: {si | Fmin,i < di <= 1.0} DS Set of degraded streams: RS – {si| di = 1.0} smin Stream j of which dj is smallest in DS m Number of streams in RS done_ratio Ratio of serviced resource to requested resource for a stream remaining_ratio Ratio of resource yet serviced to requested resource for a stream: (= 1.0 – done_ratio) remaining_ratioi Amount of resource which is yet serviced in stream i sum_remaining_ratio Total sum of remaining_ratio of degraded streams: sum of remaining_ratio of sk, where sk is in DS MIN_FRACT Minimum amount of resource to be assigned for a stream MIN_RESERVE Minimum amount of reserve which must be at least maintained in DRA mode di done_ratio of si ei remaining_ratio of si : (= 1.0 – di) Fmin,I MIN_FRACT of si Ttotal Total available resources initially given Treserve Amount of resources assigned to the reserve Tfree_res Available resources in Reserve Tfree_non_res Available resources outside Reserve Talloc Allocated resources to the requesting stream Tavail Available resources to assign: (= Ttotal – Tused) Tused Total resources currently used SU(t) Total system utilization defined by ‘sum of dk’, where sk is in RS at time t QT(t) Total video quality defined by SU(t)/m at time t
Basic Greedy Algorithm The naive approach to admission control is to use a greedy strategy, where applications (streams) are admitted as long as there are resources (Ernst & Biersack, 1996; Lelend et al., 1994; and Vin, Goyal & Goyal, 1994). On application arrival, the policy admits the application if the resources available are greater than the resources requested. Upon application departure, the policy just adds the released resources from the application to the resources available. Table 2 (Greedy Algorithm) presents the details of the greedy strategy. Initially Ttotal is the total resources available, out of which the Tused amount of resources are currently being used by the applications. Application resource requests are computed using the requested QoS level qi (Step 3). If we let qi be the resource requested by a new application, the system can admit it if the currently available resources (Tavail = Ttotal – Tused) are more than qi (Step 4). The system allocates Talloc (= qi) amount of resources to the requesting application. Tused is
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 245
Table 2: Algorithm for greedy admission control
Greedy Admission Control (event Ei, qi) // Ei = {START, CLOSE) 1 switch EVENT of 2 case START: // with parameters of qi // get requested resource 3 If (Tavail <= qi) then // Admit the application 4 Talloc <- qi; 5 Tused <- Tused + qi; 6 Tavail = Ttotal – Tused; 7 else // not enough resources 8 Reject this application instance; 9 case CLOSE: //with parameters of appl_inst_ID 10 Tused <- Tused – Talloc; 11 Tfree_res = Ttotal – Tused; 12 Delete the application instance; 13 end switch
adjusted by Tused + qi (Steps 5-7). Upon departure, Tused is decreased by Talloc, and Tavail is readjusted by Ttotal – Tused (Steps 11-13).
Reserve-Based Admission Control and Scheduling Algorithm The key advantage of the basic greedy strategy is that it is simple. However, due to the other shortcomings of basic greedy admission control algorithms, we should be able to think of another strategy to allow more streams to run concurrently in the continuous media servers by degrading the requested quality of newly arriving streams and by adapting the returned resources to these streams. Since we cannot make a good estimate of the future traffic, it is very difficult to implement a perfect algorithm to determine the amount of degraded resources. Yet, we believe that our new Soft-QoS framework noticeably reduces the rejection ratio and lets more input streams run in the system. Given the average seek time (ts), cycle length (Tsvc), disk block size (b), rotation time (tr) and disk transfer rate (R), the consumption rate of each client request (qi) is a variable in the I/O bandwidth constraint equation. We initialize the total available rate (Tavail) with Tsvc (=Ttotal), and set Treserve (the threshold value for criteria to check congested_bit). Initially the congested_bit is 0 (not congested). The resource (here I/O bandwidth) is congested if the resource usage (Tused) is more than (Tsvc – Treserve). We can set the Treserve amount of resources for admitting more requests at a reduced quality. Here, Tusage + Tavail = Tsvc. That is, the congested_bit is set (= 1) only when Tavail becomes smaller than Treserve; otherwise, the basic I/O bandwidth constraint is just applied (i.e., the new request is admitted without being degraded). Under the congested_bit being set, we restrict the amount of assignment of resources to the new request because there is no sufficient resource remaining any longer). We explain the details in the following:
246 Lee & Srivastava
On adding the request qi, if the resource remains uncongested, then admit it with no degradation (Talloc = qi). That is, under the situation of the congested_bit being reset (= 0), the new requests are simply admitted. Let the current usage be Tused; then, adding the application qi will increase the usage to (Tnew = Tused + qi). If (Tnew >= Tsvc – Treserve), then, adding qi will make the resource congested. When the resource gets congested, we have to dip into the resources (Tfree_non_res = max (0,Tsvc – Treserve – Tused). This is the unused resource outside the reserves. Let Tavail be the amount of resources in the reserves. Then, Tfree_res = min (Treserve,(Tsvc – Tused)). The new application is allocated, and Tused = Tfree_res + min ((qi – Tfree_res), Psi(Tfree_res)). That is, we allocate the resources that are outside the reserves and, in addition, we give a maximum Psi (Tfree_res) of the resources in the reserves. Psi(.) is a function of Tfree_res and returns an appropriate amount of resources according to Tfree_res. Several different policies such as a fixed, variable and exponential policy could be applied for this function. If Psi(Tfree_res) is too small (in this case, we had better not support the new request because display quality may be too poor), we simply reject the request. If the running stream (sk) is over, then the resource vi is reclaimed. We return the resource occupied by the stream and adjust (if required) the resource allocation of streams not being fully serviced. Here the unused resource is allocated to applications such that the least serviced streams could get the returned resource first.
Negotiation (Reclamation) Algorithm We have implemented two heuristic-based methods and tested their performance: Dynamic Reserve Adaptation (DRA): This is invoked only if there are applications that need resources and do not have them. On departure of streams, DRA returns the assigned resources to the leaving streams (qi) back to the available resource pool and re-calculates the new Tavail using the total available resource (i.e., old Tavail + qi). Among the requests that are not fully serviced yet, we select a request (smin) from the queue (DS), whose done_ratio is the smallest first, and assign the proper resource to the request. The maximum Phi(Tfree_res)-rule applies here. Assign Talloc = min (Phi (Tfree_res), remaining_ratemin) to the selected request (smin). The loop continues until there is no more available resource to assign or Talloc is too small to assign. (2) Reclamation within Returned Reserve (RWR): This is a similar method to the Dynamic Reserve Adaptation method, but the difference is that it distributes the only resources returned from the leaving streams (vi). In DRA, we used Tavail (instead of vi) for reclamation. The Talloc per stream is calculated according to the ratio of remaining_ratio to sum_remaining_ratio. Talloc= Tavail * remaining_ratiok / sum_remaining_ratio ). The key idea here is not to touch the reserves, but to utilize the only returned resource due to the departure of streams. We will validate the better performance of this policy compared to that of DRA. When the resources in the reserves are not used for a long time, some of it is reclaimed. For every k period, if there is no request to the reserve, then we release some amount of the reserve to be reclaimed. This is kind of a reserve adaptation method. (1)
Advantages of Applying RAC Algorithm The advantages we expect in applying our new strategy compared to the basic greedy admission control strategy are: (1) In the heavy loaded traffic situation, we can have more streams running concurrently (instead of simply rejecting some of them due to the lack of resources) by sacrificing some rates of newly admitted requests. Yet, this should be done under the assumption
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 247
that the rate given to the admitted streams must be in a tolerable range: reducing rejection ratio. (2) As soon as some resources are available due to the completion of running applications, we assign them to the degraded requests in the queue (DS} according to done_ratio, thereby performing QoS-adaptation among the degraded applications: dynamic and adaptive QoS negotiation. (3) According to our observation, this new scheme still achieves a good play-out quality of CM streams, playing out compared to that done by the basic greedy admission control strategy: no quality degradation. (4) The total system utilization is increased: enhancing system utilization. The detailed experimental evaluation results with respect to the above advantages will be described in the following section.
SIMULATION RESULTS We performed extensive simulation to validate our admission control and scheduling algorithm. In this section, we present the performance results obtained from simulations under the various load conditions. For results presented in this section, we simulated an environment with parameters described in Table 3. This configuration corresponds to a Sun Ultra Sparcstation with a Seagate Barracuda 4GB disk (ST34371N).
Experiment Design We simulate the situation where the applications with soft-QoS requirements arrive at the system in an arbitrary order. The applications also come with a play-out rate (which maps into an amount of required resource), depending on the user profile and the data content, which is a value of the data rate parameters. For example, we may have a set of video distribution applications that provide streaming video with different types of content to the remote users. Ideally, the application load should be characterized using experimental data.
Table 3: Simulation environment
Parameter System OS Hard Disk Type Formatted Capacity Interface Internal Transfer Rate External Transfer Rate Average Latency Average Access (read) Disk Block Size
Value Sun Ultra Sparcstation 1 SunOS 5.5 Seagate Barracuda (ST34371N) 4.35 GB Ultra SCSI 80 - 122 Mb/sec 20 sync Mb/sec 4.17 msec 9.4 msec 2 KB
248 Lee & Srivastava
However, for the simulations, we assume the following attributes: • Lambda: average inter-arrival rate of a Poisson process with which applications arrive are based on. • mud: mean of a Gaussian process to describe the duration behavior of applications. • sigmad: standard deviation factor of a Gaussian distributed scale process for the duration of applications. The applications arrive based on a Poisson process with an average inter-arrival rate (lambda). The duration of the applications is described using a Gaussian process with the mean (mud) and the standard deviation (sigmad) of the distribution. The experiments are run by fixing the generating process parameters and then generating many application traffic traces. The admission control and scheduling algorithm is tried on each one of the application traffic traces. We measure the following metrics: • Accumulated number of admitted streams over time: Streams are admitted if there are resources minimally available ( > MIN_FRACTIONi). The basic greedy algorithm and the two policies of RAC are compared by evaluating the effect on the number of admitted streams. • Admission ratio: The admission ratio is given as admitted streams/total input streams. • Total system utilization: The total system utilization is computed as the sum of all the done_ratios of the streams supported by the resource: total_system_utilization = sum of done_ratiok, where sk is in RS . • Total quality: The total quality is computed as Total system utilization divided by the number of current running streams. total_quality = total_system_utilization / number_of_current_running_streams • Effect of resource reserve size: We study the effect of the amount of resources kept aside for the reserves (Treserve) on the number of admitted streams in time. Figure 4 shows the data rates of input streams that change in time. Heavy traffic is generated by setting lambda = 0.333, mud = 90 and sigmad = 3.5. Medium traffic is generated by having lambda = 0.125, mud= 60 and sigmad= 3.5. The values on the x-axis are normalized. Under the heavy traffic, the resource demands arrive more frequently than under the medium traffic. Figure 4: Request data rate over time
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 249
For providing more general input traffic modeling for simulation, which involves practical Internet and/or real-time communication traffics, we will have to conduct another traffic modeling (e.g., self-similarity Wijesekera & Srivastava, 1996, 2000) in addition to current simple Poissonian server access traffic model.
Simulation Results We will present the detailed experimental result of each experiment in the following Subsections.
Experiment 1: Number of Running Streams The major advantage of the RAC is its guaranteeing more numbers of streams to be admitted and to run simultaneously with a tolerable degradation of quality. In this section, we compare the efficiency of the RAC (RWR and DRA) algorithm and the greedy algorithm from the aspect of an accumulated number of admitted streams. Figure 5 demonstrates the accumulated numbers of admitted streams in time under heavy and medium traffic. Under traffic loads that demand more than the available resources, the accumulated occurrences of admission decreases (rejection increases), both in the basic method and in our strategy. Overall, RAC achieves better performance than the basic greedy algorithm in terms of the admission rate, and in particular, the RWR mode results in the best performance. The heavier the traffic is, the better performance the RAC algorithm achieves compared to the basic greedy algorithm (as shown in Figure 5). In another point of view, the RWR algorithm achieves a 200 ~ 300% increase in the number of streams that can be serviced simultaneously by the server. In case of the DRA algorithm, it also achieves about a 100% increase. In other words, our RAC algorithm (both RWR and DRA) noticeably reduces the rejection ratio compared to the basic greedy method.
Figure 5: Number of admitted streams over time
250 Lee & Srivastava
Experiment 2: Admission Ratio The effectiveness of the admission control criteria was measured by the number of admitted streams, total quality and system utilization. Here we compared the RAC method to the basic greedy method in terms of the admission ratio under different application load conditions. Under the heavy load, RWR achieves the best admission ratio (a 150 ~ 250% increase), and DRA achieves the next (about 100% increase). Under the medium load, the increase rates of the two RAC methods are decreased compared to under the heavy load. Unlike the heavy load, the DRA method achieves even better performance than RWR. This is because when the traffic load is not too heavy, DRA is able to utilize the resources more efficiently by taking off some resources from the reserves. The probability of reserves running out under medium/light loads is smaller than that under the heavy load. In RWR, it only uses the released resources without dipping into the reserves. However, we can observe that the RAC (both RWR and DRA) strategy noticeably increases the admission ratio compared to the greedy method. Degradation makes the applications use a lower (but fully reasonable due to the properties of CM) amount of resources; therefore, more applications can be supported on the resource. In any case, as we will examine in the following subsection, the RWR algorithm still achieves equivalent total quality compared to that by the greedy algorithm (i.e., no quality loss).
Experiment 3: System Utilization In this section, we mainly focus on visualizing the effect of the RAC algorithm with respect to total system utilization and display quality. To demonstrate that the RWR and DRA algorithms achieve improvement in the total quality defined, we plot the curve. Figure 7 plots the total system utilization that is achieved by the sum of each application’s done_ratio. The curves are mainly affected by applications’ arrival distribution; however, it is observed that the total system utilization achieved by RWR is much higher during most of the period.
Figure 6: Admission ratio over time
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 251
Figure 7: Total system utilization over time
In the later part of the simulation in heavy traffic, the quality looks a bit increased. This phenomenon arises due to the fact that most admitted streams get to be closed in the last part of the simulation at the same time. Hence, the RWR algorithm is able to utilize the system resources more efficiently than the greedy algorithm.
Experiment 4: Total Quality To investigate the display quality of the RAC algorithm, we examined another factor. The total quality is quantified by dividing the sum of each application’s done_ratio by the number of current running streams. It is observed that the total quality achieved by the RWR algorithm is almost the same as that achieved by the greedy algorithm. Only in the initial part of simulation, is it occasionally below 1.0, but still over 0.85, which is fairly good. However, the quality gets to be almost 1.0 after that. In contrast, the performance of DRA algorithm in total quality is quite low most of the time. This arises due to the fact that the DRA algorithm holds off the reserves for future streams, which arrive soon before some of the current running streams depart. Under medium traffic, though the performance of DRA gets better, it is still too low. Here we could conclude that we should be more careful to dip into the reserves for resource reclamation. In the RWR mode, we hold off the reserves by reclaiming the only returned resources.
Experiment 5: Number of Streams vs. Threshold We were interested in studying the effect of the amount of resources kept aside for the reserves (Treserve) on the performance of the system. Using simulation experiments, we were able to verify that for the fixed policy for resource assignment (where we assigned 1/k of the resources to the applications after the resource utilization entered the reserves), the performance depended on the Treserve. This observation is explained by understanding that, from a system perspective, keeping more resources in the reserves implies that the system
252 Lee & Srivastava
believes probability of more applications can be supported without a loss of quality. If the statistics of the application traffic is kept constant, there is an optimal value of Treserve that maximizes the total quality. We ran simulations with the fixed policy and varied the amount of resources assigned to the reserves. As we had expected, we observed that for fixed application traffic characteristics, the total quality varied with the value of Treserve. Figure 8 shows the relationship between the number of admitted streams and time in four different Treserve values. From the experimental observations, we see that the system benefit gets maximized for a reserve value between 0.2 and 0.3.
CONCLUSIONS AND FUTURE WORK We presented an integrated dynamic and adaptive admission control and scheduling algorithm for CM server systems. We have also presented results of a simulation evaluation of our algorithm and a greedy algorithm with respect to several metrics designed to measure the admission ratio, total quality and bandwidth utilization. It was observed that under heavy traffic, our algorithm achieves much better performance than the greedy algorithm. Using our scheme, we observed that our admission control and scheduling strategy provides three principle advantages over conventional mechanisms. First, it guarantees better total system utilization. Second, it provides better disk utilization and larger admission ratio for inputting continuous media streams, which is a major advantage of our scheme: in other words, it reduces the rejection ratio significantly and increases the total system benefit, in particular when the system traffic gets to be heavy. Third, it still presents acceptable play-out qualities compared to the conventional greedy admission control algorithm. In conclusion, using our scheme, more streams can be running with an acceptable range of data quality in a given system resource. Our ongoing and future work includes extending the QoS metrics considered in our algorithm to multi-dimensional ones (e.g., frame size, buffer requirement, compression quality factor, scalability, network bandwidth, etc.). Toward this end, we are
Figure 8: Number of admitted streams vs. threshold
RAC: A Soft-QoS Framework for Supporting Continuous Media Applications 253
studying the constructs of user utility functions and resource request functions from which we could formulate the system and user demands more specifically.
ACKNOWLEDGMENTS This work was partially sponsored by Air Force contract number F30602-96-C-0130 to Honeywell Inc., via subcontract number B09030541/AF to the University of Minnesota.
REFERENCES Agrawal, M., Harinath, R., Huang, J., Richardson, J., Lee, W., Srivastava, J., Su, D., Wadhwa, S. and Wijesekera., D. (1997). Sonata: Active views in a distributed objectoriented system. In IEEE HPDC-6, Portland, Oregon, August. Anderson, D., Osawa, Y. and Govindan, R. (1992). A file system for continuous media. ACM Trans. Computer Systems, 10(4), 311-337. Chen, M., Kandlur, D. and Yu, P. (1992). Design and analysis of a grouped sweeping scheme for multimedia storage management. In Proceedings of International Workshop on Network and Operating System Support for Digital Audio and Video, San Diego, November. Chen, Z., Tan, S. M., Campbell, R. H. and Li, Y. (1995). Real-time video and audio on the World Wide Web. In Proceedings of World Wide Web Conference, Boston, MA, December. Crorella, M. and Bestavros, A. (1996). Self-similarity in World Wide Web traffic: Evidence and clauses. In Proceedings in ACM SIGMETRICS’96, Philadelphia, PA, May. Ernst, F. T. and Biersack, W. (1996). Statistical admission control in video servers with constant data length retrieval of VBR streams. In Third International Conference on Multimedia Modeling, Toulouse, France, November. Gemmell, D. and Christodoulakis, S. (1992). Principles of delay-sensitive multimedia data storage and retrieval. ACM Trans. Information Systems, 10(1), 51-90. Harinath, R., Lee, W., Parikh, S., Su, D., Wadhwa, S., Wijesekera, D., Srivastava, J. and Kenchammana-hosekote, D. (1997). A multimedia programming toolkit/environment. In Proceedings of International Conference on Parallel and Distributed Systems (ICPADS’97), Seoul, Korea, December. Harinath, R., Lee, W., Su, D., Srivastava, J. and Richardson, J. (2000). Active view systems in an object-oriented system for distributed multimedia applications. Submitted for publication in Journal of Parallel and Distributed Computing. Haskin, R. and Schmuck, F. (1996). The tiger shark file system. In COMPCON’96. Kenchammana-Hosekote, D. and Srivastava, J. (1994). Scheduling continuous media in a video-on-demand server. In Proceedings of ICMCS-1994, May. Kenchammana-Hosekote, D. and Srivastava, J. (1997). I/O scheduling for digital continuous media. ACM-Springer Multimedia Systems Journal, 5(4), 213-237. Lau, S. and Lui, J. C. S. (1995). A novel video-on-demand storage architecture for supporting constant frame rate with variable bit rate retrieval. In 5th NOSSDAV, Durham, NH, April. Lee, W. and Sabata, B. (1999). Admission control and QoS negotiations for soft-real-time applications. In Proceedings of the IEEE International Conference on Multimedia Computing Systems (ICMCS), Florence, Italy, June.
254 Lee & Srivastava
Lee, W. and Srivastava, J. (1998). CORBA evaluation of video streaming with QoS provisioning. In Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems (SRDS’98), West Lafayette, Indiana, October. Lee, W. and Srivastava, J. (2001). An algebraic QoS-based resource management model for competitive multimedia applications. To appear in Journal of Multimedia Tools and Applications (MTAP). Kluwer Academic Publishers, 13(2), 197-210. Lee, W., Su, D., Ngo, H. Q. and Srivastava, J. (1998). A QoS-driven networked continuous media server. In Proceedings of SPIE PC’98, Beijing, China, September. Lee, W. and Srivastava, J. (2000). QoS-based evaluation of file systems and distributed system services for continuous media provisioning. Information & Software Technology, Elsevier Science, 42, December. Lelend, W. E., Taqqu, M. S., Willinger, W. and Wilson, D. V. (1994). On the selfsimilar nature of Ethernet traffic. ACM/IEEE Transaction on Networking, February 2(1), 1-15. Makaroff, D., Neufeld, G. and Hutchinson, N. (1997). An evaluation of VBR disk admission algorithms for continuous media file servers. In Proceedings of the ACM Multimedia Conference, Seattle, WA, December. Martin, C., Narayanan, P. S., Ozden, B., Rastogi, R. and Silberschatz, A. (1996). The Fellini multimedia storage server. In Multimedia Information Storage and Management. Kluwer Academic Publishers. Rangan, P. V. and Vin, H. M. (1993). Designing a multiuser HDTV storage server. IEEE JSAC, January, 11(1), 153-164. Reddy, A. and Wyllie, J. (1994). I/O issues in a multimedia system. Computer, 27(3), 69-74. Shenoy, P. J., Goyal, P., Rao, S. S. and Vin, H. M. (1998). Symphony: An integrated multimedia file system. In Proceedings of SPIE/ACM MMCN’98, San Jose, CA, January. Smith, B. C. (1994). Implementation techniques for continuous media systems and applications. Ph.D. Dissertation, University of California, Berkeley. Steinmetz, R. and Engler, C. (1996). Human perception of media synchronization. IEEE JSAC, 14(1), 61-72. Steinmetz, R. and Blakowski, G. (1996). A media synchronization survey: Reference model, specification and case studies. IEEE JSAC, 14(1), 5-35. Steinmetz, R. and Engler, C. (1996). Human perception of media synchronization. Technical Report, IBM European Research Center. Tripathi, S. K. and Raghavan, S. V. (1998). Networked Multimedia Systems: Concepts, Architectures and Design. Prentice Hall. Vin, H. M., Goyal, P. and Goyal, A. (1994). An observation-based approach for designing multimedia servers. In Proceedings of ICMCS’94, Boston, MA, May. Vin, H. M., Goyal, P. and Goyal, A. (1994). A statistical admission control algorithm for multimedia servers. In Proceedings of ACM Multimedia ‘94, Boston, MA, May. Vin, H. M. and Rangan, P. V. (1992). Admission control algorithms for multimedia ondemand servers. In the Proceedings of the Third NOSSDAV. Wijesekera, D. and Srivastava, J. (1996). Quality of Service (QoS) metrics for continuous media. Multimedia Tools and Applications, September, 3(1), 127-166. Wijesekera, D. and Srivastava, J. (2000). Experimental evaluation of loss perception in continuous media. ACM Springer Multimedia Systems Journal.
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 255
Chapter XII
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application Silvia Giordano, ICA Institute, Switzerland Piergiorgio Cremonese, Wireless Architect, Italy Jean-Yves Le Boudec, Laboratoire de Reseaux de Communications, Switzerland Marta Podesta, Whitehead Laboratory, Italy The traffic generated by multimedia applications presents a great amount of burstiness, which can hardly be described by a static set of traffic parameters. The dynamic and efficient usage of the resources is one of the fundamental aspects of multimedia networks: the traffic specification should first reflect the real traffic demand, but optimize, at the same time, the resources requested. This chapter presents: a model for dynamically renegotiating the traffic specification (RVBR), how this can be integrated with the traffic reservation mechanism RSVP and an example of application able to accommodate its traffic to managing QoS dynamically. The remainder of this chapter focuses on the technique used to implement RVBR) taking into account problems deriving from delay during the renegotiation phase and on the performance of the application with MPEG4 traffic.
INTRODUCTION Future applications will make use of different technologies such as voice, data and video. These multimedia applications require, in many cases, a better service than a best-effort service. This service is generally expressed in terms of Quality of Service (QoS), whereas network efficiency depends crucially on the degree of resources sharing inside the network. To achieve both applications’ QoS requirements and network resources efficiency, it is extremely important, for several reasons, network dimensioning or traffic charging. The evolution of multimedia applications has pointed out how the QoS management must be supported by the network as well as at the application layer. The resource optimization is possible only if requests for reservation fit as much as possible the effective resource occupation. It follows that applications must be enabled to directly manage the QoS in order to limit the resource lost. Copyright © 2002, Idea Group Publishing.
256 Giordano, Cremonese, Le Boudec and Podesta
The introduction of the renegotiable variable bit rate (RVBR) service (Giordano, 1999, 2000), at application layer is assumed to simplify and generalize this task. Whenever renegotiation is underway, the RVBR scheme generates a traffic specification conforming to the real demand to renegotiate the network resources in an optimal way while guaranteeing QoS to the traffic flows. The RVBR service uses the knowledge of the past status of the system and the profile of the traffic expected in the near future, which can be either prerecorded or known by means of exact prediction. We propose an example of a multimedia application (called Armida) supporting dynamic QoS management based on RSVP that integrates RVBR services. Armida provides MPEG4 streaming video over an IP network in a Microsoft environment. The rest of the chapter is organized as follows. In the next section we provide a short overview of the RSVP protocol. Then we describe the RVBR mechanism as defined in Giordano (2000). In the fourth section we introduce the Armida application, pointing out the component implementing the signalling protocol. Finally we provide a set of results related to a real case in which we compared the required bandwidth (derived from generated traffic) and the reserved QoS, varying the number of performed re-negotiations. We also provide an analysis of re-negotiation cost in terms of time required to set up the new QoS.
BACKGROUND QoS Management via RSVP The QoS management on the Internet is performed via the Resource ReSerVation Protocol (RSVP) (Braden,1997). RSVP is the signalling protocol implementing the QoS management according to the model defined by the Integrated Services (IS) group within IETF (Wroclawski-2210,1997). RSVP allows the reservation of resources for a flow, seen as a sequence of datagrams. The sender sends the characteristics of the traffic in the Tspec traffic descriptor, contained in the PATH message. The receiver tries to set up a reservation related to the received PATH message issuing a RESV message. The reservation is periodically refreshed (suggested refresh period is currently 30 seconds), i.e., the PATH and the RESV messages are reissued. IS defines three classes of services: Guaranteed Service (GS) (Shenker, 1997), Control Load Service (CLS) (Wroclawski-2211,1997) and Best-Effort Service. In the rest of the chapter we will focus only on the CLS. CLS provides the client data flow with a quality of service closely approximating the QoS that the same flow would receive from an unloaded network element, but uses a capacity (admission) control to assure that this service is received even when the network element is overloaded. The end-to-end behaviour offered by the controlled-load service to an application, under the assumption of a correct functioning of the network, is expected to provide little or no delay and congestion loss. The sender provides the information of the data traffic it will generate in the Tspec. The parameters specified by the Tspec are: • peak rate p • bucket rate r • bucket size b
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 257
In addition, there is a minimum policed unit m and a maximum packet size M. The service offered by the network ensures that adequate network resources are available for that traffic. The controlled-load service is well suited to those applications that can usefully characterize their traffic requirements and are not too sensible to eventual delays or losses. As this work was unfolding, the media was busily hyping RSVP as a panacea-the magic cure that would bring an end to all network woes. As is often the case with over-hyped technologies, RSVP and IS failed to deliver on the promises (DeSousa, 1999). However, RSVP found a new application for the configuration of traffic handling mechanisms. The ISSLL working group of the IETF has developed a model by which RSVP signalling is used with diffserv traffic handling in order to enable QoS in a scalable manner. In addition, by listening to RSVP signalling, network devices are more readily able to identify and classify traffic in order to determine the appropriate traffic handling mechanisms to apply (DeSousa,1999).
Dynamic QoS: RVBR The RVBR service is based on a renegotiable VBR traffic specification, and offers a scheme to optimize the traffic specification in the next period of time during which this traffic specification is valid (Giordano, 1998, 1999). RVBR service uses the knowledge of the past status of the system and the profile of the traffic expected in the near future, which can be either pre-recorded or known by means of exact prediction. This scheme suits perfectly the dynamics of the traffic generated by multimedia applications. Moreover it naturally integrates with the soft state mechanism of RSVP, which allows for resources renegotiating. There is a renegotiable leaky bucket specification (with rate r and depth b) plus a fixed size buffer X drained at maximum at renegotiable peak rate p. The elements of a RVBR source, as shown in Figure 1, are a renegotiable leaky bucket specification (with rate r and depth b) plus a fixed size buffer X drained at maximum at renegotiable peak rate p. Therefore, the RVBR service is described with two leaky bucket specifications (Giordano, 1999, 2000). In the case of RSVP, the MTU size is the bucket associated to the peak p, hence it is fixed. We further assume it equal to zero to simplify the computation, given that this is not a limitation. The observation time is divided into intervals, and Ti= (ti,ti+1) represents the i-th interval. Inside each interval the system does not change. The parameters of the RVBR service in Ii are indicated with (pi, ri, bi). The RVBR service is completely defined by: • the time instants ti at which the parameters change • the RVBR parameters (pi,ri,bi), for each interval Ti • the fixed shaping buffer capacity X The input-output characterization of the RVBR service comes straightforward as a special case of the time varying leaky bucket shaper (Giordano, 2000), that is defined by means of Network Calculus theory (LeBoudec, 2000), (Thiran, 2000) as: R*(t)=min {σ0i (t- ti)+ R*( ti), inf ti <s <= t Figure 1: RVBR reference configuration {σi(t-s) +R(s)}} (eq1) where σ0i, representing the service curve taking into account the initial conditions at time ti , is defined as: σ0i(u)=min j (rij * u + bij-qj (ti) )
258 Giordano, Cremonese, Le Boudec and Podesta
and qj (t) is the bucket level of the j-th bucket, defined as, at time t in TI: qj (t)= max { supti < s <= t{R*(t)-R*(s)- rij (t-s)} ,{R*(t)-R*(ti)- rij (t-ti)+ qj (ti)}} (eq2) Moreover, w(ti) is the backlog in the shaping buffer at time t in Ti: w(t)= max{supti < s <= t{R(t)-R(s)-σi(t-s)}, {R(t)R(ti)-σ0i (t-ti) +w(ti)} for t in Ti: (eq3) An RVBR source is a time varying leaky bucket shaper with two renegotiable leaky buckets (J=2): one with rate ri and depth bi, and the second with rate pi and depth always equal to zero, plus a buffer of fixed size X. Therefore, in the equations (eq1), (eq2) and (eq3), si and s0i are given by: σi(u)= min ( pi * u + bi1, ri * u + bi2 ) σ0i(u)=min ( pi * u + bi1-q1(ti), ri * u + bi2-q2(ti) ) In the following we will show how RVBR can be used to implement signalling adaptive applications.
The Application of the RVBR Service to RSVP In real life, examples of this service are traffic shaping done at source sending over VBR connections as defined in ITU (1998) and Internet traffic that takes the form of IntServ specification with RSVP reservation (Braden,1997; Wroclawski,1997). In RSVP, the sender sends a PATH message with a Tspec object that characterizes the traffic it is willing to send. When we consider a network that provides a service as specified for the Controlled Load Service (CLS) (Wroclawski, 1997), the Tspec takes the form of a double bucket specification (Wroclawski, 1997) as given by the RVBR service. There is a peak rate p and a leaky bucket specification with rate r and bucket size b. Since, with RSVP as reservation protocol, the reservation has to be periodically refreshed, p, r and b need to be reissued at each renegotiation time. There is no additional signalling cost in applying a Tspec renegotiation at that point, even if there is some computational overhead due to the computation of the new parameters, or to the call admission control, etc. It is important to note here that, contrary to the negotiation of a new connection, with the renegotiation the reservation is never interrupted. If the network cannot support the requested traffic specification, the old traffic specification is restored and the network may not be able to accommodate the next traffic. Mechanisms to prevent this failure from occurring are still under study. Here we assume that either (1) the Tspec is accepted all over the network, as well as at the destination, such that the source can transmit conforming to its desired traffic specification or (2) the source can adapt to transmit with the old Tspec, even if at the price of a reduced quality. We assume that at any time ti=30 * i the application knows (because pre-recorded or predicted) the traffic for the next 30 seconds. We further assume to know the cost to the network of the Tspec (indicated by the cost function (u * r + b) and the upper bound to the bucket size bmax and to the bucket rate rmax.) The backlog w(ti) and the bucket level q(ti) can be measured in the system. In order to use RVBR service for RSVP with CL service scenario, we are faced with the problem of computing the leaky bucket parameters. Therefore, we describe the case of a source that wants to reserve the resources for the next interval. For the RVBR service, this is equivalent to the problem of computing the RVBR parameters for the next interval. Here we present the approach that we used in the simulations (Giordano, 1998). As we will see later, in real cases, this approach can require some modifications. From Equations (eq1) and (eq2) it comes: R(t)-R(s) <=σi(t-s) + X t in Ti, ti < s <=t t in Ti R(t)-R(ti) <=σ0i(t-ti) - w(ti) + X
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 259
These equations give a necessary and sufficient condition for a minimum pi. This, in analogy to the work in LeBoudec (1997) can be seen as the effective bandwidth of the arrival stream in Ii taking into account the backlog at time ti. Therefore, given that pi is computed independently from ri and bi, the problem of finding a complete optimal parameter set (pi,ri,bi) for the RVBR service is reduced to the problem of finding the optimal parameters ri and bi. This optimization problem, when the cost function is linear: c(ri,bi)=u * ri+bi, for fixed values of u, is modelled in Giordano (2000) in the form of an algorithm (LocalOptimm1). Algorithm 1 Local Optimum (X , {R (t )}t∈I , bmax , rmax , u , w(t i ), q (t i ), t i +1 ) If bmax < sup{β i ( s ) − rmax s − X } then there is no feasible solution; s∈I
Else {
pi = max( sup t , s∈I i
R(t i ) − R ( s) − X + ω (t i ) R(t ) − R( s) − X , sup ) t−s ti − s s∈I i
If ( u=0) then { X0=min(vmax,pi); }else {
x0 = sup s∈I
β i ( s) − β i (u ) s −u
β i ( s ) − X − bmax s s∈I , s > 0
x A = sup
β i ( s) − X s s∈I , s > 0
x B = sup
if (x 0 ≥ min (x b , rmax , p i )) then x 0 = min (x b , rmax , p i ) else if (x 0 ≤ x a ) then x 0 = x a ; } ri = x 0 ;
bi = sup{β i ( s ) − X − sx0 }; s∈I
}
RVBR When the Traffic is Specified by its Arrival Curve LocalOptimum algorithm uses functions that imply the exact traffic. In the real case of video stream applications, the module that implements RVBR has to access information related to the network and, therefore, even if the traffic is pre-recorded and stored, it is very unlikely to have access to the exact traffic. To the aim of using the algorithm in a real application, we propose an approximation to some functions, which originally work with the exact traffic, in order to work with a
260 Giordano, Cremonese, Le Boudec and Podesta
smaller and less precise information: the exact traffic R(t) for t in Ii is substituted by upper bound functions. We introduce the function: αi(u)=min ( pαi * u, rαi *t u + bαi ) where pαi =supt,s (R(t)-R(s))/ (t-s) rαi = (R(ti+1)-R(ti))/(ti+1 -ti) bαi =St (R(t)- rαi * t)+ and a second function that takes into account the traffic q(ti) that is the bucket at the transient period. αi 0(u)=min (pα0i * u, rαi *t u + bαi -q(ti)) where pα0i = supt(R(t)-R(ti))/(t- ti) These functions are arrival curves (Le Boudec, 2000) of R(t), i.e., upper bounds to the traffic R(t). With the introduction of ai and ai 0we can approximate the function bi and the optimal peak rate pi that in the RVBR are originally computed from the exact traffic R(t). Therefore, indicating with w(ti) the backlog in the shaping buffer at time ti , the function bi is given by: bi (s)=maxs ( (α(s), αi 0 (s) +w(ti)+q(ti) ) and the minimum pi by: pi =max (sups (αι(s )-X)/s , sups (αi 0 (s )-X+w(ti) )/s) αι and αi 0 can be used in a real implementation, because they are computed with only four parameters (pαi, rαi , bαi, pα0i) that can be easily stored and passed from the application level to the RVBR module.
AN MPEG-4 APPLICATION: ARMIDA ARMIDA is an MPEG4 (ISO-1,1998)-compliant client-server platform. It provides Video Streaming feature, potentially requiring a large amount of bandwidth with strict QoS bounds to satisfy audio/video requirements. It provides features to manage several multimedia streams (named Elementary Stream) which are combined together by the client according to synchronization requirements. In the version used in this work, the ARMIDA architecture is characterized by the introduction of the intermediate DMIF (Delivering Multimedia Interface Framework) (ISO, 1998) layer to make the application independent from the network. The final architecture consists of three layers: the “real” application, the DMIF filter (the part independent of the delivery technology and the part dependent on the delivery technology) and the Daemon process. DMIF provides QoS management aspects and mechanisms to gather information about data transfer and resource utilization. A standard definition of QoS format would be needed to guarantee a general mapping from user QoS (DMIF) to network QoS (RSVP, etc), providing information about data sources. The following parameters are passed from the application to the DMIF: • MAX_AU_SIZE: maximum size of an access unit. It is expressed in bytes. • AVG_BITRATE: average bit rate. It is expressed in bytes/second. • MAX_BITRATE: maximum bit rate. It is expressed in bytes/second. • SERVICES_CONSTRAINT, which indicates if the traffic requires strict bounds of delay. • TIME_LENGTH, which contains the whole duration of the stream to be transmitted by the application.
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 261
•
BURST_SIZE: is the burst dimension, which the application foresees to send. These values can be calculated because ARMIDA is a video-streaming application, where the traffic is known in advance and measurements can be made in order to get these values.
ARMIDA with RVBR The introduction of RVBR within ARMIDA makes an impact on three layers: Application Layer: Several QoS descriptors must be handled. The current standard defines that only a single QoS descriptor can be associated to an Elementary Stream. In order to introduce the renegotiation, it is necessary to define the association between several QoS descriptors and an Elementary Stream. Additionally, the structure for maintaining and managing several QoS descriptors for the same data flow is needed. It means that a data flow must store more than one QoS descriptor and a reference to the related portion of data. 2. DMIF layer: At each re-negotiation, it must provide the new QoS descriptors to Daemon layer. 3. Daemon layer: Several re-negotiation phases must be managed. It must be able to • identify when an old QoS expires to start a new QoS • interact with the RSVP, which must modify the Tspec sent with the PATH message asking for a new reservation. The video stream has to be divided into time intervals, each of which requires a homogeneous QoS descriptor, and its duration has to be greater than those of the soft state, so the application is protected from the loss of RSVP packets. We can divide the total duration T of the stream in a set of intervals related to each required renegotiation: T={T1,T2,…,Tn}. In ARMIDA we implemented both the GS and the CLS, but the RVBR is more suitable with the CLS. Unlike the GS, a CLS reservation is successful when the first RESV message is received, even if the reservation characteristics are unknown, because the CLS does not require bounds on end-to-end datagram queuing. Therefore, a reservation failure due to resource lack is not possible. In the case of GS reservation, when we introduce the renegotiation, we cannot assure anymore the service for the entire traffic. For example, if at the interval Ii, we require more resources than the resources used in the interval Ii-1 there is no guarantee that the network can satisfy the request. Therefore, the resulting service is guaranteed only interval by interval (local guarantee), as opposed to the GS (global guarantee). 1.
Figure 2: ARMIDA architecture Appication Layer Application (client or server) DAI DMIF Filter
TCP/IP Protocol
ATM Protocol
Other Protocol
DMIF Layer
DNI
TCP/IP Daemon
TCP/IP - RSVP Daemon
ATM Daemon
Daemon Layer
262 Giordano, Cremonese, Le Boudec and Podesta
RSVP provides features to ask for a new reservation conformant to traffic characteristics. The nature of the application, a streaming of pre-recorded video, allows to know all generated traffic information, e.g., rate, burst, peak, etc. It enables the implementation of a completely deterministic system based on the knowledge of parameters suiting perfectly the dynamics of generated traffic. The reservation mechanism provided by RSVP is based on the re-sending of a control message (PATH) at every pre-defined time interval (default is 30 sec), carrying information about the required QoS. The usual behaviour foresees that the RSVP daemon re-sends the same PATH until the end of data communication. The idea is to exploit this mechanism also to change the reservation sending a new PATH message carrying the values related to the new traffic descriptor defined in the Tspec. The re-negotiation must take into account the following: • The new reservation must avoid removing resources when these are still needed (e.g., new reservation starts before the new interval). • The new reservation must guarantee enough resources to the application according to requirements (e.g., a new reservation starts late). The main problem is to determine when the new reservation phase must start to be sure that data sending and signalling are synchronized to guarantee always the requested resources.
DYNAMIC QOS FOR MULTIMEDIA TRAFFIC In the following we present the performance of ARMIDA when used with the RVBR capability. We first introduce a more adaptive approach that is not constrained by a fixed renegotiation period and then we illustrate our results that were also shown in a real demonstration at Telecom99 exposition (DeSousa, 1999).
Renegotiation Time Evaluation To evaluate when the new reservation phase must start, we need to know the time spent to install a generic reservation, i.e., the time spent to send a PATH message and receive a RESV message. The time or delay needed to install a generic reservation takes variable values. To calculate it, we assume a Poisson distribution of the delay where the probability that the delay x spent is less than T is y: P{ x ≤ T} = y, P{ x ≤ T} = 1 – e-λT Where λ = 1/E{x}. The mean value E{x}, and then l, is a function of the network properties. In our experiments, server and client belong to the same network, which is half-duplex with a bandwidth of 10 Mbytes. By considering two access time periods (needed to complete the reservation) and an elaboration time on the terminal of 20%, we have E{x} = 240 milliseconds. This result is confirmed by experimental observations, where we have measured the delay spent to install the reservation for each request to change it. Then, with E{x} known, it is possible to estimate a time needed to set up the reservation T observed with probability y: y = 0.9 ⇒ T = 552.62 msec. Knowing T, we can decide when a new reservation phase must start by considering the Tspec associated to the actual time interval and to the previous and successive interval. Let Θ be the set of QoS defined as follows:
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 263
Θ={θι:θι is the QoS related to the i-th interval of the data sequence} We can define on Θ a Total Order < as follows Let θ1, θ2 ∈ Θ be defined as θ1 =(r1, b1, p1), θ2=(r2, b2, p2) θ1< θ2 if (r1+0.5* b1) < (r2+0.5* b2) (eq4) We can define a relationship between Tspec and Θ mapping each θ ∈Θ on the Tspec carrying related values. In the case of the first time period, the actual Tspec has to be compared with the next one via (eq4) in order to assure that at the end of the first interval, a correct reservation for the data associated to the second interval has been installed. Let Tspec1 and Tspec2 be Tspec related to the first and the second interval: • If Tspec2 < Tspec1, the procedure needed to set up the second reservation can start at the end of the first interval, because the resources actually reserved are sufficient to assure a correct data sending. • If Tspec1 < Tspec2 the procedure has to be anticipated in order to guarantee enough resources to the second group of data. Let TRi be the time needed to set up the reservation related to the i-th interval. In the first case, the path message containing the new Tspec can be sent at the end of the first interval; in the second case, the path message has to be sent T seconds before the end of the time period. In the case of the successive time periods, i.e., the general case, we also need to consider the previous reservation, because in order to evaluate the duration of the i-th reservation, we have to know when this reservation has taken place. We can define the starting time TSi of each reservation as follows: • TspecnTspeci+1; • TDi=Ti – TRi if TspeciTspeci-1 and Tspeci >Tspeci+1; • TDi=Ti – T + TRi • TDi=Ti – T + (T – TRi) = Ti – TRi if Tspeci>Tspeci-1 and Tspeci < Tspeci+1;
Figure 3: Behaviour of renegotiation time
m ean
16
14
12
10
8
6
4
2
0 1 00
1 50
2 00
2 50 m e an
3 00
3 50
264 Giordano, Cremonese, Le Boudec and Podesta
In Figure 5 we show a practical case referred to a video stream of 280 sec where renegotiation actions are performed every 60 seconds. The allocated QoS is always greater or equal to the QoS required by the application. The trial has been performed on a LAN ethernet.
QoS Input Parameters We must distinguish between QoS defined at the network layer from a QoS defined at the application layer. RSVP provides features to set up a communication according to a defined QoS at the network layer; the same is provided by the DMIF. Because of the quite initial state of DMIF standardization (related to QoS definition) and the presence of fields to be defined, we propose to introduce the following parameters in order to support RSVP with RVBR: • SERVICES_CONSTRAINT indicates if there are some restrictions for the traffic delay, it is a property of the entire ES. • INTERVAL_NUMBER specifies the number of time intervals composing the stream. • MAX_AU_SIZE is the maximum size of an access unit; it can be considered a global parameter. • TIME_LENGTH is the total length of the stream. The parameters listed above are common to the complete stream; there are also some parameters, which have to be replicated for each time interval: • AVG_BITRATE is the average bit rate of each interval. • MAX_BITRATE is the maximum bit rate of each interval. • BURST_SIZE is the burst size of each interval. • INTERVAL_TIME is the duration of each interval. It follows that two main parts compose the QoS DMIF descriptor of each Elementary Stream: • global parameters, • array of repeated parameters: the dimension is equal to |{T1,…,Tn}|. The computing of the total QoS parameters is made for each interval, according to the previous rules; for this purpose, the interval lengths of Elementary Stream, which will be composed in the same TransMux, have to be equal to allow a right combination of them.
Traffic Renegotiation We present here some experiment with a real MPEG4 video of 2.2 Mbytes transmitted in 280 seconds over a LAN with a low-average load. Initially all the buffers and buckets are empty (zero initial conditions). The file is pre-recorded pro frames and, given that we do not
Figure 4: Resources overlapping Case Tspec i < Tspec i+1
T
Tspec i
Ti Ti+1 Tspec i+1 TR i+1
Case Tspec i+1 < Tspec i Tspec i
Ti T i+1 Tspec i+1 TR i+1
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 265
Figure 5: Example of renegotiation time 16 0000
14 0000
12 0000
Bandwidth
10 0000
B _ r e q u ir e d
80 000
B _ re s e rve d
60 000
40 000
20 000
0 0
20 0
60 200
60 290
12 0200
12 0371
17 9648
17 9728
18 0200
23 9648
23 9798
24 0200
28 0200
t im e
use any scheduling or pre-fetch, we know R(t) over the entire interval. We also assume that the Tspec is accepted all over the network, as well as at destination, such that the source can transmit conforming to its traffic specification. As it comes from the formulas given above, when we renegotiate, we also know R*(t) for any previous time, and we can measure the buffer and the leaky bucket content. We obtain the optimal shaper parameters by applying the algorithm LocalOptimum. Figure 6 and Figure 7 show the effective generated traffic and the allocated resources with renegotiation intervals defined at 60 sec and with no renegotiation (280 sec). It is obvious that increasing the number of segments with different QoS parameters, we need fewer resources with a lower waist of bandwidth (allocation greater than the real traffic) and limited amount of buffer for traffic shaping. We have also shown that the renegotiation activity adds an insignificant communication overhead in the case of RSVP. We have a relatively small improvement for the peak, because the input traffic is not very bursty in the initial and in the final parts. On the contrary, there is a real improvement for the mean rate (and also for the bucket size, as shown in Figure 9). The rate needed with renegotiation is much smaller, because it adapts to the current transmission, and it is not constrained by a fixed initial negotiation as in Figure 6. In Figure 8 we compare the behaviour of the system with different reallocation intervals: 60 sec, 120 sec and no reallocation. Here it is evident the benefit of renegotiation in terms of resources (that is beneficial to both the network, that has to allocate them) and the application (that, very likely, has to pay for them). Figure 9 illustrates the setting of the bucket size for the same experiment. By the analysis of this figure, it is evident that the renegotiation is effective for the, because the algorithm works for smaller interval. The comparison of these curves confirms that the scheme we propose allows users to optimize the resources reserved to the network (expressed in terms of bucket), with a limited overhead deriving from the effort needed for renegotiation. These results indicate that renegotiation is an efficient mechanism for allowing the better use of network resources at the very low price of implementing a service like RVBR. The initial conditions are: $q(0)=0$ and $w(0)=0$. The file is pre-recorded pro frames and, given that we do not use any scheduling or pre-fetch, we know R(t) over the entire interval. We also assume that the Tspec is accepted all over the network, as well as at destination, such that the source can transmit conforming to its traffic specification. As noted above, when we renegotiate, we also know R*(t) for any previous time, and we can measure the buffer and the leaky bucket content. We obtain the optimal shaper parameters by applying the algorithm LocalOptimum.
266 Giordano, Cremonese, Le Boudec and Podesta
Figure 6: Example allocation without renegotiation Aproximation with 1 interval 140000
120000
100000
80000 rate
rate mean peak 60000
40000
20000
39
83
36
89
07 28
90
73
26
14
46
25
95
11
24
00
76
23
14
40
21
49
97
79
48
24
02
29
20
19
12 18
98 16
57 15
26
06
14
13
7
7
3
9
2
2
23
70 11
10
38
69 92
50 81
28 70
09
91
59
9
70
47
2 25
48
36
25
14
33
03
0
time
Figure 7: Example of renegotiation every 60 sec 140000
120000
100000
80000 rate
rate mean peak 60000
40000
20000
39
28
07
89
26
90
83
25
73
36
24
46
14
23
11
95
00
76 21
14
40 20
02
29 19
49
12 18
97
98 16
48
79
57
26
15
14
24 70
38
06 13
23 11
69
7 10
50
7 92
3 28 70
81
9
2
09 59
2
91 47
36
70
9
2 25
48 25
14
33
03
0
time
CONCLUSIONS We presented an example of a traffic-accommodation application able to manage dynamically the QoS according to the traffic requirements. This is obtained with the introduction of the RVBR service, that allows to modify the traffic specification of a connection, while keeping the connection active, in order to support the traffic QoS requirements. In this respect, we introduced the video-on-demand application, RVBR-enabled ARMIDA, which also became the first instance of an application RVBR-enabled that renegotiates RSVP traffic specification. We solved several issues arising with the integration and the utilization of this model and illustrated the performance of the application with MPEG4 traffic. We carried this study on a real network with the client and server exchanging RSVP messages containing a Tspec renegotiated according to the RVBR service. We compared the results of transmitting an MPEG4 video trace with renegotiation (at different renegotiation period) and without renegotiation. The measurements performed, as well as in the real-time
A Model for Dynamic QoS Negotiation Applied to an MPEG4 Application 267
Figure 8: Comparison among different reallocation intervals 140000
120000
100000 rate avg NoReall peakNoReal
80000
avg120 peak120 avg90 60000
peak90 avg60 peak60
40000
20000
0
time
Figure 9: Bucket definition for renegotiation time equal to 60 (bucket_1) and with no renegotiation (bucket_6) Bucket Size
1000000
900000
800000
700000
bucket size
600000 bucket_1
500000
bucket_6
400000
300000
200000
100000
5
2
5
21
7
8
7
2
9
0
6
5
9
9
8
1
92
71 59
62
62
91 60
51 58
77 58
78 58
52 59
09 61
01 60
78 59
35 57
14 59
3
45
8
5
9
3
8
8
0
36
17 59
58
59
10 59
69 58
49 58
41 60
51 58
54
48 59
59
52
60
4
0
time
demonstration we conducted at Telecom99, showed that the renegotiation allows better use of network resources and that, with protocols such as RSVP, where there is no additional cost for signalling (or so we mainly assume), it is better to renegotiate. We also presented an advanced study on the renegotiation period that takes into account the possibility of using a different period or the same traffic. Some important aspects (for example, a recovery mechanism in case of fault) are still under study.
REFERENCES Bernet, Y. (2000). The complementary roles of RSVP and differentiated services in the fullservice QoS network. Communication Magazine, February. Braden, R., Zhang, L., Berson, S., Herzog, S. and Jamin, S. (1997). RFC2205: Resource ReSerVation Protocol (RSVP). IETF. De Sousa, P. (1999). Global communication newsletter. Communication Magazine. Giordano, S. and Le Boudec, J. Y. (2000). On a class of time varying shapers with application to the renegotiable variable bit rate service. Journal on High Speed Networks.
268 Giordano, Cremonese, Le Boudec and Podesta
Giordano, S. and Le Boudec, J. Y. (1999). The renegotiable variable bit rate service. IFIP99. Giordano, S. and Le Boudec, J. Y. (1998). QoS-based integration of IP and ATM: Resource renegotiation. In Proceedings of 13th IEEE Computer Communications Workshop. ISO. (1998). Information Technology, Generic Coding of Moving Pictures and Associated Audio Information, Part 6: Delivery Multimedia Integration Framework–ISO. ISO-1. (1998). Information Technology, Generic Coding of Moving Pictures and Associated Audio Information, Part 1: System International Standard Organization. ITU-T Recommendation Q.2963.2. (1998). Broadband integrated services digital network (B-ISDN) digital subscriber signalling system No. 2 (DSS 2) connection modification-Modification procedure for sustainable cell rate parameters. ITU Telecommunication Standardization Sector-Study Group 13. Le Boudec, J. Y. (1997). Network calculus, deterministic effective bandwidth, VBR trunks. IEEE Globecom 97, November. Le Boudec, J. Y. and Thiran, P. (2000). A short tutorial on network calculus I: Fundamental bounds in communication networks. Proceedings of ISCAS 2000, Geneva, May. Le Boudec, J. Y., Thiran, P. and Giordano, S. (2000). A short tutorial on network calculus II: Min-plus system theory applied to communication networks. Proceedings of ISCAS 2000, Geneva, May. Shenker, S., Partridge, C. and Guerin, R. (1997). RFC2212: Specification of guaranteed quality of service. IETF. Shenker, S. and Wroclawski, J. (1997). RFC2216: Network element service specification template. IETF. Wroclawski, J. (1997). RFC2210: The use of RSVP with IETF integrated services. IETF. Wroclawski, J. (1997). RFC2211: Specification of controlled-load network element service. IETF.
Playout Control Mechanisms for Speech Transmission over the Internet 269
Chapter XIII
Playout Control Mechanisms for Speech Transmission over the Internet: Algorithms and Performance Results Marco Roccetti University of Bologna, Italy
Sophisticated applications of Internet multimedia conferencing will become increasingly important only if their users will perceive the quality of the communications as sufficiently good. The result of extensive experiments has shown that audio is frequently perceived as one of the most important components of multimedia communications. Unfortunately, the actual architecture of the Internet is not a good environment for real-time audio communications, since very high transmission delay and transmission delay variance (known as jitter) may be experienced that impair human conversations. Hence, in the absence of network support to provide guarantees of quality to users of Internet voice software, an alternative to coping with problems caused by delay and delay jitter is to use adaptive control mechanisms. These mechanisms are based on the idea of using a voice reconstruction buffer at the receiver in order to add artificial delay to the audio stream to smooth out the jitter. In this chapter, we describe three different control mechanisms that are able to dynamically adapt the audio application to the network conditions so as to minimize the impact of delay jitter (and packet loss). We also present a set of performance results we have gathered from an extensive experimentation with an Internet audio tool we have designed and developed in order to conduct voice-based audio conversations over the Internet.
Copyright © 2002, Idea Group Publishing.
270 Roccetti
INTRODUCTION The value of integrating real-time and traditional data services onto a common network is well known. Telecommunication companies have proposed ATM as a standard for upgrading the Internet to provide both real-time and data services. In contrast, it has been empirically demonstrated that real-time multimedia services may be added to traditional IP networks that were originally designed for data transmission only. However, since the IP community has not considered the provision of Quality of Service guarantees with the same intensity as the ATM community, the current Internet service model offers a flat, classless, best-effort service to its users. This is not a good environment for real-time audio transmissions since the delay between the arrival of subsequent packets becomes dependent on the traffic condition of the network. It is a common experience that very high packet delay and packet delay variance (known as jitter) may be experienced over many congested Internet links. As a consequence, audio packet loss percentages, due to the effective loss and damage of packets as well as belated arrivals, may be very large. Unfortunately, user studies demonstrate that, among the many possible metrics that influence the user perception of audio, probably the most important factors that disrupt the user perception of audio are represented by the packet audio playout delay and the packet loss rate. With the term playout delay, we refer to the total amount of time that is experienced by the audio packets of a given talkspurt from the instant they are generated at the source and the instant they are played out at the destination. Summarizing, such a playout delay consists of: i) the collection time needed for the transmitter to collect audio samples and to prepare them for transmission, ii) the transmission time needed for the transmission of audio packets from the source to the destination over the underlying transport network, and finally iii) the buffering time, that is the amount of time that a packet spends queued in the destination buffer before it is played out. Thus, in the absence of network support to provide guarantees of quality to users of Internet voice software, an interesting alternative to coping with problems caused by jitter and high packet loss is to use adaptive control mechanisms. Typically, these mechanisms are based on the idea to use a voice reconstruction buffer at the receiver in order to add artificial delay to the audio stream to smooth out the jitter. Clearly, the longer the scheduled playout delay, the more likely it is that an audio packet will arrive at the destination before its scheduled playout deadline has expired. However, too long playout delays can significantly impair human interactive conversations. In this chapter, we will describe the most characterizing features of a number of those control mechanisms that try to dynamically adapt the audio application to the network conditions so as to minimize the impact of delay jitter and packet loss. We will report, also, on a set of performance results we have gathered from an extensive experimentation with an Internet audio tool we have designed and developed in order to conduct unicast, voice-based audio conversations over the Internet. The remainder of this chapter is organized as follows. In the next section, we present a brief overview on digital audio coding techniques that aims at describing the state of the art of the audio compression methods for human conversational speech. In the third section, three important algorithms are detailed that are typically used for transmitting human speech over the Internet. In the fourth section, performance figures derived from several experiments are discussed that illustrate the adequacy of those mechanisms in transmitting speech across the Internet. Finally, the conclusions complete the chapter in the last section.
Playout Control Mechanisms for Speech Transmission over the Internet 271
SPEECH CODING: BASIC CONCEPTS Sound is a physical phenomenon caused by the vibration of matter (Steinmetz and Nahrstedt, 1999). As a material vibrates momentary changes in air pressure are produced. These changes of air pressure are modeled as waves propagating from the point of origin through the air. When a wave reaches a human ear, a sound is perceived. The pattern of these oscillations is called a waveform. The waveforms are often periodic since they reproduce the same shape at regular intervals termed periods. The frequency of a sound is the reciprocal value of the period and is measured in hertz (Hz) or number of periods in a second. The total sound frequency range may span from the so-called infra-sound frequency (from 0 to 20 Hz) to the so-called hyper-sound frequency (from 1 GHz to 10 THz). Usually, multimedia communication systems exploit only the frequency range of human hearing, from 20 Hz to 20 kHz. Sounds in the human hearing frequency range are called audio and the waves in this frequency acoustic signals. Human speech and music are examples of acoustic signals in the human hearing frequency range. Sound is also characterized by means of amplitude. The amplitude of a sound is a function of the sound pressure level (i.e., the total energy under the envelope of the waveform) and is measured in units of decibels (dB). The amplitude of a sound is perceived by a human ear as loudness. A change in the sound pressure level of 10 decibels corresponds to a perceived doubling in the loudness of that sound; the limit of human hearing is 0 to 10 dB, and a difference in sound level of 1 to 3 dB is hardly perceived. The analog representation of a sound wave is typically obtained by exploiting an analog electrical signal. In such one case amplitude of 0.7 volts driving a load of 600 ohms corresponds to a sound pressure level of 0 dB (Westwater, 1998). The continuous curve of a sound waveform may not be directly represented in a digital computer (as an analog signal). However, a digital representation of sound may be done by means of a technique known as “Pulse Code Modulation” (PCM for short). This technique is applied to the smooth, continuous electrical signal captured by an analog device (the microphone). The amplitude of the waveform is measured (sampled) at regular time intervals and converted into an integer value. Each of these instantaneous measurements is a sample. The rate at which a continuous waveform is sampled is called the sampling rate. Sampling rates are measured in Hz. As early as 1924, Nyquist realized the existence of a fundamental limit concerning the sampling rate for lossless digitization: “In order to achieve lossless digitization, the sampling rate must be at least twice the maximum frequency responses.” In other words, an ordinary telephone line, which has a frequency bandwidth of 300 to 3,400 Hz, will translate into a need to digitize voice samples at 8,000 Hz. Sampling the line faster than 8,000 Hz will be useless since the higher frequency components that such sampling could recover have already been filtered out by the maximum frequency (3,400 Hz). The process of conversion from the analog audio signal to its digital representation is called analog-to-digital (A/D) conversion. The reverse process performed for converting digital audio samples into an acoustic signal is termed digital-to-analog (D/A) conversion. Like a waveform that is sampled at discrete time intervals, also the value of the sample is discrete. The number of discrete levels in the A/D conversion defines the accuracy of the sample. The process of representing continuous information by using a limited number of discrete levels is known as resolution or quantization. The resolution or quantization of a sample depends on the amount of bits that are used for measuring the amplitude of the waveform. Typical audio converters use 8 through 16 bits of accuracy in the sampling, yielding from 256 to 65,536 possible discrete values. Summarizing, the two following
272 Roccetti
different examples may be considered. The CD music standard has a sampling rate of 44100 Hz. This means that CD music waveforms are sampled at a rate of 44,100 times per second. Taking into account the human ear bandwidth which is equal to 20,000 Hz – 20 Hz = 19,980 Hz, a sampling rate of 44,100 Hz may capture frequencies up to 22,050 Hz which is closer to the human limit of hearing. A CD-quality audio is quantized with a 16-bit linear PCM. Hence, each second of CD-quality music requires a total amount of bits equal to 705600. Instead, in order to permit the use of conventional telephone systems, human speech is sampled at a rate of 8,000 Hz. With the use of an 8-bit per sample quantization scheme, a second of voice-quality digitized audio requires 64,000 bits. The examples of telephone-quality and CD-quality audio reported above demonstrate that important parameters for the format specification of audio are sampling rate and sample resolution. As far as speech coding is concerned, the algorithm at the basis of the PCM conversion process does not compress audio samples. We have already mentioned that, with 8 bits per digitized sample, voice samples amount to 64 Kbps (Kilobits per second). This enormous amount of data puts transmission networks under severe stress. For example, it cannot be supported using existing 28.8 Kbps modem. One possible solution to this problem is to employ more efficient encoding algorithms to transform analog voice signals into digital samples. Many such algorithms exist that have been originally standardized for use over the Public Switched Telephone Network (PSTN). At the sending side, speech samples are compressed before being transmitted across the network. At the receiving side, the inverse coding algorithm decodes the samples. Compression is necessary to reduce the bandwidth requirement of these digitized audio data. Those speech compression schemes reduce the network bandwidth consumed, at the cost of reduced audio quality and increased compression complexity. The perfect compression algorithm is still waiting to be developed. Typically compression schemes that produce good quality speech at high compression ratios either limit the content of audio or are complex computationally. A first class of speech coding schemes exists that provides satisfactory audio quality. To this class the following two algorithms belong. Adaptive Pulse Code Modulation (ADPCM) and Code Excited Linear Prediction (CELP) operate, respectively, at 32 Kbps and 16 Kbps still while maintaining a toll (or telephone) quality at the cost of an increasing complexity in the encoding algorithms. Another standard speech coding scheme is Group Special Mobile (GSM) which was developed for use over cellular networks. The target quality of service of GSM is slightly less than toll, but the algorithm, which is able to operate at the same bit rate as CELP, is much less complex. A second class of coding algorithms exist which provide (much less than toll) synthetic quality. These algorithms operate at very low bit rates but produce mechanical sounding speech. Among these, the most important are Linear Predictive Coding (LPC) and the ITUT standard G.723.1. They both operate at a very low bit rate (approx. 5 Kbps and below) and are based on a principle that is also at the basis of both the CELP and GSM coders. A summary of the expected compression ratio (expressed in Kbps), the computational complexity (expressed in instruction per second of IPS) and the quality of the above mentioned coding schemes is reported in Table 1 (Watson and Sasse, 1998).
Playout Control Mechanisms for Speech Transmission over the Internet 273
Table 1: Comparison of speech coding schemes
Technique
Bandwith (Kbps)
Complexity (IPS)
Quality
PCM
64
10000
Toll
ADPCM
32
100000
Toll
CELP-G.728
16
1000000000
Toll
GSM
13.2
10000000
Less than Toll
LPC
2.4-5.6
10000000
Synthetic
G.723.1
5.3-6.3
1000000
Synthetic
SPEECH TRANSMISSION OVER THE INTERNET The transmission of speech over the Internet (i.e., Internet telephony) is still dismissed as an impractical application because of the poor quality experienced by many users of Internet audio tools. Such Internet audio services must operate in a bandwidth-, delay- and packet loss-constrained environment. This environment has been passed down to the codec development efforts of the ITU (International Telecommunication Union). As mentioned above, several codecs have been recently designed that work well in the presence of the scarce network bandwidth constraint. For example, the ITU codecs CELP-G.729 and G.723.1 have been designed for transmitting audio data at bit rates ranging from 8 Kbps to 5.3 Kbps. However, available network bandwidth is not the only requirement to meet for quality audio. Indeed, each step in the audio data flow pipeline mentioned above (from coding to transport, to reception, to decoding) adds delay to the overall transmission. While some delays are relatively fixed (such as coding and decoding), others depend on network conditions. The problem with the today’s Internet is that routers operate on a FIFO basis and statistically multiplex traffic from different sources. The impact of this behavior on real-time traffic is to introduce high delay variation (known as jitter) to the inter packet timing relationship and even high packet loss rates. For example, very high average packet delay and packet delay variance on the order of 500/1,000 msec may be experienced over many congested Internet links. As a consequence, packet loss percentages, due to the effective loss and damage of packets as well as belated arrivals, may vary between 15% and 40%. User studies indicate that telephony users find round-trip delays greater than about 300 msec, more like a half-duplex connection than a real-time conversation. However, user tolerance of delays may vary significantly from application to application. The most critical users may require delay of less than 200 msec, but more tolerant users may be satisfied with delays of 300-500 msec. Finally, many experimental tests have revealed that random independent packet loss rates of up to 10% have little impact on speech recognition (Kostas et al., 1998). In the absence of network support to provide guarantees of quality to users of Internet voice software, an interesting approach to cope with the problems caused by jitter and high packet loss rates is to use control mechanisms. Such control mechanisms adapt the behavior of the audio application to the network conditions so as to minimize the impact of delay jitter and packet loss.
274 Roccetti
Thus, to transport audio over a non-guaranteed, packet switched network like the Internet, all the existing Internet audio tools, such as, for example, NeVot (Schulzrinne, 1992), vat (Jacobson and McCanne), rat (Hardman et al., 1998) and FreePhone (Bolot, 1996), typically operate by using datagram-based connections (e.g., IP/UDP/RTP) as described in the following. Audio samples are encoded (with some form of compression), inserted into packets that have creation timestamps and sequence numbers, transported by the network, received in a playout buffer, decoded in sequential order and finally played out by the audio device, as seen in Figure 1. A symmetric scheme is used in the other direction for interactive conversation. A typical packetized audio segment may be considered as being constituted of short bursts of energy (called talkspurts), during which the audio activity is carried out, separated by silent periods (during which no audio packet is generated). During a packet audio connection over the Internet, in order for the receiving site to reconstruct the audio conversation, the audio packets constituting a talkspurt must be played out in the order they were emitted at the sending site. Clearly, jitter has to be removed from audio packet streams, since it causes the speech to be unintelligible. Hence, in order to compensate for variable network delays, all the existing Internet audio tools use a voice reconstruction buffer at the receiver to add artificial delay to the audio stream to smooth out the jitter. Received audio packets are first queued into the buffer and then the periodic playout of each packet is delayed for some quantity of time beyond the reception of the first packet, as seen in Figure 2. This mechanism must be adaptive, since jitter on the Internet may vary significantly with time. In this way, dynamic playout buffers can hide at the receiver packet delay variance at the cost of additional delay. Needless to say, a critical trade-off exists between the amount of delay that is introduced in the buffer and the percentage of late packets that are not received in time for playout (and are consequently considered lost). The longer the additional delay, the more likely it is that a packet will arrive before its scheduled playout time. But, if, on one side, a too large percentage of audio packet loss may impair the intelligibility of an audio transmission, on the other side, too large playout delays may disrupt the interactivity of an audio conversation
Figure 1: Audio data flow over the Internet Audio Signal
Sender
Encoder
Packetizer
Internet
Dynamic Playout Buffer
Decoder Receiver Audio Device Driver
Audio Signal
Playout Control Mechanisms for Speech Transmission over the Internet 275
Packet Number Generation
Figure 2: Smoothing out jitter delay at the receiver
Packet generation
Playout Buffer
Network delay
Time
(Kostas et al., 1998; Panzieri and Roccetti, 1997). Hence, playout control mechanisms adaptively adjust the playout delay in order to keep this additional buffering delay as small as possible, while minimizing the number of packets delayed past the point at which they are scheduled to be played out. In the remainder of this section, we present three different playout delay control mechanisms that have been designed to dynamically adjust the audio packet playout delay (and the receiver buffer depth) of packet audio applications for compensating the highly variable Internet packet delays. All the three different playout delay control mechanisms adopt a control mechanism for adaptively adjusting the playout delays that keeps the same playout delay constant throughout a given talkspurt, but permits different playout delays in different talkspurts. For the sake of brevity, only the relevant features of these three nontrivial mechanisms are reviewed.
Speech Transmission over the Internet: Mechanism #1 In this section we briefly describe the adaptive playout delay adjustment algorithm proposed in Ramjee (1994) on which several Internet audio tools, such as NeVoT (Schulzrinne, 1992), rat (Hardman, 1998) and FreePhone (Bolot and Vega Garcia, 1996) are based. As mentioned above, a receiving site in an audio application buffers packets and delays their playout time. Such a playout delay is usually adaptively adjusted from one talkspurt to the next one. In order to implement this playout control policy, the playout adjustment mechanism proposed in Ramjee et al. (1994), and denoted in the following as mechanism # 1, makes use of the two following important assumptions: 1) An external mechanism exists that keeps synchronized the two system clocks at both the sending and the receiving sites (for example, the Internet-based Network Time Protocol NTP). 2) The delays experienced by audio packets on the network follow a Gaussian distribution. Based on these assumptions, the playout control mechanism #1 works by calculating the playout time pi for the first packet i of a given talkspurt as:
276 Roccetti
pi = ti + di + k * vi, where ti is the time at which the audio packet i is generated at the sending site, di is the average value of the playout delay (i.e., the average value of the time interval between the generation of previous audio packets at the sender and the time instants in which those packets have been played out at the receiver), k ∈ [1, 2, 4] is a variation coefficient (whose effect can be enforced through shift operations) providing some slack in playout delay for arriving packets (the larger the coefficient, the more packets that are played out at the expense of longer playout delays) and finally vi is the average variation of the playout delay. From an intuitive standpoint, the reported formula is used to set the playout time to be far enough beyond the average delay estimate, so that only a small fraction of the arriving packets should be lost due to late arrivals. The playout point for any subsequent packet j of that talkspurt is computed as an offset from the point in time when the first packet i in the talkspurt was played out: pj = pi + tj - ti. The estimation of both the average delay and the average delay variation are carried out using the well-known stochastic gradient algorithm (Ramjee et al., 1994): di = α * di-1 + (1- α) * ni, vi = α * vi-1 + (1- α) * | di - ni|, where the constant a (usually set to 0.998) is a weight that characterizes the memory properties of the estimation, while ni is the total delay introduced by the network (i.e. the difference between the time at which packet i is received at the receiving site and ti). In essence, the playout adjustment mechanism #1 is a linear filter that has revealed to be slow in catching up with change in delays, but is quite good at maintaining a steady state value when the control value (1 - α) is set to be very low. Hence, in order to circumvent its inability to adapt to very large change in transmission delays, the playout adjustment mechanism #1 has been subsequently equipped with a delay spike detection and management mechanism. Several studies, in fact, have indicated the presence of spikes in end-to-end Internet delays. A spike is a sudden, large increase in the end-to-end network delay, followed by a series of audio packets arriving almost simultaneously. In cases where a delay spike spans multiple talkspurts, it is important to quickly react to the delay spike. The method proposed to react to spikes is reported in Table 2 and works as follows (Ramjee et al., 1994). Two different modes of operation are used depending on whether a transmission delay spike has been detected or not (the spike detected mode and the normal mode, respectively). For every packet that arrives at the receiver, the mechanism computes the end-to-end network delay ni then it checks the current mode (part 2 of the algorithm in Table 2) and, if necessary, switches it. In particular, if the delay variation v between two consecutive packets is larger than some multiple of the current average delay variation, then the algorithm switches to the spike-detected mode. The end of a spike is detected in a much more tricky way. The consideration that has been exploited to detect the end of a spike is based on experimental observations of the spike phenomenon and it is as follows. Typically, at the end of a spike, a series of packets arrive one after another, almost simultaneously, at the receiver. The motivation for this fact is that, as packets of a talkspurt are transmitted at regular intervals by the sender, at the end of a spike, the last belated packets of that spike arrive simultaneously with subsequent packets that are experiencing progressively smaller delays. To take into
Playout Control Mechanisms for Speech Transmission over the Internet 277
account this phenomenon, the variable var has been used in the algorithm with an exponentially decaying value that adjusts to the slope of the spike. When this variable has a small value, revealing that there is no longer a significant slope, the normal mode of the algorithm is resumed. During the spike-detected mode, the mechanism uses the delay of the first packet of the talkspurt as the estimated playout delay for each packet in the talkspurt. Instead, in the normal mode the mechanism operates by calculating the playout delay as described at the beginning of this section (part 3 of the algorithm in Table 2).
Speech Transmission over the Internet: Mechanism #2 Another adaptive on-line delay adjustment algorithm based on the same two assumptions as mechanism # 1has been presented in Moon et al. (1998).
Table 2: The pseudo-code of the Internet audio mechanism #1 with spike detection, reported from Ramjee et al. (1998)
(1)
ni = Receiver_Timetsamp – Sender_Timestamp
(2)
IF mode == NORMAL IF ( ni − ni −1 > v ⋅ 2 + 800 ) var = 0 /*detected beginning of a spike*/ mode = IMPULSE; ELSE var = var /2 + (2 ⋅ n i − n i −1 − ni − 2 ) / 8 ; IF (var<=63) mode = NORMAL; /*end of spike*/ n i − 2 = n i −1 ; ni −1 = ni ;
return; (3)
IF (mode = NORMAL) d i = d i −1 ⋅ α + (1 − α ) ⋅ ni ;
ELSE d i = d i −1 + n i − n i −1 ; v i = v i −1 ⋅ α + (1 − α ) ⋅ n i − d i
(4)
n i − 2 = n i −1 ; n i −1 = n i ;
return;
;
278 Roccetti
This mechanism, which we denote as mechanism #2, is based on the history experienced during the transmission of the audio samples in a given audio connection. The algorithm tracks the network delays of received audio packets and efficiently maintains delay percentile information. That information, together with an appropriate delay spike detection algorithm, is used to dynamically adjust talkspurt playout delays. In essence, the main idea behind that algorithm is to collect statistics on packets already arrived and then to use them to calculate the playout delay. Instead of using some variation of the stochastic gradient algorithm in order to estimate the playout delay, each packet delay is recorded and the distribution of packet delays is updated with each new arrival. When a new talkspurt starts, the algorithm proposed in Moon et al. (1998) calculates a given percentile point (say q) for the last arrived w packets, and uses it as the playout delay for the new talkspurt. In addition, as in the Internet audio mechanism #1, this algorithm detects and accommodates delay spikes in the following manner. Upon the detection of a delay spike, the algorithm stops collecting packet delays, and follows the spike (until the detection of the spike end) by using as playout delay the delay experienced by the packet that commenced the spike. Upon detecting the end of the delay spike, the algorithm resumes its normal operation mode. Internet audio mechanism #2 is fully detailed in Moon et al. (1998), but, for the sake of completeness, we report it here in Table 3. As already mentioned, this mechanism also operates in two modes. For each packet arrived at the destination, the current operation mode is checked, and one mode is switched into the other, if necessary (lines 1-7). When in normal mode, the delay distribution is updated according to the statements in lines 9-22. The beginning of a spike is detected when a packet arrives with a delay (di) larger than some multiple of the current playout delay (pk). In such a case, the algorithm switches to the spike-detected mode. The end of a spike is detected, instead, when the most recently received packet (during a spike) has a delay (di) which is less than some multiple of the playout delay before the current spike (old-d). In this case, the operation mode is set back to normal. In the algorithm reported in Table 2, two parameters head and tail are used (in lines 5 and 2) for detecting the beginning and the end of a spike. To conclude this section, it is worth mentioning that Moon et al. (1998) have experimentally shown that their algorithm outperforms other existing delay adjustment algorithms over a number of measured audio delay traces, and performs close to a theoretical optimum over a range of parameters of interest.
Speech Transmission over the Internet: Mechanism #3 Recently, a new mechanism (denoted as mechanism #3 in the remainder of the chapter) has been designed to dynamically adapt the talkspurt playout delays to the network traffic conditions (Roccetti et al., 2001a). It assumes neither the existence of an external mechanism for maintaining an accurate clock synchronization between the sender and the receiver, nor a specific distribution of the end-to-end transmission delays experienced by the audio packets. This mechanism is at the basis of an Internet audio tool, called BoAT, developed at the University of Bologna (Roccetti et al., 1999). Succinctly, the technique for dynamically adjusting the talkspurt playout delay is based on obtaining, in periodic intervals, an estimation of the upper bound for the packet transmission delays experienced during an audio communication. Such an upper bound is periodically computed using round-trip time values obtained from packet exchanges of a three-way handshake protocol performed between the sender and the receiver of the audio communication. At the end of such a protocol, the receiver is provided with the sender’s estimate of an upper bound for the
Playout Control Mechanisms for Speech Transmission over the Internet 279
Table 3: The pseudo-code of the Internet audio mechanism #2, reported from Moon et al., (1998)
(1) (2) (3)
IF (mode == SPIKE ) IF di <= tail ∗ old_d) /* the end of a spike */ mode = NORMAL;
(4) (5)
ELSE IF di > head ∗ pk ) /* the beginning of a spike */
(6) (7) (8)
mode = SPIKE; old_d = pk ; /* save pk to detect the end of a spike later */
(9) (10)
ELSE IF (delays[curr_pos] <= curr_delay) count -= 1
(11) (12) (13)
distr_fnc[delays[curr_pos] ] -= 1; delays[curr_pos] ] = di; curr_pos = (curr_pos + 1) % w;
(14) (15) (16)
distr_fnc[di] += 1; IF (delays[curr_pos - 1] <= curr_delay) count += 1;
(17) (18)
WHILE (count <= w ∗ q) curr_delay += unit;
(19) (20) (21)
count += distr_fnc[curr_pos]; WHILE (count > w ∗ q) curr_delay -= unit;
(22)
count -= distr_fnc[curr_pos];
transmission delay that can be used in order to adjust dynamically the talkspurt playout delay. The proposed mechanism guarantees that the talkspurt playout delay may be dynamically set from one talkspurt to the next, provided that intervening silent periods of sufficiently long duration are exploited for the adjustment. The idea, which is at the basis of the mechanism #3, is the following. When the sender transmits the first packet of an audio talkspurt, the sender timestamps that packet with the value (say C) of the reading of its own clock. When this first packet arrives at the receiver, the receiver also sets its clock to C and immediately schedules the presentation of that first packet. Subsequent audio packets belonging to the same talkspurt are also timestamped at the sender with the value of the reading of the sender’s clock at the time instant when the packet is transmitted. When those subsequent packets arrive at the receiving site, their
280 Roccetti
attached timestamp is compared with the value of the reading of the receiver’s clock. If the timestamp attached to the packet is equal to the value of the clock of the receiver, that packet is immediately played out. If the timestamp attached to the packet is greater than the value of the clock of the receiver, that packet is buffered and its playout time is scheduled after a time interval equal to the positive difference between the value of the timestamp and the value of the receiver’s clock. Finally, if the timestamp attached to the packet is less than the value of the clock of the receiver, the packet is simply discarded since it is too late for presentation. However, due to the fluctuant delays in real transmissions, the value of the clocks of the sender and the receiver at a given time instant may differ of: TS - TR = ∆, where and TS and TR are, respectively, the readings of the local clocks at the sender and at the receiver, and ∆ is a nonnegative quantity ranging between 0, a theoretical lower bound, and ∆max, a theoretical upper bound on the transmission delays introduced by the network between the sender and the receiver. Hence, a crucial issue of the mechanism is an accurate dimensioning of the playout buffer. Both buffer underflow and overflow may result in discontinuity in the playout process. The worst case scenario for buffer underflow (corresponding to the case when packets arrive too late for presentation) is clearly when the first packet arrives after a minimum delay (e.g., 0) while a subsequent packet arrives with maximum delay (e.g., ∆max). It is possible to show that in this case the subsequent packet (transmitted by the sender when its clock shows a time value equal to X, and consequently timestamped with the value X) may arrive at the receiver too late for playout, precisely when the receiver’s clock shows the value given by X + ∆max. This consideration suggests that a practical and secure method for preventing from buffer underflow is that the receiver delays the setting of its local clock of an additional quantity equal to ∆max, when the first packet of the talkspurt is received. With this simple modification, the policy guarantees that all the audio packets of the talkspurts that will suffer from a transmission delay not greater than ∆max will be on time for playout. However, the above mentioned technique introduces another problem: that of playout buffer overflow. The worst case scenario for buffer overflow occurs in the following circumstance: the first packet of a talkspurt suffers from the maximum delay (i.e., ∆max), and a subsequent audio packet experiences minimum delay (e.g., 0). It is possible to show that in such a case the subsequent audio packet (transmitted at time X) may arrive at the receiving site when the receiver’s clock is equal to X - 2 * ∆max. In conclusion, this example dictates that the playout buffer dimension may never be less than the maximum number of bytes that may arrive in an interval of 2 * ∆max. In the mechanism #3, a technique has been devised that is used to estimate an upper bound for the maximum delay transmission. This technique exploits the so-called Round Trip Time (RTT) and is based on a three-way handshake protocol. It works as follows. Prior to the beginning of the first talkspurt in an audio conversation, a probe packet is sent from the sender to the receiver timestamped with the clock value of the sender (say C). At the reception of this probe packet, the receiver sets its own clock to C and sends immediately back to the sender a response packet with attached the same timestamp C. Upon the receiving of this response packet, the sender computes the value of the RTT by subtracting from the current value of its local clock the value of the timestamp C. At that moment, the difference between the two clocks is equal to an unknown quantity (say t0) which may range from a theoretical lower bound of 0 to a theoretical upper bound of RTT. Unfortunately, t0 is unknown and a rough approximation of this value might result in
Playout Control Mechanisms for Speech Transmission over the Internet 281
both playout buffer underflow problems and packet loss due to late arrivals. Based on these considerations, the sender, after having received the response packet from the receiver and calculated the RTT value, sends to the receiver a final installation packet, with attached the previously calculated RTT value. Upon receiving this installation packet, the receiver sets the time of its local clock, by subtracting from the value shown by its clock at the arrival of the installation packet the value of the transmitted RTT. Hence, at that precise moment, the difference between the two clocks at the receiver and at the sender is equal to a value ∆ given by ∆ = TS - TR = t0 + RTT, where ∆ ranges in the interval [RTT, 2 * RTT] depending on the unknown value of t0, which, in turn, may range in the interval [0, RTT]. On the basis of the value of the time difference imposed by the above mentioned protocol between the two system clocks at the sender and at the receiver, the following audio packet playout/buffering strategy may be performed at the receiver’s site. When an audio packet (emitted by the sender) arrives at the receiver (i.e., it is delivered to the application level of the receiving host), its generation timestamp t is compared with the value T of the receiver’s clock, and a decision is taken according to the rules depicted in Table 4. If t < T, the packet is discarded having arrived too late (w.r.t. its playout time) to be buffered (first row in Table 4). If t > T + ∆, the packet is discarded having arrived too far in advance of its playout time to be buffered (second row in Table 4). Instead, if T < t ≤ T + ∆, the packet arrives in time to be played out and it is placed in the first empty location in the playout buffer at the receiver’s site (third row in Table 4). Using the same rate r adopted for the sampling of the original audio signal at the sender, the playout process at the receiver’s site fetches audio packets from the buffer and sends them to the audio device for playout, as discussed in the following. When the receiver’s playout clock shows a value equal to T, the playout process searches in the buffer the audio packet with timestamp T. If such a packet is found, it is fetched from the buffer and sent to the audio device for immediate playout, while the buffer location where the audio packet has been found is marked as empty (fourth row in Table 4). If the packet timestamped with T is not present in the buffer, then the playout process replaces the corresponding audio sample with a silent period of length r. In essence, a maximum transmission delay equal to ∆ is left to the audio packets to arrive at the receiver in time for playout and consequently a playout buffering space proportional to ∆ is required for packets with early arrivals. Based on the aforementioned playout/buffering policy, it would be easy to show that both buffer overflow and underflow are always prevented provided that the transmission delay experienced by the audio packets from the sender to the receiver never exceed the maximum transmission delay value ∆. A Table 4: Mechanism #3: buffering/playout policy at the receiver
Condition
Policy
t
packet discarded (late arrival)
t >T+ ∆
packet discarded (premature arrival)
T
packet buffered (waiting for playout)
t=T
packet sent to audio device for playout
282 Roccetti
symmetric scheme may be used in the other direction to perform the playout of audio packets that flow from the receiver to the sender. In order for the proposed policy to adaptively adjust to the highly fluctuant endto-end delays experienced over wide area, packet switched networks (like the Internet), the above mentioned synchronization technique is first carried out prior to the beginning of the first talkspurt of the audio conversation, and then periodically repeated throughout the entire conversation. Hence, each time a new value for RTT is computed by the sender, it may be used by the receiver for adaptively setting the value of its local clock and the playout buffer dimension. This method guarantees that both the introduced additional playout time and the buffer dimension are always proportioned to the traffic conditions. However, it may not be possible to replace on the fly at the receiver the current values of its own clock and the dimension of its playout buffer, during a talkspurt. In fact, such a per-packet adaptive adjustment of the synchronization parameters might introduce either gaps inside the talkspurt or even timing collisions among audio packets (Roccetti et al., 2001a). Consequently, the installation of the new synchronization values at the receiver is carried out only during the periods of audio inactivity, when the sender generates no audio packets. A formal theorem (with a corresponding algorithm) is provided in Roccetti et al. (2001a) that shows that the installation of a synchronization may be conducted during a silent period (detected by the sender) without introducing either gaps or packet collisions inside the talkspurts of the audio conversation, only if the silent period is equal to or greater than a given known value. In simple words, the computed variation of the playout delay from a talkspurt to the next is accommodated at the receiver with the introduction of artificially elongated or reduced silent periods of the human conversation. In particular, an improvement of the transmission delay experienced by audio packets (equal to δ milliseconds) is accommodated by the playout delay control mechanism by installing an up-to-date playout delay value at the receiver that causes an artificial contraction of the silent period chosen for such one installation. Such one reduction of the silent period perceived by the receiver (w.r.t. the original duration of the corresponding silent period generated by the sender) amounts to exactly δ milliseconds. Instead, a deterioration of the transmission delay experienced by audio packets (equal to δ milliseconds) causes at the receiver a perception of a silent period whose original duration is elongated by δ milliseconds. Another problem with the mentioned policy is related to the possible high value for the obtained RTT that may be caused by the fact that either the probe packet or the response packet (during the first two phases of the synchronization operation) has suffered from a very high delay spike. Due to that, a very high value for the playout delay would be introduced, thus impairing the interactivity of the audio conversation. This problem has been solved in Roccetti et al. (2001a) by adopting a policy, which was inspired by the delay spike detection and management mechanism proposed in Moon et al. (1998).
PERFORMANCE RESULTS The need for silent intervals for allowing a playout delay control mechanism to adjust to the fluctuating network conditions is common to all the three Internet audio mechanisms described in the previous section. This renders all the described control schemes particularly suitable for voice-based conversational audio with intervening silent periods between subsequent talkspurts. Hence, in order to assess the efficacy of those mechanisms, an
Playout Control Mechanisms for Speech Transmission over the Internet 283
accurate model of the talkspurt/silence characteristics of conversational speech is necessary. In particular, an accurate modeling of the voice activity characteristics of conversational speech is mandatory for understanding whether sufficient (and sufficiently long) silent intervals occur in typical human conversations that may permit the periodic activity of dynamically setting the playout delay from one talkspurt to the next one. To this aim, in Roccetti et al. (2001b) an eight-state Brady model of conversational speech has been exploited in order to assess respectively: the overall quantity, the frequency and the duration of silent intervals within human conversational speech. The motivations behind the choice of this particular model (succinctly represented in Figure 3) are its accuracy and ease of implementation for carrying out simulations and analyses concerning the main on-off characteristics of human conversations. It is also worth mentioning that this particular model of human conversational speech had been originally introduced in Stern et al. (1996) in order to carry out performance modeling and analysis of new-generation wireless communication systems. Figure 3 is divided into quadrants with each quadrant representing a different state for parties A and B engaged in a conversation. In particular, one such model has been shown able to reproduce all the three following different types of silences occurring in human speech: i) listening pauses, which occur when a party is silent and listening to the other party; ii) long speaking pauses, which occur between phrases or sentences while a party is speaking; iii) short speaking pauses, which occur between words or syllables while a party is speaking. The first question to be answered in order to assess the feasibility of the Internet playout delay control mechanisms presented in the previous section is whether a sufficient total amount of silent intervals occur in human conversational speech. To this question, a positive response has come from the studies developed in Roccetti et al. (2001b). Based on the use of the Brady model, it has been calculated that that the total quantity of silent intervals within a simulated two-party, one-hour-long packetized conversation amounts to about 63-66%, depending on the voice packetization interval that is typically chosen in the range of [10-30] milliseconds. This result is summarized in Figure 4 where the total number of silent intervals (with relative duration) is shown, as obtained in a simulated Figure 3: Modified 8-state Brady model for the on-off characteristics of conversational speech A Talks and B is Silent State 7 Short Silence Gap While A Talks, B Silent
α
7,1
α α
A and B are both Talking 7,2
1,7
α
State 11 State AASpeaks, Silent Talks, BBSilent
α
α' 1,4
State 2 Double Talk, A is Interrupted
2,1
α
1,2
α
State 3 Double Talk, B is Interrupted
3,1
4,1
α
2,6 α 6,3
α 5,1
α 3,6
α 4,6
State 4 Mutual Silence, A Spoke Last
α 8,3
α State 5 Mutual Silence, B Spoke Last
A and B are both Silent
State 6 B Talks, A Silent 5,6
α' 6,5
α
6,8
α 8,6
State 8 Short Silence Gap While B Talks, A Silent
B Talks and A is Silent
284 Roccetti
one-hour-long, two-party conversation. As seen from the figure, the smaller the packet size (i.e., 10 milliseconds), the larger the number of silent intervals (i.e., 5,075). As an important parameter that influences the efficacy of the Internet playout delay control mechanisms is the frequency of the intervening silence periods, another important question has to be addressed. This question concerns the frequency of those silent intervals within a human conversation. Needless to say, the more frequent those silent intervals are, the more likely it is that Internet control mechanisms will be successful in dynamically adjusting the playout delay. The Brady model has been used in Roccetti et al. (2001b) to understand how many different talkspurts (and consequently silent intervals) are expected in a simulated two-party, one-hour-long packetized conversation. From this simulative experiment, the following important result has been observed: the smaller the chosen packet voice size, the more the total number of silence periods increase and the average talkspurt length decreases (to about 244 milliseconds). Instead, the larger the packet size, the larger the average duration of the talkspurt (451 milliseconds). The main results concerning the quantity and the duration of the talkspurts are depicted in Figures 5 and 6 where, respectively, the total quantity of packetized talkspurts (with duration smaller than a fixed amount) and the percentage of talkspurts (with length smaller than a fixed amount) are reported. As the duration of intervening silent periods in human speech may be artificially reduced or elongated to accommodate the receiver changes of the audio packet transmission delays, another important question to be answered concerns the average length of the intervening silent periods in a human conversation. In Roccetti et al. (2001b) the average length of the silent periods obtained in a simulated two-party, one-hour-long conversation was measured as ranging in the interval [465-770] milliseconds, yet again depending on the packetization interval. In particular, the larger the packet size (i.e., 30 milliseconds), the larger the average silent duration (i.e., 770 milliseconds). The main results regarding the silent interval duration are summarized in Figure 7 where the percentage of silent periods (of length larger than a fixed amount) out of the total quantity of all the obtained silent periods are reported. Finally, based on the positive results obtained with the use of the Brady model, a working prototype implementation of the Internet playout control mechanism #3 has been performed using the C programming language and the development environment Figure 4: Total amount of silent periods 5500 packet size 10 msec
# of Silences larger (or equal) than x
5000 4500 4000
packet size 20 msec
3500 3000
packet size 30 msec
2500 2000 1500 1000 500 0
100
200
300
400 500 600 700 Silence length (msec)
800
900
1000
Playout Control Mechanisms for Speech Transmission over the Internet 285
# of Talkspurts smaller (or equal) than x
Figure 5: Total amount of talkspurts packet size 10 msec packet size 20 msec packet size 30 msec
5000 4000 3000 2000 1000 0 0
100
200
300
400 500 600 700 Talkspurt length (msec)
800
900
1000
Figure 6: Percentage of talkspurts (with duration smaller than x) w.r.t. the total number of talkspurts packet size 30 msec packet size 20 msec packet size 10 msec Percentage of Talkspurts
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
100
200
300
400 500 600 700 Talkspurt length (msec)
800
900 1000
provided by both the Linux and the BSD UNIX operating systems (Roccetti et al., 2001a, 2001b). The UNIX socket interface and the datagram-based UDP protocol were used to transmit the audio packets. All the audio packets had length of [30,40] msec and were produced by sampling speech at 8-KHz and encoding it with the ITU-T G.729 standard that provides coding of speech with a bit rate of 8-kbit/sec, while maintaining a satisfactory audio quality. The performance of the prototype implementation was evaluated using an IP-based Internetworked infrastructure connecting a Pentium-based PC situated at the Laboratory of Computer Science of Cesena (a remote site of the University of Bologna), and a SUN SPARCstation 5 situated at the C.E.R.N. Institute in Geneva. The route between Cesena and Geneva was typically quite lossy (20% in average), had about 20 hops and a bandwidth bottleneck (i.e., 512 Kbps) in the link interconnecting Cesena with Bologna. Many experiments were carried out in two different periods: September-October1997 (Roccetti et al., 2001a, 2001b) and January-June 1999 (Roccetti, 2000). Each conducted experiment consisted of transmitting from Cesena to Geneva a pre-
286 Roccetti
Figure 7: Percentage of silent intervals (with duration larger than x) w.r.t. the total number of silent periods 1 packet size 30 msec packet size 20 msec packet size 10 msec
0.9 Percentage of Silent Intervals
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
100
200
300
400 500 600 700 Silence length (msec)
800
900
1000
recorded 10-minute-long audio file. In each experiment the following important parameters were measured: the average playout delay and the audio packet loss rate. An example of the performance results gathered in the experimental audio transmissions conducted during the period January-June 1999 is described in Table 5. Another important experimental result that was observed concerns the total amount of failed attempts of the playout delay control mechanism #3 in dynamically changing the playout delay value from one talkspurt to the next. One such number of unsuccessful attempts amounts in an average of only 8-10% out of the total number of attempts, thus resulting in a good ability of mechanism #3 in adapting the playout delay value to the fluctuating network traffic conditions. Finally, it is worth reporting on an additional simulative experiment, fully detailed in Roccetti et al. (2001a, 2001b) that was set up in order to better assess the performance of the Internet audio mechanism #3. A software simulator was developed--using the C programming language--that is able to read in the transmission delay of each packet from a given real audio trace detects if it has arrived before the playout time that is computed by the playout delay control mechanism and executes the algorithm. The simulator is also able to calculate the average playout delay and the packet loss for each given trace. We run the simulator on each audio delay trace obtained in the period September-October 1997, using the receiver buffer size as the control parameter to be varied to achieve different loss percentages (Moon et al., 1998). Using this simulation technique, the corresponding average playout delays were obtained as a function of the loss percentages. By running the simulator over all the experimental traces of the experiments, and then averaging the results, the plot of the playout delay (Figure 8) was obtained. In order to provide the reader with an understanding of the effect that various delay and loss rates (as well as buffer dynamics) have on the quality of the perceived audio, we have reported an approximate and intuitive representation of three different ranges for the quality of the perceived audio (Kostas et al., 1998). The three following audio quality ranges have been used: “good” for delays of less than 200/250 milliseconds and low loss rate, “potentially useful” for delays of about 300-350 milliseconds
Playout Control Mechanisms for Speech Transmission over the Internet 287
Table 5: Summarization of performance results: Average playout delay (msec), loss percentage Trial
Start Time
Playout Delay
Packet Loss
1
01:00pm 10/1/1999
216
4%
2
02:00pm 20/1/1999
209
5%
3
08:00am 30/1/1999
177
8%
4
08:00am 2/2/1999
199
7%
5
02:00pm 20/2/1999
202
6%
6
01:00pm 10/3/1999
213
5%
7
07:00am 7/4/1999
182
7%
8
04:00pm 18/4/1999
179
8%
9
10:00am 15/5/1999
196
6%
10
07:00pm 22/6/1999
190
6%
Figure 8: Performance of playout delay control mechanism: Average playout delay vs. packet loss 450
average playout delay (msec)
400 POOR
350
POTENTIALLY USEFUL
300
250
200 GOOD
150 3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
packet loss %
and higher loss rates, and finally “poor” for delays larger than 350 milliseconds and very high loss rates. As seen from the figure, and based on the consideration that audio of acceptable quality may be obtained only if lower delays are achieved while the loss percentage does not exceed the value of 10%, we can deduce that algorithm #3 shows very good performance.
CONCLUSION Delay adaptation in the presence of fluctuating network delays is a crucial issue in determining the audio quality for real-time speech transmission over the Internet. With this in view, in this chapter we have focused on three different control algorithms that have been
288 Roccetti
proposed for adaptively adjusting the playout delay for audio packets in the face of varying network delays. An important characteristic common to all of the three mechanisms is that they try to dynamically adapt the audio application to the network conditions so as to minimize the trade-off between packet playout delay and packet playout loss. The performance figures we have derived from several experiments have illustrated the adequacy of those mechanisms for human speech transmission across the Internet.
ACKNOWLEDGMENTS The author is indebted to Microsoft Research Europe (Cambridge, UK) for supporting his work under a grant. The author also thanks the Italian MURST and CNR for the financial support.
REFERENCES Bolot, J. and Vega Garcia, A. (1996). Control mechanism for packet audio in the Internet. In Proceedings of the IEEE SIGCOMM ‘96, San Francisco (CA). Bolot, J., Crepin, H. and Vega Garcia, A. (1995). Analysis of audio packet loss on the Internet. In Proceedings of Network and Operating System Support for Digital Audio and Video, Durham (NC), 163-174. Jacobson, V. and McCanne, S. (2001). Available on the World WIde Web at: ftp:// ftp.ee.lbl.gov/conferencing/vat/. Accessed March 2001. Hardman, V., Sasse, M. A. and Kouvelas, I. (1998). Successful multiparty audio communication over the Internet. Communications of the ACM, 41(5), 74-80. Kostas, T. J., Borella, M. S., Sidhu, I., Schuster, G. M., Grabiec, J. and Mahler, J. (1998). Real-time voice over packet-switched networks. IEEE Network, 12(1), 18-27. Moon, S. B., Kurose, J. and Towsley, D. (1998). Packet audio playout delay adjustment: Performance bounds and algorithms. ACM, 6(1), 17-28. Panzieri, F. and Roccetti, M. (1997). Synchronization support and group-membership services for reliable distributed multimedia applications. ACM Multimedia Systems, 5(1), 1-22. Ramjee, R., Kurose, J., Towsley, D. and Schulzrinne, H. (1994). Adaptive playout mechanisms for packetized audio applications in wide-area networks. In Proceedings of the IEEE INFOCOM’94, Montreal (Canada). Roccetti, M., Ghini, V., Balzi, D. and Quieti, M. (1999). BoAT: The Bologna optimal Audio Tool. Available on the World Wide Web at: http://www.radiolab.csr.unibo.it/BoAT/ src. Accessed March 2001. Roccetti, M. (2000). Adaptive control mechanisms for packet audio over the Internet. In Proceedings of the SCS Euromedia Conference, Antwerp (Belgium), 151-155. Roccetti, M., Ghini, V., Pau, G., Salomoni, P. and Bonfigli, M. E. (2001a). Design and experimental evaluation of an adaptive playout delay control mechanism for packetized audio for use over the Internet. To appear in Multimedia Tools and Applications. Kluwer Academic Publishers. Roccetti, M., Ghini, V. and Pau, G. (2001b). Simulative and experimental analysis of an adaptive playout delay adjustment mechanism for packetized voice across the Internet. To appear in the International Journal of Modelling and Simulation. IASTED Press. Schulzrinne, H. (1992). Voice communication across the Internet: A network voice terminal.
Playout Control Mechanisms for Speech Transmission over the Internet 289
Technical Report of the Department of ECE and CS, University of Massachusetts, Amherst (MA). Steinmetz, R. and Nahrstedt, K. (1999). Multimedia: Computing, Communications and Applications. NJ: Prentice Hall. Stern, H. P., Mahmoud, S. A. and Wong, K. K. (1996). A comprehensive model for voice activity in conversational speech-Development and application to performance analysis of new-generation wireless communication systems. Wireless Networks, 2(4), 359-367. Watson, A. and Sasse, M. A. (1998). Measuring perceived quality of speech and video in multimedia conferencing applications. In Proceedings of the ACM Multimdedia’98, Bristol (UK). Westwater, R. (1998). Digital audio presentation and compression. In Fuhrt, B. (Ed.), Handbook of Multimedia Computing. CRC Press, 135-147. Vega Garcia, A. (1996). Mecanismes de controle pour la transmission de l’audio sur l’Internet. Doctoral Thesis in Computer Science, University of Nice Sophia Antipolis, Ecole Doctoral SPI, France.
290 Costantini & Toinard
Chapter XIV
Collaboration and Virtual Early Prototyping Using the Distributed Building Site Metaphor Fabien Costantini and Christian Toinard CEDRIC, France
Rapid Prototyping within a virtual environment offers new possibilities of working. In order to reduce the design time and define better specifications, concurrent engineering shall be addressed at the early stage of a concept phase. The presentation defines different concepts namely Virtual Prototyping, Rapid Prototyping and Virtual Early Prototyping. VEP improves the Rapid Prototyping providing important leverage on the problem of collaboration. The state of art shows that the current solutions offer a limited collaboration. Within the context of an extended team, the solutions do not address how to move easily from one style of working to another one. They do not define how to manage the rapid design of a complex product. Moreover, the different propositions suffer mainly from the client-server approach that is inefficient in many ways and limits the openness of the system. One explains also that the Internet protocols are best suited to develop collaborative services within a VEP system. That state of art enables us to explain that CORBA, MPEG-4 and multimedia protocols are not adapted to solve the problem of collaboration. A case study is used to show that our solution enables an efficient collaboration. The chapter presents a global methodology enabling different styles of work. Thus, an extended team manages easily the concurrent design of a complex product. A way to start a project among a geographically dispersed team is proposed. It enables to manage different design teams in a secure way over the Internet. Afterwards, the different teams reach a kick-off meeting to set-up the initial proposal of the specification. Then, each team works on a system design in a distributed and collaborative way. Thus, private works are merged and consolidated easily. Work reviews solve the interdependencies between the different Copyright © 2002, Idea Group Publishing.
Collaboration and Virtual Early Prototyping 291
systems. At last, a project review enables us to conciliate the different proposals into a satisfying solution. Our solution provides practical advantages, namely a design in context to avoid speculation by default, a method breaking down the complexity, a decrease of the design time and an increase of the quality. The case study is derived into a set of general requirements for a VEP tool. Thus, the major functional services are identified. Afterwards, the chapter presents a new solution to satisfy the VEP requirements. It proposes new collaboration services that can be used to distribute a virtual scene between the designers. Our solution, called the Distributed Building Site Metaphor, enables project management, meeting management, parallel working, disconnected work and meeting work, real-time validation, real-time modification, real-time conciliation, real-time awareness, easy motion between these styles of work, consistency, security and persistency. In contrast with the other solutions, our services enable parallel work while preserving consistency. These services do not require or implement a reliable multicasting. They are fully distributed and do not require any specific quality of service from the under laying network. DBSM can add collaboration to any stand-alone application.
INTRODUCTION Motivation Virtual Early Prototyping enables concurrent engineering at the early stage of a product design. Most of the time, an important part of a product cost is charged during the design phase. But tools to reduce the time to design a product and to examine different design alternatives are missing. VEP addresses the concept phase of a product development where most of the design choices are made. So, the concept phase is critical for the quality and the cost of the product. A VEP tool must authorize different designers to make quickly the best virtual prototype through a large set of design alternatives. So, the workers must work to explore quickly and easily different design alternatives. The concept design activities of a complex system can be performed in project-oriented teams that have important and tight deadlines. Concept design is not limited to a single enterprise. Members of a global design force can include partners, suppliers, contractors and customers. That global design force is organized into several design teams responsible for the different systems. The best specifications must be defined by these different teams through successive iterations while mixing easily individual works, alternatives for the different systems and conciliation milestones to reach a global specification. That extended team must collaborate freely and easily, without forcing a given enterprise to have more privilege and information than others. At the same time, the design environment must provide the ability to respect the responsibilities and expertise of the different members. Collaborative systems focus on natural social interactions that let people easily move between different styles of work. They propose metaphors like Greenberg (1998) that ease people’s transitions across the different styles of work. Generally, the room metaphor is considered as a container for several documents and space where workers meet. Thus, workers easily move from one room to another and can introduce documents in the different rooms. But, these collaboration metaphors do not address the way designers can collaborate at the concept phase of complex engineering products. Moreover, these collaborative systems do not manage virtual shared worlds.
292 Costantini & Toinard
Industrial solutions like Dassault (1998) start from a 3D representation of the product. They authorize mainly a design review that is a limited style of collaboration. Some like Hewlett-Packard (1999) offer a direct modification of the shared 3D world but they use a central server to achieve a consistent update. These solutions are not easy to use since they do not address clearly how the designers can cooperate freely and efficiently. For example, they do not support mobile working, iterative validation and conciliation. Moreover, a central server is not an open solution because it requires a unique enterprise to collect the shared world. Centralization has also some usability drawbacks since each participant must have an account and sufficient right before being able to work. Thus, a central management deprived the participant of their know-how, data management and rights. Moreover, these solutions do not support an easy motion between different styles of work. For example, it is not simple for mobile workers to prepare proposals for a global task before reaching a meeting to conciliate the different proposals in real time. Generally, current solutions address a single way of working, which is the meeting, and tackle mainly the design review. Thus, metaphors are missing to achieve concurrent engineering at the early phase of a complex industrial product. First, collaborative models do not tackle virtual prototyping. Second, Computer Aided Design (CAD) systems only support a limited style of collaboration. On the other side, the distributed virtual environments are guided by the historical requirements coming from the simulation of battlefields. Thus many, (Barrus, 1996; Defense Modeling and Simulation Office, 1997; Broll, 1998; Hagsand, 1996; Madedonia, 1995) consider mainly how to reduce the network traffic due to a high number of moving objects. Generally, a way to divide statically or dynamical the scene is provided for scalability. These systems do not support parallel working (Barrus, 1996) or do not guaranty the work consistency (Defense Modeling and Simulation Office, 1997; Broll, 1998; Hagsand, 1996). Moreover, client-server pitfalls (Broll, 1998; Hagsand, 1998) or quality of service requirements (Defense Modeling and Simulation Office, 1997) limits these solutions. Thus, adapted distributed virtual environments are missing. First, despite what they claim to be, the current environments do not address the concurrent engineering requirements because they do not preserve the consistency or limit the ability of parallel working. Second, they do not address how to move easily from different styles of work and to support different kinds of collaboration and conciliation. Third, a central server introduces a bottleneck and a failure point in the system. At last, the different Quality of Service (QoS) approaches do not inter-operate and are yet poorly deployed on the Internet. So, requiring a specific QoS from the under-laying network forces the participants to share the same network architecture and limit the ease of deployment and the openness of the solution. As the industrial solutions and propositions from the literature do not address the collaboration services well at the early stage of a product design, this chapter defines the Virtual Early Prototyping requirements. It shows that the Distributed Building Site Metaphor (DBSM) provides a good solution that satisfies these requirements. Our metaphor enables a real-time collaborative design and parallel work within a virtual shared world while preserving the consistency of the design when multiple participants are working simultaneously. It enables different styles of work like mobile working on a global design, meeting, real-time design, real-time review, real-time conciliation and real-time validation. DBSM can be used for both co-located and distributed work. It is a fully distributed approach without any central server. The solution does not require any specific quality of service from the underlying network. It can be deployed over the standard Internet architecture while providing a robust authentication and confidentiality. As the solution consumes a reduced bandwidth, low links can provide real-time work.
Collaboration and Virtual Early Prototyping 293
Chapter Organization The BACKGROUND provides definitions and discussions of the collaborative virtual environments. It incorporates a state of art that gives an up-to-date overview of both 3D collaborative systems and distributed virtual environments. This section explains also why existing communication standards like CORBA (Object Management Group, 1999), MPEG4 (Eleftheriadis, 1998) and real-time protocols cannot help developing collaborative services for a VEP system. It shows however that GroupWare tools and multimedia environments can be used as complementary technologies during distributed sessions. The USE CASE IN AN EXTENDED TEAM presents the style of collaboration that is supported by our VEP solution. It shows how the participants set-up a design project. Different alternatives will be designed concurrently starting from an elementary proposal. Participants resolve conflicts through a day-to-day consolidation. The REQUIREMENTS defines a complete set of requirements for collaborative VEP environments. These address functional needs like project management, meeting management and shared world consistency, but also non-functional constraints like the ease of deployment over the Internet and heterogeneity support. The THE DISTRIBUTED BUILDING SITE METAPHOR describes DBSM that is a fully distributed solution. It uses regular email during the initial phase of a project establishment. Afterwards, the collaboration services are fully distributed using only multicast communications that are available on any standard Internet Protocol (IP) suite. DBSM is thus fully supported by the Internet architecture. The A DESIGN PATTERN APPROACH FOR DBSM IMPLEMENTATION introduces how DBSM can be implemented using Design Patterns. More precisely, it focuses on two important patterns solving respectively distributed designation and automatic scene aggregation problems. Finally, the CONCLUSION gives the lessons learned and further directions to improve the solution.
BACKGROUND Definitions Virtual Prototyping This is a way of designing a manufacturing product using a CAD system. These systems are mainly devoted to have a 3D representation of the product. Each piece of the product is finely drawn showing any internal detail of that piece. These systems enable a high numerical precision of the 3D prototype. Using a CAD system requires a long-term training. Moreover, designing a product with a high numerical precision is long-term activity. So, a timeconsuming design is the price you should pay to have a precise model and 3D representation of your product. The major CAD system is CATIA from Dassault Systems. It is widely used in different branches of activities like aeronautic or automotive building.
Rapid Prototyping This is a pre-design of a manufacturing product during the concept phase of an industrial product. A Rapid Prototyping system authorizes a 3D representation of the
294 Costantini & Toinard
product. In contrast with Virtual Prototyping, a designer does not yet make a precise prototype. At this stage, Rapid Prototyping makes the main pieces of the product emerging only gradually as his ideas evolve. The different pieces of the product are roughly defined. At this stage, a designer does not think to the internal details of each piece of his product. In contrast with a CAD system, a Rapid Prototyping system is easy to learn. Moreover, defining a rough model requires only a couple of hours.
Virtual Early Prototyping This is an improvement of the Rapid Prototyping concept. Virtual Early Prototyping also occurs during the concept of a product development. It is another tool to make a 3D representation of the developed product with the same functionality as a Rapid Prototyping system. In contrast, however, Virtual Early Prototyping offers a better level of design and high degree of interaction. For example, CoCreate OneSpace (Hewlett-Packard, 1999) is a basic virtual early prototyping. Thus, time to design decreases while the quality of the specification increases. First, it enables one to evaluate advanced characteristics of the product at the early stage of the development. For example, a tool like CATIA (Dassault, 2000) checks interference between components. It can automatically propose modifications to respect these functional rules. Thus, the system goes one step further in the designer’s assistance. Second, the human-machine interactions are easier than ever with an ergonomic interface that is adapted to the major operations carried out by a designer. Thus, the system is not a general-purpose tool with inadequate services but proposes the best practices to the designer. For example, best practices for the air conditioning design is a widget enabling to route automatically the air conditioner with the different air vents and compute the flow (i.e., debit and pressure) at each air vent. Third, human-to-human interactions are supported. Different designers with a specific knowledge participate and debate the baselines of the future product. Here, the different participants examine several alternatives in order to achieve the best specifications taking into account the expertise of each designer and satisfying the major requirements of the product. For example, electricians and hydraulicians collaborate to design a new product. Different teams are formed to design in parallel the different systems in accordance with the product requirements. Each designer can propose different alternatives (i.e., 3D objects) connected with others’ work (i.e., other 3D objects). These different alternatives will be used to define the best specification. The different works are related with each other by day-today consolidation. A validation and conciliation phase enables to respect the global requirements. For example, the electrician respects minimal distances with the fuel system. Human-human interactions take place within a 3D scene that is managed through a virtual environment. Thus, human-human interactions can be transported by a scene modification. Typically, a participant creates a new 3D object that is taken into account by the other participants later on. For example, while traveling the electrician uses his laptop to create a new electric system within the 3D scene. Despite that mobile work, the electrician will share the proposed system as soon as possible by initiating a meeting with the concerned people. Thus, other designers will work in accordance with that new electric system. Human-human interactions are easy to use and support different styles of work. The system does not give any unnecessary rights to a given participant even if he is the leader of the project. Thus, the expertise and property rights of the extended team are respected.
Collaboration and Virtual Early Prototyping 295
State of the Art and Justification of the Approach Different aspects are considered in this section. First, industrial prototyping systems are briefly described in order to show the proposed services and their implementation. Second, different solutions of distributed virtual reality are presented. As a VEP system must provide efficient human-human interactions and collaboration within a 3D world, it is important to see if propositions from the literature can be used to develop the system. Third, an overview of different communication standards shows which ones are best suited to develop a VEP system. Throughout this section, emphasis is put on services that are missing for a VEP tool. Thus, the reader has a first level of understanding of the required services. At the end of the section, a justification of our solution is given with contrast to the other proposals.
Industrial Prototyping Systems CATIA 4D Navigator Dassault (1998) provides a collaborative navigation of the scene through synchronization of viewpoints, telepointers and annotations. Thus, a simple form of design review is achieved. Typically, the designers share one viewpoint. One designer actively pilots the viewpoint throughout the 3D scene to conduct a visual inspection of the work. The passive designers fly through the scene as controlled by the pilot. Thus, the system allows simultaneous viewing of the scene by multiple participants. Each participant has a telepointer that is transmitted to the distant participant. Basically, a telepointer enables a participant to locate the corresponding participant within the scene. Each participant can use his telepointer to show a direction or make an annotation. Thus, the participants share the same scene, the same viewpoint and observe the different pointers. 4D Navigator provides the ability to make notes, request design modifications or ask questions through an online redlining capability. GroupWare toolkits like Microsoft NetMeeting can be used to run 4D Navigator. Thus, the designers can make a videoconference during the 4D Navigator session. So, they talk with each other using the voice transmission of the GroupWare toolkit. Typically, voice helps to comment on the annotations achieved during the review. From a communication point-of-view, 4D Navigator uses a server that broadcasts the scene and position of the viewpoint to each participant. So, this is a client-server architecture without any advanced distribution mechanism. This product is limited to design review as one designer cannot modify directly the 3D scene (i.e., adding a new 3D object) with distant designers having a real-time awareness of the modifications. There is no possibility for a given worker to make a private improvement of the scene before adding the modifications to the shared scene. Thus, mobile design of an extended team is not supported. Real-time conciliation of the shared scene is not supported since real-time modification of the shared scene is not permitted.
CoCreate OneSpace Hewlett-Packard (1999) proposes better collaboration services. Like Dassault (1998), designers make distributed review meetings. In the same way, shared viewpoints and pointers are proposed and the system enables simultaneous viewing. The major difference is the possibility to modify the sceneon-the-fly. For example, one participant creates a new 3D object by activating a modification command. The character-
296 Costantini & Toinard
istics of the newly created object are sent to all connected participants in the form of a graphical update. This incremental update guarantees that all participants see the results of the modification immediately. The communications use a client-server architecture. After the initial upload of the scene, the participating stations render the 3D scene locally. Typically, a station sends a command to the server. The server processes the command and sends an incremental update to the different stations. The network traffic is reduced but the server has to compute the update that has to be sent. So, the load of the server can be high if each participant sent a big number of commands. The server has to be administrated. In case of an extended team, the server shall be hosted in a central entity and the different partners must ask the central administrator access rights. So, this solution introduces set-up and administration costs. It is not feasible to have a fully distributed team, as the hosting and administration are devoted to a single entity. Mobile working and easy merging of the different works are not addressed. So, the extended team cannot alternate easily between mobile working and sharing process. The working phase is only supported during a connection phase with the central server, and so, the designers become dependant of the connection ability and a disconnected work can be easily integrated within the shared scene.
Distributed Virtual Reality Locales Barrus (1996) allows combining different rooms into a global world. The combination relies on transformation matrixes in order to connect adjacent rooms that present communication doors between them. A room is called a locale. Thus, a user located in a room can observe the events of an adjacent room through the door. The transformation matrix allows representing the neighboring rooms and moving smoothly from one room to another. In that context, a neighboring room can be connected dynamically in order to combine the works achieved separately in different rooms. The participants use different locales to work separately. This solution provides three interesting features: graceful motion between neighboring locales, performances are improved through decomposition of the world into chunks that can be processed separately and workers combine separate objects through a dynamic connection of their locales. A server maintains each locale using a multicast address to send the update messages. By opening a network extremity with the corresponding multicast address, a client can efficiently obtain the locale updates. Thus, a client obtains information about locales surrounding the focus of attention without receiving messages from irrelevant locales. As each locale is assigned to a server maintaining its state, it is a centralized solution where the server multicasts the updates. The locale metaphor is well suited to visit different locales that are connected through doors. A concept design activity cannot easily be mapped onto the locale metaphor since the different pieces of the global scene are not disjoint locales but systems interleaving with each other. Thus, the locale metaphor is not adapted to the concept design activity, and no security is provided.
High-Level Architecture The Defense Modeling and Simulation Office (1997) addresses the framework of a great number of moving objects. The solution aims at a real time that is as exact as possible. It provides sophisticated solutions to reduce the number of state transmissions and to recover
Collaboration and Virtual Early Prototyping 297
the transmission errors. The Defense Modeling and Simulation Office (1997) defines a dynamic cutting (which is not perfect) according to the motion of the objects. Each moving object defines an area of interest, and events falling into that area are observed by the moving object. The difficulty is to change dynamically the area of interest according to the motion of the object. A given object does not have the guaranty to receive all the interesting events. Moreover, this standard defines different ordering of events that can be used in a simulation context. The Defense Modeling and Simulation Office (1996) identifies different qualities of ordering. It is mainly interested with a logical ordering of events where the application associates logical dates to the events. That ordering supports a distributed simulation by guaranteeing that the events are processed at the corresponding logical time. These properties do not address a real-time simulation or a collaborative activity. Ownership services are defined. The ownership management allows federates to transfer ownership of object attributes. The ownership transfer is associated with publication and subscription services. Ownership acquisition requires current publication and subscription for the attribute. The standard does not define different levels of ownership like read and write permissions. It is not integrated within a schema where a shared scene is distributed among different designers in order to be processed on a private basis. So, distribution of the shared scene among mobile designers is not supported. In fact, the ownership management concerns less the collaborative design than the distributed simulation. Security is not addressed at all. HLA does not standardize the protocols in order to get a freedom of solutions. In practice, a run-time is under development. It assumes resource reservation (i.e., ReserVation Protocol) and reliable multicasting. These requirements seem necessary in the event of a great number of state transmissions associated with moving objects.
Distributed Interactive Virtual Environment Hagsand (1996) considers moving objects and aims at a precise real-time. Solutions reduce the number of state transmissions. In that context, uniquely fresh events interest the recipient and the system improves the performances relying on that feature. The system uses multicasting heavily and partial replication. With DIVE, the application has the responsibility of the placement of the copies for the shared world. It must define which object must be present in the different copies. It means that all of the states are not replicated at each copy. A modification is multicast to update the copies that have joined the corresponding address. Consistency is achieved by associating one object to a given owner for a long time. Thus, concurrent requests must wait during that period. In fact, the solution makes the assumption that concurrent modifications seldom occur. Since partial replication means that none have a complete state of the world, a server must be used to preserve the persistency of the scene. The server receives all the updates and writes the changes to a persistent storage. The first member of a session loads the initial world from the server. A subsequent member requests its initial world state over a multicast address and receives an up-to-date copy from the closest member. This solution provides a protocol allowing to locate the user the nearest copy for a given object. When an entity detects a missing event or just requests an object, it multicasts a request. The closest copy replies with its latest version of the object. The latest version of the object is not the freshest one. Another copy can have a better knowledge for the state of that object. When nobody has a better proposition, only the closest copy replies to the request.
298 Costantini & Toinard
The system implements a classical dead-reckoning module like IEEE (1995), in order to keep the number of updates low: between two updates, a receiving entity evaluates the position using linear and angular velocity. Since a server is required for downloading the scene and for persistency, an extended team cannot use the same tool during a mobile working and the sharing process. Moreover, the solution does not define consistency properties that can support distributed validation and conciliation. Security is not addressed since the solution does not provide authentication and confidentiality.
Distributed World Transfer Protocol Broll (1998) considers a server for: transmission of the virtual world, reliability of the updates, persistency, supervising the connections of participants, supporting consistency. In essence, that solution is close to Hagsand (1996). Proxies manage copies of the shared world to reduce the load of the server. The server and the copies are updated when a peer multicasts events. The system provides a recovery mechanism. Each peer has the ability to receive all the events. When the recovery cannot be achieved, a participant has to reconnect to the server. Broll (1998) provides a simple concurrency control achieved through a lock mechanism. The approach is a locking on a per-interaction basis instead of a per-object basis. The server releases the locks within a certain time. That concurrency control is well suited for special requirements (like controlling concurrent motions of the objects). Broll (1998) integrates solutions to limit the bandwidth that is necessary when working at a VRML level of abstraction. The system improves the performances using cells. The cells define statically a mapping of the world. Each peer has to connect to a central server. The server transmits mapping and world state at the connection time. The solution does not describe how to add dynamically new cells to the world. As for being close to DIVE, this solution suffers from the same drawbacks. An extended team cannot use it to alternate mobile and sharing phases. Consistency is not addressed to support distributed validation and conciliation. The solution does not provide any security.
World2World Sense8 (1997) proposes a client-server architecture to distribute and synchronize simulation data. A central server receives all the modification events and forwards them to the other participants. The basic idea is that each update is sent using a point-to-point message over the User Datagram Protocol (UDP). UDP offers a better throughput than the Transmission Control Protocol (TCP). Moreover, the server can make some optimization when receiving two modifications for the same object (i.e., two translations of the same object). In that case, it will only forward the latest modification and discard the first one. This optimization is only possible for the same type of modification (i.e., two translations). The data are structured hierarchically within a server using a tree. Each node in the tree can be associated with a lock. When a client wants to modify a sub-tree associated to a node, it first has to acquire the lock before being able to update that node. The server prohibits all other users from adding, updating or removing properties of that protected sub-tree until the lock is released. Thus, the system guarantees the consistency of the concurrent operations. At last, a client has an update rate. If a client produces more updates than enabled, then the client will send only the most recent update for each shared property. Thus, the bandwidth is limited and the server is not too loaded.
Collaboration and Virtual Early Prototyping 299
In contrast with others, this solution does not use the multicasting of events. So, the server has to send explicitly each update to the different client through a point-to-point message. That solution uses more bandwidth than multicasting. The major advantage of that solution is the optimization that enables a central server. But, this server also introduces a bottleneck in the system and shall be administrated. The solution does not address the distribution of a shared scene among mobile designers. There is no solution that is provided to merge the mobile works into a global and consistent scene.
Cavern Leigh (1997) addresses design review and collaboration services with immersion in the virtual world. A client uses a server with supercomputing resources and highperformance networking. This centralized solution requires resource reservation and reliable transmission. Through a persistent object server, the system supports persistency. Either intermittent snapshot can be created or entire collaboration experiences can be recorded for later review. A client must specify a Quality of Service (QoS) in order to have the desired bandwidth and jitter. These requirements are due to the nature of collaboration where immersion generates a great volume of data with real-time constraints. The solution aims at loading large scientific data (like video or CAD format). Caches using timestamps improve the performances. That solution cannot be used easily within an extended team. First, the solution does not address the distribution of the shared scene among mobile workers. Second, security is not considered. At last, supercomputing is not a scalable solution for an extended team that is dispersed among several companies.
Communication Standards Internet The Internet Protocol (IP) is the network protocol that is present at each router or station. It supports point-to-point, broadcast and multicast transmissions but is unreliable. At each Internet station, UDP and TCP are the two transport protocols that are available to transmit a message from one process to another. These two transport protocols use the IP services. Using UDP, a process can send a message to a single process (point-to-point), all the processes of a unique network (broadcast) or a group of processes (multicast) that is disseminated over different networks. These three kinds of communication are fully supported by any Internet station. UDP is unreliable. TCP is reliable but offers a lower throughput. As depicted during the previous section, Distributed Virtual Environments make an intensive use of multicasting. The reason is that DVE requires that each station has its own copy of the world. That way the station can render each copy of the scene locally. A multicasting packet is the best way to send each event (creation, update and deletion) to the group of copies. Sending each copy a point-to-point message requires N packet, when one multicast packet suffices. Thus, using a multicast transmission reduces the bandwidth. Moreover, a multicast message can be sent directly by a peer to the multicast group. It does not require a server for relaying the message to the group. Thus, multicasting can be used to avoid a server bottleneck.
300 Costantini & Toinard
CORBA At this time, the industrial implementations support the CORBA 2 standard (Object Management Group, 1999). CORBA is mainly a client-server architecture where the client invokes a method on the server. A client cannot invoke a method on a group of servers. CORBA solves the heterogeneity problem between the client and server. The only way to send a message to a group is using the Event Service. But, this service is inefficient as it needs one method invocation for each recipient and each invocation is carried out using a TCP connection. CORBA Audio/Video (AV) Streaming Service specification (Object Management Group, 1997) defines architecture for implementing open distributed multimedia applications. But, in essence it is a new interface to encapsulate multimedia transport protocols. As explained further, multimedia protocols do not correspond to the requirement of collaborative virtual environments. So, CORBA does not suit to Distributed Virtual Reality, as it cannot support multicasting.
Multimedia Transport Schulzrinne (1994) is gaining acceptance as a transfer protocol for streaming audio and video flows over the Internet. It is now widely adopted by GroupWare toolkit like NetMeeting. The basic idea of multimedia protocols like RTP is to send data using UDP that provides a good throughput. A multimedia protocol is then responsible to associate a date to the samples that are transmitted with UDP. When a reception buffer is full, the receiver can sort the samples according to their emission date to preserve the regularity of audio or video. Thus, the receiver is able to drop the samples without producing long silences (i.e., it does not drop consecutive samples). In fact, collaborative virtual environments are interested mainly to find adapted solutions in case of a great number of participants. They require also preserving the consistency of the object. Playing samples at a regular rate does not reply to these requirements. Still, multimedia protocols and GroupWare toolkits are communication tools authorizing designers to talk with each other during a session of collaborative virtual prototyping. Thus, the collaborative virtual environment supports real-time modeling and advanced collaboration styles of design while the GroupWare product allows users to interact more naturally despite distance barriers.
MPEG-4 Eleftheriadis (1998) is a client-server architecture. It is devoted to the transmission of an audiovisual scene where 3D objects, audio and video are transmitted. Synchronization information is transmitted at the same time. At the server side, audiovisual scene information is compressed, supplemented with synchronization information. The server passes the scene to a delivery layer that multiplexes it in one or more coded binary streams. At the receiver side these streams are de-multiplexed and decompressed. The media objects are composed according to the scene description and synchronization information and presented to the end user. The end user may interact with the presentation. Interactions are processed locally or transmitted to the server. The main objective is to allow a server to control how the receiver will behave in terms of buffer management and synchronization when rendering the audiovisual sequences. The model allows the sender to specify when information is removed from these buffers and schedule data transmission so that overflow does not occur.
Collaboration and Virtual Early Prototyping 301
Despite that some interactions can be sent to the server, the system is not specifically designed to enable the user to modify the data on the server. Moreover, if one user wants to send information to other users, the interaction must be sent to the server and processed before being forwarded to the distant users. At last, the system does not permit a user to manage access rights and put locks on nodes to solve concurrent updates.
Justification of our Solution DBSM is an open platform that enables the participants to move easily from a private work (i.e., a disconnected phase) to a virtual building site aiming to collaborate on a shared global scene. That global scene is distributed dynamically among the different workers to improve the virtual building site and to propose different design alternatives. The resulting work is disseminated within the different private workspaces when leaving the building site. The global scene remains accessible when entering later on in the building site. That way, the different workers in the building site can form a conciliation meeting. Moreover, mobile designers do not need any network access to improve a subset of the shared scene. Before entering again in the building site, a mobile worker selects the new proposal he wants to share with the other participants. Thus, an update of the virtual building site is carried out through an automatic merging of the different proposals. The virtual building site can be compared to the building of a house. When a worker is in the building site, he observes the evolutions carried out by the other workers. At the same time, he enters new assemblies into the building site to adjust them in accordance with the other elements. The building site is the place for real-time cooperation, validation and conciliation to respect the specification of the house. The main difference with a real building site is that a designer brought out at home the virtual building site to improve his elements or define new assemblies. The solution authorizes different styles of working. First, the designers prepare their work before reaching the building site, as any reasonable worker would do. This disconnected work improves a subset of the global scene that was collected during a previous meeting. Second, they collaborate in real time by modifying and discussing alternative proposals during a meeting. The global scene reforms automatically through implicit relationships. Despite the distribution of the global scene, our solution guarantees a consistent progression of the work. Thus, the designers do not spend time rebuilding previous results. Third, this metaphor can be used to perform a co-located work where participants are in the same room and discuss directly while observing a copy of the global scene on their machine. Moreover, a distributed work, where participants are physically at different sites, is enabled. Thus, participants discuss through chatting facilities and GroupWare toolkits while interacting in real time on the global shared scene. Fourth, parallel working is fully supported in the different styles of work. During a preparation phase, parallel work is achieved without any communication. During a meeting, parallel work is achieved while preserving the work consistency. Finally, different styles of validation, conciliation and responsibility sharing are supported. Designers share the responsibility of the different systems. Thus, each participant controls easily how others can modify his work. The participants can build and juxtapose different alternatives to achieve real-time conciliation. Real-time modification can also help to make the conciliation. Each designer can run a validation tool that processes the local copy
302 Costantini & Toinard
of the shared scene. Since, each designer has a fresh and consistent copy, the validation tool has the guarantee to process an accurate and correct scene. A precise consistency property is provided to support validation and conciliation. Thus, the designers define in real time the best specifications while preserving others’ responsibilities. The next section describes a case study where DBSM is used for virtual early prototyping of a new aircraft. The design of a new aircraft involves several teams to design 11 systems. That case study talks you through our solution for the concurrent engineering of aircraft systems. The time to design the aircraft is reduced because of parallel work, validation facilities and usability of the tool. The quality of the aircraft is increased since a wide range of alternatives can be defined, examined and conciliated. A major point of our solution is to be fully distributed through standard best-effort services. Thus, it is deployed easily on any IP (Internet Protocol) network infrastructure. It does not require any specific quality from the underlying network-like resource or bandwidth reservation. In contrast with Broll (1998) and Hagsand (1996), our solution implement neither a reliable multicast nor an ordered multicast (Toinard, 1999) (i.e., a causal and total order) that are well known to scale poorly. Moreover, a reliable or ordered multicast does not guarantee the work consistency. Logical ordering used in parallel simulations is useless as it addresses reproducibility. Concurrent design is not confronted with reproducibility but with parallel working and consistency. Here, parallel working is not limited and consistency is achieved using a lightweight protocol that is processed uniquely at a required time. It is not necessary for a distant participant to observe all the states of a given object because the owner maintains the correct state. An inconsistency is recovered when required with the ownership transfer. As the ownership transfer runs over an unreliable transport (UDP--Unreliable Datagram Protocol), one multicast and two point-to-point acknowledgments are required. In contrast with a reliable multicast, ownership transfer is a low-cost protocol that does not recover the errors in a continuous way and preserves the work consistency. If required (typically on user demand) a peer resynchronizes its copy with the current states of the distant nodes to get a fresh copy of the global scene. At last, our solution can be deployed securely over the Internet. It uses standardized authentication tools that are embedded in any email client. A project can be set up and meetings are scheduled using email. Email serves also to distribute a session key. Thus, confidentiality is achieved using that session key. Thus, our solution enables a straightforward and lightweight implementation on any standard workstation.
USE CASE IN AN EXTENDED TEAM To design the different systems of the virtual prototype, several teams are involved. These teams may be geographically distributed among several countries and companies. Thus, several partners including contractors and suppliers cooperate at the design phase. The case study shows the style of working and organization that is typically supported by our metaphor. It describes that the designers start working from an initial proposal and converge towards the optimal specification through several iterations. Suppose for example that three teams are disseminated at different sites. The project coordinator is at Site 1. Electrical and hydraulic teams are also located at Sites 2 and 3.
Collaboration and Virtual Early Prototyping 303
Project Start-Up In order to initiate the design, the project coordinator sends an email to each team manager including the project definition, the role of each team, the milestone dates and other organizational data. Each team manager sends an email to reply and accepts the definition, the role and the organizational proposal. When the project coordinator receives all the replies, he sends a last email to confirm the start-up of the project. Since emails are transmitted between the different sites and teams using the standard Internet, security must be guaranteed. Here, S/MIME enables each team manager to authenticate the project coordinator. In turn, the project coordinator authenticates each team manager when he receives the reply. Confidentiality of the exchanged data is also provided in a standard way through S/MIME.
Kick-Off Meeting The second phase is to schedule a kick-off meeting where the project manager will distribute the initial virtual prototype which can be an empty shape that the designers fill throughout the design process. For example, the initial virtual prototype can define the shape of a new car. It would make no practical sense to change the shape since it does not really evolve during the life of the car. Thus, automotive designers can fill a given shape/volume with a motor, sites, gears, air-conditioning and so on. The project coordinator sends an email to propose a kick-off meeting. Each team manager acknowledges the proposed schedule. Scheduling the meeting can require exchange of several emails. At last, the coordinator sends a final email to confirm the negotiated schedule. That final email includes the communication address and encryption data that permit a straightforward and secure meeting. Thus, each team member can securely invite another team member to the scheduled meeting through email transmission of the received data. It is practical management architecture since generally the team members can change and each team manager wants to decide which member must participate to the meeting. It authorizes a decentralized management: the project coordinator controls which team can participate and each team manager controls independently the participation rights for his team members. Unauthorized persons cannot reach the meeting since they do not receive the appropriate communication address and protection key. Even if an unauthorized person succeeds in listening to the meeting address, she cannot use the encrypted data. That security mechanism is simple to use, since for the end user point-of-view, it relies only on a widely standardized secure email. Thus, the end user does not have any specific system to learn. Moreover, it increases the usability of the solution since a mobile designer can use the system from any unprotected communication channel. Thus, a participant can securely reach the meeting all over the world from any available network extremity. After that scheduling phase, any invited designer can reach the meeting independently at the scheduled time. For that purpose, he uses a menu of the VEP tool to select the corresponding entry of his agenda. The VEP tool establishes automatically the communication channel with the distant participants. Since the system relies mainly on multicasting, the end user does not have to connect to any server in order to reach the meeting. That communication establishment simplifies the interface since a designer enters into the meeting just by knocking at the scheduled door without having to know a server address.
304 Costantini & Toinard
For the kick-off meeting, all the participants of a given site are collocated in the same room. Thus, they can talk directly with each other. To talk with a distant team, each team manager typically starts a NetMeeting session with the distant manager. The distant network extremity is transmitted using the VEP tool by creating a chat object within the shared world. Thus, the basic problem to discover the network extremity of a distant NetMeeting session is easily resolved through the VEP tool. It is a very practical way to establish a NetMeeting session. Thus, the teams benefit at the same time from video-conferencing and scene-sharing facilities. When the project coordinator enters into the meeting, he selects the initial prototype he wants to introduce within the shared scene. Thus, any participant gets a copy of the initial structure he will have to work on. During the kick-off meeting, the project coordinator transmits the requirements of the aircraft as direct interactions on the shared scene but also as NetMeeting data. These requirements can be discussed in real-time through the scene sharing and NetMeeting session. The different teams work in real-time on the shared scene to allocate the space of the major systems whose they are in responsible. At the end of this first meeting, the designers have the requirements and a first level of specification. Each participant leaves the meeting by storing locally his copy of the shared scene. Thus, he brings home the environment (the initial prototype plus the distant systems) and his own systems. Typically, he will use his local store later on a disconnected basis to improve his work in accordance with the requirements and initial specification.
It is a Very Fast and Efficient Way to Cooperate First, the different participants do not physically move to a common location to reach the meeting. They simply reach the virtual building site from their regular working location. A team member can even reach the building site from any temporary location where he currently moved. Thus, a meeting does not need to be canceled or delayed because of participants that are traveling. Second, they work in real-time during that first meeting to directly define and share a first iteration of specification. Since the system is easy to use and the scene directly shared, the delay to transmit and share a first level of iteration is reduced. As each participant observes in real-time the distant interactions, that first iteration is a design context. Thus, the first iteration does not contain speculation about the distant intentions and choices. Third, that first iteration respects responsibilities of each other since the interaction respects the role devoted to each team. For example, the electrical team cannot create a regular hydraulic system since that operation is permitted only to the hydraulic team. In fact, for proposing alternatives to a given problem (i.e., a fuel system located close to an electrical system), the VEP tool can manage exceptions like an electrician creating a hydraulic system. In that case, the resulting system is displayed with a special rendering (i.e., use of a zebra texture to distinguish the exceptional system). Fourth, the operations (creation, update and deletion of object) are carried out in a consistent way according to the protections of the associated object. That property has a practical impact on the progression and the correctness of the design. Thus, a work does not need to be redone because of a synchronization problem between two parallel workers. Moreover, that property avoids an inconsistent modification of the considered object.
Distributed Design Within the distributed organization, each team designs independently the system whose it is responsible.
Collaboration and Virtual Early Prototyping 305
For example, the air-conditioning team uses the result of the first iteration to improve the air-conditioning by adding new ventilation routes. This operation is carried out on the global scene resulting from the first iteration. Thus, each participant (or team) can work on a disconnected basis without communication with the other participants (or teams). That distributed design relies on a human-centered approach where private work provides the baseline of further real-time interactions and discussions. Thus, a teamwork enables the user to complete an individual work. This approach presents several practical advantages. First, it is an efficient collaboration principle since the designers shall not work at the same time on the same project to achieve their work. Second, it increases the productivity since a designer can efficiently prepare the same work outside the virtual building site without any external perturbation. Third, it is a very natural and efficient approach of collaboration where the private work serves as a baseline for the team cooperation. Assume that the air-conditioned team built a new air-conditioned system during a previous meeting based upon the first iteration. Afterwards, each member prepared several improvements on a private basis. Then, the air-conditioned team participates to a collocated meeting to join the private works. Thus, each member introduces his work. These different private works are used to reform automatically the improved system. Each participant gets a copy of that new proposal on his laptop. During that meeting, the participants consolidate the proposal to introduce it later as partial specification for the second global iteration. Since the authorized designer has moved the air-conditioner, the systems designed by the other members are no more connected to the air-conditioner. Different design choices are possible to resolve the problem (i.e., move back the air conditioner, update the other systems). As the team is aware that the proposal could also conflict with other systems (i.e., that have not been introduced into the meeting), they prepare two design alternatives. The first one includes the new air-conditioner position and the second one corresponds to the previous position of the air-conditioner. Thus, the team has consolidated the proposal by producing two distinct alternatives. Each participant can check interference between components using a validation tool like that proposed in Dassault (2000). The validation tool can process the local copy of that participant since his copy contains all the network data and information required by the interference checker. The VEP tool displays the results of the flow simulation as shared objects for mutual awareness. It is a very powerful property since each team guarantees valid proposals (i.e., where, for example, interferences are detected and solved in real-time).
So, Using DBSM Presents Several Advantages First, a team proposes a correct system where inconsistencies between the private contributions of the team members have been resolved and requirements are satisfied. Thus, a future conciliation phase is shortened since the global conciliation will only check that new entries are correct. Second, each consolidation and validation becomes independent of possible inconsistencies in the distant systems. Thus, despite the interdependencies between the different systems, it becomes feasible to achieve a partial validation reducing the range of control to only one system. Without partitioning between the concurrent evolutions of the different systems, it could be completely impossible to do a partial validation of each system. Moreover, a global validation would become even more difficult then.
306 Costantini & Toinard
Third, partitioning is an efficient specification approach to design a complex system since it is impossible for a designer to take into account all the interdependencies in real-time. Thus, a team focuses on his problem without being flooded by other constraints. Fourth, day-to-day consolidation and validation enable a participant to update the system with a private work as soon as possible. Thus, the team in charge of the system can work quickly on an up-to-date system. The consolidation and validation phases have the guarantee to process a consistent and fresh state of the shared scene. Without that technical characteristic, any consolidation and validation would be useless since the result could be irrelevant due to inconsistent proposals. As we can see, a technical property in the field of distributed systems can have major impact on the usability of the solution.
Work Review Different work reviews can be necessary to resolve conflicts between several systems and discard a maximum of alternatives. We will conduct a simple example because the main point is the understanding of the basic principle of the work review. Let us consider a first example. The hydraulic and electrical managers organize a meeting to review the possible problems between different alternatives. When entered into the meeting, each designer observes a shared scene including the proposed hydraulic and electrical networks. Thus, immediately the participants observe unsatisfied constraints. Typically, a collision is detected between the two systems. In order to resolve the collision, the two designers can define in real time several alternatives. Costantini et al. (2000) describes practical ways to resolve such a collision. Several advantages are provided. First, conflicts are resolved as soon as detected and in cooperation with the designer responsible of the other system. Second, the VEP tool enables to sketch different alternatives since the update of a subset of the scene is a rapid way to define a new alternative. Thus, a maximum number of solutions are browsed within a short time. Third, design in context enables to define real alternatives. Once again, the ability to process a consistent and fresh copy of the shared scene enables design in context to solve real problems. Through frequent reviews, the different teams avoid speculation by default, improve the quality of the solution and decrease the design time. A second example involves the air-conditioned and hydraulic systems. The two corresponding teams start a meeting to review several alternatives and resolve collisions between their two systems. As described previously, the air-conditioned team prepared several alternatives to solve collisions. Since the global scene merged, the air-conditioned manager observes several collisions with the hydraulic pipes. Instead of updating immediately the shared scene to cope with the conflicts, he introduces another alternative within the shared scene and again computes the collision. Thus, the manager can select the alternative that minimizes the collisions. Afterwards, a direct update of the selected alternative enables resolution of the remaining collisions. In contrast with the previous way of working, the prepared alternatives speed-up the review and augment the number of solutions the participants can examine. This is only a demonstration example. But, all along the design several opportunities are given to the designers to prepare different alternatives. First, a designer can have several proposals that he prepares during the private work. Thus, he can quickly propose his ideas during a meeting to the other designers and discuss their respective advantages. Second, a
Collaboration and Virtual Early Prototyping 307
team offers several possibilities that satisfy the requirements. Thus, the global interdependencies will be exercised with a greater degree of freedom. Third, the work review naturally leads to different alternatives to resolve the conflicts and satisfy the overall requirements. Obviously the number of alternatives increases with the number of teams and systems. That is why several team-to-team reviews can be used to eliminate in parallel the irrelevant alternatives. Thus, the number of explored possibilities increases.
REQUIREMENTS As a VEP system shall augment the Rapid Prototyping concept with efficient humanhuman interactions, collaboration and distribution facilities, this section focuses on the corresponding requirements. This section formalizes the different styles and phases of working presented during the previous case study section. This section corresponds to the proposal made in Costantini et al. (Tech. Report 2000); further details can be found in that paper. One does not address here the needs associated with the respect of functional constraints and human-machine interactions. In fact, these two latter needs are not specific to a VEP system. Moreover, sharing a 3D world enables to compute locally functional constraints and human-machine interactions. So, much emphasis is being placed on humanhuman interactions, collaboration and distribution needs. Requirements of Virtual Early Prototyping are now described. The objective is to enable the designers to quickly set-up and discuss proposals producing a 3D prototype. Thus, the best specification is defined in a collaborative way.
Designer Definition Designer’s name: each designer has a unique name (i.e., “Franck Cossgatoichega from Virtual Engineering Design Inc. at 205 Smart Avenue London in England”). A designer shall provide his name to participate in a specification activity.
Project Management A project is a formal organization of multiple designers that carry out the global specifications of an industrial product. Project name: a designer creates a project through its name definition (i.e., “Specifications of Paradise the third millennium space shuttle”). Dynamic project membership: a project membership sets up the names of the different designers that are project members. Within the project, a designer is simply named member with reference to his project membership. The group is built dynamically by adding or removing members during the lifetime of the project. Responsibility sharing: different responsibilities are associated to the designers. For example, a designer shall define the electrical routes while another is responsible for the mechanical structure. Responsibilities evolve during the project lifetime. Moreover, a responsibility cannot be used to perform unauthorized actions during the project lifetime. Project negotiation: a project definition is negotiated among the relevant participants to set up accordingly the responsibility and role. That negotiation phase occurs before exchanging significant data. At any time, a project can be renegotiated while respecting current responsibilities. The project negotiation can use various channels (like email, fax,
308 Costantini & Toinard
phone call, etc.). Received information authorizes to negotiate the project participation. This information cannot serve to take unauthorized roles or actions.
Distributed Specification Each designer can prepare an individual 3D scene. This 3D proposal is elaborated using local tools. Afterwards, proposals are merged into a global specification. Thus, an extended team produces a distributed specification. Individual proposal: it contains 3D objects but also organizational data, namely geographical positions, responsibility data, annotations and computing resources. A unique designer makes an individual proposal. It is a subset of a global specification. It can contain results from a previous global specification. An individual proposal is produced mainly on a private basis during a disconnected phase. But, it can be introduced later on into the global specification during a meeting phase. Global specification: though the designers elaborate individual proposals, these local proposals can be used to define a global specification of the product. A global specification is the union of different individual proposals. Different iterations of a global specification are produced. Each iteration is produced mainly during a meeting phase. Parallel work: two individual proposals are carried out simultaneously by two different designers. Thus, concept phase can be shortened as much as possible. Parallel work must be supported during both a disconnected and meeting phase.
Disconnected Work This is a way to work on an isolation basis at the specification without any network connection. Typically, it enables a mobile designer to prepare some work on a laptop during travel. It is also a way to prepare a work at the office before integrating the extended team. Thus, disconnected work can be seen as a preparation phase or a local improvement of the global specification. That disconnected work produces a sub-set of the global specification. Responsibility control: a disconnected designer can only make actions in accordance with role and rights that have been granted by the project management or during a meeting phase. For example, if a designer works on a version retrieved during a previous meeting, then he can only modify objects whose rights have been granted to him during this meeting. Local validation: each designer can validate his individual proposal during a disconnected work. Thus, he checks that his local proposal respects the functional rules of the system. Typically, he verifies that his local proposal is correct before entering a meeting. This local validation does not guaranty correctness among the different local proposals. So, a global validation will be necessary during a meeting. Local modification: a designer can improve locally a global specification by modifying it. Typically, he recovers a global specification during a meeting work and continues to improve it on a disconnected basis. Thus, he produces locally a new proposal that will be used during a further meeting. Parallel work: two disconnected tasks are carried out simultaneously. Thus, two disconnected designers produce in parallel two different local proposals. These two local proposals will be merged in a future meeting.
Meeting Management Designers, located in different sites, reach independently a meeting for joining their different proposals. Thus, a global specification is automatically obtained and shared. That
Collaboration and Virtual Early Prototyping 309
global specification is submitted to a discussion process. The global correctness is validated through a real-time conciliation and real-time modification. Dynamic discovery: the different designers discover dynamically a meeting when it is created. Thus, a designer can decide to reach a new meeting after discovering it. Real-time membership: different designers independently join the meeting. Meeting membership is updated when a new designer is in. During a meeting, a designer is named a participant with reference to his meeting membership. Membership is accessible to each participant. Knowledge of distant members provides information describing each role and responsibility. Privacy: a meeting can be restricted to certain participants. Authorized participants are project members that are allowed to reach the meeting. Unauthorized members cannot reach a meeting or cannot use the information that is transmitted during that meeting. Typically, only authorized members can process the information exchanged during a meeting. If a member receives unauthorized information, he cannot interpret or process it. Thus, confidentiality is achieved during a meeting. Global scene merging: each designer joins the meeting with an individual proposal. These individual proposals are merged into a global shared scene. Thus, a global specification is built by an automatic union of individual proposals satisfying the responsibilities of the different designers. Real-time validation: each designer validates the global scene by checking that others do not make conflicting or incorrect proposals. The system must guarantee that a validation is processed on a consistent scene. Otherwise, processing a validation would detect nonexisting conflicts. Thus, the global specification is validated during the meeting. Real-time modification: a designer improves the global scene by modifying his proposal or others’ proposals. The modifications are carried out while satisfying the different responsibilities. It reduces the design time because the designers collaborate in real time to improve the specifications through an immediate modification of the 3D scene. Real time awareness: during a meeting, any participant observes the distant interactions as quickly as allowed by his processing speed. For that purpose, the system minimizes the flow of data that must be processed by a station. Real time conciliation: the system provides the required services to negotiate the improvement that must be carried out. First, a designer can juxtapose a contra-proposal with the current proposal. Thus, different proposals for a subset of the global scene are defined and compared in real-time. Second, the conciliation must be done on a consistent scene. This is a constraint similar to the real time validation. Thus, the participants have the guarantee that their consensus is negotiated using consistent data. Parallel work: two independent tasks are carried out simultaneously. Thus, the meeting does not degrade the ability of parallel working. For example, two participants are able to create, modify or delete two different objects simultaneously. So, in case of concurrent modifications, the performances do not degrade due to inadequate locking mechanisms. Guided visit: the system provides the required tools to facilitate the overview of the global scene. Typically, one participant guides the visit authorizing the others to follow him in real time. Thus, the participants review the specification in a synchronized way. Typically, shared viewpoints and annotations are supported. Moreover, GroupWare tools enable participants to talk with each other despite the distance barrier.
310 Costantini & Toinard
Iteration The designers can iterate several times, moving easily from a disconnected work to a meeting work. A new phase of work enriches the work carried out during the previous phases. Thus, moving between different styles of work does not impoverish the global specification. Individual iteration: an individual proposal contains a subset of the global specification. A designer builds an individual proposal starting from a given subset. Thus, the designer works on the interesting view. During iteration he modifies the subset. During a further meeting, the resulting proposal can be used to update the global specification. During an individual iteration, he creates, modifies and deletes objects of his proposal. So, the subset changes, increases or decreases according to the different operation carried out during the individual iteration. Distributed global iteration: the starting state of a global iteration is a collection of individual proposals. The different participants process a global iteration in a distributed way. During a meeting phase, the distributed modifications produce results of a global iteration. Those results define a new global specification. When multiple global iterations are processed within different meetings, the corresponding global specifications incorporate results from individual iterations that are interlaced among those meetings. Consistent progression: a new iteration shall incorporate the consistent results from previous iterations. Thus, the system guarantees a consistent progression of the work and the designers do not spend time to rebuild previous results. Let us give some counter-examples. First, a participant has the ability to destroy distant works because the system does not manage correctly the rights that are not persistent from one meeting to the next one. Second, during the same meeting, an old state wrongly supersedes a newer state of a given object because the system does not preserve object consistency. So, clearly the system must avoid those inconsistent behaviors in order to guaranty a consistent progression within a single meeting but also between different meetings.
Constraints Non-functional requirements shall be respected to guarantee usability and efficiency of the solution. Simplicity: solution must be simple and lightweight. Simplicity enables good chances for the solution to be implemented. A lightweight solution can run on any standard workstation. Portability: solution must use only standardized and widely supported services and protocols. Thus, the solution can be ported easily through recompilation and minor changes on different operating systems (Unix systems, Windows, MacOS, etc.). Internet deployment: the solution can be deployed easily over the Internet. It shall not require any specific quality of service from the underlying network. Thus, the solution runs widely over any kind of underlying network. Heterogeneity: the collaborative system authorizes heterogeneous machines to interoperate. Thus, different platforms can participate in a meeting. Zero administration: the system requires no administration effort to be deployed. Thus, any standard user manages projects and meetings without any system or network administration skills. It means also avoiding a central server where a system administrator is required to set-up users with customized profiles, providing access to the collaboration data. Even if administration procedures are required, they shall be completely distributed among the designers and easy to use.
Collaboration and Virtual Early Prototyping 311
Low bandwidth: the solution can run with low bandwidth networks, including 33.6 modem links. Even with low links, good real-time performances must be achieved. So, the system must support mechanisms reducing the bandwidth consumption without degrading real-time awareness. Modularity: collaboration services must be independent of application and virtual reality environment. Thus, any application can use the collaboration services within any kind of virtual environment. Persistency: a designer has the ability to make his work persistent. Thus, he stores different steps of his progression to be able to retrieve them later on. Security: the collaboration services guarantee authenticity and confidentiality. Through authentication, it is determined who is participating before revealing sensitive information or entering into a collaboration process. Thus, malicious people cannot usurp the identity of another designer. Confidentiality guarantees that only authorized persons can access the information.
THE DISTRIBUTED BUILDING SITE METAPHOR One proposes a novel solution answering to the VEP requirements that have been described in the previous section. This solution is called the distributed building site metaphor. This metaphor defines the specifications of a distributed environment allowing to share a 3D world. It is a fully distributed approach without any central server. That metaphor gives general specifications of the distributed system. It defines how the properties are provided. In fact, the metaphor gives the principles of the solution. The major assumption is that the metaphor uses only an unreliable multicast transport. It also requires regular email for the preparation phases. Thus, project management and meeting scheduling use email. Afterwards, the solution uses multicasting to share the 3D world and support several modes of real-time collaboration.
Project Management Project negotiation: the designer creating a project becomes the manager of his project. Afterwards, the project management can be transmitted among the different participants. As a project manager, the creator establishes the project attributes, namely project name, project membership, responsibility of participants and various collaboration, graphic or organizational data. The manager sends an email including the project definition to the different members. Each recipient replies using email. He accepts the project as is or asks for some modifications. For example, a member can request other responsibilities. That negotiation phase does not require any specific tools. But, a robot can assist a member by processing an email entry, negotiating the reply with the local member and replying to the sender using an email. Thus, different email exchanges can be involved. At the end of that negotiation phase, the manager sends a project confirmation to the participants. Security is achieved using standard authentication and encryption procedures like S/MIME that are integrated within any email client (i.e., Netscape Navigator or Microsoft Outlook). For that purpose, any participant uses a digital signature (i.e., a X509 certificate including the user public key; the private key is kept secret at the user side) to be authenticated.
312 Costantini & Toinard
Management transfer: the manager can transfer the project management at any time to another person. For that purpose the project is renegotiated using the same protocol. The difference is that the confirmation message includes a signed text from the granting manager. The signed text was computed using the private key of the granting manager (i.e., the creator). The new manager of the project includes the signed text within any further email. Thus, the owner gives the proof that he really received the project management from the previous manager. Dynamic project membership: project membership is set-up dynamically during the project lifetime. Entering or departure of any member is under the control of the current manager. Updating the project membership is achieved by again running the project negotiation protocol. Project membership can be different from meeting membership. Initial scene transmission: the project confirmation can include the initial 3D scene to be used for the project. It gives the initial definition of the product. Considering a project of aircraft engineering, the aircraft structure would be the initial scene to be transmitted to each participant. Thus, each designer could design the systems corresponding to his responsibility (air-conditioned, seats, electricity, hydraulic…). Moreover, an initial scene can be downloaded later on using FTP or HTTP services. So, there is no obligation to achieve any transmission of the initial scene at the confirmation time.
Meeting Preparation After the project negotiation phase, a scheduling phasel, enables preparation for a meeting. Scheduling: any project member can send an email to a group of participants to schedule a meeting. Typically, this group is a sub-set of the project members. But, the group of participants can be extended to any required designer. Each member replies with an email. Several emails can be exchanged. Finally, the project initiator sends an email to confirm the meeting. Like project negotiation, authentication and confidentiality is achieved through secure email and X509 certificates. Thus, a mutual authentication is guaranteed between the meeting initiator and participants. Address allocation: during the scheduling phase, the initiator proposes a multicast address. Each recipient replies if the requested address is not already allocated for another purpose. The reply can contain a range of available addresses. In the latter case, the manager selects a new address and processes the scheduling negotiation again. In practice, each recipient can use a local directory service (i.e., Lightweight Directory Access Protocol (Wahl, 1997)) to reserve the requested address. By consulting already reserved addresses, he replies with the available addresses. In case no directory service is configured, a recipient simply accepts the proposed address. This is a temporary allocation that will be used at the beginning of the meeting to contact each other. Further on, we describe how conflicting addresses are detected and resolved automatically during the meeting. Key distribution: the scheduling phase serves also to distribute a private key PKS. PKS enables a symmetric encryption during the scheduled meeting. The first email proposes a schedule and enables the receiver to authenticate the meeting initiator using the included certificate. Each recipient X replies with his own certificate. Thus, the initiator authenticates each participant X. When sending the confirmation message to X, the initiator sends a private key PK S. That private key is encrypted through secure email (i.e., S/MIME). Thus, X gets the private key PK S that will be used later on during the meeting.
Collaboration and Virtual Early Prototyping 313
Thus, a session key PKS is securely distributed. First, mutual authentication is achieved. Second, only an authenticated participant X can decrypt the confirmation and gets the session key PKS. The system uses standardized cryptographic systems: PKS is a DES key. Thus, any standard tool can be used to generate that key. The manager can use a robot to send the confirmation email. A participant can also use a robot to retrieve and store the session key PKS. Those robots provide automation but the user can also make the operations without any assistant robot. In fact, required operations are simple, as various tools for those cryptographic systems are now widely available and easy to use.
Distributed Scene Tree Scene tree: the application uses a scene tree that is composed of different objects. The scene tree is an abstraction of the application scene. It describes graphical information but also data that are relevant for the collaboration activity. Each object defines a sub-tree where the leaves are elementary nodes (i.e., a beam or an electrical route). Each node contains attributes--namely geometrical attributes, organizational attributes, collaboration attributes and processing resources. A node is typically a graphical object. Organizational attributes consist of father-sons’ relationships. Collaboration attributes contains mainly protocol states and protection attributes. A node can also contain computing resources like code to animate the corresponding object. Figure 1 shows a simplified scene dealing with an aircraft subassembly design. User A owns the equipped structure and hydraulic system. B owns the air system. As C does not manage any system, he is not owner of any system. When a designer owns a node N, it means that he can do operations like updating attributes (i.e., size, position, protection attributes, etc.) associated with N or creating a child node N.M. Here, designer A owns the nodes X, X.1, X.1.1, X.1.3, Y.1, Z, Z.1.2. Designer B owns the nodes X.2, X.1.2, X.2.1, Y, Y.3. Designer C owns the nodes Y.5, Z.1. Distributed scene: any central server does not maintain the global scene tree. It is automatically built during a meeting. The global scene is accessible as a whole if all the designers are present in the meeting. Figure 2 presents the partial scene that is processed if only designers B and C are present. The important feature is that designers B and C can work despite the absence of A. Figure 1: Simplified scene dealing with an aircraft subassembly design Assembly
Equipped structure XA
Structure Electrical system X.2 B X.1 A
Floor X.1.1 A
Beam U X.1.2 B
Beam V X.1.3 A
Route X.2.1 B
Air system YB
Segment Y.1 A
Segment Y.3 B
Hydraulic system ZA
Connection Y.5 C
Subsystem W Z.1 C
Pump Z.1.2 A
314 Costantini & Toinard
In that case only B and C’s nodes are processed. All the nodes, owned by B and C, are present in the partial scene. Each node of the partial scene can be processed without any restriction. For example, C adds a pipe (Z.1.3) to the subsystem (Z.1) while B changes some attributes of the route (X.2.1). During a meeting each designer has a copy of the scene. Any modification is processed on the local copy and a modification event is multicast to the distant copies. When receiving the event, a distant peer processes the modification on his local copy. Thus, any distant participant observes the modification. As will be described further, each participant has the ability to bring his nodes in an isolated workspace in order to continue working on a disconnected basis. Thus, the scene is distributed among the different designers. The scene is processed on a distributed basis during both a meeting and a disconnected work. Protection: each node has protection attributes (read right and write right). The owner manages the protection attributes. Thus, the owner prevents or authorizes the modifications from the other designers. For example, disabling write permission, the owner prevents distant participants from deleting or modifying the corresponding node. Disabling write permission also prevents other designers from attaching a new branch at the corresponding node. Distributed designation: at first level of the scene tree, a unique name is attributed locally when a designer creates a new node. A unique name contains the IP address (@IP) of the creation machine plus a local date (e.g., name X=@IPA,dateA defines the Equipped Structure of Figure 1). The subsequent nodes are designated according to their creation time (e.g., @IPA,dateA,1.3 corresponds to the first branch and fourth leave that have been created by A for the node @IPA,dateA). Only the owner can create a child node (e.g., A created the node @IPA,dateA,1.3 as owner of @IPA,dateA,1). That way, each computation peer defines unique names in a distributed way. Those distributed names can be created on a disconnected basis. It is an interesting approach as a designer does not need any name server to create a unique name. The father-son relationships are implicitly maintained by the distributed names and thus the system does not have to explicitly store those relationships. If a designer does not have the right to create some type of nodes (i.e., a designer that is an air system manager does not have the privilege to create a hydraulic system), the application controls the permission. The distributed system provides the basic services enabling the application to respect the different responsibilities and permissions. The Figure 2: Partial scene dealing with an aircraft subassembly design Assembly
LEGEND: Modification of the scene
Air system YB
Electrical system X.2 B
Beam U X.1.2 B
Route X.2.1 B
Segment Y.3 B
Connection Y.5 C
Device Y Z.1.3 C
Subsystem Z.1 C
Collaboration and Virtual Early Prototyping 315
application uses the project definition (that was transmitted to each designer) and protection attributes of the nodes to control the validity of the requested operation. In Figure 2, the application uses the protection attributes of Z.1 and decides that participant C, as owner, has the permission to create the child node Z.1.3.
Disconnected Work Isolated workspace: during a disconnected work, a user builds an individual proposal within an isolated workspace. He creates, deletes or modifies nodes within that isolated workspace without any communication at all. Thus, a designer can work on a disconnected basis using a subset of the global scene. The subset comes from an initial transfer of the scene or from a previous meeting. That subset contains all the nodes that belong to the designer but it can also contain nodes from other designers. For example, designer A is the manager of the assembly project. He sends during the project negotiation the initial scene as represented by Figure 3. According to the project definition, designer B starts a disconnected work and creates an individual proposal within his isolated workspace as depicted by Figure 4. Figure 3: Sending during the project negotiation: The initial scene Assembly
Hydraulic system ZA
Equipped structure XA
Structure X.1 A
Floor X.1.1 A
Subsystem W Z.1 A
Beam V X.1.3 A
Figure 4: Starting a disconnected work and creating an individual proposal within isolated workspace Assembly
Equipped structure XA
Structure X.1 A
Floor X.1.1 A
Air system YB
Segment Y.1 B
Beam V X.1.3 A
Segment Y.3 B
Hydraulic system ZA
Connection Y.5 B
Subsystem W Z.1 A
Pump Z.1.2 A
316 Costantini & Toinard
As described later on, the ownership of a node can be transferred during a meeting. For example, after the first meeting the isolated workspace of designer C looks like Figure 1. It means that the ownership of the nodes Y.5 and Z.1 was transferred to designer C during the first meeting. After the first meeting, C can then modify these two nodes (i.e., he can update their attributes, delete those nodes, create child nodes) during a disconnected work. Thus, on a disconnected basis an isolated workspace permits some improvement starting from a previous meeting. Responsibility control: the application can control that an operation is in accordance with the designer’s responsibilities. For example, the application enables designer B to create an air system because he has the corresponding responsibility included in the project definition. After the first meeting, the application can also permit designer C to update the attributes of node Y.5 during a disconnected work. Parallel work: two disconnected tasks are carried out simultaneously. For example, designers A and B work in parallel starting from the initial scene that is described in Figure 3. A modifies the scene by adding the pump Z.1.2. In the same time, B builds the air system as presented in Figure 4. Thus, two disconnected designers produce in parallel two proposals. These two proposals will be merged during a future meeting. For example, the first meeting automatically builds the scene presented in Figure 4.
Meeting Basically, a meeting runs in four phases. The first phase is a real-time membership. The second phase achieves a global scene merging. At the end of that second phase, each participant has a copy of that global scene. That second phase uses a synchronization protocol that guarantees each participant to have exactly the same copy. During a third phase, different operations can be processed, namely real-time modifications, real-time review, Figure 5: Christian entering concurrently with Antoine Fabien Costantini
Christian Toinard
Antoine Sgambato
BeginMembership: Christian's request
BeginMembership: Antoine's request color = red
First phase
Receive: color = red Choose: color = blue
Receive: color = red Choose: color = blue
Conflict: color = blue Choose: color = green End Membership: Christian's color
End Membership: Antoine's color
Collaboration and Virtual Early Prototyping 317
real-time validation and real-time conciliation. Fourth, the participants leave the meeting bringing a subset of the shared space in their isolated workspace. Real-time membership: during that first phase, an entering peer multicasts the participant’ name. Distant participants reply as multicasting their names. Thus, a new participant locally mantains his knowledge of the meeting membership. The project definition enables a participant to access the role and responsibility of any distant member. Moreover, the protocol allocates a session color to each participant. This is a user-friendly mechanism allowing coloring of the objects according to their owner. Let us describe the protocol more precisely. The entering peer multicasts a request including its network extremity (address IP + port number). In response, the distant peers multicast their name, network extremity and color. To terminate the membership phase, the entering participant multicasts a second message with the chosen color. When two peers are conflicting for the same color, the peer with the smallest extremity will change its color and multicast a new message. Thus, each participant builds its local membership knowledge of other participants. Figure 5 shows Christian entering concurrently with Antoine. The conflicting colors are resolved at the end of Christian’s membership phase and Christian’s peer multicasts a new message. The mechanism can have minor effects like seeing user Christian changing from the blue color to the green color during the same meeting. Global scene merging: after the first phase, a second protocol enables each participant to share exactly the same scene. For that second phase, a participant chooses the individual proposal that he wants to add to the global scene. Receiving the proposal, a distant participant multicasts his own proposal. So, the entering peer receives the distant individual proposals that are assembled to compose a copy of the global shared scene. At the same time, the distant peers update their local copy with the entering proposal. A node that does not have a read permission is not multicast. That way, distant participants cannot observe unauthorized nodes. As the meeting was scheduled previously, the application knows the list of all the participants that planned to come. Thus, the application can prohibit multicasting of the nodes of the proposal that are owned by other participants. It has two advantages. First, it reduces the set-up time of the copies. Second, it avoids sending old-fashioned nodes, as the owners will enter their up-to-date states. Thus, each peer has a local copy of the global scene at the end of that second phase. Moreover, the global scene respects the responsibilities because the modifications carried out during a disconnected work have been done according to the protection rules. In fact, the entering peer uses a synchronization protocol. This synchronization avoids overloading the network by simultaneous multicasts from the different peers. Moreover, it recovers the transmission error guaranteeing each peer to have an exact copy of the shared scene. This protocol is described precisely below. Copies synchronization: this synchronization protocol is used during the global scene merging. But, a peer can use it at any time to resynchronize the copies with the freshest state of each node. As each owner has an up-to-date state for his nodes, fresh states are collected from the different owners. Peer A (i.e., an entering peer) multicasts a synchronization request including a list of nodes owned by A. Each element of the list contains the name of a node plus the current version number of that node. Afterwards, A multicasts a set of messages to provide a state for each announced node. When a distant peer B received all the states from A, it multicasts a synchronization reply including a list of nodes owned by B. This list is followed by the corresponding states.
318 Costantini & Toinard
To avoid network saturation if all the peers reply at the same time, different mechanisms are proposed. First, the network extremities can be used to schedule concurrent replies. Any peer acquired the network extremity of the distant peers during the membership phase. When A finishes multicasting its states, the peer with the smallest network extremity will start to reply. Thus, the network extremities order the replies. Second, each peer P includes a list in its synchronization request or synchronization response. If a distant peer R misses a node of that list, then R asks P to retransmit a state providing the node’s name. Moreover, R must receive a state with a version that is superior or equal to the announced version. Otherwise, R asks P to retransmit a state in order to receive a fresher version. This solution is efficient in many ways. First, it guarantees each peer to have a fresh copy. Second, during that protocol multiple senders do not overload a peer, as only one unique peer can multicast at a time. Third, it enables users to recover the transmission errors. So, at the end each peer has recovered a consistent copy including all the freshest states. This service can be used before starting a validation or conciliation procedure. A participant can also use this service when he thinks that he misses lots of states. In that case, the peer sends a synchronization request asking for point-to-point replies. Distant peers reply directly to the requester using point-to-point exchanges. Thus, a requested peer does not receive any response from the other peers and its computation load is reduced. Real-time modifications: a designer can create, delete or modify a node according to the permission rules defined by the protection attributes and project definition. At that time, his peer multicasts an event to the distant peers. Each event contains the name of the corresponding node and all its attributes--namely graphical data, node version and ownership. A receiving peer computes a new state for its local copy. To prevent old events from overwriting a newer state, the receiving peer uses the version number to distinguish those old versions. Ownership transfer: to modify nodes owned by others, a peer must first request the ownership to the current owner. The owner refuses the ownership transfer when a write attribute is disabled. Otherwise, the requesting peer receives the current state and ownership one at a time. The transfer runs in three phases. First, the requester multicasts a message to locate the owner. Second, the owner replies to the requester with a point-to-point message including the ownership and the current state of the node. Third, when receiving the reply, the granted peer sends a point-to-point acknowledgement that terminates the transfer. When receiving that acknowledgement, the granting peer sets the ownership to false, being sure that the granted peer received the ownership. The ownership transmission must be reliable. For that purpose, a local number is associated with the request. The owner replies with the same number. When receiving the reply, the requester sends an acknowledgment including the same number. The requester resends his request in absence of response. The granting peer resends the reply in absence of acknowledgment. Faulty situations, where an object is without any owner due to a transmission error, are thus avoided. When receiving a request from A, the granting peer sets the node to pending. Thus, it does not reply to any request that arrives from another requester B. That concurrent peer B multicasts periodically the request waiting that the new owner A replies. Freshness of a copy: a copy can contain old states for the nodes owned by distant participants. But, owned nodes obviously are up to date.
Collaboration and Virtual Early Prototyping 319
There is impossibility to guarantee that a copy has only fresh nodes. The reason is that the transmission time of a modification cannot be bounded. Moreover, it cannot be the same for the different peers. Even if the system uses an underlying network guaranteeing the same transmission time (which is hard and costly) for the group of participants, one cannot guarantee such a property if a transmission error occurs. In practice, transmission time is thus unbounded. Moreover, the important point is that an old state cannot overwrite a fresh one. Here, the version number maintained by the owner guarantees that the old versions are discarded. Object consistency: the ownership transfer guarantees that any modification is carried out starting from the latest state of that node. The reason is that a peer must recover the ownership before being able to modify that node. Thus, inconsistent modifications, where a peer would modify a node without being owner, are avoided. This is a very important property that guarantees a consistent progression of the work. It assures a designer will not modify a node without observing its latest state. Thus, a designer will not have to redo a work that has not been observed by a concurrent worker. This basic mechanism can be used to achieve object consistency. For that purpose, a peer requests the ownership of the sub-tree composing the object. Thus, the designer is guaranteed to have a latest state of the sub-tree. Then having the latest version, he can decide if it is worth being modified. One can also imagine a participant requesting the ownership of the whole scene. Thus, he makes sure to have a consistent scene. In contrast, the synchronization protocol provides a fresh copy of the whole scene without requesting the ownership transfer. Real-time awareness: at any modification, an event passing authorizes synchronization of the distant peers. In contrast with a point-to-point transmission, multicasting reduces the transmission time by emitting the event once and by propagating it simultaneously to the different participants. Moreover, a receiving peer speeds up the delivery of recent events throwing away old pending events. Thus, the workload is reduced because the receiving peer does not process old-fashioned updates. This mechanism enables users to speed up the relevant events in the pending queue. Missing states are not recovered. It is not necessary for a distant participant to observe all the states of a given node because the owner maintains the correct state. Parallel work: two tasks dealing with two distinct nodes are processed simultaneously. These two tasks do not communicate with each other. Two tasks for the same node are serialized. The current owner processes the first task. Afterwards, the ownership transfer is carried out with the distant task. Thus, the second task starts processing the node after the end of the first task. Serialization is requested to guarantee a consistent progression of the work as explained previously. Thus, parallel working is limited only when required. Real-time review: a shared viewpoint can be created by one of the many participants as a standard graphical node. Afterwards, the ownership of the shared viewpoint can be transmitted to any participant that wishes to guide the visit. A guiding request is multicast to the distant peers. Observing the guiding request, a receiving participant can start the visit. Thus, his local viewpoint is synchronized with the shared viewpoint. At any time, a guided participant can stop the visit to recover his local viewpoint. Thus, the participants review the specification in a synchronized way through a simple shared node. During the review, any participant can place an annotation by creating a new node whose attributes define a plain text. Participants can work together by jointly selecting and referring to specific scene nodes. For example, a participant moves his telepointer to select
320 Costantini & Toinard
and highlight a node (setting the attributes with a typical color) in order to reference this element. Thus, the distant participants observe the telepointer motion and the highlighting refers the participants to the node involved. Address allocation: a multicast address was reserved during the meeting preparation. If another application uses the same multicast address, DBSM will detect automatically the situation because received messages cannot be decrypted using the session key KS. DBSM automatically multicasts a new address. Receiving the proposed address, each peer will join that new multicast address. That way, the session is not interrupted. Before proposing a new multicast address, DBSM joins and listens to that address. If DBSM does not receive any message during a period, the new address is free and can be multicast to the other peers. It should be noted that currently only a small set of applications uses the multicast addresses. Thus, the probability that the chosen address is in use at the same time is really low. Moreover, using an occupied address does not prohibit the application from running. Reading irrelevant messages simply slows down the application. Real-time validation: this functionality is inherent in the application using our metaphor. DBSM provides the basic services enabling the application to propose adapted services. In fact, the application can run any kind of tool for validating the distributed work. For example, that conflicting proposals can be detected through a local computation--namely a collision detection module, segregation module or flow simulation tool. These tools come as part of the application itself. They are embedded modules processing the local copy. The basic support, provided by DBSM, rests on the multiple ways to have an exact copy. First, after the scene merging, each participant gets a complete copy of the global scene. So, the end of the merging phase is the right time where each participant can run a validation tool. Second, at any time after real-time modifications, a participant can use the synchronization procedure. Thus, synchronization enables the different participants to run a validation tool after real-time modifications. Third, a point-to-point synchronization enables a requesting peer to recover a consistent copy without forcing the distant peers to do the same. Thus, a single peer runs a validation tool without disturbing distant participants. Finally, any peer has good chances to be up to date without using the synchronization service. The first reason is that the transmission time of each event is shortened. Each modification message vehicles a small amount of data (i.e., node name, object type, size, position, color), and emitting a multicast message takes less time than sending multiple point-to-point messages. The second reason is that the probability of losing a message is low because the size of each event is small and each router is not congested by a big queue of point-to-point messages. The third reason is that a receiving peer discards the old states. Thus, a peer is not congested with old states. A validation computation should have good performances as it processes only the local copy. Thus, a real-time calculation is inherent in the associated tool. Real-time conciliation: in the same way, the application can use the basic services of DBSM to build conciliation procedures. Various possibilities can be imagined. DBSM provides basic properties. Let us consider how to negotiate the required improvements. A designer can duplicate a sub-set of the scene. The original node can be write-protected but the duplicated node belongs to the requester. Thus, he can update the duplicated node to propose another solution. The different participants can visualize the current proposal and new proposal at
Collaboration and Virtual Early Prototyping 321
the same time. Thus, the participants discuss and compare the two proposals accordingly. A real-time modification of the global scene is another way to make a new proposal. A basic property that can be used by a conciliation procedure is the synchronization procedure that guarantees the different participants to have a consistent copy. Thus, having the same view, they can discuss in a consistent manner. They do not lose time thinking on old-fashioned data. Chatting facilities support the discussion process. A chatting object is associated with a node that can be created by any participant and written by everyone. Moreover, that tool can be completed with a GroupWare toolkit providing voice transmission. Leaving: a leaving participant selects the nodes that he wants in his isolated workspace. All the owned nodes are automatically included in his isolated workspace. Nodes, owned by distant participants, can be selected. Thus, the leaving participant defined the sub-set of the shared scene he includes in his isolated workspace. The operation is processed locally by marking the excluded nodes within the local copy. Unmarked nodes build up the isolated workspace. It is thus a lightweight procedure. At the end of the selection, the peer multicasts a leaving message to inform the distant participants of its departure. Then, the peer leaves the multicast address. It shall be noted that the leaving message does not need any acknowledgment. Unreliable leaving suffices as the peer already has the required nodes within his local copy. Before sending the leaving message, the peer will normally save the isolated workspace within a local store. Thus, the designer is able to recover his isolated workspace at any time. Further, the saved workspace enables a disconnected work.
DBSM Qualities Simplicity: the solution uses unreliable multicasting and a lightweight consistency protocol. Thus, complex solutions are avoided by using a novel approach that focuses on consistency. Portability: the solution uses standard multicasting and email communication. Thus, the solution can be easily ported on any machine supporting Internet. Internet deployment: the solution does not require any specific quality of service from the underlying network like resource reservation or reliable multicasting. It is deployed using the standard ability of routers to find a path for each member of the multicast address. In some case, tunnels are set up to go through deficient routers. Thus, the solution is deployed easily over the Internet. Heterogeneity: some standard encoding rules, as used by ASN.1 compilers, can solve the heterogeneity problem. At the time of writing, our implementation transmits each message as a character string. It is possible because the state of a node has a small size. Thus, string conversion does not increase greatly the required bandwidth. Zero administration: a user requires an email and an application using the collaboration library. Thus, any designer uses the collaboration services in a straightforward way. It does not require administration privilege or installation of specific servers. The system relies on a standard configuration of electronic mail and routers. Performances: they should be good because of eliminating completely client-server bottlenecks and reliable multicasting pitfalls. Various techniques improve the performances (i.e., speed up of events, synchronization with load control). Moreover, dead reckoning can be used when the shared viewpoint moves quickly within the scene. Low bandwidth: the different communication protocols uniquely use the multicasting of events. Thus, less bandwidth is required than sending a message to each peer.
322 Costantini & Toinard
Moreover, each peer receives only a small amount of data, as the modifications are simply incremental updates. Modularity: a copy of the scene is stored as application data. A local interaction event is processed directly by the application to update the local copy accordingly. Update uses the services of the virtual environment for a 3D representation of the local copy. It uses the collaboration services for transmitting an event to the distant peers. Thus, application, collaboration services and virtual environment are implemented as independent modules. Persistency: at any time a designer can decide to store locally a sub-set of the nodes. This sub-set includes owned nodes. It can include also distant nodes. Thus, persistency is achieved in a fully distributed way through local savings. A global scene is thus distributed in the different permanent stores available among the designers. Security: certificates enable the authentication of the designers. Thus, any participant checks the identity of another user. The application can then define what that authenticated user is permitted to do according to his responsibility and the required protections. Moreover, encryption guarantees the confidentiality. Each message is encrypted with the session key KS. Thus, uniquely the authorized members that securely received the session key KS can decrypt the messages that are transmitted during a meeting.
A DESIGN PATTERN APPROACH FOR DBSM IMPLEMENTATION In this final section, we focus on how two important aspects of the DBSM can be implemented with a Design Patterns (Gamma, 1994) approach. The first pattern illustrates the distributed designation problem. The second pattern shows how the scene aggregation could be implemented. For a more complete discussion about those patterns, see Costantini et al. (2000).
A Pattern for Distributed Designation Name Hierarchical Designation Manager
Example Consider the example (Figure 1) of the previous section. Distant designers can set-up in a collaborative way a virtual prototype and attach simulation modules to the different subsystems. Typically, designers cooperate to allocate space for different sub-systems and connect these sub-systems to show their relationships. During the process, designers can interact with the virtual objects hierarchy and provide real-time inputs to the simulators onto a virtual shared scene. Figure 1 shows a simplified scene tree dealing with a subassembly design. We want these sub-system objects to be managed in a fully distributed and efficient way while keeping consistency of work.
Context In a collaborative framework, workers need to share objects on the network. These objects have owners and may depend on other objects. Then, we need a component that
Collaboration and Virtual Early Prototyping 323
manages a distributed applicative scene tree that allows these objects to be handled hierarchically in a fully distributed way.
Problem The nodes of the global scene must keep unique names in time and space among the different private spaces and replicas. The solution must guarantee several properties: • Unique names should be easy to manage and use. • Two private spaces or replicas cannot contain the same object with two distinct references. For example, during a disconnected phase, the pump can be present in two different private spaces and still be designated with the same name. During a meeting, two replicas use a same name to access the pump object (local or distant). • If a name designates two distinct nodes, then uniquely one node is active and the other is a pending node that cannot be processed. The pending node must disappear and be replaced by the active one. The system must guarantee that a pending node exists only within a private space. In contrast, two versions of the same node have the same name. • The distributed scene tree may not reflect the applicative scene tree: objects can be private and remain in the local private space or be public in the global shared space. • A user creating a node must be able to use a dynamic Internet Protocol address. Despite a dynamic address allocation, the name must remain unique in time and space.
Solution A peer (i.e., an entity processing a replica or a private space) computes locally a unique name when a user creates a distributed new node (i.e., a Network Decorator Leaf1 or Network Decorator Composite as defined in the sequel) within the scene tree. For example, a unique name contains the Internet Protocol (IP) address (@IP) of the creation machine, a local timestamp and the position within the tree (e.g., @IPA,stampA,1.3 corresponds to the first child and third grandson of the node @IPA,stampA). A timestamp includes a date, corresponding to the local creation date, plus a random number. Moreover, a node maintains locally the child names. For example, node @IPB,stampB,1 stores locally the list of its children (@IPB,stampB,1.1, @IPB,stampB,1.2, ...). To create or to delete a name, a user must be the owner of the father. For example, to create name @IPB stampB,1.3, the user must own node @IPB,stampB,1. Thus, a node (e.g., @IPA,stampA) can be created at the first level on a disconnected basis without being attached to any particular shared scene. It enables a user to prepare some work without knowing in advance in which scene he will enter his work. For example, one designer prepares an electric sub-system and another designer prepares a hydraulic subsystem. Afterwards, they reach a meeting to conciliate their works. Figure 6 gives the automatic merging of the corresponding scene tree. The root of the tree corresponds to the name of the meeting. It is seen that the unique names do not correspond directly to the graphical position of the objects (e.g., the first vent is named @IPB,stampB,1,3 while the second vent is named @IPB,stampB,1,1).
324 Costantini & Toinard
Figure 6: Automatic merging of the corresponding scene tree
Assembly @IPA,stampA
Electrical system
@IPA,stampA,1
Route A
Hydraulic system
@IPB,stampB
sub-system W
@IPB stampB,1
@IPB stampB,1.1
Pump A
Figure 7: Structure 1+
Work Composite
Work Object
NetWork Decorator
Work Leaf 1
uses
Unique Names Repository (Singleton)
1+
Application Scene Tree
NetWork Decorator leaf 1
NetWork Decorator Composite
uses
Hierarchical Designation Manager
Collaboration and Virtual Early Prototyping 325
Participants Applicative Part • • •
Work object: - Abstract work object class that defines all the generic attributes and methods that a VR work object may have (position, orientation, color, …). Work Leaf1 (Pump): - A concrete applicative object that may be wrapped by a Network Object Decorator on demand. Work Composite inherited objects (Route): - Concrete groups of applicative objects.
Distributed Part •
•
• •
Network Decorator: - Abstract class reflecting the generic network mechanisms of the Work Object abstract class. For example, it can implement at this level the distribution of abstract Work Object methods like moveObject(), resizeObject(),…. Network Decorator Leaf1 (PumpDecorator): - This object inherits from an abstract Work Decorator and wraps a concrete applicative object from the scene tree. - It furthers additional network control to abstract Work Object methods in order to update the replicates of the underlying object on the network. Network Decorator Composite: - This group forms, with the Network Decorator, the composite part of the distributed object tree implementation. Unique Names Repository: - This Singleton implements the local Unique Naming access methods (CreateUniqueName(), RemoveUniqueName(), IsNameDistant(), …) and is used by the Network Decorator objects. It also handles the dynamic construction of the global name list by storing the distant unique names received from the remote participants.
Collaboration When entering a meeting, Network Decorator objects substitute local Application Scene Tree objects. On the others participant’s sides, Network Decorator objects are constructed and then create replica of the distant objects. Note that the same network classes manage the local objects and their replicas. So there are as many concrete Network Decorators implementations as there are concrete applicative implementations.
Consequences Network Properties •
The first property is satisfied. The unique names are easy to create since the Unique Names Repository creates each name locally. Thus, a unique name can be created both during disconnected work and during meeting. A unique name is independent of the geographical position within the 3D scene. So, objects can move without changing
326 Costantini & Toinard
their names. As an example, a Network Decorator Object can update its underlying applicative concrete Work Object position, but its name at the network level remains the same. A unique name defines the position of the node within the global scene tree. Thus, a name is unique, in time and space, but defines also a position within the distributed scene tree. It is useful since the application retrieves directly from a name which node is the father. Moreover, an application can decide to accept unordered message notifications and stack them until it has resolved all the dependencies between them. For instance, a Network Decorator Composite Object node notification may result in several concrete Network Decorator object notifications. As hierarchical information is attached to the notification, we can implement a best-effort algorithm that wait for all dependent node notifications before it updates locally the replicas. • The second property is satisfied since the object name is defined at creation time and remains unchanged all along the life of that object. • Third property is guaranteed because to create a name (@IPB,stampB,1.1) a user must be owner of the parent node (@IPB,stampB,1). So, the parent (a Network Decorator Composite object) must be present to create a child node. Using the local list of children, the parent guarantees that an active name cannot be reused. To be reused, the owner of the active name must be present and must delete the name. If the same name exists for another object, it can be only for a user that is neither present nor owner. So, it can only be a private space that contains the pending name. Moreover, the pending name will disappear when the user reaches a new meeting. • The last property enables a communication peer to change its IP address without any difficulty. Already created names remain unique because the probability that two machines use the same address to produce the same name (same local time and same random number) is close to zero. Moreover, existing works can be reused through first-level nodes that do not contain any reference to a specific shared scene.
Structure Considerations • •
•
•
Network implementation is well decoupled from object implementation: the Network Decorator does the network job, so almost no specific and complex network code obscures the existing code and therefore simplifies the developer task. The existing applicative object structure must be stable: adding new applicative objects types means adding new Network Decorators. However, note that many generic features may be already implemented in the abstract Network Decorator class that reflects the Work Object generic methods augmented with network constraints. So even then, the effort of creating a new concrete Network Decorator class should be minimized. In some situations, the Network Decorators may have to know about the private internal attributes of their underlying applicative objects in order to be able to dispatch object updates or to construct the replicas. This problem can be solved in most cases by using the Memento Pattern (Gamma, 1994) which will enable a participant to interface the applicative object with his decorator by furthering only a subset of the real object state to the decorator without violating encapsulation of this underlying object. Maintaining two different trees synchronized (a distribution one and an applicative one) is not easy. However, this structure allows the user to choose clearly what should be distributed and what should remains in the private local space.
Collaboration and Virtual Early Prototyping 327
•
Direct object distribution may not be acceptable for performance reasons. In some situations, it is better to implement a Distributed Facade (Brown, 1999) instead of sending several disassociated object notifications. In our example, the distributed objects are lightweight objects that need to send few messages to their replicas. Moreover, for more abstract command notifications that does not apply specifically to an object structure, it can be necessary to implement a Command-oriented pattern which can be expressed as a combination of the Command pattern and the Decorator Pattern (see Compounding Command (Vlissides, 1999)). The decorator part will correspond to the network transmission of the wrapped command.
See Also Decorator, Singleton, Composite, Memento, Command (Gamma, 1994), Compounding Command (Vlissides, 1999), Distributed Facade (Brown, 1999).
A Pattern for Automatic Scene Aggregation Name Scene Aggregation Helper
Example Collaboration requires moving easily between disconnected works and meeting works. When virtual prototyping, each participant improves during a disconnected phase different nodes of the global virtual scene. Thus, a disconnected workspace includes nodes from the previous meeting but also new nodes created during the disconnected phase. During a further meeting, each participant brings the disconnected nodes into the shared scene tree. Thus, different entering nodes reform automatically a shared scene tree.
Context This service is present typically when mobile workers collaborate at a virtual prototyping. Merging of disconnected works must be supported while satisfying a minimal “consensus.” It is obvious that disconnected works cannot provide a global specification that satisfies all of the prototype requirements. For example, two disconnected designers can produce two conflicting improvements. During a meeting, the system can compute the shared scene to check for requirement respect. Detected conflicts must be resolved by interactions between the designers and direct modifications of the shared scene. End-users do not want the system to resolve automatically the conflict since they want to negotiate and conciliate the different proposals. Despite that it is not feasible to guarantee a global consistency for the different entering proposals, the system must at least guarantee that the different proposals are associated with correct functional locations to guarantee a minimal consensus on the shared work.
Problem A participant must be able to bring pieces of the scene puzzle and get a copy of the global scene. The different pieces must go to the right place of the puzzle. A right place is not only a 3D position but also a correct functional location within the scene tree. For each arrival, the distant participants must observe the added pieces. Moreover, each peer must recover from transmission errors and get a fresh state for each node of the global scene.
328 Costantini & Toinard
Solution An application peer creates a Network List Object (i.e., an object including all of the entering nodes--see further details in the structure section) to define the nodes entered by the local user. At the creation time, the Network List Object multicasts several list messages to the distant peers for creating distant replicas. Each list message can be viewed as a part of the whole Network List Object. When a peer receives the total number of parts, it has a complete copy of the Network List Object. Several messages are required since UDP (User Datagram Protocol) limits the size of a multicast message. A list message is defined as follows (LIST: VL, TotParts, NPart, (G1,V1),…, (GN,VN)) where VL is the version number of the list, TotParts is the total number of parts forming that list, NPart is the part number and each couple (GX,VX) defines the unique name and the version number for an object X. Using TotParts parts, a peer announces the objects it enters into the meeting. Afterwards, the announcing peer multicasts an object state (State: VL, GX, VX, TX, SX) for each object X of the list where TX is the type of X and SX is the state associated with VX. Thus, a distant peer recovers all of the announced objects. At the end of the scene aggregation, each peer has a copy of the shared scene. Since a distant peer has the list of the objects, it can ask the retransmission of a missing object X by requesting a version number equal or higher to VX. Thus, a producing peer has to retransmit uniquely the current version V’X of X (V’X ³ VX). Since each state (State: VL, GX, VX, TX, SX) transports a list version VL, a receiving peer that misses the corresponding list message can ask a retransmission by requesting the list part for the object GX. Generally, the producing peer will retransmit the corresponding list part (LIST: VL, TotParts, NPart, (G1, V1), …, (GX, VX), …, (GN, VN)). When the producer added or deleted an object, the requested version VL is no more available since the list changed. In that latter case, the producer retransmits all the new parts (LIST: V’L, …) with V’L ³ VL. After receiving the new list, the requester has the responsibility of asking the retransmission of missing states based upon the received list. It should be noted that a selective list (SLIST: VL, TL, TotParts, NPart, (G1, V1),…, (GN, VN)) transmits only object references of type TL. A selective list is a subset of the list VL.
Participants •
Network List Object: - There are as many list objects as meeting participants. - It manages several attributes like (G1, V1),…, (GN, VN) and list version VL. - On demand, it furthers additional network control to update the replicas of the Network List Object on the network. The other participants have been described for the previous pattern.
Collaboration Typically, a GUI uses the Scene Aggregation when the local participant decides to enter prepared works. Thus, a new Network List Object is created and transmitted to update the distant copies. Each Network Decorator (Leafs and Composite) uses the Network List Object each time a modification is processed on the corresponding objects (Leafs and Composite). Thus, each Network Decorator object modifies Network List Object of the local participant. On demand (from a local user or distant request), an update of the Network List Object is transmitted to the distant peers. For example, if a replica wishes to re-synchronize with fresh values then it sends a request for the Network List Object to check freshness of
Collaboration and Virtual Early Prototyping 329
Figure 8: Structure
Network Interface
Network Message Dispatcher
Work Object
Network Decorator
send state receive state notify change
dispatch list update list
Network List Object Singleton
Scene Aggregation
Network Decorator leaf 1
Network Decorator Composite
distant objects. When a Network List Object is newly created for a distant participant, the creation method uses the local Network List Object to update the entering participant with the local Network List Object. Since a communication peer receives the version number of the Network List Object but also of the different Work Objects, it can ask retransmission of missing list messages or object updates.
Consequences Network Properties •
•
Each entering peer brings pieces of the scene puzzle using the Network List Object. A couple (GX, VX) transmits a piece X. The entering participant gets a reply from distant participants. Thus, he completes the global scene as assembly of the different responses. At the end of the scene aggregation, each participant gets a copy of the shared scene. Each piece X goes to the right place since the name GX contains the location within the scene tree.
330 Costantini & Toinard
• •
• •
•
Each time a participant enters, the distant copies of the global scene are updated using the list messages. Moreover, the responses give an opportunity for the copies to resynchronize and to recover fresher states. When a receiving peer detects a transmission error (missing announced objects or missing list parts), he asks for newer versions. Thus, retransmission only gives the fresh version of the missing object or missing list. Retransmissions of out-of-date states are thus avoided. Since the List Object is transmitted only on demands to the distant peer, the system does not retransmit redundant information, i.e., the state of the Decorator Object that caused the List Object update. List Object gives information about the global progression of the corresponding participant. It enables a receiving peer to quickly check if the current state of a participant is fresh or if states are missing. It provides a basic support to further send negative acknowledgement about missing values. It shall be noted that a List Object can evolve while the local peer processes its Decorated Objects. The List Object evolves consistently with the current state of the peer. There is no need to lock the List Object while a peer processes its Decorated Object or vice versa.
Structure Considerations •
• •
Network implementation is well decoupled from any application processing: the cooperation between the Network List Object and the other Network Decorators does the network job. So, application code only has to create a List Object on user request to get a copy of the shared scene. As the List Object is processed either through local or distant request, a Compounding Command (Vlissides, 1999) can be used. The decorator part enables the user to wrap and transmit the command to a distant peer. The communication between the Network Decorator and the Network List Object can be implemented efficiently with Observer (Gamma, 1994). In effect, the Decorator will be the subject and the Network List Object will observes the notify updates.
See Also Compounding Command (Vlissides, 1999), Observer, Singleton (Gamma, 1994).
CONCLUSION Our solution is efficient since the design time can be reduced. The key points of that reduction are the ability to work in parallel through a widely distributed specification, the easy motion between private and shared work, the good support for conciliation and the incremental validation of the global specification. Second, the number of examined solutions increases. The key points are the ability of working anywhere in the world, easy consolidation and conciliation of private works into a global solution, collaboration in real time with distant participants and good usability of the collaboration services. Third, the distribution can be integrated within any standalone application. Thus, a rapid prototyping tool can be developed, experimented and improved on a standalone basis. Afterwards, collaboration can be integrated to support concurrent engineering without modifying the interface of the prototyping tool.
Collaboration and Virtual Early Prototyping 331
Fourth, deployment has been proven to be very simple. The major reasons are namely a fully distributed solution, no hosting and management of any server, efficient security using secure email to distribute a session key, use of the standard Internet architecture and automated components to manage the network channel and security. DBSM works on both Unix and NT. It has been used in a virtual prototyping application like described in Costantini et al., March 2000, May 2000. Future works will provide several improvements. New human-human interactions will be added using the feedback of the users. The resynchronization service will be improved through a new management of object versions (Costantini et al., November 2000). A solution will be developed to cope with participants that cannot be reached using multicast and a better confidentiality will be provided using Deffie-Hellman shared secrets (Costantini, 2001). eXtensible Markup Language will be used to access various CAD and Product Data Management systems.
REFERENCES Barrus, J. W., Waters, C. and Anderson, D. B. (1996). Locales: Supporting large multiuser virtual environments. IEEE Computer Graphics and Application, November, 16(6), 50-57. Broll, W. (1998). DWTP-An Internet protocol for shared virtual world. International Symposium on the Virtual Reality Modeling Language VRML’98, ACM SIGGRAPH, Conference Proceedings, 49-56. Brown, K., Eskelin, P. and Pryce, N. (1999). A mini-pattern language for distributed component design. 6th Pattern Languages of Programs Conference. Urbana, Illinois. August. Costantini, F., Sgambato, A., Toinard, C., Chevassus, N. and Gaillard F. (2000). An Internetbased architecture satisfying the distributed building site metaphor. IRMA2000 Multimedia Computing Track, Anchorage, Alaska, 21-24 May, Conference Proceedings, 151-155, published by IDEA Group Publishing. Costantini, F., Sgambato, A., Toinard, C., Chevassus, N. and Gaillard F. (2000). Distributed building site metaphor for concurrent engineering. MICAD2000, 19th International Conferences Dedicated to CAD/CAM/CAE and New Technologies for Design and Manufacturing, Paris, France, 28-30 March, Conference Proceedings, 187-194. Costantini, F., Sgambato, A., Toinard, C., Chevassus, N. and Gaillard, F. (2000). Virtual early prototyping using the distributed building site metaphor. Technical Report CNAM-CEDRIC submitted to WSCG’2000. Costantini, F., Toinard, C. and Chevassus N. (2000). Secure mobile replication for collaborative virtual reality. Intelligent Multimedia Computing 2000, RostockWarnemünde, Germany, 9-10 November, Conference Proceedings. Costantini, F., Toinard, C. and Chevassus, N. (2000). Patterns for collaborative distributed virtual reality. ICSSEA 2000 Conference Proceedings, Paris, France, 5-8 December. Costantini, F., Toinard, C. and Chevassus, N. (2001). Collaborative design using distributed virtual reality over the Internet. Internet Imaging 2001, San-José, California, Conference Proceedings. Dassault Systems CATIA Marketing Team. (1998). 4D Navigator. Retrieved on the World Wide Web: http://www.catia.ibm.com/prodinfo/.
332 Costantini & Toinard
Dassault Systems. (2000). Retrieved on the World Wide Web: http://www-3.ibm.com/ solutions/engineering/escatia.nsf/public/v5_public. Defense Modeling and Simulation Office. (1997). HLA Data Distribution Management Design Document Version 0.5. February, U.S. Department of Defense, Washington DC. Retrieved on the World Wide Web: http://www.dmso.mil/project/hla. Defense Modeling and Simulation Office. (1996). HLA Time Management Design Document Version 1.0, August, U.S. Department of Defense, Washington DC. Eleftheriadis, A., Herpel, C., Rajan, G. and Ward L. (1998). Text for ISO/IEC FCD 144961 Systems, ISO/IEC JTC1/SC29/WG11CODING of Moving Pictures and Audio, May. Gamma, et.al., (1994). Design Patterns: Elements of Reusable Object-Oriented Design. Addison Wesley. Greenberg, S. and Roseman, M. (1998). Using a room metaphor to ease transitions in groupware. Research Report 98/611/02, University of Calgary. Hagsand, O. (1996). Interactive multiusers VEs in the DIVE system. IEEE Multimedia, 3(1), 30-39. Handley, M. and Jacobson, V. (1998). SDP: Session description protocol. Request for Comments 2327. Hewlett-Packard. (1999). CoCreate OneSpace: The Award-Winning, Web-Enabled, CADIndependent, Real-Time Collaboration Solution. IEEE Std 1278.1. (1995). Standard for distributed interactive simulation applications protocols. IEEE Computer Society. Leigh, J., Johnson, A. E. and Defanti, T. A. (1997). Issues in the design of a flexible distributed architecture for supporting persistence and interoperability in collaborative virtual environments. Supercomputing, Conference Proceedings, 15-21. Object Management Group. (1997). Control and management of A/V streams specification. OMG Document Telecom/97-05-07 Edition. Object Management Group. (1999). The Common Object Request Broker: Architecture and Specification, Edition 2.3. Macedonia, M. R., Zyda, M. J., Pratt, D. R., Brutzman, D. P. and Barham, P. T. (1995). Exploiting reality with multicast groups. IEEE Computer Graphics and Applications, 15(5), 38-45. Schulzrinne, H., Castner, S., Frederick, R. and Jacobson, V. (1994). RTP: A transport protocol for real-time applications. Internet Draft. Sense 8. (1997). World2World Release 1 Technical Overview. Toinard, C. and Chevassus, N. (1998). Virtual world objects for real-time cooperative design of manufacturing systems. Lecture Notes in Computer Sciences, Objectoriented Technology ECOOP’98 Workshop Reader, Springer Verlag, Conference Proceedings, 525-528. Toinard, C., Flori, G. and Carrez, C. (1999). A formal method to prove ordering properties of multicast systems. ACM Operating Systems Review, 33(4), 75-89. Vlissides, J. and Helm, R. (1999). Compounding command. C++ Report, April. Wahl, M., Howes, T. and Kille, S. (1997). Lightweight Directory Access Protocol (v3). Request for Comments 2251.
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 333
Chapter XV
Methods for Dealing with Dynamic Visual Data in Collaborative Applications– A Survey Binh Pham Queensland University of Technology, Australia
Many important collaborative applications require the sharing of dynamic visual data that are generated from interactive 3D graphics or imaging programs within a multimedia environment. These applications demand extensive computational and communication costs that cannot be supported by current bandwidth. Thus, suitable techniques have to be devised to allow flexible sharing of dynamic visual data and activities in real time. This chapter first discusses important issues that need to be addressed from four perspectives: functionality, data, communication and scalability. Current approaches for dealing with these problems are then discussed, and pertinent issues for future research are identified.
INTRODUCTION Much work has been devoted to the development of distributed multimedia systems in various aspects: storage, retrieval, transmission, integration and synchronization of different types of data (text, images, video and audio). However, such efforts have concentrated mostly on passive multimedia material which had been generated or captured in advance. Yet, many applications require active data, especially 3D graphics, images and animation, that are generated by interactively executing programs during an ongoing session of a multimedia application. For example, a user of an educational multimedia system may wish to simulate and analyze the behaviour of a scientific or economic phenomenon by varying the values of variables or parameters in application programs in order to view the effects. Similarly, a medical practitioner may wish to perform some image manipulation and Copyright © 2002, Idea Group Publishing.
334 Pham
processing in a specific manner to illustrate certain ideas to colleagues during a cooperative diagnosis or consultant session. A common practice is to generate graphical or image outputs for each instance in advance, and store and recall them for display as required. A more effective way to gain some insight would be to allow the user to run the programs on the fly and to change the values of the parameters at will or to select the values according to the results of previous runs. Such activities also need to occur in real time. The main obstacle to this mode of usage is the limitation of current network bandwidth which does not allow large amounts of graphical or image data to be transmitted in real time. This problem exacerbates if such a multimedia system is to be operated in a collaborative environment where a number of participants wish to share the knowledge by running programs with different parameters and discuss results with each other. While the boundary between computer graphics and image processing or analysis was traditionally strong, recent progress has gradually blurred this distinction as many techniques used in one area have been adapted for the other. In addition, some applications require both of these types of data and tasks. For example, some medical applications need both 2D digital images in various modalities (X-ray, MRI, PET, etc.) and volumetric data (e.g., 3D models of anatomical parts to aid surgery). A virtual environment such as a virtual museum may be constructed by combing 3D graphical modeling with the montage of 2D digital images. A visualization and analysis system for supporting mineral exploration requires the capability of both 3D geological modelling and processing of satellite images of terrains. Thus, it is pertinent to deal with both these types of tasks and data. Within the context of this chapter, we use the term ‘visual data’ to denote both digital images and graphical data, and discuss them together if the matter is applicable to both while pointing out the differences whenever appropriate. Computer-supported cooperative work (CSCW) has emerged as a popular and useful area of research, where the main focus has been on the design and implementation of appropriate architectures and tools to support the coordination of the interactions between participants. However, most collaborative applications dealt with so far are of a general purpose type such as brainstorming sessions to assist with group decision making using general shared tools (e.g., calendar, editors, spreadsheets, drawing board) (Stefik et al., 1987; Tou et al., 1994; Rheinhard et al., 1994; Brown et al., 1996; England et al., 1998). These applications also do not require intensive computational cost for generation of large volumes of data or high bandwidth for their transmission over the network. To support collaborative applications which involve dynamic visual data, we need to consider how to distribute not only application data, but also rendering, processing, analysis and display tasks across multiple machines. The majority of existing collaborative work which involves interactive graphics are focused on specific applications, particularly on distributed virtual environments (DVEs) (Calvin et al.,1993; Elliot et al., 1994; Stytz, 1996; Macedonia, 1997) and immersive environments (Poston and Serra, 1994). The main aim of these systems is to provide facilities for designing and handling objects and activities required for virtual environments, hence some of these facilities are not suitable for general purpose graphics applications. There are also other pertinent issues inherent in such collaborative environments besides those of computational and communication costs, for example, how to model the spatial aspect of interactions in order to devise appropriate methods for supporting flexible and dynamic sharing of data and activities (Hagsand, 1996). However, this chapter only deals with the problems arisen from extensive computational and communication requirements. Our intention is to provide a survey on current approaches for dealing with these
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 335
problems, analyze their effectiveness and discuss the scope for future work in this area. We first discuss different types of collaborative modes and then address major issues that need to be addressed for collaborative applications which involve dynamic visual data. These issues are viewed from four perspectives: functionality, data, communication and scalability. Methods for dealing with collaborative and interactive 3D graphics to be discussed are classified in two main categories: integrated and plug-in approach using graphics exchange formats, and distributed approach using distributed graphics libraries. In view of their special characteristics, methods for dealing with collaborative runtime image processing and analysis are dealt with in a separate section. Finally, remaining issues for future research are identified and discussed.
COLLABORATIVE MODES In a collaborative environment, the mode of collaboration may vary with different situations, depending on how tasks are synchronized, how data and tasks are communicated, and how aware each participant is of what the other participants are doing. Participants may perform tasks in a synchronous, asynchronous or sequential manner. They may carry out discussions using only one application program (single-tasking) or many application programs (multi-tasking). Each participant may see what another participant is doing (collaboration-aware) or may not for certain tasks (collaboration-unaware). Participants may also need or choose to share some data (public data) or not to share some data (private data). The main problems are not only to allow these situations to occur, but also to manage data consistency and to improve the efficiency of communication. A common approach to manage collaboration within a network of heterogeneous computers is to use a client-server architecture, where the server is in charge of basic communication and client management, while the application clients provide user interfaces and application tools to perform activities. The clients are generally considered active components while the server a passive component because it waits for a request from a client for a service to be performed. Participants are free to register for a collaborative session and to terminate whenever they wish. Hence, the server needs to keep track of all events in order to inform newcomers of what had been discussed if they request it. In addition, the history records of collaborative sessions could also be very useful for certain purposes such as training and evaluation. The most straightforward case occurs when participants perform tasks in a synchronous, collaboration-aware and single-tasking mode. All data for discussion in this case is also shared by all participants. Conflicts due to simultaneous requests from a number of participants may be resolved by a token mechanism controlled by the server, where a token is given to a participant when the permission to perform a task is requested. When this task is completed, the token is passed on to another participant who then issues the next request. This mode of collaboration is the most common, and is suitable for many situations such as consultations between residents and senior doctors, or between medical specialists. However, in other situations, it is desirable to allow other modes of collaboration. For example, in a remote learning situation, a student may wish to communicate with an instructor privately, while other students are exploring different tasks. Thus, the environment also has to support multi-tasking, asynchronous tasks, collaborative-unaware and selective sharing of data.
336 Pham
The local mode allows a participant to work independently without sharing data or work process with others. This mode is useful when a participant wishes to investigate a certain line of thoughts and knowledge before forming firm ideas for exchange with others. The selective cooperative mode allows a participant to nominate specific participants with whom to communicate in a synchronous manner, and to share data and tasks. The remaining participants are unaware of these activities. In some situations, participants may wish to perform certain tasks in an asynchronous fashion. For example, they may wish to run a program to extract features or regions of interests for a later discussion, while carrying out a synchronous discussion with each other. The peer-to-peer communication may be achieved directly via message managers, or indirectly via a single server or a number of distributed servers. While the former approach reduces latency and avoids bottlenecks that are caused by simultaneous requests to the servers, the latter approach has a number of advantages. It has a lower overhead and can offer a simpler process and tailored service for each client. Although most current network communication still use a point-to-point or unicast approach, multicast communication, which allows a single message to be sent to a number of machines belonging to a multicast group, has also been available for the last few years. While this approach reduces bandwidth requirements by avoiding duplicating packets, it does not cater for individual detailed requirements.
COLLABORATIVE APPLICATIONS INVOLVING DYNAMIC VISUAL DATA: MAJOR ISSUES Applications that involve interactive 3D graphics and run-time image processing and analysis are often computationally intensive and require large amounts of data. This creates a number of serious problems when trying to share tasks and data in real time. We now discuss these problems from four perspectives: functionality, data, communication and scalability.
Functionality There are a number of major applications that require the capability of real-time dynamic generation and processing of visual data in a collaborative environment: co-design, collaborative visualization and simulation, virtual workspace, cooperative medical diagnosis and co-planning for surgery. Although the objects of interest and their representation formats differ in these applications, there are many common task requirements which may be described in the following framework. Remote clients should be able: • to have fast and reliable remote access to graphical models, digital images and other relevant information (e.g., structure, attributes, viewing and other application-dependent parameters); • to share graphical scene or hierarchy of image operations in an unambiguous manner, and at different levels of granularity and hierarchy; • to efficiently perform interactive tasks (e.g., modify graphical components; perform image filtering) on local machines as well as requesting specific tasks to be performed efficiently on the server; • to point, annotate, chat, and access to report and session history;
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 337
•
•
to retain autonomy by electing to collaborate or not to collaborate on any parts of the graphical models or digital images and at any specific time--absolute synchronization where each client transmits its state to all other clients for each update is not only costly, but also restrictive; to work on different computer platforms and communication networks.
Data Data to be communicated consist of not only different types of visual data (e.g., graphical components, 2D images, image volumes of different formats), but also other data which contain information on specific tasks (e.g., when and how they should be performed). Data sharing may occur at various levels depending on the application requirements and available bandwidth: • Only output images (with specific viewing parameters or other parameters) are sent to other hosts at their request. This mode of communication is simple and allows propriety data and methods to be kept private, but limits the interactive ability. • Raw data is shared by downloading at the start of the collaboration. This requires complete trust between participants. • Raw data is stored at the central server, but participants are allowed to share visualization geometric primitives or image operations. Common requirements for real-time sharing of dynamic visual data include: • continuous flow of information between different application programs across multiple platforms and multiple networks; • visual data are kept in one place (the server) and allow easy access; • data representations should be designed to facilitate exchange and minimize network resources; • the amount of data required to be replicated or distributed to remote clients should be kept to a minimum; • graphical data should be refreshed and updated efficiently and consistently; • local graphical and imaging operations should be allowed, and both synchronous and asynchronous modes should be provided.
Communication Delay time for interactive applications are caused by many factors: input/output device; computational and rendering tasks; transfer time from data to hardware, then to display screen; synchronization of tasks; waiting time for data between processing and successive updates; and the refresh frame rates of display hardware. The following issues which are applicable to communication in general are also relevant for our requirements: • network resources should be minimized as all communications need to be done in real-time; • a mechanism should be available to notify remote clients of any change to distributed states; • reliable, consistent and secure delivery; • bottlenecks, deadlocks and conflicts should be avoided.
Scalability As a collaborative multimedia environment may be shared by a very large number of participants, it is essential to address how well the environment can perform with the
338 Pham
increasing number of participants. The following factors have significant effects on scalability: • complexity of tasks and data for communication, • level of collaboration awareness, • variability of computer platforms, • network latency. We now discuss methods for dealing with interactive 3D graphics and animation in collaborative applications.
COLLABORATIVE AND INTERACTIVE 3D GRAPHICS AND ANIMATION Commonly used high-level graphics libraries such as OpenInventor (Wernecke, 1994) and Java3D (Sowizral et al., 1998) only support stand-alone applications. OpenInventor is an object-oriented 3D toolkit that facilitates application development as well as data movement between applications via its 3D interchange file format. Unlike Open GL where rendering is explicit by direct access to a frame buffer or by playing back of a previously recorded display list, rendering in OpenInventor is performed by invoking the rendering operations encapsulated in objects (e.g., changing lighting model, drawing style, transformation). A 3D scene can be created by constructing a 3D scene database which consists of these editable objects. OpenInventor takes advantage of the concept of scene graphs to encode the application state so that the task of planning and keeping track of changes in graphical models is simplified and mistakes may be avoided. As this approach is also adopted later by a number of high-level graphics libraries with various minor variations, it is described here in some detail. A scene graph is made up of nodes of three primitive types: shape nodes containing geometric data, property node for attributes such as material and colour, and group nodes containing pre-built grouping of nodes. An action may be performed on a scene by creating an instance of action class and applying it to the root node of the scene graph and then to other nodes by traversing the graph from top to bottom and left to right. To deal with the dynamic behaviour, two more special types of objects are also provided: engine and sensor. Engine objects describe how other objects are connected or constraint (e.g., for animation) so that both motion and geometry are encapsulated into a single class. This allows active behaviour to be included in the description of an object so that whenever the file containing this description is read, this behaviour is displayed automatically (e.g., a rotating wheel). Sensor objects detect when a change has taken place in the scene database and call a function supplied by the application. A user-defined callback function is invoked in the application whenever some changes are detected. Thus, 3D objects only move according to user interaction, sensor activity or as depicted by the engine attached to the scene graphs of the objects. Such an object-oriented approach facilitates the construction of complex and dynamic graphical objects and scenes. Although a 3D graphics interchange file format is provided by OpenInventor to allow data export and import between applications, it does not allow a model to be shared or edited by multiple remote users. As the Internet gained acceptance as the major platform for exchanging shareware in the form of a variety of media, VRML (Virtual Reality Markup Language) was developed to allow 3D graphical scenes to be constructed in a hierarchical fashion and exchanged via VRML files (Ames et al., 1997). These files contain instructions
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 339
(in text) for the VRML Web browser to draw 3D graphics objects. Another language to facilitate the development of 3D graphics in a Web environment is Java3D which provides a high level layer over graphics libraries OpenGL and Direct3D. Java3D also uses a scene graph hierarchy similar to those provided by VRML and OpenInventor, which defines a complete description of a scene (model data, attributes, viewing parameters, transformations). As Java3D has cross-platform, threading and garbage collection capabilities, it has more appeal than VRML. However, both VRML and Java3D only deal with stand-alone applications and are not suitable for use in a distributed and collaborative applications. Early collaborative 3D graphics applications separated the task of graphics rendering from other tasks. For example, in CAD/CAM applications, data describing design prototypes were sent to a design centre where rendering took place and the resulting image is sent back. To manipulate the model or to navigate through a virtual scene, parameters were also sent to a high-end graphics workstation for generation (Macedonia and Noll, 1997). An alternative approach was to reduce image resolution or frame rate. In the case of the Visinet (Lamotte et al., 1997), which was created to provide a platform on which designers, architects, city planners and engineers in Europe can work together, 3D scenes were sent through ATM, using black-and-white or dithering images. Later work on distributed virtual environments involve diverse applications such as office management, battle simulation, collaborative design, collaborative visualization and animation (England et al., 1998; Pang and Wittenbrink, 1997; Faure et al., 1999; Kindratenko and Kirsch, 1998). Multicast protocols allow any participating site to receive a packet over a network without requiring communication be known ahead of time. Multicast network traffic is more efficient than unicast because it only increases at a linear rate with the number of sites instead of quadratic rate. However, as multicast technology is still not widespread due to its high cost, deploying multicast to distribute graphics scenes is still comparatively rare. Most work to date therefore falls into two main categories: integrated and plug-in approach using exchange graphics formats, and distributed approach using distributed graphics models.
Integrated and Plug-in Approach This approach takes advantage of exchange graphics formats such as OpenInventor and VRML models as a means to communicate between different applications or within a specific application which runs across inhomogeneous platforms. To minimize the traffic, one practice is to replicate graphical objects at all local sites and only transmit their modifications as they are generated (Holbrook et al., 1995). The main problem here is how to keep track and maintain the consistency of these duplicate databases. In a collaborative environment where participants work on a common graphical scene (e.g., in OpenInventor format) which is replicated at local machine, any change to the graphical scene initiated by one participant may be communicated to other participants by taking advantage of the concept of these sensor objects. In a single user mode, a sensor and an appropriate callback function are first constructed. A sensor is then attached to a field, node or path as required. If changes are detected by the sensor, then a callback to the application is executed. In the collaborative mode, one extra task is added. In addition to the callback to the application on a local host, the detected changes are broadcast to the server and other clients in order to invoke a similar callback to the application on remote hosts. As sensors are attached to fields, nodes or paths in a scene hierarchy, messages that describe current changes can also be constructed in a hierarchical fashion by systematically following one branch of the hierarchy from the root to a leaf.
340 Pham
One essential requirement of a collaborative environment is to provide a mechanism for participants to know each other’s viewpoint. The system Cspray which is dedicated to collaborative visualization uses the metaphor of a ‘spray can’ to allow participants to indicate their area of visualization and features to be observed (Pang and Wittenbrink, 1997). This is done by constructing tools to spray these areas with fine particles and tools to highlight the features of interest. An ‘eyecon’ which gives information on the location and orientation of each participant also assists with the sharing and interpretation of view-dependent images. The strength of this approach lies with its ability to allow participants to share data in a selective manner according to their needs and their resource capability (e.g., graphics capability). Confidentiality of raw data may also be sustained by sharing and updating only geometric primitives via a spray can that connect to a relevant data set. Access rights to the spray can is regulated by floor control policies which separate the managing and creating of a shared spray can from using it by manipulating parameters. This system does not provide any special methods for dealing with the exchange a of high volume of graphical data, except via the compression and transmission of geometric primitives. However, it was reported from a number of public demonstrations of the system that the lagtime between mouse events and display activities was tolerable (a couple of seconds for up to four participants over the Internet), but it was difficult to coordinate participants and in particular, to familiarize them with the complex interfaces and capabilities. Another system for developing multiuser 3D virtual environment is mWorld (Dias et al., 1997) which uses the generic OpenInventor models and extends them by including a class of deformable objects. To create the key-frame animation of these deformable objects, all surface deformations are firstly computed by applying user-defined parameters. The OpenInventor engine is later instructed to switch between these surface deformation instances which are contained in the nodes of the scene graph and render. Thus, key-framed animation can be integrated into a multiuser virtual environment. In contrast with the Cspray system, mWorld follows the WYSIWYG (What You See Is What You Get) approach and uses the JESP (Joint Editing Service Platform) for session control and data communication. It supports all three types of collaboration: centralized (by global synchronization), distributed (by serialized events) and hybrid (application resources located locally and global communication) by providing different protocols for three separate layers: application, session manager and group communication (Figure 1). Consistency is achieved by using token passing mechanisms. This generic framework can be easily adapted for other applications, and the separation of application and communication functions also facilitates the use of different network protocols with minimum changes. Another example of a generic functional architecture for development of multiuser 3D environments was presented by Lin and Smith (1999). This framework integrates existing modeling and simulation systems by grouping tools according to their functionality (e.g., storing, graphical, logical, spatial tool) and providing interfaces between them. The input and output models are processed and checked for consistency in a pipeline fashion through different functional layers, each of which may contain multiple tools. This integrated environment can also be readily implemented as a distributed network of tools, where a tool in a higher functional layer may act as a server to tools in a lower functional layer (clients). Thus, existing software packages may be reused and participants do not need to acquire common software. To make data readable to different components of the integrated system, either data translators or exchange graphical data
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 341
formats such as VRML are required. In collaborative mode, both entity data and corresponding transformation rules need to be broadcast to participants. As this framework is organized in a modular fashion, it also allows flexible modification and extension. Another way to cope with the diversity of graphics software is to design a special language to provide a common platform on which participants can plug in their own applications and run them locally. The interaction is carried out by using an exchange graphics format. An example of such a language is PAVRML (Faure et al., 1999) which provides a common distributed environment for animation based on VRML97 using a clientserver architecture. A global database of the virtual world which is kept by the server is updated according to messages sent by participants who act as clients (Figure 2). The main advantage of this approach is that participants are not required to use the same graphics or animation software and furthermore, they need not know about each other’s software and applications. Thus, legacy software may be used and confidentiality of propriety software can be protected. Some commercial software packages for collaborative applications also use this approach. An example is CoCreate which is a software package released by Hewlett-Packard for collaborative CAD modelling (Co Create, 2000). By using VRML as underlying format for its 3D model viewer FirstSpace, it allows different CAD packages to be used. While geometric data is kept in the server, the graphical representation and structure information of the model are sent to clients to be rendered locally. As the centralized client-server model usually does not scale well with constant streams of visual data update, a common practice is to distribute control and to replicate application resources and session control at all sites, and to create tele-events in order to share the application. These tele-events which encapsulate the descriptions of tasks in small messages are transmitted to other sites, to be parsed and processed there locally. As visual data and application resources are replicated at the start of the collaborative session, time delay is only due to the transmission of these small messages, which is negligible. Table 1 provides a summary of the methods discussed together with their main advantages and disadvantages. The integrated or plug-in approach avoids the need for extensive development of new Figure 1: Different Protocols
Application 1
Application 2
Figure 2: The server
Client 11 Client LocalDatabase Local
Client22 Client Local dabase Local Database
database
Session Manager 1
Session Manager 2
Group Group Communication Communicati
Group Group Communication Communicati
Internet
VRML and updates
Server
342 Pham
Table 1: Methods using integrated and plug-in approach Methods Holbrook et al., 1995
Cspray Pang &Wittenbrink, 1994, 1997
MWorld Dias et al., 1997
Lin & Smith, 1999
PAVRML Faure et al., 1999
HP CoCreate, 2000
Special Characteristics Replicated graphical objects locally. Changes detected by sensors and communicated via callback functions.
Advantages
Disadvantages
Messages built in a hierarchical manner.
Hard to maintain consistency. Scalability-slow response as number of participants increases. Cannot interactively edit graphical models. Complex interfaces. Cannot interactively edit graphical models.
Provide a tool to identify a client’s area of visualisation. Selective sharing of geometric primitives. Three layers with separate protocols: application, session managers, group communication. Extended Open Inventor models. Integrated keyframed animation
Matching service level to clients’ requirements and resources. Allow data confidentiality. Generic framework for different applications. Easy-to-change network protocols.
Group tools according to functionality. Entity pipelines with transformation rules to transform one entity layer to another. Collaborate by broadcasting entity data and transformation rules. Special language to provide a common distributed environment for animation. Allow different CAD packages to be
Generic framework. Reuseable software. Flexible. Extensible.
Cannot interactively edit graphical models.
Allow software reuse. No need for common software.
Cannot interactively edit graphical models.
used, communicated via VRML models.
No data confidentiality. Cannot interactively edit graphical models.
Cannot interactively edit graphical models.
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 343
software by allowing graphical output produced by existing software to be shared via graphical exchange format. However, it does restrict the flexibility of interaction. Ideally, participants would wish to be able to interactively modify graphical scenes at different levels of granularity and in a seamless manner. This is the motivation behind the distributed approach to be discussed in the next subsection.
Distributed Graphics Model Approach There have been a few attempts to address the problem of sharing and distributing 3D graphics applications in order to improve performance. These methods differ in the types of data to be distributed, where and how they are distributed. For example, Performer (Rohlf and Helman, 1994) supports the distribution of 3D graphics rendering components across multiple processors, but not across multiple machines. Other software systems such as the Visualization Data Explorer (Lucas, 1992) or ATLAS (Fairen and Vinacua, 1997) allow different machines to share application data, but not to share graphical scenes. In these cases, the rendering task is allocated to one machine, although the results may be displayed on more than one machine. Both of these approaches are not suitable for interactive editing of graphical scenes and sharing them remotely, at different levels of granularity. To address this problem, Elliott et al. (1994) used a constraint graph to represent a graphical scene, where the constraints are based on those time-varying relationships that describe the changes in the scene. This framework allows users to share interactive and animated graphics applications across different machines by sending this constraint graph after it is modified by a user to other machines. As every processor must evaluate all constraints regardless of their relevance, this approach is inefficient and lacks flexibility because individual requirements are not catered for. MacIntyre and Feiner (1998) overcame these drawbacks by using a language-based approach. They designed a high-level object-oriented graphics library called Repo-3D to allow rapid prototyping of distributed applications. This system, which introduces a new distributed programming language called Repo, is an extension of a 3D animation package written in Modula-3. Modula-3 was chosen because its capability to support threads and garbage collection facilitates the implementation of the distribution of objects and their removal when they are no longer required. Similar to OpenInventor and Java3D, both Repo3D and Anim-3D also use a scene graph model which consists of graphical objects and properties, together with input event callbacks that are used for dealing with interactive behaviour. The main difference is that there is no file format for storing graphical objects. Instead, graphical objects are stored as Repo-3D programs which are to be executed for their creation. This property was exploited to gain efficiency because both the programs and graphical objects are simultaneously updated after the objects are distributed. These graphical objects reside in a distributed share memory and may be replicated in a similar manner as non-graphical data, whenever required. This share memory is organized by using the Share Data Object Model introduced by Levelt et al. (1992), which allows both graphical and non-graphical data to be encapsulated in user-defined objects. These objects are subsequently shared by invoking objects’ remote method calls which are very similar to those of Java RMI (remote method of invocation) (Horstmann and Cornell, 1998). This data sharing approach provides users with the flexibility to decide which part of the graphics models to be shared, and at the same time, letting interactive activities and modifications of graphical scenes to be performed locally if required. The graphics objects and properties represent their common logical entities and their attributes in a graphical scene. To encapsulate different types of dynamic behaviour, one additional attribute was introduced
344 Pham
to denote the dynamic behaviour of a property (whether it is constant over time, or changed at a specific time, or changed according to a specified function of time or depending on the occurrence of another event). To minimize the inconsistencies that often occur while local variations are being performed and both local and global graphics objects are being updated, Repo-3D limits the types of local variations that may be applied to their properties and the nodes of the graphics object scene. The system takes care of the communication and synchronization of components via the use of input event callbacks, replication of objects and notification of changes to distributed objects. Thus, users are left to focus on the functionalities of the application. One major drawback is that unlike Java3D, Repo-3D was not built for different platforms. Furthermore, it is a research system and not available widely for public use. However, the approach used can be readily adapted for other object-oriented languages (e.g., Java3D). Table 2 gives a summary of these methods.
COLLABORATIVE IMAGING PROCESSING AND ANALYSIS While many diverse collaborative applications which require interactive 3D graphics have been studied, telemedicine appears to be the only notable collaborative application where capability for runtime image processing and analysis has been investigated so far (Ando et al., 1999; Aubry et al., 1999; Pham, 2000). However, it is envisaged that this capability would benefit other sectors such as mineral exploration and applications that use GIS (Geographic Information Systems), where professionals (often of different expertise and background) need to process images on the fly in order to illustrate their ideas and analysis to each other. Application programs which involve 3D graphics, or image processing and analysis tasks are usually organized in a modular fashion, where each block of code is dedicated to a specific task and the combination of these tasks make up an application. The combination can be achieved in a hierarchical or a pipeline fashion. In the first type, a program is composed of sub-programs each of which is composed of sub-subprograms, and so on. For example, different parts of a medical image are processed to extract different types of features which could support a diagnosis when viewed in combination. In the second type, an application may consist of a sequence of programs, where the result of each program is used as the input for a subsequent program. For example, a sequence of tasks needs to be carried out to extract the contours of vertebrae from an X-ray image of a cervical spine. Edge detection is firstly done on the image, and the result is then thresholded to obtain edge points. The binary image resulted from thresholding is subsequently used to detect the contours of the vertebrae. In a generic telemedicine environment, application programs are more diverse and may span both areas of computer graphics and image processing and analysis. For example, a telemedicine application may need a collection of tasks including: image enhancement, image manipulation, transformation, particular feature extraction, segmentation, quantification, 3D image reconstruction and extraction of regions or volumes of interest. The retrieval of images and similar cases including image data is also another common application. One current trend is to adopt the concept of “user-in-the-loop,” with the aim to improve the results of these tasks by allowing users to interactively control the process at the appropriate stage.
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 345
Table 2: Methods using distributed graphics model approach Methods Performer, 1994
Visualization Explorer, 1992 ATLAS, 1997 Elliot et al., 1994
MacIntyre & Feiner, 1998
Special Characteristics 3D graphics component distributed across multiple processors.
Advantages
Disadvantages Shared application data, not graphical.
Distributed across multiple machines. Rendering performed on one machine. Graphical scenes represented by a constraint graph.
Efficient.
Cannot interactively edit and share graphical scenes.
Shared interactive and animated graphics.
Inefficient due to uniform constraint evaluation. Allow no customised features.
High level OO graphics library using scene graph models. Input events callback. Graphical objects stored as Repo-3D programs.
Distributed share memory for both graphical and nongraphical data. Efficient, flexible sharing modes. No need for exchange file format.
Not built for different platforms. Limited types of local graphical variation to ensure consistency. Language is not widely used.
For example, physicians may use their own contextual and interpretive knowledge to guide the segmentation and visualization of regions and volumes of interest. This approach is worthwhile, but certainly harder to cater to in a shared environment. In these cases, not only the task but also a specific participant’s intentions on how to carry out the task need to be communicated to other participants. Hence, there is a need to devise a method for efficient handling of both types of programs in a shared environment. Although those methods for dealing with interactive 3D graphics (discussed in the previous section) were designed for different applications, they can be readily adapted for telemedicine applications. Current bandwidth cannot support real-time transmission of a large amount of image data, hence a current practice is to download all required images for a consultation session from the server in advance. Whenever a program is executed by a participant to perform some tasks on an image, a message is sent to other participants via the server in order to invoke the same operations on their own computers. Thus, the problem is reduced to that of designing small messages that can encapsulate the description of each task together with associated information required for the task to be performed (Pham,2000). Figure 3 displays the architecture of such an example system.
346 Pham
Figure 3: The architecture Communications Layer
CLIENT Image Viewing/ Drawing Environment
Chat Text Display Environment
Image Query Processing/ Formation Maipulation Controls Programs Image Processing/ Manipulation Programs
Message M Distribute M to all
Image Features
Rquest Images
Request images Images Broadcast
SERVER
Communications Manager
Case-Base CaseQuery Manager Database Query Manager
Case Library
Image/ Feature Database
Commonly used image processing libraries for general purpose such as KHOROS (1999) are modular in nature. They usually provide a software integration and development environment to facilitate the development of application programs. The libraries are built on a set of functions, each of which is designed to perform a specific imaging task with appropriate arguments. These arguments include input and output files as well as relevant parameters required for the task, which may be replaced by user-specified values whenever a function is called. Image processing programs that are developed from scratch without using an image processing library are also generally written in this fashion. Thus, an application program which requires various image processing tasks to be performed may be formed from these function calls. In such a case, we can construct a message that describes specific operations and executable environments from ‘message primitives’ in a hierarchical or pipeline manner. Each message primitive may be expressed in a syntax which is similar to the syntax of a UNIX command together with its arguments. Sample templates for these message primitives may be found in Pham (2000). At a remote host, these messages are then parsed and mapped back into executable commands. The remote host invokes these commands in order to execute the same application tasks. As these messages are very small, the lagged time that occurs when the same task is executed on different machines is negligible. Thus, participants feel that they can obtain and view the results in real time. A client-server architecture can be used in this case, where the server is in charge of basic communication and client management, while the application clients provide user interfaces and application tools to perform activities. The communication can be implemented using Java network and remote method invocation (RMI) tools (Coutois, 1997). The synchronization of tasks and clients can be achieved by using the concept of multi-threading, where each thread consists of a block of executing code and associated variables. Thus, programs can be executed simultaneously, and tasks can be synchronized at different speeds. To ensure the consistency of data and tasks, threads are also grouped together in a threadgroup in the order that they should be executed. To improve the efficiency, an asynchronous callback system is used with the help of a thread pool. All threads in the pool are created at initialization and put to sleep. When the server receives an item on a callback method (whose role is supposedly to distribute the task to all connected clients), a thread is
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 347
notified so that it can distribute the item to all connected clients. Thus, the method is in the position to immediately return control to the calling client. Other types of dynamic visual data that are very useful in collaborative applications are telepointers and interactive free-hand drawings which provide mechanisms for participants to point at or mark specific items or areas of interest. In particular, as they simulate essential tools that are usually used by medical specialists in order to discuss images with each other for diagnostic purposes, it is worthwhile to include them in this section. For telepointers, the positional coordinates of a client’s cursor are sent to other clients via the server. For interactive free-hand drawings, the number of points generated by a client is firstly reduced based on the rate of change of orientation, and the remaining points are sent progressively to other clients via the server. These data points are subsequently used for generating smooth spline curves at each client’s local computer. Due to their simple nature, these two functions can be implemented as special threads that run concurrently with the main application. This thread uses first-in-first-out buffer to store incoming drawing or telepointer data, thus enabling the remotely called draw method to return immediately because it does not have to wait until the drawing is completed. When the draw-thread receives a new drawing, it is notified in order to read the buffer and draw item by item until the buffer is empty. The thread then returns to sleep status. The draw-thread can handle telepointers’ data from several connected clients by using a separate buffer for each client. This allows these clients to perform telepointing concurrently without causing problems for the display of all telepointers. The distribution of objects can be handled via a client’s broadcast-thread and the main GUI-thread in the server. It is the server’s task to broadcast all these data items to all clients connected to it. The client’s broadcast-thread contains a buffer for each type of data item. When a user draws an item, this item is written into the BroadcastThread’s buffer, and the thread that is running in the background will broadcast the buffered data until the buffers are empty. This approach allows the calling method (in this case, the mouse handler that initiates the drawing) within the client’s main thread to return control immediately to the GUI because it does not need to wait for the completion of the broadcast process.
CONCLUSION AND FUTURE RESEARCH The needs for resource rationalization and productivity improvement have created impetus for the development of collaborative workspace. Dynamic visual data are increasingly required not only for presentation and marketing purposes, but as core material on which many important applications are based. Yet current bandwidth is not sufficient for direct transmission of large amounts of dynamic visual data to allow remote participants to collaborate as if they were sharing the same application in the same location. Furthermore, there is an increasing demand for taking these applications to mobile and wireless environments that offer even lower bandwidth (see, for example, Beltz et al., 1998). Thus, the problem of how to share such dynamic visual data in an efficient and effective manner is an important one. The methods presented in this survey have addressed this problem in three main ways: indirect communication of graphical modifications through graphics exchange formats, direct modifications of distributed graphical objects and indirect communication through small messages that encapsulate the operations and the executing environments of programs that produce visual data. Although these methods managed to deal with many important issues presented in the third section, the issue of scalability remains largely unexplored. In our experience, even
348 Pham
with the two simplest tasks-telepointing and free-hand 2D drawing-the lag time significantly increases as the number of participants increases. This problem aggravates as the complexity of tasks increases further, as required by all above-mentioned applications. Thus, a number of issues remain for future research. The first issue is concerned with the representations of visual data and the structure of programs for performing required tasks. Existing data structures and representations for 3D graphics models have been designed to suit the requirements for stand-alone applications. Is it pertinent to ask if we should re-design these representations to take care of collaborative applications? The system Repo3D has offered a distributed model based on user-specification. Can a scheme be found to automatically optimize the way the components of graphical objects are distributed and later re-combined in order to increase the performance? Similarly, could we find a method to optimize the structures of application programs which produce these visual data (both 3D graphics and images) to facilitate the sharing of tasks as well as visual data? For a particular application, it is certainly feasible for us to identify subtasks that are usually performed by certain types of participants and to use this knowledge to optimize work distribution to enhance performance. A question is whether it is possible to optimally adapt the way to distribute work on the fly and as further knowledge becomes available? The second issue is related to methods for coding and transmission of visual data. Are standards provided by the subgroup SHNC (Synthetic Natural Hybrid Coding) of MPEG4 for efficient coding and transmission of graphics models and animation parameters suitable for collaborative applications and, if so, do they provide any extra benefits? Furthermore, should there be any standards for distribution of graphical objects? The last issue is concerned with the interplay of two types of dynamic visual data (3D graphics and images) and their corresponding programs. So far, they have been treated independently of each other and in different ways. Yet, an increasing number of applications deploy them in a seamless manner within a stand-alone mode. Could common representation schemes be found that enable the encapsulation of both these two types of data and programs so that they can be treated in the same way when they are used in collaborative modes?
REFERENCES Ames, A. L, Nadeau, D. R. and Moreland, J. L. (1997). VRML 2.0 Sourcebook. New York: John Wiley & Sons. Ando, Y., Kitamura, M., Tsukamoto, N., Kawaguch, O., Kunieda, E., Kubo, A., Kohda, E., Hiramatsu, K., Sakano, T., Fujii, T., Okumara, A., Furukawa, I., Suzuki, J. and Ono (1999). Inter-hospital PACS designed for teleradiology and teleconference using a secured high-speed network. Proceedings SPIE Medical Imaging 1999-PACS Design and Evaluation: Engineering and Clinical Issues, 420-429. Beltz, C., Jung, H., Santos, L. and Strack, R. (1998), Handling of dynamic 2D/3D graphics in narrow-band mobile services. In Vine, J. and Earnshaw, R. (Eds.), Virtual Worlds on the Internet. Los Alamitos: IEEE Computer Society, 147-156. Benford, S. D., Bowers, J. M., Fahlén, L. E., Mariani, J. and Rodden, T. R. (1994). Supporting co-operative work in virtual environments. The Computer Journal, 37(8), Oxford University Press. Ben-Natan, R. (1995)., CORBA: A Guide to the Common Object Request Broker Architecture. McGraw Hill.
Methods for Dealing with Dynamic Visual Data in Collaborative Applications 349
Brown, M. H., Narjork, M. A. and Raisamo, R. (1996). Collaborative active textbooks: a Web-based algorithm animation system for an electronic classroom. Proceedings of the 1996 IEEE Symposium on Visual Languages, 197-198. Calvin, J., Dickens, A., Gaines, B. and Metzge, R. P. (1993). The SIMNET virtual world architecture. Proceedings of the IEEE VRAIS’93, 450-455. Chen, H., Nunamaker, J., Orwig, R. and Titkova, O. (1998). Information visualization for collaborative computing, Computer, 31(8), 75-81. Chen, H. M. and Yun, D. Y. Y. (1998). MISSION-DBB: A distributed multimedia database system for high-performance telemedicine. In Wong, (Ed.), Medical Image Databases, STC, Boston: Kluwer Academic Publishers, 283-302. Co Create. (2000). Available on the World Wide Web at: http://www.cocreate.com. Courtois, T. (1997). JAVA 1.1-Networking & Communications, Prentice Hall PTR, New Jersey. Dias, A. C. A., Belo, C. A. C. and Rebordao, J. M. (1997). mWorld: A multiuser 3D virtual environment. IEEE Computer Graphic & Applications, 17(2), 55-65. Elliott, C., Schechter, G., Yeung, R. and Abi-Ezzi, S. (1994). TBAG: A high-level framework for interactive, animated 3D graphics applications. Proceedings of the ACM SIGGRAPH 94, 421-434. England, D., Prinz, W., Simsarian, K. and Stahl, O. (1998). A virtual environment for collaborative administration. In Vine, J. and Earnshaw, R. (Eds.), Virtual Worlds on the Internet. Los Alamitos: IEEE Computer Society, 237-252. Fairen, M. and Vinacua, A. (1997). ATLAS: A platform for distributed graphics applications. Proceedings of the VI Eurographics Workshop on Programming Paradigms in Graphics, 91-102. Faure, F., Faisstnauer, C., Hesina, G., Aubel, A., Escher, M., Labrosse, F. and Nebel, J. C. (1999). Collaboration animation over the network. Proceedings of the Computer Animation ’99, Geneva. Greenhalgh, C. (1999). Large Scale Collaborative Virtual Environments. SpringerVerlag, London. Hagsand, O. (1996). Interactive multiuser VEs in the DIVE system. IEEE Multimedia, 3(1), 32-39. Holbrook, H. W., Singhal, S. K. and Cheriton, D. R. (1995). Log-based receiver-reliable multicast for distributed interactive simulation. Proceedings of the ACM SIGCOMM ’95, 328-341. Horstmann, C. S. and Cornell, G. (1998). Core JAVA 1.1 Volume II-Advanced Features. Sun Microsystems Press, California. Kindratenko, V. and Kirsch B. (1998). Sharing virtual environments over a transatlantic ATM Network in support of distant collaboration in vehicle design. Retrieved on the World Wide Web: http://www.ncsa.uiuc.edu/VEG/DVR/ve98/article.html. KHOROS Vers 2.1; Tutorial. (1999). Retrieved on the World Wide Web: http:// splish.ee.byu.edu/tutorials/khoros/khoros.html. Lamotte, W., Flerackers, E., Van Reeth, F., Earnshaw, R. and De Matos, J. M. (1997). Visinet: Collaborative 3D visualisation and VR over ATM networks. IEEE Computer Graphics and Applications, 17(2), 66-75. Levelt, W., Kaashoeck, M., Bal H. and Tanembaum, A. (1992). A comparison of two paradigms for distributed shared memory. Software Practice and Experience, 22(11), 985-1010.
350 Pham
Lin, T. and Smith, K. (1998). A generic functional architecture for the development of multiuser 3D environments. In Vine, J. and Earnshaw, R. (Eds.), Virtual Worlds on the Internet, Los Alamitos: IEEE Computer Society, 85-100. Liu, P. W., Chen, L. S., Chen, S. C., Chen, J. P., Lin, F. Y. and Hwang S. S. (1996). Distributed computing: New power for scientific visualization. IEEE Computer Graphics & Applications, 16(3), 42-51. Lucas, B. (1992). A scientific visualization renderer. Proceedings of the IEEE Visualisation ’92, 227-233. Macedonia, M. R .and Noll, S. (1997). A transatlantic research and development environment. IEEE Computer Graphic & Applications, 17(2), 76-82. Macedonia, M. R. (1997). A taxonomy for networked virtual environments. IEEE Multimedia, 4(1), 48-56. MacIntyre, B. and Deiner, S. (1998). A distributed 3D graphics library. Proceedings of the SIGGRAPH 98, Orlando, Florida, 361-370. Pang, A. and Wittenbrink, C. (1997). Collaborative 3D visualization with Cspray. IEEE Computer Graphic & Applications, 17(2), 32-41. Pham, B. (2000). Delivery and interactive processing of visual data for a cooperative telemedicine environment. Telemedicine Journal, 6(2), 261-268. Poston, T. and Serra, L. (1994). The virtual workbench: Dextrous VR. Proceedings of the ACM VRST 94, 111-122. Prakash, A, and Shim, H. S. (1994). SistView: Support for building efficient collaborative applications using replicated objects. Proceedings of the ACM CSCW ’94, 153-162. Reinhard, W., Schweitzer, J. and Volksen, G. (1994). CSCW tools: Concepts and architectures. Computer, 27(5), 28-36. Rohlf, J. and Helman, J. (1994). IRIS Performer: A high peformance multiprocessing toolkit for real-time 3D graphics. Proceedings of the ACM SIGGRAPH 94, 381-394. Sowizral, H., Rushforth, K. and Deering, M. (1998). The Java 3D API Specification. Addison-Wesley, Reading, MA. Stefik, M., Foster, G., Bobrow, D. G., Kahn, K., Lanning, S. and Suchman, L. (1987). Beyond the chalkboard: Computer support for collaboration and problem solving in meetings. Communications of the ACM, 30(1), 30-47. Stytz, M. R (1996). Distributed virtual environments. IEEE Computer Graphics & Applications, 16(3), 19-31. Tou, I., Berson, S., Estrin, G., Eterovic, Y. and Wu, E. (1994). Prototyping synchronous group applications. IEEE Computer, 27(5), 48-56. Wernecke, J. and Open Inventor Architecture Group. (1994). The Inventor Mentor: Programming Object-Oriented 3D Graphics with Open Inventor, Release 2. AddisonWesley, Reading, MA.
An Isochronous Approach to Multimedia Synchronization 351
Chapter XVI
An Isochronous Approach to Multimedia Synchronization in Distributed Environments Zhonghau Yang, Robert Gay and Chee Kheong Siew Nanyang Technological University, Singapore Chengzheng Sun and Abdul Sattar Griffith University, Australia
In this chapter, we provide a new look at the synchronization issue in distributed environments. We attempt to explore the power of isochronous protocols, as advocated by Lamport, to the multimedia synchronization. It is based on the use of synchronized physical clock time instead of any form of logical clock or sequence numbers, and thus the clock synchronization across the distributed system is assumed. An isochronous protocol for achieving multimedia synchronization is presented. Derived from the globally synchronized clock, there exists a lattice structure in a system. Media conference participating processes in the system execute a simple clock-driven protocol, and all significant events (the sending and delivering of media data) are restricted to occur at lattice points of the globally synchronized space/time lattice. This lattice structure greatly simplifies the multimedia synchronization and readily maintains the temporal and causal relationship among the media. The basic simplicity of the approach makes it easier to understand the precise properties and behavior of a system. The availability of globally synchronized clock (for example, the new version of Internet NTP) and predictable quality of service of advanced communication networks make the isochronous synchronization approach not only attractive but also practical.
Copyright © 2002, Idea Group Publishing.
352 Yang, Gay, Sun, Siew & Sattar
INTRODUCTION A multimedia system is characterized by the integrated computer-controlled generation, manipulation, presentation, storage and communication of independent discrete and continuous media data. The presentation of any data, and the synchronization between various kinds of media data, are the key issues for this integration (Georganas, Steinmetz and Nakagawa, 1996). Clearly, multimedia systems have to precisely coordinate the relationships among all media. These relationships include temporal and spatial relationships. Temporal relationships are the presentation schedule of media, and spatial relationships are the location arrangements of media. In this chapter, we are mainly concerned with temporal relationship and multimedia synchronization mechanism to ensure a temporal ordering of events in a multimedia system.
Synchronization Problems and Approaches Three types of multimedia synchronization can be distinguished: intra-stream synchronization, inter-stream synchronization and inter-media synchronization (Schulzrinne, 1993; Crowcroft, Handley and Wakeman, 1999). Intra-stream synchronization, also called playout synchronization, ensures that the receiver plays out the medium a fixed time after it was generated at the source and experienced variable end-to-end delay. In other words, intra-stream synchronization assures that a constant rate source at the sender again becomes a constant rate source at the receiver, despite delay jitter in the network. The example of intra-stream synchronization is the single stream of video frames. For a video with a rate of 25 frames per second, each of the frames must be displayed for 40 ms. If the arrival rate is abnormal due to network delay, which is not uncommon, the jitter phenomenon occurs. Intra-stream synchronization affects the rate of presentation. Intra-stream synchronization is a base part of the H.261 and MPEG coding systems. H.261 and MPEG systems (ITU 1993; Mitchell, Gall & Fogg, 1996) specify an encapsulation of multiple streams, but also how to carry timing information in the stream. In the Internet, the RTP media specific timestamp provides a general-purpose way of carrying out the same function. Inter-stream synchronization ensures all receivers play the same segment of a medium at the same time. Inter-stream synchronization may be needed in collaborative environments. For example, in a collaborative session the same media information may be reacted upon by several participants. The easiest way of synchronizing between streams at different sites is to use a single time reference. There are several ways to provide this time reference, such as: • The network will have a clock served as a single reference. This approach is used in H.261/ISDNbased systems. A single clock time is propagated around a set of CODECs and multipoint control units (MCS). • The network deploys a clock synchronization protocol, such as NTP (the Network Time Protocol) (Mills, 1993). The timestamps of media packets will be derived from the globally synchronized clocks. In this chapter, we will elaborate this approach. Inter-media synchronization is concerned with maintaining the requirements of the temporal relationships between two or more media. Lip-synchronization between video and audio is the example of inter-stream synchronization where the display of video must synchronize with audio. The approaches used for inter-stream synchronization can also be used for inter media synchronization.
An Isochronous Approach to Multimedia Synchronization 353
Synchronization refers to time. In a distributed environment, the following timing factors can cause the asynchrony among the media (Akyildiz and Yen, 1996): • Different initial collection time. As there are several media senders in the distributed communications, these senders must collect media objects and transmit them synchronously; otherwise, the temporal relationship among media objects might be destroyed. • Different skew, i.e., there is a time difference between different media objects (related audio and video) at the destination after traveling through the network. Thus even the media are in synchronization at the source, they are out of synchronization at the sink. • Different jitter, that is, on a communication channel, there is a maximum different endto-end delay experienced by any two consecutive media objects. Due to jitter, a media object may arrive before or after its playback time. In a distributed multimedia environment, at the destinations the presentation component needs to have the synchronization specification (information) at the moment an object is to be displayed. There exist three approaches for the delivery of the synchronization information to the destinations (Blakowski and Steinmetz, 1996): • Delivering the complete synchronization information before the start of the presentation. This approach is often used in presentation and retrieval-based systems with stored data objects that are arranged to provide new combined multimedia objects. • Using an additional synchronization channel. This approach is used in live synchronization when the synchronization information is available only at the time the multimedia data is being captured. Because of the additional channel used, it is not very suitable for the case of multiple sources. • Multiplexed data streams. In this approach, the multimedia data and related synchronization information is multiplexed on one communication channel and delivered together. This approach is suitable only for different media at the same source location, and is not appropriate in distributed environments. All these make multimedia synchronization a non-trivial research issue that has attracted extensive research (see IEEE JSAC Special Issue for the recent development; Georganas, Steinmetz and Nakagawa, 1996). Very often, sophisticated mechanisms and algorithms are required to compensate these timing differences. The surveys of multimedia synchronization can be found in Baqai, Khan and Ghafoor, (1997); Blakowski and Steinmetz (1996); and Ehley, Ilyas and Furht (1995).
An Isochronous Synchronization Approach In this chapter, we revisit the distributed synchronization issue using synchronized clocks. We present a clock-driven protocol for achieving multimedia synchronization (any one of three types of synchronization). This approach is particularly suitable for the distributed collaborative multimedia environment where many-to-many multimedia communication is the basic interaction pattern. In this approach, the multimedia synchronization is based on the use of synchronized physical clock time instead of any form of logical clock or sequence numbers, and thus the clock synchronization across the distributed system is assumed. A real-time (synchronized) clock is incorporated in the system as a mechanism used for initiating significant events (actions) as a function of real time. Derived from the globally synchronized clock, there exists a lattice structure in a system. One dimension of this lattice represents the progression of time; the other dimension is the processes in the system. Processes in the system execute a simple clock-driven protocol, and all significant events (the sending and delivering of messages for presentation) are restricted to occur at the lattice point of the globally synchronized space/time lattice (also called event lattice)
354 Yang, Gay, Sun, Siew & Sattar
(Kopetz, 1992). This lattice structure greatly simplifies the multimedia synchronization and readily maintains the temporal and causal relationship among the media. To the best of our knowledge, this is the first attempt to explore the power of isochronous approach, as advocated by Lamport (1984), to the multimedia synchronization. Every previously published multimedia synchronization protocol that we know of has not used time thus directly and explicitly (Georganas, Steinmetz and Nakagawa, 1996). The published multimedia synchronization protocol has been to make process synchronization independent of execution rates of any components and to provide only ad hoc solutions to multimedia synchronization problems (see Georganas, Steinmetz and Nakagawa, 1996, for examples). The idea behind the clock-driven, isochronous synchronization is very simple and intuitive in that, as argued by Lamport (1984), the easiest way to synchronize processes is to get them all to do the same thing at the same time. The basic simplicity of the approach makes it easier to understand the precise properties and behavior of a system. The availability of globally synchronized clock and predictable quality of service of advanced communication networks make the isochronous synchronization approach not only attractive but also practical. Using the simple mechanism based on the synchronized clock without requiring complex algorithms, the approach can equally well be applied to various multimedia applications in distributed environments, including live multimedia applications (live teleconferencing and CSCW) and stored media applications. We now present the system model and assumptions on which the synchronization is considered.
SYSTEM MODEL AND ASSUMPTIONS We model a distributed multimedia system as a set P of n sequential processes {p1, p2,… …, pn} that communicate only by exchanging messages. The messages exchanged range from the simple text information, to multimedia data objects such as video, audio and images. Each process executes a sequence of events, where an event is the sending, receiving and delivery of message, or the process’s clock reaching a certain value. The sending event specifies not necessarily only a single destination; in fact, in many cases, for example in a CSCW environment and multimedia teleconference, the message being sent could also refer to multiple destination messages (the multicasting of messages). For this reason, in the following, the sending and multicasting are exchangeable. All the events in the system form a partially ordered set, E, and this partial order is governed by the happen before relation (→) that represents potential causality (Lamport, 1978). However, the exact order in which a process executes receive events depends on many factors, which include process scheduling, network routing and flow control, and may not respect the causal order. Therefore, we use a separate event, delivery event, to deliver the message to application (for presentation) according to the causality. Using “happen” before relation, a distributed computation is represented as a partial order of events, denoted by K = (E, →) where E is the set of all events. For the convenience, we denote by M(K) the set of all the messages exchanged in a distributed computation K. The underlying communication network has the following properties as illustrated in Figure 1: • Communication network is not assumed reliable, that is, the message can be lost. • For any event e that causes process pi to send a message, there is a bound on delay δ
An Isochronous Approach to Multimedia Synchronization 355
Figure 1: Communication delay, validation time and playout synchronization generation time
send Sender A
B
receive
C
A
B
D
Too late
D Receiver
arrive time
playout
missed playout (validation time)
playout
variable delay
such that if event e occurs at time T, the message arrives at process pj by time T + δ, where time T and the time when the message arrives are both measured according to process’s local clock. • There is a maximum communication delay, ∆, that a multimedia application can tolerate before degrading the quality of its services and that is determined by the requirement of multimedia applications. For example, in teleconferencing the acceptable ∆ for audio and video messages is 80 ms (Steinmetz, 1996). The acceptable maximum delay is called validity time. In Figure 1, packet C arrived too late beyond the validity time and must be discarded. • Since the communication delay is unpredictable, there may be a small number of messages that take longer time than ∆> δ to arrive. These messages become useless for multimedia applications, and will be discarded. In fact, the systems we consider are the partially synchronous systems where the components have some knowledge about time, although the information might not be accurate. In such systems, the execution is not completely lock step as it is in the synchronous model. It is not completely asynchronous either in that there is no bound on message delays. Instead, our system model imposes some restrictions on the relative timing of events, and is called the partially synchronous (timing-based) model. As we will show below, algorithms designed using knowledge of the timing of event can be efficient (yet simple to understand), but they can also be fragile in that they may be broken down and will not run correctly if the timing assumptions are violated. This type of system is more realistic than either completely synchronous or completely asynchronous systems, since real systems typically do use some timing information. For example, processes in a partially synchronous network might have access to synchronized clocks, or might know approximate bounds on process step time or message delivery time. The approaches to estimate delay are discussed in the later part of this chapter. In message-passing systems, if the “happen before” relationship is defined on the messaging passing events, there are two specific cases concerning temporal order: causal order and ∆-causal order. The notion of causal order, as introduced by Birman and Joseph (1994), states that for any process the order in which it delivers messages must respect the happen before relation of the corresponding sending of the messages. More formally:
356 Yang, Gay, Sun, Siew & Sattar
Definition 1 (Causal Order) A distributed computation K respects causal order if for any two messages m1 and m2 ∈ M(K), we have: if send(m1) → send(m2) and m1, m2 have the same destination process, then deliver(m1) → deliver(m2). In this definition, nothing is mentioned about the time at which the messages are delivered; also, it does not prescribe what to do with the cases of message loss and late arrival. However, as indicated above, in a distributed multimedia system, messages have limited validity time after which the messages become useless and are allowed to be discarded. Messages that arrive at its destination within its validity time must be delivered within the expiration of its validity time and in its causal order. This motivates the following definition of ∆-causal order, introduced in Yavatkar (1992) and formalized in Baldoni, Mostefaoui and Raynal (1996). The ∆-causal order specifically takes the delivery time into consideration.
∆-causal order) Definition 2 (∆
A distributed computation K respects ∆-causal order if: • all messages in M(K) that arrive within ∆ are delivered within ∆, all the others are never delivered (they are lost or discarded); • all delivery events respect causal order, i.e. for any two messages m1 and m2 ∈ M(K) that arrive within ∆, we have: if send(m1) → send(m2) and m1, m2 have the same destination process, then deliver(m1) → deliver(m2).
Figure 2 illustrates the relationship between causal order and ∆-causal order, where the delivery of m4 before the receipt of m2 by process p3 violates the causal order, but the ∆-causal order was respected. Note that in the following discussion, we assume that all the message delays are included in the ∆ term. Figure 2: Causal order and ∆-Causal Order
t1
t2
t3
t4
P1 m4
m1 m2 P2 m3 P3 ∆
discard
An Isochronous Approach to Multimedia Synchronization 357
AN ISOCHRONOUS SYNCHRONIZATION: PRINCIPLES In this section, we discuss some principles underlying the multimedia synchronization protocol described in the next section.
Synchronized Clocks In order to ensure real-time requirements of multimedia applications, each process p has access to its local clock, Cp. We denote by Cp(t) p’s local clock time at real time t. All the local clocks in the system are approximately synchronized. The precision of the synchronized clock is π, i.e., |Cp(t) - C q(t)| < π. Moreover, the granularity g of the synchronized clock is defined as the real-time duration between two consecutive global ticks. With the globally synchronized clock, each event e of a process is associated with a global time value (timestamp) at which the event occurred, denoted by C(e). Timestamps must guarantee the following clock condition in order to capture the causality relation between events: ei, ej ∈ E:: if ei → ej then C(ei) < C(ej)
However, due to granularity and precision of the global clock, the delivery of a message and the sending event of the same message might occur during the same tick of the global clock, and thus both events are associated with the same timestamps, violating the previous clock condition. To cope with this problem, what we need is the g-precedence (Verissimo, 1994) relation extended from the happen before relation:
Definition 3 (g-precedence) An event ei is said to g-precede an event ej, denoted by ei ⇒ ej, if C(ej) - C(ei) > g. The clock synchronization protocols exist that achieve a much smaller temporal uncertainty than the protocols used for clock synchronization; for example, the Internet clock synchronization protocol can achieve the accuracy of in the range of 10 ms (Mills, 1991). With the use of public broadcast time signals, sub-millisecond accuracy is practical (Mills, 1993). In a LAN environment, a precision of better than 10 microseconds can be realized with some hardware support (Kopetz and Ochsenreiter, 1987). This accuracy is suitable for global time synchronization and for distributed operation scheduling. Indeed, it is cost effective to have globally synchronized clocks in order to provide a global time base for timestamping.
Estimating Delay In general, the distribution for message/packet transmit delay through a network exhibits probability nature and the acceptable delay can be evaluated from the end-to-end delay distribution and by taking into account the validation time allowed by multimedia and the specified message loss characteristic. The delay actually experienced by a media packet can be divided into a fixed delay and a variable delay. The fixed delay is the same for each packet in message transmissions and arises from the propagation of packet signals and from fixed buffering delay in the network, the sender and the receiver. The variable delay results from queuing and other variable processing delay in the network. As compared with the fixed
358 Yang, Gay, Sun, Siew & Sattar
delay, the variable delay is the key factor. The nature of the variable delay is highly dependent on the nature of the network. • In a local network, the variable delay is the result of contention for the network transport medium (e.g., Ethernet). The nature of the delay is highly dependent on the particular characteristics of the medium and the access protocols. In the most cases, the variable delay is quite small. • In a long-haul network where packets traverse multiple links and routers, the variable delay results from queuing at each link and router. Variable delay depends on the number of links and routers, utilization, packet size and link speed. Delays may be too large for multimedia transmission on certain networks. Many methods can be used to estimate either the packet production time or transport delay of an incoming packet. In this section, we discuss four general methods based on the exposition of Montgomery (1983): blind delay, round trip delay, absolute timing and accumulated variable delay.
Blind Delay The simplest strategy for estimating the production timing of an arriving packet is for the receiver to make a worst-case assumption about the delay encountered by a packet. Once the arrival time has been estimated, the receiver uses sequence information in subsequent packets to determine the proper playout time for each. We call this blind delay, because the receiver makes its estimate blindly, with no information on the actual packet generation time or transit delay (see Figure 1). The worst-case assumption for the receiver assumes that the packet on which the estimate is based arrives with minimum transit delay, and that other packets may be delayed by significantly more time. Hence, the receiver must set the target playout time (validity time) for the first packet to be its arrival time plus a maximum variable delay (see Figure 1). Maximum delay is chosen so that packets that experience more than this much delay can be discarded in the playout procedure without unacceptable degradation of media quality. The delay estimate can be revised at any time since a new estimate can be made on any packet. This scheme is quite simple and requires no additions to the network to aid in media reconstruction and presentation. While simple, this scheme does not achieve the best possible performance. For a local area network application, the maximum delay is relatively small and other fixed delays are small as well. Blind delay may be a very appropriate strategy, with the estimate being made either on the first packet of the call only, or on the first packet of each spurt of speech. For a long-haul network, however, the maximum delay may be too large for the overall delay introduced by this process to be tolerated.
Roundtrip Measurement The roundtrip delay measurement method is to actually measure the roundtrip delay in the communication path between the sender and receiver, and assume that delay is equally distributed between both directions. This roundtrip delay between the sender and receiver is then used to estimate the one-way delay of a particular packet. Note that this technique is commonly used in maintaining synchronized clocks in a distributed network (Cristian, 1989). The measurement is made by sending a packet containing a local clock reading from the sender. When the packet arrives at the receiver, it is immediately sent back to the sender. When it arrives back at the sender, the sender calculates the roundtrip delay by subtracting the clock value recorded in the packet from the current local clock time. The roundtrip delay
An Isochronous Approach to Multimedia Synchronization 359
is then sent to the receiver. While this technique gives an accurate measurement of the roundtrip delay, estimation of one-way delay may not be accurate, because the delay in both directions may not be equal. On the other hand, it does reduce errors substantially over the blind delay method. In practice, this improvement may be sufficient to permit its use in a long-haul network with low variable delay.
Absolute Timing In a system where synchronized clocks are maintained at the sender and receiver, the absolute time from the synchronized clock can be used for calculating the delay. In this case, each packet carries an indication of its production time (absolute timestamp), and the receiver uses that to compute the target playback time. The timestamp need not be capable of encoding the absolute time, but must encode time to a sufficient resolution to allow the receiver to unambiguously determine when the packet was produced. An 8 bit value with the least significant bit representing 1 ms should be sufficient for networks where variable delay is less than 250 ms (Montgomery, 1983). The receiver can infer higher order bits not sent with the packet by examining the timestamp and its synchronized local clock.
Added Variable Delay The fourth method for estimating the delay experienced by a packet in a packet network is to actually measure the variable delay where it occurs. The packet network keeps track of the delay experienced by a packet as it travels through the network. As indicated before, variable delay results from queuing and processing delays in packet switches and media terminals. The variable delay measurement can be made by carrying a delay stamp indicating the accumulated delay in each packet. Each network element adds its delay to the delay stamp as the packet passes through. The delay calculation can be made by each network element, using only its local clock. With this technique, the delay estimate may be used either on each packet, or only occasionally, with relative timing information used on other packets. However, it may be simpler for the network to perform the delay stamp calculation on every packet and for the receiver to perform the target time calculation on each. Like absolute timing, the added variable delay method provides a good measurement of packet network delay, and thus can achieve multimedia synchronization with minimal total delay.
The Ordering Properties of Synchronization Protocol In essence, what is really required for distributed multimedia synchronization is to order; that is, a synchronization protocol must ensure that multimedia messages or streams are sent, delivered and presented in a order that is consistent with the expected behavior of the distributed multimedia system as a whole. The ordering property has the following three aspects: • Same order: The multimedia messages are delivered to the destinations in the same order. • Temporal order: Different destinations see the different messages in the temporal order. • Simultaneity: Different destinations see the same messages at about the same time. Note that the temporal order is a prerequisite for the causal order. If and only if the
360 Yang, Gay, Sun, Siew & Sattar
occurrence of an event e1 has preceded the occurrence of an event e2 in the domain of real time is it possible that e1 has an effect on e2. On the other hand, if it can be established that e2 has occurred after e1, e2 cannot be the cause of e1. As indicated above, when using clock values as event timestamps to preserve temporal ordering of events and to achieve synchronization, clocks in the system must have sufficient granularity or resolution. The granularity g of a synchronized clock is defined as the realtime duration between two consecutive global ticks. Obviously, the temporal order of two or more events, which occur between any two consecutive ticks of the synchronized clock, cannot be re-established from their timestamps. This is a fundamental limit when using clock time for temporal ordering. The granularity requirement for deducing the temporal order from timestamps is stated as follows (Verissimo, 1994): Granularity Condition. Given two events e1 and e2 in different processes, timestamped by a synchronized global clock of precision π and granularity g ≥ p, their temporal order can be guaranteed to be deduced from the timestamps only if: | C(e2) - C(e1) | ≥ g + π From this granularity condition, it is easy to see that the tightest value for g is the case where g = π, resulting in | C(e2) - C(e1) | ≥ 2g. On the other hand, if g >> π, as in the usual case for the multimedia synchronization setting, we have | C(e2) - C(e1) | ≈ g, instead of 2g. With globally synchronized clocks which satisfy the granularity condition, we can construct an action lattice (or event lattice) (Kopetz, 1992). One dimension of this lattice represents the progression of time; the other dimension is the processes in the system. Processes in the system are designed to execute a simple clock-driven protocol, which requires that the events of sending and receiving messages are restricted to only occur at the lattice point of the globally synchronized space/time lattice. Thus, whenever an action has to be taken, it has to be delayed until the next lattice point of the event lattice. This delay is the price we have to pay for the simple and intuitive synchronization protocol. This lattice is a basic mechanism for the isochronous approach to multimedia synchronization. The lattice interval, ∆, is an important design parameter for the synchronization protocols. The following factors will affect how to choose ∆: • The bounded end-to-end communication delay, δ. Bounds on network delays can be guaranteed via network resource reservation (Braden, Zhang, Berson and Herzog, 1997), which includes admission control (Hyman, Lazar and Pacifici, 1992), real-time scheduling, and using buffer reservation schemes at network nodes and the multimedia server (such as those proposed in Ferrari and Verma, 1990). For the stream type of multimedia, the approaches to bounding the end-to-end delay have also been proposed (for example, see Little and Ghafoor, 1991; Lamont and Georganas, 1994; Barberis and Pazzaglia, 1980). • The validity time of multimedia beyond which the multimedia objects become useless. • The granularity g and precision π of clock synchronization. As described earlier, to guarantee the temporal order property, the lattice interval ∆ must satisfy the granularity condition. • The system scheduling delay, θ. This delay can be reduced by using real-time operating system and resource reservation schemes. In summary, ∆ must be large enough to accommodate all these factors; it must also be small enough not to unduly delay events. The detailed derivation for ∆ in a variety of
An Isochronous Approach to Multimedia Synchronization 361
environments can be found in Cristian, Aghili, Strong and Dolev (1995). ∆ can also be estimated during the negotiation phase of the protocol (using the estimation methods described above) or by a scheme for the real-time establishment of channels with deterministic delay bounds (Ferrari and Verma, 1990). Using the mechanism described in this section, we are ready to present the isochronous protocol for multimedia synchronization in the next section.
AN ISOCHRONOUS SYNCHRONIZATION PROTOCOL We now describe a general clock-driven, isochronous protocol to achieve desired multimedia synchronization. The protocol seeks to guarantee that in a time period during which a set of processes will deliver the same messages at the same time and in the same temporal order. Here, “same time” must be understood to be limited by the clock skew (clock synchronization precision) as much as π: two processes undertaking to perform the same action at the same time may in fact do so as much as π time unit apart. To guarantee some degree of real-time service available for the protocol, the protocol is assumed to run under control of a real-time operating system which provides a system called schedule A(B) at T, meaning that the call schedules an action A with input parameter B at the local time T. The protocol executes its events on every clock tick (i.e., at the lattice point), and executes NoOp event by doing nothing if there is no communication event to be taken place on clock tick. In practice, operations on clock tick can be implemented by the interrupt-driven program; for example, the receipt of a message timestamped T causes the setting of a clock interrupt for T+∆ which, in turn, will cause the message to be processed at that time. When a process disseminates a message, it will not do so immediately, rather, it will wait until the next tick on its clock and then timestamp the message using clock reading and send out the message. When the message arrives at its destination, it is not sufficient for the destination process to process messages that have been received in ascending order by timestamp. We must ensure that a process delivers messages only if no message with a smaller timestamp can be subsequently received. We say that a message is stable for p once no message with lower timestamp can be delivered to the process p; clearly, a message should be processed only after it becomes stable. The lattice structure for the protocol operation provides the convenience in the message stability. Here, testing stability of a message can be accomplished by exploiting the bounds on delivery delays (i.e., the next lattice point) and process clocks. A message timestamped T by process p will be received by T+∆ at every other process in the system according to each process’s local clock. The messages arriving later than T+∆ are considered useless and discarded (because the multimedia message is beyond its validity time). The protocol is stated by the following rules: Sending Rule: A process sends out the message at every clock tick; if there is no message to send, the process will do nothing (or you can think of it as sending out the NULL message) at the clock tick. Stability Rule: A message is stable at process pi if the timestamp on the message is T and the clock at p has a value equal and greater than T+∆. Delivery Rule: Stable messages are delivered to the application in the ascending order
362 Yang, Gay, Sun, Siew & Sattar
by timestamp. Tie-breaking Rule: Two messages with the same timestamp from different processes are ordered according to the process id which are assumed to be unique. These rules are further described using the following pseudo code: process pi sending a message m: wait-for-next-tick; T = clock time at the next tick; mcast(m, T, pi); // multicast m timestamped with T add (m, T, Pi) to queuei process pi receives a message from process j: receive(m, T) if the clock time at the next tick ≥ T + ∆ it’s late message, discard; else insert (m,T,pj) to Queue j in the timestamp order schedule delivery(T) at time T+∆ Process pi delivers a message from Queue: delivery(T) at clock time T + ∆, from queue 1 ... n take the message timestamped with T deliver it to the application remove the messages from the queue. A few observations follow regarding this clock-driven synchronization protocol. The ordering of properties as required by the synchronization is guaranteed. Noticeably, the temporal order, causal order and ∆-causal order are all respected without requiring additional sophisticated algorithms. Executing this protocol, all processes have a consistent behavior towards the messages. All processes receive all messages before the next is sent (some processes may discard late messages which cannot be tolerated by the multimedia system), and they act on the message at about the same time and in the timestamp order. Since the system satisfies the granularity condition (g-precedent), the temporal order (and causal ∆-causal order) is guaranteed. In addition, the protocol is very simple, and there is no sophisticated buffering and scheduling mechanism built into the protocol. Generally, there is a tradeoff in the delay (∆) of the protocol. For large value of ∆ (but of course still smaller than the validity time, otherwise, the protocol would break down), the protocol provides a stronger guarantee for the temporal order and casualty, but has also a large latency to delivery. On the other hand, a smaller value of ∆ ensures the quick delivery with small latency but may force those packets that arrive later to be unnecessarily discarded. One might characterize this tradeoff as one of pessimism vs. optimism. If the expectations from the network are generally optimistic, that is, suffering from the message/ packet loss is infrequent and the large latency of packet transmission is rare, as in the case for modern ATM networks and LANs, then ∆ can be chosen as relative small. Otherwise, if a pessimistic expectation is taken, ∆ can be chosen as relative large. The protocol as presented here is clock-driven. Generally, distributed computing systems, including multimedia systems, fall into two classes: clock-driven where the protocol makes use of accurate time source, and event-triggered where all activities are initiated as a consequence of the occurrence of events (e.g., significant state change). The majority of distributed protocols/algorithms, up until now, is event-driven. This is not without reason. In distributed systems, the global time has not been readily available; clock synchronization used to be an attractive research issue and many clock synchronization
An Isochronous Approach to Multimedia Synchronization 363
algorithms have been devised. Nevertheless, the clock synchronization algorithms with sufficiently high accuracy have not widely been deployed. However, the recent development in the availability of globally synchronized clocks has greatly changed the landscape. The improved Network Time Protocol (NTP) (version 3) enjoys synchronization to within a few tens of milliseconds in the global Internet of today; and the clock synchronization for LANs can obtain the accuracy as high as a few microseconds. The introduction of the global positioning system (GPS) in the 1990s has further blurred the picture (Herring, 1996). The GPS has introduced an inexpensive way to obtain accurate information using a radio receiver, which consists of nothing more than a GPS receiver and a network interface. Time obtained in this manner is accurate to a few tens of microseconds. Accuracy such as this is adequate for even the most demanding real-time applications. With this development and such accurate timing source in place, we believe that the clock-based distributed synchronization, as advocated by Lamport (1984), provides a promising, yet simple and intuitive alternative.
THE ISOCHRONOUS SYNCHRONIZATION AND RTP/RTCP In this section, we discuss our isochronous synchronization approach in the context of the emerging transport protocols, particularly RTP (a Transport Protocol for Real-Time Application) (Schulzrinne, Casner, Frederick and Jacobson, 2000) and its companion protocol, RTCP (Real-Time Control Protocol).
The Synchronization Support in RTP/RTCP RTP is the real-time transport protocol within an Internet Integrated Service Architecture which is designed to provide the quality of service guarantee beyond the current TCP/ IP best-effort service model (Yang, Gay, Sun and Siew, 2001). RTP provides end-to-end network transport functions suitable for applications transmitting real-time data (e.g., audio, video or simulation data) over multicast or unicast network services (Schulzrinne, Casner, Frederick and Jacobson, 2000). The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scaleable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers. Since RTP/RTCP is an emerging important Internet protocol for real-time applications, it is interesting to see how RTP/RTCP facilitate to achieve synchronization for multimedia applications. In this chapter, we examine the features and facilities available in RTP/RTCP for media synchronization. The reader who is interested in knowing more about RTP/RTCP in detail is directed to Schulzrinne, Casner, Frederick and Jacobson (2000) and the related chapter in this book. The noticeable feature associated with media synchronization is the 32-bit timestamp field in RTP data packets, the 64-bit NTP timestamp field and the 32-bit RTP timestamp field in RTCP control packets. Internet RTP standard states that although RTP does not mandate to run the Network Time Protocol (NTP) (Mills, 1993) to provide the clock synchronization, running NTP is very useful for synchronizing streams transmitted from separate hosts.
364 Yang, Gay, Sun, Siew & Sattar
Achieving Isochronous Synchronization Using RTP/RTCP When RTP transports media data packets, each packet is timestamped. The timestamp reflects the sampling instant of the first octet in the RTP data packet. The RTP standard states that the sampling instant is derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations. The resolution of the clock should be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter. The analysis in the previous section provides the guideline for the resolution. During the RTP operations, the companion control protocol (i.e.,RTCP) periodically transmits control packets to all participants in the session, using the same distribution mechanism as the RTP data packets. The most important control packets are sender report (SR) and receiver report (RR). One of primary functions of RTCP is to provide feedback in these reports on the quality of the data distribution and information for inter-media synchronization. The RTP standard requires that the NTP timestamp (based on synchronized clocks) and corresponding RTP timestamp (based on data packet sampling) are included in RTCP packets by data senders. This correspondence between the RTP timestamp and NTP timestamp may be used for intra- and inter-media synchronization for sources whose NTP timestamps are synchronized. Using the timestamp mechanisms in RTP/RTCP and the lattice structure described in this chapter, our isochronous approach can be readily applied to the multimedia systems, particularly in the distributed many-to-many conferencing environments.
Estimating the Transmission Delays Using RTCP The sender reports and receiver reports that RTCP periodically transmitted also include the following fields that can be used to estimate the various delays: • NTP timestamp (64 bits): Indicates the wallclock time (absolute date and time) represented using the timestamp format of the Network Time Protocol (NTP), which is in seconds relative to 0 hour UTC on 1 January 1900. • Inter-arrival jitter (32 bits): An estimate of the statistical variance of the RTP data packet inter-arrival time, measured in timestamp units and expressed as an unsigned integer. • Last SR timestamp (LSR) (32 bits): The middle 32 bits out of 64 in the NTP
Figure 3: RTP/RTCP: Round-trip time calculation SSRC_n
A=0xb710:8000 (46864.500 s)
SR(n)
ntp_sec =0xb44db705 ntp_frac =0x20000000
dlsr=0x0005.4000 (
5.250 s)
lsr=0xb705:2000 (46853.125 s)
(3024992016.125 s) SSRC_r
RR(n) DLSR (5.25 s)
An Isochronous Approach to Multimedia Synchronization 365
timestamp received as part of the most recent RTCP sender report (SR) packet. • Delay since last SR (DLSR) (32 bits): The delay, expressed in units of 1/65,536 seconds, between receiving the last SR packet by receiver SSRC_r from source SSRC_n and sending this reception report block. With this timing information available, we can measure round-trip propagation between senders and receivers. An example of round-trip estimation is shown in Figure 3. In the example, source SSRC_n can compute the round-trip propagation delay to SSRC_r by recording the time A when this reception report block is received. It calculates the total round-trip time (A - LSR) using the last SR timestamp (LSR) field, and then subtracting this field to leave the round-trip propagation delay as (A - LSR - DLSR), thus: A 0xb710:8000 DLSR -0x0005:4000 LSR -0xb705:2000 ———————————————delay 0x 6:2000
(46864.500 s) ( 5.250 s) (46853.125 s) (
6.125 s)
RTP’s Requirement for Synchronized Clocks The RTP protocol states the importance of synchronized clocks in distributed environments and stresses that “the resolution of the clock must be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter (one tick per video frame is typically not sufficient.)” However, it did not prescribe precisely what is considered as “sufficient.” In this chapter, we complement RTP by presenting a more formal treatment in terms of g-precedence, and thus provide a guideline to implement media synchronization using RTP.
RELATED WORK AND FUTURE TRENDS Most seminal work on using absolute time to obtain synchronization in distributed systems was done by Lamport (1984) who argues that the method can be used for any desired form of synchronization in a distributed system. In a sense, our work is an exercise of applying Lamport’s approach in the context of multimedia synchronization. The use of the action lattice has been proposed in real-time distributed systems (Kopetz and Grunsteidl, 1993). The idea of isochronous synchronization, which is similar to RISK processor executing one instruction every time cycle, was used in ATM cell switching (Li, Ofek, Segall and Sohraby, 1994; Ofek and Yung, 1994). In the area of multimedia synchronization, the most closely related work was to use InBand clock distribution to achieve source-destination synchronization (Li and Ofek, 1996). Escobar et al presented an adaptive flow synchronization protocol that permits synchronized delivery of media data to and from distributed sites based on synchronized clocks. Their synchronization protocol timestamps data at the sources and buffers data at the destinations for a period of equalization delay (Escobar, Partridge and Deutsch, 1994). In this approach, the buffer management for the equalization is non-trivial. Ferrari’s synchronization approach also assumes all of the network node’s clocks to be synchronized, but the media synchronization relies on all of the internetwork’s gateway nodes (Ferrari, 1992). Each packet is timestamped by the source, and when a packet arrives at each gateway node, the synchronized clock is read. The gateway delays (buffers) the packet until it can be delivered according to synchronization specification. Our approach assumes the globally synchronized clocks and uses the action lattice structure for synchro-
366 Yang, Gay, Sun, Siew & Sattar
nization; thus no complex buffer management is required. While the future trends of multimedia synchronization are generally difficult to predict, there is a clear indication of how the inter-media synchronization should be approached as seen from the emerging Internet protocols, particularly in the distributed setting. The Internet standard RTP/RTCP protocol clearly suggests running the network time synchronization protocol NTP “for synchronizing streams transmitted from separate hosts,” Schulzrinne, Casner, Frederick and Jacobson, 2000: 10). The RTP protocol packet contains the timestamp format of the Network Time Protocol (NTP). It stresses the importance “to choose a commonly used clock so that if separate implementations are used to produce the individual streams of a multimedia session, all implementations will use the same clock” (Schulzrinne, Casner, Frederick and Jacobson, 2000: 27). In addition, the NTP protocol is designed to establish the correspondence between RTP timestamps of data packets and NTP timestamps of control packets which can be “used for intra- and inter-media synchronization for sources whose NTP timestamps are synchronized.” We believe that with the QoS guarantee provided by the emerging networks and easy-installed NTP, the multimedia synchronization based on global synchronized clocks will be a simple, promising approach.
CONCLUSION Synchronization is a key issue in the multimedia system research, and many synchronization mechanisms and protocols have been published. In this chapter, we presented a new isochronous synchronization protocol for distributed multimedia systems. In distributed systems, there exist two approaches to protocol design--event-driven and clock-driven--and most protocols have taken the event-driven approach. While using physical time, based on a globally synchronized clock for obtaining synchronization, was advocated a long time ago (Lamport, 1984), this approach has not been popular in the distributed system research community. The synchronization protocol as presented in this chapter is designed in the spirit of Lamport’s work. In conjunction with the lattice structure, the protocol achieves the required synchronization which guarantees the temporal order, including causal order and ∆-causal order without additional sophisticated algorithms for respecting causality. The only assumption of isochronous protocol is the synchronized clock in the distributed system and the known bounds of the network communication delays. As argued in the chapter, these assumptions can be readily satisfied with the deployment of modern networks. We believe that the proposed clock-driven isochronous protocol has presented a new look at the synchronization issue in distributed multimedia environments.
REFERENCES Akyildiz, I. F. and Yen, W. (1996). Multimedia group synchronization protocols for integrated services networks. IEEE Journal on Selected Areas in Communications: Special Issue: Synchronization Issues in Multimedia Communication, 14(1), 162-173. Baldoni, R., Mostefaoui, A. and Raynal, M. (1996). Causal delivery of messages with realtime data in unreliable networks. Journal of Real-Time Systems. Baqai, S., Khan, M. F. and Ghafoor, A. (1997). Multimedia communication–Synchronization. In Grosky, W. I., Jain, R. and Mehrotra, R. (Eds.), The Handbook of Multimedia
An Isochronous Approach to Multimedia Synchronization 367
Information Management, Chapter 11, 335-363. Prentice-Hall. Barberis, G. and Pazzaglia, D. (1980). Analysis and optimal design of a packet-voice receiver. IEEE Transactions on Communications, COM-28(2), 217-227. Birman, K. and Joseph, T. (1994). Reliable communication in the presence of failure. In Birman, K. P. and Renesse, R. (Eds.), Reliable Distributed Computing with the ISIS Toolkit. IEEE CS Press, 176-200. Blakowski, G. and Steinmetz, R. (1996). A media synchronization survey: Reference model, specification and case study. IEEE Journal on Selected Areas in Communications: Special Issue: Synchronization Issues in Multimedia Communication, 14(1), 5-35. Braden, B., Zhang, L., Berson, S. and Herzog, S. (1997). Resource ReSerVation Protocol (RSVP)—Version 1 functional specification. IETF, RFC 2205. Cristian, F. (1989). Probabilistic clock synchronization. Distributed Computing, 3, 146158. Cristian, F., Aghili, H., Strong, R. and Dolev, D. (1995). Atomic broadcast: From simple message diffusion to byzantine agreement. Information and Computation, 118(1), 158-179. Crowcroft, J., Handley, M. and Wakeman, I. (1999). Internetworking Multimedia. Morgan Kaufmann Publishers. Ehley, L., Ilyas, M. and Furht, B. (1995). A survey of multimedia synchronization techniques. In Furht, B. and Milenkovic, M. (Eds.), A Guided Tour of Multimedia Systems and Applications, IEEE CS Press, 230-256. Escobar, J., Partridge, C. and Deutsch, D. (1994). Flow synchronization protocol. IEEE/ ACM Transactions on Networking, 2(2), 111-121. Ferrari, D. (1992). Delay jitter control scheme for packet-switching internetworks. Computer Communications, 15(6), 367-373. Ferrari, D. and Verma, D.C. (1990). A scheme for real-time channel establishment in wide area networks. IEEE Journal on Special Area on Communications, 8(3), 368-379. Georganas, N., Steinmetz, R. and Nakagawa, N. (Eds.). (1996). IEEE Journal on Selected Areas in Communications: Synchronization Issues in Multimedia Communications, 14(1). Herring, T. A. (1996). The global positioning system. Scientific American, 274(2), 32-38. Hyman, J. M., Lazar, A. A and Pacifici, G. (1992). Joint scheduling and admission control for ATS-based switching nodes. In Proceedings of the ACM SIGCOMM, 223-234. ITU. (1993). Recommendation H.261 (03/93)-Video codec for audiovisual services at p x 64 kbit/s, ITU, Geneva, March. Kopetz, H. (1992). Sparse time versus dense time in distributed real-time systems. In Proceedings of the Twelfth International Conference on Distributed Computing Systems, Yokohama, Japan, IEEE Computer Society, 460-467. Kopetz, H. and Grunsteidl, G. (1993). TTP—A time-triggered protocol for fault-tolerant real-time systems. In Digest of Papers, The 23rd Int’l Symposium on Fault-Tolerant Computing, Toulouse, France, IEEE CS Press, 524-533. Kopetz, H. and Ochsenreiter, W. (1987). Clock synchronization in distributed real-time systems. IEEE Transactions on Computers, C-36(8), 933-939. Lamont, L. and Georganas, N. D. (1994). Synchronization architecture and protocols for multimedia news service application. In Proceedings of the ICMCS, Boston. Lamport, L. (1978). Time, clocks and the ordering of events in a distributed system. Communications of the ACM, 21(7), 558-565. Lamport, L. (1984). Using time instead of timeout for fault-tolerant distributed systems.
368 Yang, Gay, Sun, Siew & Sattar
ACM Transactions on Programming Languages and Systems, 6(2), 254-280. Li, C. S. and Ofek, Y. (1996). Distributed source-destination synchronization using inband clock distribution. IEEE Journal on Selected Areas in Communications: Special Issue Synchronization Issues in Multimedia Communication, 14(1), 153-161. Li, C. S, Ofek, Y., Segall, A. and Sohraby, K. (1994). Pseudo-isochronous cell switching in ATM network. In Proceedings of the IEEE INFOCOM’94 (V.2), 428-437. Little, D. C. and Ghafoor A. (1991). Multimedia synchronization protocols for broadband integrated services. IEEE Journal on Selected Areas in Communications, 9(9), 13681382. Mills, D. L. (1991). Internet time synchronization: The network time protocol. IEEE Transactions on Communications, 39(10), 1482-1493. Mills, D. L. (1993). Precision synchronization of computer network clocks. ACM Computer Communications Review, 24(2), 28-43. Mitchell, J. L., Gall, D. L. and Fogg, C. (1996). MPEG Video Compression Standard. Chapman & Hall. Montgomery, W. A. (1983). Techniques for packet voice synchronization. IEEE Journal on Selected Areas in Communications, SAC-1(6), 1022-1028. Ofek, Y. and Yung, M. (1994). The integrated metanet architecture: A switch-based multimedia LAN for parallel computing and real-time traffic. In Proceedings of the IEEE INFOCOM’94 (v.2), 802-811. Schulzrinne, H. (1993). Issues in designing a transport protocol for audio and video conferences and other multiparticipant real-time application. IETF, Internet Draft. Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, V. (2000). RTP: A transport protocol for real-time applications. IETF, Internet Draft. Steinmetz, R. (1994). Multimedia encoding standards. ACM/Springer Multimedia Systems, 1(5). Steinmetz, R. (1996). Human perception of jitter and modia synchronization. IEEE Journal on Selected Areas in Communications: Special Issue, Synchronization Issues in Multimedia Communication, 14(1), 61-72. Verissimo, V. (1994). Ordering and timeliness requirements of dependable real-time programs. Journal of Real-Time Systems, 7, 105-128. Yang, Z, Gay, R., Sun, C. and Siew D. (2001). Building Internet multimedia applications: An integrated service architecture and media frameworks. In Syed, M. R.(Ed.), Multimedia Networking: Technology, Management and Applications. Hershey, PA: Idea Group Publishing. Yavatkar, R. (1992). MCP: A protocol for coordination and temporal synchronization in multimedia collaborative applications. In Proceedings of International Conference on Distributed Computing Systems, IEEE, 606-613.
Introduction to Multicast Technology 369
Chapter XVII
Introduction to Multicast Technology Gábor Hosszú Budapest University of Technology and Economics, Hungary
The multimedia communication over the Internet needs the multicasting delivery scheme. In this chapter the sophisticated group management and routing protocols required for multicasting are presented. The different kinds of transport protocols for satisfying the special requirements of the multimedia applications are also included. Finally the current design principles of the multicast-based multimedia applications are discussed.
INTRODUCTION Most of the widely used traditional Internet applications, such as Web browsers and email, operate between one sender and one receiver. However, in a lot of new software, one sender transmits to a group of receivers at the same time. These programs increase the user’s ability to communicate and collaborate, leveraging more value from the network investment. Such typical applications are video and audio conferencing for remote meetings, updates on the latest election results, replicating databases and Web site information, collaborative computing activities, transmission over networks of live TV news or live transmission of multimedia training, etc. The Internet Multicast Service (McCanne, 1997) is a network technology which extends the traditional, best-effort unicast delivery model of the Internet Protocol (IP) with efficient multi-point packet transmission. With Internet multicast, a single packet is sent to an arbitrary number of receivers by replicating the packet within the network at fan-out points along a distribution tree rooted at the packet’s source. This extension to IP, called IP multicast, is an efficient, standard-based solution that is supported in the local networks by the majority of the standard operating systems. With IP multicast, applications send one copy of the information to a group address, reaching all recipients who want to receive it. Without multicasting, the same information must be either carried over the network multiple times, one time for each recipient, or broadcast to everyone on the network, consuming unnecessary Copyright © 2002, Idea Group Publishing.
370 Hosszú
bandwidth and processing resources and limiting the number of participants. IP multicast technologies address the needed mechanisms at different levels in the network and internetworking infrastructure to efficiently handle group communications. Under development since the early 1990s, IP multicast is an important advance in IP networking (Johnson, 1997c). In order to reach the necessary level of reliability of the multicast delivery, controlling of the packet transmission process is needed by so-called transport protocols. To solve the problem that the multicasting technology does not completely cover the whole Internet, sophisticated applications are also needed to produce real-time multimedia services. In this chapter the multimedia transport on the Internet, and the IP multicasting technology including the routing and transport protocols, will be described. After them the popular Multicast Backbone (MBone) is discussed. Lastly the different aspects of the policy of the multicast applications are presented detailing the main multicast application design principles, including the light weight sessions, the tightly coupled sessions and the virtual communication architectures on the Internet.
MULTIMEDIA TRANSPORT ON THE INTERNET The Problem of the Reliable or Real-Time Transport Due to the advantages in the multimedia and network technologies, multimedia has become an indispensable feature on the Internet. Multimedia networking means to build the hardware and software infrastructure and application tools to support multimedia transport on networks in such a way that users can communicate in multimedia (Liu, 1998). There are other ways to transmit multimedia data, like dedicated links and cables, but they are not practical because they require special installation and new software. Without an existing technology, like Local Area Network (LAN) and Wide Area Network (WAN), the software development may be very expensive. In the Internet the developed LAN and WAN technologies based on IP protocol stack connect large networks all over the world together. In fact, the Internet has become the platform of most networking activities. This is the primary reason to develop the multimedia supporting protocols over Internet. Another advantage of running multimedia over IP is that users can have integrated data and multimedia service over one single network, without investing in another network’s hardware and building up the interface between two networks. The technology, like live audio and video transmissions in teleconferencing or live broadcasting are called real-time streaming, enables the transmission of digitized audio/ video information-in the following referred to as multimedia data-from one user to another via the Internet (Kretschmer, 1998). To avoid the same multimedia stream transmitted separately to each user, in case of a large number of these multicast, capable protocol has to be used. To run real-time traffic or multimedia over the Internet as a shared datagram network, a number of issues must be solved: • High bandwidth is required. Since multimedia means extremely dense data and heavy traffic, the hardware has to provide sufficient bandwidth. • Multimedia applications are usually related to multicast, i.e., the same data stream, not multiple copies, is sent to the thousands of receivers. At a videoconference, for instance, the video data need to be sent to all participants at the same time.
Introduction to Multicast Technology 371
•
Real-time traffic is necessary. The real-time applications require the delivery of a data transmission with a limited delay. • Multimedia data stream is generally bursty. The Internet is a packet-switching datagram network where packets are routed independently across shared networks. Some sophisticated transport mechanisms must be used to take care of the timing issues so that multimedia data can be played back continuously with proper timing and synchronization. The feature bursty means that occasionally the traffic is much higher than average. This problem is not solvable by just increasing the available bandwidth. Furthermore, for most multimedia applications, the receiver has a limited buffer. If no measure is taken to smooth the data stream, it may overflow or underflow the available application buffer. If data arrive too fast, the buffer overflows and some data packets are lost, resulting in poor quality. Conversely, if data arrive too slowly, the buffer underflows and the application starves. The traditional Internet service model is the so-called best effort (McCanne, 1997). The network does not guarantee that packets reach their destinations or packets are delivered in the order they were produced. Rather, routers simply attempt, without guarantee, to deliver packets toward their destination. If a link fails, the routing protocol needs some time to compute new routes around the failure. During this period, packets may be reordered, lost in routing loops or even duplicated. However, these phenomena mean that a best-effort network does not support real-time traffic. Most of the multicast transport mechanisms have been designed with a specific application in mind (Obraczka, 1998). For instance, several existing multicast transport protocols address the requirements of delay-sensitive, real-time services, such as multimedia conferencing tools. These applications can tolerate a certain degree of data loss, but they are sensitive to packet delay variance. On the other hand, traditional data dissemination services, such as multipoint file transfer, are not delay-sensitive, but require that objects have to deliver in their entirety if the transfer fails. The Transmission Control Protocol (TCP) sends a byte-oriented data stream via the packet-oriented IP network, which is a bidirectional point-to-point connection without guaranteeing any minimum rate according to the best-effort principle (Mödeker, 1998). TCP organizes Acknowledgments (ACKs) for received packets and-if required-retransmissions of lost packets in order to secure the error-free transmission of the data stream. If a multimedia application sends data with a constant rate and if network problems arise, the sending process is blocked until retransmission. Depending on the actual network conditions, this can lead to a data stream break-off on the destination side. The TCP/IP transport and network protocol stack of the Internet was designed for this type of traffic (Johnson, 1997a). However, with multimedia traffic if a receiver has to wait for a retransmission in the TCP, there can be a noticeable and unacceptable gap in playout of the real-time data. Furthermore, TCP does not provide timing information, which is a critical requirement in case of multimedia transport. Alternatively, the User Datagram Protocol (UDP) is as unreliable as IP itself. By its simple specification, it is suitable as a basis for developing new protocols specifically tailored to the needs of real-time streaming applications. These are to recognize packet losses, packet jitter as well as packets arriving in the wrong order and are to respond appropriately. On the basis of the above, multimedia applications use a simpler transport framework (Johnson, 1997a) than the TCP. Most playback algorithms can tolerate missing data much better than lengthy delays caused by retransmissions, and they do not require guarantees in sequence delivery. A number of protocols have been developed to enhance the Internet
372 Hosszú
architecture and improve support of playout and interactive multimedia applications, e.g. the real-time oriented protocols, such as the Real-time Transport Protocol (RTP), Resource ReSerVation Protocol (RSVP), Real-Time Streaming Protocol (RTSP), etc. (Liu, 1998).
Service Quality and QoS In the previous section we found that the Internet itself does not offer a single level of quality and some areas of the network exhibit high levels of congestion and consequently poor quality, while other areas display consistent levels of high quality service. It is important to question what the components of service quality are and how they can be measured. Service quality on the Internet can be expressed as the combination of the network-imposed properties such as delay, jitter, bandwidth and reliability (Ferguson, 1998).
Delay The delay is the elapsed time for a packet to be passed from the sender, through the network, to the receiver. The higher the delay, the greater the stress that is placed on the transport protocol to work efficiently. For the TCP protocol, higher levels of delay imply greater amounts of data held in transit in the network, which in turn places stress on the counters and timers associated with the protocol. It should also be noted that TCP is a selfclocking protocol, where the sender’s transmission rate is dynamically modified by the signal timing coming back from the receiver, via the acknowledgment. The greater the delay between sender and receiver, the more insensitive the feedback loop, and in such a way the protocol becomes more insensitive to short-term dynamic changes in network load. For interactive multimedia applications, large delays cause the system to appear unresponsive.
Jitter The jitter is the variation in end-to-end transit delay. It is measurable as the absolute value of the first differential of the sequence of individual delay measurements. High levels of jitter cause the TCP protocol to make a very pessimistic estimate of round-trip time, causing the protocol to work very slowly which is impossible to accept in real-time multimedia applications. In such cases, jitter causes the signal to be distorted, which in turn can only be rectified by increasing the receiver’s reassembly playback queue, which effects the delay of the signal, making interactive sessions very hard to maintain.
Bandwidth The bandwidth is the maximal data transfer rate that can be sustained between two end points. It should be noted that this is limited not only by the physical infrastructure of the traffic path within the transit networks, which provides an upper bound to available bandwidth, but also by the number of other flows which share common components of this selected end-to-end path.
Reliability The reliability is commonly conceived as a property of the transmission system and in this context, it can be taken as the average error rate of the medium. Reliability can also be a byproduct of the switching system, in that a poorly configured or weakly performing switched system can alter the order of packet in transit. This delivers packets to the receiver in a various order than that of the original sending by the source, or even drops packets
Introduction to Multicast Technology 373
through transient routing loops. TCP cannot distinguish between losses due to packet corruption or congestion, and packet loss invokes the same congestion avoidance behavior response from the sender; this causes the sender’s transmit rates to be reduced by invoking avoidance algorithms even though no congestion may have happened on the network. In the case of UDP-based multimedia applications, unreliability causes induced distortion in the original analog signal at the receiver’s end.
Differentiated Service Quality The differentiated service quality refers to differentiation of one or more of the four basic quality metrics (delay, jitter, bandwidth and reliability), discussed above, for a particular category of traffic. Service quality can be defined as delivering consistently predictable service, to include required network reliability, low delay, low jitter and high availability. QoS, however, can be interpreted as a method to provide preferential treatment to some arbitrary amount of network traffic. This is opposite to the treatment of all traffic as best effort. If a network cannot give a reasonable level of service quality, then attempting to provide some method of differentiated QoS on the same infrastructure is almost impossible.
Requirements for QoS Real-time multimedia applications impose different requirements on the network since the resulted traffic they cause must be delivered on a certain schedule or it becomes useless (Cisco, 1995a). Without a proper network planning and configuration, the given bandwidth requirements of multimedia applications can crowd out other traffic by reducing its available bandwidth, or other traffic can use all available bandwidth, not saving enough place to allow consistent QoS for the multimedia applications. Before network managers create a peaceable domain wherein all network flows can efficiently commingle, they must gain an understanding of the various types of flows that their networks must support. A flow is a sequence of messages that have the same source, destination (one or more) and QoS requirements. Applications that generate real-time traffic have very specific QoS requirements, which are communicated to the network through a flow specification referred as flowspec. The flowspec is a data structure used by internetwork hosts to request special network services, which often guarantees how the internetwork will handle some of the hosts’ traffic. The Internet carries all types of traffic, while each type has various characteristics and requirements. For instance, a file transfer application requires that some quantity of data is transmitted in an acceptable amount of time, while Internet telephony requires that most packets get to the receiver in less than 0.3 seconds. If enough bandwidth is available, besteffort service fulfills all of these requirements. When resources are straitened, however, realtime traffic observes congestion. In general, multimedia applications can be divided into three basic types of traffic: • Playback: multimedia stream transfer with constant rate • Interactive real-time communication • Data dissemination In order to get more sophisticated classification; the number of users can take into account. In such a way six different sets of applications and QoS techniques can be differentiated:
374 Hosszú
1)
2)
3)
4)
5) 6)
Loss-free data transport: Conventional data applications, such as file transfers, need the loss-free reliability without strong time constraints. The more traditional transport model supports such applications; however, in case of a huge amount of data, resource reservation maybe need. Scalable data dissemination: In case of a number of receivers, the reliability requirement can lead to scaling problems due to the retransmissions. Sophisticated multicast transport protocols are developed to serve such applications. Available bit applications: New applications such as multimedia mail can operate with a wide range of available bandwidth. These applications need little bandwidth to function slowly and run faster as they access to more bandwidth. Depending on the given network condition, the conventional TCP-based transport with best-effort quality can be enough; however, the quality of the playback is poor sometimes. Constant bit rate applications: Audio traffic, video codecs and LAN TV have injected constant bit rate traffic into networks. These applications could not function with less bandwidth than some minimum, application-specific requirement, nor have they benefited from extra bandwidth. Running in circuit-switched WAN environments, they have received dedicated bandwidth. Variable bit applications: The one-to-one applications can be easily set to the varying network conditions. Variable bit applications with low latency: The interactive multimedia applications are bursty in nature and they have various traffics, which fluctuate low and high bandwidth requirements. The different available services are described in the Table 1.
Latency and Jitter Real-time, interactive applications such as desktop conferencing are sensitive to accumulated delay, which is referred to as latency (Cisco, 1995a). For instance, telephone networks are engineered to provide less than 400 ms round-trip latency. Multimedia networks that support desktop multimedia conferencing also must be engineered with the same latency. The contribution of the network to latency can be described by the well-known parameters, such as the propagation delay, the transmission delay, the processing delay and the jitter.
Table 1: Quality of Service requirements
Point-to-Point
Multipoint
Data dissemination Loss-Free Data Transport File transfer Scalable Data Dissemination Software upgrade News service
Playback Available Bit Rate Multimedia Mail Multimedia Notes Constant Bit Rate LAN TV
Real-Time Interactive Variable Bit Rate Self-Posed Training Retail Kiosk Variable Bit Rate with Low Latency Desktop Conferencing Corporate Broadcasts
Introduction to Multicast Technology 375
IP MULTICASTING The Multicast Extension of the Internet Protocol There are three types of the Internet Protocol Version 4 (IPv4) addresses: unicast, multicast and broadcast. The protocol version 4 refers to the standardized version of the Internet Protocol. Unicast addresses are used for transmitting a message to a single destination node. With broadcast addresses the message is transmitted to all nodes in a subnetwork. For delivering a message to a group of destination nodes that are not necessarily in the same subnetwork, multicast addresses are used. Unlike the original IP, the extended IP multicast does not address one computer, but so-called multicast group addresses. These follow the Internet standard dotted decimal notation; so-called Class D addresses range from 224.0.0.0 through 239.255.255.255 to send the same data to several machines simultaneously (Daviel, 1995). The ranges of the different sets of the possible IP addresses are shown in the Table 2. This means that there are 228 possible multicast hosts to allow more than 270 million different multicast transmissions in the same time. Addresses between 224.0.0.0 and 224.0.0.255 are reserved for maintenance protocols and are not forwarded outside of the subnet. IP multicast is an extension of the IP protocol. It supports many-tomany connections and as efficient routing of packets as possible. This is especially important to real-time streaming applications since here one transmitter gives many recipients the same data stream. The key concept in IP multicast is the multicast group (Keshav, 1998). Such a group has two defining properties: • packets sent by any member of a group are received by all other members; • members may join or leave a group without reference to other members. IP multicasting is the transmission of an IP datagram to a host group, a set of zero or more hosts identified by a single IP destination address (Deering, 1989). A multicast datagram is delivered to all members of its destination host group with the same besteffort reliability as regular unicast IP datagrams, that is, the datagram is not guaranteed to arrive intact at all members of the destination group or in the same order relative to other datagrams. The multicast group concept is powerful since it gives a primitive to arbitrarily group widely dispersed processes. It can provide an efficient infrastructure for advanced services such as group communication, fault tolerant computing, information dissemination and hierarchical Web caching. The IP multicast service model is composed of three parts (Crowcroft, 1998): • Sender sends to a multicast address. • Receivers express an interest in a multicast address. • Routers cooperate to deliver traffic from the senders to the receivers. Table 2: The different classes of the IP addresses
Name
Target
Address range
Class A Class B Class C Class D
Large network addresses Middle sized network addresses Small network addresses Multicasting
1.0.0.0 to 126.255.255.255 128.0.0.0 to 191.255.255.255 192.0.0.0 to 223.255.255.255 224.0.0.0 to 239.255.255.255
376 Hosszú
Sending multicast traffic is similar to sending unicast traffic, however, to receive multicast traffic, an interested host must inform its local router that it is interested in a particular multicast address, which it does using the Internet Group Management Protocol (IGMP). The membership of a host group is dynamic: hosts may join and leave groups at any time. There is no restriction on the location or number of members in a host group. A host may be a member of more than one group at a time. A host need not be a member of a group to send datagrams to it. However, in the implemented applications a host group may be permanent or transient. A permanent group has a well-known, administratively assigned IP address. It is the address, not the membership of the group, which is permanent; at any time a permanent group may have any number of members, even zero. Those IP multicast addresses that are not reserved for permanent groups are available for dynamic assignment to transient groups that exist only as long as they have members. An IP multicast capable subnetwork requires two main protocols: • A host-allocated protocol, to allow an application to notify the local router that it has joined to a multicast group, and to start the data flow from all senders. • A multicast router-allocated protocol, to let all the routers which have also multicast group members on their LAN to establish the connection to each other in order to ensure that the packages sent to the actual group address are forwarded to all receivers within the intended scope.
Host Conditions for Multicasting To support IP multicast and sending/receiving multimedia streams by multicasting, the nodes and network infrastructure between them must be multicast enabled, including intermediate routers (Johnson, 1997b). Requirements for native IP multicast at the node hosts are: • Support for IP multicast transmission and reception in the TCP/IP protocol stack. • Software that supports IGMP, which handles requests to join a multicast group and receive multicast traffic. • Network interface card which filters for LAN data link layer addresses mapped form network layer IP multicast addresses. • IP multicast application software, e.g., audio conferencing. To run or evaluate IP multicast on a LAN, the above are needed, only. No routers should be involved for a host’s adapter to create or join a multicast group and share multicast data with other hosts on that LAN segment. To expand IP multicast traffic to a WAN requires: • All intermediate routers between the senders and receivers must be IP multicast capable. In fact many new routers have support for IP multicast. • Firewalls may need to be reconfigured to permit IP multicast traffic. This is a huge problem, since the firewall administrators preferring the security may dislike the IP traffic that differs from the normal unicast. IP multicast has broad and growing industry backing, and is supported by many vendors of network infrastructure elements such as routers, switches, TCP/IP stacks, network interface cards, desktop operating systems and application software. Figure 1 shows components of the multicast-based transmission (Johnson, 1997b). To send an IP multicast datagram, the sender specifies an appropriate destination address, which represents a host group (Johnson, 1997b). IP multicast datagrams are sent using the same operation used for unicast datagram.
Introduction to Multicast Technology 377
Figure 1: The structure of a multicast session and the underlying protocol stack between two applications Sender Multicast Application
Multicast application session
Receiver Multicast Application
Multicast transport protocols
Multicast transport session
Multicast transport protocols
UDP transport protocol
session UDP-level transport sesion
UDP transport protocol
Network protocols (IP, ICMP, IGMP)
Network connection through
Network protocols (IP, ICMP, IGMP)
Network interface (driver and card)
Physical link
Network interface (driver and card)
Whereas an IP unicast address is statically bound to a single local network interface on a single IP network, an IP host group address is dynamically bound to a set of local network interfaces on a set of IP networks. An IP host group address is not bound to a set of IP unicast addresses. Multicast routers do not need to know the list of member hosts for each grouponly the groups for which there is one member on their local network.
Group Management Hosts willing to receive multicast data packets need to inform their neighboring routers that they are interested in getting multicast messages sent to certain multicast groups. This way, each node can become a member of one or more multicast groups and receive the multicast packets sent to those groups. The IGMP is used by hosts to join IP multicast sessions and by routers to communicate information about group members on their directly attached subnets (Johnson, 1999). However, additional mechanisms are necessary such as: • How does a user or application learn about forthcoming IP multicast sessions of band mechanisms such as e-mail or Web sites that are used for announcements • Announcing sessions • Determining temporary multicast addresses and providing for sessions • Issuing invitations, e.g., for conferences • Negotiating parameters such as membership • Sending media encoding and encryption keys • Adding or deleting members during the session • Other control functions Routers also use the IGMP to periodically check whether the known group members are still active (Banikazemi, 1997). Based on the information gained from the IGMP, the router can decide if they should forward multicast messages or receive them to its subnetwork. After receiving a multicast packet sent to a certain multicast group, the router checks and determines if there is at least one member of that particular group on its subnetwork. In this case the router forwards the message to that subnetwork, otherwise, it discards the multicast packet. This procedure is the last phase of transmitting a multicast
378 Hosszú
packet. Figure 2 shows the relation between the IGMP group membership protocol and the inter-router multicast routing protocol (Maufer, 1997), where IPMC denotes multicast capable router. In the IP multicast scheme, each individual multicast group-that is, a group of hosts that have all subscribed to a particular application-can be identified by a particular Class D IP address. Each host can register itself as a member of selected multicast groups using the IGMP (Cisco, 1995b). This protocol allows hosts to join and quit multicast groups dynamically. It operates in both modes of router and host (Deering, 1989). IP multicast allows a host to register dynamically for the multicast sessions in which it chooses to participate. Since the IP multicast uses logical group addresses instead of individual destinations, a source does not need to know the location or the address of its destinations, it just sends to the group address and the network takes care of required routing and packet duplication. Similarly, a client does not need to know where the information comes from; it just joins a multicast group and receives all datagrams that are being sent to that group address.
Scoping Each IP multicast packet uses the time-to-live (TTL) field of the IP packet header as a scope-limiting parameter. This parameter can be set to a value between zero and 255. Every time a router forwards the packet, it decrements the TTL field in the packet header. The TTL field controls the number of hops that an IP multicast packet is allowed to propagate. Each time a router forwards a packet, its TTL is decremented. A multicast packet whose TTL has expired, that is it equals 0, is dropped without an error notification to the sender. This mechanism prevents messages from needless transmission to regions of the worldwide Internet that lie beyond the subnets containing the multicast group members (Johnson, 1997b).
Figure 2: Multicast delivery scheme
: : :
: : :
IPMC 1
IPMC 2
IPMC 3
Internet
IPMC 4
Multicast routing protocol IGMP protocol
: : :
Introduction to Multicast Technology 379
A local network reaches all closely neighboring members of the destination host group, that is the TTL is 1 by default. If a multicast datagram has a TTL greater than 1, the multicast router attached to the local network take responsibility for internetwork forwarding. The datagram is forwarded to other networks that have members of the destination group. On those other networks that are reachable within the IP time-to-live, an attached multicast router completes delivery by transmitting the datagram as a local multicast. TTL thresholds in multicast routers prevent datagrams with less than a certain TTL from traversing certain subnets. This provides a convenient mechanism for confining multicast traffic to within campus or enterprise networks. An example setting for the TTL field is 1 for local net, 15 for site, 63 for region and 127 for the whole world.
IP Multicast for Multimedia Applications The new one-to-many multimedia applications are compelling the need for advances in traffic handling to overcome bottlenecks (Johnson, 1997c). In order to offer an efficient multipoint delivery mechanism, IP multicast has a number of other attractive properties that make it well suited for large-scale applications of multimedia communication (McCanne, 1997): • IP multicast is both efficient (in terms of network utilization) and scaleable (in terms of control overhead). There are no centralized points of failure, and protocol control messages are distributed across receivers. Group maintenance messages flow from the subnetworks which from no branches lead to downstream branches, called leaves toward the source. • IP multicast offers a large range of scope-control mechanisms. Applications can limit the distance with which a multicast packet travels, and the network can be configured with hierarchical administrative scope boundaries to localize the reach of a multicast transmission in accordance with administrative policies. • IP multicast gives a flexible, dynamic and anonymous model for group membership. Senders need not explicitly know about every receiver and on the other hand, receivers need not know an individual sender’s identity to receive its traffic. Instead, sources simply send packets destined to a multicast group and receivers tune in to the transmission by subscribing to that group. On the basis of the combination of the main properties of the IP multicast, namely scalable protocol performance, user-defined scope-controls and flexible group membership, IP multicast is an excellent building block for robust and scalable applicationlevel protocols. The IP multicasting refers loosely to a set of four related problems (Keshav, 1998): • routing: finding a path from a sender to other members in the multicast group; • reservation: reserving resources along such a path; • reliability: assuring that all packets from a sender reach all receivers, despite packet loss and corruption; and • flow control: regulating the rate at which a sender sends packets so that intermediate links and receivers are not overloaded. These problems are more or less orthogonal, in that we can pick and choose solutions to each problem independently.
380 Hosszú
Routing Algorithms and Protocols General Considerations Routing protocols have traditionally been designed to provide the optimal route from one network to another. Multipoint routing requires routers to locate the route to many networks at once-and to do so as efficiently as possible (Cisco, 1995b). To be of practical use, IP multicast must be efficient, scale well and be incrementally deployable (Keshav, 1998). • Efficiency means that setting up and maintaining the group should require only a few control messages. In addition, multicast packets should follow an optimal route from a source to the destinations. • Scalability means that the number of control messages and the amount of state in network elements should grow at most linearly with the number of receivers in the group and the size of the network. • Incremental deployability means that we should be able to add the multicast algorithm to the Internet without requiring a simultaneous change at all routers and endpoints. Present algorithms for IP multicast differently satisfy these criteria. For instance, the flood and prune approach in the present MBone is inefficient for sparse groups because packets are flooded to all routers, even those that are not interested in the group (Johnson, 1997a).
Algorithms IP multicast traffic for a particular pair of source and destination groups is transmitted from the source to the receivers via a spanning tree that connects all the hosts in the group. Different IP multicast routing protocols use different techniques to construct these multicast spanning trees; once a tree is constructed, all multicast traffic is distributed over it (Johnson, 1997d). Routers keep track of these groups dynamically and build distribution trees that chart paths from each sender to all receivers. When a router receives traffic for a multicast group, it refers to the specific tree that it has built for the sender. There are a lot of different general-purpose multicast routing protocols that can build the distribution trees for multipoint routing, in the following they are introduced: There are a lot of different general-purpose multicast routing algorithms: • Flooding. This is the simplest technique for delivering multicast datagrams to all routers in an internetwork. The router forwards a packet to be forwarded on all interfaces except the one on which the packed arrived. The flooding algorithm is simple, but does not scale for a large number of recipients since it uses all available path across the internetwork instead of just a limited number (Maufer, 1997). • Spanning Trees. The spanning tree selects a subset of the Internet topology by defining a tree structure where only one active path connects any two routers. Its disadvantage is that it centralizes traffic on a small number of links and may not provide the most efficient path between the source subnetwork and group members. • Reverse Path Broadcasting (RPB). This builds a group-specific spanning tree for each potential source subnetwork. Its main limitation is that it does not take into account multicast group membership when it builds the distribution tree. As a result, datagrams may be unnecessarily forwarded to subnetworks that have no members in the destination group.
Introduction to Multicast Technology 381
•
Truncated Reverse Path Broadcasting (TRPB). This eliminates unnecessary traffic on leaf subnetwork by determining the group membership on each leaf subnetwork, but it does not consider group memberships when building the branches of the distribution tree. • Reverse Path Multicasting (RPM). This creates a delivery tree that spans only subnetworks with group members, and routers and subnetworks along the shortest path to subnetworks with group members. It allows a source-rooted spanning tree to be pruned so that datagrams are only forwarded along branches that lead to members of the destination group. The drawback of this is that each router is required to maintain state information for all groups and each source, which can lead to serious scaling problems. • Steiner Trees (ST). This minimizes the number of links; however, the created tree can be slower than the other spanning trees can. • Center-Based Trees (Crowcroft, 1998; Ballardie, 1993). Unlike other algorithms which build a source-rooted, shortest-path tree for each pair of source and group, this algorithm constructs a single delivery tree that is shared by all members of a group. In fact this algorithm maps the multicast group address to a particular unicast address of a router. However, it allows a different center-based tree for each group. Multicast traffic for each group is sent and received over the same delivery tree, regardless of the source. These algorithms are used in different multicast routing protocols that are being introduced in the following. One approach is based on assumptions that the multicast group members are densely distributed throughout the network and the bandwidth is plentiful. The so-called dense-mode multicast routing protocols rely on the different multicast routing algorithms to propagate information to other network routers. These protocols use a data-driven approach to construct multicast distribution trees. Dense-mode is most useful when (Cisco, 1999): • Sender and receiver are in close proximity to one another. • There are few senders and many receivers. • The volume of multicast traffic is high. • The stream of multicast traffic is constant. Another approach to multicast routing basically assumes the multicast group members are sparsely distributed throughout the network and bandwidth is not necessarily widely available, e.g., across many regions of the Internet or if users are connected via dial-in connections. It is important, that the sparse-mode does not imply that the group has few members, just that they are widely dispersed. In this case, flooding would unnecessarily waste network bandwidth and hence could cause serious performance problems. Thus, sparse-mode multicast routing protocols must rely on more selective techniques to set up and maintain multicast trees. The sparse-mode protocols use a receiver-driven process, that is a router becomes involved in the construction of a multicast distribution tree only when one of the hosts on its subnet requests the membership in a particular multicast group. These protocols are designed to restrict multicast traffic to only those routers interested in receiving it. The sparse multicast routing is most useful when (Cisco, 1999): • There are few receivers in a group. • Senders and receivers are separated by WAN links. • The type of traffic is intermittent. IP multicast routing protocols and their realizations in the protocols usually follow one of the two basic approaches, the dense-mode and the sparse-mode algorithms, depending on the expected distribution of multicast group members throughout the internetwork.
382 Hosszú
Dense-Mode Protocols The routing algorithms are used in routing protocols. A large part of such protocols are called dense-mode routing protocols were developed for applications that send traffic to high concentrations of LANs. Examples of these applications include LAN TV, corporate broadcasts and financial broadcasts. Such protocols are listed in the following subsections.
a) Distance Vector Multicast Routing Protocol The DVMRP constructs a different distribution tree for each source and its destination host group (Waitzmann, 1988; Johnson, 1997d). Each distribution tree is a minimum spanning tree from the multicast source being the root of the tree to all the multicast receivers being leaves of the tree. The distribution tree provides a shortest path between the source and each multicast receiver in the group, based on the number of hops in the path, which is the DVMRP metric. A tree is constructed on demand, using broadcast and prune technique, when source begins to transmit messages to a multicast group. Since new members may join the group at any time, and since these new members may depend on one of the pruned branches to receive the multicast transmission, DVMRP periodically reinitiates the construction of the spanning tree.
b) Protocol Independent Multicast Dense-Mode To address the requirements of both local-area and wide-area communication, the Protocol Independent Multicast (PIM) protocol has both of the two modes: dense and sparse, introduced above. The Protocol Independent Multicast Dense-Mode (PIM-DM) is similar to DVMRP, in the sense that both employ Reverse Path Multicasting (RPM) algorithm to construct source-rooted distribution trees (Deering, 1996). The main differences between DVMRP and PIM-DM are that: • PIM-DM is completely independent of the unicast routing protocol that is used on the network, while DVMRP relies on specific mechanisms of the associated unicast routing protocol; and • PIM-DM is less complex then DVMRP. The PIM allows routers to learn dynamically which multicast group data needs to be forwarded through each interface of the router. It gives a scaleable, multi-enterprise solution for multicast capability that enables networks running any multicast routing protocol to support IP multicast. PIM can be integrated into existing networks with the Routing Information Protocol (RIP), Interior Gateway Routing Protocol (IGRP), Enhanced IGRP, Integrated Intermediate System-to-Intermediate System (IS-IS) or Open Shortest Path First (OSPF). Using a technique known as reverse-path forwarding, dense-mode PIM works by flooding incoming multicast traffic out of every interface except the one through which the source can be reached. Routers that have no need for specific data stream reply with a prune message, which causes the router that performed the flooding to prune the relevant interface of replying router from its flood list. To summarize, the PIM protocol provides routers the ability to dynamically locate the best path for multicast traffic. Dense-mode PIM is best suited for broadcast applications such as LAN TV, corporate broadcasts and resource location. According to the industrial background, e.g., Cisco routers support PIM. They can use the both types of PIM, the densemode and the sparse-mode as well (Cisco, 1995b).
Introduction to Multicast Technology 383
c) Multicast Open Shortest Path First The original unicast routing protocol Open Shortest Path First (OSPF) routes the messages along least-cost paths, where cost is expressed in terms of a link-state metric. The multicast-capable protocol MOSPF is the extension of the unicast protocol OSPF that makes use of the OSPF link state database to calculate source-based multicast trees (Moy, 1994). In addition to number of hops in a path, other network performance parameters that can influence the assignment of cost to a path include among others: • An application’s desired QoS. For instance, if an application requires low latency, a path involving a satellite link should be assigned a high cost. • Load-balancing information. For instance a link that has very little traffic on it might be assigned a lower cost than a heavily utilized link, in an attempt to balance traffic on the network. MOSPF is designed for use within a single routing domain, e.g., a network controlled by a single organization. MOSPF is dependent on the use of OSPF as the accompanying unicast routing protocol, just as DVMRP includes its own unicast protocol. In an OSPF/ MOSPF network, each router maintains an up-to-date image of the topology of the entire network. This link-state information is used to construct multicast distribution trees. Link-state routing protocols operate by having each router send a message periodically listing its neighbors and how far away they are (Crowcroft, 1998). These routing messages are flooded throughout the entire network, so every router can build up a map of the network. These maps can then be used to build forwarding tables to decide quickly which is the correct next hop for send a given packet. In order to extend this functionality to multicasting, each MOSPF router periodically collects information about multicast group membership via IGMP. This information, along with the above link-state information, is flooded to all other routers in the routing domain. All routers calculate exactly the same tree, since they periodically share link-state information. Therefore, MOSPF does not scale well, due to the periodic flooding of link-state information among the routers (Johnson, 1997d).
(d) Simple Multicast Routing Protocol The Apple Computer’s Simple Multicast Routing Protocol (SMRP) gives multicast routing functions for AppleTalk traffic. Applications produced by Apple, such as QuickTime Conference (QTC), support SMRP; furthermore, Apple supports QTC on IP networks using IP multicast (Cisco, 1995b).
Sparse-Mode Protocols The different versions of the Center-Based Trees algorithm are applied for creating well-scalable routing protocols for the internetwork. They address the needs of environments where there might be many data streams at a given moment, but each stream goes to a relatively small number of the LANs on the Internet. Sparse-mode routing protocols are listed in the following.
a) Core Based Trees Some multimedia applications, such as distributed multimedia gaming and distributed interactive simulation, have many active senders within a single multicast group. Unlike DVMRP or MOSPF, which construct a shortest-path tree for each pair of source and group, the Core Based Trees (CBT) protocol constructs a common tree called Core being shared by all members of the group. Multicast traffic for the entire group is sent and received over
384 Hosszú
the same tree, regardless of the source. This use of a shared tree can provide significant savings in terms of the amount of multicast state information that is stored in individual routers. A CBT shared tree has a Core router that is used to construct the tree (Johnson, 1997d). Routers join the tree by sending a join message to the Core. When the Core receives a join request, it returns an ACK over reverse path, thus forming a branch of the tree. In such a manner, CBT collects traffic onto a smaller subset of links than would be used for sourcebased trees. The resulting concentration of traffic around the Core is a potential problem for this approach to multicast routing. Some versions of CBT support the use of multiple Cores since load balancing might be achieved by using more than one Core.
b) Protocol Independent Multicast-Sparse-mode Sparse-Mode PIM (PIM-SM) addresses the needs of environments where there might be many data streams at a given moment, but each stream goes to a relatively small number of the LANs on the Internet. Some examples of applications suited for PIM-SM include desktop conferencing and collaborative computing. For this type of traffic, a flood and prune approach such as Reverse Path Forwarding would waste bandwidth. Instead, PIM-SM works by designating a router as a rendezvous point (RP). This is similar to the Core of the CBT. When a sender needs to transmit data, it first registers with the RP. Likewise, a would-be receiver must first joint the group by registering with the RP (Cisco, 1995b). In fact it constructs a distribution tree around the RP router. PIM-SM assumes that no hosts want the multicast traffic unless they specifically ask for it. PIM-SM is optimized for environments where there are many multipoint data streams (Cisco, 1999). Each data stream goes to a relatively small number of the LANs in the internetwork. The PIM-SM is more flexible than the CBT. While CBT with trees are always groupshared trees, with PIM-SM an individual receiver may choose to construct either a groupshared tree or a shortest-path tree. There are advantages to each type of distribution tree. The shared tree is relatively easy to construct and it reduces the amount of state information that must be stored in the routers. Accordingly, a shared tree would conserve network resources if the multicast group consisted of a large number of low-data rate sources. However, shared trees cause an aggregation of traffic around the RP, a phenomenon that can result in performance degradation if there is a large volume of multicast traffic. Another disadvantage of shared trees is that traffic often does not traverse the shortest path from source to destination. PIM-SM architecture supports both types of distribution trees (Johnson, 1997d). Some examples of applications suited for PIM-SM include desktop conferencing and collaborative computing.
c) Border Gateway Multicast Protocol The Border Gateway Multicast Protocol (BGMP) is an inter-domain routing protocol (Crowcroft, 1998). BGMP has a different goal than the other multicast routing protocols. It does not build trees of routers, but rather it builds bidirectional shared-trees of domains. Within a domain, any of the other multicast routing protocols can be used, and BGMP then provides multicast routing between domains. The large problem of the BGMP is deciding where to put the root for any shared tree. It places the root in the domain, which has been allocated the multicast address. Thus if the conference initiator obtains the address from its local multicast address allocation server, then the tree will be rooted in the session initiator’s domain. For TV-style broadcasts this is optimal, but for multi-party interactive applications, it may be less optimal. Figure 3 presents the classification of some multicast routing protocols.
Introduction to Multicast Technology 385
Figure 3: Classification of multicast routing protocols Multicast Routing Protocols Dense Mode Sparse Mode Distance Vector DVMRP, PIM-DM
Link-State MOSPF
Shared Tree CBT, PIM-SM
Shared Tree of Domains BGMP
TRANSPORT PROTOCOLS The multicast routing protocols use various approaches to construct multicast delivery trees for the efficient multicast transmission. The resultant delivery trees provide a route from a multicast sender to the receivers. However, without additional mechanisms, such routing is not guaranteed to provide a specified QoS (Johnson, 1999). To support the development of group communication applications such as multipoint data dissemination and multi-party conferencing tools, a lot of multicast transport protocols have been proposed and implemented. The purpose of transport protocols is to provide endto-end services that are specific to some certain range of applications, and provide specific services that are not common to all applications, and thus not required in the network service (Crowcroft, 1998). Most of the transport protocols are layered on top of a network layer multicast protocol, such as IP multicast; however, there are application-level transport protocols, too. Earlier multicast transport mechanisms were designed as a general solution to the group communication problem. More recently it has been recognized that group communication applications have a much wider range of requirements than unicast applications (Obraczka, 1998). For multimedia over IP the solution is to classify all traffic, allocate priority for different applications and make reservations. The Integrated Services Working Group in the IETF (Internet Engineering Task Force) developed an enhanced Internet service model called Integrated Services that includes best-effort service and real-time service (Liu, 1998). The real-time service enables IP networks to provide QoS to multimedia applications. Resource ReServation Protocol (RSVP), together with Real-Time transport Protocol (RTP), Real-Time Control Protocol (RTCP) that works in conjunction with RTP, and Real-Time Streaming Protocol (RTSP), gives a working basis for real-time services. Integrated Services allows applications to configure and manage a single infrastructure for multimedia applications and traditional applications. It is a comprehensive approach to provide applications with the type of service they need and in the quality they choose. The RSVP, RTP, RTCP and RTSP transport protocols are the foundation of real-time services.
Reliability in Multicast Transport Reliable delivery is required by many real-time and non-real-time applications. For unicast services, error detection and correction in the TCP layer provides reliability. For reliable multicast, a lot of approaches to tracking ACKs and detecting and correcting errors have been developed since scalability problems arise due to the fact that an IP multicast sender may have thousands of recipients. Reliable multicast protocols provide error correction schemes, which are designed to overcome the limitations of unreliable multicast datagram delivery without loading the network. There are many approaches reducing the number of ACKs in a reliable multicast
386 Hosszú
service. The error correction mechanisms can vary, depending on application requirements and the multicast environment (Johnson, 1999). Mechanisms employed by reliable multicast services include multicasting ACKs to reduce duplicate requests, using separate polling services and holding information for retransmission at intermediate points. Some solutions use a pre-defined error correction protocol, while others support the flexibility to choose from a set of reliable protocols. Applications supporting both simultaneous reliable unicast and multicast retransmission can be useful, enabling network administrators to gradually transition from a unicast to a multicast capable network.
Taxonomy We can classify multicast transport protocols according to the following features (Obraczka, 1998). • Data transfer: mechanism used to propagate data (not control information) among participating sites. • Reliability mechanism: how the protocol recovers from data loss. • Loss recovery (repair request): how the protocol requests retransmission of lost data. • Feedback control: how the protocol controls the amount of control information receivers generate. • Retransmission: how the protocol propagates data that needs to be retransmitted. • Flow control: mechanism used to prevent the source from overflowing the receivers and from causing network congestion. • Locus of control: if the protocol uses a central site to control functions, namely message ordering, group membership and retransmissions. • Ordering the datagrams: what ordering guarantees the protocol offers and how those guarantees are met. • Group management: how the protocol manages multicast group membership. • Target applications: if the protocol was proposed or implemented to support a specific application.
General Purpose Protocols These earlier protocols are not designed for a specific application.
Reliable Broadcast Protocol The Reliable Broadcast Protocol (RBP) gives multipoint communication between sites connected by a local-area broadcast network (Obraczka, 1998). RBP combines Negative and Positive Acknowledgments (NACKs and ACKs) to achieve reliability, ordering and fault tolerance. Messages are multicasted to the group through the token site, that is the token site multicasts an ACK for each message it receives. ACKs inform the sender that the token site received a message. Since the ACKs carry a global timestamp, receivers use them to order and detect lost messages.
Multicast Transport Protocol The Multicast Transport Protocol (MTP) gives reliable multicast delivery on a one-tomany and many-to-many basis (Magnus, 1999). It can be used as the transport layer of any network architecture provided the datalink layer includes some support for multicast. Clients
Introduction to Multicast Technology 387
of MTP need not be concerned with the geographical location and population of the Web members. It uses only a small number of control packets with the bulk of synchronization and control information. MTP is a Negative Acknowledgment (NACK)-based protocol. When a group member detects a lost message, it sends a NACK to the message producer, who multicasts the requested message to the entire group.
Reliable Multicast The Reliable Multicast Protocol (RMP) gives a totally ordered, reliable multipoint transport protocol on top of an unreliable multicast service. The RMP evolved from RBP. Besides using multicast where available, RMP extended RBP with a name service that advertises the existence of multicast groups and a flow and congestion control mechanism. RMP’s name service uses the RMP protocol to manage its distributed database. RMP’s group membership protocol allows sites to join or leave a group and updates the members’ group membership view accordingly.
Xpress Transport The Xpress Transport Protocol (XTP) was designed to support a wide range of applications and in such a way gives all the functionality of TCP, UDP, etc. However, in multicast mode, while retransmissions are multicasted to the group, NACKs are unicast back to the source, which means that XTP does not provide any feedback control mechanism.
Uniform Reliable Group Communication The Uniform Reliable Group Communication Protocol (URGC) gives reliable and ordered communication among the members of a group. To achieve ordering and decide when to purge old messages, URGC uses centralized control, which rotates among all members of the group. All sites keep a history of all processed messages which allows other sites to recover missing messages. Besides broadcasting its message ordering decisions, the present coordinator also informs the group which is the most up-to-date member. By the way, through unicast communication, other group members can recover missing messages from the most up-to-date site.
Reliable Multicast Frameworks The main point of Reliable Multicast Frameworks (RMF and RMFP) is that instead of a specific protocol targeted at a specific class of reliable multicast applications, a framework for reliable multicast can be used to develop a variety of multicast protocols for applications with different requirements. The RMF introduces the idea of a universal receiver, which is able to operate with any sender using self-identifying packets (Obraczka, 1998).
Support of the Multipoint Interactive Applications Real-Time Transport and Control RTP is a real-time transport protocol that gives end-to-end delivery services to support applications transmitting real-time data (Johnson, 1997a). RTP is defined in (Schulzrinne, 1990a), along with a profile for carrying multimedia over RTP in (Schulzrinne, 1996b). RTP is presently used by some commercial systems implemented
388 Hosszú
as a transport protocol for the multimedia streams on a number of platforms. RTP services include: • payload type identification, • sequence numbering and • timestamping. RTP gives end-to-end delivery services, but it does not provide all of the functionality that is typically provided by a transport protocol. RTP runs on the top of UDP to utilize its multiplexing and checksum services. The RTP header requires synchronizing the timing information necessary to display multimedia data, finally to determine if packets have been lost or have arrived out of order. The important property of a certain media packet, the socalled RTP payload, is that data transported by RTP in a packet and its value depends on the type of the actual multimedia stream. The header specifies the payload type, allowing in such a way multiple data and compression types. RTP does not provide any mechanism to ensure timely delivery or provide QoS guarantees. It does not guarantee delivery or prevent out-of-order delivery; furthermore it does not assume that the underlying network is reliable. Some adaptive applications do not require such guarantees, but for those that do, RTP must be accompanied by other mechanisms to support resource reservation and to provide reliable service. The Real-Time Control Protocol (RTCP) is the control protocol that works in conjunction with RTP (Johnson, 1997a). Each participant in an RTP session periodically transmits RTCP control packets to all other participants. Feedback on information to the application can be used to control performance and for diagnostic purposes. Its primary function is to provide information to an application regarding the quality of data distribution. Each RTCP packet contains sender and/or receiver reports that report statistics useful to the application. These statistics, including number of packet sent, number of packet lost, interarrival jitter, etc., are useful for the sender, receivers and third-party monitors. The RTCP is defined to monitor the QoS and to convey information about the participants in an ongoing session. The latter aspect of RTCP may be sufficient for loosely controlled sessions, i.e., where there is no explicit membership control and set-up, but it is not necessarily intended to support all of an application’s control communication requirements (Schulzrinne, 1990a). The RTCP also carries a transport-level identifier for an RTP source, called the canonical name, which is used to keep track of the participants in an RTP session. Receivers use the canonical name to associate multiple data streams from a given participant in a set of related RTP sessions, e.g., to synchronize multimedia. The RTP means a new style of protocol, since RTP is intended to be flexible to provide information that is required by a particular application and is often integrated as a separate layer. RTP is a protocol framework that is deliberately not complete. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed (Clark, 1990). If both audio and video media are used in a conference, they are transmitted as separate RTP sessions (Schulzrinne, 1990a). RTCP packets are transmitted for each medium using two different UDP port pairs and/or multicast addresses. There is no direct coupling at the RTP level between the audio and video sessions, except that a user participating in both sessions should use the same distinguished canonical name in the RTCP packets for both so that the sessions can be associated.
Introduction to Multicast Technology 389
One motivation for this separation is to allow some participants in the conference to receive only one medium if they choose. Despite the separation, synchronized playback of a source’s audio and video can be achieved using timing information carried in the RTCP packets for both sessions.
a) Abstract Network Elements In a real conference scenario, it is possible that not all sites want to receive media data in the same format. Consider the case where participants in one area are connected through a low-speed link to the majority of the conference participants who enjoy high-speed network access. Instead of forcing everyone to use a lower-bandwidth, reduced-quality audio encoding, an RTP-level relay called a mixer (defined exactly below) may be placed near the low-bandwidth area. This mixer resynchronizes incoming audio packets to reconstruct the constant 20 ms spacing generated by the sender, mixes these reconstructed audio streams into a single stream, translates the audio encoding to a lower-bandwidth one and forwards the lower-bandwidth packet stream across the low-speed link. The RTP header includes a means for mixers to identify the sources that contributed to a mixed packet so that proper talker indication can be provided at the receivers. Some of the intended participants in the audio conference may be connected with high bandwidth links but might not be directly reachable via IP multicast. For instance, they might be behind an application-level firewall that does not let pass any IP packets. For these sites, mixing may not be necessary, in which case another type of RTP-level relay called a translator (defined exactly below) may be used. Two translators are installed, one on either side of the firewall, with the outside one tunneling all multicast packets received through a secure connection to the translator inside the firewall. The translator inside the firewall sends them again as multicast packets to a multicast group restricted to receivers on the internal subnetwork. RTP meets the requirement to supply services simultaneously to recipients of different QoS attributes. Therefore, group members having low quality requirements or low bandwidth do not specify the QoS attributes for all other communication partners. The translator and the mixer are defined for these purposes. These RTP-level devices may be implemented for a variety of purposes. An example is a video mixer that scales the images of individual people in separate video streams and composites them into a video stream to simulate a group scene.
b) Synchronization Source (SSRC) The source of a stream of RTP packets is identified by a 32-bit numeric Synchronization Source (SSRC) identifier and carried in the RTP header so as not to be dependent upon the network address. All packets from an SSRC perform part of the same timing and sequence number space, so a receiver groups packets by SSRC for playback. Examples of SSRCs include the sender of a stream of packets derived from a signal source such as a microphone, a camera or an RTP mixer. A SSRC may change its data format, e.g., audio encoding, over time. The SSRC identifier is a randomly chosen value meant to be globally unique within a particular RTP session. A participant need not use the same SSRC identifier for all the RTP sessions in a multimedia session; the binding of the SSRC identifiers is provided through RTCP. If a participant generates multiple streams in one RTP session, e.g., from separate video cameras, each stream must be identified as a different SSRC.
390 Hosszú
c) Contributing Source (CSRC) This is a source of a stream of RTP packets that has contributed to the combined stream produced by an RTP mixer. The mixer inserts a list of the SSRC identifiers of the sources that contributed to the generation of a particular packet into the RTP header of that packet. This list is called the CSRC list. An example application is audio conferencing where a mixer indicates all the talkers whose speech was combined to produce the outgoing packet, allowing the receiver to indicate the present talker, even though all the audio packets contain the same SSRC identifier of the mixer.
d) End System in Multimedia Streaming End system is an application that generates the content to be sent in RTP packets and/ or consumes the content of received RTP packets. An end system can act as one or more synchronization sources in a particular RTP session, but typically only one.
e) Translator The translator is an intermediate system that forwards RTP packets with their SSRC identifier intact. Examples of translators include devices that convert encoding without mixing, replicators from multicast to unicast and application-level filters in firewalls. This, for instance, enables the recipients to identify the original transmitter of the multimedia stream behind the firewall. A translator can be used, e.g.: • to convert an already existing multimedia stream to a lower data rate; • to change the coding format; • to tunnel a firewall. For the generation of the new multimedia stream, it must be ensured that the timestamps are adapted. This enables a recipient to recognize, on account of the identical SSRC, that both streams are of identical contents. Therefore, it can be switched between the two streams as desired and synchronize the streams by means of the timestamps.
f) Mixer The mixer is an intermediate system that receives RTP packets from one or more sources, possibly changes the data format, combines the packets in some manner and then forwards a new RTP packet. Since the timing among multiple sources is not generally be synchronized, the mixer makes timing adjustments among the streams and generates its own timing for the combined stream. Thus, all data packets originating from a mixer are identified as having the mixer as their SSRC. Since in this case a common timestamp cannot be generated from the different sources, a new SSRC is set and a new timestamp is generated. All sources involved are appended to the new RTP header as CSRC in order to provide a further identification of the original streams.
g) Multiplexing RTP Sessions For efficient protocol processing, the number of multiplexing points should be minimized. In RTP, multiplexing is provided by the destination transport address (network address and port number) which define an RTP session. For instance, in a teleconference composed of audio and video media encoded separately, each medium should be carried in a separate RTP session with its own destination transport address. It is not intended that the audio and video be carried in a single RTP session and demultiplexed based on the payload
Introduction to Multicast Technology 391
type or SSRC fields. Interleaving packets with different payload types but using the same SSRC would introduce several problems: • If one payload type were switched during a session, there would be no general means to identify which of the old values were replaced by the new one. • An SSRC is defined to identify a single timing and sequence number space. Interleaving multiple payload types would require different timing spaces if the media clock rates differ and would require different sequence number spaces to tell which payload type suffered packet loss. • The RTCP sender and receiver reports can only describe one timing and sequence number space per SSRC and do not carry a payload type field. • An RTP mixer would not be able to combine interleaved streams of incompatible media into one stream. • Carrying multiple media in one RTP session includes the use of different network paths or network resource allocations if appropriate; reception of a subset of the media if desired, e.g., just audio if video would exceed available bandwidth. Furthermore it includes the receiver implementations that use separate processes for the different media, whereas using separate RTP sessions permits either single- or multiple-process implementations. Using different SSRC for each medium but sending them in the same RTP session would avoid the first three problems but not the last two.
h) RTP Packet Format • •
•
Figure 4 shows the RTP packet header (Schulzrinne, 1990a). The meaning of the fields: version (V): 2 bits. This field identifies the version of RTP. The version defined in the current specification is 2. padding (P): 1 bit. If the padding is set, the packet contains one or more additional padding octets at the end, which are not parts of the payload. Padding may be needed by some encryption algorithms with fixed block sizes or for carrying several RTP packets in a lower-layer protocol data unit. extension (X): 1 bit. If the extension bit is set, the fixed header is followed by exactly one header extension. The extension mechanism is provided to allow individual implementations to experiment with new payload format independent functions that require additional information to be carried in the RTP data packet header.
Figure 4: The RTP packet header
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
392 Hosszú
• • • •
•
• •
CSRC count (CC): 4 bits. The CSRC count contains the number of CSRC identifiers that follow the fixed header. marker (M): 1 bit. The interpretation of the marker is defined by a profile. It is intended to allow significant events such as frame boundaries to be marked in the packet stream. payload type (PT): 7 bits. This field identifies the format of the RTP payload. sequence number: 16 bits. The sequence number increments by one for each RTP data packed sent, and may be used by the receiver to detect packet loss and to restore packet sequence. The initial value of the sequence is randomized in order to make knownplaintext attacks on encryption more difficult, even if the source itself does not encrypt, since the packets may flow through a translator that does it. Techniques for choosing unpredictable numbers are presented in Eastlake (1994). timestamp: 32 bits. The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant must be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations. The resolution of the clock must be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter. For instance one tick per video frame is generally insufficient. SSRC: 32 bits. The SSRC field identifies the synchronization source. CSRC list 0 to 15 items, 32 bits each. The CSRC list identifies the contributing sources for the payload contained in this packet. The CC field sets the number of identifiers. Mixers, using the SSRC identifiers of contributing sources, insert CCRC identifiers.
i) Implementations RTP can be implemented as a separate protocol layer though it is usually integrated into the streaming applications. It gives these applications additional facilities of synchronization, identification, demultiplexing and error detection of multimedia streams in distributed multimedia systems.
Multicast Transport Protocol - 2 The Multicast Transport Protocol - 2 (MTP-2) is a modification of the original MTP, including some useful features to support remote collaboration, such as teleconferencing services. One of the modifications is the possibility of the immediate joins, which means that unlike MTP, new sites are not required to wait until all transmit tokens are in the master’s possession; all messages older than the join message are ignored by the new member (Bormann, 1994).
Scalable Reliable Multicast An interesting application, the distributed whiteboard tools called wb (Floyd, 1995), uses the Scalable Reliable Multicast (SRM) transport protocol to reliably distribute data among all receivers. Any member that has the information can answer the participants’ multicast NACKs to request retransmission of lost data. To avoid generating multiple copies of retransmitted data, retransmissions are multicasted to the group. To further reduce the multiple copy problem, a site waits a random period of time before sending a NACK or retransmitting data and surpresses its own transmission in case it hears it from another member of the group. One of the problems with SRM is that the algorithm ends up consuming a lot of bandwidth when there is little correlation of losses among receivers (Kausar, 1998). For
Introduction to Multicast Technology 393
instance, in a group of 1,000 receivers, when only one receiver loses a packet, all 1000 receivers need to process the multicast NACK and repair packets. This causes significant overhead. Also if one set of hosts in particular requires a packet, it is not desirable to multicast the packet to all the possible groups. One possible method of improving SRM’s efficiency is to use localized recovery. The idea is to multicast NACKs and repair locally to a limited area instead of to the whole group. Using the TTL (Time to Live) field in the IP packet header is one possible way to implement scope control.
Log-Based Reliable Multicast The Log-Based Reliable Multicast (LBRM) (Holbrook, 1995) uses a logging server to log all packets transmitted by the source and to retransmit lost packets. The source keeps data that has already been sent until it gets an ACK from the logging server. LBRM uses NACKs between receivers and the logging server and ACKs between the logging server and the source. Logging servers may be organized in a hierarchy. These servers decide on the bases of the number of received ACKs if they should multicast or unicast a lost packet.
Reliable Adaptive Multicast The Reliable Adaptive Multicast Protocol (RAMP) protocol (Koifman, 1996) gives NACK-based reliable delivery. A RAMP sender starts by sending a connect message to an IP multicast address. The sender can start multicasting the data after it receives at least one accept. Data messages contain monotonically increasing sequence numbers and once a receiver detects a gap in the message sequence, it requests a retransmission by unicasting a NACK back to the sender. The application selects if retransmitted messages are multicasted to the group or unicast to the requester. RAMP uses a simple rate-based flow control in which the sender adjusts its transmit rate based on the feedback received by the slowest receiver in the group.
Transport Protocol for Reliable Multicast The Transport Protocol for Reliable Multicast (TRM) gives reliable multipoint communication to support interactive multimedia applications such as shared whiteboards and network editors (Sabata, 1996). TRM assumes no knowledge of group membership at the source and allows receivers to join and leave an existing multicast group. Group membership management is handled by the application.
Real-Time Streaming The application-level Real-Time Streaming Protocol (RTSP) aims to provide a robust protocol for streaming multimedia in one-to-many applications over unicast and multicast, and to support interoperability between clients and servers from different vendors (Johnson, 1999). Streaming breaks data into many packets sized approximately for the bandwidth available between the client and server. When the client has received enough packets, the user software can be playing one packet, decompressing another and receiving a third. The user can begin listening almost immediately without having to download the entire media file. Sources of data for streaming can include both live data feeds and stored clips (Johnson, 1997a).
394 Hosszú
RTSP is considered more of a framework than a protocol. It is intended to control multiple data delivery sessions and to provide a means for choosing delivery channels such as UDP, TCP and IP multicast and delivery mechanisms based on RTP. Control mechanisms such as session establishment and licensing issues are being addressed. RTSP is designed to work on top of RTP to both control and deliver real-time content. Thus RTSP realizations are able to take advantage of RTP improvements, such as the new standard for RTP header compression (Johnson, 1997a).
Resource Reservation The Resource Reservation Protocol (RSVP) is the network control protocol that allows data receiver to request a special end-to-end QoS for its data flows (Liu, 1998). Real-time applications use RSVP to reserve necessary resources at routers along the transmission paths so that the requested bandwidth can be available when the transmission actually takes place. The IP multicast works in concert with other IP protocols and services, such as QoS requests to support real-time multimedia. While RTCP can provide feedback on reception quality, other protocol mechanisms are needed to request timely delivery and guarantee a specific QoS for a session between a sender and receivers (Johnson, 1997a). The RSVP is a resource reservation setup protocol designed for integrated services on the Internet. An application invokes RSVP to request a specific end-to-end QoS for a data stream. RSVP aims to efficiently set up guaranteed QoS resource reservations which support unicast and multicast routing protocols and scale well for large multicast delivery groups. A host receiver uses RSVP to request a specific QoS from the network for a particular data stream from a data source. An atomic RSVP reservation request consists of a specification for an end-to-end desired QoS (e.g., peak or average bandwidth and delay bounds) and a definition of the set of data packets to receive the QoS. RSVP is useful for environments where QoS reservations can be supported by reallocating rather than adding resources. For multicast, a host sends IGMP messages to join a host group and then sends RSVP messages to reserve resources along the delivery path(s) of that group. RSVP gives access to integrated services on the Internet, where hosts and networks work in concert to achieve a guaranteed quality of end-to-end transmission. All the hosts, routers and other network infrastructure elements between the receiver and sender must support RSVP. They each reserve system resources such as bandwidth, CPU and memory buffers in order to satisfy the QoS request (Johnson, 1999). RSVP QoS control request messages are sent to reserve resources along all the nodes (routers and hosts) along the reserve delivery path to the sender. Note that RSVP is receiverinitiated-RSVP requests resources in only one direction. For multicast, the reservation request need only travel to a point where it merges with another reservation for the same source stream. This receiver-oriented design is intended to accommodate large multicast group and dynamic group membership (Johnson, 1997a).
Support of Data Distribution Applications Multicast Usenet Feeds The Multicast Usenet Feeds (Muse) was developed to multicast news articles on the MBone (Lidl, 1994). The Muse transmitter accepts a news article through a standard NNTP interaction. In such a way for other news feeds, a Muse feed looks like an NNTP feed.
Introduction to Multicast Technology 395
Multicast Dissemination The Multicast Dissemination Protocol (MDP) gives a reliable multicast framework for file distribution (Macker, 1996). The MDP sender fragments the file to be transmitted into as sequence of MDP Maximum Data Units (MDUs) which are multicasted to the group using the UDP/IP multicast suite. MDP receivers assemble the received data units into the original file. MDP has a complex recovery mechanism, including an operation mode in which the receivers may request repairs at any time during the transmission of a file. The MDP source can also request ACKs from receivers, who answer with information about their present state.
Adaptive File Distribution The Adaptive File Distribution Protocol (AFDP) provides a reliable file distribution service on top of the UDP (Cooperstock, 1996). The AFDP senders are called as publishers and the names of the receivers are subscribers. A special subscriber in a group, or the group’s secretary, is responsible for managing group membership, authorizing publishers and determining the required transmission mode to be used. AFDP allows sites to join and leave a distribution group by contracting the group’s secretary, but does not provide a recovery mode to handle site failures or network partitions.
Tree-Based Multicast Transport The Tree-Based Multicast Transport Protocol (TMTP) provides reliable multicast communication support for one-to-many bulk data dissemination applications (Yavatkar, 1995). For error and flow control, TMTP organizes group members into a control tree. Each subtree or domain is originated from their root or the domain manager. Domains typically correspond to subnets. The control tree grows and shrinks as members join and leave the group. TMTP supports dynamic group membership, where sites can join and leave a multicast group anytime during the life of a session. A session directory service gives primitives for creating, deleting, joining and leaving a multicast group. However, knowledge of group membership is not assumed.
Reliable Multicast Transport The Reliable Multicast Transport Protocol (RMTP) supports reliable one-to-many bulk data transmission (Lin, 1996). Like TMTP, RMTP addresses feedback implosion by organizing group members into a control tree, which leads feedback propagation and processing. Intermediate nodes in the control tree, or designated receivers are responsible for buffering data received from the source, processing ACKs from their downstream nodes in the distribution tree and retransmitting lost packets. The RMTP is applied by the Lucent Technologies’ e-cast reliable product family (Obraczka, 1998).
Multicast File Transfer The Multicast File Transfer Protocol (MFTP) is another multicast protocol that supports reliable one-to-many data transmission; it was developed by StarBurst Communications (Miller, 1997). The MFTP consists of a group management and data transmission protocols. Using a data transmission protocol, the MFTP sender announces a file transfer session periodically during the announcement phase. The MFTP is used in StarBurst’s multipoint file transfer software family.
396 Hosszú
Reliable Multicast Data Distribution In Reliable Multicast Data Distribution Protocol (RMDP), the problem of insuring reliable data delivery to large groups and adaptability to heterogeneous clients is solved by Forward Error Correction (FEC) technique based on erasure codes (Kausar, 1998). The basic idea behind the use of erasure codes is that the original source data, in the form of a sequence of k packets, along with additional n redundant packets, are transmitted by the sender, and the redundant data can be used to recover lost source data at the receivers. A receiver can reconstruct the original source data once it receives a sufficient number of (k out of n) packets. The main benefit of this approach is that different receivers can recover from different lost packets using the same redundant data. In principle, this idea can significantly reduce the number of retransmissions, as a single retransmission of redundant data can potentially benefit many receivers simultaneously.
Combining the Interactivity and the Reliability Interactive and dissemination services differ in their delivery delay (or delay variance) and reliability requirements (Obraczka, 1998). While interactive applications are intended to sacrifice reliability in favor of real-time delivery, data dissemination tools tolerate higher delivery delays but require reliable delivery. The resilient multicast delivery model addresses the different reliability and interactivity requirements among different applications, as well as among different receivers in the same multicast group. The idea is that, in applications such as interactive teleconferencing, different participants have different levels of participation: some are purely spectators, while others are more active. Passive receivers would rather have high playback quality, which can be achieved by delayed playback combined with retransmissions.
Structure-Oriented Resilient Multicast The IP multicast delivery mechanism gives a popular basis for delivery of continuous media to many participants in a conferencing application. However, the best-effort nature of multicast delivery results in poor playback quality in the presence of network congestion and packet loss. Contrary to widespread belief that the real-time nature of continuous media applications precludes the possibility of recovery of lost packets using retransmissions, it can be found that these applications offer an interesting tradeoff between the desired playback quality and the required degree if interactivity (Xu, 1997). In the applied resilient multicast, model each receiver in a multicast group can decide its own tradeoff between reliability and real-time requirements. To be effective, error recovery mechanisms in such a model need to be both fast (due to the real-time constraint) and have a low overhead (due to high volume of continuous media data). In order to realize the resilient multicast model, the Structure-Oriented Resilient Multicast (STORM) protocol was designed, in which senders and receivers collaborate to recover from lost packets using two key ideas: 1. Group participants self-organize themselves into a distribution structure and use the structure to recover lost packets from adjacent nodes. 2. The distribution structure is dynamic and a lightweight algorithm (discussed below) is used to adapt the structure to changing network traffic conditions and group membership.
Introduction to Multicast Technology 397
Comparison of the Various Transport Protocols The tables below summarize some properties of various multicast transport protocols based on Obraczka (1998). Table 3 shows the loss recovery and retransmission properties of some protocols. It should be noted that the MFTP protocol uses multicast for NACKs, however, unicast is applied for ACKs. Table 4 presents the reliability mechanisms and data transfer methods applied in the protocols. Table 5 demonstrates the target application and the control allocation of the various transport protocols. The tables demonstrate the diversity of the various multicast transport protocols due to the different requirements of the target applications.
Table 3: The loss recovery (repair request) and datagram retransmission mechanism of the various transport protocols
Loss recovery (repair request)
Unicast
Unicast Muse, RBP, URGC, STORM
Multicast
Retransmission Unicast/multicast Multicast LBRM, RAMP, RMTP
MTP, MTP-2, AFDP, TMTP
RMP
SRM, TRM MDP, MFTP
Nothing
Nothing
RTP
Table 4: The way of the data packet transfer and the applied reliability mechanism in the various transport protocols Applied Reliability Mechanism ACK based NACK based Both used Nothing Uni-, MultiAFDP and Broadcast MDP, MTP, MTP-2, Data Muse, RMP, SRM, Multicast LBRM RTP RMTP transfer RAMP, TRM, TMTP, STORM Broadcast URGC RBP Unicast and XTP Multicast Table 5: The locus of control and the target application of the various transport protocols General purpose Centralized Locus of Control Distributed
Target Application Interactive Interactive multimedia, simulation stream media
MTP, MTP-2, RBP, RMP, XTP, URGC
LBRM RAMP, RTP, SRM, TRM, STORM
Network news, file and data transfer AFDP MDP, MFTP, Muse, TMTP, RMTP
398 Hosszú
MULTICAST BACKBONE (MBONE) Internet multicast requires new routing protocols and forwarding algorithms, but not all routers can be upgraded simultaneously (McCanne, 1997). To solve this problem, requisite infrastructure has been incrementally put in place as a virtual multicast network that is embedded into the traditional unicast-only Internet, which is called Internet Multicast Backbone (MBone).
The Structure The MBone is an experimental, cooperative volunteer effort spanning several continents. Research and testing of multicast protocols and services have been conducted extensively on the MBone which is, in fact, a virtual network layered on the top of the physical Internet to support routing of IP multicast packets. Its reason is that the IP multicast function has not yet been integrated into many production routers. The MBone virtual network is composed of islands that can directly support IP multicast, such as multicast LANs like Ethernet, linked by virtual point-to-point links called tunnels. The tunnel endpoints are typically hosts having an operating system support for IP multicast and running the mrouted multicast routing daemon (Casner, 1994). The number of participating sites has grown rapidly as a result of interest and utility. The growth of MBone is illustrated in Figure 5, where the number of subnets is presented in each year (McCanne, 1997). IP multicast packets are encapsulated into unicast IP packets for transmission through unicast transmissions that are called unicast tunnels, shortly tunnels. These packets in the tunnel look like normal unicast datagrams to intervening routers and subnets. A multicast router that wants to send a multicast packet across a tunnel will use another IP header, set the destination address in the new header to be the unicast address of the multicast router at the other end of the tunnel. The multicast router at the other end of the tunnel receives the packet, strips off the encapsulating IP header and forwards the packet as required. The MBone topology is a combination of mesh and star, where the backbone and regional networks are linked by a mesh of tunnels among mrouted hosts located at the interconnection points of the backbones. Some redundant tunnels are configured with higher Figure 5: The growing rate of the MBone-connected subnets (no accurate data are available after 1996 due to the diversity of the MBone) 3500 3000 2500 2000 Subnets 1500 1000 500 0 1992
1993
1994
1995
1996
Introduction to Multicast Technology 399
metrics for robustness. Each regional network has a star hierarchy hanging off its node of the mesh to fan out and connect to all customer networks that want to participate. There is a cooperative project combining networking knowledge distributed among the participants to avoid loading any one individual link with the responsibility of designing and managing the whole topology. The goal is that when a new regional network wants to join in, they make a request on the proper MBone mailing list, where the participants at the close nodes answer and cooperate in setting up the ends of the appropriate tunnels. To keep fanout down, sometimes this means breaking an existing tunnel to inserting a new node, and so three sites have to work together to set up the tunnels. To know which nodes are close requires knowledge of both the MBone logical map and the underlying physical network topology (ISI, 1999; Rajvaidya, 1999). For instance, Mantra (Monitor and Analysis of Traffic in Multicast Routers) is a tool for monitoring various aspects of multicast at the router level. The results from Mantra are aimed to depict the snapshots of various constituents of the multicast infrastructure from the point of view of the routers that are being monitored. Within a regional network, the network’s own manager can freely manage the tunnel fanout hierarchy in conjunction with the end-user participants. Because most of the MBone nodes are connected with lines of at least what is T1 speed, it is possible to carry a restricted traffic by slower lines. Each tunnel has an associated threshold against which the packet’s IP time-to-live (TTL) value is compared. By convention, higher bandwidth sources (such as multimedia transmit) with a smaller TTL value can be blocked while lower bandwidth sources (such as compressed audio) are allowed through.
Multicast Routing for the MBone The MBone is a virtual network layered on top of the Internet. Due to the difference in the MBone and the Internet topology, the multicast routers must run their own topologydiscovery protocol, i.e., routing protocol, in order to decide where to forward multicast packets. The protocol that is used in most of the MBone routers today uses DVMRP as routing protocol. However, other multicast routing protocols such as MOSPF and PIM are being applied in the MBone; still DVMRP is used by the majority of MBone routers (Banikazemi, 1997). The non-DVMRP routers can interoperate with DVMRP routers by implementing a subset of DVMRP. DVMRP routers treat the MBone topology as a single, flat routing domain. This means that each router maintains a separate entry in its routing table for every subnet in the MBone, exchanges periodic routing messages with each of its neighbors identifying every subnet. In such a way the method becomes sensitive to additional processing costs whenever there is a topological change anywhere in the MBone. As the number of subnets grows exponentially, those routing costs grow similarly. The solution of this problem in the unicast routing is the hierarchical routing. In a similar manner a two-level hierarchical routing model is created to solve this problem (Thyagarajan, 1995) in the MBone. This approach involves partitioning the MBone into non-overlapping regions using DVMRP as the inter-region routing protocol; intra-region routing is accomplished by any of various existing multicast protocols. The feature of this approach is the independence of the higher level routing protocols from the subnet addresses, which allows for easy incremental deployment with small changes to existing intra-region protocols. Currently a new multicast service model is introduced called Source Specific Multicast (SSM), which solves some scaling problems of the shared trees and Rendezvous Points, which are used in the wide-area multicast routing (Quinn, 2001). The SSM service model is based on the source-filtered IGMP Version 3 and the improved PIM-SM, called Source-
400 Hosszú
Specific PIM (Sheperd, 2000). The source filtering means that the host can accept or reject each source of a certain multicast group. In the SSM model the channel is introduced as a pair of the source and group addresses, and the receivers choose a channel instead of a group when they join a multicast session (Bhattacharyya, 2001).
Tunneling Many Internet hosts, while capable of running multicast applications, cannot access the MBone because: • The routers that connect them to the Internet do not yet support IP multicast routing. • Their operating systems cannot support a tunneled implementation of IP multicast routing. Therefore, IP tunneling is a provisional technique used to connect islands of multicast routers separated by links which do not support IP multicast. Using this model, multicast datagrams are encapsulated in a standard point-to-point unicast datagram. A tunnel is a path between two multicast routers over which a multicast packet is transmitted by encapsulating it inside a unicast IP packet addressed to the destination router of the tunnel. Tunneling in the context of multicast refers to the encapsulation of multicast packets in an IP unicast datagram in order to route through the non-multicast-routing-capable part of the Internet. Figure 6 presents a part of the MBone virtual network that contains five IP multicastcapable routers connecting four multicast islands together. Special routing protocols are applied by the IPMC routers in the MBone to create a distribution tree on the bases of the unicast channels among them. The UDP Multicast Tunneling Protocol (UMTP) enables such a host to establish a connection to the MBone by tunneling multicast UDP datagrams inside unicast UDP datagrams (Finlayson, 1999). By using UDP, this tunneling can be implemented as a user-level application, without requiring changes to the operating system of the host. It is important to note, however, that this tunneling mechanism is not a substitution for proper multicast routing and should be used only in environments where multicast routing cannot be used instead. The UMTP is implemented by the liveGate multicast tunneling server (Live, 1999). Figure 6: Application of IP tunnels among multicast islands.
IPMC 1
IPMC 2
IPMC 3 IPMC 4
Non-multicast capable network
IPMC 5
Multicast island IP unicast tunnel IP multicast
Introduction to Multicast Technology 401
Connecting to the MBone The mTunnel is an application that tunnels multicast packets over a unicast UDP channel (Parnes, 1998). Several multicast streams can be sent over the same tunnel while the tunnel still only uses one port. This is useful if tunneling through a firewall. The primary goal of the applications is to allow an easy tunneling of multicast over, e.g., a modem and/or an ISDN connection. The mTunnel has a built-in Web server allowing an easy access to information about present tunnels. This server listens by default on port number 9000 on the machine where started. The mTunnel also listens on session announcements for easier tunneling of known sessions. If anybody wants to connect to the MBone, a very useful documentation can be obtained.
Some MBone tools IP multicast and its scalability have been exploited in the design of a number of endto-end protocols and multicast applications such as: • Audio conferencing tools (real-time media transport of audio with accompanying control information): nevot, rat, vat, etc. • Video conferencing tools: ivs, nv, vic, etc. • Reliable multicast transport of persistent data that underlies shared tools: wb, nte. • Session directories to create and announce the existence of multicast sessions, e.g., sdr, multikit, isc, mmcc, gwTTS. The first packet audio conference was held using the Network Voice Protocol (NVP) implemented within vt. Improving upon vt, new algorithms for counteracting network jitter through receiver buffering were elaborated and enhancing the tool with a graphical user interface to create the new application called Visual Audio Tools (vat), in the Lawrence Berkeley National Laboratory (LBL). Later, the nevot audio tool applied firstly the earlier versions of the RTP--nowadays the IETF standard for Internet packet audio. The Rat is an audio conferencing application that uses a forward error correction scheme where redundant information is coded at lower quality. Hence, modest packet loss rates cause minor quality degradation rather than more noticeable audio break-ups. The INRIA released its video conferencing tool, ivs, and an integrated multimedia conferencing system that uses the H.261 for video compression (ITU, 1993). A Network Video from Xerox PARC is a video-only application and become a de facto standard for MBone video. The shared white board application means an additional medium to the audio and video channels, for shared graphical annotation and upgrading of on-line documents. Unlike multimedia streams, which can continue in the presence of packet loss with a momentary degradation of quality, wb requires reliable transport since drawing state is persistent and a lost drawing update should eventually be transmitted to all the affected participants. But the design of a protocol that meets this requirement is a fundamentally difficult problem since the loss recovery algorithm must scale to very large numbers of receivers in an environment like the Internet where network partitions are common and packet loss rates are high. In wb a new approach to reliable multicast, namely Scalable Reliable Multicast, was developed. The video conferencing application vic from the LBL uses the Tenet real-time networking protocol (Ferrari, 1990) and supports the light-weight session (LWS) architecture.
402 Hosszú
The application nte (Network Text Editor) requires reliable delivery and adopts the LWS design philosophy. The Session Directory (sd) is a session directory tool designed to allow the advertisement and joining of multicast conferences on the MBone. It was originally modeled on sd written by Van Jacobson at LBL, but implements a later version of the session description protocol than sd does. It is used to reserve or allocate media channels, and to view advertised channels. The sd makes announcements periodically too well-known multicast addresses and ports (Johnson, 1999). A user creates and advertises a conference or a session with sd. In turn, each participant uses sd to automatically launch all the media tools pertaining to that session, freeing the user from complicated configuration of multicast addresses, ports and scopes. Handley and Jacobson refined this work into the Session Description Protocol (SDP); then a much-improved SDP-based session directory tool called sdr was developed, too (Handley, 1995). A session description expressed in SDP is a short, structured, textual description of the name and the purpose of the session, furthermore the media, protocols, codec formats, timing and transport information. These need to decide if a session is likely to be of interest and to know how to start media tools to participate in the session (Crowcroft, 1998). SDP does not include a transport protocol, and is intended to use different transport protocol as appropriate, including the RTSP, the Hyper-Text Transport Protocol (HTTP) and the electronic mail using the MIME extensions. SDP carries the timing, formats, protocols, contents, etc., applied in the session, which is used by both the Session Announcement Protocol (SAP) and the Session Initiation Protocol (SIP). The SAP delivers information about actual sessions on a well-known multicast address and the SIP invites a specific user to a session (Schefström, 1999). The sd_listen is an MBone session directory; it listens for multicast of addresses and descriptions of MBone broadcasts (Daviel, 1995). Sd_listen is compatible with LBL sd. The gwTTS is the University of Virginia’s tele-tutoring system, the isc is an integrated session controller and the mmcc is a multimedia conference control application (Esler, 1998). The LiveGate is a server program (Live, 1999) that runs on a computer that’s already connected to the MBone. It enables a non-MBone-connected computer to access and participate in multicast services. It uses the UMTP to create tunnels. The liveCaster (Live, 1999) application is usable to multicast MP3 audio streams or small pieces of textual data, such as prices and quotes to a potentially unlimited audience, even if the available Internet connection has a low bandwidth. To receive and play audio data from liveCaster, the user needs to be part of the MBone and requires an MP3 player tool that can receive RTP encapsulated-MPEG audio data. There are a lot of free RTP-capable MP3 players, e.g., FreeAmp, Audioactive, playRTPMPEG and the Cisco Systems’ IP/TV viewer. LiveCaster uses a multicast channel to announce the contents of a file system directory. It runs in one of two modes: • audio mode: used to stream MP3 audio data. • attribute mode: applied to announce textual attributes. The Multikit is a distributed, multicast-based directory browser (Live, 1999). Just as a graphical file system browser displays directories in a file system, multikit displays one or more multicast directories, each containing announcements that have been multicast over the MBone. These announcements may describe upcoming or ongoing multimedia events, such as lectures, conferences, concerts or other types of data sources. Using multikit, the user can find out information about these events, including if they are presently taking place, or will take place in the future. The participation is also possible in any desired event, simply by
Introduction to Multicast Technology 403
double clicking on an announcement. This usually launches separate applications that perform the actual multimedia services. In this respect, multikit can be thought of as an interactive program guide for Internet multicast services. However, multikit also allows the user to create his own announcements and to create and organize his own directories. An optional built-in tunneling mechanism allows non-MBone-connected users to access multicast services. On the bases of the presented issues, Figure 7 shows the hierarchical structured protocol stack of the current multimedia delivery from the packet transmission to the upper level application protocols (Crowcroft, 1998).
POLICY OF MULTICAST APPLICATIONS Media Session Control Multiparty collaborative applications require to transmit data reliably and efficiently in order to provide guaranteed QoS. The multimedia applications often involve a large number of participants and are interactive in nature with participants dynamically joining or leaving the applications. IP multicast gives scalability and efficient routing but does not provide the reliability these multimedia applications require. There are two models of conference/session control: • formal/tightly coupled, and • informal/loosely coupled/light-weight. Light-weight sessions are conference membership control and explicit conference control mechanisms. They are multicasted media conferences without explicit session membership and control mechanism (Crowcroft, 1998). Such a light-weight session consists of a number of many-to-many media streams based on RTP/RTCP using IP multicast. Typically, the only conference control information that is provided during the course of a light-weight session is that distributed in the RTCP session information, e.g., an approximate membership list with some attributes per member. On the contrary, tightly coupled conferences where the media streams are flowing from a mainly one-to-many or one-to-one basis, require an explicit conference control mechanism. In a communication model, a user interface is provided where the chair can choose to give a floor to one of the participants, so one person can talk, take control of the shared whiteboard or use the video channel at a time. Figure 7: The multimedia streaming protocol stack
Session directory SDP
Shared Tools
Audio Video
HTTP SMTP SDAP
RTP/RTCP
TCP
UDP IP Integrated Services Forwarding
Conference Control RSVP
404 Hosszú
Light-Weight Sessions The IP multicast-based network architecture is where thin application-level protocols are built on top of IP multicast and its powerful group concept has been expressed the LightWeight Sessions (LWS) architecture its light-weight, loosely coupled and decentralized communication model (McCanne, 1997). In LWS, the collection of multicast senders and receivers, communicating within one or more interdependent multicast groups and their respective protocols and protocol instances, are collectively called a session. An important design goal that has consistently underscored the development of the LWS architecture is an explicit attempt to accommodate the highly heterogeneous and loosely controlled nature of the Internet. The component networks of the Internet are heterogeneous with diverse failure modes and mixed levels of reliability. To account for this, LWS protocols are deliberately designed to be robust and forgiving. For instance, if a network broken link causes a temporary network partition, then the higher-level LWS protocols do not reset or terminate the session. Instead, the session continues its existence as a set of partitioned sub-sessions. Later, when the network re-forms, the session recovers and integrates independently generated activity and state across the cured partition. LWS achieves this robustness to network failures by exploiting receiver-driven, soft-state protocols that treat recovery from such failures as part of normal protocol function. The robustness and scalability of the LWS architecture are embodied in several design principles that are exploited consistently across the protocols and the applications in this framework: • Shared Multicast Control. By multicasting control information furthermore to data, hosts can intelligently share information to improve the scalability and robustness of their underlying protocols. • Adaptation. A key principle in LWS is to adapt the application to the network rather than the network to the application. LWS applications are typically delay-adaptive, load-adaptive loss-adaptive, or a combination thereof. They make not an a priori assumption about the underlying network’s QoS, but instead adapt resources to the available network resources through dynamic measurements. • Soft State. In this model, each session member periodically announces its presence and/or a state binding by multicasting a message to the group. In turn, receivers tune into the group, listen to the announcements and assemble the announced state. Each piece of state eventually times out unless refreshed by a subsequent announcement. With soft state, error recovery is designed into the protocol and need not be treated as a special case. This leads to highly strong distributed systems that gracefully accommodate network partitions, partial system failures, misbehaving implementations and so on. • Receiver-Driven Design. If a multicast protocol makes decisions or carries out actions at the source on behalf of its receivers, then the scalability of that protocol is certainly constrained since the source must explicitly interact with an arbitrary number of receivers. Instead, in a receiver-driven design, distributing application or protocol decisions across the receivers simultaneously solves heterogeneity and scalability. To maximize scalability, the sender’s protocol should be as receiver ignorant as possible.
Tightly Coupled Sessions This type of conference may also be multicast based and use RTP/RTCP, but in addition it has an explicit session membership mechanism and may have an explicit session control mechanism. Tightly coupled sessions are initiated by invitation or by user initiation.
Introduction to Multicast Technology 405
In the first case, a call-up mechanism is necessary which can be combined with the explicit conference membership technique. In the latter case the rendezvous mechanism can be handled by the same session directory that handles LWSs, with the addition of a description of the contact mechanism to be used to join the conference to the description of the session. The most conventional tightly coupled conferences are the ITU-based H.323 or T.120 protocols which were initially designed to work over switched networks like ISDN. In its original design, the scheme depended on a centralized model of control and on the reliable and constant bit rate property of ISDN networks.
Virtual Communication Architectures over the Internet A great deal of effort has been put into the development of network-aware applications. These applications provide increasing functionality to handle networking problems such as limited throughput, delay jitter, packet loss, etc. and end-system heterogeneity (Jonas, 1997). With the increasing number of users on the Internet, the number of networking applications and the number of networking and network-related protocols and standards has increased as well. Several products that provide real-time streaming over intranets and the Internet follow on the high interest for multimedia applications. This increasing number of streaming applications, coding methods, compression standards and multipoint services has led to a situation where heterogeneity is a growing problem. For instance a real-time decoder on a client system must match the coding format of the encoder on the server system. If servers use many different formats, clients must become very sophisticated to be able to decode yet another format, or multiple client applications with different user interfaces have to be installed at the receiving end-system. This is not applicable to encryption, error correction, synchronization and other features that are generally realized by independent protocols. Due to the great number of available-mostly incompatible-streaming protocols, the user has to install a special client application (the so-called player) for each applied system. These client applications usually need to be maintained and updated. To avoid this effort, it would be desirable if the user had to install only one client and if the corresponding protocol conversions were done transparently in the network. With this approach, it would also be possible to offer a multimedia stream of different bit rates in order to provide the users with the stream best suitable for the specific bandwidth, such as 28.8 kbit/s modem, 64 kbit/s ISDN, 10 Mbit/s Ethernet. In the current multimedia streaming applications, the end user has the take-it-or-leaveit choice; a client application must be capable of handling the server’s formats. On the networking level, applications must adapt to varying networking conditions. Servers and clients must be able to detect networking problems and to react appropriately implementing features such as forward error correction, error detection, error recovery, congestion and rate control, stream synchronization and playout buffers. Requirements are even increasing in a multipoint environment where multiple receivers might experience different networking conditions. In most applications, QoS problems at the networking level are either ignored or handled by downscaling (everybody perceives a reduced quality) or media-cutting which means that the audio is for all, video only for some participants. The same datastream is being sent to all receivers, independent of their individual networking capabilities.
406 Hosszú
KISS: An Implemented Architecture A possible solution of the problems listed in the previous section can be the Communication Infrastructure for Streaming Services (KISS) which is a proposal for a virtual architecture over the Internet (Kretschmer, 1998; Mödeker, 1998). It makes it possible that a data stream, like a live audio transmission, is altered while it flushes through a communication network. A reason for doing this is that the world has many different requirements to services and the network should have to take care for this (Jonas, 1998). This approach attempts to realize an environment where the communication network gives sophisticated autonomous services to help end-system applications. The principle idea behind this approach is that network operators rather than users, releasing end users from the responsibility to install and update services which could be provided by network operation centers, could provide many features of present applications. The communication system architecture implemented in the KISS system moves complexity from the end-system to the network. It consists of the following three parts: 1) Network Access Point (NAP). The NAP organizes the Internet KISS network and constitutes an access point for end applications. 2) Service Agent (SA). An SA supplies content-independent services such as the conversion of data formats (e.g., transcoding) in the KISS context. 3) End Applications (EA). The EA is a KISS client software that serves the standard application programs giving interface to the KISS virtual network. The NAP acts as a mediator between the end-system applications (servers and clients) and the service-enhanced networks with their service applications. The NAP is an application that is installed somewhere in the Internet and waits for other applications to request its service. A well-known multicast channel is used for queries to it. KISS consists of a loose network of NAPs, which communicate via a suitable protocol and constitute the access point for end applications. A NAP has the following tasks: • communication with EAs and other NAPs; • announcement of available multimedia streams; • organization and monitoring of multimedia streams; • requesting service agent services. The concept of the KISS is not in contradiction with the LWS principle that prefers the adaptation of the application to the network, rather than inversely the network to the application. Because of this the KISS is strongly adapted to the network by the NAP and EA applications; however, in the same time the end user is saved from this kind of adaptation. Typically, an external service provider (EA) announces streaming services to a NAP in the form of an abstract service description (SD) and the NAP announces it to the other NAPs in accordance with the SD specifications. A client (EA) can now inquire details of all available SDs from a NAP and request the described multimedia streams. The connection between EA and NAP consists of a bi-directional control link and an arbitrary number of data links, which are usually unidirectional. Control communication is integrated into the client-network interface (CNI) which also describes NAP localization. It is to be noted that neither server nor client requires information about the organization and activities of the network. Behind a NAP, the CNI is extended to the network-network interface (NNI) which adds methods such as requesting and announcing service information. This information is required by the NAP if a client requests a multimedia stream. Basically, multimedia streams are transmitted via multicast; however, it should also be possible to request them from a NAP as a unicast stream if an EA is not connected to a multicastcapable network.
Introduction to Multicast Technology 407
The SAs provide the NAPs with services to be requested by them in a way, which is transparent to the EAs. These services are the actual extensions of the services supplied by KISS. For instance, an SA can adapt the data rate of a multimedia stream to the client’s bandwidth or convert a multimedia stream from coding format A to coding format B. Further possible applications can be: • the tunneling of firewalls; • the interconnection of several KISS clouds; • the synchronization of several streams; • mixing several multimedia streams to a single one; • forward error correction (FEC); • unicast-multicast conversion. SAs can be concatenated arbitrarily to realize complicated scenarios. Multimedia streams should be transmitted via multicast if at all possible in order to save network resources. Within the KISS network (between NAPs/SAs), only multicast is used. From the EA’s point of view, the only corresponding communication endpoint is a NAP. The separation affects clients and servers in such a way that server applications are not involved in any form the client joins or leave results in a simple, flexible and powerful design (Deering, 1989). An example scenario of the KISS virtual architecture is presented in Figure 8. The design goals of the system KISS are the following: • The EAs should be kept as simple as possible. • The overall system should allow the delivery of client-specific services, network capabilities and communication protocols. • KISS should scale for a large number of end-systems. This means that the server system cannot be involved when clients connect or disconnect or change service requirements. Figure 9 shows some basic procedures in the KISS system; it uses a scenario of one server, one client, two NAPs and one SA. The SA can translate the format A to format B. During the service, registering the service A is announced over the network. The query service is a request from a potential client for the service in format B. The NAP1 connected to this client sends a query for the announced service but in format B, since the NAP1 itself cannot convert the format A to B. During the ongoing service the client can temporally stop the streaming by a pause message to its NAP. This message is forwarded by the other component of the KISS to the appropriate server. Figure 8: An example scenario
LAN 1
LAN3 NAP NAP
Server
SA
Client Client
KISS network - using the MBone SA LAN 2 Client
NAP Client
SA NAP
Client Server
LAN4
408 Hosszú
Figure 9: Basic procedures in the KISS architecture
Server
NAP1
SA (AÆB)
NAP2
Client
Install A
Announce A
Register Service
Announce A
Announce A
Get Service
Query Service
Service A
Query B
Client B
Info B
Set up Service
Client A Continue A
Service Active
Pause A
Client B
Accept A
Pause A
Accept B
Pause B
Accept B
Pause B
The communication protocol used by the KISS network is the Service Access and Management Protocol (SAMP), which is based on UDP/IP multicast (Mödeker, 1998). Every NAP or every SA receives SAMP packets from a multicast datagram socket, which are addressed to a well-known IP multicast group or directly to it. A SAMP packet consists of a header specifying among others the packet type and a variable list of additional information which is required depending on packet type.
CONCLUSIONS Multicasting is a key technology in multimedia systems; however, it requires more sophisticated communication software than the normal unicast delivery. In the chapter, the necessary group management routing protocols were discussed. Due to the network heterogeneity and the special requirements of the multimedia applications, different kinds of transport protocols have been developed to enhance the reliability and timing possibilities of the Internet. The current design principles of the multicast-based multimedia applications were presented.
Introduction to Multicast Technology 409
REFERENCES Ballardie, T., Francis, P. and Crowcroft, J. (1993). Core-based trees (CBT): An architecture for scalable inter-domain multicast routing. Proceedings of ACM SIGCOMM’93 Symposium on Communications, Architectures and Protocols, September. Banikazemi, M. (1997). IP multicasting: Concepts, algorithms and protocols. Survey Paper, Ohio State University, August. Bhattacharyya, S. et al. (2001). A framework for source-specific multicast deployment. Internet Draft, Internet Engineering Task Force, March. Bormann, C., Ott, J., Gehrcke, H. C., Kerschat, T. and Seifert, N. (1994). MTP-2: Towards achieving the S.E.R.O. properties for multicast transport. Proceedings of the International Conference on Computer Communications and Networks (ICCCN) 94, September. Casner, S. (1994). Frequently Asked Questions (FAQ) on the Multicast Backbone (MBONE). Retrieved on the World Wide Web: ftp://venera.isi.edu/faq.txt, December. Cisco. (1995a). Building consistent quality of service into the network. Cisco Systems Inc., Packet Magazine Archives, First Quarter. Available on the World Wide Web at: http://www.cisco.com/warp/public/674/6.html. Cisco. (1995b). IP multicast streamlines delivery of multicast applications. Cisco Systems Inc., Packet Magazine Archives, First Quarter. Cisco. (1999). Multicast routing. Cisco Systems Inc. Available on the World Wide Web at: http://www.cisco.com/warp/public/614/17.html, March. Clark, D. D. and Tennenhouse, D. L. (1990). Architectural considerations for a new generation of protocols. SIGCOMM Symposium on Communications, Architectures and Protocols, (Philadelphia, Pennsylvania), 200-208, IEEE, September, Computer Communications Review, 20(4), September. Cooperstock, J. R. and Kotsopoulos, S. (1996). Why use a fishing line when you have a net? An adaptive multicast data distribution protocol. In Proceedings of Usenix’96. Crowcroft, J., Handley, M. and Wakeman, I. (1998). Internetworking Multimedia. UCL Press, December. Daviel, A. (1995). Linux Multicast FAQ. Available on the World Wide Web at: http:// andrew.triumf.ca/pub/linux/multicast-FAQ, December. Deering, S. (1989). Host extensions for IP multicasting. Network Working Group RFC 1112, August. Deering, S. E., Estrin, D., Farinacci, D., Jacobson, V., Liu, C. G. and Wei, L. (XXXX). The PIM architecture for wide-area multicast routing. IEEE/ACM Trans. on Networking, 4(2), April, 153-162. Eastlake, D., Crocker, S. and Schiller, J. (1994). Randomness recommendations for security. RFC 1750, Cybercash, MIT, December. Esler, M. (1998). Linux Multicast Information. Retrieved on the World Wide Web: http:/ /www.cs.washington.edu/homes/esler/multicast, May. Ferguson, P. and Huston, G. (1998). Quality of service on the Internet: Fact, fiction or compromise? INET’98 Conference. Available on the World Wide Web at: http:// www.isoc.org/inet98/proceedings/6e/6e_1.htm. Ferrari, D. and Verma, D. (1990). A scheme for real-time communication services in wide-area networks. IEEE Journal on Selected Areas in Communications, April, 8(3), 368-379.
410 Hosszú
Finlayson, R. (1999). The UDP multicast tunneling protocol. Network Working Group Internet Draft. Available on the World Wide Web at: http://www.live.com/umtp.txt, January. Floyd, S., Jacobson, V., McCanne, S., Liu, C. G. and Zhang, L. (1995). A reliable multicast framework for light-weight sessions and application-level framing. 1995 ACM SIGCOMM Symposium on Communications, Architectures and Protocols, October, 342-356. Handley, M. and Jacobson, V. (1995). SDP: Session description protocol. Internet Draft, November. Holbrook, H. W., Singhal, S. K. and Cheriton, D. R. (1995). Log-based receiver-reliable multicast for distributed interactive simulation. 1995 ACM SIGCOMM Symposium on Communications, Architectures and Protocols, October, 328-341. ISI. (1999). Mapping the MBONE. Available on the World Wide Web at: http://www.isi.edu/ scan/mbone.html, March. ITU. (1993). Video codec for audiovisual services at p*64kb/s. ITU-T Recommendation .H.261, March. Johnson, V. and Johnson, M. (1997a). Higher level protocols used with IP multicast. IP multicast Initiative (IPMI), Stardust Technologies, Inc. Retrieved on the World Wide Web: http://www.ipmulticast.com/community/whitepapers/highprot.html. Johnson, V. and Johnson, M. (1997b). How IP multicast works. IP Multicast Initiative (IPMI), Stardust Technologies, Inc. Retrieved on the World Wide Web: http:// www.ipmulticast.com/community/whitepapers/howipmcworks.html. Johnson, V. and Johnson, M. (1997c). IP multicast backgrounder. IP Multicast Initiative (IPMI), Stardust Technologies, Inc. Retrieved on the World Wide Web: http:// www.ipmulticast.com/community/whitepapers/backgrounder.html. Johnson, V. and Johnson, M. (1997d). Introduction to IP multicast routing. IP Multicast Initiative (IPMI), Stardust Forums, Inc. Retrieved on the World Wide Web: http:// www.ipmulticast.com/community/whitepapers/introrouting.html, February. Johnson, V. and Johnson, M. (1999). IP multicast APIs & protocols. The IP Multicast Channel at Stardust.com. Retrieved on the World Wide Web: http://www.stardust.com/ multicast/whitepapers/apis.htm, August. Jonas, K. (1997). Forget the Net! Proceedings of IEEE DMS’97 (4th Pacific Workshop on Distributed Multimedia Systems), Vancouver, Canada, July 23-25, 103-109. Jonas, K., Kretschmer, M. and Mödeker, J. (1998). Get a KISS-communication infrastructure for streaming services in a heterogeneous environment. ACM Multimedia 98Electronic Proceedings. Kausar, N. and Crowcroft, J. (1998). End-to-end reliable multicast transport protocol requirements for collaborative multimedia systems. Department of Computer Science, University College of London. Keshav, S. and Paul, S. (1998). Centralized multicast. Cornell CS Technical Report, IEEE/ ACM Transactions on Networking. Retrieved on the World Wide Web: http:// www.cs.cornell.edu/skeshav, April. Koifman, A. and Zabele, S. (1996). Ramp: A reliable adaptive multicast protocol. Proceedings of the IEEE Infocom’96, San Francisco, CA, March, 1442-1451. Kretschmer, M. (1998). Specification and implementation of an enhanced RTP translator with dynamic configuration features. Thesis, FH Köln, GMD Research Series N. 8., GMD
Introduction to Multicast Technology 411
Lidl, K., Osborne, J. and Malcolm, J. (1994). Drinking from the firehouse: Multicast USENET news. Proceedings of the 1994 Winter USENIX Conference. Lin, J. C. and Paul, S. (1996). RMTP: A reliable multicast transport protocol. Proceedings of the IEEE INFOCOM’96, March, 1414-1424. Liu, C. (1998). Multimedia Over IP: RSVP, RTP, RTCP, RTSP. Retrieved on the World Wide Web: http://www.cis.ohio-state.edu/~jain/cis788-97/ip_multimedia/ index.htm, January. Live. (1999). LiveGate, multikit, livecaster. Live Networks, Inc. Retrieved on the World Wide Web: http://www.live.com. Macker, J. and Dang, W. (1996). The multicast dissemination protocol (mdp) framework. Internet Draft, Internet Engineering Task Force, November. Magnus. (1999). MTP: Multicast Transport Protocol. Retrieved on the World Wide Web: http://ganges.cs.tcd.ie/4ba2/multicast/magnus/. Maufer, T. and Semeria, C. (1997). Introduction to IP multicast routing. Internet Draft, 3Com Corporation, GlobeCom Network, March. McCanne, S. (1997). Scalable multimedia communication with Internet multicast, lightweight sessions and the MBone. University of California, Berkeley. Retrieved on the World Wide Web: http://www.ncstrl.org. Miller, K., Robertson, K., Tweedly, A. and White, M. (1997). Starburst multicast file transfer protocol (mftp) specification. Internet Draft, Internet Engineering Task Force, January. Mödeker, J. (1998). Transparent network access for multimedia streaming services. GMDIMK.IBS, Thesis FH Köln, March. Moy, J. (1994). Multicast extensions to OSPF. Network Working Group RFC 1584, March. Obraczka, K. (1998). Multicast transport protocols: A survey and taxonomy. IEEE Communications Magazine, January, 36, 94-102. Parnes, P. (1998). Multicast Tunnel-mTunnel. Retrieved on the World Wide Web: http:// www.cdt.luth.se/~peppar/progs/mTunnel. Quinn, B. and Almeroth, K. (2001). IP multicast applications: Challenges and solutions. Internet Draft, Internet Engineering Task Force, March. Rajvaidya, P. (1999). MANTRA: Monitoring the multicast at global scale. CAIDA, CS.UCSB. Retrieved on the World Wide Web: http://steamboat.cs.ucsb.edu/mantra/ home.html, November. Sabata, B., Brown, M. J. and Denny, B. A. (1996). Transport protocol for reliable multicast: TRM. Proceedings of the IASTED International Conference on Networks, January, 143-145. Schefström, D. (1999). RTP: A Framework for Real-Time Applications. Retrieved on the World Wide Web: http://www.cdt.luth.se/~dick/smd074/99-LP1/RTPConference. Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, V. (1996). RTP: A transport protocol for real-time applications. Network Working Group RFC 1889, January. Schulzrinne, H. (1996). RTP profile for audio and video conferences with minimal control. Network Working Group RFC 1890, January.
412 Gómez-Skarmeta, Ruiz & Mateo-Martínez
Chapter XVIII
IP Multicast: Inter Domain, Routing, Security and Address Allocation Antonio F. Gómez-Skarmeta, University of Murcia, Spain Pedro M. Ruiz, University Carlos III of Madrid, Spain Angel L. Mateo-Martínez, University of Murcia, Spain
INTRODUCTION Without doubt, multicast communication as a means for one-to-many or many-to-many delivery of data has become a hot topic in multimedia environments. A lot of people are interested in multicast: the research community, standards groups and Internet Service Providers (ISP) among them. Although IP multicast is a very good solution for Internetworking multimedia in manyto-many communications, there are still issues that have not been completely solved. Protocols are still evolving and new protocols are constantly coming up to solve these issues because that is the only way for making multicast become a true Internet service. The main goal of this chapter is to describe the evolution of IP multicast from the obsolete MBone (Multicast Backbone) and intra-domain multicast routing to the actual inter-domain multicast routing scheme. We will pay special attention to the challenges and problems that need to be solved, the problems that have been solved and the way they were solved. We will make a complete picture of the state of the art explaining the idea behind each protocol and how all those protocols work together. Some of the topics that we will discuss broadly are related to address allocation, security and authentication, scope control and so on. We will explain our view of the problems, the work that has been made worldwide on these issues and also the developments that we have made in order to solve some of these problems. We will give some results and recommendations. Copyright © 2002, Idea Group Publishing.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 413
Authentication With the current IP multicast model, there is no way to control who can join to a certain multicast group or even who can send datagrams addressed to a certain multicast group. Nowadays there is only one proposal in the draft “IGMP Extensions for IP Multicast Sender and Receiver Authentication,” written by N. Ishikawa from the NTT Information Centre. We have developed a system for controlling accesses to IP multicast networks based on this work. We are planning to integrate our system with the current ideas expressed in the IGMPv3 protocol that is being defined these days.
Scope Control Another drawback that the current IP multicast model shows is related to the scope control. In the same way that you cannot control who accesses to IP multicast networks, is not possible to limit the scope of the sent datagrams. The system we have developed also covers this issue by adding some ideas to Ishikawa’s proposal. Without solving these problems, it is not possible to become IP multicast in a true Internet service because malicious users could conspire to clog your network and you cannot do anything. This makes ISPs think twice before offering this service to their customers.
Address Allocation Another problem that arises when deploying IP multicast to the whole Internet is related to address allocation. The current model is based on sdr and the fact that if there is not an announced session using a certain multicast group, then there is nobody using that group. Nevertheless, IP multicast evolution based on inter-domain routing protocols makes this solution obsolete, and a more dynamic and robust solution is needed, which should be integrated with this new multicast routing models; so MASC, GLOP and AAP are been proposed to solve this problem. Then we will present some conclusions, future work and future research opportunities for people working in the deployment of IP multicast.
BACKGROUND There are various ways to provide interactive multimedia services with real-time requirements. The first solution we can use is ISDN, as the great variety of H.320 videoconference systems over this kind of switched network shows. H.320 is not just a unique standard, but a set of standards covering topics such as audio and video formats, data transmission or control signalling. Although H.320 is a good solution to remote videoconferencing, it has some drawbacks. The first one is the fact that it does not use the Internet “standard” for communicating remote sites, so you need an extra and independent tool to access the service. Nowadays the Web browser is thought to be the killer application for accessing Internet services, so this is not a friendly solution. H.323 is very close to H.320. Actually, protocols and formats defined and used in H.323 are very similar (if not the same) to that ones used with H.320. The main difference between these two protocols is their objective. While H.320 was designed for digital switched network based on 64 kpbs channels, H.323 was designed to be used over datagram networks with unguaranteed bandwidth. So H.323 can be used in low bandwidth environments.
414 Gómez-Skarmeta, Ruiz & Mateo-Martínez
In addition, H.323 is based on IP, so it can be perfectly integrated with some other Internet technologies and services. However, it is based on the traditional IP unicast service, so it does not scale well when there are more than two participants in the multimedia conference. In order to be able to deal with conferences with more than two participants, an MCU (Multipoint Control Unit) is needed. The MCU is the network element that is in charge with replicating flows, i.e., all participants send their data to the MCU and then it has to distribute that data to the rest of the participants. So, the MCU could become a bottleneck to the conference performance. As an alternative, IP multicast can be used to deal with many-to-many communications because it is based on the concept of group addresses. A group address is not related with any certain host on the network, but to a group of hosts that are subscribed to that multicast address. So IGMP (Internet Group Management Protocol, Fenner, 1997) is used by a host to inform its local attached router about the group addresses that it is interested in. IGMP is also used between routers to interchange its member lists to route multicast groups. The main idea with IP multicast is that routers are in charge of the distribution of packets sent to the multicast group, so there is no need to use an MCU and the packets are only replicated when needed. Nevertheless, the distributed nature of multicast IP brings us another problems that does not appear in H.320 or H.323 environments. Some of these problems will be studied deeply throughout the rest of this chapter.
MAIN THRUST OF THE CHAPTER IP multicast was first introduced by Steve Deering in his PhD thesis in 1988. If we compare IP multicast and WWW evolution, which was introduced at about the same time, we can find out that the IP multicast deployment has been very slow. For example, there are many more HTTP servers than multicast-enabled routers. Why this difference between the speed of deployment? One of the various causes is that the Web was very easy to deploy with the technology, equipment and protocols that were used in those days. However, IP multicast needs some extra requirements from the network equipment that could not be met by that technology. In fact, IP multicast requires additional “intelligence” in the network and an important amount of complexity in the routers. It was very difficult to manage this kind of service in an infrastructure that was only used to offer a unicast and best-effort service. Another cause of the slow deployment of multicast is that only technicians and investors were interested in multicast in its initial stages. There weren’t any carriers or ISPs working in this issue. Finally, IP multicast now has various problems that make ISPs think twice before offering the multicast service to their customers. Several of these problems have been shown in Skarmeta (1999). Some other extra problems have been identified in Diot (2000). In this section we will look at the evolution and deployment of the current IP multicast service. We will start by looking at the first virtual multicast topology called MBone and the protocols that made this infrastructure work. Then, we will explain the new protocols that allow us to use native multicast and how these protocols have evolved from offering intra-domain multicast routing to inter-domain multicast routing.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 415
Finally, we will offer a detailed view of the current challenges that are being solved and how they are being solved. We need to keep in mind that nowadays multicast is an ongoing work. So, we will pay special attention to the current challenges that need to be solved.
MBone: The First Step Towards Multicast Deployment From the first Internet-wide experiments in 1992 to the middle of 1997, all the standardisation and deployment effort focused on a single flat topology. This topology is in contrast to the Internet topology, which is based on a hierarchical routing structure. The first step in deploying IP multicast was to develop routing protocols for this initial multicast topology. However, at the beginning of 1997, the multicast community realised the need for a hierarchical multicast structure and inter-domain routing. So, the existing IP multicast routing protocols were called “intra-domain IP multicast routing protocols” and the standardisation of inter-domain solutions started. We will start explaining the intra-domain protocols and concept, and then we will see how these protocols evolve to inter-domain protocols.
The Standard IP Multicast Model One of the first things that we need to explain is the idea behind IP multicast. The typical IP unicast service works fine in one-to-one communications but when we try to use the same service to carry one-to-many or many-to-many communications, there appear various issues that need to be improved: • Bandwidth used. • Scalability. What is the general idea behind multicast? IP multicast is a new technology that is based on the use of group addresses instead of the typical IP unicast addresses. Using multicast addresses we can reach a great number of receivers sending only one datagram which will be replicated only when needed. With this simple idea, we can solve the various drawbacks that we have mentioned previously: 1. You can reduce the bandwidth used because you only need to send one datagram instead of sending one datagram per receiver. 2. This approach is more scalable than IP unicast because the source can send data to a lot of receivers using the same processing power and bandwidth as if they were sending to only one receiver. One example is, if we think in a unicast video server offering video streams on demand, the cost of such equipment is proportional to the number of simultaneous streams that we want to serve. If we want a more inexpensive and scalable solution, we need to use IP multicast. Using IP multicast, we could serve different requests for the same content using only one flow. This approach is more scalable than unicast in all cases except when every receiver wants to receive a different content. In that case, both approaches will need the same resources. This idea that we have just outlined was first defined by Steve Deering in his PhD thesis and is commonly known as the standard IP multicast model (Deering, 1991). In order to understand all the inter-domain multicast routing evolution, it is very important to understand the model. This model can be summed up as: IP multicast senders only need to address datagrams to the proper multicast group or class D address. The source of the datagrams does not need to register or schedule transmission.
416 Gómez-Skarmeta, Ruiz & Mateo-Martínez
IP multicast receivers only need to join the multicast group that they want to receive. There is no need to register, synchronise or negotiate with a centralised group management entity. Multicast-enabled routers conspire to deliver IP multicast traffic. With this model in mind, an IP multicast backbone called MBone was created. Which are the protocols making possible multicast? How do all these protocols fit together in order to construct the MBone? We are going to answer these questions in the next section.
Protocols Inside the MBone As you can suppose from the previous discussion, we are going to explain in this section the protocols that make possible the inter-domain multicast and the way they fit together. The inter-domain multicast protocols can be divided into two mayor groups: host-torouter protocols and router-to-router protocols. On the one hand, host-to-router protocols, as you can imagine from their name, are used by the end systems to interact with their local multicast-enabled router or routers. The typical example of host-to-router protocol is the Internet Group Management Protocol (IGMP, Fenner, 1997). This protocol is used by end systems to inform their local routers about their interest in receiving certain multicast groups. We will examine the way IGMP works later in this section. The host-to-router protocols are responsible for implementing the first two points in the standard IP multicast model. On the other hand, router-to-router protocols are responsible for implementing the third point of the standard IP multicast model. They are also called IP multicast routing protocols and the most known examples are Distance Vector Multicast Routing Protocol (DVMRP, Waitzman, 1998), Protocol Independent Multicast Dense Mode (PIM-DM, Deering, 1998) and Protocol Independent Multicast Sparse Mode (PIM-SM, Estrin, 1998). All these protocols can be classified according to various aspects. For example, depending on the kind of distribution trees being constructed, they can be classified in unidirectional or bi-directional trees. However, the typical classification is made between flood and prune protocols (also known as dense mode protocols) and Sparse Mode protocols. We will make a broader discussion on these protocols later in this section. The initial MBone infrastructure was based on two basic protocols: IGMP and DVMRP. Of course, some other protocols like IP, UDP, etc., were used. In these initial stages, not all the routing equipment in the Internet was capable of doing multicast routing. So, the only way to create a worldwide multicast network was to interconnect several multicast-enabled networks via tunnels. The routing function was provided based on DVMRP by a multicast routing daemon process running in a workstation that is known as mrouted. So, all the multicast traffic between two multicast routers (also known as mrouters) was encapsulated into IP unicast datagrams addressed to the other end of the tunnel. The mrouters were responsible of forwarding the unicast-encapsulated traffic coming on an incoming interface over the appropriate set of outgoing interfaces. These outgoing interfaces were determined according to DVMRP. In the early MBone the multicast routing function was only a controlled way of flooding. An additional way of controlling this routing was called pruning and is based on proper management of IGMP-based interest mechanism. If no host on a subnet
IP Multicast: Inter Domain, Routing, Security and Address Allocation 417
expresses interest in receiving certain multicast group traffic, then no multicast data traffic comes over the tunnel to that subnet. So, traffic only flows over links having active receivers attached to them. This virtual topology is shown in Figure 1. IGMP acts like the glue between the multicast-enabled network and the end systems. IGMP is used for routers to learn about the existence of members on their directly attached subnets. There are various message types for IGMP but all of them have a header like that shown in Figure 2. Multicast-enabled routers send Group Membership Queries (IGMP type 1) addressed to the group 224.0.0.1 (all the multicast systems). When a host receives such query, replies with a Group Membership Report (IGMP type 2) addressed to the multicast group that it is interested in. There exist some other IGMP messages that are used between multicast-enabled routers like IGMP-DVMRP (IGMP type 3). This message is used for DVMRP updates. Another important message is the IGMP-Prune. Whenever an mrouter detects that there are no hosts interested in a certain multicast group on some of its outgoing interfaces, it sends an IGMP-Prune to the mrouter at the other end of the tunnel on the interface where it was joined. In this way it asks the neighbour mrouter to stop sending that traffic. The original DVMRP creates multicast trees using the flood and prune mechanism. Constructing the distribution tree is called Reverse Shortest Path Tree. It involves the following steps: • The source multicasts each datagram on its local network. An attached mrouter receives those packets and starts flooding on all its outgoing interfaces. • Each of the neighbour mrouters performs an RPF (Reverse Path Forwarding) check. That’s to say, it will check if the incoming interface for that packet is the interface that it would use as an outgoing interface toward the source. If the RPF-check is successful,
Figure 1: Typical MBone topology Multicast Network
Multicast Network
Multicast
M lti
t
disabled Network
Tunnel
Figure 2: IGMP header information 7
Type
15
MRT
5
31
Checksum
Group Address
418 Gómez-Skarmeta, Ruiz & Mateo-Martínez
then the packets are forwarded over all the interfaces in the outgoing interface list. Otherwise, the packets are dropped and depending on the protocol, it is possible to send and IGMP-Prune on that interface. • If the RPF check is successful but the mrouter realises that it has no attached hosts nor neighbour interested on receiving the packet (note that the mrouter discovers that using IGMP), it sends an IGMP-Prune toward the source on the RPF interface. • Mrouters receiving prune messages create prune state for the interface on which the prune is received. If an mrouter receives prune messages for that group on all interfaces except the RPF interface, the mrouter will send a prune message of its own toward the source on the RPF interface. As we have previously said, these kinds of protocols are commonly known as dense mode protocols because they work better when the topology is densely populated with group members. The main advantage in these protocols is that the distribution tree is very efficient. However, sending traffic on all the interfaces and getting people to tell you that they are not interested in the traffic is not a very scalable approach. In addition, mrouters that do not take part in the distribution tree need to store the prune state.
From the MBone to Native Multicast During the last years, MBone has grown tremendously. It has been evolving from a virtual network to a real network that is being integrated into the Internet. In parallel with this evolution, a lot of vendors started to offer native multicast support in their equipment. So, the evolution from MBone to native IP multicast networks started here. Ongoing research has led to the development and deployment of two additional dense mode protocols like PIM-DM and MOSPF (Moy, 1994). In addition, some other sparse mode protocols have been widely developed and deployed.
MOSPF MOSPF stands for Multicast Extensions to OSPF. As its name indicates, this protocol defines several extensions to the OSPF protocol making it able to route IP multicast datagrams. In the same way that OSPF routers build their own unicast routing topology by flooding link state messages, MOSPF routers have the same view of group membership in their area. In order to achieve this, a new OSPFv2 link state advertisement (LSA) describing the location of multicast destinations has been added. This new LSA has been called groupmembership-LSA. Then, each MOSPF router in the area can calculate the path of a multicast datagram by building the shortest-path tree (using the Dijkstra algorithm) rooted at the packet’s source. All branches not containing multicast members are pruned from the tree. Such a tree is shown in Figure 3. As we can see, having each router calculate the complete map of IP multicast sources and receivers, it is very easy to obtain the better distribution path from senders to receivers. This works fine when all sources and receivers are in the same area. When forwarding multicast to different OSPF areas, a router in one area does not have perfect information about the topology in some other areas. So, only incomplete or approximated shortest-path trees can be built. This may led to some inefficiency in routing. This protocol can be considered between dense mode protocols and sparse mode protocols. It can be considered a dense mode protocol because membership information is
IP Multicast: Inter Domain, Routing, Security and Address Allocation 419
Figure 3: Shortest-path tree for a (source, group) pair Source A B
D
C
E
G
F
H
I
broadcasted to each MOSPF router in the area. However, it can also be considered a sparse mode protocol because data is only sent to those receivers that specifically request it. The main advantage of MOSPF is that it builds very good distribution trees. However, it suffers from poor scaling. In the same way that dense mode routers need to store the state of joins even when there are no receivers interested, here MOSPF routers need to store unwanted state when there are no senders. Nowadays, MOSPF is not very used and people prefer using some other native protocols like PIM-DM or PIM-SM.
PIM-DM Protocol Independent Multicast (PIM) has been split into two protocols. One densemode protocol called PIM-DM and the sparse-mode version called PIM-SM. PIM-DM (Deering, 1998) is very similar to DVMRP. There are only two major differences between them. The first is that DVMRP computes its own routing table to determine the best path to the source, whereas PIM-DM uses the routing table used by the underlying unicast routing protocol. That’s why it is called Protocol Independent Multicast. Routing decisions do not depend on the underlying protocol. This use of the underlying routing table is also applicable to PIM-SM. The second difference between DVMRP and PIM-DM is that DVMRP tries to avoid sending unnecessary packets to those neighbours that will generate prune messages based on a failed RPF check. This is made by adding to the set of outgoing interfaces only those that use the given router to reach the source--that is to say, only the interfaces that will do a successful RPF check. Instead of using this mechanism, PIM-SM uses a much more simple approach. All the multicast traffic is forwarded on all the outgoing interfaces. The trade-off here is that sometimes-unnecessary packets are forwarded to routers that must then generate prune messages because of the resulting RPF check failure. Of course, PIM-DM uses native multicast instead of encapsulating multicast traffic in unicast UDP datagrams. As a dense mode protocol, PIM-DM has many of the same problems that we explained with DVMRP, and a new family of protocols started to be developed and deployed. This family of protocols was designed to be used in environments where the group members are sparsely distributed. That is, there are a few people that are widely distributed. In this case,
420 Gómez-Skarmeta, Ruiz & Mateo-Martínez
instead of flooding all the network and making a lot of routers to say that they are not interested in receiving this traffic, it is better not to forward anything and make the interested group members (their local attached multicast-enabled routers) send explicit join messages towards a core node that acts like a “meeting point.” The most popular sparse mode protocols are explained below.
CBT CBT stands for Core-Based Trees (Ballardie, 1997). This is the simplest and earliest center-based tree protocol. It was designed primarily to improve dense-mode protocol’s scalability. When a receiver joins a multicast group, its local CBT router obtains the Core router for the group and sends a JOIN_REQUEST message that is sent to the next hop router towards the Core. Every router in the path, from the local CBT router to the Core, maintains forwarding state for that group and a JOIN_ACK is sent back to the previous router by the Core router itself or by another router that is on the unicast path between the sending router and the core. In this way, a Centre-around multicast distribution tree is built. This process is shown in Error! Reference source not found. As you can see from Figure 4, CBT builds and maintains a shared bi-directional tree for delivering the packets addressed to a certain multicast group. This tree is then used by all the senders and receivers in the group. Packets can flow up towards the Core and down the tree, away from the Core, depending on the location of the source. This is different from the PIM-SM approach where distribution trees are unidirectional. Figure 4: Process for building a CBT Core Join
Join
Join
Join
Receiver C Receiver A Receiver B
Core
Receiver C Receiver A Receiver B
Data Packets Bidirectional Shared Trees
IP Multicast: Inter Domain, Routing, Security and Address Allocation 421
Tree maintenance is achieved by each downstream router periodically sending a “keepalive” message (ECHO_REQUEST) to its upstream neighbour. On multicast-enabled links, a keepalive is multicasted to the group 224.0.0.15 (ALL-CBT-ROUTERS). The receipt of a “keepalive” is answered with an ECHO_REPLY message. The ECHO_REPLY message contains all groups for which the router has state. This mechanism is used to maintain consistent information between parent and child. The sending and the receiving of multicast datagrams by hosts conforms to the typical IP multicast model. With this approach, the Core router can become a single point of failure. So, CBT also allows multiple Core routers to be specified. This adds a little redundancy in case the core becomes unreachable.
PIM-SM PIM-SM has been developed trying to avoid the limitations in the CBT protocol. As we will note PIM-SM has several similarities with CBT. One of the key concepts in PIM-SM is the notion of Rendezvous Point (RP) that is very similar to what is called in CBT the Core. So, PIM-SM works this way: The first step is configuring the RP. There may exist various candidates RP but there can only be one RP per group. We will look at the way PIM-SM makes the mapping between multicast group and router later. This task is accomplished by the bootstrap protocol. At the moment it is enough to know that an RP acts as a meeting point between senders and receivers. When a host wants to receive IP multicast datagrams addressed to a certain multicast group G, it sends an IGMP Membership Report for group G on its local network in response to a previously sent IGMP Query from its local attached router. Then, its local attached router maintains forwarding state for group G and sends an IGMP Join for group G on its outgoing interface towards the RP. This process is repeated several times until an IGMP Join for group G arrives to the RP. At this moment, a forwarding state has been created in each router along the path from the sender to the RP. So, multicast datagrams can flow through this path. Moreover, the tree is a reverse shortest path tree because the join messages follow a reverse path from receivers to the RP. When a sender starts transmitting datagrams, irrespective of whether it is a member or not, its local attached router receives the packets and starts sending that traffic as encapsulated traffic addressed to the unicast IP address of the RP. If the RP does not have forwarding state for the group (this happens when nobody has sent Join messages to the RP), then the RP sends a register stop message to the sender. This message avoids wasting bandwidth between the source and the RP. It is also possible for an RP to send a join message towards the sender’s local attached router. In this way, a multicast forwarding state is established in the path between the sender and the RP. This avoids the overhead caused by encapsulating traffic from the sender to the RP. The important advance over CBT is that it is possible for a receiver to make a transfer from the shared distribution tree to a shortest-path tree. This makes PIM-SM less critically dependent on the RP placement than CBT is on the Core location.
From Intra-Domain Multicast to Inter-Domain Multicast When trying to define a way to use native IP multicast on the whole Internet, some problems arise. In the same way that with the growing of the Internet, an inter-domain hierarchical routing topology needed to be built, it would be very nice having a hierarchical
422 Gómez-Skarmeta, Ruiz & Mateo-Martínez
multicast routing topology for the whole Internet. In order to achieve this goal, the investment on inter-domain multicast routing started. But, what was going wrong with the intra-domain approaches? • DVMRP and PIM-DM can’t be used because their flood-and-prune philosophy makes all routers even-off tree routers to keep per-source state. This is not scalable. • MOSPF can’t be used because it does not scale well enough. This is caused by the continuous flooding of membership information. • CBT and PIM-SM also cannot be used because the mechanism used for doing the Group to RP mapping (or Group to Core in the CBT case) limits their scalability. So, inter-domain multicast routing has evolved out of the need to provide scalable, hierarchical, Internet-wide multicast (Stardust, 2000). Several protocols have been developed for providing the proper functionality. However, this technology is relatively immature yet. There are various efforts to solve this issue at this moment. Nowadays a near-term solution is being used, tested and deployed. It lacks elegance and long-term scalability but it’s functional. So, there exists some other ongoing work in order to find a long-term solution. Some of these proposals are based on the current IP multicast model that we explained previously and some others are redefining the multicast service model in order to make the problem simpler. We’re going to start explaining the interim solution and then we will deal with the new proposals.
Interim Solution: MBGP/PIM-SM/MSDP This solution is supported by three basic protocols: MBGP is responsible for carrying routing information between Autonomous Systems (AS), PIM-SM is used within the ASs and finally, MSDP is used for inter-domain announcement of multicast-active sources. As we have previously seen, PIM-SM has a good behaviour when used within a single domain, but if we’re looking for an Internet-wide multicast, this protocol shows several scalability problems. Thus, some external protocol is needed to interconnect different multicast domains. The solution is to carry IP multicast routers in BGP (Rekhter, 1995). We need to make multicast routing hierarchical in the same manner as unicast routing. In unicast routing, route aggregation and abstraction is provided by the Border Gateway Protocol (BGP). BGP allows a network manager to use whichever routing protocol within his domain. BGP makes the abstraction among domains. So, when packets are unicasted to hosts within other external domains, we only need to select the best external link to reach that domain. To make that decision, we need some information that BGP exchanges between domains using TCP connections (to achieve reliability). Each AS advertises the set of routes that it can reach together with an associated cost and with this information, the set of AS that needs to be traversed to reach the destination can be computed. It’s also possible to announce network prefixes grouping several networks instead of exchanging information in a network-by-network basis. So, routing is still made in a hop-by-hop basis, but less information needs to be exchanged between ASs by the border routers. Thus, with these advantages in mind, people started to think in a similar approach for multicast routing. The result of this investment is an extension to the BGP4 protocol called Multiprotocol Extensions to BGP4 (MBGP) (Bates, 1998).
IP Multicast: Inter Domain, Routing, Security and Address Allocation 423
In order to make these extensions backwards compatible, as well as to simplify the introduction of the multiprotocol capabilities into BGP-4, two new attributes are defined: Multiprotocol Reachable NLRI (MP_REACH_NLRI) and Multiprotocol Unreachable NLRI (MP_UNREACH_NLRI). If we observe the format of these new attributes, we can see a new field called “Subsequent Address Family Identifier.” This identifier allows BGP to exchange not only IPv4-related reachability information, but also some other protocols like IPv6, IPX, etc. In the multicast case, this field can specify unicast, multicast or unicast/multicast. Using this approach, instead of every router needing to know the whole multicast topology, each router only needs to know its own domain topology and the paths to reach external domains. In addition, this way of extending the protocol allows an AS to have different topologies for unicast and multicast traffic. An example topology is shown in Figure 5. For example, if domain A in the figure advertises reachability for multicast, the message will say something like, “I’ve got a path to multicast sources on networks listed within this message.” So, BGMP does not carry information about specific multicast groups. The information carried with BGMP is then used when the RP or one of the receivers needs to send a Join Message towards the source. That is, when constructing the source-rooted distribution tree. And now, a new question arises. Does MBGP suffice for providing inter-domain multicast routing? The answer to this question is negative, because MBGP can’t provide an inter-domain tree construction mechanism. There is neither a defined format for inter-domain joins nor standard retransmission times, etc. The interim solution uses PIM-SM for building multicast trees among domains having group members. But this approach shows various problems that need to be solved. So, in order to reduce this problem, a new protocol called Multicast Source Discovery Protocol (MSDP) (Farinacci, 2000) has been proposed. What is the problem that MSDP tries to solve? If we try to connect sparse mode domains (using PIM-SM for example), the problem is how to inform an RP in one domain that there are active sources in other domains given that MBGP only allows for multicast reachability exchanges. It is also problematic for an ISP to have routers being elected RPs for certain groups when there are neither senders nor receivers interested in that group within its domain. The near-term solution allows for having various RP instead of a single root RP. Actually, there is only one RP per domain for a group (as PIM says), but now multiple
Figure 5: Different MBGP topologies for unicast and multicast traffic
Dom A
Dom B
Dom C
Dom D
Mcast peering Unicast peering
424 Gómez-Skarmeta, Ruiz & Mateo-Martínez
domains can be involved. Moreover, if there are various group members within different domains, every domain will have its distribution tree based on their local RP, but there is no way to interconnect the various multicast trees together. In fact, there is no mechanism for RPs to communicate with each other when one receives a register message from one of their sources. How can MSDP avoid these problems? They are solved because MSDP works by having an element (usually called MSDP SA originator) running in the RP within each domain. This element announces to other MSDP peers within other domains the existence of active sources. These announcements are distributed using reliable TCP connections. If the multicast sources coming in an MSDP announcement are of interest, the normal sourcetree building mechanism in PIM-SM will be used to deliver multicast data over an inter-domain distribution tree. When a source in a PIM-SM domain originates traffic to a multicast group, its local attached PIM designated router sends the data encapsulated in a PIM Register message to the local RP in the domain. Then, the RP will send a Source-Active message (SA message) to its MSDP peers. This message contains the source address of the data source, the group address that the data source sends to and the IP address of the RP in the local domain. So, different sources are advertised using different SA messages. Each MSDP peer that receives these messages takes an RPF flooding of this message away from the RP address. The BGP routing table is used to determine which peer is the next hop towards the RP originating the message. This kind of peer is called RPF peer. If the MSDP peer receives the SA message from a non-RPF peer towards the originating RP, it will drop the message. Otherwise, it forwards the message to all its MSDP peers. When an MSDP peer, which is also an RP for its own domain, receives a new SA message, it checks if it knows about some group members interested in the group within the SA message. Then, it triggers a join event towards the data source as if a Join/Prune message was received addressed to the RP itself. This sets up a branch of the source tree to this domain. Subsequent data packets are forwarded down the shared-tree inside the domain. If leaf routers within the domain decide to join to the source-tree, they have to act in the same way that they act when joining to a source-tree within their domain. Finally, if an RP in a domain receives a Join for a new group G, and it is caching SAs, then RP should trigger a join event for each source within all the SAs for that group that it caches. This mechanism is called flood-and-join because if one RP is not interested in a group G, it only needs to ignore the SAs for that group. Otherwise they join the distribution tree. You can now imagine why this near-term solution is called MBGP/PIM-SM/MSDP. But, how good is this solution? Let’s discuss the limitations of the current solution. The MBGP/PIM-SM/MSDP solution is relatively simple once you understand all the abbreviations and the factors that guided the design of the protocols. Although some people says that this solution is very complex, it isn’t much more complex than some other Internet protocols. The main advantage of this solution is that it works based on the existing protocols and that’s making its use quite widespread. Its main drawback is that it can’t be used as a longterm solution because it may suffer from scalability problems. Because of the way MSDP works, if multicast starts working fine and its use starts to be widespread enough, the overhead of MSDP will be too significant. Imagine 2,000 multicast sources in a domain. With this solution, 2,000 SA messages need to be sent by the MSDP SA originator every certain period of time. Moreover, if an MSDP peer does SA caching, it will need very big tables for looking up all the senders in external domains. So,
IP Multicast: Inter Domain, Routing, Security and Address Allocation 425
although MSDP is not scalable and will be insufficient for the future, it works fine with the current use of multicast while no long-term solutions appear. MSDP also suffers from problems when managing dynamic groups. With intra-domain routing protocols, when a source starts transmitting, the network creates some type of routing state to accommodate that traffic and to control its distribution. However, in the case of MSDP, when a source starts sending data, information about the existence of sources must be transmitted before routing state can be created. It is presumable that when groups are dynamic, either because of bursty sources or frequent group member join/leave events, the overhead of managing the group can be important. The following two problems may arise: 1) As we have previously seen, SA messages are sent periodically. So, it could appear a significant time slice, between the moment when new receivers join their local RP and the moment in which a new SA message reaches its local MSDP agent. To avoid this problem, MSDP peers may be configured to cache SA messages. It is also possible for an MSDP peer to send an SA-Request message to an MSDP peer that makes caching. Although this mechanism solves this problem, it imposes extra complexity and reduces scalability due to the extra state that needs to be maintained. 2) The second problem is related to bursty sources. This type of source is characterised by sending short packet bursts between silent periods of several minutes. The typical example is an application used to announce sessions using the SAP protocol. These tools periodically send a single packet addressed to the known multicast group 224.2.127.254. Suppose one of these packets is sent by a host; when this packet arrives to the local domain RP, it floods an SA message towards its MSDP peers. When members of that group within external domains get the inter-domain distribution tree built, it’s very late and they can’t hear the announcement. At this point, one can think that at least the tree is built for the following announcement. But, if the next announcement from this source is not sent before the forwarding state expiration time (about three minutes), the forwarding state is discarded because no packets are being sent. In this way, no packets from bursty sources ever reach group members within external domain. To solve this problem, the MSDP protocol says that SA messages will carry the first ‘n’ data packets. This is not very elegant but it solves the problem. Of course, it is difficult to standardise this mechanism because of its lack of elegance. So, people in the MSDP WG are making proposals that allow data to be carried in either GRE or UDP packets.
SECURITY, POLICY AND AUTHENTICATION With the widespread use of the Internet, some security problems started to appear. People started to demand new services from the network and the protocols started to evolve for meeting those requirements. The Internet evolved from an experimental technology to a real service used for e-commerce or even money transactions. In this environment, new services that had not been taken into account in the initial Internet stages like security, authentication, integrity, privacy and non-repudiation started to be demanded. Until these requirements were met, the Internet was not thought of as a real service. Nowadays, a lot of effort is being made on these issues. A lot of protocols for ecommerce are appearing. Building Public Key Infrastructures (PKIs) is also a hot topic. And almost all critical applications are starting to use cryptography. It is usually thought that these issues are only important when money transactions are involved, however, they need to be taken into account in multimedia environments as they are also becoming an Internet
426 Gómez-Skarmeta, Ruiz & Mateo-Martínez
business. The typical example is a Video on Demand (VoD) server where clients are paying for accessing a certain multimedia content. Covering these issues in unicast communications is very straightforward with the current technology. However when multicast communications are involved the problem is much more complex. In this chapter we are going to review the current security, policy and authentication problems, and we will have a look at the possible solutions.
Lacks in the Current Architecture The current IP multicast architecture is being widely used, but there are only few ISPs offering this new technology as a service to their customers. Some years ago, everybody thought that the main cause was the poor stability. Nowadays, IP multicast is much more stable than it was. However, ISPs are still thinking twice before offering it. Why? We think that this is because they think it’s still in an immature stage. In fact, it has several problems that need to be addressed. Some of them are presented below: • Denial of service attacks. With the current approach it is very simple for a bad guy to clog network links or even not allow people to take part in multicast sessions. Imagine the disaster if you have paid for receiving a concert via multicast and then you can’t see anything. • Policy of use. With the current protocols, it is not possible to establish who can access the IP multicast network, who can send datagrams to which groups and so on. Nowadays IP multicast is an all or nothing service. You can allow everybody in your network to do whatever (send or receive without any control of the scope of those packets) or allow nobody to use IP multicast. • Authentication. This is the process of checking that a user is who he is claiming to be. This is one of the issues making the establishment of a policy on the IP multicast usage difficult. In fact, if you cannot authenticate a user, you cannot impose any restrictions on that user. • Key distribution. The current approach for encrypting sessions is based on the use of some algorithm like DES or IDEA, using a common group key that needs to be distributed to all the participants in some unstandardized manner. The problem arises when one of the members leaves the group and it is necessary to redistribute a new group key. This process is not very scalable and various approaches are being defined in the Secure Multicast Research Group of the IRTF (Secure Multicast Group Charter). • Billing. How can an ISP bill for this service? Who pays the service, the source or the receivers? These problems need also to be solved and nowadays there is not a proper way to do so. In the following subsection, we are going to explain in deeper detail how and when these problems arise and the efforts that are being made to solve these problems.
Current Approaches to Solve the Problems In the current IP multicast model, there are no restrictions for a source to send datagrams addressed to a certain multicast group. Using this model, it is very easy for a malicious user to cause a DoS attack. For example, one user in your university’s network can start sending a lot of nasty data addressed to the multicast group and port used in a teleteaching course, making all the students not properly receive the contents of the course. Another example occurs when sending a lot of data over low bandwidth links and clogging the network.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 427
Nowadays, the only way to avoid this kind of DoS attack is making certain changes to the current IP multicast model in order to provide IP multicast authentication services. You could stop these attacks if you can authenticate who is sending the data and to where the data is addressed. Then, depending on those parameters you could decide if those datagrams need to be accepted or dropped. So, it’s not sufficient to provide authentication. We need to establish a policy and to have the proper mechanism to get the policy met. From all these services, the more important one is authentication because it makes it possible to offer all the other services. For example, you need to authenticate the source of some datagrams in order to bill for those datagrams. In the last few years, little attention has been paid to these issues. People were more interested in routing or even real-time data transmission than on these issues. So, there are not very many proposals to solve these problems. However a lot of people are now becoming aware of these problems and are starting to work hard on them. One of the proposals was made by Ishikawa et al. (1998) and was called IGMP Extensions for IP Multicast Senders and Receivers Authentication. We have developed a system based on this work that we will describe later. Another that we will examine is the current IGMPv3 draft defined by the IETF (Cain, 1999).
IGMP Extensions for IP Multicast Senders and Receivers Authentication This section describes the extension to IGMPv2 in order to allow us to authenticate IP multicast senders and receivers. This prevents an unauthorised user from sending and receiving IP multicast datagrams. The three main objectives to offer: • A method for avoiding malicious senders to inject traffic to the network. • A method ensuring that unauthorised people do not receive multicast data. • A mechanism for authentication being independent of the underlying IP multicast routing protocol. This mechanism is based on two entities called ingress router and egress routers. An ingress router authenticates a directly attached IP multicast sender sending multicast datagrams in an IP multicast network. An egress is responsible for authenticating a directly attached IP multicast receiver. So, an IP multicast sender sends IP datagrams through an ingress router. IP multicast datagrams travel egress routers through IP multicast routers within an IP multicast network. Every multicast-enabled router is running some multicast routing protocol. Finally, an egress router sends IP multicast datagrams to its locally attached IP multicast receivers that have joined the host group. In order to do user authentication, a challenge-response mechanism, similar to that used in CHAP (Simson, 1996), is used. When an IP multicast sender starts to send IP multicast datagrams, an ingress router may optionally authenticate it, using the challenge-response mechanism. The method used for doing the authentication can be RADIUS (Rigney, 1997) or another. The important issue here is that the ingress router needs to authenticate the sender before forwarding all the datagrams it is sending. If the result of the authentication is successful, IP multicast datagrams sent by the IP multicast sender travel towards egress routers through IP multicast routers. Otherwise, the ingress router silently discards all the datagrams sent by the IP multicast sender. This mechanism prevents an unauthorised user from sending IP multicast datagrams to the Internet. When an IP multicast receiver starts to receive IP multicast datagrams, an egress router may optionally authenticate it using the challenge-response mechanism. An egress router
428 Gómez-Skarmeta, Ruiz & Mateo-Martínez
may optionally use RADIUS as the authentication server, when authenticating the IP multicast receiver. If the result of the authentication is successful, the egress router starts to forward the IP multicast datagrams to the IP multicast receiver. When the result of the authentication is not successful, the egress router does not transmit IP multicast datagrams to the IP multicast receiver. This mechanism prevents an unauthorised user from receiving IP multicast datagrams from the Internet. Although this approach can work fine, it has some issues that need to be covered: • Receivers on shared media networks. On shared media networks like Ethernet, where packets are seen by every host connected to the LAN, this approach does not work fine. The problem arises whenever a local host in the LAN is authorised to receive a certain content. As the host is authorised, its local egress router will start to forward that content to that LAN. From this moment, every host in the LAN (even unauthorised ones) can receive the content when joining the proper multicast group. • Granularity on filtering IP multicast datagrams. An ingress router drops IP multicast datagrams sent from unauthenticated IP multicast senders, based only on their source IP addresses, even if user-IDs are used for authenticating IP multicast senders. • It is not possible to control the scope of the sent packets. This approach only allows us to decide if a packet is allowed to be forwarded or not depending on the source IP of the sender. It could be useful to decide if the packet needs to be forwarded depending not only on the source address but also on the scope it has. This extension would allow us to assign a maximum allowed TTL to a certain user.
Multicast Access Control We have developed an IP multicast access control system based on the ideas of the previous work but using a different implementation approach. The main problem we tried to solve was access control in multicast environments. Our system should provide control over who joins to MBone, who sends data addressed to certain multicast group and so on. With such a system developed, it is possible for universities to offer IP multicast services to their students without any risk of flooding or clogging external networks. Our system is based on various distributed entities that can be placed wherever in the network: • Musersd. This element is a daemon running in the multicast router (nowadays a Linux box with a modified kernel allowing it to act as ingress router) that is used for communicating changes on the policy to the multicast router’s kernel. • Mauthclient. The client for managing and establishing the policy can be whatever Java-enabled browser. • Mauthserver. This element acts as a mediator between the mauthclient and the musers. It is used for communicating the multicast sessions’ announcements to the client. It is also used for offering fault tolerance because its strategic position allows it to recover the system in case of any failure in one of its extremes. Figure 6 shows the system architecture. In order to make the multicast router work as an ingress router (in the same way that is understood in (1998), we had to modify the multicast routing schemas in the Linux’s kernel. To be more accurate, we introduced some system calls allowing us to configure and establish a multicast forwarding policy using the musersd demon. In addition, our system calls allow us to make an on-line configuration of the policy without needing to reboot or reload anything.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 429
The policy can be established based on three different properties: source, destination and TTL. When the manager uses the Web interface (mauthclient) to create or change the current policy, he can configure which sources can address multicast datagrams to which multicast groups and what the maximum TTL is that the sender is allowed to use. So, setting the maximum TTL to a local scope (usually 15) for every student is sufficient for being sure that no student is going to interfere with external sessions. However, they can receive conferences or meetings coming from out of their university. They could also participate in those sessions by contacting the manager. The Web interface for managing the system shows all the active sessions and allows us to easily configure the policy to be applied. A snapshot of that interface is shown in Figure 7. For communicating the different entities, we have developed a protocol called control. This protocol is like the typical call/response protocol. However, the mauthserver uses the object serialisation facility offered by Java for communicating changes on the active sessions, new sessions or sessions that have expired. The key issue here has been the addition of new structures and functionalities to the Linux kernel. These additions make a typical multicast router (on Linux) become an ingress router. Being more specific, we’ve added internal structures for storing the policy information like source IP, multicast group and maximum TTL. We have also added some system call allowing to register or unregistered senders without restarting anything. Finally, we modified some routines making the ingress router check if the datagrams were authorised to be forwarded or they need to be dropped. Nowadays, the system works fine and it’s independent of the multicast routing algorithm used. This is ongoing work, and we’re also planning to do various improvements like avoiding the need of a manager and improving the authentication process.
IGMPv3 IGMPv3 is the latest version of the IGMP protocol that is used by IPv4 hosts to report their IP multicast group memberships to their local attached multicast-enabled routers. Version 3 of IGMP adds support for source filtering, that is, the ability for a system to report interest in receiving packets only from specific sources, or from all but specific sources sending to a certain multicast group. This new information offered by IGMPv3 membership reports may be used by multicast routing protocols to avoid delivering multicast packets from specific sources on networks where there are not interested receivers. For maintaining socket state, a tuple like the following is used for every interface and multicast address pair: (interface, multicast-address, filter-mode, source-list) where filter-mode can be either INCLUDE or EXCLUDE and source-list is a list of less than 64 source IP addresses. For example, if I only want to receive multicast datagrams addressed to 224.129.34.12 from 130.206.1.169 on my 155.54.95.10 interface, the entry would look like: (155.54.95.100,224.129.34.12,INCLUDE,130.206.1.169) Different sockets on the same machine have different entries. So, a state for the interface needs to be provided. Being more specific, every interface uses a table with various entries like the following: (multicast-address, filter-mode, source list)
430 Gómez-Skarmeta, Ruiz & Mateo-Martínez
Figure 6: Architecture of the whole system
Figure 7: Interface for managing the whole system
IP Multicast: Inter Domain, Routing, Security and Address Allocation 431
So, when a multicast datagram arrives, it is first checked by the interface layer to see if it needs to be delivered to the socket layer. Then, if it’s forwarded, the socket layer will pass it to the proper socket. The general rules for delivering the per-interface state from the per-socket state are as follows: for each distinct (interface, multicast-address) pair that appears in any socket state, a per-interface record is created for that multicast address on that interface. Considering all socket records containing the same (interface, multicast address) pair: • If any such record has a filter mode of EXCLUDE, then the filter mode of the interface record is EXCLUDE, and the source list of the interface record is the intersection of the source lists of all socket records in EXCLUDE mode, minus those source addresses that appear in any socket record in INCLUDE mode. For example, if the socket records for multicast address m on interface I are: from socket s1: (i, m, EXCLUDE, {a,b,c,d} ) from socket s2: (i, m, EXCLUDE, {b,c,d,e} ) from socket s3: (i ,m, INCLUDE, {d,e,f} ) then the corresponding interface record on interface i is: (m,EXCLUDE, {b,c} ) • If all such records have a filter mode of INCLUDE, then the filter mode of the interface record is INCLUDE, and the source list of the interface record is the union of the source lists of all the socket records. For example, if the socket records for multicast address m on interface i are: from socket s1: (i,m,INCLUDE, {a,b,c}) from socket s2:(i,m,INCLUDE,{b,c,d}) from socket s3:(i,m,INCLUDE,{e,f}) then the corresponding interface record on interface i is: (m,INCLUDE,{a,b,c,d,e,f}) Note that if the source list has more than 64 elements, it becomes an EXCLUDE {} list. That is, it will allow datagrams for all sources. What this protocol is defining at the router side is basically the behaviour of an ingress router. However, it does not define how to do the authentication of the source sending the datagrams but this makes us think that in the near future a schema like the one that we have previously defined can be standardised.
MULTICAST ADDRESS ALLOCATION In previous sections we have looked at how multicast routing has evolved from intradomain algorithms based on flood and prune to more sophisticated and efficient inter-domain algorithms. These inter-domain algorithms make a distinction between routing within a domain and routing among different domains. The basic idea behind this new kind of routing algorithms is to use the same idea of hierarchical routing based on autonomous systems (AS) to route multicast datagrams. Unicast routing is based on the division of the entire Internet in several ASs, where every AS is a part of the Internet with its own routing and with its own behaviour. So, an AS is a part of the Internet under control of a unique entity and with its own internal structure. These ASs use BGP (Border Gateway Protocol) to peer with other ASs in the Internet. As every AS has its own unicast address range, they can inform their peers about how to route packets between ASs. However, multicast addresses do not belong to any AS. In fact, hosts from different ASs may be simultaneously subscribed to a certain multicast group.
432 Gómez-Skarmeta, Ruiz & Mateo-Martínez
In the next sections we are going to see how multicast address allocation has evolved from the traditional model based on random selection to new schemes used with inter-domain routing algorithms.
The Traditional Model Sdr’s address allocation scheme constitutes the “de facto” standard in multicast address allocation. Whenever sdr needs to select an address, it randomly selects one between the unused addresses. But, what does sdr consider to be an unused address? The answer is simple: an unused address is one that has not been used by any of the advertised sessions and it is not a well-known multicast address. Although this is the simplest way to choose an address, it has various disadvantages and problems: • It could produce collisions between addresses used by different sessions. • It is not possible to develop a true inter-domain routing based on this mechanism. An example of the first problem could be: let’s suppose a conference using a certain multicast group A with a TTL scope limited to Europe. Then a user from the USA, who cannot know the existence of this conference, creates another conference using the same multicast group A, but with a world scope. Then users from Europe receive data from two conferences. The probability of using the same group/port is very low when the number of groups in use is small, but it increases steeply when the percentage of addresses in use reaches a certain threshold. The second problem that presents this allocation model is caused by the random nature of the choice. So there is no way to define the concept of “multicast domain” that makes it possible to extend BGP to the multicast model.
The Multicast Address Allocation Model Although administrative scoping can solve some of the problems of the traditional address allocation model, it can not provide a way to extend the concept of “multicast domain.” So, a more sophisticated address allocation architecture is needed. This architecture assumes the use of administrative scoping (Meyer, 1998), because TTL scoping is a poor scope model. The more important properties of this multicast allocation mechanism are robustness, timeliness, no clashing allocations and good address space utilisation (Thaler, 2000). • Robustness: An application requiring a multicast address should always be able to obtain one, even in the presence of network failures. • Timeliness: A short delay is acceptable before a client obtains an address with a reasonable level of uniqueness. • Low probability of clashes: An allocation scheme should always be able to allocate an address that should not clash with the address of another session. • Address space packing: When the address space is scarce, it is important to obtain an efficient utilisation of it. Simply partitioning the address space produces fragmentation. The main problem is that these properties can be contradictory. For example, guaranteeing that no clashes are produced requires partitioning the address space, which can produce a poor address space utilisation. So, we must give priority to these properties. The typical architecture of this model is divided in three layers: • Layer 1: This is the protocol that a multicast client uses to request a multicast address from a multicast address allocation server (MAAS). It is the server’s responsibility to guarantee that an assigned address is not granted to more than one client. Examples
IP Multicast: Inter Domain, Routing, Security and Address Allocation 433
• •
of protocols at this layer are MADCAP (Hanna, 1999), or an HTTP access to a Web server where we will look for allocation. Layer 2: This is the protocol to use between different MAASs belonging to the same intra-domain to ensure that they not duplicate addresses. This layer is also used by MAASs to obtain multicast address ranges. AAP is an example of this layer protocol. Layer 3: This is the inter-domain protocol that allocates address ranges to Prefix Coordinators. Individual addresses or subranges may then be allocated out of these ranges by MAASs inside the domain. MASC (Estrin, 1998) is an example of protocol at this layer.
The MASC/BGMP Architecture The main drawback of the current allocation scheme is that the addresses are not structured because of its random nature. The Multicast Address Set Claim (MASC, Estrin, 2000) protocol and the Border Gateway Multicast Protocol (BGMP, Thaler, 1998) are intended frameworks to extend the IP unicast AS model to IP multicast. MASC is a hierarchical address allocation architecture based on the existent inter-domain topology (local networks belong to a regional network, that is part of a national network, and so on). MASC dynamically allocates addresses using a listen and claim with collision detection approach: child domains request addresses to its parent domain, which obtain its addresses from its parent domain and so on. A child domain listens to multicast addresses selected by its parent, selects a subrange and then propagates its selection. The claimers wait for a period to detect collisions before informing the domains MAAS of the acquired range.
Figure 8: Multicast address allocation architecture To others peers
To others peers
Prefix Coordinator
Prefix Coordinator
Layer 3 Prefix Coordinator
Prefix Coordinator Layer 2
MAAS
MAAS Layer 1
Client
Client
Layer 1 Client
MAAS Layer 1
Client
434 Gómez-Skarmeta, Ruiz & Mateo-Martínez
Requirements for Inter-Domain Multicast Routing Being able to develop an inter-domain multicast routing architecture has some requirements such as scalability, stability, policy, conformance with IP service model and intra-domain independence (Kumar, 1996). • Scalability: It is needed to minimise the amount of information to interchange between inter-domain routers, and the address allocation model should scale well as the number of groups increases in order to develop a scalable system. • Stability: It is not diserable that the distribution trees have to be frequently rebuilt since rebuilding implies additional control traffic as well as potential packet loss on sessions in progress. • Conformance with IP service model: IP multicast senders do not need to be joint to a multicast group in order to send data over it. Moreover, multicast applications expect to send data whenever it is available. Therefore, it is important to minimise computation time in the router. • Intra-domain multicast routing independence: Each domain can run any intra-domain routing protocol independently and transparently of the inter-domain routing.
Multicast Address Set Claim MASC is used by special nodes in a domain (normally the border router) to acquire multicast address ranges to be used by Multicast Address Allocation Servers (MAASs) in that domain. These nodes are communicated between them by using an intra-domain protocol (layer 3 of the multicast address allocation architecture) to assure that no collisions are produced. MASC domains have a hierarchical structure that reflects the inter-domain topology. A domain that is part of higher level domain will be a child MASC domain of the second one, which is called the parent domain. A domain that has no parent domain is called a top-level domain. We are going to use an example to explain the MASC protocol. Suppose a hierarchy of domains as shown in Figure 9. The domain A advertises its address ranges (i.e., 224.0.0.0/ Figure 9: Address allocation using MASC
Domain D
Domain E
Domain A 224.0.0.0/16
Domain B
Domain F
Domain C 224.0.2.2/28
Domain G
IP Multicast: Inter Domain, Routing, Security and Address Allocation 435
16) to all of its children domains. Now, domain B claims an address range (i.e., 224.0.2.0/ 24) from its parent domain’s A address space and informs its parent and its directly connected siblings, that it wants that address range. Then, A propagates this claim to all of its children. If there is some domain C that is currently using this range (or a subrange of that), then it sends back a collision announcement. For example, if C is already using the address range 224.0.2.2/28, then it sends a collision announcement to B. When B hears the collision announcement, it gives up trying its current claim and selects another multicast address range to claim. If B does not receive any collision announcement during a certain period of time, then it supposes that there is no collision and communicates the acquired address range to its local MAASs, as well as to other domains through BGP. The parent domain, A in this case, controls the address space utilisation and how much of its address space is neither assigned to any children domain nor used by itself. So when address utilisation reaches a certain threshold, then domain A has to claim for more addresses. Since A is a top-level domain, it has no parent domain. However, it still uses this same claim mechanism, but in this case it will consider 224.0.0.0/4 as the address range acquired by a non-existent parent domain.
Border Gateway Multicast Protocol The Border Gateway Multicast Protocol (BGMP, Thaler, 1998) is an attempt to extend the traditional IP unicast routing architecture to the IP multicast model, designing a true inter-domain multicast routing protocol. BGMP is based on the ideas from the intra-domain protocols, but it has a different goal: it does not build trees of routes, but rather it builds bi-directional shared-trees of domains. Inside a domain, any intra-domain protocol can be used, and for the inter-domain routing, BGMP is used. The principal problem of CBT and PIM-SM, and in general of any intra-domain sparse routing protocol, is that they do not scale well because of the method for mapping multicast addresses to the unicast address of the Rendezvous Point. But, BGMP does not have this problem because this work is done by the multicast allocation address architecture using MASC. Figure 10 explains how BGMP builds a distribution tree. The receiver R joins to the multicast group 224.2.127.254, which is in the range assigned to the multicast domain C, then its local router sends a Domain-Wide Report (DWR) to all the border routers on domain A (which is its local domain). The router A1 knows that the best route for that multicast address is 224.1/16 received from its peer B1, so it sends a join message to B1. Then router B1 repeats the same mechanism, so B3 sends a join message to C1, and then the multicast distribution tree is built. Now, sender S starts sending data to the multicast group. Its data flows to the border routers of domain D. D2 is not in the best path to domain C, so it sends a prune message toward S1 , but D1 is in the best path so it forwards data from S to D2. Now data can reach the root domain to 224.2.127.254 multicast address and can be distributed by the shareddistribution tree previously built. Data is distributed to multicast domains using this mechanism and inside the domain it is distributed by any intra-domain routing algorithm. For example, router C3 has no neighbour domain to distribute data to, so data is not distributed to C3, but router C1 does, so data from S has to be forwarded from C2 to C1.
436 Gómez-Skarmeta, Ruiz & Mateo-Martínez
Figure 10: Formation of a BGMP shared tree Domain B allocated 224.2/16
Domain C allocated 224.1.128/20 B3
B1
C1
Root domain for 224.1.127.254
B2
C2
A1 D1 Receiver R Sender S Domain A allocated 224.1.192/20 BGMP Join Messages Domain-Wide Reports
D2 Domain D allocated 224.1.130/24
When data reaches router B3, it has to be forwarded through domain B toward router B1. B3 receives data from C1, but it is not in the best path to sender S, so data from B3 could be problematic in order to determine the distribution tree through domain D. So, B3 encapsulates data and forwards it to B2, which distributes them through domain B. At this moment, data is flowing from S to R along the shared tree. However B3 has to encapsulate data and send it to B2. As B2 is not in the shared tree, it can initiate a shortest-path branch by sending a specific join to source S via router D2. When D2 receives this join message, it joins to the distribution tree in domain D. Then, B2 sends a BGMP prune to B3 and starts dropping encapsulated data from B3, because it is receiving also from D2. This prune message is going to be propagated to the root domain if B3 has no other branch of the shared tree. As we can see, the problem is deciding who is the multicast root domain for a group. In BGMP the root domain is the domain which the multicast address is allocated for.
Multicast Address Allocation Protocol (AAP) AAP (Handley, 1999) is a protocol to be used by MAASs within a domain. It aims at coordinating MAASs among themselves. AAP complies with the Multicast Address Allocation Architecture, so it is designed with its requirements in mind. MAASs and Prefix Coordinators exchange data by sending messages to a scoperelative multicast address. So, all the MAASs and Prefix Coordinators can receive this message because they are joint to this group. Prefix Coordinators periodically send Address Space Announcement (ASA) messages announcing the set of multicast addresses they have available. MAASs send Address In Use (AIU) messages to announce addresses which they have allocated. As we have seen above, when a MAAS needs to allocate one or more addresses, it sends an Address Claim (ACLM) message listing the address range it claims. If addresses assigned to another MAAS collision with addresses claimed, this MAAS sends a collision message and the first MAAS claims another address range.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 437
A MAAS can pre-allocate addresses using the Address Intent To Use (AITU) message. In order to do this, the MAAS periodically sends this AITU message announcing the preallocated range. If another MAAS checks that this range conflicts with another one already assigned, it acts in the same way as if there was a conflict with an ACLM message. When a MAAS cannot meet its allocation requirements, it reports this to other MAASs by using an Address Not Available (ANA) message. A MAAS sends periodically also Address Space Report (ASRP) messages to indicate to Prefix Coordinators how much of its current allocation space is in use and how many addresses it needs to satisfy its demands. With these messages, a Prefix Coordinator decides if it needs to acquire more addresses.
Multicast Address Dynamic Client Allocation Protocol (MADCAP) Another approach to the multicast address allocation problem can be based on principles of the Dynamic Host Configuration Protocol (DHCP). This approach is based on a client/server model, where hosts request addresses from address allocation servers. When a client needs a multicast address, it unicasts or multicasts a request to one or more MADCAP servers, who respond with a unicast message to the client. MADCAP (Hanna, 1999) defines a set of messages for the communication between MADCAP clients and MADCAP servers. However, it does not define how MADCAP servers acquire multicast addresses because it supposes that there is another mechanism to do this (for example, MASC).
Protocol Overview The Internet Assigned Numbers Authority (IANA) has reserved a multicast address to be used as a multicast MADCAP server address. This address is the address with a relative offset of -1 from the last address of a multicast scope. MADCAP clients send a DISCOVER message to this reserved address in order to discover MADCAP servers that can probably satisfy their requests. When a client knows which MADCAP servers can satisfy it requests, and the client needs an address, it sends a REQUEST message to that server. Servers that can satisfy that request, then temporally reserve the address needed and finally send an OFFER message back to the client. The client selects a server and sends a multicast REQUEST message, including an identifier of the chosen server. Then, the chosen server responds with an ACK or NAK message and the other servers stop reserving addresses temporally assigned.
GLOP Addressing MASC and other multicast address allocation architectures require a great evolution in current multicast routing algorithms. However, GLOP (Meyer, 1999) proposes a statical allocation architecture with global scope.
Figure 11: Structure of a GLOP address
0
8 233
16 16 bits AS
24 local bits
438 Gómez-Skarmeta, Ruiz & Mateo-Martínez
For this purpose, IANA has allocated 233/8 addresses to be used with GLOP. The rest of the bits are used as explained in the next example. Consider, for example, AS 5662, in binary and left padded with 0s, we get 0001011000011110. Mapping the high order octet to the second octet of the address and the low order one to the third one of the address, we get address 233.22.30/24. This allows a 24 address range for every AS in the Internet.
DEVELOPMENT OPPORTUNITIES FOR THE FUTURE As already noted, IP multicast is an ongoing work and there are many issues that need to be solved; thus, there is a wide range of development opportunities. We will describe some of them, but surely, you will find more challenges.
Group Key Management Using cryptographic algorithms for having private videoconferences over unicast connections is very straightforward. However, if you intend to have private videoconferences in multicast environments, the key management starts to become a tricky issue. Various proposals have been made, but none of them solves the problem well enough. It would be very interesting to find one way to make a scalable re-keying for multicast groups.
Reliable Multicast When we talk about multicast videoconferences, it is common to use UDP datagrams over IP multicast, that is, we are using an unreliable delivery mechanism. However, there are many applications that do not work well with this kind of delivery. For example, when transmitting video if a datagram gets lost, it is better to forget about that datagram than ask for retransmission. Some other applications like distributed whiteboards or distributed text editors need a reliable delivery mechanism because they cannot deal with losing characters or even words. The main problems with reliable multicast are: • Feedback implosion. Feedback from receivers may overwhelm a source. • How to recover lost packets. • Requirements imposed by the applications. A lot of different protocols, such as AFDP, RAMP, RMP and so on, have been developed and proposed but none of them is perfect. Some of them are very scalable but poorly reliable, others are very reliable but scale poorly, and a scalable and reliable multicast protocol needs to be defined by now.
Address Allocation Although we have explained some proposed protocols that try to resolve the multicast address allocation problem, this is an ongoing research and there are still many issues that needs to be addressed. For example, nowadays there is not true support for the presented protocols, because there are not enough multicast tools implementing these new models of multicast address allocation. So, these protocols are not tested in a true and working framework and can’t be truly validated.
IP Multicast: Inter Domain, Routing, Security and Address Allocation 439
Another problem related to address allocation is that there is not a definitive model that is imposed over the rest. Moreover, the interoperability of all these new protocols are not tested.
Inter-Domain Multicast Routing As this work has shown, nowadays much work is being done in order to find a proper inter-domain multicast routing algorithm. So, it would be very interesting to develop tools for testing this kind of proposal. For example, some kind of application allowing for simulation of these new algorithms could be very useful. It’s better to test the behaviour of these algorithms before implementing them on a router.
Authentication and Access Control Without doubt, an application allowing for authentication on multicast environments, could become the killer application in a near future because these new cryptographic services are being demanded a lot. And, as soon as a multicast be widely used, these services will also be demanded for multicast applications.
Integration with Other Videoconferencing Technologies It is also very important in the near future to integrate ITU’s videoconferencing standards and IETF multicast videoconference. ITU’s videoconferencing protocols, such as H.320, H.320, H.310 and so on, interoperate very well each others. However, few applications actually exist allowing for MBone-H.32X interoperation.
ENDNOTE 1 We are supposing a flood and prune intra-domain routing in order to simplify the example. But in a sparse mode routing, the mechanism is analogous.
REFERENCES Ballardie, A. (1997). Core-based trees (CBT) multicast routing architecture. RFC 2201, September. Bates, T., Chandra, R., Katz, D. and Rekhter, Y. (1998). Multiprotocol extensions for BGP4. RFC 2283, February. Cain, B., Deering, S. and Thyagarajan, A. (1999). Internet group management protocol. version 3. Internet Draft, November. Crowcroft, J., Handley, M. and Wakeman, I. (1998). Internetworking multimedia. UCL Press, November. Deering, S., Estrin, D., Farinacci, D., Jacobson, V., Helmy, A., Meyer, D. and Wei, L. (1998). Protocol independent multicast version 2 dense mode specification. Internet Draft, November. Deering, S. (1991). Multicast routing in a datagram internetwork. PhD Dissertation. Diot, C., Levine, B., Lyles, B., Kassem, H. and Balensiefen, D. (2000). Deployment issues for the IP multicast service and architecture. IEEE Network.
440 Gómez-Skarmeta, Ruiz & Mateo-Martínez
Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., Handley, M., Jacobson, V., Liu, C., Sharma, P. and Wei, L. (1998). Protocol independent multicast sparse mode (PIMSM): Protocol specification. RFC 2362, June. Estrin, D., Govindan, R., Handley, M., Kumar, S., Radoslav, P. and Thaler, D. (1998). The multicast address-set claim (MASC) protocol. Internet Draft, August. Farinacci, D., Rekhter, Y., Lothberg, P. and Kilmer, J. (2000). Multicast source discovery protocol (MSDP). Internet Draft, February. Fenner, W. (1997). Internet group management protocol, version 2. RFC 2236, November. Gómez-Skarmeta, A. F., Mateo-Martínez, A. L. and Ruiz-Martínez, P. M. (1999). Access control in multicast environments: An approach to senders authentication. Proceedings of the IEEE LANOMS’99, 1-13. Handley, M. and Hanna, S. R. (1999). Multicast address allocation protocol (AAP). Internet Draft, September. Hanna, S., Patel, B. and Shah, M. (1999). Multicast address dynamic client allocation protocol (MADCAP). RFC 2730, December. Ishikawa, N., Yamanouchi, N. and Takahasi, O. (1998). IGMP extensions for authentication of IP multicast senders and receivers. Internet Draft, August. Kosiur, D. (1998). IP Multicasting. Wiley Computer Publishing. Kumar, V. (1996). MBone. Interactive Multimedia on the Internet. New Riders. Kumar, S., Radoslav, P., Thaler, D., Alaettinoglu, C., Estrin, D. and Handley, M. (1996). The MASC/BGMP Architecture for Inter-Domain Multicast Routing. Meyer, D. and Lothberg, P. (1999). GLOP addressing in 233/8. Internet Draft, November. Meyer, D. (1998). Administratively scoped IP multicast. RFC 2365, July. Moy, J. (1994). Multicast extensions to OSPF. RFC 1584, March. Rekhter, Y. and Li, T. (1995). A border gateway protocol 4 (BGP-4). RFC 1771, March. Rigney, C., Rubens, A., Simpson, W. and Willens, S. (1997). Remote authentication dial in user service (RADIUS). RFC 2138, April. Secure Multicast Group Charter (SmuG). Available on the World Wide Web at: http:// www.irtf.org/charters/secure-multicast.html. Simson, W. (1996). PPP challenge handshake authentication protocol (CHAP). RFC 1994, August. Stardust White Paper. (2000). The evolution of multicast. © Stardust.com Inc. Thaler, D., Estrin, D. and Meyer, D. (1998). Border gateway multicast protocol (BGMP): Protocol specification. Internet Draft, November. Thaler, D., Handley, M. and Estrin, D. (2000). The Internet multicast address allocation architecture. Internet Draft, January. Waitzman, D., Partridge, C. and Deering, S. (1998). Distance vector multicast routing protocol. RFC 1075, November.
Mobile Multimedia over Wireless Network 441
Chapter XIX
Mobile Multimedia over Wireless Network Jürgen Stauder Thomson Multimedia R&D, France Fazli Erbas University of Hannover, Germany
In the last few years the rapidly growing Internet has pushed new multimedia applications in the field of entertainment, communication and electronic commerce. The next step in the information age is the mobile access to multimedia applications: everything everywhere any time! This tutorial chapter addresses a key point of this development: data transmission for mobile multimedia applications in wireless cellular networks. Addressed networks are existing standardized terrestrial wireless systems such as GSM, D-AMPS, IS95 and PDC, including their evolutions HSCSD, GRPS, HDR, IS-136+ and IS-136HS. Furthermore, proprietary satellite networks like Orbcomm, Globalstar, ICO, Ellipso and Courier are considered. Finally, future high bandwidth terrestrial/satellite third-generation systems based on the UMTS standard, as well as future proprietary systems like Astra-Net, Skybridge, Teledesic and Spaceway, are discussed. For each of these networks, an overview on the data channels is given with respect to their capacity, temporal organization, error characteristic, delay and availability. Further, the architecture, the functions and the capacities of the mobile terminals are reviewed. Having studied this chapter, the reader is able to answer questions like: • Which network will be capable to transmit real-time video? • Does a rainfall interrupt my mobile satellite Internet connection? • When will high bandwidth, wireless networks be operational? • How to tune existing multimedia applications to be efficient in wireless networks? The chapter is closed by a glossary of terms, a reference list to in-detail literature and a list of Web sites of companies and organizations providing useful information.
Copyright © 2002, Idea Group Publishing.
442 Stauder & Erbas
INTRODUCTION In the last years, the rapidly growing Internet has pushed new multimedia applications. The next step in the information age is mobile access to these multimedia applications: everything everywhere any time! Out of office will no longer mean out of touch: remote employees can access the same data and can use the same tools, thanks to mobile phones or wireless connected personal digital assistants (PDAs). Consumers can demand multimedia services whenever they want and wherever they are. Email, home banking and e-commerce are the first services that have left the Internet to enter wireless terrestrial communication networks such as GSM (ETSI, 1999a) in Europe, D-AMPS (USDC, IS-54) and IS-95 in the USA or PDC in Japan. The number of users of mobile terrestrial services will grow from 430 million in 2000 to 940 million in 2005 (ETSI, 1999a). Among these, in 2005 more than 50 million people will use mobile infotainment appliances, according to the UMTS Forum (2000). What is needed to make mobile multimedia happen? Three components have to be realized: multimedia services, multimedia networks and multimedia terminals. Multimedia services are offered by service providers in the field of: • broadcast (TV and audio channels, editors); • information on demand (video, audio, weather, documents); • communication (voice and video telephony); • commerce (banking, electronic commerce, publicity); and • industry (collaborate work, VPN). These services are nowadays tuned to run on PCs or PDAs either independently or connected via the Internet or an intranet. To get them mobile, they have to be adapted to the characteristics of multimedia networks and multimedia terminals. The main concern of this chapter is the cooperation between multimedia services and wireless cellular global networks. For network developers, the question is what constraints impose multimedia transmission on wireless networks? For example, which network delay is tolerable for a real-time video/audio transmission from USA to Europe? For multimedia experts, the question is rather which constraints impose the existing or foreseen wireless network standards on multimedia applications? For example, which error rate should expect a three-dimensional virtual travel agent application serving a mobile user at the left lane of a German highway? This tutorial chapter follows the multimedia expert’s view of the problem. The section FROM FIXED TO WIRELESS NETWORKS introduces the existing and future wireless cellular global networks. Furthermore, the network’s data channels are presented focusing on their capacity, their temporal organization, their error characteristics, their delay and their availability. The following section MOBILE NETWORK TERMINALS outlines the architecture of terminals that receive, process, display and retransmit multimedia content over such channels. The adaptation of multimedia applications to the constraints of data channels and mobile terminals is then discussed in the section MOBILE MULTIMEDIA APPLICATIONS. The chapter closes within the section CONCLUSION AND FUTURE TRENDS, followed by a glossary of this chapter’s terms and a list of references.
FROM FIXED TO WIRELESS NETWORKS Networks connect terminals (computers, phones, pagers) among themselves and/or with servers (WAP, World Wide Web or broadcast server) for communication purposes.
Mobile Multimedia over Wireless Network 443
The nature and the conditions of these communications are manifold and thus a variety of networks exist. Historically, networks for computer interconnection (LAN, Internet), for broadcasting (radio, TV) and for telecommunication (telephone) were separated. In the future, all these networks will tend to merge with each other. Figure 1 presents telecommunication oriented networks regarding three criteria. The first criterion is the mobility of the terminals. In fixed networks, terminals can be moved within the radius of the cable. In cordless networks the terminals can move inside a spatial cell (e.g., a room). In wireless networks, the terminals can freely move inside the region that is covered by the network (e.g., nationwide). The second criterion is the type of modulation used. Earlier network standards used analog modulation techniques (AM, FM) while modern networks employ digital modulation techniques (PSK, FSK, QPSK, GMSK) to scope with difficult transmission conditions and to allow higher data rates. The third criterion addresses the terrestrial nature of most of the existing (often only nationwide) networks. Some globally operating telephone networks and the new standard UMTS also employ satellites to ensure a larger geographic coverage.
Fixed Networks Fixed networks are characterized by static terminals (e.g., telephones) and static connections (e.g., cables). Furthermore, the terminals have a fixed network address. That means from the network’s point of view, a terminal is always reachable via the same cable. A well-known type of fixed network is a telephone network (PSTN, public switched telephone network). In the following, more advanced, digital telecommunication networks are reviewed: the synchronous ISDN and B-ISDN networks, as well as the asynchronous ATM network. Finally, their relation to the Internet is outlined.
ISDN The Integrated Services Digital Network (ISDN) is defined by a set of CCITT/ITU standards for digital transmission over ordinary telephone copper wire as well as over other media (Boisseau, 1994; ITU, 1988-99). The ISDN technology integrates voice, text, (audio, video, multimedia) data and services (i.e., telephony, fax) (Sigmund, 1996). ISDN is Figure 1: Fixed, cordless and wireless networks Networks
Fixed
Cordless
Wireless / Cellular
Analog
Digital
Digital
Analog
Terrestrial
Terrestrial
Terrestrial
Terrestrial
PSTN
ISDN
CT 0-3
NMT (Scand.)
DECT
TACS (UK)
PHS (Japan)
AMPS (USA)
HIPERLAN
C/D-Netz (RFA)
802.11 (USA)
Radiocom (F)
SWAP (USA) Bluetooth
Digital
Terrestrial
Satellite
GSM
Orbcomm
IS-95 (USA)
Globalstar
D-AMPS (USA)
ICO
PDC (Japan)
Thuraya Courier Teledesic Skybridge GEOs
Satellite+Terrestrial
UMTS
444 Stauder & Erbas
generally available from phone companies in most urban areas all over the world. There are two levels of service: the Basic Rate Interface (BRI), intended for the home and for small enterprises, and the Primary Rate Interface (PRI), for professional users. Both rates include a number of B (bearer) channels and a single D (data) channel. The B channels carry data, voice and other services. The D channel carries control and signalling information. The BRI consists of two 64-kbps B channels and one 16-kbps D channel (user data rates). Thus, a BRI user can have up to 128 kbps. The PRI consists of more channels. For example in Europe, the PRI offers 30 B channels and 1 D channel.
B-ISDN B-ISDN is the broadband counterpart to ISDN. B-ISDN aims at broadband networks of fibre optic and radio media while ISDN transmits over ordinary telephone copper wires. B-ISDN is both a concept and a set of services. The first concept of B-ISDN was simply: • to add new high-speed channels to the existing channel spectrum, • to define new broadband user-network interfaces and • to rely on the existing ISDN protocols. The bit rate available to a broadband user is typically from about 50 Mbps up to hundreds of Mbps. B-ISDN will support services of both constant and variable bit rates, data, voices, still and moving picture transmission, multimedia applications that may combine data, voice and picture components.
ATM ATM (asynchronous transfer mode) is a dedicated connection-oriented packet switching technology. In ISDN and PSTN, the communication is connection oriented, i.e., an established channel is open all the time during the communication. In ATM, only dedicated connections, i.e., virtual or logical connections, are established. Actual data is transmitted packet by packet only then, when needed. These packets, called cells, count each 53 bytes. A cell is processed asynchronously and independently to other cells. Because ATM is designed for ease of hardware implementation (rather than by software), faster processing speeds are possible. The pre-specified bit rates are either 155 Mbps or 622 Mbps. However, the speed on ATM networks is expected to reach higher bit rates along several technologies. Today, ATM is mostly mapped onto existing isochron (connection-oriented) networks. ATM is a key component of broadband ISDN (B-ISDN) (ITU-T, 1988-99, Händel, 1994). Apart form the ITU-T, the ATM Forum defined some ATM interface standards.
Internet The Internet is a worldwide system of computer networks-a network of networks-in which users at any computer can, if they have permission, get information from any other computer. It was conceived by the Advanced Research Projects Agency (ARPA) of the U.S. government in 1969 and was first known as the ARPANet. The original aim was to create a network that would allow users of a research computer at one university to be able to “talk to” research computers at other universities. A side benefit of ARPANet’s design was the maintenance of network functionality even if parts of it were destroyed in military attacks or other disasters. Today, the Internet is a more and more private, cooperative and selfsustaining facility accessible to hundreds of millions of people worldwide. Physically, the Internet uses a portion of the total resources of the currently existing public and private telecommunication networks.
Mobile Multimedia over Wireless Network 445
What technically defines the Internet is its use of a set of protocols called TCP/IP (Transmission Control Protocol/Internet Protocol). The TCP transport layer protocol permits the establishment of reliable connections between user terminals. The IP network layer protocol transmits the data connection-less, i.e., packet switched (Gilligan, 1997).
From Fixed to Wireless: Radio Access Methods Wireless networks replace the cable by radio connection. The key technology is the radio access. Methods for radio access aim at a partition of the rare time-frequency space among as many as possible mobile stations (MSs). In this section, the methods FDMA, TDMA and CDMA are introduced. To organize the radio access, the area covered by the network is partitioned into spatial cells, which are ideally hexagonal. Each cell is served by a base station (Mehrotra, 1994), as shown in Figure 2. For FDMA and TDMA, the time-frequency space is partitioned among the cells such that non-neighbouring cells may re-use parts of the time-frequency space, i.e., two cells may use the same frequency at the same time. In general, a mobile terminal (called MS, mobile station) is in contact with one base station. The radiated power from the base station is sufficient to provide adequate radio coverage for all MSs travelling in its cell.
FDMA Early cordless systems and early first-generation wireless networks used Frequency Division Multiple Access (FDMA). Each MS uses its own frequency band. For several MSs several bands are necessary. For example, the British CT2 cordless system uses 100 kHz bands between 864 and 868 MHz (Steele, 1999). FDMA does not use the time-frequency space optimally. One reason is that between the bands some space has to be left to limit cross talks. Furthermore, the canal capacity cannot be managed flexibly--either an MS uses a band or not. A “half band“ is not possible. Finally, the distribution of frequencies over the cells is fixed and limits the flexibility of the network to react on dynamic traffic.
TDMA In Time Division Multiple Access (TDMA), several MSs use the same frequency band by temporal multiplex (see Figure 4). Each time-frequency segment is called slot. For Figure 2: Geographical coverage of a wireless cellular network, organized into cells
Hexagonal cell
Base station
446 Stauder & Erbas
Figure 3: Frequency Division Multiple Access (FDMA), used in first-generation wireless networks
Frequency MS 1 MS 2 MS 3
Time Figure 4: Time Division Multiple Access (TDMA) with frequency hopping, used in GSM and D-AMPS
Frequency MS 1
MS 3
MS 2
MS 2
MS 1
MS 3
MS 3
MS 2
MS 1 Time
example, the GSM wireless network standard uses slots of the size 577 ms by 200kHz (Mouly, 1992). For a standard telephone call, an MS uses one slot per TDMA frame (eight temporal slots) in the bands between 890 MHz and 950 MHz for transmission, and one other slot per TDMA frame in the bands 935-960 MHz for reception. As shown in Figure 4, the MS may use frequency hopping. The continuous change of frequency band has three advantages. First, the radio resource can be partitioned flexibly. Mobile stations may even use different amounts of bandwidth (not shown in Figure 4). This is interesting to offer different services or to give a non-used slot from a less active MS to a more active one. Second, an illegal listener has difficulties tracking the communication. Finally, fading can be better compensated: fading is a temporal phenomenon at singular frequencies. Due to multiple echoes and signal propagation (see introduction to secondgeneration wireless networks), the received signal energy reduces drastically. By frequency hopping, only some slots per MS are concerned, so that the effect can be partially compensated by channel coding or by retransmission.
CDMA Another promising radio access method is Code Division Multiple Access (CDMA). In CDMA, all MSs use the same frequencies all time (see Figure 5). The separation between the different MSs is done using different codes (Steele, 1999). First, the data bit-stream is modulated. For example, the American standard IS-95 uses QPSK. Then, the temporal
Mobile Multimedia over Wireless Network 447
portion of the modulated signal that corresponds to a data bit is multiplied by a unique code of N code bits in the N-th speed of the data bits. The period of one code bit of the N bit code sequence is called chip. For example, IS-95 uses N=64. With a data bit rate of 19.2 kbps, this gives a rate of 1.229 Mchip/s over the air per MS. At the receiver, the signal is multiplied by the same code sequence. The effect is that all other signals than the aimed one are spread in their spectrum. By application of a band pass, these signals rest only with small interfering amplitudes. An inconvenience of CDMA is that the base station needs to receive from all MSs with the same power, to be able to decode the different channels. This imposes to each MSs a position-dependent power control. Furthermore, the total data rate of a cell decreases with an increasing number of MSs due to the CDMA inherent interference. This effect is called cell breathing. Finally, the capacity of a CDMA-based network can be increased only by additional base stations, while FDMA-or TDMA-based networks can be adapted simply by additional frequencies.
FDD and TDD The time-frequency space is divided among several MSs as described by FDMA, TDMA or CDMA techniques. In case of duplex (bidirectional) communication, the allocated time-frequency portion of one MS has to be divided into downlink (to the MS) and uplink (from the MS). Frequency Division Duplex (FDD) uses two different frequencies. Time Division Duplex (TDD) uses two different time segments.
Cordless Systems Cordless systems are the first step to wireless networks. They are designed for mobile radio coverage over relatively small distances, such as in home or in office. In cordless systems, the MSs can freely move inside a cell but cannot move from cell to cell without interrupting the communication. This restricted mobility is typical for a cordless system. Furthermore, the cells are small. Thus, the transmission characteristics are relatively good and the employed technologies are not too complicated (Mehrotra, 1994). The cordless systems shown in Figure 1 can be divided into two groups. The first group is telecommunication-oriented systems such as CT0-2, DECT and PHS. They are summarized in Table 1 and will be described in the following. The second group is datatransmission-oriented networks to interconnect devices such as PCs, printers and even mobile phones such as HYPERLAN, 802.11, SWAP and Bluetooth. They are not in the scope of this chapter. Figure 5: Code Division Multiple Access (CDMA), used in IS-95 and UMTS
Frequency MS 1,2,3
Time
448 Stauder & Erbas
Table 1: Cordless systems
System Modulation Frequency band [MHz] Channels Bandwidth Access Duplex Channel allocation Coverage Cellular Capacity
CT0 CT1 analogue analogue 1,6/ 47 914-916 959-961 8 40 400 kHz 4 MHz FDMA FDMA FDD FDD fixed dynamic
CT1+ analogue 885-887 930-932 80 4 MHz FDMA FDD dynamic
CT2 digital 864-868
DECT digital 1880-1900
40 4 MHz FDMA TDD dynamic
approx. 120 20 MHz FDMA/TDMA TDD dynamic
<1000 m < 300 m no limited 1 E/km2 200 E/km2
< 300 m limited 200 E/km2
< 300 m limited 250 E/km2
< 300 m yes 10000 E/km2
CT0-CT3 The first cordless systems CT0 (employed in Asia) and CT1 (a CEPT standard) were based on analogue modulation techniques. They were followed by the digital systems CT2 (in Europe, notably UK) and CT3 (by Ericsson). For example, CT2 offered about 40 channels with each 32 kbps coded speech including encryption.
DECT The Digital Enhanced Cordless Telecommunications (DECT) system is designed for home and business use. DECT is designed especially for a smaller area with a large number of users, such as in corporate complexes. DECT employs the division of its covered area into cells. It is based on FDMA, TDMA and TDD. DECT includes adaptive channel allocation, two-way talking capability, high capacity and more sophisticated features than CT2 systems. The used modulation is GMSK. The number of channels per carrier is 12 (per direction) with a carrier separation of 1.728 MHz. Speech is digitally encoded with 32 kbps ADPCM.
First-Generation Wireless Cellular Networks What is the difference between wireless cellular, cordless and fixed? It is the degree of mobility. Fixed networks have a location-oriented connectivity: calling a number means calling a location. In cordless systems, the location is already less clear: the called telephone is inside the covered region, usually one or several buildings. But the telephone may not move during a call: leaving the cell of a base station means loss of connection. In wireless cellular networks, full mobility is supported: coming from a cell A into a neighboring cell B, the change from A’s base station to B’s is seamless without interruption of communication. This change is called handover. In wireless cellular networks, the called number is a person, not a location. Such networks are therefore called personalized communication networks (PCNs). In the USA in the early 1920s, the Detroit police used mobile telephones in their cars using frequencies near 2 MHz. The first two-way mobile telephone system was switched on in the middle of 1933 by the New York City Police Department, also using the 2 MHz band (Mehrotra, 1994). In Germany, the first mobile telecommunication system started in 1926
Mobile Multimedia over Wireless Network 449
in trains between Hamburg and Berlin (David, 1996). Like these pioneering systems, the first generation of the wireless cellular networks is based on analogue modulation technology: The speech signal is not digitised before modulation. However, all command and control of the network is digital. Like newer cordless systems such as DECT, the first-generation wireless networks make use of cellular technology (see Figure 2). First-generation wireless networks are numerous. They include the Nordic Mobile Telephone (NMT450 and NMT900), the American Advanced Mobile Phone Service (AMPS), the British Total Access Communications System (TACS, Steele, 1999), the German B- and C-Netz, the Japanese Nippon Advanced Mobile Telephone System (NAMTS), which is mainly based on the British TACS, also known as J-TACS, and the French Radiocom known as RS2000. There are only minor differences between these analogue mobile radio systems. Table 2 summarises the technical characteristics of first-generation wireless networks (Mehrotra, 1994; David, 1996).
Second-Generation Wireless Cellular Networks Second-generation wireless cellular networks make use of digital modulation technology. This means that digitized signal or data is numerically modulated to an intermediate frequency and then analogically mixed up to the final carrier. This technology has its advantage in the demodulation: concurrently with the demodulation, the digital demodulator compensates transmission effects such as echoes (caused by the environment) or Doppler spectrum distortions (caused by a moving terminal). This strategy is called channel equalization and allows for much better system performance: Larger distances between terminal and base stations, higher data rates, less errors. First-generation networks partially process their data digitally, but they use an analog modulation technique. Table 3 summarizes the second-generation wireless networks: the GSM system in Europe (ETSI, 1999a; Mouly, 1992), the D-AMPS (USDC, IS-54) system in the USA (Steele, 1999), the IS-95 system in the USA (Steele, 1999) and the PDC (JDC) system in Japan (Steele, 1999; Walke, 1999).
Network Architecture In the following, the architecture of second generation cellular terrestrial networks is described using GSM as example. Let us first have a look at the physical architecture of a network: an operator of a public land mobile network (PLMN) possesses a (usually nationwide) network of so-called mobile services switching centers (MSCs). Each MSC is associated to a geographical region that is Table 2: First-generation wireless networks System Frequ en cy ban d [M H z] H F chann els C arrier sp acin g [k H z] D up lex sp acin g [M H z] M odu lation
A M PS 825890 666
TACS 890960 1000
NAM TS 870-940 600
N M T 450 N M T 900 C -N etz R S2000 453-468 890-960 451406-430 466 180 1999 222 256
30
25
25
25
12,5
20
12,5
45 FSK
45 PS K
55 PS K
10 FFS K
45 FFS K
10 FSK
10 FFS K
450 Stauder & Erbas
Table 3: Second-generation wireless networks Network Frequency band [MHz] Carrier spacing [KHz] Duplex spacing [KHz] Modulation Radio access Duplex
GSM 890 – 960
D-AMPS (IS-54) 824 – 894
IS-95 824 - 894
PDC (JDC) 810 – 1513
200
30
1250
25
45
45
45
130
GMSK TDMA/FDMA FDD
π/4-DQPSK TDMA/FDMA FDD
QPSK CDMA FDD
π/4-DQPSK TDMA/FDMA FDD
partitioned into several location areas (LAs). An MSC memorizes in a local temporal database (VLR, visitor location register) which mobile phones (MSs) are for the moment in which LA. Each LA is partitioned in several groups of cells. Each group of cells is managed by a base station controller (BSC). In each cell, a base station (BTS) can establish a radio connection to an MS and thus establish the link MSC-BSC-BTSMS. This link is defined by three interfaces: the A interface (MSC-BSC), the Abis interface (BSC-BTS) and the radio interface (BTS-MS). All MSCs stay in contact to a central database called home location register (HLR). In the HLR, for each mobile user the following information is listed: the telephone number (MSISDN), a PLMN-internal telephone number (IMSI), a technical user profile (type of MS and services) and the MSC of the region where the user travels for the moment. This latter information is always up to date as long as mobile stations are switched on. Let us now have a look at the protocol layer architecture of the network: what happens in the PLMN when a wired phone in a public switched telephone network (PSTN) calls a mobile phone subscribed to a PLMN? The called number (MSISDN) indicates to the PSTN that the called phone belongs to a specific PLMN. The call is conducted to the nearest PLMN entry, a so-called gateway MSC (GMSC). The GMSC will now create a connection to the mobile station using signaling messages. Then, the actual user data (e.g., the coded voice) will be transmitted to the phone. The signaling starts in the network layer, which is subdivided into the sub-layers communication management (CM), mobility management (MM) and radio resource management (RR). First, the CM creates a logical link to the mobile phone and prepares billing information. Then, the MM searches the mobile phone: the GMSC asks the HLR for the current MSC where the MS is visiting (the VMSC) using the MSISDN number. The HLR provides to the GMSC a MSRN number indicating the path in the network to go from the GMSC to the VMSC. The VMSC now sends in all cells of the LA of the called mobile phone a paging message based on the IMSI number via a reserved so-called beacon frequency. Once the MS is awaken, the RR layer establishes a radio connection. Data segmentation, channel coding and modulation are carried out by the link and physical layers. Important keywords in wireless networks are security and mobility, both managed by the MM network layer. Security is guaranteed by authentication, encryption, frequency hopping and temporal identities. For authentication and encryption see the next section on wireless network terminals. By frequency hopping (see Figure 4), a communication uses
Mobile Multimedia over Wireless Network 451
continuously changing frequencies. The temporal identity of a mobile user is the temporally changing number TMSI that is used instead of the fixed number IMSI. By this, an illegal listener has immense difficulties tracking a call. Mobility is ensured by roaming and by location management. Roaming is the free movement of an MS from one cell to another cell without interrupting the communication: a seamless switch between two base stations (BTS), sometimes even between two BSCs or between two MSCs (even of different PLMNs). This kind of mobility is usually not possible in cordless systems such as CT-1. The location management is carried out by combination of two principles: localization and searching. The localization principle obligates each MSC to memorize in its VLR the LA of all MSs that travel in the region managed by the MSC. This ensures a coarse location management. The searching principle obligates the MS to listen to paging messages and to take contact if being called. This ensures a precise location management.
Services The second-generation networks are still telephone/voice oriented. The main service is voice communication. The voice is coded adaptively between 1.2 and 9.6 kbps (IS-95), at 13 kbps (GSM channels TCH/H and TCH/F) or at 7.95 kbps (D-AMPS). The networks provide data channels in the same class of bandwidths. These services are connectionoriented, i.e., once a connection is established, the bandwidth is allocated and has to be paid till the end of communication. In the GSM system, a first connection-less service is the short message service (SMS). From phone to phone, a message of 160 characters can be sent.
HSCSD HSCSD (High Speed Circuit Switched Data) is a GSM evolution that entered the market during the year 2000. For example, a German operator offers since the end of 1999 the data rate of three GSM TCH/F channels corresponding to 38.4 kbps. Usually two GSM channels are enhanced and bundled to 28.8 kbps for each up- and down-link. This evolution can technically be implemented in a GSM system by a software update: the base station (BTS) and the mobile station (MS) use more than one slot per TDMA frame. Therefore, a new MS has to be purchased: it should be capable of adapting faster its frequency generator to process succeeding slots on different frequencies (frequency hopping). By less redundant channel coding, the German operator proposed for mid-2000 even 49.5 kbps using three GSM channels. A maximum user data rate of 57.6 kbps can be achieved by a bundling of four channels. The resulting bundled channel is still connectionoriented such as the GSM TCH/F channel.
GPRS GPRS (General Packet Radio Service), an ETSI standard (ETSI, 1996), is the most powerful second-generation evolution available since 2000. The GPRS system uses the same radio access method as GSM (see Figure 4), but all slots are freely managed allowing a connection-less communication. Between communicating phones or servers (point-to-point or point-to-multipoint), only a logical link is established. Data is transmitted by packet switching. Since based on packet switching, the GSM slots are used in GPRS only when users are actually sending or receiving data. Rather than dedicating a radio channel to a mobile user for a fixed period of time, the available radio resource can be concurrently shared between several users. A user pays only those slots actually used. Uplink (from the terminal) and
452 Stauder & Erbas
downlink (to the terminal) transmissions can have different and dynamically changing data rates. The interface to the user is based on packet switched protocols like IP (IETF, 1981) or–for communication applications notably-connection-oriented protocols like X.25. At good transmission conditions (that means little error protection), up to 162 kbps can be transmitted. The GPRS evolution needs rigorous additional hardware in the GSM system. The GSM system is based on a network of MSCs while the GPRS system comprises an own network of so-called serving GPRS support nodes (SGSNs). There exist two types of links between the GPRS and the GSM networks. The first link is a connection of all SGSNs to the HLR of the GSM system. By this, users can access GSM and GPRS services. The second link connects each SGSN with several BSCs (as the GSM’s MSC is connected to several BSCs). Therefore, each BSC is equipped with a packet control unit (PCU) that controls the packet flow between the GPRS network and the GSM base stations (BTSs). Packet switching networks cannot guarantee a fixed data rate to every user. The amount of successfully transmitted data in percentage of the traffic load, i.e., the amount of data sent, is called throughput. Theoretically at 100%, in practice the throughput falls when the traffic load is more than half of the theoretical channel capacity. It decreases further when the multimedia application uses packet sizes that do not match well those of the link layer protocols (Brasche, 1997). An advantage of GPRS is the re-transmission of erroneous packets by the link layer protocol MAC (medium access protocol). This leads to lower error rates (but higher delay) compared to the GSM channel TCH.
EDGE EDGE (Enhanced Data Rates for GSM Evolution), an even faster evolution of GSM, is designed to deliver user data rates up to 384 kbps, enabling the delivery of multimedia and other broadband services to mobile phone and computer users. The EDGE standard uses the GSM carrier bandwidth and the GSM TDMA access scheme. EDGE primarily improves the radio interface. An eight-phase shift keying (8-PSK) modulation scheme was selected that encodes three bits per modulated symbol instead of one bit as GMSK in GSM and GPRS does. The GMSK modulation used by GSM will be also part of the EDGE system concept (Dahlmann, 1998). The roadmap for EDGE standardisation proposes two phases. The first phase will place emphasis on enhanced GPRS (EGPRS) and enhanced circuit-switched data (ECSD). The second phase will define improvements for multimedia and real-time services. Another objective is to align EDGE with third generation UMTS networks. The EDGE radio protocol strategy reuses the protocols of GSM and GPRS. However, due to higher bit rates, some protocols have been modified to optimise performance. The EDGE concept includes a packet switched mode (EGPRS) and a circuit-switched mode (ECSD). The current GSM/GPRS standard supports data rates from 12 to 28.8 kbit/s per time slot. By contrast, EGPRS will allow data rates from 12 to 48 kbit/s per time slot, which in a multislot configuration yields a data rate of 384 kbit/s.
IS-95b, HDR, IS-136+, IS-136HS For the American ANSI standard IS-95 with its 16kbit/s data channel, two evolutions are foreseen. The first evolution IS-95b is comparable to HSCSD: several IS-95 channels are used bundled. A second, long-term evolution is high data rate (HDR) or IS-95c which bundles even more channels but requires a hardware upgrade. IS-136+ and IS-136HS (High
Mobile Multimedia over Wireless Network 453
Speed) are close to the concept of GPRS and EDGE, respectively.
Second-Generation Data Channels Table 4 shows the characteristics of data channels of second-generation wireless cellular networks (Steele, 1999; Walke, 1999; Saunders, 1999).
Third-Generation Wireless Cellular Networks Future customers will want to combine multimedia with seamless mobility from home to everywhere, resulting in higher demand for bandwidth and creating a significant shift towards new types of data-based services. Telecommunication networks matching this goal are called third-generation (3G) wireless networks.
UMTS Standardisation The Universal Mobile Telecommunications System (UMTS, ETSI, 1999b) is the new third-generation wireless network being developed within the framework, which has been defined by the International Telecommunications Union (ITU) and known as IMT-2000. IMT-2000 has been defined by the ITU as an open international standard for a high capacity and high data rate mobile telecommunications system incorporating both terrestrial-radio and satellite components. UMTS is standardized by the ETSI in the IMT-2000 framework. The ETSI is in cooperation with other regional and national standardization bodies around the world to produce the detailed standards. The current place of UMTS standardization is the Third-Generation Partnership Project (3GPP). UMTS is a mobile communications system that offers significant user benefits (UMTS Forum, 1997) such as: • high-quality, high-bandwidth, wireless, personalised multimedia services and • high mobility by integration of fixed, terrestrial and satellite networks. At launch, terrestrial UMTS will have the capability for data rates up to 2 Mbps, but it is designed as an open system, which can evolve later on to incorporate new technologies as they become available. This will allow UMTS to eventually increase its capability above
Table 4: Data channels of second-generation wireless networks Data channel
User data rate [kbps]
GSM TCH/F GSM TCH/H GSM SMS GSM HSCSD GSM GPRS GSM EDGE D-AMPS IS-95 IS-95b IS-95c (HDR) PDC (JDC)
12 6 57.6 101, 119, 162 384 7.95
6.7
Added channel coding redundancy [%] 90 280 148 58 70, 44, 6 <23 64
67
Gross data rate [kbps] 22.8 22.8 91.2 171.2 13 16 64 160 11.2
Connection mode
connection connection packet connection packet packet / conn. connection connection connection packet connection
Error correction
FEC FEC FEC, ARQ FEC FEC, ARQ
FEC
454 Stauder & Erbas
those standardized, much in the same way that GSM will evolve from the original capability of 12 kbps for data to GSM-GPRS (about 119 kbps) and then to GSM-EDGE technology (384 kbps, UMTS Forum, 1998).
UMTS Radio Interface In 1992, the World Radio Conference identified the frequency bands 1885-2025 MHz and 2110-2200 MHz for future IMT-2000 systems. Of this the bands 1980-2010 MHz and 2170-2200 MHz are intended for the satellite part. In 2000, the bands 806-960 MHz, 17101885 MHz and 2500-2690 MHz were added. Technical studies within ETSI selected a new radio interface for UMTS called UMTS Terrestrial Radio Access (UTRA). In the USA existing networks partially cover the frequency bands chosen for Europe. Europe and Japan have decided to implement the terrestrial part of in so-called paired and unpaired bands: Paired: Uplink (from the terminal) and downlink (to the terminal) transmissions use different paired frequency bands. Europe and Japan have decided to use the pair of bands 1920-1980 MHz and 2110-2170 MHz. The chosen modulation technique is wideband CDMA (W-CDMA) (Dahlmann, 1998), the duplex method is FDD. Unpaired: Uplink and downlink transmissions use the same so-called unpaired band either 1900-1920 MHz or 2010-2025 MHz. The signals are modulated by W-CDMA and the duplex method is TDD. The radio interface is composed of the OSI layers 1, 2 and 3 (transmission, link and network layers). Layer 1 is based on W-CDMA technology and the TS 25.200 series (3GPP, 1999) describes the Layer 1 specification. Layers 2 and 3 of the radio interface are described in the TS 25.300 and 25.400 series, respectively.
Geographical Coverage of UMTS The UMTS radio access system UTRA will support operation with high spectral efficiency and service quality in all the physical environments in which wireless and mobile communication take place. Today’s user lives in a multi-dimensional world, moving between indoor, outdoor congested (urban) and outdoor rural environments with mobility ranging from essentially stationary through pedestrian up to very high speeds. In practical implementations of UMTS, some users may be unable to access the highest data rates at all times. For example, the physical constraints of radio propagation and the economics of operating a network will mean that the system services might only support lower data rates in remote or heavily congested areas. Therefore services will be adaptive to different data rates and other Quality of Service (QoS) parameters. In the early stages of UMTS deployment, traffic will probably be generated predominantly in locations such as airports and railway stations which operators will cover immediately following network launch. However, users will want full coverage so that they can access their services wherever they are. To offer this, UMTS technology is being defined to enable roaming with other networks, for example GSM systems, which will be able to offer global coverage. The satellite component of UMTS will provide a global coverage to a certain range of user terminals. The same services will be supported on both terrestrial and satellite systems. UMTS is being standardized to ensure that roaming and handover between satellite and terrestrial networks will be efficient. These satellite systems are planned to be implemented
Mobile Multimedia over Wireless Network 455
using the S-band Mobile Satellite Service (MSS) frequency allocations.
UMTS Data Channels The area covered by UMTS is partitioned into home, pico, micro, macro and global cells, as depicted in Figure 6. The transmission rate capability of UTRA will provide at least 144 kbps for full mobility applications in all environments, 384 kbps for limited mobility applications in the macro and micro cellular environments and 2.048 Mbps for low mobility applications particularly in the micro and pico cellular environments. The 2.048 Mbps rate may also be available for short-range or packet applications in the macro cellular environment, depending on deployment strategies, radio network planning and spectrum availability. In later phases of UMTS development, there may be a convergence with even higher data rate systems using mobile wireless Local Area Network (LAN) technologies (microwave or infrared) providing high data rates (e.g., 54 Mbps as HIPERLAN 2) in indoor environments. UMTS is also being designed to offer data rate on demand, where the network reacts flexibly to the user’s demands, his or her profile and the current status of the network.
UMTS Services Customers want easy-to-use terminals and good value for money. This means that UMTS should offer: • Services that are easy to use, and customizable in order to address individual users needs and preferences. • Terminals which allow easy access to these services. • Low, competitive service prices that ensure a mass market. Speech: Accounting to market studies, speech will remain the dominant service up to the year 2005 for existing fixed and mobile telephone networks, including GSM. Users will Figure 6: UMTS coverage is seamless and universal (UMTS Forum, 1998)
UMTS Global Global
Satellite Satellite cell
Suburban Suburban
Macro cell
Urban
Urban
Micro Micro-Cell cell
Macro-Cell
In- Building In-Building
Home Home-Cell Pico-Cell
Pico cell
Audio/visual Terminals
Audio-visual Roaming terminals Inter-Network Seamless end-to-end Service
cell
456 Stauder & Erbas
demand low-cost, high-quality speech from UMTS; however, the opportunity for increased revenue over today’s systems will come from offering advanced data and information services. Long-term industry forecasts for UMTS show a strongly growing number of multimedia subscribers by the year 2010. Virtual Home Environment: UMTS services are common throughout all UMTS user and radio environments. A personal user will experience a consistent set of services–a Virtual Home Environment (VHE). Users will always “feel” as if they are on their home network, even when roaming. VHE will ensure the delivery of the service provider’s total environment, including for example a corporate user’s virtual work environment, independent of the user’s location or mode of access (satellite, terrestrial, in-house). VHE will also enable terminals to negotiate functionality with the visited network, possibly even downloading software so that it will provide “home-like” service, with full security, transparently across a mix of access and core networks. The ultimate goal is that all networks, signaling, connection, registration and any other technology should be invisible to the user, so that mobile multimedia services are simple and user-friendly. World Wide Web: The number of IP networks and applications are growing fast. Most obvious is the Internet, but private IP networks (intranets) show similar or even higher rates of growth and usage. UMTS supports both IP and non-IP traffic in a variety of modes including packet, circuit switched and virtual circuit. The IP version 6 will allow parameters such as quality of service (QoS), bit rate and bit error rates (BER), vital for mobile operation, to be set by the operator or service provider. This development is done in the framework of the Internet Engineering Task Force (IETF). The IPv6 Forum joined the 3GPP in January 2000 to reach a close development of UMTS and IP. The ability to transport multimedia content over various types of networks, such as broadcast, telecommunications and Internet, requires industry to develop cross-platform interoperability because the properties of the networks may have an impact on the content. In many cases several different kinds of networks will be cascaded such as Ethernet, ATM, X25 and UMTS.
UMTS Service Providers Figure 7 shows how the interaction between users, networks operators and service providers in a UMTS system can work (UMTS Forum, 1998). In UMTS networks, individual users need new ways to manage and control vast amounts of information from many diverse sources in order to fulfill their own objectives and interests. A user needs personalized services that cannot be satisfied using existing concepts of service provision. The content providers will deliver additional information to the mass-market and to niche markets. Content provision is likely to emerge from today’s Internet but will be able to cover subscription as well as non-subscription services. Content provision services will also extend to cover corporate intranets where information is managed and delivered to closed user groups on an organizational or interest basis. New roles of business entities will be developed: • Service providers that are not necessarily associated with a network operator. • Content providers who are able to optimize existing content for UMTS. • Content providers for new, specifically mobile services and content. A content provider’s service or application is able to set up the preferred QoS from the network in terms of preferred bandwidth, maximum error rate and delay or latency. New charging and billing mechanisms will be developed in order to dynamically price the new
Mobile Multimedia over Wireless Network 457
Figure 7: UMTS service delivery functions (UMTS Forum, 1998)
UMTS
Payment Billing
Service Broker Value-Added Service Providers
Home Subscriber/ User
Service Management i.e. ISP, Corporate Network
Accounting Communications Content Providers Access Network Operator
Core Network Operator
services in relation to the QoS requested and that actually delivered. UMTS will operate in a very heterogeneous environment converging fixed, mobile, satellite, private and public systems. This requires the integration of players from several fields such as Internet, finance, broadcast, entertainment and the media.
Satellite Networks Already a couple of years before the standard UMTS, industrial consortiums established several proprietary worldwide satellite networks based on proprietary technology. These satellite networks can be divided into three types. The first types of proprietary satellite networks are narrow-band satellite networks mainly offering wordwide telephony service. These networks are either already operational (Orbcomm even since 1995) or they are planned for the near future (at latest Courier by 2003). They apply low earth orbit (LEO, 0…1500 km) or medium earth orbit (MEO, 1500…7000 km) satellites. These networks are: • Orbcomm: Operational since 1995, 35 light LEO satellites • Globalstar: Operational since 1999, 52 light LEO satellites • ICO (Inmarsat): Operational 2001, 10 MEO satellites • Courier: Operational 2003, 72 LEO satellites • Ellipso: 10 MEO satellites The proposed services include GSM-like services such as telephony, SMS and low data rate transmission (between 2.4 and 9.6 kbps). Additional services are navigation and location-based services (LBS). Aiming at clients such as oil companies and seafaring, the networks are designed to allow between 45.000 (ICO) to 150.000 (Globalstar) simultaneous communications wordwide. In civilized regions, dual or three mode mobile phones are able to change from the satellite network to existing GSM or D-AMPS terrestrial networks
458 Stauder & Erbas
without interruption of the communication (a service called “roaming”). One of these systems (IRIDIUM, operational from 1999-2000 with 66 intelligent LEO satellites) already smashed up in early 2000. Possible reasons are the expensive and error-sensible satellites that used inter-satellite communication as well as the lack of GSM complementary services, e.g., by the use of multi-mode terminals. The second type of proprietary satellite networks is based on broadcast GEO (geo stationary orbit) satellites (Grami, 1997). These satellites do not move with respect to the earth and their footprints are fixed. Originally designed as analog or digital TV and radio broadcast satellites, they have been operational for a long time (e.g., Telesat since 1973). Recently, the services were extended: under the brand name DirecPC, an IP Internet downlink of 400 kbps is proposed, where the low bandwidth uplink is realized using a terrestrial (e.g., telephone) connection. To raise in the future the number of users and to reuse frequency bands, the satellites will be equipped by antennas generating a steerable beam being much smaller than the usual footprint of a traditional GEO satellite. The satellite networks of the second type and their services are: • Telesat: Coverage North America, 95 TV channels, Internet 400 kbps • Hughes: Coverage: North America, 210 pay-TV channels, Internet 400 kbps • Eutelsat: Coverage Europe/Africa, 550 TV channels, Internet 400 kbps The third type of proprietary satellite networks is high-bandwidth, future systems based on different technologies: • Astra: Operational 2000, 9 GEO satellites (Europe, Africa), 6.5…38 Mbps downlink, terrestrial uplink • Skybridge: Operational 2002, 80 low-complex LEO satellites, 20...100 Mbps downlink, 2...10 Mbps uplink • Spaceway: Operational 2002, 2 intelligent GEO satellites (North America), 6 Mbps downlink, 256 kbps uplink • Teledesic: Operational 2003, 288 LEO satellites, 64 Mbps down link, 2 Mbps up link The systems are designed for one million (Teledesic) to 10 million (Skybridge) simultaneous users. For communication, a fixed, so-called very small aperture terminal (VSAT) is needed. These networks do not allow mobile multimedia.
Satellite Data Channels As UMTS will incorporate a satellite segment, it is worth it to have a look at satellite data channels here. Existing and future satellite and terrestrial channels use various frequency bands. The lowest bands, the L and S bands between 0.5 and 2 GHz, are used by: • GSM: 890-915 MHz (uplink) and 935-960 MHz (downlink) • GSM-DCS:1710-1785 MHz (uplink) 1805-1880 MHz (downlink) • UMTS: 1885-2025 and 1980-2010 MHz (terrestrial & satellite) • Satellite telephony networks (Globalstar, ICO,...) The next band is the Ku band between 10 and 12 GHz that is used by: • Satellite television broadcast networks (ASTRA-NET, Telesat, Hughes,...) • Future high-bandwidth satellite networks (Skybridge) Finally, the Ka band between 20 and 30 GHz is used by: • Future high-bandwidth satellite networks (Teledesic, Spaceway) Specific characteristics of satellite channels have to be taken into account by mobile multimedia applications. Important parameters are transmission errors and delay. For a UMTS-like 64 kbps link without channel coding, Inmarsat (1999) supposes a bit
Mobile Multimedia over Wireless Network 459
error rate (BER) of typically 10-6, higher than in terrestrial wireless channels. However, transmission may be corrupted further by heavy rainfalls that Telesat expects to be typically in the range from two to 15 minutes for northern America. The more the employed wavelength reaches the size of raindrops, the more the signal will be attenuated. Whereas UMTS with a transmission at 2 GHz (15 cm) is only slightly affected, current and future proprietary satellite networks decided to transmit in the Ku band (2cm) or the Ka band (1cm) that allow for higher data rates. The higher the band and the more distant the orbit, the heavier will be the rain attenuation. So one may set up the following list: • Very low rain attenuation: S band (UMTS) • Low rain attenuation: Ku band, LEO orbit (e.g., Skybridge) • More rain attenuation: Ka band, LEO orbit (e.g., Teledesic) • Medium rain attenuation: Ku band, GEO orbit (e.g., ASTRA-NET, Telesat) • Higher rain attenuation: Ka band, GEO orbit (e.g., Spaceway) A second parameter is the transmission delay. It can be decomposed into propagation delay, satellite system delay and network delay. The one-way propagation delay from one user to another user implies in general two satellite hops, i.e. four times the distance earthsatellite. This turns out to be: • LEO: 20 ms (e.g. Globalstar, 1.414km) • MEO: 140 ms (e.g. ICO, 10.39km) • GEO: 480ms (e.g. Telesat and Inmarsat, 35.800 km) The overall one-way delay includes a system delay and a network delay of each approximately 100ms. These delays may be higher in LEO satellite networks that have to follow the fast satellites. In general, the higher the orbit, the more important is the propagation delay. For MEO and GEO satellite systems, the overall delay is thus higher than in terrestrial networks like GSM.
MOBILE NETWORK TERMINALS Having reviewed existing and future wireless networks in the previous section, we will now focus on the mobile terminals that will receive, process, display and retransmit multimedia content. Since we live in a quickly changing world of multiple wireless network standards, multimedia terminals will be multi-mode terminals. Some mobile phones today are already compatible to more than one of the standards GSM, D-AMPS (IS-54), IS-95 and PDC; future multimedia terminals may additionally cover UMTS, DECT and PHS. This capability is essential to smoothly introduce new standards such as GPRS and UMTS. In high-bandwidth satellite networks like Skybridge, Teledesic and Spaceway, the socalled VSATs (very small aperture terminals) are not mobile: they need satellite antennas with apertures from 40 to 60 cm. Here, we will regard mobile terminals connected to terrestrial base stations in networks like GSM, D-AMPS (IS-54) and IS-95 or to low orbit LEO satellites in networks like e.g., Gobalstar and ICO or that are connected to future thirdgeneration networks. The following subsection outlines the architecture of mobile terminals. Then, two subsections regard two architectural aspects relevant to multimedia applications in further detail: one subsection tackles the smart card subsystem, responsible for the security, and another subsection describes the application interface (API) of the terminal.
460 Stauder & Erbas
Terminal Architecture A mobile terminal consists actually of two sub systems: the terminal itself, also called mobile station (MS), and the smart card subsystem, also called subscriber identity module (SIM). The software of each subsystem consists of drivers, operating system (OS), application interface (API) and applications (see Figure 8). For the terminal subsystem, the drivers communicate with the basic terminal resources such as input, output, smart card interface and network interface. The input and output drivers depend on the type of the terminal. Mobile terminals may be cellular phones, pagers, personal digital assistants (PDA), Web pads or infotainment appliances with preference on communication, messaging, organizing capabilities, Internet access or multimedia, respectively (Mobile Lifestreams, 2000). The network interface makes use of one of the transmission channels proposed by the network (in GSM, for example, SMS or GPRS). The network interface implements the whole protocol stack including encryption, channel coding and modulation in the physical layer. The smart card subsystem is technically, physically and commercially independent from the terminal. Whereas operators manage their wireless networks, they do not bill directly the mobile users but so-called service providers. The mobile user makes his contract with such a service provider and receives a smart card from him. The smart card is a fully operational microprocessor system responsible for security and subscriber identity. Read more on it in the following subsection. In traditional terminals such as mobile phones, the applications of a terminal were burned-in and made direct use of the basic resources. The API layer as shown in Figure 8 did not exist. The reason is fixed services such as telephoning. To profit from other services, like short message service (SMS) in GSM or fax, one had to buy a phone with these services implemented. In future mobile terminals, the services and algorithms will be downloadable as a reconfigurable plug-in application. Therefore, an operating system (OS) and a standardized application interface (API) will enable the cooperation between the basic resources and the applications as shown in Figure 8. Continue reading about APIs in a subsection further down. Market leading operating systems are the Palm OS from Palm (used, e.g., by Motorola, Nokia), Windows CE from Microsoft (used, e.g., by Siemens and Casio) and EPOC from Symbian (used e.g. by Nokia, Ericsson, Motorola). Even mobile Linux has been proposed (e.g. by LinuxWorld) or proprietary solutions (such as Brew by Qualcomm). Mobile operating systems are not the focus of this chapter.
Figure 8: Mobile terminal software architecture
Application Application Interface (API) Operation System (OS) Drivers (input, output, smart card, network)
Mobile Multimedia over Wireless Network 461
Smart Card A smart card is a card that is embedded with either a microprocessor plus a memory chip or only with a memory chip with burned-in logic. The mobile terminal smart cards are microprocessor cards and can perform algorithmic operations on the memory content, while a memory-chip card (for example, pre-paid phone cards) can undertake only some predefined operations. Smart cards are manufactured by the operators and distributed by the service providers. Apart from its interface to the terminal, the on-card software was a long time nonstandardized and different from operator to operator. Since late 1999, smart cards are more intelligent. The ETSI decided to have JavaCard in the GSM standardized SIM toolkit. SIM toolkit is the ETSI standard of the GSM smart card application interface (API). The main concern of today’s smart cards is security. The security aspect is identified by marketing research to be a high priority need of high tech consumers and thus important for the acceptance of future mobile multimedia services. Security is ensured by user identification and encryption. Regarding GSM as example, the user identity is assured by two memorized codes. The first one is the internationally unique user number called IMSI. The second one is a PLMN-wide unique secret code, called Ki, which is known only to the smart card and to the network management. The code Ki is never transmitted in the network, neither it leaves the smart card or the network management. Using both codes and a secret operator-dependent algorithm, called A3, the smart card assists the terminal in proving the user identity to the network. This so-called authentication is operator-dependent. It works also in a visited (e.g., foreign) network, since it is always carried out between the terminal and its home network management. The encryption of data is carried out in GSM systems by the terminals and the base stations. A standardized algorithm, called A5, and a secret, temporal encryption code is used. The encryption code is generated by the smart card or by the network management using a secret, operator-dependent algorithm, called A8, and the secret code Ki. The next smart card generation will be the UMTS Subscriber Identity Module (USIM). The smart card industry will offer cards with higher memory capacity, faster CPU performance, contact-less operation and greater capability for encryption. Contact-less cards can be used for electronic commerce or electronic ticketing without removing them from a wallet or from the mobile phone. New memory technologies can be expected to make larger programs and more data storage feasible. Several applications and service providers could be accommodated on the same card. Then, the user can decide which applications/services he wants to have on the card, much as he does for his computer’s hard disk. It is expected that all fixed and mobile networks will adopt the same or compatible lower layer standards for their subscriber identity cards to enable USIM roaming on all networks and universal user access to all services.
Terminal API Future multimedia applications will run on a mobile terminal using the functions offered by the application interface (API). The API allows the abstraction of both the terminal and network, providing a generic way for applications to access terminals and networks. The API will allow the same application to be used on a wide variety of terminals and will also provide a common method
462 Stauder & Erbas
of interfacing applications to networks. The API supports security, billing, subscriber information, service management, call management, SIM management user interaction and content translation. Already known from set top boxes (STBs) or client/server systems, the employment of an API will accelerate the response time of service providers and mobile terminal manufacturers to new developments. Many examples of commercially successfully client/server solutions can be found today for example in banking and travel agencies. Client/server systems allow intelligence to be downloaded transparently (from a server) into the mobile terminal (the client). These applications are stored in the terminals only temporally. Tasks that must remain centralized, such as databases, are held on centrally administered servers and respond to queries from the clients rapidly and efficiently. Two APIs have reached wider acceptance: WAP and MExE. While WAP is an open industry standard with first products on the market since end of 1999, MExE is in process of standardization by the ETSI (Mobile Lifestreams, 1999).
WAP The Wireless Application Protocol (WAP) is based on the client-server model and incorporates a relatively simple micro-browser-client into the mobile terminal. A mobile user will request multimedia content using the built-in micro-browser. The WAP microbrowser is based on the wireless markup language (WML), which is-as HTML-compliant with the extensible meta language XML of the WWW Consortium (W3C). Compared to HTML, WML is designed to support scalable documents for differently sized displays of the expected variety of mobile terminals. Furthermore, mobile terminal one-hand-surfing with keypads, touch-screens and pens is supported. Finally, WML is a binary, compressed format that reduces significantly the amount of HTML data. The request of the browser is passed to a WAP gateway. This gateway retrieves the requested information either in HTML from a WWW server, directly in WML from a server supporting WAP applications. If the content being retrieved is in HTML format, a filter in the WAP gateway will translate it into WML, as far as possible. Such translators have been announced in 2000 by several large companies. The requested information is then sent from the WAP gateway to the WAP client in the mobile terminal using whatever mobile network bearer service is available or most appropriate (in GSM e.g. SMS or GPRS). WAP provides a full protocol stack (Mobile Lifestreams 1999) build onto the chosen bearer: Figure 9: WAP capable phone R380 (left) and PDA (right) from Ericsson
Mobile Multimedia over Wireless Network 463
•
WAE (wireless application environment): The API: WML, a script language and a telephone application • WSP (wireless session protocol): Session management • WTP (wireless transport protocol): Transport management • WTLS (wireless transport layer security): Security management • WDP (wireless datagram protocol): Adaptation to the bearer WAP is already on the market. Many network operators offer WAP-based weather, traffic, sport, commercial news and headline services. They can be enjoyed on suitable devices/terminals as shown in Figures 9 and 10.
MExE The ETSI standard “Mobile Station Application Execution Environment” (MExE) provides full application programming. These applications are connected via the network to other applications, e.g., servers. The applications are personalized and can be easily updated, because they are downloadable. They can perform intelligent services, since they are software defined. For this purpose, the standardization groups have decided to incorporate a downscaled Java version called J2ME (Java 2 Micro Edition). The proprietary programming language Java is designed to employ portable, dynamically downloadable and secure applications. The J2ME architecture defines a small core application interface (API). Extending this core API, J2ME is deployed in several configurations. Each configuration addresses a particular class of device and specifies a set of APIs and Java virtual machine features that multimedia applications can use. Application developers and content providers must design their code to stay within the bounds of the Java virtual machine features and of the APIs specified by that configuration. The J2ME includes two 128 kb ROM standard configurations, which are subsets of the Java 2 Standard Edition core APIs. An MExE terminal will be able to inform the MExE server about its configuration. MExE will provide configurations that match and
Figure 10: WAP-capable Communicator 9110 (left) and a 3G terminal concept (right) from Nokia
464 Stauder & Erbas
those that exceed WAP functionality. Whereas WAP proposes a restricted number of operations like scripting, graphics, animation and text, MExE is fully programmable and aims at a wide range of man-machine interfaces such as voice recognition, icons and softkeys. Using the smart card functionalities, MExE includes a security framework to prevent unauthorized remote access to the user’s data. This is important since MExE provides with Java full programming capabilities, which are--in absence of security mechanisms--a paradise for future mobile viruses.
MOBILE MULTIMEDIA APPLICATIONS Multimedia applications need transmission of data of different traffic under varying constraints. The traffic of audio at telephone quality is more or less constant and low (e.g., 13 kbps voice data rate in GSM), while the traffic of coded video, e.g., in the Common Intermediate Format (CIF, 352x288 pels) at 25 Hz frame rate is very fluctuant at data rates between 64 kbps and 1.5 Mbps. Even more fluctuant is file transfer or transfer of 3D graphical data. The constraints of multimedia applications address the transmission delay and the transmission errors that are supported by a specific application. Such constraints are called quality of service (QoS). Errors in voice telephony may be tolerated, while a delay of more than some hundreds of milliseconds is already perceptible. During transmission of coded video, errors in sensible data portions like coding mode information or motion compensation vectors leads to very disturbing artifacts. If video and audio are transmitted together, the synchronization of the transmission delay is important. Finally, file or graphical data transmission usually supports a greater delay. Furthermore, transmission errors can be corrected by re-transmission. Multimedia applications can be classified using the following criteria: • Degree of interactivity: Some applications such as videoconferencing, teleworking and games need more interactivity (and a lower transmission delay) than others like remote database access, video on demand or email. • Distribution type: Broadcasting applications like television and information services are distributed by multicast (one sender, many listeners) while other applications like database access, groupware, videoconferencing and games transmit point-to-point or point-to-group. • Computational complexity: Compared to Workstations, PCs and even compared to set top boxes (STBs), mobile terminals have restricted resources which need to be considered by mobile multimedia applications: • Limited memory (e.g., 128 kb) • Limited CPU power • Limited (battery) power • Limited bandwidth connection (e.g., 12 kbps user data rate in GSM) • Error prone transmission • Input/output requirements: Different applications need different input/output devices. In today’s mobile terminals one can find: • Input devices such as microphone, keypads, keyboards, touch-screens and pencils • Output devices as loud-speaker, screens of 1-40 lines, acoustic signals • No input devices
Mobile Multimedia over Wireless Network 465
• No output devices Future mobile multimedia services may be: • Public information services (WWW, interactive shopping, on-line equivalents of printed media, on-line translations, location-based broadcasting services, intelligent search and filtering facilities) • Educational services (virtual school, on-line science labs, on-line library, on-line language labs, training) • Entertainment (audio on demand, games on demand, video clips, virtual sightseeing) • Community services (emergency services, government procedures) • Business information services (mobile office, VPN, narrowcast business TV, virtual work-groups) • Communication services (video telephony, videoconferencing, unified messaging, personal location) • Business and financial services (virtual banking, online billing, universal credit card) • Location and location-based services (routing, transport telematics, 3D virtual city maps) • Other services (telemedicine, security monitoring services, instant help line, expertise on tap, personal administration) This section will first focus on the QoS aspect of future wireless cellular networks. Then, two exemplary applications will be discussed: video over wireless networks and remote database applications.
Quality of Service (QoS) To run efficiently over networks, a multimedia application needs to negotiate with the network so-called QoS parameters. These are • End-to-end transmission delay (jitter, average) • Bandwidth (average, peak, minimum, maximum burst) • Call blocking rate (rate of service denial) • Transmission errors (packet loss, error rate) • Reliability and up-time An end-to-end transmission may pass several access networks (e.g., GSM, UMTS, PSTN, cable, LMDS, satellite), several backbone networks (e.g., ATM, frame relay, PSTN) and also a private network (e.g., intranet, service provider). Thus, different protocol stacks have to communicate to together ensure the QoS. Triggered by the rapidly growing Internet, most of today’s networks–including future wireless cellular networks such as UMTS–offer the Internet Protocol (IP) as network layer protocol. For IP networks, the Internet Engineering Task Force (IETF) discusses two approaches to QoS: Integrated Services (IntServ) and Differentiated Services (DiffServ).
IntServ The goal of IntServ is to guarantee an absolute end-to-end QoS. An IntServ communication starts by a terminal demanding a certain QoS. This demand is either denied or accepted by the IP network. IntServ is a so-called soft-state concept, since each router needs to be updated regularly about each IntServ data stream to assure the end-to-end QoS.
DiffServ The concept of DiffServ is to assign a relative QoS, a so-called Class of Service (CoS), to packets. Therefore, a small additional header is sufficient for each packet. Arrived at a router, a packet is treated “better” the higher its class. “Better” means shorter queuing time
466 Stauder & Erbas
and lower dropping rate in case of congestion. DiffServ is a hard-state concept; the routers do not need to be updated. Furthermore, the treatment is fast, since it depends only on a small header. To accomplish an absolute end-to-end QoS, the application needs to monitor its data flow and needs to adapt if necessary the CoS of its packets.
Application 1: Wireless Video As a first mobile multimedia application, video transmission over wireless networks will be regarded. Applications are video broadcast, video casting, interactive television (Stauder, 2001) and distributed video games. For video transmission (Hanzo, in preparation) the whole video chain has to be considered: acquisition–source coding–channel coding– modulation-transmission–demodulation–error correction/detection–source decoding–display. The network realizes modulation, transmission and demodulation by the transmission layer, the link layer and parts of the network layer. The video application realizes source and channel coding. Source coding compresses the video data by reduction of inherent redundancies and elimination of irrelevant parts of the signal. Channel coding adds redundancies to the compressed video data to detect or correct transmission errors at the receiver. Current video source coding standards such as ISO MPEG4 (Pereira, 2000) and ITU H.263 (Girod, 1997) already contain approaches to join source and channel coding in an optimal manner (Le Léannec, 1999; Côté, 1999; Färber, 1998). Notably in the ITU standardization groups, wireless channels are considered (Nokia, 1999; Inmarsat, 1999). To adapt video applications to wireless networks, source coding and channel coding as well as the underlying protocol stack of the network have to be addressed: • Enhancements of the networks: • Feedback channel: Can be used for two purposes: First, received but damaged data can be re-transmitted by sending a demand on a feedback channel. Second, the receiver can send error statistics to the transmitter to optimize the coding process (see coding mode selection). • Enhancement of source coding: • Coding mode selection: Generally, efficient video compression methods (modes) are based on prediction. These methods predict new data at the receiver using already transmitted data instead of transmitting the new data. In case of temporally high transmission errors–made known to the transmitter via a feedback channel–coding modes with less prediction can be selected, since already transmitted data may be damaged and the prediction may be bad. • Scalable source coding: The video is coded into several streams. A basic low-bitrate stream is transmitted always and allows the receiver to display a low-quality video. This stream is highly protected by channel coding. The other streams are less protected and enhance the video quality. These streams are used by the receiver only if the errors can be corrected, or some of them are not transmitted, if the feedback channel reports high errors. • Congestion and rate control: If the feedback channel reports high transmission loss, the source coder codes at lower quality and thus lower bitrate. • Enhancement of synchronization: • Synchronization markers are placed into the coded bit-stream. They allow the receiver to restart here the decoding if before-transmission errors occurred. • Reversible run-length-codes: Run-length codes are a coding technique to reduce the redundancy. Usually, one single error in the transmitted bit-stream prevents the decoding of all following data. By a reversible run-length-code, the data can be decoded from the beginning to the first occurring error and–in reverse direction–from
Mobile Multimedia over Wireless Network 467
•
the end to the last occurring error. Enhancements of channel coding: • Re-transmission: In wireless networks, often burst error occurs, i.e., several succeeding bits are damaged. Instead of correcting the data (which requires highly redundant channel codes and thus higher bit rates), damaged data is demanded (via a feedback channel) to be re-transmitted. • Data partitioning: Important parts of the data as headers, motion vectors or coding mode switches are protected by a better performing channel code. • Header repetition: Headers are transmitted twice, because any header loss prevents decoding of all data of the same packet.
Application 2: Wireless Database Access As a second sample application, remote database access and groupware applications are addressed (Saunders, 1999). Such systems-running already on isolated handhelds or in networked PCs-have to be adapted to wireless networks. Several measures are possible: • Caching: Remote database access terminals usually communicate often with the server to ensure system consistency. Caching, i.e., local storage, of some data can reduce these communications and thus prevents a slow-down of the application if the terminal is connected via a low-bandwidth wireless network. On the other hand, caching is limited since the local memory of mobile terminals is limited. • Middleware: The introduction of a middleware layer can adapt an existing application to higher delays or higher error rates of wireless networks. • Data compression: In some of today’s applications, e.g., HTML-based Webbrowsers, the data (HML pages) transmits uncompressed. Dealing with low-bandwidth wireless networks, data compression can enhance the performance of applications. For example, in the industry standard WAP, the headers of WML pages are compressed. On the other hand, data compression requires some computational power–a resource that is limited in mobile terminals.
CONCLUSION AND FUTURE TRENDS The main advances of the future UMTS network over most of the existing second generation networks are: a) the CDMA radio access with its higher channel capacity replacing the FDMA and TDMA methods; b) the seamless integration of in-house, terrestrial and satellite node points into one single network; c) gross data rates up to 2 Mbps in mobile UMTS mode compared to 171 kbps in GSM GPRS; and d) new types of applications based, e.g., on WAP, Java and IP connectivity.
Commercial Success The acceptance of a future UMTS network by mobile communication users is still unclear. The mobile customers will be tented to change the network–e.g., from GPRS to UMTS–by value-added services. This strategy has been already applied for the GSM-GPRS transition: replacing the GSM SMS service by higher data rates with packet switched transmission and billing, the service provider motivates their clients to invest in a GPRS terminal. Service providers and operators expect that once the clients appreciate these new services, the clients will also proceed to UMTS that provides services at multimedia-capable data rates.
468 Stauder & Erbas
Political Aspects The success of UMTS depends not at least on the industry. After a long patent fight between Ericsson (Europe) and Qualcomm (USA) on CDMA, the original three UMTS proposals from Japan, Europe and USA (Steele, 1999) had been harmonized to the UMTS Release 99. But some questions are still open. How will the mobile phone API industry standard WAP integrate in the future ETSI API-standard MExE? Which of the three mobile operating systems Palm OS, Windows CE and EPOC will survive? Even a mobile version of Linux and Java-on-the-chip solutions has been proposed recently.
Technical Aspects Beside the political and commercial side of future high bandwidth wireless global networks, also some technical questions are still open. During its employment in 2000, first criticisms appeared on WAP’s limitation to only WML page browsing and on insufficient WAP services. Another question is the security: are encrypting and authentication methods of wireless networks efficient? One may have doubt here since research teams of the University of California, USA, and the Weizmann Institute, Israel, cracked the secret GSM authentication/encryption algorithms (A3, A8) and the standardized GSM encrypting algorithm (A5), respectively, in eight hours, respectively two minutes, on an ordinary PC. To summarize let us once again ask the question: when will high-bandwidth wireless networks be operational? The answer is: now! GPRS already transmits up to 171 kbps gross data rate. Within the close future, UMTS will provide between 144 kbps and 2 Mbpseverywhere and any time.
REFERENCES David, K. and Benkner, T. (1996). Digitale Mobilfunksysteme. B. G. Stuttgart, Teubner. Stuttgart, Germany. Boisseau, M., Demange, M. and Munier, J. M. (1994). High Speed Networks. John Wiley & Sons Ltd. Brasche G. and Walke, B. (1997). Concepts, services and protocols of the new GSM phase 2+ general packet radio service. IEEE Communications Magazine, 35(8), 94-104. August. Côté, G., Shirani, S. and Kossentini, F. (1999). Robust H263 video communication over mobile channels. Proceedings of the International Conference of Image Processing ICIP’99. Kobe, Japan, 25, October 28. Dahlmann, E., Gudmundson, B., Nilsson, M. and Sköld, J. (1998). UMTS/IMT-2000 based on wideband CDMA. IEEE Communications Magazine, 36(9), 70-80. September. ETSI. (1996). Digital cellular telecommunication system (phase 2+): General Packet Radio Service (GPRS); Overall Description of the GPRS radio interface (Um) (GSM 03.64). European Telecommunications Standards Institute, TC-SMG, GRPS Adhoc Group, Draft Techn. Spec. 0.1.0. Sophia Antipolis, France, December. ETSI (1999a). Global system for mobile communication GSM. European Telecommunications Standards Institute. Retrieved December 6, 1999 on the World Wide Web: http:/ /www.etsi.org. ETSI. (1999b). Universal Mobile Telecommunications System UMTS. European Telecommunications Standards Institute. Retreived December 6, 1999 on the World Wide Web at: http://www.etsi.org. Färber, N., Villasenor, J. and Girod, B. (1998). Extensions of ITU-T recommendation H.324
Mobile Multimedia over Wireless Network 469
for error-resilient video transmission. IEEE Communications Magazine. 36(6), 120128. June. Retrieved January 25, 2000 on the World Wide Web: http://www-nt.etechnik.uni-erlangen.de. Gilligan, R., Thomson, S., Bound, J.and Stevens, W. (1997). Basic socket interface extensions for IPv6. IETF Network Working Group, April. Retrieved January 21, 2000 on the World Wide Web: http://www.ipv6.org/ pub/rfc/rfc2133.txt. Girod, B., Färber, N. and Steinbach, E. (1997). Performance of the H.263 video compression standard. Journal of VLSI Signal Processing: Systems for Signal, Image and Video Technology, Special Issue on Recent Development in Video: Algorithms, Implementation and Applications, 17(2-3), 101-111. November. Retrieved January 25, 2000 on the World Wide Web: http://www-nt.e-technik.unierlangen.de. Grami, A. and Gordon, K. (1997). Multimedia satellites: A high-level assessment. Int’l Workshop on Satellite Communication in the Global Information Infrastructure, JPL Pasadena, USA, June. Retrieved December 15, 1999 on the World Wide Web: http:/ /www.telesat.ca/news/speeches/97-06.html. Händel, R., Huber, M. N. and Schröder. S. (1994). ATM Networks-Concepts, Protocols, Applications, 2nd Edition. Addison Wesley. Hanzo, L., Cherriman, P. J. and Streit, J. (in preparation). Modern Video Compression and Communications over Wireless Channels: From Second to Third Generation Systems, WLAN and Beyond. Book in preparation for publication. Retrieved January 23, 2000 on the World Wide Web: http://rice.ecs.soton.ac.uk/books. IETF Internet Engineering Task Force. (1981). Internet protocol. RFC 791, J. Postel, September. Inmarsat. (1999). Satellite component of IMT-2000. International Telecommunication Union Study Group 16, February, Christodoulides, L. ITU-T International Telecommunications Union. (1988-1999). I-Series Recommendations for ISDN. Retrieved January 25, 2000 on the World Wide Web: http://www.itu.int. Le Léannec, F., Toutain, F. and Guillemot, C. (1999). Packet loss resilient MPEG-4 compliant video coding for the Internet. Real Time Video over the Internet [Special issue]. Signal Processing: Image Communication, September, 15(1-2), 35-56. Mehrotra, A. (1994). Cellular Radio Performance Engineering. Artech House. Mobile Lifestreams. (1999). An Introduction to WAP. Retrieved December 14, 1999 on the World Wide Web: http://www.mobilegprs.com. Mobile Lifestreams. (2000). FutureFoneZone. Retrieved January 14, 2000 on the World Wide Web: http://www.mobilegprs.com. Mouly, M. and Pautet, M. B. (1992). The GSM system for mobile communications. CELL&SYS, Palaiseau. Nokia. (1999). W-CDMA error pattern. International Telecommunication Union Study Group 16, February. Pereira, F. (Ed.). (2000). Tutorial issue on the MPEG-4 standard [Special issue]. Signal Processing: Image Communication, 15(4-5). January. Retrieved January 23, 2000 on the World Wide Web at: http://www.cselt.it/mpeg. Saunders, S., Heywood, P., Dornan, A., Bruno, L. and Allen, L. (1999). Wireless IP: Ready or not, here it comes. Data Communication, September, 42-68. Sigmund, G. (1996). Technik der Netze, 3. Edition, R.v. Heidelberg, Decker, Germany. Stauder, J. (2001). Let the sunshine on your screen: Introducing augmented reality into
470 Stauder & Erbas
interactive television. Submitted to ACM Multimedia, Ottawa, September-October. Steele, R. and Hanzo, L. (1999). Mobile Radio Communications. Chichester, Wiley. UMTS Forum. (1997). A Regulatory Framework for UMTS. Report No. 1, June. Retrieved February 6, 2000 on the World Wide Web: http://www.umts-forum.org/. UMTS Forum. (1998). The Path Towards UMTS-Technologies for the Information Society. Report No. 2. Retrieved February 6, 2000 on the World Wide Web: http://www.umtsforum.org/. UMTS Forum. (2000). The UMTS Third-Generation Market-Structuring the Service Revenues Opportunities. Report No. 9, October. Retrieved March 12, 2001 on the World Wide Web: http://www.umts-forum.org/. Walke, B. H. (1999). Mobile Radio Networks. Chichester, Wiley. 3GPP Third Generation Partnership Project. (1999). Technical Specification Group Radio Access Network; Physical layer-General description (3G TS 25.201 version 3.0.0)
GLOSSARY The following glossary comprises the terms and names used in this chapter: 3GPP ADPCM AM AMPS ANSI API ARPA ARQ ASTRA ASTRA-NET ATM ATM Forum A3 A5 A8 ARQ B-ISDN BRI BTS CEPT CDMA CM CONP Courier CT D-AMPS DECT DirecPC EDGE
Third Generation Partnership Project, http://www.3gpp.org Adaptive Differential Pulse Code Modulation Amplitude Modulation Advanced Mobile Phone Service, American wireless network standard American National Standards Institute Application Interface, the interface between application and terminal Advanced Research Project Agency Automatic Repeat Request, a correction method by re-transmission Proprietary satellite broadcaster, http://www.astra.com The Internet service of the ASTRA, http://www.astra-net.com Asynchronous Transfer Mode Organization for ATM evolution, http://www.atmforum.com Secret algorithm of GSM phones to generate a code proving their identity Standardized, non-secret GSM algorithm to encrypt data Secret algorithm of GSM phones to generate a code for encryption Automatic Repeat Request: Method for re-transmission of damaged data Broadband ISDN Basic Rate Interface Base Transceiver Station, the base station of a wireless network European Conference of Postal and Telecommunications Administrations, http://www.cept.org Code Division Multiple Access, used in IS-95 and UMTS Communication Management, upper network layer protocol of GSM Connection-oriented network protocol, ISO 8348, used in GSM GPRS Proprietary satellite telephone network, http://www.satcon.de Cordless Telephone Digital AMPS, American wireless network standard (IS-54) Digital European Cordless Telecommunications System Brand name of a satellite IP link service, http://www.direcpc.com Enhanced Data Rates for GSM Evolution, a GSM evolution
Mobile Multimedia over Wireless Network 471
EPOC ETSI FDD FEC FDMA FM FSK Globalstar GMSC GMSK GPRS GSM GSM MoU Association HDR HSCSD ICO IETF IMSI IP IPv6 Forum ITU ISDN IS-54 IS-95 Java JDC LAN LBS MAC MM MoU MS MSISDN MSC MSRN NTT OS Palm OS PAM PCN PCS PCU PDA PDC PHS PLMN PSK
Proprietary OS for mobile terminals, http://www.symbian.com European Telecommunications Standards Institute Frequency Division Duplex Forward Error Correction, a method for error protection using, e.g., CRC Frequency Division Multiple Access: Each MS has its own frequency. Frequency Modulation Frequency Shift Keying: Modulation technique using carrier frequency Proprietary satellite telephone network, http://www.globalstar.com Gateway MSC, the entry to and exit from a GSM network Gaussian Minimum Shift Keying, a band-efficient modulation technique General Packet Radio Service (ETSI, 1996), a GSM evolution Global System for Mobile Communication (Mouly, 1992) A standardization/discussion body, http://www.gsmworld.com High Data Rate (IS-95c), an IS-95 evolution High Speed Circuit Switched Data, a GSM evolution Proprietary satellite telephone network, http://www.ico.com The Internet Engineering Task Force, http://www.ietf.org International Mobile Subscriber Identity, a GSM internal user number Internet Protocol, the Internet’s network layer protocol (IETF, 1981) Organization for the development of IP version 6, http://www.ipv6.org International Telecommunication Union, http://www.itu.int Integrated Services Digital Network See D-AMPS ANSI wireless cellular network standard A proprietary programming language, http://www.java.sun.com Japan Digital Cellular, former name of PDC Local Area Network Location-Based Services, locating goods and persons by mobile terminals Medium Access Control, the upper link layer protocol of GPRS Mobility Management, medium network layer protocol of GSM Memorandum of Understanding Mobile Station, i.e., a mobile phone, a mobile terminal or a PDA Mobile Station ISDN Number, the telephone number of a mobile phone Mobile Services Switching Center, the node of a GSM network MS Roaming Number, a GSM-internal temporary user telephone number Japanese wireless network operator, http://www.ntt.com Operating System, the basic software level of a computer Proprietary OS for mobile terminals, http://www.palm.com Pulse Amplitude Modulation Personal Communication Network: A network allowing full mobility Personal Communication Service: A service to mobile users Packet Control Unit, GSM GPRS unit to control the BSCs packet flow Personal Digital Assistant: A low-cost notebook Public Digital Cellular, Japanese wireless network standard, see also JDC Personal Handy-phone System: Japanese cordless network Public Land Mobile Network, i.e., an operated terrestrial wireless network Phase Shift Keying, a technique modulating the carrier phase
472 Stauder & Erbas
PSTN QPSK Roaming RR SGSN SIM SIM toolkit Skybridge SMS Spaceway STB TCH/F TCH/H TCP TCP/IP Teledesic Telesat TDD TDMA TMSI UMTS VPN WAP Windows CE W3C XML
Public Switched Telephone Network: Terrestrial wired network Quadrature Pulse Shift Keying Being subscribed to a network, but using temporally, seamlessly another Radio Resource Management, lower network layer protocol of GSM Serving GPRS support node, a node of the GSM GPRS network Subscriber Identity Module, the smart card of a mobile phone ETSI standard for the smart card API Proprietary satellite network, http://www.skybridgesatellite.com Short message service (a GSM packet data transmission service) Propr. satellite network, http://www.hns.com/spaceway/spaceway.htm Set Top Box, the receiver in digital television broadcast networks Full Traffic Channel, the GSM data channel at 12 kbps user rate Half Traffic Channel, the GSM data channel at 6 kbps user rate Transmission Control Protocol, the Internet’s transport layer protocol Transmission Control Protocol/Internet Protocol Proprietary satellite network, http://www.teledesic.com Private satellite broadcast provider, http://www.telesat.ca Time Division Duplex Time Division Multiple Access: Temporal multiplex of several MS Temporary Mobile Station Identity, temporal version of the IMSI Universal Mobile Telecommunications System (Steele, 1999) Virtual private network (a private network installed on a public network) Wireless Application Protocol, http://www.mobilewap.com Proprietary OS for mobiles, http://www.microsoft.com/windowsce WWW Consortium, an industry consortium, http://www.w3.org Extensible Markup Language, a W3C specification, e.g., HTML, WML
About the Authors 473
About the Authors Syed Mahbubur Rahman is currently a professor at the Minnesota State University, Mankato, USA. He worked in several other institutions around the world including NDSU in the USA (1999), Monash University in Australia (1993-98), Bangladesh University of Engineering and Technology (BUET, 1982-92) and Ganz Electric Works in Budapest (1980-82). He was the head of the Department of Computer Science and Engineering of BUET from 1986 to 1992. He is cochairing and is involved as a program/organizing member in a number of international conferences. He obtained his doctoral degree from Budapest Technical University in 1980. He supervised more than 30 research projects leading to Master's and PhD degrees. His research interests include electronic commerce systems, multimedia computing and communications, image processing and retrieval, computational intelligence, pattern recognition, distributed processing and security. He has published 100+ research papers in his areas of interest. *** Ray-I Chang earned his PhD degree in Electrical Engineering and Computer Science from National Chiao Tung University in 1996. His current research interests include real-time and distributed multimedia systems. Dr. Chang is a member of IEEE. Meng-Chang Chen received BS and MS degrees in Computer Science from National ChiaoTung University, Taiwan, in 1979 and 1981, respectively, and the PhD degree in Computer Science from the University of California, Los Angeles, in 1989. He joined AT&T Bell Labs in 1989, and then was an Associate Professor at the Department of Information Management, National Sun Yat-Sen University, Taiwan, from 1992 to 1993. Since then he has been with Institute of Information Science, Academia Sinica, Taiwan. He holds Associate Research Fellowship since July 1996. His current research interests include multimedia systems and networking, QoS networking, operating system, data and knowledge engineering and knowledge discovery. Dr. Chen is a member of ACM. Fabien Costantini is a PhD student in Computer Science at CNAM-CEDRIC Research Laboratory, Paris, France. He worked from February 1999 to April 2000 with a European company manufacturing aeronautical systems, where he participated to the conception and development of a large collaborative virtual prototyping tool using the Distributed Building Site Metaphor. Currently, he focuses on a more general distribution framework that relies on DBSM principles as well as new architectural Design Patterns for implementing highly reusable and portable Object Oriented Distribution software solutions. Piergiorgio Cremonese is Senior/Principal Wireless Architect in Netikos. He has extensive research, publication, consulting and industrial experiences in the design and analysis of advanced computer network and Internet applications. He participated in several European ACTS projects and was editor of the NIG-G12 Chains work of the European Community. He was responsible for design and development of applications and services for the main Italian telecom operators.
Copyright © 2002, Idea Group Publishing.
474 About the Authors
Tadeusz Czachorski was born in 1949. He received his MS degree in 1972, and PhD and habilitation degrees in Computer Science in 1979 and 1989, respectively, all from Silesian Technical University, Gliwice, Poland. He also received an MS degree in Physics 1975, from Silesian University, Katowice, Poland. Since 1972 he is a research worker at the Institute of Theoretical and Applied Informatics, Polish Academy of Sciences; since 1992 he is a professor at Silesian Technical University. He was a visiting professor at Paris-Sud, Paris-Nord and Versailles Universities (France). Since 1989 he is the scientific secretary of Information Science Comity, Polish Academy of Sciences. His main research interests include: queuing theory, modelling and performance evaluation of computer and telecommunication networks. Fazli Erbas was born in Germany in 1972. He received his diploma degree (Dipl-Ing) in Electrical Engineering from the University of Siegen, Germany, in 1999. He is currently working at the Institute for Communications (Institut für Allgemeine Nachrichtentechnik) at the University of Hannover, Germany as Research Assistance towards the PhD degree. His research interests include communication networks and protocols, especially mobile wireless communication systems, mobility management, location management, as well as location techniques and location-based services. Robert Gay is currently a professor in the School of EEE at NTU and also Director and CEO of the ASP Centre. He obtained his B Eng, M Eng and PhD degrees at the University of Sheffield, England, in 1965, 1967 and 1970 respectively. Since obtaining his PhD, he has been involved in Education and R&D working in institutions such as Singapore University (1972-1979), Rutherford and Appleton Laboratory (England, 1979-1982), NTU (1982-1995 and 1999present) and Gintic Institute of Manufacturing Technology (1989-1999). He has also been actively involved in promoting innovation in Singapore through work in various committees: Science Quiz (MOE), Science Centre Board, National CAD/CAM (NCB), Tan Kah Kee Young Inventors Award (TKK Foundation & NSTB), National IT Plan (NCB), Technopreneurship Incubation Center (ITE), Commercenet Singapore and Singapore Computer Society. He has more than a hundred publications in journals, conference proceedings and books. Silvia Giordano received her PhD at the beginning of 1999 from the Institute of Communications and Applications (ICA) at EPFL, Lausanne. She is currently working as senior/first assistant at the ICA Institute. Since 1999 she is Editor of IEEE Communication Magazine of IEEE Communication Society. She was guest editor of the Feature Topic on Challenges in MANETs that appeared in Communication Magazine in June 2001, as well as for the Special Issue on Mobile Ad-Hoc Networks that will appear on Cluster Computing. Her current research interests include traffic control and mobile ad-hoc WANs. Antonio F. Gómez-Skarmeta was born in Santiago, Chile, in 1965. He received an MS degree in Computer Science from the University of Granada and BS (Hons.) and PhD degrees in Computer Science from the University of Murcia Spain. During 1986-1992 he was an Assistant Professor in the Department of Computer Science at the University of Murcia, and since 1993 he is Associate Professor at the same department and university. Dr. Gómez-Skarmeta has worked on different research projects mainly in the national environment either in the distributed artificial intelligence field (project M2D2), as well as in tele-learning and computer support for collaborative work , and new telematics services in broadband networks (SABA and Pupitre). He is also coordinator of a Socrates CDA (European Master on Sof Computing) and of a Leonardo project for Distance and Open Learning. He is collaborator of different regional institutions for the development of the Information Society in the regional context and also has participated in the training of teachers in the NETD@YS activities in the Murcia Region,
About the Authors 475
especially in the context of the CIEZ@Net project. He has published more than 15 international journal papers and more that 50 conference papers. He is also reviewer of different ESPRIT projects. Janusz Gozdecki studied at the AGH University of Technology in Cracow (Poland). In 1995 he received an MA degree in Telecommunication and joined the Department of Telecommunications at the AGH University. Since then he has been working on access networks, high-speed networks and multimedia systems. Jan-Ming Ho received his PhD degree in Electrical Engineering and Computer Science from Northwestern University in 1989. He received his BS in Electrical Engineering from National Cheng Kung University in 1978 and his MS at the Institute of electronics of National Chiao Tung University in 1980. He joined the Institute of Information Science, Academia Sinica, Taiwan, R.O.C. as an Associate Research Fellow in 1989, and was promoted to Research Fellow in 1994. He visited IBM T. J. Watson Research Center in 1987 and 1988, Leonardo Fibonacci Institute for the Foundations of Computer Science, Italy, in 1992, and Dagstuhl-Seminar on “Combinatorial Methods for Integrated Circuit Design,” IBFI-Geschaftsstelle, Schlo-Dagstuhl, Fachbereich Informatik, Bau 36, Universitat des Saarlandes, Germany, in 1993. His research interests include real-time operating systems with applications to continuous media systems, e.g., video-ondemand and videoconferencing, computational geometry, combinatorial optimization, VLSI design algorithms, and implementation and testing of VLSI algorithms on real designs. Gábor Hosszú received an MSc degree from the Technical University of Budapest in Electrical Engineering and the Academic degree of Technical Sciences (PhD) in 1992. After graduation he received a three-year grant of the Hungarian Academy of Sciences then he joined the Microelectronics Co. in 1988. In 1990 he joined the Department of Electron Devices, where he has been full-time Associate Professor. His later interests are mainly in the area of media-communication based on the Internet, multicast technology, CAD systems in network environment and VHDL language. He published technical papers in different journals including the IEEE Transactions on Computer-Aided Design of Integrated Circuits (October 1993). He just finished a book in the field of the media-communication (in Hungarian). Stanislaw Jedrus was born in 1975. He received his MS degree in 1998 in Computer Science from the Silesian Technical University in Gliwice, Poland. Since 1998 he has been working at the Institute of Theoretical and Applied Computer Science of the Polish Academy of Sciences. He received a PhD degree (also in computer science) from this Institute in 1999. His research interests include performance evaluation of computer and telecommunication networks, architecture of multimedia systems, multifractal theory and quantum computing. Vana Kalogeraki is a Research Scientist at Hewlett-Packard Laboratories in Palo Alto, California. Her research interests include distributed systems, real-time systems, resource management, fault tolerance and computer networks. She received a PhD in Electrical and Computer Engineering from the University of California, Santa Barbara in 2000. She received MS and BS degrees in Computer Science from the University of Crete, Greece, in 1994 and 1996, respectively. Iraklis Kamilatos received his degree in Combined Engineering from Conventry University, Coventry, UK, and his MS degree in Telematics from the University of Surrey, Guildford, UK. He has been working as a Research Associate in the Information Processing Lab since January
476 About the Authors
1996 where he is pursuing his PhD degree under the supervision of Professor Michael Strintzis. He has attended seminars in the areas of data communication, image processing and data compression and has gained practical experience working for University Laboratories and private companies in Greece and Europe. He has been a member of the Technical Chamber of Greece (TCG) as a Electronic Engineer since 1996. His research interests include data communication, parallel system and information processing. Bhumip Khasnabish was born in Kishoreganj, Bangladesh. He is a senior principal member of the technical staff at Verizon (originally GTE) Labs., Inc., Waltham, Massachusetts where he works on various PSTN evolution, multimedia, wireless, and enterprise networking projects. Previously he worked on design, development, integration and testing of Nortel Networks, Magellean Passport's (a frame cell switch) trunking and traffic management software modules. He earned a Ph.D. degree (1992) in Electrical Engineering from UW, Canada and received the Roy Saheb Memorial Gold Medal (1975) and Commonwealth Scholarship (1984-1988). Bhumip authored or co-authored more than 100 patents, books, book chapters, and articles published in various trade magazines and IEEE and other international journals, magazines and conference proceedings. He guest edited special issues of IEEE Network and IEEE Communications magazine and coedited the 1998 Artech House (Boston, MA, USA) published book, Multimedia Communications Networks: Technologies and Services. He is a founding information director of ACM's SIGMobile, and was the general chair of IEEE ComSoc's EnterNet sponsored conference on Enterprise Networking and Computing (ENCOM-98). Bhumip is a senior member of IEEE and a member of the board of editors of the Journal of Network and Systems Management, Plenum Press. He is also an adjunct faculty member of Brandeis University in Waltham, MA, USA. Bhumip may be contact at [email protected] or [email protected]. Ming-Tat Ko received a BS and an MS in Mathematics from National Taiwan University in 1979 and 1982, respectively. He received a PhD in computer science from National Tsing Hua University in 1988. Since then he joined the Institute of Information Science, Academia Sinica, Taiwan; currently, he is a research fellow. Dr. Ko’s major research interests include the design and analysis of algorithms, computational geometry, graph algorithms, real-time systems and computer graphics. Yamuna Krishnamurthy received an MS degree in Computer Science from Washington University, St. Louis, where her research focused on “Enabling QoS in the CORBA Audio/ Video Streaming Service.” Presently, she is a research consultant at OOMWorks LLC where she is developing a scalable CORBA audio/video streaming service that enables application level QoS adaptation. She can be reached at [email protected]. Jean-Yves Le Boudec graduated from Ecole Normale Superieure de Saint-Cloud, Paris, where he obtained the Agregation in Mathematics in 1980. He received his doctorate in 1984 from the University of Rennes, France, and became an Assistant Professor at INSA/IRISA, Rennes. In 1987 and 1988 he was with Bell Northern Research, Ottawa, Canada, as a member of scientific staff in the Network and Product Traffic Design Department. In 1988, he joined the IBM Zurich Research Laboratory at Rushchlikon, Switzerland, where he was Manager of the Customer Premises Network Department. In 1994 he formed the Laboratoire de Reseaux de Communication at EPFL. His interests are in the architecture and performance of communication systems. Wonjun Lee has been with the Department of Computer Science and Engineering in EIST at Ewha Womans University since 2001. He has also held the position of Assistant Professor of Computer Science Telecommunications at the University of Missouri-Kansas City, USA, in
About the Authors 477
2000. He received BS and MS degrees in Computer Engineering from Seoul National University, Seoul, Korea, in 1989 and 1991, respectively. He also received an MS in Computer Science from the University of Maryland, College Park, USA, in 1996 and a PhD in Computer Science and Engineering from the University of Minnesota, Minneapolis, USA, in 1999. His research interests and expertise include networked multimedia computing, distributed systems, real-time systems, databases and Internet technology. He is a member of IEEE and ACM. Angel L. Mateo-Martínez was born in Murcia (Spain) in 1975. He is a Computer Science Engineer graduated from the University of Murcia, where he did his Master’s Thesis related to IP Multicast and Multimedia Communications. During 1998-1999 he worked at RedIRIS (the Spanish National Educational and Research Network), being responsible for the IRIS-MBONE service--the service related with IP Multicast and collaborative multimedia tools. By the middle of 1999, he started to work at Fundación Integra, where he was the Manager of the telematics services at the Murcia Region Network Interconection Center. Since February 2000, Professor Mateo-Martínez is the manager of the videoconferencing service at the University of Murcia and he is also professor in the Deparment of Computer Science at the University of Murcia. He is also working on his PhD related to IP Multicast and videoconferencing systems. Mark T. Maybury is Executive Director of MITRE’s Information Technology Division. Dr. Maybury has more than 50 publications to his name and is editor of Intelligent Multimedia Interfaces (AAAI/MIT Press, 1993), Intelligent Multimedia Information Retrieval (AAAI/MIT Press, 1997), co-editor of Readings on Intelligent User Interfaces (Morgan Kaufmann Press, 1998), Advances in Text Summarization (MIT Press, 1999) and Advances in Knowledge Management (MIT Press, 2001). He is also co-author of Information Storage and Retrieval (Kluwer Academic, 2000), a Board member of the Object Management Group and serves on the ACM IUI Steering Council. Dr. Maybury received his M Phil and PhD degrees from Cambridge University, England, in 1991. Peter Michael Melliar-Smith is a professor in the Department of Electrical and Computer Engineering at the University of California, Santa Barbara. His research interests include fault tolerance, resource management, distributed systems, and communication networks and protocols. He received a PhD in Computer Science from the University of Cambridge, England. Louise E. Moser is a professor in the Department of Electrical and Computer Engineering at the University of California, Santa Barbara. She is an associate editor for IEEE Transactions on Computers and has served as an area editor for Computer magazine in the area of networks. Her research interests span the areas of distributed systems, computer networks, and software engineering. She received a PhD in Mathematics from the University of Wisconsin, Madison. Piotr Pacyna received his MSc degree in Computer Sciences in 1995 from the AGH University of Technology in Cracow. Since then he has been working in the Department of Telecommunications of the AGH University where he delivers lectures on broadband network communications and operating systems. He spent sabbatical leaves in Loracom Nancy and CNET France Telecom in Paris, where he worked on traffic modeling for video flows. He was also working in ACTS projects. Mr. Pacyna co-authored three books, several research papers and expertise reports for the industry. His interests focus on broadband communications, video coding and transmission, and next-generation IP. Zdzislaw Papir is a professor in the Department of Telecommunications at the AGH University of Technology in Cracow (Poland). He is currently lecturing on signal theory, modulation and
478 About the Authors
detection theory and modeling of telecommunication networks. During 1991-93 he made several visits to universities in Belgium, Germany and Italy working on traffic modelling. During 1994-95 he served as a Design Department Manager to the Polish Cable Television. Since 1995 he has also served as a consultant in the area of broadband access networks for the Polish telecom operators. His current research interests include performance analysis of digital modulations used in broadband access networks and integration of IP and ATM networking for provisioning of multimedia services. Binh Pham is currently the Professor and Director of Research in the Faculty of Information Technology at the Queensland University of Technology, Brisbane, Australia. Prior to this, she held the IBM Foundation Chair in Information Technology at the University of Ballarat from 1995-99, and was an Associate Professor in the School of Computing & Information Technology at Griffith University from 1993-1995. Her research interests include computer graphics, multimedia, CAD, image analysis and intelligent systems. Marta Podesta graduated from the University of Pisa in 1998. She worked for one year in Finsiel where she was involved in the European project Diana, developing a multimedia application supporting QoS via RSVP. She is currently a researcher in the Whitehead laboratory (Livorno). Marco Roccetti earned an Italian Dott Ing degree in Electronics Engineering from the University of Bologna in the academic year 1987/88. He is currently a Professor of Computer Science in the Department of Computer Science of the University of Bologna. From 1990 through 1998, he was with the Department of Computer Science as a Research Associate, while from November 1998 to October 2000, he was with the same department as an Associate Professor. His research interests include design, implementation and evaluation of multimedia computing and communication systems, and performance analysis and simulation of distributed and parallel computing system. Professor Roccetti is a member of the Society for Computer Simulation International. Pedro M. Ruiz graduated from the University of Murcia with a degree in Computer Sciences. Mr. Ruiz has participated in several European and Spanish research projects related to videoconferencing, active networks, IPv6 and broadband networking. He works at RedIRIS (the Spanish national research network) where he coordinates the IP Multicast service and some other advanced network projects. He also works as Associate Professor in the Telematics Department of the University Carlos III of Madrid where he has taught several lectures, graduate courses, postgraduate courses and masters. Nikolaos Sarris was born in Athens, Greece, in 1971. He received an Aster of Engineering degree in Computer Systems Engineering from the University of Manchester Institute of Science and Technology (UMIST), Manchester, United Kingdom. He is currently a PhD candidate in the Electrical and Computer Engineering Department of the Aristotle University of Thessaloniki (AUTH), Greece, where he is also employed as a Graduate Research Assistant. His research interests include image processing, computer vision, model-based three-dimensional image sequence analysis, synthesis and coding, and video coding standards. He is a member of the Technical Chamber of Greece. Abdul Sattar is currently Professor (Chair in Information Technology) and Director of Knowledge Representation and Reasoning Unit at the Griffith University. He has obtained BSc, MSc (Rajasthan), MPhil (JNU), MMath (Waterloo) and PhD (Alberta) degrees. His research interests include computational models of hypothetical reasoning, constraint satisfaction, temporal reasoning, non-monotonic reasoning and intelligent agents. He is a member of AAAI,
About the Authors 479
ACM and ISAI, and an executive member of Australian Computing Society’s National Committee and Artificial Intelligence and Expert Systems. He is founding Secretary-Treasurer of the Pacific Rim International Conference on Artificial Intelligence (PRICAI) series. Douglas C. Schmidt is Associate Professor in the Electrical and Computer Engineering Department at the University of California, Irvine. He is currently serving as a Program Manager at the DARPA Information Technology Office (ITO), where he is leading the national effort on distributed object computing middleware research. His research focuses on patterns, optimization principles and empirical analyses of object-oriented techniques that facilitate the development of high-performance, real-time distributed object computing middleware on parallel processing platforms running over high-speed networks and embedded system interconnects. He can be reached at [email protected]. Timothy K. Shih is Professor and Chairman of the Department of Computer Science and Information Engineering at Tamkang University, Taiwan, R.O.C. His research interests include multimedia computing and networking, distance learning, software engineering and formal specification and verification. He was a faculty member of the Computer Engineering Department at Tamkang University in 1986. In 1993 and 1994, he was a part-time faculty of the Computer Engineering Department at Santa Clara University. He was also a visiting professor at the University of Aizu, Japan, in the summer 1999. Dr. Shih received his BS and MS degrees in Computer Engineering from Tamkang University and California State University, Chico, in 1983 and 1985, respectively. He also received his PhD in Computer Engineering from Santa Clara University in 1993. Dr. Shih has published more than 200 papers and participated in many international academic activities, including the organization of DMS’98, SEMA’99, IMMCN’2000, ICPADS’2000, ICCLC’2000, MNS’2000, SEMA’2000, CAIIC’2000, MNS’2001, DMS’2001, Human.Society@Internet’2001, IMMCN’2002 and MNSA’2002. Dr. Shih has received many research awards, including Tamkang University research awards, NSC research awards (National Science Council of Taiwan) and the IIAS research award of Germany. He also received many funded research grants from NSC, from the Institute of Information Industry, Taiwan, and from the University of Aizu, Japan. Dr. Shih has been invited frequently to give tutorials, panels, and talks at international conferences and overseas research organizations, including DMS’99, ICCLC’2000, COMPSAC’2000, CAIIC’2000, MSE’2000, Santa Clara University (USA), Hiroshima City University (Japan), Iwate Prefecture University (Japan), University of Aizu (Japan), City University of Hong Kong (Hong Kong) and Hosei University (Japan). Dr. Shih was invited by ACM Multimedia 2001 to demonstrate the virtual university software system. The contact address of Dr. Shih is: Department of Computer Science and Information Engineering, Tamkang University, Tamsui, Taipei Hsien, Taiwan 251, R.O.C. E-mail: [email protected], Fax: +886 2 26209749, Phone: +886 2 26215656 x2743, x2616. Chee Kheong Siew is currently Director of the Information Communication Institute of Singapore (ICIS), School of EEE, Nanyang Technological University (NTU). He obtained his B Eng in Electrical Engineering from the University of Singapore in 1979 and MSc in Communication Engineering, Imperial College, in 1987. After a six-and-a-half-year stint in the industry, he joined NTU as a Lecturer in 1986 and was appointed Associate Professor in 1999. He was seconded to the National Computer Board (NCB) as the Deputy Director, ICIS, in August 1995 and managed the transfer of ICIS from NCB to NTU in 1996. In January 1997, he was appointed as the Director of the Institute. His research interests include e-commerce, traffic shaping, neural networks and network performance. He is a member of IEEE and a senior member of SCS, Singapore.
480 About the Authors
Jaideep Srivastava received his PhD from the University of California-Berkeley in 1988, and has been on the faculty of the University of Minnesota, where is a Professor. For more than 15 years he has been active as a researcher, educator, consultant and invited speaker in the areas of databases, artificial intelligence and multimedia. He has established and led a database and multimedia research laboratory, which has graduated 17 PhD and 36 MS students, and in the process published more than 125 papers in journals and conferences. Throughout his career Dr. Srivastava has had an active collaboration with the industry, both for collaborative research and technology transfer. In a two-year sabbatical between 1999 and 2001, Dr. Srivastava spent time at Amazon.com as the Chief Data Mining Architect, at Yodlee Inc. as Director–Data Analytics and at Chingari Inc. as the Chief Technology Officer. Dr. Srivastava is an often-invited participant in technical as well as technology strategy forums. He has given more than 100 talks in various industry, academic and government forums, and has served on the program committee of a number of conferences and on the editorial board of various journals. He has also served in an advisory role to the governments of India and Chile on various software technologies. Dr. Srivastava is a member of the ACM, and a senior member of the IEEE. Jürgen Stauder received the Dipl-Ing degree in Electrical Engineering from the University of Darmstadt, Germany, in 1990. From 1990 to 1998, he joined the Institut für Theoretische Nachrichtentechnik und Informationsverarbeitung at the University of Hannover, Germany as a Research Assistant, where he received the PhD degree in 1999. Then, he stayed 18 months at the IRISA Laboratory of INRIA in Rennes, France, as Visiting Researcher and Lecturer. Since 1999, he is with Thomson Multimedia, Corporate Research, France. His research interests are photometric image processing and computer graphics with its applications to augmented reality, video coding and multimedia networking. Michael G. Strintzis received an undergraduate degree in Electrical Engineering from the National Technical University of Athens, Greece, in 1967, and MA and PhD degrees in Electrical Engineering from Princeton University, New Jersey in 1969 and 1970, respectively. He then joined the Electrical Engineering Department, University of Pittsburgh, Pennsylvania, where he served as Assistant (1970–1976) and Associate (1976–1980) Professor. Since 1980, he has been a Professor of Electrical and Computer Engineering, University of Thessaloniki, Greece, and since 1999, Director of the Informatics and Telematics Research Institute, Thessaloniki, Greece. His current research interests include 2-D and 3-D image coding, image processing, biomedical signal and image processing, and DVD and internet data authentication and copy protection. Since 1999, Dr. Strintzis has served as an Associate Editor of the IEEE Transactions on Circuits and Systems for Video Technology, and in1984, he was awarded one of the Centennial Medals of the IEEE. Chengzheng Sun is a professor (Chair in Internet Computing) at the School of Computing and Information Technology at Griffith University, Australia. He received Ph.Ds in Computer Science and Engineering from the University of Amsterdam, the Netherlands and from the National University of Defense Technology, China, respectively. His major areas of expertise and research interests include Internet and Web computing technologies and applications, realtime groupware systems and CSCW, distributed operating systems and computer networks, mobile computing systems, and parallel implementation of object-oriented and logic programming languages. Naga Surendran is currently working at Sylantro Systems, a telecommunications company developing application switches for Voice Over IP. He is currently involved in telephony call routing and call state management there. He finished his Master’s degree at Washington
About the Authors 481
University, St. Louis, where his thesis focused on “Pluggable Protocols for CORBA Audio/ Video Streaming.” He can be reached at [email protected]. Christian Toinard is an associate professor in Computer Sciences at ENSEIRB, Bordeaux, France, since 1993. His research activities are at the CNAM-CEDRIC. From 1984 to 1989, he worked as an engineer in the field of real-time systems. From 1990 to 1993, he prepared a PhD thesis, dealing with multicast systems, at the University of Paris 6. He works closely with industry through research contracts. His research activity has shifted from the distributed control of manufacturing systems to the cooperative prototyping of aeronautical systems. He published several articles about cooperative virtual prototyping and Object-Oriented Distributed Systems. Zhonghua Yang is currently an associate professor at ICIS, School of EEE, NTU. His previous professional experiences include serving as a Senior Research Fellow at Gintic Institute, Singapore; a Senior Research Scientist with Distributed Systems Technology Center (DSTC) in Australia; an NSERC Visiting Professor at the University of Alberta, Canada; and a distributed systems researcher at Imperial College, England. While in China, he held various senior positions at a research institute of Chinese aerospace industry, and had been a member of China’s ISO Technical Committee. His major areas of expertise and interests are Internet computing, distributed systems and distributed objects middleware. He holds a PhD from Griffith University, Australia. Maciej Zakrzewicz is an assistant professor at the Institute of Computing Science at the Poznan University of Technology. He is currently delivering lectures on database systems, business system design and Internet standards at universities in Poland and Germany. He received his PhD in Computer Science from Poznan University of Technology in 1998. His research interests include database systems, data mining and knowledge discovery in databases and Internet technologies. Since 1999 he is the President of the Polish Oracle Users Group.
482 Index
Index A A/V Stream Throughput 82 A/V Streaming Service 55 ACR/NEMA 135 admission control 36, 237, 252 allocation techniques 62 analysis 334 annotation 143 Application Data Units (ADUs) 20 application level framing 20 application QoS 6 application-level framing 80 applications 269 architecture of Vic 80 ATM 186, 444 ATM channel 186 Audio/Video (A/V) Streaming Service specification 55 authentication 450 authentication mechanism 30, 58 available bit applications 374
B B-ISDN 444 bandwidth 19, 372 best effort 18 bit error rate (BER) 458 blind delay 358 broadcast 442 broadcast news 126, 127
C C++ 62 callback interface 69 canonical name 26 causal order 354 CDMA 445 center-based trees 381 channel equalization 449 clock-driven protocol 351
coding mode selection 466 coding scheme 104 collaborative 333 collaborative services 1 collaborative systems 291 Common Object Request Broker Architecture 1, 2, 38 Common Object Request Broker Architecture (CORBA) 2 compressed video images 10 compression rates 103 computer graphics 334 computer supported cooperative work 134 computer-supported collaboration 134 connection-oriented transport 43 constant bit rate applications 374 consumer architecture 79 continuous media 237, 252 contributing source count 25 contributing source identifier 23 controlled load 18 coordinator 8 CORBA 38, 54 CORBA A/V Streaming Service Components 60 CORBA Internet Inter-ORB Protocol (IIOP) 55 CORBA Media Stream Framework 19 CORBA Media Streaming Framework 39 CORBA/ATM Testbed 82 cordless systems 445 CPU Usage of the MPEG decoder 82 cryptographic 144 CSC 134 CSCW 134, 135, 147
D data transfer protocol 23, 55 database architecture 151 datagram-oriented transport 43 delay 371 Copyright © 2002, Idea Group Publishing.
Index 483
delay jitter 269 delivery rule 361 DES 144 DICOM 135 DICOM standard 146 digital television 103 digital video 190 distance 151 distance learning 1, 151 distributed 1, 134 distributed multimedia systems 333 distributed object computing 2 distributed object technology 38 distributed systems 2, 136 distributed virtual environments 292 DVMRP 416 dynamic visual data 333
E EDGE 452 elastic applications 18 encryption 448 end systems 23 end-to-end delivery 3 end-to-end quality of service 1 endpoint connection 7 EuroISDN 147 European ISDN 147 event lattice 353 extensible interface 73 extension interface 59
F Facial Definition Parameters (FDPs) 104 Facial Feature Extraction 110 Fair 18 FDD 447 fixed-filter 36 flooding 380 flow descriptor 35 flow devices 40 flow endpoint 57 flow specification 58, 67 flows 57 flows and flow endpoints 40
G g-precedence 357 General Inter-ORB Protocol 3 global positioning systems (GPS) 363 GPRS 451 granularity 357 granularity condition 360 graphical user interface 120 GSM 441 guaranteed service 18
H H.323 26 handover 448 happen before relation 354 HDR 441 high bandwidth 370 HSCSD 441
I IDEA 144 ideo buffering process 79 IDL 57 IGMP 413 image processing 134, 334 implementing Vic using TAO’s A/V Streaming Service 80 information extraction 126 initialization phase 107 instruction 151 instruction on demand 151 integrated layer processing 20 inter-media synchronization 352 inter-stream synchronization 352 intermediate systems 23 Internet 17, 269 Internet access technologies 19 Internet Control Message Protocol (ICMP) 25 Internet control protocol 25 Internet Engineering Task Force 18 Internet Engineering Task Force (IETF) 18 Internet Integrated Services Model 18 Internet Multicast Backbone 398 Internet TCP/IP protocol architecture 20 intra-stream synchronization 4, 352
484 Index
IP Multicast 412 IP-multicast 19 ISDN 147, 149, 443 isochronous synchronization 351
J Java event model 48 Java Media Framework 17, 19 Java media packages 49 jitter 352, 372 JMF API 48 JMF interfaces 49 JMF RTP architecture 49
L Lamport 351 language processing 126, 127 lattice structure 351 loss-free data transport 374 low latency 383
M managing real-time 1 MBGP 422 MBone 19, 370, 398, 412 MD5 144 MDBMS 151 media capture 48 media controller 63, 66, 77 media frameworks 19 media presentation 48 media processing 49 media streaming 17 media-on-demand services 32 medical CSCW 141, 149 medical teleconferencing 135 medical telecooperation 135 message passing mechanism 139 message-passing systems 355 middleware 19, 54 mixer 23 mobile operating systems 460 Mobile Station Application Execution Environment 463 model-based 102 Moving Picture Experts Group 103
MPEG player application 75 MPEG video 193 MPEG-2 186 MPEG-2 Video Communications 186 MPEG-2 video streams for transmission 186 MPEG-4 102, 255, 260 Multi-Channel Multipoint Distribution Service 19 multi-participant multimedia conferences 22 multicast 369 Multicast Access Control 428 Multicast Address Allocation Model 432 Multicast Address Set Claim 434 multicast-capable protocol 370 multimedia 102, 126, 151, 186, 269 multimedia applications 1 multimedia conferences 22 multimedia data 370 multimedia database management system 151 multimedia device 7, 40 Multimedia Device Factory (MMDevice) 61 multimedia frameworks 17 multimedia presentation 151 multimedia streaming 1, 57, 388, 390 multimedia transmission 222 multimedia transport 370 multiplexing 26 Multipoint-to-Multipoint Binding 98
N network dimensioning 186 networking 186 neural network 111 NTP 351 NTP timestamp 363
O object 151 object bus 39 Object Management Group (OMG) 55 object request broker 2, 39, 55 object reuse 151 object-oriented multimedia middleware 38
Index 485
OMA architecture 38 OMG IDL 45 ordering property 359
P packet audio 269, 270 parallelism in depth 136 partial order 354 partially synchronous systems 355 payload format specification 22 personalcasts 126 PIM-DM 416 PIM-SM 416 playout delay 269, 270 pluggable A/V protocol 68 pluggable A/V protocol framework 55 point-to-multipoint binding 96 point-to-point binding 91 policy control 36 precision 357 predictive service 18 presentation description file 33 profile 22 profile specification 22 programs 134 property service 61 protocols 20
Q Quality of Service 2, 35, 54, 172, 237, 255, 454
R radio access methods 445 real time protocol 120 real-time 269 real-time media transport 17 Real-Time Stream-Control Protocol 32 real-time streaming 370 real-time streaming protocol (RTSP) 21 real-time telecoopearation 136 real-time traffic 370 real-time transmission 120 real-time transport protocol 21
receiver report 26 reliability 372 remote method invocation 2 request broker 39 reservation styles 36 resource allocation 237 resource management 1 Resource ReSerVation Protocol (RSVP) 21, 35 resource scheduling 238 Reverse Path Broadcasting (RPB) 380 Reverse Path Multicasting (RPM) 381 reversible run-length-codes 466 roaming 451 roundtrip delay 358 RPC 141 RSVP 174, 255, 256 RTCP 363 RTP 352 RTP client and server 50 RTP relay 23 RTP session 50 RTP timestamp 363 RTP transmissions 22 RVBR 255, 256
S satellite 441 scalable 466 scalable data dissemination 374 security 450 security subsystem 143 segmentation 110 SEISMED 144 sender report 26 sending rule 361 service model 18 service providers 442 service quality 373 session 21 session advertisement 26 session announcement protocol (SAP) 21, 28 session description protocol (SDP) 21 session directory 26 session initiation protocol (SIP) 21, 30 session invitation 26 session manager 50
486 Index
shared datagram network 370 shared explicit 36 short message service (SMS) 451 simple flow protocol (SFP) 43, 59 simplex protocol 35 skew 361 smart card 459 Soft-Qos Framework 237, 245 source description 26 source callback 81 spanning trees 380 speech transcription 126 stability rule 361 stable 361 Steiner Trees (ST) 381 stream controller 8, 64 stream endpoint 7, 39, 57, 58, 64 stream establishment latency 85 stream endpoint 61 streams 39, 40, 57 supplier 8 supplier architecture 75 synchronization markers 466 synchronization source 25 synchronized clocks 352 synthetic/natural hybrid coding 104 systems 271
T TAO 55 TAO’s CORBA A/V Streaming Service 57 TCP/IP protocol 20 TDD 447 TDMA 445 teleconferencing 1 teleconsultation 135, 149 telecooperation 134 telecooperation architecture 134 terminals 441 three dimensional modeling 108 thresholding 112 throughput 452 tie-breaking rule 362 time-based media 49 timestamp 352 timing constraints 1 topic detection and tracking 126, 127
tracking phase 107 translator 23 transmission control protocol 371 transmission delay 459 transport protocols 385 Truncated Reverse Path Broadcasting (TRPB) 381
U UMTS 441 UMTS Terrestrial Radio Access 454 user datagram protocol 371 user studies 126 using ATM 186 UTRA 454
V variable bit applications with low latency 374 VBR Traffic Shaping 222 Vic 80 video 126 video coding 102 video retrieval 172, 174 video streaming 172, 175 video transmission 6, 103 video conferencing application 70, 8, 122, 413 videoconferencing system 102 videophone 103 videophone applications 103 virtual device 63 Virtual Home Environment (VHE) 456 virtual multimedia device 7 virtual prototyping 290 VRML 109
W wildcard filter 36 Wireless Application Protocol (WAP) 462 wireless cellular networks 441 wireless database 467 wireless video 466