37• Multimedia
37• Multimedia Authoring Systems Abstract | Full Text: PDF (209K) Distributed Multimedia Systems Abstrac...
38 downloads
920 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
37• Multimedia
37• Multimedia Authoring Systems Abstract | Full Text: PDF (209K) Distributed Multimedia Systems Abstract | Full Text: PDF (448K) Document Interchange Standards Abstract | Full Text: PDF (101K) Hypermedia Abstract | Full Text: PDF (323K) Multimedia Audio Abstract | Full Text: PDF (145K) Multimedia Information Systems Abstract | Full Text: PDF (181K) Multimedia Video Abstract | Full Text: PDF (403K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20E...ND%20ELECTRONICS%20ENGINEERING/37.Multimedia.htm16.06.2008 16:36:33
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4802.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Authoring Systems Standard Article Benjamin Falchuk1 and Ahmed Karmouch2 1University of Ottawa, Ottawa, Ontario, Canada 2University of Ottawa, Ottawa, Ontario, Canada Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W4802 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (209K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4802.htm (1 of 2)16.06.2008 16:37:24
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4802.htm
Abstract The sections in this article are Multimedia Documents and Terminology Multimedia Document Standards Visual Techniques for Document Authoring Mediadoc Current and Future Issues in Authoring Systems Conclusions Acknowledgments | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4802.htm (2 of 2)16.06.2008 16:37:24
60
AUTHORING SYSTEMS
these documents must aid in both the spatial and temporal layout of media and may also have to support the modeling of user interactions with time-based media. This article discusses multimedia document standards, visual techniques for document authoring, and research issues for the future of authoring systems, including the impact of networked mobile agents.
MULTIMEDIA DOCUMENTS AND TERMINOLOGY
AUTHORING SYSTEMS Advances in technology allow for the capture and manipulation of multiple heterogeneous media, and so the demand for these media in digital multimedia documents is a natural progression. Multimedia documents containing heterogeneous media require complex storage, editing, and authoring tools. While there are numerous document standards, few have addressed all of the requirements of documents containing multiple, time-based media. Furthermore, representing document structures and playback scenarios visually is challenging, and only a small number of effective tools have emerged in both the commercial and research realms. Authoring systems for
Structuring and laying out information for human consumption can be a difficult task. Publishers of newspapers and magazines face this challenge regularly as they lay out text and graphics onto printed pages. Publishers of multimedia content for human consumption via computer screens face even greater challenges because of both time and interactivity. Unlike a magazine and more like a movie, a multimedia presentation may have timing constraints. Furthermore, unlike both magazines and movies, multimedia presentations can be interactive and so their perceived presentation may change each time they are experienced. Before the technological advances of digital computers, a document usually meant a book. Authoring such a document required writing a linear story line that described events through time—the book provides the exact same story each time that it is read. Creating a multimedia presentation is considerably more complex. Unlike a book, which we may call monomedia and linear, multimedia presentation may use media that must occur simultaneously or in some related way; all these relations must be specified by the author. Thus from the authoring point of view, there is a need for methods and models of representing temporal relations in multimedia documents. One of the main issues in temporal models of documents is the model’s flexibility to express different temporal relationships. Throughout this article, we are mainly concerned with the temporal behavior of documents, and other document attributes—including its layout, quality, and playback speed— are not the primary focus of our investigation. Of main issue is the representation and modeling of multimedia scenarios that ‘‘play themselves back’’ as well as letting the user interact with the running presentation, thereby driving it in a custom direction. Our focus is on those tools allowing authors to create scenarios that offer these nonhalting, transparent options to viewers. To begin to understand the problem, consider Table 1. Various media types are categorized as having a temporal nature or not. If a media element has intrinsic attributes that relate directly to time-related qualities, it is considered a temporal media element. For example, since a video has a frame rate and an audio has a sample rate, they are both considered temporal. Text and images have no such attributes. However, as the third row of the table suggests, in a multimedia context any media type can be assigned temporal attributes in the form of relationships. An image, for example, can be programmed to be rendered only between the times 10 s and 15 s. Thus, the image has a duration of 5 s. Furthermore, the temporal attributes of a media element may be related to other media as well as to time. For example, a text (e.g., a title) may be programmed to appear on the screen exactly 10 s after the appearance of a logo graphic. Finally, the temporal
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
AUTHORING SYSTEMS
61
Table 1. Various Media Types and Their Temporal Classification Media Classification
Text
Image
Graphic
Video
Audio
Animation
Static media Temporal media Temporal in a multimedia document context
Yes No Yes
Yes No Yes
Yes No Yes
No Yes Yes
No Yes Yes
No Yes Yes
attributes of a media may be related to events that occur in a totally asynchronous fashion. For example, a logo may be programmed to appear anytime that the user moves the mouse over the top of a small icon. The act of moving the mouse over the icon is called an event and it triggers the rendering of the logo graphic. This type of event is asynchronous because not only does the author of the multimedia presentation not know when it will happen, but the author does not even know if it will happen at all. Table 2 summarizes the type of relationships that can occur between media in a multimedia document (see also row 3 in Table 1). Finally, we introduce the concept of the multimedia document and a sample of how one might look graphically. Figure 1 illustrates a multimedia document. When this document about Africa starts, the title, a video, and textual subtitles in synch with the video all start to play. At a particular point in the presentation, say between 10 s and 20 s, the video mentions African wildlife. At any time during this 10-s ‘‘window,’’ the user is able to make a mouse selection on the ‘‘?’’ icon and switch to a short video that relates to African wildlife. This interaction is asynchronous since its time is unknown in advance. If the choice is not made the original multimedia presentation continues ‘‘on its track.’’ In this article, we make use of terminology, some of which originate elsewhere (1,2). Some required definitions are as follows: • Events: Points at which the display of media objects (text, video, etc.) can be synchronized with other media objects. We focus on start and end events but we can generalize to internal events. • Synchronous events: Those with predictable times of occurrence (e.g., their temporal placement is known in advance).
Table 2. Different Types of Temporal Relationships That May Exist Between Media in a Multimedia Document Relationship Type Synchronous
Relative
Asynchronous
Description The media occurs at a specific time (relative to the starting reference time of 0 s) for a specific duration. For example, image I2 occurs at time 20 s for 10 s. The media occurs relative to a temporal attribute of another. For example, image I2 occurs 5 s after the rendering of graphic G3. The media occurs relative to an event whose occurrance is not even guaranteed or, if it is, the absolute time of the event is not known in advance. For example, the image I3 occurs if and when the mouse is moved over an icon I4.
• Asynchronous events: Those with unpredictable times of occurrence and duration (e.g., their time of occurrence cannot be known in advance). • Temporal equality: A synchronization constraint requiring that two events either occur simultaneously or that one precedes the other by a fixed amount of time. • Temporal inequality: A synchronization constraint requiring, for example, that for two events, A and B, they occur such that A precedes B by an unspecified duration, by at least some fixed time, or by at least some fixed time and at most another fixed time. • Hypermedia: Implies store-and-forward techniques where user actions such as mouse selections on hot spots cause the system to retrieve a new ‘‘page’’ of data, which could be an image, text, video, etc. There are usually no temporal relationships between media. • Passive multimedia: Implies a fully synchronized document that ‘‘plays itself back’’ through time, synchronizing all media objects together. • Active multimedia: Implies that there are hypermediatype choices presented to users during the playback of a multimedia document that allow the user’s interaction to ‘‘drive’’ the playback. • Scripting language: A language such as SGML (Standard Generalized Markup Language) or one of its instances, such as HTML (Hypertext Makeup Language), that allows the addition of semantic and logical information to data using a markup language. • Scenario: A term used for describing a completely specified (e.g., authored) multimedia presentation. It should now be clear that both multimedia documents and the process of authoring them are complex. With interand intramedia relationships and asynchronous events, the onus is clearly on both the logical structure of the document and the authoring tool to aid the author in creating such a complex renderable entity. The remainder of this article covers the following topics: (1) existing and de facto standards related to multimedia documents, (2) visual techniques for document authoring, (3) recent research achievements, and (4) the impact of the mobile agent paradigm on authoring systems. MULTIMEDIA DOCUMENT STANDARDS This section provides a brief overview of several key standards for document architecture, multimedia data format, and markup language standards for presentational applications. Standardization of these issues directly affects multimedia authoring systems. For example, authoring systems may be compliant to one particular standard (e.g., produce
62
AUTHORING SYSTEMS
Figure 1. A sample active multimedia document. A title box, video, and accompanying subtitles are presented. The appearance of the question mark denotes a ‘‘temporal window’’ in which the user may or may not make a selection to change the course of the presentation.
documents that conform to ODA [Open Document Architecture] or HTML) or may be compliant to none. Furthermore, since standards affect logical document structure they may also affect the graphical user interfaces used to author those documents and are therefore in context with this article. Both SGML (3) and ODA (4) were designed to facilitate the representation and exchange of documents. They were tailored for subtly different settings—ODA for the office and SGML for a publishing environment. SGML represents a document as a grouping of logical elements. The elements are composed hierarchically and together they form the logical structure of the document. SGML uses tags to mark up the text and create elements. Markup is distinguished from regular text by enclosing it in braces (i.e., 具Chapter典) and the end of a logical element is marked up with a slash as well as the braces (i.e., 具/Chapter典). Furthermore, logical elements may have attributes—title, chapter, etc. A document created with SGML refers to some document class stored as a Document Type Definition (DTD). It determines the structure of the document, but users may define their own DTDs. Furthermore, DTDs are developed in an object-oriented fashion so prototyping is facilitated by code re-use and DTD class specializations. ODA is a more robust type of architecture. ODA provides tools for specifying how the document should be laid out on the page, whereas SGML does not. ODA documents consists of a profile and content. The profile contains attributes of the document (e.g., title, chapter), and the content of the document is the combination of text and graphics specified by the author, as well as two important structures, the logical and layout structures. At the top of the hierarchic logical structure is the logical root and at the bottom are the basic logical objects that are atomic. The layout structure is also hierar-
chic, consisting of page sets, frames, and blocks that are laid out on the printed page as desired by the author. Layout structures are object oriented so that more than one may be defined for a given document. Specific layout structures are the instances of generic layout structures that are the templates for documents. An author can specify layout styles and so, for example, it is possible to specify that each figure should start on a new page. Neither SGML nor ODA can handle temporal information. HTML (5) is an SGML DTD created for the purpose of enabling a worldwide distributed hypertext system. HTML has become a de facto standard for on-line Internet documents and is the native language of the World Wide Web (WWW) and its browsers, such as Netscape’s Navigator and Microsoft’s Internet Explorer. The Hypermedia Time-based (HyTime) (6) document structuring language is a standard developed to permit human communications in a variety of media and to permit all media technologies to compete against each other in an environment that is capable of supporting all combinations of the media. HyTime was created for the digital publishing industry as a means to integrate all aspects of information collection and representation. It is based on the SGML standard, and consists of several modules. Even though different platforms may have been used to create the information, by standardizing the syntax for documents HyTime brings together people, software, and departments. Representing abstract time dependencies (not addressed in other standards) and hyperlinks are the focus of HyTime, and it is therefore a model that can support any combination of multimedia, hypermedia, time, or space specifications. Furthermore, if the system cannot render some of the objects in the HyTime document, then blankness or darkness will be rendered to preserve the time/space relationships within the document.
AUTHORING SYSTEMS
VISUAL TECHNIQUES FOR DOCUMENT AUTHORING It is clear that the potential complexity of multimedia documents dictates, to some extent, the complexity of the authoring tools. This section focuses on several important and significant authoring tools that allow authors graphically to model multiple media and their temporal relationships. The novelty of the graphical representations is emphasized over the particular document architectures, as is the ability of the particular model to represent interactive runtime events as opposed to predetermined absolute events. General Timeline Perhaps the most prevalent visual model is the timeline (9), a simple temporal model that aligns all events (start and end events of media objects) on a single axis that represents time. Since the events are all ordered in the way they should be presented, exactly one of the basic point relations, before (⬍), after (⬎), or simultaneous to (⫽), holds between any pair of events on a single timeline (see Fig. 2). Although the timeline model is simple and graphical, it lacks the flexibility to represent relations that are determined interactively at runtime. For example, assume a graphic (e.g., a mathematical graph) is to be rendered on the screen only until a user action (e.g., a mouse selection) dictates that the next one should begin to be rendered. The start time of the graphic is known at the time of authoring. The end time of the graphic depends on the user action and cannot be known until presentation
Audio Media
HyTime’s addressing capabilities allow the identification of hypermedia reference links from anchors to targets (a chapter of a document, a series of video frames, etc.). The target may be a file outside of the HyTime document and, as such, the model knowns about media types but relies on an applicationdependent SGML notation to tell it how to access media objects. Referencing elements within the same document is made easier, since each document has its own name space of unique names. Three modes of addressing are supported: by name, by position (in some arbitrary measurable universe or coordinate space), or by semantic construct. Elements can be linked together using different types of links: independent, property, contextual, aggregate, and span. Alignment of elements in HyTime is done in terms of bounding boxes called events (or grouped events called event schedules) that contain references to data. Each event is placed in a finite coordinate space (FCS) with one or more axis relating to some measurement domain (seconds, minutes, etc.) and addressable range (frame, text, etc.). The dimension of the event on each axis is called the extent, and the dimension is marked with quanta. Absolute as well as relative (by referencing other elements’ quanta) temporal specifications can be made, as can delays. The event projection module maps the FCS of the source into the FCS of the target. In this way, events in a schedule in a source FCS can be first modified (e.g., using ‘‘wands’’) and then projected (e.g., using ‘‘batons’’) onto a target FCS’s schedule. Further details on HyTime, including examples, may be found in (7). Substantial effort has been made by the developers of ScriptX (8) at Kaleida to create a multimedia application platform that handles temporal elements.
63
Image Text Video1
Video2 Time
Figure 2. The basic timeline model. Although it is relatively simple, this model visually captures the intramedia relationships and is the basis for several commercial tools.
time; hence the scenario cannot be represented on the traditional timeline, which requires a total specification of all temporal relations between media objects. Some commercial authoring and editing products that use this paradigm are: Macromedia Director, AVID media composer, and Adobe Premiere. An Enhanced Timeline Model In Ref. 1, Falchuk, Hirzalla, and Karmouch present a visually and functionally enhanced timeline model that provides additional graphical entities that relay temporal information. In this approach, user actions are modeled as objects on the vertical axis of the timeline (usually reserved for media such as text, graphics, audio, and video). A new type of media object called choice is added and is associated with a data structure with several important fields: user_action, region, and destination_scenario_pointer., User_action completely describes what input should be expected from the viewer of the presentation; for instance, key-press-y or left-mouse. Region describes what region of the screen (if applicable) is a part of the action; for instance, rectangle(100,100,150,180) may describe a rectangle in which if a user clicks with the mouse, some result is initiated. Destination_scenario_pointer is a pointer to some other part of the scenario or a different scenario. This media object ‘‘choice’’ may be placed directly on the traditional timeline. Suppose there is a scenario in which a video of American presidents is being rendered, along with an audio track. The video serves to introduce us to three American presidents, Clinton, Bush, and Reagan. Suppose again we have rendered text boxes that display the name and the age of each president as they are introduced in the short video clip. Now suppose the authors wish to create some additional timelines, one for each president. In this way a user might make some selection during the playback, the result of which would be a ‘‘jump’’ to a more in-depth presentation of the president currently being introduced (i.e., active multimedia). To do this each of the in-depth scenarios must be authored and then three choice objects must be added to the original timeline. The objects’ data structures contain user_action= left-mouse, region=the appropriate layout location on the screen, and destination_scenario_ pointer=the appropriate scenario. The choice objects are added in Fig. 3. In effect, the active multimedia scenario is finished. Since the choice data structures are completed, each such object refers to some other subscenario, as in Fig. 3.
64
AUTHORING SYSTEMS
Media
timeline2 (Clinton)
choice1
choice2
choice3
Clinton_text
Bush_text
Reagan_text
timeline2 (Bush)
audio_track_of_video Figure 3. Choice objects in an extended timeline and their associated destinations. Choices represent ‘‘windows of opportunity’’ for asynchronous events.
video_of_president timeline2 (Reagan) timeline1
The resulting scenario is indeed an interactive one. Note that choice objects, like other media objects, have a duration, meaning that the user has a ‘‘window of opportunity’’ to make the action that initiates the choice. If the associated action is not made during this time, the user loses the chance to make it. That is, if the user does not make the mouse selection while the text object Clinton_text is being rendered, he or she will continue to see the rendering of timeline1 and a new choice (the one associated with President Bush) will be offered to the user. If the appropriate choice is made, rendering continues from the destination timeline and the original scenario is terminated. In the enhanced visual model, the traditional rectangle that represents a media element on a timeline is split into three basic units, as shown in Fig. 4(a). The edges of these units, which represent start or end events, are either straight or bent lines. Straight lines represent synchronous events and bent lines represent asynchronous events. The actual time of an asynchronous event can only become known at runtime, and will be shifted by 웃 seconds to the right on the time axis, where 웃 is a nonnegative number representing the user response delay (see Fig. 4b). Again though, at authoring time the value of 웃 is unknown.
(a)
max start time
(b)
Asynchronous start event
max δ duration
Actual presentation Time
δ Possible Start time start time determined by a user response
Possible End time end time
Figure 4. The extended timeline—(a) Basic representation units. (b) User response delay.
The author may specify a limit to the value of 웃 where, if the user does not respond to some interactive prompt, a default response may be assumed (that starts rendering the media object in the example shown in Fig. 4b). Therefore, defining the maximum value for 웃 is useful and necessary. The length of the unit from the left sharp point to the right straight line represents the maximum value of 웃 (see Fig. 4b). Based on the three basic units shown in Fig. 4(a), many different forms could be used, but only six of them are applicable to interactive scenarios. These six shapes and their descriptions are shown in Fig. 5. The assumptions used in forming the shapes are as follows: 1. An event that is temporally related to an asynchronous event is itself asynchronous. A simple example is that if the start time of a media element that has a specific duration is unknown, then its end time is also unknown. 2. The user response delay, 웃, is a nonnegative value. Petri Nets In the Petri Net model, media intervals are represented by ‘‘places’’ and relations by ‘‘transitions.’’ Petri Nets are a formal graphical method appropriate for modeling systems that have inherent concurrency. The Petri Net model has been used extensively to model network communication and complex systems. A classification scheme exists that partitions Petri Nets into three classes: (1) those with Boolean tokens (places marked by at most one unstructured token), (2) those with integer token (places marked by several unstructured tokens), and (3) those with high-level tokens (places marked by structured tokens with information attached). As shown in Fig. 6, each of the basic point relations, before, simultaneous to, and after, can be modeled by a transition in conjunction with a delay place 웃. The delay place has a nonnegative value that represents an idle time. If the delay is not known at authoring time, temporal inequalities can be expressed, thus allowing user interactions. However, unlike the timeline model, the graphical nature of the Petri Net model can become complex and difficult to grasp when the document becomes relatively large. Figure 6 shows a possible Petri Net representation of the scenario originally shown in Fig. 2.
AUTHORING SYSTEMS
65
Sychronous start and end events Sychronous start event, asynchronous end event Asynchronous start and end events Asynchronous start and end event, max end time Asynchronous start and end events, max start time Asynchronous start and end events, max start time, max end time
Although this model is often used in research, no significant commercial products have emerged that use it. This reflects the model’s lack of practicality in comparison to the timeline model. MEDIADOC The logical structure of MEDIADOC (10) uses abstract objects to represent the document in an aggregation hierarchy. Classes include independent objects, sequential objects, concurrent objects, and media objects. Media object classes have subclasses such as frame, sound, and image. Other classes, such as sequential and concurrent, allow specification of temporal relations. MEDIADOC supports the ⬍, ⫽, and ⬎ temporal relations between two objects. These relations are based on events that include time of day, scene start or end, object start or end, and other events. Temporal relations are represented graphically with a unique and effective representation. Figure 7 illustrates a simple scenario. Circles represent media objects and rectangles represent temporal information. The scene starts (right facing triangle) with graphic G1 commencing immediately (⫽ Start of Scene). G1 endures for 7 s. Five seconds after the termination of G1, T1 is rendered (5 s ⬎ G1). The scene then ends (left-facing triangle). MEDIADOC can also support choice graphically as a ‘‘splitter’’ that forms two threads of scenario, but the scenario starts to become unreadable if too many choices are used. To assist in editing, MEDIADOC provides a timeline representation (called a SORT graph) of the scenario in which nondeterminate objects are placed on the graph but shaded with a different texture. Furthermore, scenario verification can be done by the author or automatically by the system.
Figure 5. Applicable units for interactive multimedia scenarios. These units reasonably capture the relevant event types.
CMIFed Multimedia Authoring The CMIFed multimedia authoring (11) provides users with a novel graphical way of visualizing and representing multimedia scenarios. CMIFed offers the traditional timeline-type visualization, called the ‘‘channel view’’ (Fig. 8). Synchronizing media objects can be achieved by using CMIFed ‘‘syncarcs.’’ As shown in Fig. 8, sync-arcs synchronize the start of the audio track to the end of the logo graphic, as well as synchronizing the start of the video to the start of the audio. The hierarchy view is a novel way of visualizing both the structures of the scenario and the synchronization information using nested boxes. Boxes that are placed top to bottom are executed in sequential order, while those placed in left-to-right order are executed in parallel. Firefly Firefly (9) is a powerful system that supports and models synchronous as well as asynchronous behaviors. Each media object is modeled by two connected rectangular nodes representing start and end events. Any other event that would be used for synchronization (such as a frame in a video) is called an internal event and represented by a circular node that is placed between the start and end events. In Firefly, asynchronous events contained in a media item are represented by circular nodes that float above the start event. Temporal equalities between events are represented by labeled edges connecting these events. Figure 9 shows an example in which an image of a car is presented along with background data. The user may select different parts of the car (e.g., the door,
Audio
Video1
δ3
δ1
Text
δ2
Video2 Image
δ4 δ5
Figure 6. Petri Net model for the scenario shown in Fig. 1.
66
AUTHORING SYSTEMS
Other Related Work 5 s>G1 =SofSc
G1
5s
T1 3s
7s Figure 7. Graphical representation of synchronization specification using MEDIADOC.
the hood) and be presented with a description of that particular part. User selections may or may not be made. Macromind Director 6.0 Finally, a product by Macromedia Inc. called Director 6.0 (12) is a very popular suite for creating, editing, and executing multimedia presentations. Director features true objects and drag and drop behavior, and 100 channels for independent graphic elements called ‘‘sprites.’’ Furthermore, Director allows instant publishing of interactive multimedia documents onto the World Wide Web and supports streaming media applications that allow bandwidth-intensive media such as video to begin to play immediately on remote machines as opposed to waiting for them to download completely. Summary Clearly there are a number of interesting, novel, and effective tools for authoring both passive and active multimedia documents. It can be noted that in almost all graphical authoring tools, the complexity of the visual representation grows with the complexity and nondeterminism of the document being authored. In particular, Petri Nets and Firefly documents tend to become unwieldy as the interactivity of the document is increased. Other methods, such as CMIFed and extended timelines, scale better. The simple timeline is likely the best choice for simple documents. However, if the document must model asynchronous events, the basic timeline model is not satisfactory. Extensions to the timeline model to support these types of events have been investigated. CURRENT AND FUTURE ISSUES IN AUTHORING SYSTEMS Current research into the areas of document standards, authoring, and editing is still active. This section first surveys some other important related work not covered in the previous section and then introduces the mobile agent paradigm. This paradigm for data access and interaction may have a profound effect on multimedia documents and authoring systems.
Recent active areas in related research fields have focused on several key areas. These areas include adding temporal structure to multimedia data, architectures and data models, issues related to time-dependent data, and extensions to standards proposals to provide additional support. Examples of such research proposals follow. Karmouch and Khalfallah (13) present an architecture and data model for multimedia documents and presentational applications. In this proposal information objects are used to model the hierarchic structures of multimedia data, each of which has an associated presentation activity that models the rendering process for that object. Composite objects and aggregate objects can be used to model complex multimedia information. Presentational objects synchronize their subordinates using synchronous message passing. Communicating synchronous processes (CSPs) and a CSP language provide a mechanism for coordination. Multimedia documents have both logical and layout architectures and associated computing architectures that describe the activity of the presentation servers using a graphical notation. A conceptual schema for multimedia documents is proposed using an extension of the Entity-Relationship (ER) model. Huang and Chu (14) propose an ODA-like multimedia document system. This system takes an object-oriented approach to extending the International Standards Organizations (ISO) ODA to support continuous media as well as static media. The ODA structure is composed of the document profile, structure model, computational features, content architecture, and processing model. This proposal extends ODA such that it may model temporal aspects of continuous media (e.g., video and audio) that have a temporal duration. In this scheme, the control mechanism is described in the view of objects, and behavior properties are also viewed as objects to which end the authors claim improved flexibility over ODA. Schloss and Wynblatt (15) present a layered multimedia data model (LMDM) consisting of the following layers from the ‘‘top’’ down: Control (CL), Data Presentation (DPL), Data Manipulation (DML), Data Definition (DDL). The DDL allows specification of objects including persistent data or instructions to generate data. The DML allows the grouping of objects into multimedia events within a frame of reference with an event time. The DPL describes how data are to be presented to the user. This involves specifying the playback devices, display methods, etc. Multimedia presentations may be reused on different systems since the DPL is system independent. The CL describes how compound presentations are built from one or more other presentations. Signals can be accepted from I/O devices and users.
document introduction video
audio
graphic logo
Figure 8. CMIFed channel view (left) and hierarchy view (right) of a temporal document. Sync arrows in the channel view represent synchronization and boxes in the hierarchy view represent parallel rendering.
logo_image time
video1
audio1
video_scene audio1
video1
AUTHORING SYSTEMS
67
No response Background Door
Simultaneous with
Engine Hood Simultaneous with Start
Simultaneous with
Start Start
Start
Start Simultaneous with Simultaneous with
End End End Text
End Simultaneous with
Audio1
Video
Simultaneous with Audio2
Figure 9. A temporal view of an active multimedia document using Firefly.
Car image
Jourdan, Layaida, and Sabry-Ismail (16) present a robust authoring environment called MADEUS that makes use of extended temporal constraint networks and implements a multimedia presentation layer. Finally, Fritzsche (17) presents a granularity independent data model for time-dependent data. This work presents multimedia data more theoretically and abstractly than merely video frames or audio samples. The Mobile Agent Paradigm and Multimedia Authoring In the classical client-server model there are two main entities. The server is the service provider that typically idles while it waits for well-formed requests to come onto the port that it monitors. The client is the service consumer that sends a particular request message to the server when it needs a service performed. The mobile agent-model is somewhat different. A mobile agent can be defined as a program that is
able to migrate from node to node on a network under its own control for the purpose of completing a task specified by a user. The agent chooses when and to where it will migrate and returns results and messages in an asynchronous fashion. In other words, mobile agents are sent to, and run beside, the remote data servers, and interaction with the remote data is not limited to the network Application Program Interface (API) otherwise afforded to clients. Mobile agents (see Ref. 18 for a thorough explanation) do not require network connectivity with remote services to interact with them. A network connection is used for a one-shot transmission of data (the agent and possibly its state and cargo) and then is closed. Agent results in the form of data do not necessarily return to the user using the same communications trajectory, if indeed re-
Jump to next node
Table 3. A Comparison of the Client-Server and Mobile Agent-Based Approaches to Document Composition Client-Server-Based Authoring
Mobile-Agent-Based Authoring
Client requests must come from a fixed set of operations from within the server’s API.
Agents are programs and have access to all of the constructs of their particular language (e.g., loops). Multiple invocations are iterated at the service itself in local memory. Agents run remotely while originating host remains free. Agents filter data both locally and remotely. Agent may jump to any number of intermediate nodes.
Multiple invocations require multiple client requests. The client iterates over the network. Server decides which data to return to the client. Data return directly to client.
Specify, refine task
Query exchange, negotiate, acquire
Classify, structure, present
Figure 10. Pseudocode describing the filling in of a document’s sections (left), and the general state diagram (right) showing the overall operation of the mobile agent. A mobile agent can also be considered a mobile document, programmed to ‘‘fill itself up’’ with appropriate media.
68
AUTHORING SYSTEMS
Remote Client media authoring Remote msgs server tool Figure 11. The traditional approach to authoring (left) versus the mobile agent approach (right). The mobile agent approach may reduce network traffic and save the end-user time by physically co-locating with media servers and agents and intelligently gathering media.
Client authoring tool
request response request response
• When the agent returns with a document to be presented, both the logical document structure and document content are completely determined by the user’s preferences and explicit choices. • The impact on authoring systems is significant— particularly those that are distributed. In this paradigm, the document authoring process can be thought of as consisting of three parts: (1) specifying the document preferences, (2) dispatching a mobile agent on a distributed system to collect media that can be used in the document, and (3) presenting the results in an arbitrary logical and layout format. This is a large departure from the traditional client-server approach of almost all distributed authoring systems (see Fig. 11). CONCLUSIONS This article has introduced the concept of multimedia documents of different kinds and provided definitions of important
Remote messages
Local msgs Agent
agent
Network
sults at this node are gained or even expected. Alternatively, the agent may send itself to another node from the intermediate one taking its partial results from the intermediate node with it. Table 3 illustrates some of the differences from the client-server model. In general, we can say that the mobile agent model offers several key advantages over the client-server model, including the following: (1) It uses less bandwidth by filtering out irrelevant data (based on user profiles and preferences) at the remote site before the data are sent back, (2) ongoing processing does not require ongoing connectivity, and (3) the model saves computing cycles at the user’s computer. As most document authoring systems, database applications, and legacy systems are client-server-based, the mobile agent paradigm will take time to become widespread. However, research laboratory prototypes (19) and some commercial products (20) have shown promise. Traditional servers will have to be extended with complex semantics to be able to support mobile agents that are clearly much more than merely messages—they have goals, state, cargo, etc. Figure 10 illustrates how a mobile agent, programmed in such a way as to collect media intermittently on certain topics, might be coded to travel from one agent server to another on a network. The agent knows which type of media to collect because it has either (1) been explicitly programmed by the user, or (2) has implicitly learned what the user likes by discovering patterns in the user’s day-to-day computing tendencies. When the mobile agent has finished collecting media to satisfy its mission, it returns to the user’s computer and presents the data to the user. The key points are as follows:
Remote agent and media server
Network
and commonly used terms in this domain. Several standards that might support multimedia documents were briefly surveyed. Visual approaches to authoring asynchronous multimedia documents were surveyed and then the mobile agent paradigm for authoring documents was introduced and the significance of traditional authoring systems emphasized. Multimedia documents containing heterogeneous media require complex storage, editing, and authoring tools. While there are numerous document standards, few have addressed all of the requirements of documents containing multiple, time-based media. Furthermore, representing document structures and playback scenarios visually is challenging and only a small number of effective tools have emerged in both the commercial and research realms. Authoring systems for these documents must aid in both the spatial and temporal layout of media and may also have to support the modeling of user interactions with time-based media. ACKNOWLEDGMENTS We would like to thank our colleagues, Nael Hirzalla, Habib Khalfallah, and James Emery, for their contributions to this article. BIBLIOGRAPHY 1. B. Falchuk, N. Hirzalla, and A. Karmouch, A temporal model for interactive multimedia scenarios, IEEE Multimedia, Fall 1995, 24–31. 2. M. Buchanan and P. Zellweger, Specifying temporal behavior in hypermedia documents. Proceedings of the ACM Conference on Hypertext, New York: ACM Press, Dec. 1992, pp. 262–271. 3. ISO 8879, Information Processing Text and Office Systems— Standard Generalized Markup Language, 1986. 4. ISO 8613-1 (CCITT T.411), Open document architecture and interchange format—introduction and general principles, 1988. 5. T. Berners-Lee and D. Connolly, Hypertext Markup Language— 2.0, IETF HTML Working Group, http://www.cs.tu-berlin.de/ ~jutta/ht/draft-ietf-html-spec-01.html. 6. ISO/IEC DIS 10744, Information Technology—Hypermedia/ Time-based Structuring Language (HyTime), August 1992. 7. R. Erfle, Specification of temporal constraints in multimedia documents using HyTime, Electronic Publishing, 6 (4): 397–411, December 1993. 8. R. Valdes, Introducing ScriptX, Software Tools for the Professional Programmers, 19 (13): November 1994. 9. G. Blakowski, J. Huebel, and U. Langrehr, Tools for specifying and executing synchronized multimedia presentations. 2nd Int. Workshop on Network and Operating System Support for Digital Audio and Video, Nov. 1991.
AUTOMATA THEORY 10. A. Karmouch and J. Emery, A playback schedule model for multimedia documents, IEEE Multimedia, Spring 1996, 50–61. 11. R. Rossum et al., CMIFed: A presentation environment for portable hypermedia documents, Proc. of ACM Multimedia ’93, New York: ACM Press, August 1993, pp. 183–188. 12. Macromedia Inc., Macromind Director 6.0, http://www.macromedia.com. 13. H. Khalfallah and A. Karmouch, An architecture and a data model for integrated multimedia documents and presentational applications, Multimedia Systems, 3: 238–250, 1995. 14. C. Huang and Y. Chu, An ODA-like multimedia document system, Software—Practice and Experience, 26 (10): 1097–1126, October 1996 15. G. Schloss and M. Wynblatt, Providing definition and temporal structure for multimedia data, Multimedia Systems, 3: 264–277, 1995. 16. M. Jourdan, N. Layaida, and L. Sabry-Ismail, Presentation services in MADEUS: An authoring environment for multimedia documents, INRIA Research report no. 2983. 17. J. C. Fritzsche, Continuous media described by time-dependent data, Multimedia Systems, 3: 278–285, 1995. 18. Special Issue on Agents, Communications of the ACM, 37 (7): 1994. 19. B. Ford and A. Karmouch, An architectural model for mobile agent-based multimedia applications, Proc. of CCBR’97, Ottawa, Canada, April 1997. 20. J. E. White, Telescript Technology: The Foundation for the Electronic Marketplace, General Magic White Paper, General Magic, 1994.
BENJAMIN FALCHUK AHMED KARMOUCH University of Ottawa
AUTOMATA. See AUTOMATA THEORY. AUTOMATA, CELLULAR. See CELLULAR AUTOMATA. AUTOMATA THEORY AUTOMATA AS MODELS FOR COMPUTATION
69
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4803.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Distributed Multimedia Systems Standard Article Shahram Ghandeharizadeh1 and Cyrus Shahabi1 1University of Southern California, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W4803 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (448K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4803.htm (1 of 2)16.06.2008 16:38:51
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4803.htm
Abstract The sections in this article are Overview of Magnetic Disks Continuous Display Stream Scheduling and Synchronization Optimization Techniques Case Study Acknowledgments | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4803.htm (2 of 2)16.06.2008 16:38:51
720
DISTRIBUTED MULTIMEDIA SYSTEMS
DISTRIBUTED MULTIMEDIA SYSTEMS Advances in computer processing and storage performance and in high speed communications has made it feasible to consider continuous media (e.g., audio and video) servers that
scale to thousands of concurrently active clients. The principle characteristics of continuous media is their sustained bit rate requirement (1,2). If a system delivers a clip at a rate lower than its prespecified rate without special precautions (e.g., prefetching), the user might observe frequent disruptions and delays with video and random noises with audio. These artifacts are collectively termed hiccups. For example, CD-quality audio (2 channels with 16 bit samples at 44 kHz) requires 1.4 Megabits per second (Mbps). Digital component video based on the CCIR 601 standard requires 270 Mbps for its continuous display. These bandwidths can be reduced using compression techniques due to redundancy in data. Compression techniques are categorized into lossy and lossless. Lossy compression techniques encode data into a format that, when decompressed, yields something similar to the original. With lossless techniques, decompression yields the original. Lossy compression techniques are more effective in reducing both the size and bandwidth requirements of a clip. For example, with the MPEG standard (3), the bandwidth requirement of CD-quality audio can be reduced to 384 kilobits per second. MPEG-1 reduces the bandwidth requirement of a video clip to 1.5 Mbps. With some compression techniques such as MPEG-2, one can control the compression ratio by specifying the final bandwidth of the encoded stream (ranging from 3 to 15 Mbps). However, there are applications that cannot tolerate the use of lossy compression techniques, for example, video signals collected from space by NASA (4). Most of the compression schemes are Constant Bit Rate (CBR) but some are Variable Bit Rate (VBR). With both techniques, the data must be delivered at a prespecified rate. Typically, CBR schemes allow some bounded variation of this rate based on some amount of memory at the display. With VBR, this variation is not bounded. The VBR schemes have the advantage that for the same average bandwidth as CBR, they can maintain a more constant quality in the delivered image by utilizing more megabits per second when needed, for example, when there is more action in a scene. The size of a compressed video clip is quite large by most current standards. For example, a two hour MPEG-2 encoded video clip, requiring 3 Mbps, is 2.6 Gigabytes in size. (In this paper, we focus on video due to its significant size and bandwidth requirements that are higher than audio.) To reduce cost of storage, a typical architecture for a video server employs a hierarchical storage structure consisting of DRAM, magnetic disks, and one or more tertiary storage devices. (Magnetic tape typically serves as a tertiary storage device (5)). As the different levels of the hierarchy are traversed starting with memory, the density of the medium and its latency increases while its cost per megabyte decreases. It is assumed that all video clips reside on the tertiary storage device. The disk space is used as a temporary staging area for the frequently accessed clips in order to minimize the number of references to tertiary. This enhances the performance of the system. Once the delivery of a video clip has been initiated, the system is responsible for delivering the data to the settop box of the client at the required rate so that there is no interruption in service. The settop box is assumed to have little memory so that it is incumbent on the server and network to deliver the data in a ‘‘just in time’’ fashion. (We will specify this requirement more precisely and formally later.) Note that the system will have some form of admission control, and while it is reasonable to make a request wait to be
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
DISTRIBUTED MULTIMEDIA SYSTEMS
initiated, once started, delivery in support of a hiccup-free display is required. An important concept is that of a stream which is one continuous sequence of data from one clip. At any point in time, a stream is associated with a specific offset within the clip. Note that two requests for the same clip that are offset by some time are two different streams. In this article, we start by assuming a set of videos stored on disk and describe techniques to stage data from disk to memory buffers. Subsequently, we explore the role of tertiary storage devices and address issues such as how a presentation can be displayed from a tertiary storage device, how data should be staged from tertiary onto the magnetic disk storage, and how pipelining can be used to minimize the incurred startup latency. The models are described assuming CBR encoded data. We do not investigate the networking issues (6) and assume, therefore, that the communication network will transmit a stream at a constant bit rate. Note that this is a simplifying assumption. There will be some amount of variation in the network transmission. To account for this, some amount of buffering is required in the network interface as well as the settop box. We do not consider this further. See (6) for a detailed treatment of buffer requirements to mask delays in end to end delivery of continuous media streams. In this context, the disk subsystem is responsible for delivering data from the disk image of a clip to buffers such that if RC is the play-out rate of the object in Mbps, then at seconds after the start of the stream, at least RC ⫻ megabits must have been delivered to the network interface. This ensures that the network never starves for data. There are a number of ways of scheduling streams depending on the frequency of access to each video. The simplest would be to define a fixed schedule a priori of start times. At the other extreme, one could start a new stream for each request at the earliest possible time (consistent with available resources). Video-on-demand implies the latter policy, but there are variations that will be described in later. With n videos, we define f i(t) to be the frequency of access to the ith video as a function of time. This notation emphasizes that the frequency of access to videos varies over time. Some variations might be periodic over the course of a day, or some other period, while other variations will not be repetitive (e.g., a video becoming old over a couple of months). It is important, therefore, to design a system that can respond effectively to both variations and do so automatically. As we shall see, striping of objects across disks in an array alleviates this problem. The following are the main performance metrics that constitute the focus of this article. 1. Throughout: number of simultaneous displays supported by the system. 2. Startup latency: amount of time elapsed from when a request arrives referencing a clip until the time the server starts to retrieve data on behalf of this request. Startup latency corresponds roughly to the usual basic measure of response time. Alternative applications might sacrifice either startup latency in favor of throughput or vice versa. For example, a service provider that supports video-ondemand might strive to maximize its throughput by delaying a user by tens of seconds to initiate the display. If the same provider decides to support VCR functionalities, for example,
721
pause, resume, fast forward and rewind with scan, then the startup latency starts to become important. This is because a client might not tolerate tens of seconds of latency for a fast forward scan. The impact of startup latency becomes more profound with nonlinear digital editing systems. To explain this, consider the role of such a system in a news organization, for example, CNN. With sports, a producer tailors a presentation together based on highlights of different clips, for example, different events at the Olympics. Moreover, the producer might add an audio narration to logically tie these highlights together. Upon the display of the presentation, the system is required to display (1) the audio narration in synchrony with video, and (2) the highlights in sequence, one after another, with no delay between two consecutive highlights. If the system treats the display of each highlight as a request for a stream, then the startup latency must be minimized between each stream to provide the desired effect. Later, we will describe scheduling techniques that hide this latency. The rest of this paper is organized as follows. We provide an overview of the current disk technology. Subsequently, we describe scheduling techniques in support of a continuous display and the role of hierarchical storage structures. Then, we focus on optimization techniques that can enhance the performance of a continuous media server. Finally, describes an experimental prototype named Mitra that realizes a number of techniques in this paper. OVERVIEW OF MAGNETIC DISKS A magnetic disk drive is a mechanical device operated by its controlling electronics. The mechanical parts of the device consist of a stack of platters that rotate in unison on a central spindle; see (9) for details. A single disk contains a number of platters, as many as sixteen at the time of this writing. Each platter surface has an associated disk head responsible for reading and writing data. Each platter stores data in a series of tracks. A single stack of tracks at a common distance from the spindle is termed a cylinder. To access the data stored in a track, the disk head must be positioned over it. The operation to reposition the head from the current track to the desired track is termed seek. Next, the disk must wait for the desired data to rotate under the head. This time is termed rotational latency. The seek time is a function of the distance traveled by the disk arm (7–9). Several studies have introduced analytical models to estimate seek time as a function of this distance. To be independent from any specific equation, this study assumes a general seek function. Thus, let Seek(c) denote the time required for the disk arm to travel c cylinders to reposition itself from cylinder i to cylinder i ⫹ c (or i ⫺ c). Hence, Seek(1) denotes the time required to reposition the disk arm between two adjacent cylinders, while Seek(#cyl) denotes a complete stroke from the first to the last cylinder of a disk that consists of #cyl cylinders. Typically, seek increases linear distance except for small number of cylinders (7,8). For example, the model used to describe the seek characteristic of Seagate ST31200W disk, consisting of 2697 cylinders, is √ 1.5 + (0.510276 × c) if c < 108 (1) Seek(c) = 6.5 + (0.004709 × c) otherwise
DISTRIBUTED MULTIMEDIA SYSTEMS
Transfer rate (Mbps)
722
dia Types, and the second in Hierarchical Storage Management. To support continuous display of an object X, it is partitioned into n equisized blocks: X0, X1, . . ., Xn⫺1, where n is a function of the block size (B ) and the size of X. We assume a block is laid out contiguously on the disk and is the unit of transfer from disk to main memory. The time required to display a block is defined as a time period (Tp):
40 36 32 28 24 20 16 12 200
400 600 800 Disk capacity (MB)
Tp =
1000
Figure 1. Zone characteristics of the Seagate ST31200W magnetic disk.
A trend in the area of magnetic disk technology is the concept of zoning. Zoning increases the storage capacity of a disk by storing more data on the tracks that constitute the outer regions of the disk drive. With a fixed rotational speed for the disk platters, this results in a disk with variable transfer rate where the data in the outermost region is produced at a faster rate. Figure 1 shows the transfer rate of the 23 different zones that constitute a Seagate disk drive. [Techniques employed to gather these numbers are reported in (10).] CONTINUOUS DISPLAY This section starts with a description of a technique to support continuous display of CM objects assuming a platform that consists of a single disk drive with one zone. Subsequently, we extend the discussion to incorporate multizone disks. Next, we describe the role of multiple disk drives in support of environments that strive to support thousands of CM streams. Finally, we show how the architecture can be extended to support a hierarchy of storage structures in order to minimize the cost of providing online access to petabytes of data. Single Disk Drive In this article, we assume that a disk drive provides a constant bandwidth, RD. The approaches discussed can, however, be extended to multizone disk drives with variable transfer rates. Interested readers can consult (11). We also assume that all objects have the same display rate, RC. In addition, we assume RD ⬎ RC first assumption is relaxed in Mix of Me-
B RC
(2)
To support the continuous display of X, one can retrieve blocks of X, one after the other, and send it to the user display consecutively. This is a traditional production-consumption problem. Since RD, rate of production, is larger than RC, rate of consumption, a large amount of memory buffer is required at the user site. To reduce the amount of required buffer, one should slow down the production rate. Note that the consumption rate is fixed and dictated by the display bandwidth requirement of the object. Therefore, if Xi and Xj are two consecutive blocks of X, Xj should be in the user buffer once the consumption of Xi has been completed. This is the core of a simple technique, termed round-robin schema (12,2). With round-robin schema, when an object X is referenced, the system stages X0 in memory and initiates its display. Prior to completion of a time period, it initiates the retrieval of X1 into memory in order to ensure a continuous display. This process is repeated until all blocks of an object have been displayed. To support simultaneous displays of several objects, a time period is partitioned into fixed-size slots, with each slot corresponding to the retrieval time of a block from the disk drive. The number of slots in a time period defines the number of simultaneous displays that can be supported by the system (N ). For example, a block size of 1 MB corresponding to a MPEG-2 compressed movie (RC ⫽ 4 Mb/s) has a 2 s display time (Tp ⫽ 2). Assuming a typical magnetic disk with a transfer rate of 68 Mb/s (RD ⫽ 68 Mb/s) and maximum seek time of 17 ms, 14 such blocks can be retrieved in 2 s. Hence, a single disk supports 14 simultaneous displays. Figure 2 demonstrates the concept of a time period and a time slot. Each box represents a time slot. Assuming that each block is stored contiguously on the surface of the disk, the disk incurs a seek every time it switches from one block of an object to another. We denote this as TW_Seek and assume that it includes the average rotational latency time of the disk drive. We will not dis-
TW_Seek
Disk activity System activity
Wi
Xj
...
Zk
Wi + 1
Display Wi
. . . Zk + 1
Display Wi + 1
Display Xj
Figure 2. Time period.
Xj + 1
Time period (Tp)
Wi + 2
DISTRIBUTED MULTIMEDIA SYSTEMS
cuss rotational latency further because it is a constant added to every seek time. Since the blocks of different objects are scattered across the disk surface, the round-robin schema should assume the maximum seek time [i.e., Seek(#cyl)] when multiplexing the bandwidth of the disk among multiple displays. Otherwise, a continuous display of each object cannot be guaranteed. Seek is a wasteful operation that minimizes the number of simultaneous displays supported by the disk. In the worst case, the disk performs N seeks during a time period. Hence, the percentage of time that the disk performs wasteful work can be quantified as [N ⫻ Seek(d)]/Tp ⫻ 100, where d is the maximum distance between two blocks retrieved consecutively (d ⫽ #cyl with round-robin). By substituting Tp from Eq. 2, we obtain the percentage of wasted disk bandwidth: wasteful =
N × Seek(d) × RC × 100 B
(3)
By reducing this percentage, the system can support a higher number of simultaneous displays. We can manipulate two factors to reduce this percentage: (1) decrease the distance traversed by a seek (d), and/or (2) increase the block size (B ). A limitation of increasing the block size is that it results in a higher memory requirement. Here, we investigate display schema that reduce the first factor. An alternative aspect is that by manipulating d and fixing the throughput, one can decrease the block size and benefit from a system with a lower memory requirement for staging the blocks. The following paragraphs elaborate more on this aspect. Suppose N blocks are retrieved during a time period; then, Tp ⫽ N B /RD ⫹ N ⫻ Seek(#cyl). By substituting Tp from Eq. 2, we solve for B to obtain Bround−robin =
RC × RD × N × Seek(#cyl) RD − N × RC
(4)
From Eq. 4, for a given N , the size of a block is proportional to Seek(#cyl). Hence, if one can decrease the duration of the seek time, then the same number of simultaneous displays can be supported with smaller block sizes. This will save some memory. Briefly, for a fixed number of simultaneous displays, as the duration of the worst seek time decreases (increases) the size of the blocks shrinks (grows) proportionally with no impact on throughput. This impacts the amount of memory required to support N displays. For example, assume Seek(#cyl) ⫽ 17 ms, RD ⫽ 68 Mb/s, RC ⫽ 4 Mb/s, and N ⫽ 15. From Eq. 4, we compute a block size of 1.08 MB that wastes 12% of the disk bandwidth. If a display schema reduces the worst seek time by a factor of two, then the same throughput can be maintained with a block size of 0.54 MB, reducing the amount of required memory by a factor of two and maintaining the percentage of wasted disk bandwidth at 12%. The maximum startup latency observed by a request, defined as the amount of time elapsed from the arrival time of a request to the onset of the display of its referenced object, with this schema is
round-robin = Tp
(5)
723
This is because a request might arrive a little too late to employ the empty slot in the current time period. Note that ᐉ is the maximum startup latency (the average latency is ᐉ/2) when the number of active users is N ⫺ 1. If the number of active displays exceeds N , then Eq. 5 should be extended with appropriate queuing models. This discussion holds true for the maximum startup latencies computed for other scheme in this paper. In the following sections, we investigate two general techniques to reduce the duration of the worst seek time. While the first technique schedules the order of block retrieval from the disk, the second controls the placement of the blocks across the disk surface. These two techniques are orthogonal, and we investigate a technique that incorporates both approaches. Three main objectives are (1) maximizing the number of simultaneous displayed streams (i.e., throughput), (2) minimizing the startup latency time, and (3) minimizing the amount of required memory. Since these objectives are conflicting, there is no single best technique. Each of the described techniques strive to strike a compromise among the mentioned objectives. Disk Scheduling. One approach to reduce the worst seek time is Grouped Sweeping Scheme (13), GSS. GSS groups N active requests of a time period into g groups. This divides a time period into g subcycles, each corresponding to the retrieval of N /g blocks. The movement of the disk head to retrieve the blocks within a group abides by the SCAN algorithm in order to reduce the incurred seek time in a group. Across the groups, there is no constraint on the disk head movement. To support the SCAN policy within a group, GSS shuffles the order that the blocks are retrieved. For example, assuming X, Y, and Z belong to a single group, the sequence of the block retrieval might be X1 followed by Y4 and Z6 (denoted as X1 씮 Y4 씮 Z6) during one time period, while during the next time period, it might change to Z7 씮 X2 씮 Y5. In this case, the display of (say) X might suffer from hiccups because the time elapsed between the retrievals of X1 and X2 is greater than one time period. To eliminate this possibility, (13) suggests the following display mechanism: the displays of all the blocks retrieved during subcycle i start at the beginning of subcycle i ⫹ 1. To illustrate, consider Fig. 3 where g ⫽ 2, and N ⫽ 4. The blocks X1 and Y1 are retrieved during the first subcycle. The displays are initiated at the beginning of subcycle 2 and last for two subcycles. Therefore, while it is important to preserve the order of groups across the time periods, it is no longer necessary to maintain the order of block retrievals in a group. The maximum startup latency observed with this technique is the summation of one time period (if the request arrives when the empty slot is missed) and the duration of a subcycle (Tp /g):
gas = Tp +
Tp g
(6)
By comparing Eq. 6 with Eq. 5, it may appear that GSS results in a higher latency than round-robin. However, this is not necessarily true because the duration of the time period is different with these two techniques due to a choice of different block size. This can be observed from Eq. 2, where the duration of a time period is a function of the block size.
724
DISTRIBUTED MULTIMEDIA SYSTEMS
X1
Y1
Z1
Subcycle 1
W1
Y2
X2
Z2
W2
Y3
X3
W3
Z3
X4
Y4
Z4
W4
...
Subcycle 2
Time period (Tp) Display X1 and Y1
Display X2 and Y2
Figure 3. Continuous display with GSS.
To compute the block size with GSS, we first compute the total duration of time contributed to seek times during a time period. Assuming N /g blocks retrieved during a subcycle are distributed uniformly across the disk surface, the disk incurs a seek time of Seek(#cyl/(N /g)) between every two consecutive block retrievals. This assumption maximizes the seek time according to the square root model, providing the worst case scenario. Since N blocks are retrieved during a time period, the system incurs N seek times in addition to N block retrievals during a period; that is Tp ⫽ N B /RD ⫹ N ⫻ Seek[(#cyl ⫻ g)/N ]. By substituting Tp from Eq. 2 and solving for B , we obtain RC × RD #cyl × g (7) Bgas = × N × Seek RD − N × RC n By comparing Eq. 7 with Eq. 4, observe that the bound on the distance between two blocks retrieved consecutively is reduced by a factor of g/N , noting that g ⱕ N . Observe that g ⫽ N simulates the round-robin schema. (By substituting g with N in Eq. 7, it reduces to Eq. 4.) Other disk scheduling algorithms for continuous media objects are also discussed in the literature. Almost all of them (14–16) can be considered as special cases of GSS because they follow the main concept of scheduling within a time period (or round). The only exception is SCAN-EDF (17). Constrained Data Placement. An alternative approach to reduce the worst seek time is to control the placement of the blocks across the disk surface. There are many data place-
ment techniques described in the literature (18–22). In this section, we describe a technique termed Optimized REBECA (22) (OREO for short). OREO (23) reduces the worst seek time by bounding the distance between any two blocks that are retrieved consecutively. OREO achieves this by partitioning the disk space into R regions. Next, successive blocks of an object X are assigned to the regions in a round-robin manner as shown in Fig. 4. The round-robin assignment follows the efficient movement of disk head as in the scan algorithm (24). To display an object, the disk head moves inward (see Fig. 5) from the outermost region toward the innermost one. Once the disk head reaches the innermost region, it is repositioned to the outermost region to initiate another sweep. This minimizes the movement of the disk head required to simultaneously retrieve N objects because the display of each object abides by the following rules: 1. The disk head moves in one direction (inward) at a time. 2. For a given time period, the disk services those displays that correspond to a single region (termed active region, Ractive). 3. In the next time period, the disk services requests corresponding to either Ractive ⫹ 1. 4. Upon the arrival of a request referencing object X, it is assigned to the region containing X0 (say RX). 5. The display of X does not start until the active region reaches X0 (Ractive ⫽ RX). To compute the worst seek time with OREO, note that the distance between two blocks retrieved consecutively is bounded by the length of a region (i.e., #cyl/R ). This distance is bounded by 2 ⫻ #cyl/R when the blocks belong to two different regions. This only occurs for the last block retrieved during time period i and the first block retrieved during time period i ⫹ 1. To simplify the discussions, we eliminated this factor from the equations (see (22) for precise equations).
X0 X6 X12
... X1 X7 X13
... X2 X8
...
Inward movement
X3 X9 R0
... One region
X4 X10
R1
R2 R3
...
R4
X5 X1
R5
Outward movement
... One block Figure 4. OREO.
Figure 5. Disk head movement.
DISTRIBUTED MULTIMEDIA SYSTEMS
Thus, the worst incurred seek time between two block retrievals is Seek(#cyl/R ). Furthermore, the system observes a long seek, Seek(#cyl), every R regions (to reposition the head to the outermost cylinder). To compensate for this, the system must ensure that after every R block retrievals, enough data has been prefetched on behalf of each display to eclipse a delay equivalent to Seek(#cyl). There are several ways of achieving this effect. One might force the first block along with every R other blocks to be slightly larger than the other blocks. We describe OREO based on a fix-sized block approach that renders all blocks to be equisized. With this approach, every block is padded so that after every R block retrievals, the system has enough data to eclipse the Seek(#cyl) delay. Thus, the duration of a time period is Tp =
Seek(#cyl) N B #cyl + + N + Seek RD R R
By substituting Tp from Eq. 2, we solve for B to obtain
Boreo
Seek(#cyl) RC × RD #cyl + = × N × Seek RD − N × RC R R (8)
By comparing Eq. 8 with Eq. 4, observe that OREO reduces the upper bound on the distance between two blocks retrieved consecutively by a factor of 1/R . Introducing regions to reduce the seek time increases the average latency observed by a request. This is because during each time period, the system can initiate the display of only those objects that correspond to the active region. To illustrate this, consider Fig. 6. In Fig. 6(a), Y is stored starting
X0 R0
Z0
R1
R2
R3
Z2 Y0
Z3 Y1
R4
Z4 Y2
... X5
R5
Z5 Y3
... OREO (a)
T1 X0
X2
X1
First time period Second time period Time period schedule (b) Figure 6. Latency time.
RC × RD RD − N × RC Seek(#cyl) #cyl × g + × N × Seek N R R
(10)
Observe that, with OREO ⫹ GSS, both reduction factors of GSS and OREO are applied to the upper bound on the distance between any two consecutively retrieved blocks (compare Eq. 10 with both Eqs. 7 and 8). The maximum startup latency observed with OREO ⫹ GSS is identical to OREO when R ⬎ 1 (see Eq. 9).
... X4
Hybrid: Disk Scheduling Plus Constrained Data Placement. In order to cover a wide spectrum of applications, GSS and OREO can be combined. Recall that with OREO, the placement of objects within a region is unconstrained. Hence, the distance between two blocks retrieved consecutively is bounded by the length of a region. However, one can introduce the concept of grouping the retrieval of blocks within a region. In this case, assuming a uniform distribution of blocks across a region surface, the distance between every two blocks retrieved consecutively is bounded by #cyl ⫻ g/N R . Hence Seek(#cyl) N B #cyl × g + Tp = + N + Seek RD N R R
Bcombined =
... X3
p
Z1
... X2
with R2, while the assignment of both X and Z starts with R0. Assume that the system can support three simultaneous displays (N ⫽ 3). Moreover, assume a request arrives at time T1, referencing object X. This causes region R0 to become active. Now, if a request arrives during T1 referencing object Y, it cannot be serviced until the third time period even though sufficient disk bandwidth is available [see Fig. 6(b)]. Its display is delayed by two time periods until the disk head moves to the region that contains Y0 (R2). In the worst case, assume (1) a request arrives referencing object Z when Ractive ⫽ R0, and (2) the request arrives when the system has already missed the empty slot in the time period corresponding to R0 to retrieve. Hence, R ⫹ 1 time periods are required before the disk head reaches R0 in order to start servicing the request. Hence, the maximum startup latency is computed as (R + 1) × Tp if R > 2
oreo = (2 × Tp ) (9) if R = 2 T if R = 1
By substituting Tp from Eq. 2, we solve for B to obtain
... X1
725
Y0
Buffer Management. The technique employed to manage memory impacts the amount of memory required to support N simultaneous displays. A simple approach to manage memory is to assign each user two dedicated blocks of memory: one for retrieval of data from disk to memory and the other for delivery of data from memory to the display station. Trivially, the data is retrieved into one block while it is consumed from the other. Subsequently, the role of these two blocks is switched. The amount of memory required with this technique is: Munshared = 2 × N × B
(11)
726
DISTRIBUTED MULTIMEDIA SYSTEMS
A block
X1
X1
X1
Z2
Z2
Z2
X4
X4
Y1
Y1
Y1
W2
W2
W2
Y4
Y4
Z1
Z1
Z1
Y3
Y3
Y3
Z4
W1
W1
W1
X3
X3
X3
W4
Y2
Y2
Y2
W3
W3
W3
X2
X2
X2
Z3
Z3
Z3
...
Subcycle 1 Subcycle 2 Subcycle 1 Subcycle 2 Subcycle 1 Subcycle 2 Subcycle 1 Subcycle 2 Time period 1
Time period 2
Time period 3
Time period 4
Figure 7. Memory requirement per subcycle.
Note that B is different for alternative display techniques: B gss with GSS, B oreo with OREO, and B combined with OREO ⫹ GSS. An alternative approach, termed coarse-grain memory sharing, reduces the amount of required memory by sharing blocks among users. It maintains a shared pool of free blocks. Every task (either retrieval or display task of an active request) allocates blocks from the shared pool on demand. Once a task has exhausted the contents of a block, it frees the block by returning to it a shared pool. As described later, when compared with the simple approach, coarse-grain sharing results in lower memory requirement as long as the system employs GSS with the number of groups (g) smaller than N . The highest degree of sharing is provided by fine-grain memory sharing. With this technique, the granularity of memory allocation is reduced to a memory page. The size of a block is a multiple of the page size. If P denotes the memory page size, then B ⫽ mP , where m is a positive integer. The system maintains a pool of memory pages (instead of blocks with coarse-grain sharing), and tasks request and free pages instead of blocks. In the following, we describe the memory requirement of each display technique with both fine and coarse-grain sharing. The memory requirement of the round-robin schema is eliminated because it is a special case of GSS (g ⫽ N ) and OREO (R ⫽ 1). Coarse-Grain Sharing (CGS). The total amount of memory required by a display technique that employs both GSS and coarse-grain memory sharing is
N ×B (12) Mcoarse = N + g To support N simultaneous displays, the system employs N blocks for N displays and N /g blocks for data retrieval on behalf of the group that reads its block from disk. To illustrate, consider the example of Fig. 7, where g ⫽ 2, and N ⫽ 4. From Eq. 12, this requires 6 blocks of memory (see (13) for derivation of Eq. 12). This is because the display of X1 and Y1 completes at the beginning of subcycle 2 in the second time period. These blocks can be swapped out in favor of Z2 and W2. Note that the system would have required 8 blocks without coarse-grain sharing. OREO and OREO ⫹ GSS can employ Eq. 12. This is because the memory requirement of OREO is a special case of GSS where g ⫽ However, the block size (B ) computed for each approach is different: B gss with GSS, B oreo with OREO,
and B combined with OREO ⫹ GSS (see the earlier sections for the computation of the block size with each display technique). Fine-Grain Sharing (FGS). We describe the memory requirement of fine-grain sharing with a display technique that employs GSS. This discussion is applicable to both OREO and OREO ⫹ GSS. When compared with coarse-grain sharing, fine-grain sharing reduces the amount of required memory because during a subcycle, the disk produces a portion of some blocks while the active displays consume portions of other blocks. With coarsegrain sharing, a partially consumed block cannot be used until it becomes completely empty. However, with fine-grain sharing, the system frees up pages of a block that have been partially displayed. These pages can be used by other tasks that read data from disk. Modeling the amount of memory required with FGS is more complex than that with CGS. While it is possible to compute the precise amount of required memory with CGS, this is no longer feasible with FGS. This is because CGS frees blocks at the end of each subcycle where the duration of a subcycle is fixed. However, FGS frees pages during a subcycle, and it is not feasible to determine when the retrieval of a block ends within a subcycle because the incurred seek times in a group are unpredictable. Therefore, we model the memory requirement within a subcycle for the worst case scenario. Let t denote the time required to retrieve all the blocks in a group. Theoretically, t can be a value between 0 and the duration of a subcycle; that is, 0 ⱕ t ⱕ Tp /g. We first compute the memory requirement as a function of t and then discuss the practical value of t. We introduce t to generate another end point (beside the end of a subcycle) where the memory requirement can be modeled accurately. The key observation is that between t and the end of subcycle, nothing is produced on behalf of a group, while display of requests in a subcycle continues at a fixed rate of RC. Hence, we model the memory requirement for the worst case where all the blocks are produced in order to eliminate the problem of unpredictability of each block retrieval time in a subcycle. Assuming Si is the end of subcycle i, the maximum amount of memory required by a group is at Si ⫹ t because the maximum amount of data is produced, and the minimum amount is consumed at this point. Observe that at a point x, where Si ⫹ t ⬍ x ⱕ Si⫹1, data is only consumed, reducing the amount of required memory. Moreover, at a point y where Si ⱕ y ⬍ Si ⫹ t, data is still being produced.
DISTRIBUTED MULTIMEDIA SYSTEMS
The number of pages produced (required) during t is
N ×m (13) produced = g The number of pages consumed (released) during t is
t×N m (14) consumed = Tp This is because the amount of data consumed during a time period is N m pages, and hence, the amount consumed during t is t/Tp of N m pages. We use floor function for consumption and ceiling function for the production because the granularity of memory allocation is in pages. Hence, neither a partially consumed page (floor function) nor a partially produced page (ceiling function) is available on the free list. Moreover, m is inside the floor function because the unit of consumption is in number of pages, while it is outside the ceiling function because the unit of production is in blocks. One might argue that the amount of required memory is the difference between the volume of data produced and consumed. This is an optimistic view that assumes everything produced before Si has already been consumed. However, in the worst case, all the N displays might start simultaneously at time period j (Tp( j)). Hence, the amount of data produced during Tp( j) is higher than the amount consumed. This is because the production starts during the first subcycle of Tp( j), while consumption starts at the beginning of the second subcycle. It is sufficient to compute this remainder (rem) and add it to produced ⫺ consumed in order to compute the total memory requirement because all the produced data is consumed after Tp( j). To compute rem, Fig. 8 divides Tp( j) into g subcycles and demonstrates the amount of produced and consumed data during each subcycle. The total amount that is produced during each time period is N m pages. During the first subcycle, there is nothing to consume. For the other g ⫺ 1 subcycles, 1/g of what have been produced can be consumed. Hence, from the figure, the total consumption during Tp( j) is N m/g 1/g (1 ⫹ 2 ⫹ ⭈ ⭈ ⭈ ⫹ (g ⫺ 1)). By substituting (1 ⫹ 2 ⫹ ⭈ ⭈ ⭈ ⫹ (g ⫺ 1)) with [g(g ⫺ 1)]/2, rem can be computed as rem = N m −
N m( g − 1) 2g
(15)
The total memory requirement is produced ⫺ consumed ⫹ rem, or Mfine = N m +
tN m Nm( g − 1) N m− − g Tp 2g
N m g
cons = 0
prod =
cons =
N m g
prod =
N m g
(16)
Note that Eq. 16 is an approximation because we eliminated the floor and ceiling functions from the equation. For large values of m, the approximation is almost identical to the actual computation. An interesting observation is that if the size of a page is equal to the size of a block, then Eq. 16 can be reduced to Eq. 12. This is because the last two terms in Eq. 16 correspond to the number of pages released during t and the first time period, respectively. Since with coarsegrain, no pages are released during these two periods; the last two terms of Eq. 16 become zero, producing Eq. 12. The minimum value of t is computed when all the N /g blocks are placed contiguously on the disk surface. The time required to retrieve them is the practical minimum value of t and is computed as
B N × tpractical = (17) g RD The number of groups g impacts the memory requirement with both coarse and fine-grain sharing in two ways. First, as one increases g, the memory requirement of the system decreases because the number of blocks staged in memory is N /g. On the other hand, this results in a larger block size in order to support the desired number of users, resulting in higher memory requirement. Thus, increasing g might result in either a higher or a lower memory requirements. Yu et al. (13) suggests an exhaustic search technique to determine the optimal value of g (1 ⱕ g ⱕ N ) in order to minimize the entire memory requirement for a given N . An implementation of FGS (beyond the focus of this article) must address how the memory is managed. This is because memory might become fragmented when pages of a block are allocated and freed incrementally. With fragmented memory, either (1) the disk interface should be able to read a block into m disjoint pages, or (2) the memory manager must bring m consecutive pages together to provide the disk manage with m physically contiguous pages to read a block into. The first approach would compromise the portability of the final system because it entails modifications to the disk inter-
Consume what is produced
prod =
727
prod =
N m g
1 N 1 2N 1 3N m cons = m cons = m g g g g g g
...
...
prod =
cons =
N m g
1 (g – 1)N m g g
Time period j Subcycle g
Subcycle 1 Figure 8. Memory requirement of the jth time period.
728
DISTRIBUTED MULTIMEDIA SYSTEMS
face. With the second approach, one may implement either a detective or preventive memory manager. A detective memory manager waits until memory becomes fragmented before reorganizing memory to eliminate this fragmentation. A preventive memory manager avoids the possibility of memory fragmentation by controlling how the pages are allocated and freed. When compared with each other, the detective approach requires more memory than the preventive one (and would almost certainly require more memory than the equations derived in this section). However, the preventive approach would most likely incur a higher CPU overhead because it checks the state of memory per page allocation/ release. Multi-Zone Disks. To guarantee a continuous display, the techniques described in the previous section must assume the transfer rate of the innermost zone. An alternative is to constrain the layout of data across the disk surface and the schedule for its retrieval (25–26, 11). This section describes three such techniques. With all techniques, there is a tradeoff between throughput, amount of required memory, and the incurred startup latency. We illustrate these tradeoffs starting with a brief overview of two techniques and then discussing the third approach in detail. Track pairing (25), organizes all tracks into pairs such that the total capacity of all pairs is identical. When displaying a clip, a track-pair is retrieved on its behalf per time period (alternative schedules are detailed in (25). Similar to the GSS discussions earlier, the system can manipulate the retrieval of physical tracks on behalf of multiple active displays to minimize the impact of seeks. Assuming that the number of tracks in every zone is a multiple of some fixed number, (26) constructs Logical Tracks (LT) from the same numbered physical track of the different zones. The order of tracks in a LT is by zone number. When displaying a clip, the system reads an LT on its behalf per time period. An application observes a constant disk transfer rate for each LT retrieval. This forces the disk to retrieve data from the constituting physical tracks in immediate succession by zone order. Recall that we assumed a logical disk drive that consists of d physical disks. If d equals the number of zones (m), the physical tracks of a LT can be assigned to a different disk. This facilitates concurrent retrieval of physical
tracks that constitute an LT. Similarly, to facilitate concurrent retrieval with track pairing, if d is an even number, then the disks can be paired such that track i of one disk is paired with track #cyl ⫺ i track of another disk. The third approach as detailed in (11) organizes a clip at the granularity of blocks (instead of tracks). We describe two variations of this approach that guarantee continuous display while harnessing the average transfer rate of m zones, namely, FIXed Block size (FIXB) and VARiable Block size (VARB) (11). These two techniques assign the blocks of an object to the available zones in a round-robin manner, starting with an arbitrary zone. With both techniques, there are a family of scheduling techniques that ensure a continuous display. One might require the disk to retrieve m blocks of an object assigned to m different zones in one sweep. Assuming that N displays are active, this scheduling paradigm would result in N disk sweeps per time period and substantial amount of buffer space. (With this scheduling paradigm, VARB would be similar to LT (26).) An alternative scheduling paradigm might multiplex the bandwidth of each zone among the N displays and visit them in a round-robin manner. It requires the amount of data retrieved during one scan of the disks on behalf of a display (m time periods that visit all the zones, starting with the outermost zone) to equal that consumed by the display. This reduces the amount of required memory. However, it results in a higher startup latency. We will focus on this scheduling paradigm for the rest of this section. To describe the chosen scheduling technique, assume a simple display scheme (GSS with g ⫽ N ). We choose the block size (the unit of transfer from a zone on behalf of an active display) to be a function of transfer rate of each zone such that the higher transfer rate of fast zones compensates for that of the slower zones. The display time of the block that is retrieved from the outermost zone (Z0) on behalf of a display exceeds the duration of a time period (TP(Z0)); see Fig. 9. Thus, a portion of the block remains memory resident in the available buffer space. During the time period when the innermost zone is active, the amount of data displayed exceeds the block size, consuming the buffered data. In essence, the display of data is synchronized relative to the retrieval from the outermost zone. This implies that if the first block of an
Mem Max required memory
Disk service time to read a block
...
...
...
...
Time (s) 0 Tp (Z0) Tp (Z1)
Tp (Z1)
Tp (Zm–2)
Tp (Zm–1)
TScan Figure 9. Memory required on behalf of a display with FIXB.
TSeek(Cyl)
DISTRIBUTED MULTIMEDIA SYSTEMS
729
Table 1. Seagate ST31200W Disk FIXB
VARB
N
Mem. (MB)
Max l (s)
% Wasted disk space
1 2 4 8 10 12 13 14 15 16 17
0.007 0.023 0.107 0.745 1.642 3.601 5.488 8.780 15.515 35.259 536.12
0.44 0.92 2.10 6.03 9.66 16.15 21.77 31.05 49.23 100.9 1392.5
58.0 58.0 58.0 58.1 58.1 58.2 58.3 58.0 58.4 58.3 73.8
% Avg wasted band.
Mem. (MB)
94.1 88.2 76.5 53.1 41.4 29.7 23.9 18.0 12.1 6.3 0.4
0.011 0.044 0.192 1.059 2.078 4.040 5.759 8.511 13.481 24.766 72.782
object is assigned to the zones starting with a zone other than the innermost zone, then its display might be delayed relative to its retrieval in order to avoid hiccups. Both FIXB and VARB waste disk space due to (1) a roundrobin assignment of blocks to zones, and (2) a nonuniform distribution of the available storage space among the zones. (In general, the outermost zones have a higher storage capacity relative to the innermost zones.) Once the storage capacity of a zone is exhausted, no additional blocks can be assigned to the remaining zones because it would violate the round-robin assignment of blocks. Table 1 compares FIXB and VARB by reporting on the amount of required memory, the maximum incurred startup latency assuming fewer than N active displays, and the percentage of wasted disk space and disk bandwidth with the Seagate ST31200W disk. The percentage of wasted disk space with both FIXB and VARB is dependent on the physical characteristics of the zones. While VARB wastes a lower percentage of the Seagate disk space, it wastes a higher percentage of another analyzed disk space (HP C2247) (11). If we assumed the transfer rate of the innermost zone as the transfer rate of the entire disk and employed the discussion earlier to guarantee a continuous display, the system would support twelve simultaneous displays, require 16.09 Mbytes of memory, and incur a maximum startup latency of 7.76 s. A system that employs either FIXB or VARB supports twelve displays by requiring less memory and incurring a higher startup latency; see sixth row of Table 1. For a high number of simultaneous displays (16 and 17, see last two rows of Table 1), when compared with FIXB, VARB requires a lower amount of memory and incurs a lower maximum startup latency. This is because VARB determines the block size as a function of the transfer rate of each zone, while FIXB determines the block size as a function of the average transfer rate of the zones, that is, one fix-sized block for all zones. Thus, the average block size chosen by VARB is smaller than the block size chosen by FIXB for a fixed number of users. This reduces the amount of time required to scan the disk (TScan in Fig. 9) which, in turn, reduces both the amount of required memory and the maximum startup latency, see (11) for details. A system designer is not required to configure either FIXB or VARB using the vendor specified zone characteristics. In-
Max l (s)
% Wasted disk space
% Avg wasted band.
0.44 0.92 2.08 5.88 9.26 15.06 19.84 27.26 40.34 69.53 192.4
40.4 40.4 40.4 40.4 40.5 40.6 40.4 40.6 41.0 41.3 42.2
94.3 88.6 77.3 54.6 43.2 31.9 26.2 20.5 14.8 9.2 3.5
stead, one may logically manipulate the number of zones and their specifications by either (1) merging two or more physically adjacent zones into one and assuming the transfer rate of this logical zone to equal that of the slowest participating physical zone, or (2) eliminating one or more of the zones. For example, one might merge zones 0 to 4 into one logical zone, zones 5 to 12 in a second logical zone, and eliminate zones 13 to 23 altogether. With this logical zone organization, VARB supports 17 simultaneous displays by requiring 10 Mbytes of memory and observing a maximum startup latency of 6.5 s the amount of required memory is lower than that required by the approach that assumes the transfer rate of the innermost zone for the entire disk, see the previous paragraph. Computed with a system that assumes the transfer rate of the innermost zone as the transfer rate of the entire disk, the throughput of the system is increased by 40% of the expense of wasting 35% of the available disk space. A more intelligent arrangement might even outperform this one. The definition of outperform is application dependent. A configuration planner that employs heuristics to strike a compromise between the conflicting factors is described in (11). Multiple Disk Drives The bandwidth of a single disk is insufficient for those applications that strive to support thousands of simultaneous displays. One may employ a multidisk architecture for these applications. In this report, we assume a homogeneous set of disk drives; that is, all the disk drives have an identical transfer rate of RD. The issues discussed here can be extended to a heterogeneous platform. Interested readers are encouraged to consult (27). Multidisk environments raise the interesting research problem of data placement. That is, on which disk (or set of disks) a single clip should be stored. A simple technique is to assign a clip to a disk in its entirety. The drawback is that a single disk storing all the hot objects might become a bottleneck. In (28), we proposed a detective mechanism to detect a bottleneck and resolve it by replicating the hot object(s) into the other disk drives. However, we have learned that both partitioning the resources and a detective mechanism are not appropriate for such a setup. For the rest of this section, we describe three alternative data placement techniques in a
730
DISTRIBUTED MULTIMEDIA SYSTEMS
X0.0 X1.0 X2.0
...
X0.1 X1.1 X2.1
X0.2 X1.2 X2.2
d1
...
d0
X0 X3
X1 X4
X2
...
...
...
...
d2
d0
d1
d2
d=D
d=1
(a)
(b)
Figure 10. RAID (d ⫽ D) vs. round-robin retrieval (d ⫽ 1).
multidisk hardware platform. These techniques neither partition resources nor employ detective mechanisms to balance the load. That is, each CM object is striped across all the disk drives. Hence, resources (disk drives) are not partitioned, the load is distributed evenly across all the drives and there is no need to detect bottlenecks to resolve load imbalance (i.e., bottlenecks are prevented and not detected). The differences among the three techniques is mainly on their retrieval schedule. RAID Striping. One way to render multiple disks is to follow the RAID (29) architecture. This has been done in Streaming RAID (30). Briefly, each block of an object is striped into fragments where fragments are assigned to the disks in a round-robin manner. For example, in Fig. 10a, block X0 is declustered across the 3 disks. Each fragment of this block is denoted as X0,i; 0 ⱕ i ⬍ D. Given a platform consisting of D disk drives, to retrieve a block, all the D disk drives transfer the corresponding fragments simultaneously. Subsequently, the block is formed from the fragments in memory and sent to the client. Conceptually, a RAID cluster can be considered as a single logical disk drive. Hence, the display techniques and memory requirements discussed earlier can be applied here with almost no modification. In theory, a RAID cluster consisting of D disk drives have a sustained transfer rate of D ⫻ RD. However, in practice, as one increases D, the seek time dominates the sustained transfer rate. This is because seek time is fixed and is not improved as D increases. Therefore, as D grows, the RAID system spends a higher percentage of time doing seeks (wasteful work) as opposed to data transfer (useful work). In summary, the RAID architecture is not scalable in throughput (see Fig. 11a).
Max. no. of users 800
Worst latency (s) 80
600 400 200
60 d=1
d=1 40
d=D
0
15 20 0 5 10 Factor of increase in resources (memory+disk) (a) Throughput
20
Round-Robin Retrieval. With round-robin retrieval, the blocks of an object X are assigned to the D disk drives in a round-robin manner. The assignment of X0 starts with an arbitrary disk. Assuming a system with three disk drives, Fig. 11(b) demonstrates the assignment of blocks of X with this choice of value. When a request references object X, the system employs the idle slot on the disk that contains X0 (say di) to display its first block. Before the display of X0 completes, the system employs cluster d(i⫹1) mod D to display X1. This process is repeated until all blocks of an object have been retrieved and displayed. This can be considered as if the system supports D simultaneous time periods, one per disk drive. Hence, the display techniques and memory requirements discussed earlier can be applied here with straightforward modifications (see (31). The throughput of the system (i.e., maximum number of displays) scales linearly as a function of additional resources in the system. However, its maximum latency also scales linearly (see Fig. 11). To demonstrate this, assume that each disk in Fig. 10(b) can support three simultaneous displays. Assume that eight displays are active and that the assignment of object X starts with d0 (X0 resides on d0). If the request referencing object X arrives too late to utilize the idle slot of d0, it must wait three (i.e., D) time periods before the idle slot on d0 becomes available again (see Fig. 12). Hence, the maximum latency time is Tp ⫻ D. Hybrid (Disk Clusters). As mentioned neither RAID striping nor round-robin retrieval scales as one increases resources. In (31–33), we proposed a hybrid approach. Hybrid striping partitions the D disks into k clusters of disks with each cluster consisting of d disks: k ⫽ D/d. Next, it assigns the blocks of object X to the clusters in a round-robin manner. The first block of X is assigned to an arbitrarily chosen disk cluster. Each block of an object is declustered (34) across the d disks that constitute a cluster. For example, in Fig. 13, a system consisting of six disks is partitioned into three clusters, each consisting of two disk drives. The assignment of the blocks of X starts with cluster 0. This block is declustered into two fragments: X0.0 and X0.1. When a request references object X, the system employs the idle slot on the cluster that contains X0 (say Ci) to display its first block. Before the display of X0 completes, the system employs cluster C(i⫹1) mod k to display X1. This process is repeated until all blocks of an object have been retrieved and displayed. Note that the hybrid striping simulates RAID striping when d ⫽ D and round-robin retrieval when d ⫽ 1. The hybrid striping by varying the number of disk drives within a cluster as well as changing the number of clusters can strike a compromise between throughput and latency time. Given a desired throughput and latency (N desired, desired), (31) describes a configuration planner that determines a value for the configuration parameters of a system. The value of these parameters is chosen such that the total cost of the system is minimized.
d=D
Hierarchical Storage Management
0
0 5 10 15 20 Factor of increase in resources (memory+disk) (b) Maximum latency time
Figure 11. RAID (d ⫽ D) vs round-robin retrieval (d ⫽ 1).
The storage organization of systems that support multimedia applications is expected to be hierarchical, consisting of a tertiary storage device, a group of disk drives, and some memory (32). The database resides permanently on the tertiary storage device, and its objects are materialized on the
DISTRIBUTED MULTIMEDIA SYSTEMS
Reference X
d0
731
Display X
Ai
Bj
d1
Ck
Dl
Em
d2
En
Go
Hp
Fn + 1
Ck + 1
Tp
Go + 1
Hp + 1
Ck + 2
Dl + 2
Em + 2
X0
Ai + 1
Bj + 1
Fn + 2
Go + 2
Hp + 2
Ck + 3
Dl + 1
Em + 1
Ai + 2
Bj + 2
Fn + 3
Time Figure 12. Maximum latency time with striping.
disk drives on demand (and deleted from the disk drives when the disk storage capacity is exhausted). A small fraction of a referenced object is staged in memory to support its display. Assume a hierarchical storage structure consisting of random access memory (DRAM), magnetic disk drives, optical disks, and a tape library (35) (see Fig. 15). As the different strata of the hierarchy are traversed starting with memory (termed stratum 0), both the density of the medium (the amount of data it can store) and its latency increase, while its cost per megabyte of storage decreases. At the time of this writing, these costs vary from $40/Mbyte of DRAM to $0.6/ Mbyte of disk storage to $0.3/Mbyte of optical disk to less than $0.05/Mbyte of tape storage. An application referencing an object that is disk resident observes both the average latency time and the delivery rate of a magnetic disk drive (which is superior to that of the tape library). An application would observe the best performance when its working set becomes resident at the highest level of the hierarchy: memory. However, in our assumed environment, the magnetic disk drives are the more likely staging area for this working set due to the large size of objects. Typically, memory would be used to stage a small fraction of an object for immediate processing and display. We define the working set (36) of an application as a collection of objects that are repeatedly referenced. For example, in existing video stores, a few titles are expected to be accessed frequently and a store maintains several (sometimes many) copies of these titles to satisfy the expected demand. These movies constitute the working set of a database system whose application provides a video-on-demand service. One might be tempted to replace the magnetic disk drives with the tertiary storage devices in order to reduce the cost further. This is not appropriate for the frequently referenced objects that require a fraction of a second transfer initiation
C0
C1
C2
X0.0 X3.0
X0.1 X3.1
X1.0 X4.0
X1.1 X4.1
X2.0 X5.0
X2.1 X5.1
...
...
...
...
...
...
d0
d1
d2
d0
d1
d2
Figure 13. Hybrid striping.
delays, that is, the time elapsed from when a device is activated until it starts to produce data. This delay is determined by the time required for a device to reposition its read head to the physical location containing the referenced data; this time is significantly longer for tertiary storage device (ranges from several seconds to minutes) as compared to that for a magnetic disk drive (ranges from 10 to 30 ms). Similarly, the tertiary storage device should not be replaced by magnetic disk drives because (1) the cost of storage increases, and (2) it might be acceptable for some applications to incur a high latency time for infrequently referenced objects. In this section, we first describe different data flows among the three storage components of the system (memory, disk, tertiary). Next, a pipelining mechanism is explained to reduce the startup latency when a request references a tertiary resident CM object. Finally, we describe some techniques to manage the disk storage space. Note that the assumed hardware architecture is identical to that of Space Management. Data Flows. Assuming an architecture that consists of some memory, several disk drives, and a tertiary storage device, two alternative organization of these components can be considered: (1) memory serves as an intermediate staging area between the tertiary storage device, the disk drives, and the display stations, and (2) the tertiary storage device is visible only to the disk drives via a fixed size memory. With the first organization, the system may elect to display an object from the tertiary storage device by using the memory as an intermediate staging area. With the second organization, the data must first be staged on the disk drives before it can be displayed. In (32), we capture these two organizations using three alternative paradigms for the flow of data among the different components: • Sequential Data Flow (SDF): The data flows from tertiary to memory (STREAM 1 of Fig. 14), from memory to the disk drives (STREAM 2), from the disk drives back to memory (STREAM 3), and finally from memory to the display station referencing the object (STREAM 4). • Parallel Data Flow (PDF): The data flows from the tertiary to memory (STREAM 1), and from memory to both the disk drives and the display station in order to materialize (STREAM 2) and display (STREAM 4) the object simultaneously. (PDF eliminates STREAM 3.) • Incomplete Data Flow (IDF): The data flows from tertiary to memory (STREAM 1) and from memory to the
732
DISTRIBUTED MULTIMEDIA SYSTEMS
Display
4 Memory
SDF: 1, 2, 3, 4 PDF: 1, 2/4 IDF: 1, 4
3 1 Tertiary
2
... 1 2 R R Disk-clusters
Figure 14. Three alternative dataflow paradigms.
display station (STREAM 4) to support a continuous retrieval of the referenced object. (IDF eliminates both STREAM 2 and 3.) Figure 14 models the second architecture (tertiary storage is accessible only to the disk drives) by partitioning the available memory into two regions: one region serves as an intermediate staging area between tertiary and disk drives (used by STREAM 1 and 2), while the second serves as a staging area between the disk drives and the display stations (used by STREAM 3 and 4). SDF can be used with both architectures. However, neither PDF nor IDF is appropriate for the second architecture because the tertiary is accessible only to the disk drives. When the bandwidth of the tertiary storage device is lower than the bandwidth required by an object, SDF is more appropriate than both PDF and IDF because it minimizes the amount of memory required to support a continuous display of an object. IDF is ideal for cases where the expected future access to the referenced object is so low that it should not become disk resident (i.e., IDF avoids this object from replacing other disk resident objects). Pipelining Mechanism. With hierarchical storage organization, when a request references an object that is not disk resident, one approach might materialize the object on the disk drives in its entirety before initiating its display. In this case, assuming a zero system load, the latency time of the system is determined by the time for the tertiary to reposition its read head to the starting address of the referenced object, the bandwidth of the tertiary storage device, and the size of the referenced object. Assuming that the referenced object is continuous media (e.g., audio, video) and requires a sequential retrieval to support its display, a superior alternative is to use pipelining (32) in order to minimize the latency time. Briefly, the pipelining mechanism splits an object into s logical slices (S1, S2, S3, . . ., Ss) such that the display time of S1 overlaps the time required to materialize S2, the display time of S2 overlaps the time to materialize S3, and so on and so forth. This ensures a continuous display while reducing the latency time because the system initiates the display of an object once a fraction of it (i.e., S1) becomes disk resident. With pipelining, two possible scenarios might happen: the bandwidth of the tertiary is either (1) lower or (2) higher than the bandwidth required to display an object the discussion for the case when the bandwidth of tertiary is equivalent to the display is a special case of item (2). The ratio between the
production rate of tertiary and the consumption rate at a display station is termed Production Consumption Ratio (PCR ⫽ BTertiary /BDisplay). When PCR ⬍ 1, the time required to materialize an object is greater than its display time. Neither PDF nor IDF is appropriate because the bandwidth of tertiary cannot support a continuous display of the referenced object (assuming that the size of the first slice exceeds the size of memory). With SDF, the time required to materialize X is n/PCR time periods, while its display requires n time periods. If X is a tertiary resident, without pipelining, the latency time incurred to display X is n/PCR ⫹ 1 time periods. (Plus one because an additional time period is needed to both flush the last subobject to the disk cluster and to allow the first subobject to be staged in the memory buffer for display). To reduce this latency time, a portion of the time required to materialize X can be overlapped with its display time. When PCR ⬎ 1, the bandwidth of tertiary exceeds the bandwidth required to display an object. Two alternative approaches can be employed to compensate for the fast production rate: either (1) multiplex the bandwidth of tertiary among several requests referencing different objects, or (2) increase the consumption rate of an object by reserving more time slots per time period to render that object disk resident. The first approach wastes the tertiary bandwidth because the device is required to reposition its read head multiple times. The second approach utilizes more resources in order to avoid the tertiary device from repositioning its read head. For more detailed description of pipelining please refer to (32). Space Management. In general, assuming that the storage structure consists of n strata, we assume that the database resides permanently on stratum n ⫺ 1. For example, Fig. 15 shows a system with four strata in which the database resides on stratum 3. Objects are swapped in and out of a device at strata i ⬍ n, based on their expected future access patterns with the objective of minimizing the frequency of access to the slower devices at higher strata. This objective minimizes the average latency time incurred by requests referencing objects. At some point during the normal mode of operation, the storage capacity of the device at stratum i will be exhausted. Once an object ox is referenced, the system may determine
Stratum 0
Memory
Faster service time 1
2
3
Magnetic disks
Optical disks
Tape drives
Figure 15. Hierarchical storage system.
Lower cost per megabyte + higher density
DISTRIBUTED MULTIMEDIA SYSTEMS
that the expected future reference to ox is such that it should reside on a device at this stratum. In this case, other objects should be swapped out in order to allow ox to become resident here. In this section, we focus on how to manage the disk space when objects are migrated in and out of the disk storage from the tertiary storage. We describe two orthogonal techniques. The first, termed EVEREST, manages the blocks of a continuous media object. It approximates a contiguous layout of a file (i.e., a block). The second technique, PIRATE, manages the entire CM object. Each block of the CM object, however, can be managed employing EVEREST. PIRATE is a replacement technique which replaces CM objects partially striving to keep the head of the objects disk resident. Therefore, PIRATE is a nice complement of pipelining. As explained later, however, PIRATE is only appropriate for single user systems. For multi-user systems, objects should be swapped in their entirety. EVEREST. With media types, a CM file system might be forced to manage different block sizes. Moreover, the blocks of different objects might be staged from the tertiary storage device onto magnetic disk storage on demand. A block should be stored contiguously on disk. Otherwise, the disk would incur seeks when reading a block, reducing disk bandwidth. Moreover, it might result in hiccups because the retrieval time of a block might become unpredictable. To ensure a contiguous layout of a block, we considered four alternative approaches: disk partitioning, extent-based (37–39) multiple block sizes, and an approximate contiguous layout of a file. We chose the final approach, resulting in the design and implementation of the EVEREST file system (40). Below, we describe each of the other three approaches and our reasons for abandoning them. With disk partitioning, assuming media types with different block sizes, the available disk space is partitioned into regions, one region per media type. A region i corresponds to media type i. The space of this region is partitioned into fix sized blocks, corresponding to B (Mi). The objects of media type i compete for the available blocks of this region. The amount of space allocated to a region i might be estimated as a function of both the size and frequency of access of objects of media type i (41). However, partitioning of disk space is inappropriate for a dynamic environment where the frequency of access to the different media types might change as a function of time. This is because when a region becomes cold, its space should be made available to a region that has become hot. Otherwise, the hot region might start to exhibit a thrashing (42) behavior that would increase the number of retrievals from the tertiary storage device. This motivates a reorganization process to rearrange disk space. This process would be time consuming due to the overhead associated with performing I/O operations. With an extent-based design, a fixed contiguous chunk of disk space, termed an extent, is partitioned into fix-sized blocks. Two or more extents might have different page sizes. Both the size of an extent and the number of extents with a prespecified block size (i.e., for a media type) is fixed at system configuration time. A single file may span one or more extents. However, an extent may contain no more than a single file. With this design, an object of a media type i is assigned one or more extents with block size B (Mi). In addition
733
to suffering from the limitations associated with disk partitioning, this approach suffers from internal fragmentation with the last extent of an object being only partially occupied. This would waste disk space, increasing the number of references to the tertiary storage device. With the multiple block size approach (MBS), the system is configured based on the media type with the lowest bandwidth requirement, say M1. MBS requires the block size of each of media type j to be a multiple of B (M1); i.e., B (Mj) ⫽ B (Mj)/B (M1)B (M1). This might simplify the management of disk space to: (1) avoid its fragmentation, and (2) ensure the contiguous layout of each block of an object. However, MBS might waste disk bandwidth by forcing the disk to (1) retrieve more data on behalf of a stream per time period due to rounding up of block size, and (2) remain idle during other time periods to avoid an overflow of memory. These are best illustrated using an example. Assume two media types, MPEG-1 and MPEG-2 objects, with bandwidth requirements of 1.5 Mbps and 4 Mbps, respectively. With this approach, the block size of the system is chosen based on MPEG-1 objects. Assume, it is chosen to be 512 kbyte; B (MPEG-1) ⫽ 512 kbyte. This implies that B (MPEG-2) ⫽ 1365.33 kbytes. MBS would increase B (MPEG-2) to equal 1536 kbytes. To avoid an excessive amount of accumulated data at a client displaying an MPEG-2 clip, the scheduler might skip the retrieval of data one time period every nine time periods. The scheduler may not employ this idle slot to service another request because it is required during the next time period to retrieve the next block of current MPEG-2 display. If all active requests are MPEG-2 video clips and a time period supports nine displays with B (MPEG-2) ⫽ 1536 kbytes, then with B (MPEG-2) ⫽ 1365.33 kbytes, the system would support ten simultaneous displays (10% improvement in performance). In summary, the block size for a media type should approximate its theoretical value in order to minimize the percentage of wasted disk bandwidth. The final approach, EVEREST, employs the buddy algorithm to approximate a contiguous layout of a file on the disk without wasting disk space. The number of contiguous chunks that constitute a file is a fixed function of the file size and the configuration of the buddy algorithm. Based on this information, the system can either (1) prevent a block from overlapping two noncontiguous chunks or (2) allow a block to overlap two chunks and require the client to cache enough data to hide the seeks associated with the retrieval of these blocks. Until now, we assumed the first approach. To illustrate the second approach, if a file consists of five contiguous chunks, then at most, four blocks of this file might span two different chunks. This implies that the retrieval of four blocks will incur seeks with at most one seek per block retrieval. To avoid hiccups, the scheduler should delay the display of the data at the client until it has cached enough data to hide the latency associated with four seeks. The amount of cached data is not significant. For example, assuming a maximum seek time of 20 ms, with MPEG-2 objects (4 Mbps), the client should cache 10 kbytes to hide each seek. However, this approach complicates the admission control policy because the retrieval of a block might incur either one or zero seeks. With EVEREST, the basic unit of allocation is a page, the size of a page has no impact on the granularity at which a process might read a section; this is detailed below, also termed a section of height 0. EVEREST organizes these sec-
734
DISTRIBUTED MULTIMEDIA SYSTEMS 0 1 2 3 4 5 6 7 8 9 101112131415
Blocks
0 1
Section view
Depth 2 3 4 Buddies
Buddies
Figure 16. Physical division of disk space into pages and the corresponding logical view of the sections with an example base of 웆 ⫽ 2.
tions as a tree to form larger, contiguous sections. As illustrated in Fig. 16, only sections of size(page) ⫻ 웆i (for i ⱖ 0) are valid, where the base 웆 is a system configuration parameter. If a section consists of 웆i pages, then i is said to be the height of the section. 웆 height i sections that are buddies (physically adjacent) might be combined to construct a height i ⫹ 1 section. To illustrate, the disk in Fig. 16 consists of 16 pages. The system is configured with 웆 ⫽ 2. Thus, the size of a section may vary from 1, 2, 4, 8, up to 16 pages. In essence, a binary tree is imposed upon the sequence of pages. The maximum height, computed by S=
logω
Capacity size(page)
is 4. To simplify the discussion, assume that the total number of pages is a power of 웆. The general case can be handled similarly and is described below. With this organization imposed upon the device, sections of height i ⱖ 0 cannot start at just any page number, but only at offsets that are multiples of 웆i. This restriction ensures that any section, with the exception of the one at height S, has a total of 웆 ⫺ 1 adjacent buddy sections of the same size at all times. With the base 2 organization of Fig. 16, each section has one buddy. With EVEREST, a portion of the available disk space is allocated to objects. The remainder, should any exist, is free. The sections that constitute the available space are handled by a free list. This free list is actually maintained as a sequence of lists, one for each section height. The information about an unused section of height i is enqueued in the list that handles sections of that height. In order to simplify object allocation, the following bounded list length property is always maintained: For each height i ⫽ 0, . . ., S, at most 웆 ⫺ 1 free sections of i are allowed. Informally, this property implies that whenever there exists sufficient free space at the free list of height i, EVEREST must compact these free sections into sections of a larger height. A lazy variant of this scheme would allow these lists to grow longer and do compaction upon demand, that is, when large contiguous pages are required. This would be complicated as a variety of choices might exist when merging pages. This would require the system to employ heuristic techniques to guide the search space of this merging process. However, to simplify the description, we focus on an implementation that observes the invariant described above.
The materialization of an object is as follows. The first step is to check whether the total number of pages in all the sections on the free list is either greater than or equal to the number of pages (denoted no-of-pages(ox)) that the new object ox requires. If this is not the case, then one or more victim objects are elected and deleted. (The procedure for selecting a victim is based on heat (43). The deletion of a victim object is described further below.) Assuming enough free space is available at this point, ox is divided into its corresponding sections as follows. First, the number m ⫽ no-of-pages(ox) is converted to base 웆. For example, if 웆 ⫽ 2, and no-ofpages(ox) ⫽ 1310, then its binary representation is 11012. The full representation of such a converted number is m ⫽ dj⫺1 ⫻ 웆j⫺1 ⫹ ⭈ ⭈ ⭈ ⫹ d2 ⫻ 웆2 ⫹ d1 ⫻ 웆1 ⫹ d0 ⫻ 웆0. In our example, the number 11012 can be written as 1 ⫻ 23 ⫹ 1 ⫻ 22 ⫹ 0 ⫻ 21 ⫹ 1 ⫻ 20. In general, for every digit di that is non-zero, di sections are allocated from height i of the free list on behalf of ox. In our example, ox requires 1 section from height 0, no sections from height 1, 1 section from height 2, and 1 section from height 3. For each object, the number of contiguous pieces is equal to the number of ones in the binary representation of m, or j with a general base 웆, ⫽ 兺i⫽0 di (where j is the total number of digits). Note that is always bounded by 웆 log웆 m. For any object, defines the maximum number of sections occupied by the object. (The minimum is 1 if all sections are physically adjacent.) A complication arises when no section at the right height exists. For example, suppose that a section of size 웆i is required, but the smallest section larger than 웆i on the free list is of size 웆j ( j ⬎ i). In this case, the section of size 웆j can be split into 웆 sections of size 웆j⫺1. If j ⫺ 1 ⫽ i, then 웆 ⫺ 1 of these are enqueued on the list of height i, and the remainder is allocated. However if j ⫺ 1 ⬎ i, then 웆 ⫺ 1 of these sections are again enqueued at level j ⫺ 1, and the splitting procedure is repeated on the remaining section. It is easy to see that, whenever the total amount of free space on these lists is sufficient to accommodate the object, then for each section that the object occupies, there is always a section of the appropriate size, or larger, on the list. This splitting procedure will guarantee that the appropriate number of sections, each of the right size, will be allocated, and that the bounded list length property is never violated. When the system elects that an object must be materialized and there is insufficient free space, then one or more victims are removed from the disk. Reclaiming the space of a victim requires two steps for each of its sections. First, the section must be appended to the free list at the appropriate height. The second step ensures that the bounded list length property is not violated. Therefore, whenever a section is enqueued in the free list at height i, and the number of sections at that height is equal to or greater than 웆, then 웆 sections must be combined into one section at height i ⫹ 1. If the list at i ⫹ 1 now violates bounded list length property, then once again, space must be compacted and moved to section i ⫹ 2. This procedure might be repeated several times. It terminates when the length of the list for a higher height is less than 웆. Compaction of 웆 free sections into a larger section is simple when they are buddies; in this case, the combined space is already contiguous. Otherwise, the system might be forced to exchange one occupied section of an object with one on the free list in order to ensure contiguity of an appropriate sequence of 웆 sections at the same height. The following algo-
DISTRIBUTED MULTIMEDIA SYSTEMS
rithm achieves space-contiguity among 웆 free sections at height i. 1. Check if there are at least 웆 sections for height i on the free list. If not, stop. 2. Select the first section (denoted sj) and record its page number (i.e., the offset on the disk drive). The goal is to free 웆 ⫺ 1 sections that are buddies of sj. 3. Calculate the page-numbers of sj’s buddies. EVEREST’s division of disk space guarantees the existence of 웆 ⫺ 1 buddy sections physically adjacent to sj. 4. For every buddy sk, k ⱕ 0 ⱕ 웆 ⫺ 1, k ⬆ j; if it exists on the free list, then mark it. 5. Any of the sk unmarked buddies currently store parts of other object(s). The space must be rearranged by swapping these sk sections with those on the free list. Note that for every buddy section that should be freed, there exists a section on the free list. After swapping space between every unmarked buddy section and a free list section, enough contiguous space has been acquired to create a section at height i ⫹ 1 of the free list. 6. Go back to Step 1. To illustrate, consider the organization of space in Fig. 17(a). The initial set of disk resident objects is 兵o1, o2, o3其, and the system is configured with 웆 ⫽ 2. In Fig. 17a, two sections are on the free list at height 0 and 1 (addresses 7 and 14, respectively), and o3 is the victim object that is deleted. Once page 13 is placed on the free list in Fig. 17(b), the number of sections at height 0 is increased to 웆, and it must be compacted according to Step 1. As sections 7 and 13 are not contiguous, section 13 is elected to be swapped with section 7s buddy, that is, section 6 [Fig. 17(c)]. In Fig. 17(d), the data of section 6 is moved to section 13, and section 6 is now on the free list. The compaction of sections 6 and 7 results in a new section with address 6 at height 1 of the free list. Once again, a list of length two at height 1 violates the bounded list length property, and pages (4, 5) are identified as the buddy of section 6 in Fig. 17(e). After moving the data in Fig. 17(f) from pages (4, 5) to (14, 15), another compaction is performed with the final state of the disk space emerging as in Fig. 17(g). Once all sections of a deallocated object are on the free list, the iterative algorithm above is run on each list, from the lowest to the highest height. The previous algorithm is somewhat simplified because it does not support the following scenario: a section at height i is not on the free list; however, it has been broken down to a lower height (say i ⫺ 1), and not all subsections have been used. One of them is still on the free list at height i ⫺ 1. In these cases, the free list for height i ⫺ 1 should be updated with care because those free sections have moved to new locations. In addition, note that the algorithm described above actually performs more work than is strictly necessary. A single section of a small height, for example, may end up being read and written several times as its section is combined into larger and larger sections. This is eliminated in the following manner. The algorithm is first performed virtually—that is, in main memory, as a compaction algorithm on the free lists. Once completed, the entire sequence of operations that have been performed determine the ultimate destination of each of the modified sections. The scheduler constructs a list of these sections. This list is in-
735
serted into a queue of house keeping I/Os. Associated with each element of the queue is an estimated amount of time required to perform the task. Whenever the scheduler locates one or more idle slots in the time period, it analyzes the queue of work for the element that can be processed using the available time. The value of 웆 impacts the frequency of preventive operations. If 웆 is set to its minimum value (i.e., 웆 ⫽ 2), then preventive operations would be invoked frequently because every time a new section is enqueued, there is a 50% chance for a height of the free list to consist of two sections (violates the bounded list length property). Increasing the value of 웆 will, therefore, relax the system because it reduces the probability that an insertion to the free list would violate the bounded list length property. However, this would increase the expected number of bytes migrated per preventive operation. For example, at the extreme value of 웆 ⫽ n (where n is the total number of pages), the organization of blocks will consist of two levels, and for all practical purpose, EVEREST reduces to a standard file system that manages fix-sized pages. The design of EVEREST suffers from the following limitation: the overhead of its preventive operations may become significant if many objects are swapped in and out of the disk drive. This happens when the working set of an application cannot become resident on the disk drive. In a real implementation of EVEREST, it might not be possible to fix the number of disk pages as an exact power of 웆. The most important implication of an arbitrary number of pages is that some sections may not have the correct number of buddies (웆 ⫺ 1 of them). However, we can always move those sections to one end of the disk—for example, to the side with the highest page-offsets. Then, instead of choosing the first section in Step 2 in the object deallocation algorithm, the system can choose the one with the lowest page number. This ensures that the sections towards the critical end of the disk that might not have the correct number of buddies are never used in both Steps 4 and 5 of the algorithm. In (40) we report on an implementation of EVEREST in our CM server. Our implementation enables a process to retrieve a file using block sizes that are at the granularity of 1/2 kbyte. For example, EVEREST might be configured with a 64 kbyte page size. One process might read a file at the granularity of 1365.50 kbyte blocks, while another might read a second file at the granularity of 512 kbyte. The design of EVEREST is related to the buddy system proposed in (44–45) for an efficient main memory storage allocator (DRAM). The difference is that EVEREST satisfies a request for b pages by allocating a number of sections such that their total number of pages equals b. The storage allocator algorithm, on the other hand, will allocate one section that is rounded up to 2lgb pages, resulting in fragmentation and motivating the need for either a reorganization process or a garbage collector (39). The primary advantage of the elaborate object deallocation technique of EVEREST is that it avoids internal and external fragmentation of space as described for traditional buddy systems [see (39)]. PIRATE. Upon the retrieval of a tertiary resident object (say Z), if the storage capacity of the disk drive is exhausted, then the system must replace one or more objects (victims) in order to allow Z to become the disk resident. Previous approaches (collectively termed Atomic) replace each of the vic-
736
;; ;
;;; ; ; ; ;;;;;; ;; ; ;;;;;;; ;; ;;;;;
DISTRIBUTED MULTIMEDIA SYSTEMS
Blocks:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
Depth
0
7
1
14
2 3 4
:free blocks O1 O2 O3
(a)
Blocks:
Blocks:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
Depth
0
7
1
14
13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
Depth
2
0
7
1
14
6 13
2 3
3
;; ;; ;; ;; ;;;; 4
4
(b)
(c)
Blocks:
Blocks:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
Depth
0
6
1
14
7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
0 1
Depth
2
4 14
6
2 3
3
;; ;;; ;;; ;; ;;;; ;; 4
4
(d)
(e)
Blocks:
Blocks:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Free list:
Free list:
0 1
Depth
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
6
4
1 Depth
2
0 2
3
3
4
4 (f)
4
(g)
Figure 17. Deallocation of an object. The example sequence shows the removal of object o3 from the initial disk resident object set 兵o1, o2, o3其. Base two, 웆 ⫽ 2. (a) Two sections are on the free list already (7 and 14), and object o3 is deallocated. (b) Sections 7 and 13 should be combined; however, they are not contiguous. (c) The buddy of section 7 is 6. Data must move from 6 to 13. (d) Sections 6 and 7 are contiguous and can be combined. (e) The buddy of section 6 is 4. Data must move from (4, 5) to (14, 15). (f) Sections 4 and 6 are now adjacent and can be combined. (g) The final view of the disk and the free list after removal of o3.
DISTRIBUTED MULTIMEDIA SYSTEMS
tim objects in their entirety, requiring each object to either be completely a disk resident or not disk resident at all. With PartIal ReplAcement TEchnique (PIRATE) (12), the system chooses a larger number of objects as victims; however, it replaces a portion of each of its victims in order to free up sufficient space for the blocks of Z. The input to PIRATE include (1) the size and frequency of access to each object X in the database, termed size(X) and heat(X) respectively (46), (2) a set of objects with a disk resident fraction except Z (denoted F ), and (3) the size of the object referenced by the pending request (size(Z)). Its side effect is that it makes enough of the disk space available to accommodate Z. PIRATE deletes blocks of an object one at a time, starting with those that constitute the tail end of the object. For example, if PIRATE decides to replace those blocks that constitute 10 min of a 30 min video clip, it deletes those blocks that represent the last 10 min of the clip, leaving the first 20 min disk resident. The number of blocks that constitute the first portion of X is denoted DISK(X), while its deleted (non disk resident) blocks are termed ABSENT(X); ABSENT(X) ⫽ size(X) ⫺ DISK(X). (Note: the granularity of ABSENT(X), size(X), and DISK(X) are in blocks.) PIRATE complements pipelining because by keeping the head of the objects disk resident, their displays can start immediately minimizing the observed startup latency time. On the other hand, since all the requests will access the tertiary, it is possible that the tertiary becomes the bottleneck. Hence, PIRATE is suitable for single user environments (such as Personal Computers (PC) (12) or when the sustained bandwidth of the tertiary is higher than the average load imposed by simultaneous requests. Formal Statement of the Problem. The portion of disk space allocated to continuous media data types consists of C blocks. The database consists of m objects 兵o1, . . ., om其, with m heat(oj) 僆 (0, 1) satisfying 兺j⫺1 heat(oj) ⫽ 1, and sizes size(oj) 僆 (0, C) for all 1 ⱕ j ⱕ m. The size of the database m exceeds the storage capacity of the system (i.e., 兺j⫽1 size(oj) ⬎ C). Consequently, the database resides permanently on the tertiary storage device, and objects are swapped in and out from the disk. We assume that the size of each object is smaller than the storage capacity of the disk drive, size(oj) ⬍ C for 1 ⱕ j ⱕ m. Moreover, to simplify the discussion, we assume that the tertiary is not required to change tapes/platters or reposition its read head once it starts to transfer an object. Assume a process that generates requests for objects in which object oj is requested with probability heat(oj) (all independent). We assume no advance knowledge of the possible permutation of requests for different objects. Let F denote the set of objects with a disk resident fraction except the one that is referenced by the pending request, size(F ) ⫽ 兺x僆F DISK(X). Moreover, assuming a new request arrives referencing object Z (F 씯 F ⫺ 兵Z其), we define free_disk_space as C ⫺ (size(F ) ⫹ DISK(Z)). If ABSENT(Z) ⱕ free_disk_space, then no replacement is required. In this study, we focus on the scenario where replacement is required; that is, ABSENT(Z) ⬎ free_disk_space. We define latency time observed by a request referencing Z (ᐉ(Z)) as the amount of time elapsed from the arrival of the request to the onset of the display. It is a function of DISK(Z) and BTertiary. If DISK(Z) ⫽ size(SZ,1), then the maximum value for ᐉ(Z) is the worst reposition time of the tertiary storage device. One may reduce this latency time to zero by incre-
737
menting size(SZ,1) with the amount of data corresponding to this time; that is, worst reposition time × BDisplay size(block) (This optimization is assumed for the rest of this paper.) If ⬎ size(SZ,1), then (ᐉ(Z) ⫽ 0 (due to assumed optimization). Otherwise (i.e., DISK(Z) ⬍ size(SZ,1)), the system determines the starting address of the nondisk resident portion of Z (missing), and ᐉ(Z) is defined as the total sum of (1) the repositioning of tertiary to the physical location corresponding to missing, and (2) the materialization time of the remainder of the first slice, DISK(Z)
size(block) × (size(SZ,1 ) − DISK (Z)) BTertiary The average (expected value of) latency as a function of requests can be defined as µ= heat(x) ∗ (x) (18) x
The variance is σ2 =
heat(x) ∗ ((x) − µ)2
(19)
x
By deleting a portion of an object, we may increase its latency time resulting in a higher 애 and 2. However, once the disk capacity is exhausted, deletion of an object is unavoidable. In this case, it is desired for some x in F to reduce DISK(x) such that enough disk space becomes available to render object Z disk resident in its entirety. The problem is how to determine those x and their corresponding fractions to be deleted to minimize both the average latency time and its variance. Unfortunately, minimizing the average latency time might increase the variance and vice versa. In the next section, we present simple PIRATE and demonstrate that it minimizes the average latency. Subsequently, extended PIRATE is introduced, as a generalization of simple PIRATE, with a knob that can be adjusted by the user to tailor the system to strike a compromise between these two objectives. Simple PIRATE. Figure 18 presents simple PIRATE. Logically, it operates in two passes. In the first pass, it deletes from those objects (say i) whose disk resident portion is greater than the size of their first slice (Si,1). By doing so, it can ensure a zero latency time for requests that reference these objects in the future (by employing the pipelining mechanism). Note that PIRATE deletes objects at a granularity of a block. Moreover, it frees up only sufficient space to accommodate the pending request and no more than that. (For example, if ABSENT(Z) is equivalent to size(X)/10, and X is chosen as the victim, then only the blocks corresponding to the last 1/10 of X are deleted in order to render Z disk resident.) If the disk space made available by the first pass is insufficient, then simple PIRATE enters its second pass. This pass deletes objects starting with the one that has the lowest heat (following the greedy strategy suggested for the fractional knapsack problem (47). One might argue that a combination of heat and size should be considered when choosing victims. However, ABSENT(Z) blocks (where Z is the object required to
738
DISTRIBUTED MULTIMEDIA SYSTEMS
define pfs: potential_free_space, define rds: required_disk_space rds 씯 ABSENT(Z ) ⫺ free_disk_space repeat victim 씯 object i from set F with: 1) the lowest heat, and 2) DISK(i ) ⬎ Size(Si,1) if (victim is NOT null) then pfs 씯 DISK(victim) ⫺ size(Svictim,1) else victim 씯 object i from set F with the lowest heat pfs 씯 DISK(victim) if (pfs ⬎ rds) then DISK(victim) 씯 DISK(victim) ⫺ rds rds 씯 0 else DISK(victim) 씯 DISK(victim) ⫺ pfs rds 씯 rds ⫺ pfs until (rds ⫽ 0) Figure 18. Simple PIRATE.
become disk resident) are required to be deleted from the disk, independent of the size of the victims. The following proof formalizes this statement and proves the optimality of simple PIRATE in minimizing the latency time of the system. Lemma 1: To minimize the average latency time of the system, during pass 2, PIRATE must delete those blocks corresponding to the object with the lowest heat, independent of the object sizes. Proof: Without loss of generality, assume F ⫽ 兵X, Y其, size(block) ⫽ 1 and BTertiary ⫽ 1. Assume a request arrives at t0 referencing Z, and the disk capacity is exhausted. Let t1 be the time when a portion of X and/or Y is deleted. We define 애0 (애1) to be the average latency at t0 (t1), see Eq. 18. Subsequently, DISK0(i) and DISK1(i) represent the disk resident fraction of object i at time t0 and t1, respectively. Let 웃i denote the number of blocks of object i deleted from disk at time t1 (i.e., 웃i ⫽ DISK0(i) ⫺ DISK1(i)). By deleting X and/or Y partially, we increase the average latency by ⌬. However, since deletion is unavoidable, the objective is to minimize ⌬ ⫽ 애1 ⫺ 애0, while ABSENT(Z) ⫽ 웃X ⫹ 웃Y (In computing the average latency, we ignore the latency of the other objects in the database as well as the repositioning time of the tertiary. This is because it only adds a constant to both 애0 and 애1 which will be eliminated by the computation of ⌬).
µ0 = heat( X ) ∗ (size( S X ,1 ) − DISK0 ( X )) + heat(Y ) ∗ (size( S Y ,1 ) − DISK0 (Y )) µ1 = heat( X ) ∗ (size( S X ,1 ) − DISK1 ( X )) + heat(Y ) ∗ (size( S Y ,1 ) − DISK1 (Y )) Thus,
= heat( X ) ∗ δX + heat(Y ) ∗ δY = heat( X ) ∗ δX + heat(Y ) ∗ ( ABSENT( Z ) − δX ) = heat(Y ) ∗ ABSENT( Z ) + δX ∗ (heat( X ) − heat(Y ))
Since heat(Y) ⴱ ABSENT(Z) and (heat(X) ⫺ heat(Y)) are constants, in order to minimize ⌬, we can only vary 웃X (this impacts 웃Y because ABSENT(Z) ⫽ 웃X ⫹ 웃Y). If heat(X) ⬎ heat(Y), then (heat(X) ⫺ heat(Y)) is a positive value; hence, in order to minimize ⌬, the value of 웃X should be minimized (i.e., the object with higher heat(X) should not be replaced). On the other hand, if heat(X) ⬍ heat(Y), then (heat(X) ⫺ heat(Y)) is a negative value; hence, in order to minimize ⌬, the value of 웃X should be maximized (i.e., the object with lower heat(X) should be replaced). This demonstrates that the amount of data deleted from victim(s) (웃i) in order to free up disk space depends only on heat (i) and not size (i). Extended PIRATE. Extended PIRATE is a generalization of simple PIRATE that can be customized to strike a compromise between the two goals to minimize either the average latency time of the system or the variance in the latency time. The major difference between simple and extended PIRATE is as follows: Extended PIRATE (see Fig. 19) requires a minimum fraction (termed LEAST(x)) of the most frequently accessed objects to be the disk resident. Logically, extended PIRATE operates in three passes. Its first pass is identical to that of simple PIRATE. If this pass fails to provide sufficient disk space for the referenced object, then during the second pass, it deletes from objects until each of their disk resident portion corresponds to LEAST(x). If pass two fails (i.e., provides insufficient space to materialize the referenced object), it enters pass 3. This pass is identical to pass 2 of simple PIRATE where objects are deleted in their entirety starting with the one that has the lowest heat. With extended PIRATE, LEAST(X) for each disk resident object is defined as follows: LEAST (X )
=
min(knob ∗ heat(X ) ∗ size(SX ,1 ), size(SX ,1 ))
define pfs: potential_free_space, define rds: required_disk_space rds 씯 ABSENT(Z ) ⫺ free_disk_space repeat victim 씯 object i from set F with: 1) the lowest heat, and 2) DISK(i ) ⬎ Size(Si,1) if (victim is NOT null) then pfs 씯 DISK(victim) ⫺ size(Svictim,1) else victim 씯 object i from set F with: 1) the lowest heat, and 2) DISK(i ) ⬎ LEAST(i ) if (victim is NOT null) then pfs 씯 DISK(victim) ⫺ LEAST(victim) else victim 씯 object i from set F with the lowest heat pfs 씯 DISK(victim) if (pfs ⬎ rds) then DISK(victim) 씯 DISK(victim) ⫺ rds rds 씯 0 else DISK(victim) 씯 DISK(victim) ⫺ pfs rds 씯 rds ⫺ pfs until (rds ⫽ 0)
Figure 19. Extended PIRATE.
(20)
DISTRIBUTED MULTIMEDIA SYSTEMS
where knob is an integer whose lower bound is zero. The minimum function avoids the size of LEAST(x) to exceed the size of the first slice. When knob ⫽ 0, extended PIRATE is identical to simple PIRATE. As knob increases, a larger portion of each object becomes disk resident. Obviously, the ideal case is to increase the knob until the first slice of all the objects become disk resident. However, due to the limited storage capacity, this might be infeasible. By increasing the knob, we force a portion of some objects with lower heat to remain disk resident, at the expense of deleting from objects with a high heat. By providing each request referencing an object a latency time proportional to the heat of that object, extended PIRATE improves the variance while not increasing the average dramatically. There is an optimal value for knob that minimizes 2. If the value of knob exceeds this value, then 2 starts to increase also.
739
Disk resident Absent
... X Y U
... V W Q
S
Heat decreases (a) knob = 0
Lemma 2: The optimal value for knob is C/Aug_Slice1. Proof: Let U be the total number of unique objects that are referenced over a period of time. We define Avg Slice1 = heat(x) ∗ size(Sx,1 ) x
X Y U
... V W Q
S
Heat decreases
heat(x) 1 = U U Avg Least = knob ∗ Avg Heat ∗ Avg Slice1 Avg Heat =
...
x
(b) knob = C/Avg_Slicel Figure 20. Status of the first slice of objects.
The ideal case is when the LEAST of almost all the objects that constitute the database are disk resident (C is the total number of disk blocks):
C ≈ U ∗ Avg Least ≈ U ∗ knob ∗
1 ∗ Avg Slice1 U
Solving for knob, we obtain: knob 앒 C/Avg_Slice1. Substituting the optimal value of knob in Eq. 20, we obm tain LEAST(X) ⫽ heat(X) ⫻ size(SX,1)/ 兺i heat(i) ⫻ size(Si,1) ⫻ C. This is intuitively the amount of disk space an object X deserves. In (12), we employed a simulation study to confirm this analytical result. Note that because heat is considered in the computation of LEAST, with knob ⫽ C/Avg_Slice1, the average latency time degrades proportional to the improvement in variance. In summary, when knob ⫽ 0, PIRATE replaces objects in a manner that minimizes average latency time. However, when knob ⫽ C/Avg_Slice1, it minimizes the variance. To observe, consider the following discussion. In the long run, with knob ⫽ 0, PIRATE maintains the first slice of all the objects with the highest heat disk resident, while the others compete with each other for a small portion of the disk space [see Fig. 20(a)]. To approximate the number of objects that become disk resident with knob ⫽ 0 (ℵ애), we use the average size (Avg_Slice1) as follows: ℵµ ≈
C Avg Slice1
(21)
However, with knob ⫽ C/Avg_Slice1, PIRATE maintains only a minimum portion of all these ℵ애 objects disk resident.
To achieve this, in an optimal case, it requires ℵ애 ⴱ Avg_Least ⫽ ℵ애 ⴱ C/U of disk space. This is optimistic, because the ℵ애 objects have the highest heats, thus a large minimum portion. While it is not realistic to use Avg_Least in the equation, it is useful for approximation. The rest of the disk space, C ⫺ (ℵ애 ⴱ C/U), can be used for the minimum portion of the other objects [see Fig. 20(b)]. Therefore, in the long run, the number of disk resident objects with knob ⫽ C/Avg_Slice1 (i.e., C/Max_Size ⴱ Max_Heat ⴱ knob ⬍ ℵ ⬍ U) is larger than ℵ애. However, with knob ⫽ 0, the first slice of ℵ애 objects are disk resident, while with knob ⫽ C/Avg_Slice1, only LEAST of each of the ℵ objects are disk resident. This results in the following tradeoff. On one hand, a request referencing an object Z has a higher hit ratio with knob ⫽ C/Avg_Slice1 as compared to knob ⫽ 0. On the other hand, a hit with knob ⫽ 0 translates into a fraction of a second latency time, while with knob ⫽ C/Avg_Slice1, it results in a minimum latency time of (size(SZ,1) ⫺ LEAST(Z)) ⫻ size(block)/BTertiary. This explains why with knob ⫽ C/Avg_Slice1, PIRATE improves the variance proportional to the degradation in average latency. STREAM SCHEDULING AND SYNCHRONIZATION In this section, we investigate a taxonomy of scheduling problems corresponding to three classes of multimedia applications. The application classes and the corresponding scheduling problems include 1. On-demand atomic object retrieval: With this class of applications, a system strives to display an object (audio
740
DISTRIBUTED MULTIMEDIA SYSTEMS
or video) as soon as a user request arrives referencing the object. The envisioned movie-on-demand and newson-demand systems are examples of this application class. We formalize the scheduling problem that represents this class as the Atomic Retrieval Scheduling (ARS) problem. 2. Reservation-based atomic object retrieval: This class is similar to on-demand atomic object retrieval except that a user requests the display of an object at some point in the future. An example might be a movie-on-demand system where the customers call to request the display of a specific movie at a specific time; for example, Bob calls in the morning to request a movie at 8:00 p.m. Reservation-based retrieval is expected to be cheapter than on-demand retrieval because it enables the system to minimize the amount of resources required to service requests (using planning optimization techniques). The scheduling problem that represents this application class is termed Augmented ARS (ARS⫹). 3. On-demand composite object retrieval: As compared to atomic objects, a composite object describes when two or more atomic objects should be displayed in a temporarily related manner (48). To illustrate the application of composite objects, consider the following environment. During the post-production of a movie, a sound editor accesses an archive of digitized audio clips to extend the movie with appropriate sound-effects. The editor might choose two clips from the archive: a gun-shot and a screaming sound effect. Subsequently, she authors a composite object by overlapping these two sound clips and synchronizing them with the different scenes of a presentation. During this process, she might try alternative gun-shot or screaming clips from the repository to evaluate which combination is appropriate. To enable her to evaluate her choices immediately, the system should be able to display the composition as soon as it is authored (on-demand). There are two scheduling problems that correspond to this application class: (1) Composite Retrieval Scheduling (CRS), and (2) Resolving Internal Contention (RIC). While CRS is the scheduling of multiple composite objects assuming a multiuser environment, RIC is the scheduling of a single composite object assuming a single request. RIC can also be considered as a preprocessing step of CRS when constructing a composite object for each user, in a multi-user environment. CRS and RIC pose the most challenging problems and are supersets of ARS and ARS⫹. Researchers are just starting to realize the importance of customized multimedia presentations targeted toward the individual’s preferences (49). Some studies propose systems that generate the customized presentations automatically based on a user query (50). Independent of the method employed to generate a presentation, the retrieval engine must respect the time dependencies between the elements of a presentation when retrieving the data. To tackle CRS and RIC, we frist need to solve the simpler problems of ARS and ARS⫹. To put our work in perspective, we denote the retrieval of an object as a task. The distinctive characteristics of the scheduling problems (ARS, ARS⫹, CRS, RIC) that distinguish
them from the conventional scheduling problems a more detailed comparison can be found in (50) are (1) tasks are IObound and not CPU-bound, (2) each task utilizes multiple disks during its life time, (3) each task acquires and releases disks in a regular manner, (4) the pattern utilized by a task to employ the disks depends on the placement of its referenced object on the disks, and (5) there might be temporal relationships among multiple tasks constituting a composite task. Independent of the application classes and due to the above characteristics, tasks compete for system resources, that is, disk bandwidth. The term retrieval contention is used in this study to specify this competition among the tasks for the disk bandwidth. This contention should be treated differently depending on alternative types of the system load. Furthermore, an admission control component, termed contention prediction (cop), is required by all the scheduling algorithms to activate tasks in such a manner that no contention occurs among the activated tasks. Indeed, the problem of retrieval contention and its prediction are shared by all the scheduling problems and should be studied separately. In this section, we first extend the hardware architecture to support a mix of media types. Note that the terms time interval and time period are used synonymously in this section. Subsequently, we study the formally defined four scheduling problems. In (51), we proved that all of these scheduling problems are NP-hard in strong sense. In addition, scheduling algorithms based on heuristics are introduced per problem in (51). Mix of Media Types Until now, we assumed that all CM objects belong to a single media type. For example, all are MPEG-2 compressed video objects. In practice, however, objects might belong to different media types. Assuming m media types, each with a bandwidth requirement of ci, the display time of each block of different objects must be identical in order to maintain the fixedsize time intervals (and a continuous display for a mix of displays referencing different objects). This is achieved as follows. First, objects are logically grouped based on their media types. Next, the system chooses media type i with block size Subi and bandwidth requirement ci as a reference to compute the duration of a time interval (interval ⫽ Subi /ci). The block size of a media type j is chosen to satisfy the following constraint: interval ⫽ Subj /cj ⫽ Subi /ci. Defining Tasks Let T be a set of tasks where each t 僆 T is a retrieval task that corresponds to the retrieval of a video object. Note that if an object X is referenced twice by two different requests, a different task is assigned to each request. The time to display a block is defined as a time interval (or time period) that is employed as the time unit. For each task t 僆 T , we define • r(t): Release time of t, r:T 씮 N. The time that ts information (i.e., the information of its referenced object) becomes available to the scheduler. This is identical to the time that a request referencing the object is submitted to the system.
DISTRIBUTED MULTIMEDIA SYSTEMS
• ᐉ(t): Length (size of the object referenced by t, ᐉ:T 씮 N. The unit is in number of blocks. • c(t): Consumption rate of the object referenced by t, 0 ⬍ c(t) ⱕ 1. This rate is normalized by RD. Thus, c(t) ⫽ 0.40 means that the consumption rate of the object referenced by t is 40% of RD, the cluster bandwidth. • p(t): The cluster that contains the first block of the object referenced by t, 1 ⱕ p(t) ⱕ D. It determines the placement of the object referenced by t on the clusters. We denote a task ti as a quadruple: 具r(ti), ᐉ(ti), c(ti), p(ti)典.
741
User interface Display schedule Logical abstraction Storage manager
Retrieval schedule
Focus Figure 21. Three levels of abstraction.
Atomic Retrieval Scheduling The ARS problem is to schedule retrieval tasks such that the total bandwidth requirement of the scheduled tasks on each cluster during each interval does not exceed the bandwidth of that cluster (i.e., no retrieval contention). Moreover, ARS should satisfy an optimization objective. Depending on the application, this objective could be minimizing either (1) the average startup latency of the tasks, or (2) the total duration of scheduling for a set of tasks (maximizing throughput). Movieon-demand is one sample application. Definition 1: The problem of ARS is to find a schedule (where :T 씮 N) for a set T , such that (1) it minimizes the finishing time (52) w, where w is the least time at which all tasks of T have been completed, and (2) satisfies the following constraints: • ᭙t 僆 T (t) ⱖ r(t). • ᭙u ⱖ 0, let S(u) be the set of tasks which (t) ⱕ u ⬍ (t) ⫹ ᐉ(t), then ᭙i, 1 ⱕ i ⱕ D 兺t僆S(u) Ri(t) ⱕ 1.0 where c(t) if ( p(t) + u − σ (t)) mod D = i − 1 Ri (t) = (22) 0.0 otherwise The first constraint ensures that no task is scheduled before its release time. The second constraint strives to avoid retrieval contention. It guarantees that at each time interval u and for each cluster i, the aggregate bandwidth requirement of the tasks that employ cluster i and are in progress (i.e., have been initiated at or before u but have not committed yet), does not exceed the bandwidth of cluster i. The modfunction handles the round-robin utilization of clusters per task. Augmented ARS ARS⫹ is identical to ARS except that there is a delay between the time that a task is released and the time that it should start. A sample application of ARS⫹ could be a movie-on-demand application where the customers reserve movies in advance. For example, at 7:00 p.m. Alice reserves GodFather to be displayed at 8:00 p.m. Hence, assuming t be the task corresponding to Alice retrieving GodFather, r(t) ⫽ 7:00, but its start time is one hour later. Due to this extra knowledge, more flexible scheduling can be performed. The quadruple notation of a task ti for ARS is augmented as ti:具r(ti), §(ti), ᐉ(ti), c(ti), p(ti)典 for ARS⫹. The lag parameter, §(t), determines the start time of the task. That is, the display of a task that is released at r(t) should not start sooner than
r(t) ⫹ §(t). Other than this distinction in the definition of a task between ARS and ARS⫹, the definition of ARS⫹ is identical to that of ARS. Note that the first constraint of the definition remains as ᭙t 僆 T , (t) ⱖ r(t). Composite Objects We conceptualize a system that supports composite objects as consisting of three components: a collection of user interfaces, logical abstraction, and a storage manager (see Fig. 21). User interfaces play an important role in providing a friendly interface to (1) access existing data to author composite objects and (2) display objects. The logical abstraction tailors the user interface to the storage manager and is described further in the following paragraphs. The focus of this report is on the storage manager. The logical abstraction is defined to separate the storage manager issues from the user interface. This has two major advantages: simplicity and portability. It results in simplicity because different (and maybe inconsistent) representations of composite objects dictated by the user interface have no impact on the algorithms at the storage manager level. An intermediate interpreter is responsible for translating the user’s representations and commands to a uniform, consistent notation at the logical level. The storage manager becomes portable because it is independent of the user interface. Hence, if future interfaces start to use goggles and head-sets, the storage manager engine does not need to be modified. At the logical level of abstraction, a composite object is represented as a (X, Y, j), indicating that the composite object consists of atomic objects X and Y. The parameter j is the lag parameter. It indicates that the display of object Y should start j time intervals after the display of X has started. For example, to designate a complex object where the display of X and Y must start at the same time, we will use the notation (X, Y, 0). Likewise, the composite object specification (X, Y, 2) indicates that the display of Y is initiated two intervals after the display of X has started. This definition of a composite object supports the 13 alternative temporal relationships described in (48). Figure 22 lists these temporal relationships and their representation using our notation of a composite object. The first two columns of Fig. 22 demonstrates the basic 7 relationships between atomic objects X and Y. The rest of the relationships are the inverse of these 7 (note that an equal relation has no inverse). Our proposed techniques support all temporal constructs because they solve for (1) arbitrary j values, (2) arbitrary sizes for both X and Y, and (3) arbitrary clusters to start the placement of X and Y.
742
DISTRIBUTED MULTIMEDIA SYSTEMS
Allen relations
Composite object construct
X before Y
XXX YYY (Y, X, j) size(X) < j
X equals Y
XXX YYY
(Y, X, j) size(X) = size(Y) & j =0
X meets Y
XXXYYY
(Y, X, j) j = size(X)
X overlaps Y
XXX YYY
(Y, X, j) 0 < j < size(X)
X during Y
XXX YYYYYY
(Y, X, j) j > 0 & size(X) < size (Y) – j
X during Y
XXX YYYYY
(Y, X, j) j = 0 & size(X) < size(Y)
X finishes Y
XXX YYYYY
(Y, X, j) j = size(Y) – size(X) < size(Y)
Figure 22. Allen temporal relationships and their representation using our notation of a composite object.
Our notation extends naturally to the specification of composite objects that contain more than two atomic objects. A composite object containing n atomic objects can be characterized by (n ⫺ 1) lag parameter, for example, (X1, . . Xn, j2, . . jn), where ji denotes the lag parameter of object Xi with respect to the beginning of the display of object X1. To simplify the discussion, we assume integer values for the lag parameter (i.e., the temporal relationships are in the granularity of a time interval). For more accurate synchronization such as lip-synching between a spoken voice with the movement of the speaker’s lips, real values of the lag parameter should be considered. This extension is straightforward. To illustrate, suppose time dependency between objects X and Y is defined such that the display of Y should start 2.5 seconds after the display of X starts. Assuming the duration of a time interval is one second, this time dependency at the task scheduling level can be mapped to (X, Y, 2). Hence, the system can retrieve Y after 2 s but employ memory to postpone Ys display for 0.5 s. Composite Retrieval Scheduling The objectives of the Composite Retrieval Scheduling (CRS) problem are identical to those of ARS. The distinction is that with CRS, each user submit a request referencing a composite object. A composite object is a combination of two or more atomic objects with temporal relationships among them. The scheduler assigns a composite task to each request referencing a composite object. A composite task is a combination of atomic tasks. The time dependencies among the atomic tasks of a composite task is defined by the lag parameter §(t) of the atomic tasks. A sample application of CRS is the digital editing environment. An editor composes a composite task on demand, and the result should be displayed to the editor immediately. This is essential for the editor in order to evaluate her composition and possibly modify it immediately. With CRS, each composite task consists of a number of atomic tasks. We use t to represent an atomic task and for a composite task. Similarly, T represents a set of atomic
tasks while ⌰ is a set of composite tasks. A composite task, itself, is a set of atomic tasks; for example, ⫽ 兵t1, t2, . . ., tn兵. Each atomic task has the same parameters as defined earlier, except for the release time r(t). Instead, each atomic tasks has a lag time denoted by §(t). Without loss of generality, we assume for a composite task , §(t1) ⱕ §(t2) ⱕ ⭈ ⭈ ⭈ ⱕ §(tn). Subsequently, we denote the first atomic task in the set as car(); that is, car() ⫽ t1. Lag time of a task determines the start time of the task with respect to §(car()). Trivially, §(car()) ⫽ 0. Briefly, §(t) determines the temporal relationships among the atomic tasks of a composite task. Each composite task, on the other hand, has only a release time r() which is the time that a request for the corresponding composite object is submitted. Definition 2: An atomic task (of a composite task) t is schedulable at u if t can be started at u and completes at u ⫹ ᐉ(t) ⫺ 1 without resulting in retrieval contention as defined in Def. 1. Definition 3: A composite task ⫽ 兵t1, t2, . . ., tn其 is said to be schedulable at u if ᭙t 僆 ; t is schedulable at u ⫹ §(t). Definition 4: The problem of CRS is to find a schedule (where :⌰ 씮 N) for a set ⌰, such that (1) it minimizes the finishing time (52) w, where w is the least time at which all tasks of ⌰ have been completed, and (2) satisfies the following constraints • ᭙ 僆 ⌰; () ⱖ r(). • be schedulable at (). (see Def. 3). Resolving Internal Contention for CRS (RIC) The CRS problem is involved with scheduling multiple composite tasks. RIC, however, focuses on the scheduling of a single composite task. The problem is that even scheduling a single composite task might not be possible due to retrieval contention among its constituting atomic tasks. RIC is very much like the clairvoyant ARS problem. The distinction is that a task can start sooner than its release time, employing upsliding. RIC can also be considered as a similar problem to ASR⫹ where all the tasks have an identical release time but different start time. However, with ARS⫹ in the worst case, a task (who cannot slide upward) can be postponed, while postponing a task with RIC will violate the defined temporal relationships. Furthermore, ARS⫹ and RIC have different objectives. Due to the above reasons, we study RIC as a separate scheduling problem. A composite object may have internal contention; that is, atomic tasks that constitute a composite task may compete with one another for the available cluster bandwidth. Hence, it is possible that due to such internal contention, a composite task is not schedulable even if there are no other active requests. In other words, it is not possible to start all the atomic tasks of at their start time. Definition 5: Internal contention: Consider a composite task ⫽ 兵t1, t2, . . ., tn兵, and ᭙u ⱖ 0; let S(u) 債 be the set of atomic tasks which §(t) ⱕ u ⬍ §(t) ⫹ ᐉ(t). The composite task
DISTRIBUTED MULTIMEDIA SYSTEMS
G1
G0
G5
G4
G3
G2
1Tp 2Tp 3Tp 4Tp 5Tp 6Tp
743
Time
G0 G1 G2 G3 G4 G5 G0 G1 G0 G1 G2 G3 G4 G5
X4
X5
X0
X1
X2
X3
Y0
Y1
Y2
Y3
Y4
Y5
Y6
...
C0
C1
G2 G5 G0 G1 G2 G3 G4
...
G3 G4 G5 G0 G1 G2 G3 G4 G3 G4 G5 G0 G1 G2 G5 G2 G3 G4 G5 G0 G1
C2
C3
C4
C5
has internal contention if ᭚u, i 1 ⱕ i ⱕ D such that 兺t僆S(u) Ri(t) ⬎ 1.0 where c(t) if ( p(t) + u − §(t)) mod D = i − 1 Ri (t) = (23) 0.0 otherwise The above definition intuitively means that /᭚u such that be schedulable at u (see Def. 3). This problem is particular to composite objects because there is a dependency among start times of atomic tasks, and yet these atomic tasks can conflict with each other. Definition 6: Resolving the internal contention for a composite task is to modify the start time of its consisting atomic tasks, such that Def. 5 does not hold true for . Such a modification requires use of memory buffers. Ideally, we should minimize the amount of required buffer. Definition 7: The problem of RIC is to resolve the internal contention for a composite task (as defined in Def. 6) while minimizing the amount of required memory. OPTIMIZATION TECHNIQUES In this section, we discuss some techniques to improve the utilization of a continuous media server. First, two techniques to reduce startup latency are explained. Next, three methods to improve the system throughput are described. Finally, we focus on retrieval optimization techniques for those applications where a request references multiple CM objects (i.e., composite objects are described earlier). We describe a taxonomy of optimization techniques which is applicable in certain applications with flexible presentation requirements. Minimizing Startup Latency Considering the hybrid striping approach described earlier, each request should wait until a time slot corresponding to the cluster containing the first block of its referenced object becomes available. This is true even when the system is not 100% utilized. To illustrate, conceptualize a set of slots supported by a cluster in a time period as a group. Each group has a unique identifier. To support a continuous display in a multi-cluster system, a request maps onto one group, and the individual groups visit the clusters in a round-robin manner (Fig. 23). If group G5 accesses cluster C2 during a time period, G5 would access C3 during the next time period. During a
Figure 23. Rotating groups.
given time period, the requests occupying the slots of a group retrieve blocks that reside in the cluster that is being visited by that group. Therefore, if there are C clusters (or groups) in the system, and each cluster (or group) can support N simultaneous displays, then the maximum throughput of the system is m ⫽ N ⫻ C simultaneous displays. The maximum startup latency is Tp ⫻ C because (1) groups are rotating (i.e., playing musical chairs) with the C clusters using each for a Tp interval of time, and (2) at most, C ⫺ 1 failures might occur before a request can be activated (when the number of active displays is fewer than N ⫻ C). Thus, both the system throughput and the maximum startup latency scale linearly. Note that system parameters such as blocks size, time period, throughput, etc. for a cluster can be computed using the equations provided earlier, depending on the selected display technique. These display techniques are local optimizations that are orthogonal to the optimization techniques proposed by this section. Even though the work load of a display is distributed across the clusters with a round-robin assignment of blocks, a group might experience a higher work load as compared to other groups. For example, in Fig. 24, if the system services a new request for object X using group G4, then all servers in G4 become busy, while several other groups have two idle servers. This imbalance might result in a higher startup latency for future requests. For example, if another request for Z arrives, then it would incur a two time period startup latency because it must be assigned to G5 because G4 is already full. This section describes request migration and replication (53) as two alternative techniques to minimize startup latency. These two techniques are or-
;
;;; ; ;; ;; ; ; ;; ; ;; ;;; ;; ;;;;;; ;;; ; G0
G5
G4
G3
X4
X5
X0
X1
Y0
Y1
Y2
Y3
Y6
...
C0
C1
Z0 C2
C3
G2
G1
X2
X3
Y4
Y5
Z1
...
C4
C5
Figure 24. Load balancing.
744
DISTRIBUTED MULTIMEDIA SYSTEMS
thogonal to one another, enabling a system to employ both at the same time. Request Migration. By migrating one or more requests from a group with zero idle slots to a group with many idle slots, the system can minimize the possible latency incurred by a future request. For example, in Fig. 24, if the system migrates a request for X from G4 to G2, then a request for Z is guaranteed to incur a maximum latency of one time period. Migrating a request from one group to another increases the memory requirements of a display because the retrieval of data falls ahead of its display. Migrating a request from G4 to G2 increases the memory requirement of this display by three buffers. This is because when a request migrates from G4 to G2 (see Fig. 24), G4 reads X0 and sends it to the display. During the same time period, G3 reads X1 into a buffer (say, B0), and G2 reads X2 into a buffer (B1). During the next time period, G2 reads X3 into a buffer (B2), and X1 is displayed from memory buffer B0. (G2 reads X3 because the groups move one cluster to the right at the end of each time period to read the next block of active displays occupying its servers.) During the next time period, G2 reads X4 into a memory buffer (B3), while X2 is displayed from memory buffer B1. This round-robin retrieval of data from clusters by G2 continues until all blocks of X have been retrieved and displayed. With this technique, if the distance from the original group to the destination group is B, then the system requires B ⫹ 1 buffers. However, because a request can migrate back to its original group once a request in the original group terminates and relinquishes its slot (i.e., a time slot becomes idle), the increase in total memory requirement could be reduced and become negligible. C ⭈ (N ⫺1) When k ⱕ C ⭈ (N ⫺ 1) (with the probability of 兺k⫽0 p(k)), request migration can be applied due to the availability of idle slots. This means that Prob兵a group is full其 ⫽ 0. Hence, pf (0, k) ⫽ 1. If k ⬎ C ⭈ (N ⫺ 1) (with the probability m⫺1 of 兺k⫽C ⭈ (N ⫺1)⫹1 p(k)), no request migration can be applied because (1) no idle slot is available in some groups, and (2) the load is already evenly distributed. Hence, the probability of failures is:
p f (i, k ) =
C−i C − (i + 1) − k − i k − (i + 1)
C k
E[L] =
m − j · (i + 1) · N k − j · (i + 1) · N
m k
E[L] =
m−1
p(k) · 0.5 · Tp
j
+
j· N m−1 k=0
p(k) · p f (0, k ) · 0.5 · Tp
m−1
(25)
k
(26)
p(k) · p f (0, k) · 0.5 · Tp
k=0
(27)
k
m−1
−
where 0 ⱕ i ⱕ k/j ⭈ N . Hence, the expected startup latency is
k=c·(N −1)+1
+
p f (i, k) =
m− j·i·N k− j·i·N
(24)
k=0
+
j
where k⬘ ⫽ k ⫺ C ⭈ (N ⫺ 1). The expected latency with request migration is C·( N −1)
of an object X as its primary copy. All other copies of X are termed its secondary copies. The system may construct r secondary copies for object X. Each of its copies is denoted as RX,i, where 1 ⱕ i ⱕ r. The number of instances of X is the number of copies of X, r ⫹ 1 (r secondary plus one primary). Assuming two instances of an object, by starting the assignment of RX,1 with a cluster different than the one containing the first block of its primary copy (X), the maximum startup latency incurred by a display referencing X can be reduced by one half. This also reduces the expected startup latency. The assignment of the first block of each copy of X should be separated by a fixed number of clusters in order to maximize the benefits of replication. Assuming that the primary copy of X is assigned starting with an arbitrary clusters (say Ci contains X0), the assignment of secondary copies of X is as follows. The assignment of the first block of copy RX, j should start with cluster (Ci ⫹ jC/r ⫹ 1) mod C. For example, if there are two secondary copies of object Y (RY,1, RY,2) assume its primary copy is assigned starting with cluster C0. RY,1 is assigned starting with cluster C2, while RY,2 is assigned starting with cluster C4. With two instances of an object, the expected startup latency for a request referencing this object can be computed as follows. To find an available server, the system simultaneously checks two groups corresponding to the two different clusters that contain the first blocks of these two instances. A failure happens only if both groups are full, reducing the number of failures for a request. The maximum number of failures before a success is reduced to k/2 ⭈ N due to two simultaneous searching of groups in parallel. Therefore, the probability of i failures in a system with each object having two instances is identical to that of a system consisting of C/2 clusters with 2N servers per cluster. A request would experience a lower number of failures with more instances of objects. For an arbitrary number of instances (say j) for an object in the system, the probability of a request referencing this object to observe i failures is
p(k) · p f (i, k ) · i · Tp
k=C·(N −1)+1 i=1
Object Replication. To reduce the startup latency of the system, one may replicate objects. We term the original copy
i=1
p(k) · p f (i, k) · i · Tp j
Object replication increases the storage requirement of an application. One important observation in real applications is that objects may have different access frequencies. For example, in a Video-On-Demand system, more than half of the active requests might reference only a handful of recently released movies. Selective replication for frequently referenced (i.e., hot) objects could significantly reduce the latency without a dramatic increase in storage space requirement of an
DISTRIBUTED MULTIMEDIA SYSTEMS
application. The optimal number of secondary copies per object is based on its access frequency and the available storage capacity. The formal statement of the problem is as follows. Assuming n objects in the system, let S be the total amount of disk space for these objects and their replicas. Let Rj be the optimal number of instances for object j, Sj to denote the size of object j, and Fj to represent the access frequency (%) of object j. The problem is to determine Rj for each object j (1 ⱕ n j ⱕ n) while satisfying 兺j ⫽ Rj ⭈ Sj ⱕ S. There exist several algorithms to solve this problem (54). A simple one known as the Hamilton method computes the number of instances per object j based on its frequency (see (53). It rounds the remainder of the quota (Qj ⫺ Qj) to compute Rj. However, this method suffers from two paradoxes, namely, the Alabama and Population paradoxes (54). Generally speaking, with these paradoxes, the Hamilton method may reduce the value of Rj when either S or Fj increases in value. The divisor methods provide a solution free of these paradoxes. For further details and proofs of this method, see (15). Using a divisor method named Webster (d(Rj) ⫽ Rj ⫹ 0.5), we classify objects based on their instances. Therefore, objects in a class have the same instances. The expected startup latency in this system with n objects is E[L] =
n i=1
Fi · E[LR ] i
(28)
where E[LRi] is the expected startup latency for object having Ri instances (computed using Eq. 27). Maximizing Throughput A trivial concept for increasing the throughput of a continuous media server is to support multiple displays (or users) by utilizing a single disk stream. This can be achieved when many requests reference an identical CM object. The problem is, however, when these requests arrive in different time instances. In this section, we explain three approaches to hide the time differences among multiple requests referencing a single object: 1. Batching of requests (55–58): In this method, requests are delayed until they can be merged with other requests for the same video. These merged streams then form one physical stream from the disk and consume only one set of buffers. Only on the network will the streams split at some point for delivery to the individual display stations. 2. Buffer sharing (59–64): The idea here is that if one stream for a video lags, another stream for the same video by only a short time interval; then, the system could retain the portion of the video between the two in buffers. The lagging stream would read from the buffers and not have to read from disk. 3. Adaptive piggy-backing (65): In this approach, streams for the same video are adjusted to go slower or faster by a few percent, such that it is imperceptible to the viewer, and the streams eventually merge and form one physical stream from the disks. Batching and adaptive piggy-backing are orthogonal to buffer sharing.
745
Optimization Techniques for Applications with Flexible Presentation Requirements In many multimedia applications, the result of a query is a set of CM objects that should be retrieved from a CM server and displayed to the user. This set of CM objects has to be presented to the user as a coherent presentation. Multimedia applications can be classified as either having Restricted Presentation Requirements (RPR) or Flexible Presentation Requirements (FPR). RPR applications require that the display of the objects conform to a very strict requirement set, such as temporal relationships, selection criteria, and display quality. Digital editing is an example of RPR (see CRS and RIC). In FPR applications, such as digital libraries, music-juke-boxes, and news-on-demand applications, users can tolerate some temporal, selection, and display quality variations. These flexibilities stem from the nature of multimedia data and user-queries. RPR applications impose very strict display requirements. This is due to the type of queries imposed by users in such applications. It is usually the case that the user can specify what objects he/she is interested in and how to display these objects in concert. Multimedia systems have to guarantee that the CM server can retrieve all the objects in the set and can satisfy the precise time dependencies, as specified by the user. There has been a number of studies on scheduling continuous media retrievals for RPR applications, see (51,66,67,68). In (51,66,68) the time dependencies are guaranteed by using memory buffers, while in (67), they are guaranteed by using the in-advance knowledge at the time of data placement. FPR applications provide some flexibilities in the presentation of the continuous media objects. It is usually the case that the user does not know exactly what he/she is looking for and is only interested in displaying the objects with some criteria (e.g., show me today’s news). In general, almost all applications using a multimedia DBMS fall into this category. In this case, depending on the user query, user profile, and session profile, there are a number of flexibilities that can be exploited for retrieval optimization. We have identified the following flexibilities in (69): • Delay flexibility which species the amount of delay the user/application can tolerate between the display of different continuous media clips (i.e., relaxed meet, after, and before relationship (48). In some applications, such delays are even desirable in order for the user (i.e., human perception) to distinguish between two objects). • Selection flexibility which refers to whether the objects selected for display are a fixed set (e.g., two objects selected for display) or they are a suggestion set (e.g., display two objects out of four candidate objects.) This flexibility is identified; however, we do not use it in our formal definitions. It is part of our future research. • Ordering flexibility which refers to the display order of the objects (i.e., to what degree that display order of the objects is important to the user). • Presentation flexibility which refers to the degree of flexibility in the presentation length and presentation startup latency.
746
DISTRIBUTED MULTIMEDIA SYSTEMS
Profile aware retrieval optimizer (Prime) accepts the Query-Script, which contains the user retrieval requirements and flexibilities (ordering, delay, display-quality, and presentation) as a formal definition, and then generates a retrieval plan that is optimal for the current CM-Server load.
Video clips being streamed to the user
4s
Sport: USC vs. UCLA football; 90 s (MPEG II)
6s
Sport: USC vs. Stanford waterpolo; 60 s (MPEG I)
Prime Profile aware user query combiner module (Parrot) accepts a user query and consults the user profile, sessions profile, and the meta-data to generate a Query-Script.
CM server
Parrot
MetaData User query (e.g., “Show me today’s news”
User profile
5s
Meta-data
Business: IBM story; 75 s (MPEG II) Business: ATT Story; 60 s (MPEG I)
Figure 25. System architecture.
• Display-quality flexibility which specifies the display qualities acceptable by the user/application, when data is available in multiple formats (e.g., MPEG I, MPEG II, etc.) and/or in hierarchical or layered formats (based on layered compression algorithms) (70, 71). With FPR applications, the flexibilities allow for the construction of multiple retrieval plans per presentation. Subsequently, the best plan is identified as the one which results in minimum contention at the CM server. To achieve this, three steps should be taken: • Step 1: gathering flexibilities • Step 2: capturing the flexibilities in a formal format and • Step 3: using the flexibilities for optimization In our system architecture (69), Fig. 25, the first two steps are carried out by the Profile Aware User Query Combiner (Parrot). It takes an input the user query, user profile, and session profile (e.g., type of monitor) to generate a query script (as output). We assume that there exist intelligent agents that would build user profiles either explicitly (i.e., by user interaction) and/or implicitly (i.e., by clandestine monitoring of the user actions, as in (72). This query script would capture all the flexibilities and requirements in a formal manner. The query script is then submitted to the Profile Aware Retrieval Optimizer (Prime) which, in turn, would use it to generate the best retrieval plan for the CM server. Using the query script, Prime defines a search space that consists of all the correct retrieval plans. A retrieval plan is correct if and only if it is consistent with the defined flexibilities and requirements. Prime also defines a cost model to evaluate the different retrieval plans. The retrieval plans are then searched (either exhaustively or by employing heuristics) to find the best plan depending on the metrics defined by
the application. In (69), we also describe a memory buffering mechanism that alleviates retrieval problems when the system bandwidth becomes fermented, namely the Simple Memory Buffering (SimB) mechanism. Our simulation studies show significant improvement when we compare the system performance for the best retrieval plan with that of the worst, or even average, plan of all the correct plans. For example, if latency time (i.e., time elapsed from when the retrieval plan is submitted until the onset of the display of its first object) is considered as a metric, the best plan found by Prime observes 41% to 92% improvement as compared with the worst plan, and 26% to 89% improvement as compared with the average plans when SimB is not applied (see (69)).
CASE STUDY The design and implementation of many CM servers have been reported in the research literature (e.g., (30, 40, 73, 74, 75). Commercial implementations of CM servers are also in progress (e.g., Sun’s MediaCenter Servers (76), Starlight Networks’ StarWorks (77), and Storage Concepts’ VIDEOPLEX (78), see Table 2). Many of the design issues that we discussed in this paper have been practiced in most of the above prototypes. In this section, we focus on the implementation of Mitra (Mitra is the name of a Persian/Indian god with thousands of eyes and ears) (40) developed at the USC database laboratory. Mitra: A Scalable CM Server Mitra employs GSS with g ⫽ N , coarse-grain memory sharing, hybrid striping, a three level storage hierarchy with SDF data flow, no pipelining, and an Everest replacement policy. Multi-zone disk drive optimization (11) as well as replication
DISTRIBUTED MULTIMEDIA SYSTEMS
747
Table 2. A Selection of Commercially Available Continuous-Media Servers Vendor Starlight Sun Storage Concepts a
Product
Max. No. of Users
Max. Streaming Capacity
StarWorks-200M MediaCenter 1000E VIDEOPLEX
133 @ 1.5 Mb/s 270 @ 1.5 Mb/s 320 @ 1.5 Mb/s
200 Mb/s 400 Mb/s 480 Mb/sa
The VIDEOPLEX system does not transmit digital data over a network but uses analog VHS signals instead.
and migration optimizations have also been incorporated in Mitra. Mitra employs a hierarchical organization of storage devices to minimize the cost of providing on-line access to a large volume of data. It is currently operational on a cluster of HP 9000/735 workstations. It employs a HP Magneto Optical Juke-box as its tertiary storage device. Each workstation consists of a 125 MHz PA-RISC CPU, 80 MByte of memory, and four Seagate ST31200W magnetic disks. Mitra employs the HP-UX operating system (version 9.07) and is portable to other hardware platforms. While 15 disks can be attached to the fast and wide SCSI-2 bus of each workstation, we attached four disks to this chain because additional disks would exhaust the bandwidth of this bus. It is undesirable to exhaust the bandwidth of the SCSI-2 bus for several reasons. First, it would cause the underlying hardware platform to not scale as a function of additional disks. Mitra is a software system, and if its underlying hardware platform does not scale, then the entire system would not scale. Second, it renders the service time of each disk unpredictable, resulting in hiccups. Mitra consists of three software components:
free display at a PM. In addition, it manages the disk bandwidth and performs admission control. Currently, the scheduler includes an implementation of EVEREST, staggered striping, and techniques to manage the tertiary storage device. It also has a simple relational storage manager to insert and retrieve information from a catalog. For each media type, the catalog contains the bandwidth requirement of that media type and its block size. For each presentation, the catalog contains its name, whether it is disk resident (if so, the name of EVEREST files that represent this clip), the cluster and zone that contains its first block, and its media type. 2. Mass storage Device Manager (DM): Performs either disk or tertiary read/write operations. 3. Presentation Manager (PM): Displays either a video or an audio clip. It might interface with hardware components to minimize the CPU requirement of a display. For example, to display an MPEG-2 clip, the PM might employ either a program or a hardware-card to decode and display the clip. The PM implements the PM-driven scheduling policy (40) to control the flow of data from the scheduler.
1. Scheduler: This component schedules the retrieval of the blocks of a referenced object in support of a hiccup-
Mitra uses UDP for communication between the process instantiation of these components. UDP is an unreliable trans-
HP magneto-optical disk library (2 drives, 32 platters) SCSI-2 (80 Mb/s) fast
. . .
EVEREST volume 1
EVEREST volume 4
Audio player
MPEG-1 player
N
N
N
N
DM 13
DM 14
EVEREST volume 5
...
MPEG-2 player N
EVEREST volume 6 N
N
DM 1
DM 2
N
N
DM 3
DM 4
EVEREST volume 7
EVEREST volume 8
DM: Disk Manager
EVEREST volume 9
Scheduler/ user interface Ret.DB
Everest
EVEREST volume 10
N
N
N
DM 5
DM 7
EVEREST volume 11
N
N
DM 9
DM 10
N
N
N
N
N
DM 6
DM 8
DM 0
DM 11
DM 12
SCSI-2 N: HP-NOSE
PMi
ATM Switch
EVEREST volume 2
EVEREST volume 3
PM2
SCSI-2 (160 Mb/s) Fast and wide
. . .
PM1
Catalog
Figure 26. Hardware and software organization of Mitra.
EVEREST volume 12
HP 9000/735 125 MHz PA-RISC
748
DISTRIBUTED MULTIMEDIA SYSTEMS
mission protocol. Mitra implements a light-weight kernel named HP-NOSE. HP-NOSE supports a window-based protocol to facilitate reliable transmission of messages among processes. In addition, it implements the threads with shared memory, ports that multiplex messages using a single HP-UX socket, and semaphores for synchronizing multiple threads that share memory. An instantiation of this kernel is active per Mitra process. For a given configuration, the following processes are active: one scheduler process, a DM process per mass storage read/write device, and one PM process per active client. For example, in our twelve disk configuration with a magneto optical juke box, there are sixteen active processes: fifteen DM processes and one Scheduler process (see Fig. 26). There are two active DM processes for the magneto juke-box because it consists of two read/write devices (and 32 optical platters that might be swapped in and out of these two devices). The combination of the scheduler with DM processes implements asynchronous read/write operations on a mass storage device (which is otherwise unavailable with HP-UX 9.07). This is achieved as follows. When the scheduler intends to read a block from a device (say a disk), it sends a message to the DM that manages this disk to read the block. Moreover, it requests the DM to transmit its block to a destination port address (e.g., the destination might correspond to the PM process that displays this block) and issue a done message to the scheduler. There are several reasons for not routing data blocks to active PMs using the scheduler. First, it would waste the network bandwidth with multiple transmissions of a block. Second, it would cause the CPU of the workstation that supports the scheduler process to become a bottleneck with a large number of disks. This is because a transmitted data block would be copied many times by different layers of software that implement the scheduler process: HP-UX, HPNOSE, and the scheduler.
ACKNOWLEDGMENTS We would like to thank Ali Dashti, Doug Ierardi, Seon Ho Kim, Weifeng Shi, and Roger Zimmermann for contributing to the presented material.
BIBLIOGRAPHY 1. S. Ghandeharizadeh and L. Ramos, Continuous retrieval of multimedia data using parallelism, IEEE Trans. Knowl. Data Eng., 1: 658–669, 1993. 2. D. J. Gemmell et al., Multimedia storage servers: A tutorial, IEEE Comput., 28 (5): 40–49, 1995.
8. D. Bitton and J. Gray, Disk shadowing, Proc. Int. Conf. Very Large Databases, September 1988. 9. J. Gray, B. Host, and M. Walker, Parity striping of disc arrays: Low-cost reliable storage with acceptable throughput, Proc. Int. Conf. Very Large Databases, August 1990. 10. S. Ghandeharizadeh, J. Stone, and R. Zimmermann, Techniques to quantify SCSI-2 disk subsystem specifications for multimedia, Technical Report USC-CS-TR95-610, Univ. Southern California, 1995. 11. S. Ghandeharizadeh et al., Placement of continuous media in multi-zone disks. In Soon M. Chung (ed.) Multimedia Information Storage and Management, chapter 2, Norwell, MA: Kluwer Academic, August 1996. 12. S. Ghandeharizadeh and C. Shahabi, On multimedia repositories, personal computers, and hierarchical storage systems. Proc. ACM Multimedia, 1994. 13. P. S. Yu, M. S. Chen, and D. D. Kandlur, Design and analysis of a grouped sweeping scheme for multimedia storage management. Proc. Int. Workshop Network Oper. Sys. Support Digital Audio Video, November 1992. 14. D. J. Gemmell and S. Christodoulakis, Principles of delay sensitive multimedia data storage and retrieval, ACM Trans. Inf. Sys., 10: 51–90, Jan. 1992. 15. D. J. Gemmell et al., Delay-sensitive multimedia on disks, IEEE Multimedia, 1 (3): 56–67, Fall 1994. 16. H. J. Chen and T. Little, Physical storage organizations for timedependent multimedia data, Proc. Foundations Data Organ. Algorithms FODO Conf., October 1993. 17. A. L. N. Reddy and J. C. Wyllie, I/O Issues in a Multimedia System, IEEE Comput. Mag., 27 (3): March 1994. 18. A. Cohen, W. Burkhard, and P. V. Rangan, Pipelined disk arrays for digital movie retrieval. Proceedings ICMCS ’95, 1995. 19. E. Chang and H. Garcia-Molina, Reducing initial latency in a multimedia storage system, Proc. IEEE Int. Workshop Multimedia Database Manage. Syst., 1996. 20. B. Ozden, R. Rastogi, and A. Silberschatz, On the design of a lowcost video-on-demand storage system, ACM Multimedia Syst., 4 (1): 40–54, February 1996. 21. P. Bocheck, H. Meadows, and S. Chang, Disk partitioning technique for reducing multimedia access delay, in Proc. IASTED/ ISMM Int. Conf. Distributed Multimedia Systems and Applications, August 1994, pp. 27–30. 22. S. Ghandeharizadeh, S. H. Kim, and C. Shahabi, On configuring a single disk continuous media server, Proc. 1995 ACM SIGMETRICS/PERFORMANCE, May 1995. 23. S. Ghandeharizadeh, S. H. Kim, and C. Shahabi, On disk scheduling and data placement for video servers, USC Technical Report, Univ. Southern California, 1996. 24. T. J. Teory, Properties of disk scheduling policies in multiprogrammed computer systems. Proc. AFIPS Fall Joint Comput. Conf., 1972, pp. 1–11.
3. D. Le Gall, MPEG: a video compression standard for multimedia applications, Commun. ACM, 34 (4): 46–58, 1991.
25. Y. Birk, Track-pairing: A novel data layout for VOD servers with multi-zone-recording disks, Proc. IEEE Int. Conf. Multimedia Comput. Syst., May 1995, pp. 248–255.
4. J. Dozier, Access to data in NASA’s Earth observing system (Keynote Address), Proc. ACM SIGMOD Int. Con. Manage. Data, June 1992.
26. S. R. Heltzer, J. M. Menon, and M. F. Mitoma, Logical data tracks extending among a plurality of zones of physical tracks of one or more disk devices, U.S. Patent No. 5,202,799, April 1993.
5. T. D. C. Little and D. Venkatesh, Prospects for interactive videoon-demand, IEEE Multimedia, 1 (3): 14–24, 1994.
27. R. Zimmermann and S. Ghandeharizadeh, Continuous display using heterogeneous disk-subsystems, Proc. ACM Multimedia 97, New York: ACM, 1997.
6. D. P. Anderson, Metascheduling for continuous media, ACM Trans. Comput. Syst., 11 (3): 226–252, 1993. 7. C. Ruemmler and J. Wilkes, An introduction to disk drive modeling, IEEE Computer, 27 (3): 1994.
28. S. Ghandeharizadeh and C. Shahabi, Management of physical replicas in parallel multimedia information systems, Proc. Foundations Data Organ. Algorithms FODO Conf., October 1993.
DISTRIBUTED MULTIMEDIA SYSTEMS
749
29. D. Patterson, G. Gibson, and R. Katz, A case for redundant arrays of inexpensive disks RAID, Proc. ACM SIGMOD Int. Conf. Manage. Data, May 1988.
55. A. Dan et al., Channel Allocation under Batching and VCR Control in Movie-On-Demand Servers, Technical Report RC19588, Yorktown Heights, NY: IBM Research Report, 1994.
30. F. A. Tobagi et al., Streaming RAID-A disk array management system for video files, 1st ACM Conf. Multimedia, August 1993.
56. A. Dan, D. Sitaram, and P. Shahabuddin, Scheduling policies for an on-demand video server with batching, Proc. ACM Multimedia, 1994, pp. 15–23. 57. B. Ozden et al., A low-cost storage server for movie on demand databases, Proc. 20th Int. Conf. Very Large Data Bases, Sept. 1994. 58. J. L. Wolf, P. S. Yu, and H. Shachnai, DASD dancing: A disk load balancing optimization scheme for video-on-demand computer systems, Proc. 1995 ACM SIGMETRICS/PERFORMANCE, May 1995, pp. 157–166. 59. M. Kamath, K. Ramamritham, and D. Towsley, Continuous media sharing in multimedia database systems, Proc. 4th Int. Conf. Database Syst. Advanced Appl., 1995, pp. 79–86. ¨ zden, R. Rastogi, and A. Silberschatz, Buffer replacement 60. B. O algorithms for multimedia databases, IEEE Int. Conf. Multimedia Comput. Syst., June 1996. 61. D. Rotem and J. L. Zhao, Buffer management for video database systems, Proc. Int. Conf. Database Eng., March 1995, pp. 439–448. 62. A. Dan and D. Sitaram, Buffer management policy for an on-demand video server, U.S. Patent No. 5572645, November 1996. 63. A. Dan et al., Buffering and caching in large-scale video servers. Proc. COMPCON, 1995. 64. W. Shi and S. Ghandharizadeh, Data sharing in continuous media servers, submitted to VLDB ’97, Athens, Greece, August 1997. 65. L. Golubchik, J. Lui, and R. Muntz, Reducing I/O demand in video-on-demand storage servers, Proc. ACM SIGMRTRICS, 1995, pp. 25–36. 66. C. Shahabi, S. Ghandeharizadeh, and S. Chaudhuri, On scheduling atomic and composite multimedia objects, USC Technical Report USC-CS-95-622, Univ. Southern California, 1995. 67. C. Shahabi and S. Ghandeharizadeh, Continuous display of presentations sharing clips, ACM Multimedia Systems, 3 (2): 76– 90, 1995. 68. S. T. Cambell and S. M. Chung, Delivery scheduling of multimedia streams using query scripts. In S. M. Chung (ed.), Multimedia Information Storage and Management, Norwell, MA: Kluwer, August 1996, Chapter 5. 69. C. Shahabi, A. Dashti, and S. Ghandeharizadeh, Profile aware retrieval optimizer for continuous media, submitted to VLDB ’97, Athens, Greece, August 1997. 70. K. Keeton and R. H. Katz, Evaluating video layout strategies for a high-performance storage server, ACM Multimedia Syst., 3 (2): May 1995. 71. S. McCanne, Scalable Compression and Transmission of Internet Multicast Video, PhD thesis, Berkeley: University of California, 1996. 72. C. Shahabi et al., Knowledge discovery from users web-page navigation, Proc. Res. Issues in Data Eng. RIDE Workshop, 1997. 73. P. Lougher and D. Shepherd, The design of a storage server for continuous media, Comput. J., 36 (1): 32–42, 1993. 74. J. Hsieh et al., Performance of a mass storage system for videoon-demand, J. Parallel and Distributed Comput., 30: 147–167, 1995. 75. C. Martin et al., The Fellini multimedia storage server, in S. M. Chung (ed.), Multimedia Information Storage and Management, Norwell, MA: Kluwer, August 1996, Chapter 5. 76. Sun MediaCenter Series, Server Models 5, 20, and 1000E, Sun Microsystems, Inc., 2550 Garcia Ave., Mtn. View, CA 940431100, 1996. 77. Starlight StarWorks 2.0, Starlight Networks, Inc., 205 Ravendale Drive, Mountain View, CA 94043, 1996.
31. S. Ghandeharizadeh and S. H. Kim, Striping in multi-disk video servers, High-Density Data Recording and Retrieval Technologies, Proc. SPIE, 2604, 1996, pp. 88–102. 32. S. Ghandeharizadeh, A. Dashti, and C. Shahabi, A pipelining mechanism to minimize the latency time in hierarchical multimedia storage managers, Comput. Commun., 18 (3): 170–184, March 1995. 33. S. Berson et al., Staggered striping in multimedia information systems, Proc. ACM SIGMOD Int. Conf. Manage. Data, 1994. 34. S. Ghandeharizadeh et al., Object placement in parallel hypermedia systems, Proc. Int. Conf. Very Large Databases, 1991. 35. M. Carey, L. Haas, and M. Livny, Tapes hold data, too: Challenges of tuples on tertiary storage, Proc. ACM SIGMOD Int. Conf. Manage. Data, 1993, pp. 414–417. 36. P. J. Denning, The working set model for program behavior. Commun. ACM, 11 (5): 323–333, 1968. 37. M. M. Astrahan et al., System R: Relational approach to database management, ACM Trans. Database Syst., 1 (2): 97–137, 1976. 38. H. T. Chou et al., Design and implementation of the Wisconsin Storage System, Softw. Practice Experience, 15 (10): 943–962, 1985. 39. J. Gray and A. Reuter, Chapter 13, Transaction Processing: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 1993. 40. S. Ghandeharizadeh et al., A scalable continuous media server, Kluwer Multimedia Tools and Appl., 5 (1): 79–108, July 1997. 41. S. Ghandeharizadeh and D. Ierardi, Management of disk space with REBATE, Proc. 3rd Int. Conf. Inf. Knowl. Manage. CIKM, November 1994. 42. P. J. Denning, Working sets past and present, IEEE Trans. Softw. Engi., SE-6: 64–84, 1980. 43. S. Ghandeharizadeh et al., Placement of data in multi-zone disk drives, Technical Report USC-CS-TR96-625, Univ. Southern California, 1996. 44. K. C. Knowlton, A fast storage allocator, Commun. ACM, 8 (10): 623–625, 1965. 45. H. R. Lewis and L. Denenberg, Chapter 10, Data Structures & Their Algorithms, 367–372, New York: Harper Collins, 1991. 46. G. Copeland et al., Data placement in bubba, Proc. ACM SIGMOD Int. Conf. Manage. Data, 1988, pp. 100–110. 47. T. H. Cormen, C. E. Leiserson, and R. L. Rivest (eds.), Introduction to Algorithms. Cambridge, MA: MIT Press, and New York: McGraw-Hill, 1990. 48. J. F. Allen, Maintaining knowledge about temporal intervals, Commun. ACM, 26 (11): 832–843, 1983. 49. R. Reddy, Some research problems in very large multimedia databases, Proc. IEEE 12th Int. Conf. Data Eng., 1996. 50. G. Ozsoyoglu, V. Hakkoymaz, and J. Kraft, Automating the assembly of presentations from multimedia databases, Proc. IEEE 12th Int. Conf. Data Eng., 1996. 51. C. Shahabi, Scheduling the Retrievals of Continuous Media Objects, PhD thesis, Univ. Southern California, 1996. 52. M. Garey and R. Graham, Bounds for multiprocessor scheduling with resource constraints, SIAM J. Comput., 4 (2): 187–200, 1975. 53. S. Ghandeharizadeh et al., On minimizing startup latency in scalable continuous media servers, Proc. Multimedia Comput. Networking, Proc. SPIE 3020, Feb. 1997, pp. 144–155. 54. T. Ibaraki and N. Katoh, Resource Allocation Problems— Algorithmic Approaches, Cambridge, MA: The MIT Press, 1988.
750
DISTRIBUTED PARAMETER SYSTEMS
78. The New Standard in Modular Video Server Technology, VIDEOPLEX Video-on-Demand, Storage Concepts, Inc., 2652 McGaw Avenue, Irvine, CA 92714, 1995.
SHAHRAM GHANDEHARIZADEH CYRUS SHAHABI University of Southern California
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4801.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Document Interchange Standards Standard Article Eric van Herwijnen1 1CERN, Geneva 23, Switzerland Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W4801 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (101K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4801.htm (1 of 2)16.06.2008 16:39:07
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4801.htm
Abstract The sections in this article are Standards for Document Interchange The Standard Generalized Markup Language Continuous Acquisition and Life-Cycle Support The Hypertext Markup Language and the Extensible Markup Language Hypermedia/Time Based Structuring Language Document Layout Multimedia and Hypermedia Information Coding Expert Group Presentation Environment for Multimedia Objects Conclusion | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4801.htm (2 of 2)16.06.2008 16:39:07
DOCUMENT INTERCHANGE STANDARDS
33
DOCUMENT INTERCHANGE STANDARDS STANDARDS FOR DOCUMENT INTERCHANGE Exchanging electronic texts between different formats has been a problem for many years. For example, it is difficult to convert a TeX (1) document into an MS Word (2) document. Proprietary solutions exist for specific purposes. For example, Microsoft invented the Rich Text Format (RTF) to facilitate the exchange between different versions of Word and other Microsoft Office software. But the existence of special conversion software (3) shows that even conversions between different versions of the same word processor can be problematic. Adobe’s PDF (4) can be used to display complex texts on the Internet by Acrobat and a freeware application, but this solution is less suitable for printing, and reprocessing the PDF format requires yet other software. PostScript (5) is in heavy use as a printing standard but is unsuitable as a format for editing. Basically, ASCII text (6) is the only format that is more or less universally interchangeable. As with other data types, standards exist that facilitate the interchange of documents. A standard is a documented agreement that contains technical specifications to ensure that objects can be used as described. For example, the format of credit cards, phone cards, and ‘‘smart’’ cards is derived from a standard. Adhering to the standard, which defines an optimal thickness (0.76 mm) ensures that the cards can be used worldwide. The advent of multimedia and the World Wide Web have extended the concept of a document far beyond the classical media (stone, wood, paper, computer screens). ASCII and its successor UNICODE (7) are standards, but they are too limited to be useful for document interchange or archival. ASCII text does not contain any semantic information, nor does it contain any layout or other accessory information. Proprietary formats such as PDF may provide a solution in a specific environment. However, for archival purposes, open standards are required that are system and vendor independent. A truly universal solution for document archival consists of the use of a number of standards for document models. This article will describe the purpose of and the interaction between the standards that have been recently adopted in that area: SGML, CALS, HTML, and XML, HyTime, DSSSL, CSS, MHEG and PREMO. Together, they attempt to solve the problem of interchanging and reusing composite documents. The Standardization Process The standardization process is led by special organizations. The best known is the International Organization for StanJ. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
34
DOCUMENT INTERCHANGE STANDARDS
dardization (ISO) (8). ISO is a federation of national standards bodies from some 100 countries, one from each country. ANSI (9), the American National Standards Institute, is the member body from the United States. The mission of ISO is to promote the development of standardization. Standards facilitate the international exchange of goods and services and help to develop intellectual and economic cooperation. ISO publishes international standards. The creation of a standard takes place according to precise rules and regulations. Before an International Standard (IS) is published, it is circulated among voting participants of a committee as a Draft International Standard (DIS). The DIS is modified according to the comments received during the voting procedure and is finally published as IS. The ISO staff is excluded from participating in the working groups and the voting committees and is not responsible for the technical content or quality of a standard. Many ISO standards are joint publications with the International Electrotechnical Commission (IEC) (10). The close links between ISO and IEC are emphasized by the fact that the central secretariats of both organizations can be found in the same building, in Geneva, Switzerland. The IEC concentrates on standards in the fields of electricity, electronics, and related technologies. Other organizations create open standards, such as the Object Management Group (11), the W3 Consortium (12), the Internet Society [through the IETF (13)], and OASIS, formerly known as the SGML Open Consortium (14). The membership fee for the participation in some of these organizations is directly proportional to a company’s turnover, a fact that might bias the creation of a standard. THE STANDARD GENERALIZED MARKUP LANGUAGE The Standard Generalized Markup Language (SGML) (15,16) is a neutral (vendor and system independent) format that allows easy reuse of data. With SGML, data can be published simultaneously on paper or in electronic form from a single source. Document Structures A document has two structures: • a physical structure which defines the visual attributes of the text (and images) as it is laid out on a page (such as font type and size, white space, positioning of text); physical structure is also called specific structure, i.e., it is fit only for one purpose • a logical structure which defines the semantics of a text, identifying the meaning of the data (such as headings, titles, hypertext links, cross references) irrespective of a physical page; logical structure is also called generic structure, i.e., it may be reused for multiple purposes We use SGML to model a documents’ logical structure. Markup Languages The machine and system independence, and hence the reusability of SGML data, is achieved by adding text strings called markup to a document. Markup originates in the publishing
industry. In traditional publishing, the manuscript is annotated by a copy-editor with layout instructions for the typesetter. These handwritten annotations are called markup. Procedural Markup Procedural markup refers to commands that fit alongside the text and directly influence its processing. For example, ‘‘set this text in a 12-point Helvetica bold typeface.’’ Descriptive Markup Descriptive commands also provide descriptive information about their purpose, such as ‘‘this piece of text is a chapter title.’’ Generic Markup In addition to delimiting parts of a document, generic markup indicates the semantics of these parts. For example, ‘‘chapter’’ rather than ‘‘new page, 18 point boldface Helvetica, centered.’’ When text is marked as a chapter, any system can render it in the way it is best able to do. Style sheets of modern word processors are an example of generic markup. Generic markup has two benefits over procedural markup: generic markup achieves higher portability and is more flexible than procedural markup. Generalized Markup Generalized markup defines the rules for creating a generic coding language. Some generic markup languages are meta languages. New applications of data thus described become possible, such as automatic database loading. Or we can select a document or a part of it without necessarily having to scan its entire content. SGML regulates the definition of generalized markup languages. The Abstract Nature of SGML SGML permits its application to an infinite variety of document structures. A concrete markup language, i.e., grammar, describes the structure and semantics of a particular class of documents. It is defined in a document type definition (DTD). DTDs exist for books, articles, computer manuals, aircraft manuals, jurisprudence articles, mathematical formulas, tables, patent applications, drug applications, submarine maintenance manuals, and other, usually complex technical documents. The best known SGML application is HyperText Markup Language (HTML), used by the World Wide Web. The logical parts of a document that are defined in the DTD are called elements. The elements in the HTML DTD describe a fairly broad class of general-purpose documents. The SGML Syntax. SGML is a delimiter based language. Like the names of the semantic parts of a document, the characters of the delimiters may be changed. Once a DTD is fixed and installed for use, the syntax is also fixed. Here are the most commonly used delimiters: 具 典
start-tag open delimiter tag close delimiter
DOCUMENT INTERCHANGE STANDARDS
具/ 具html典 具/html典 = ‘‘ & ; <
end-tag open delimiter start-tag of the html element end-tag of the html element value indicator literal string delimiter entity reference open delimiter entity reference close delimiter entity reference to the entity ‘‘lt’’
An SGML parser scans a document for these delimiters. When one is found, it triggers a change in the way it recognizes the data following it.
35
Entities. An entity is a collection of characters that can be referenced as a unit. Among other things, entities allow the identification of characters that cannot be entered from the keyboard. For example, in some countries, there is no key corresponding to an e with an acute accent on it, e´. This symbol is represented by the entity eacute. To refer to it, the entity is enclosed within an entity reference open delimiter (&) and an entity reference close delimiter (;). Entity references should also be used when a text character is required that is the same as an SGML delimiter. For example, instead of typing ‘‘具’’, the reference ‘‘<’’ should be used and instead of ‘‘典’’, ‘‘>’’. Parameter entities are a special type of entity that can be used as a variable inside a markup definition. Document Type Definitions
Elements. Elements are the logical units that an author (or the software used by the author) should recognize and mark up. Examples of HTML elements are: html, header, title, body, img, a, p. h1, h2, etc. They are marked up by adding start-tags at the start of the element: 具header典, 具title典, 具img典, 具a典, 具p典, 具h1典, 具h2典; and by adding end-tags at the end of the element: 具/header典, 具/title典, 具/a典, 具/h1典, 具/h2典. There are various different elements, depending on the data they contain. Elements can be nested inside each other, and can be defined as required, optional, or required and repeatable. Element names may be in upper, lower, or mixed case (i.e., 具IMG典=具img典=具Img典). Attributes. It is possible to qualify additional information about an element that goes beyond the structure. This information is given as the value of an attribute. For example, the 具IMG典 and 具A典 tags both have an attribute that specifics the URL of the image or the target of the link: img src = "http://lhcb.cern.ch/images/gif/lhcb.gif" The 具img典 tag is an empty tag, and has no content. In this example of an 具a典 tag, the URL is contained in the value of the HREF attribute:
a name = "z0" href = "http://lhcb.cern.cb/default.htm" LHCb home page/A Notice that the value is always surrounded by double quotes (literal string delimiters), and that the attribute only appears on the element’s start-tag. Attribute names may be specified in upper, lower, or mixed case letters (i.e., SRC=src=Src). Attribute values are literals, and are left in the case they are specified. SGML has a number of predefined attribute types, such an unique identifiers and references thereto. For example, if the attribute ‘‘ID’’ in the tag 具P ID=‘‘first’’典 is declared as a ‘‘unique identifier’’ in the DTD, it can be used as the target of a cross reference link. This could be done by 具Pref IDREF= ‘‘first’’典 if IDREF is declared as an attribute of type ‘‘reference to a unique identifier’’ in the DTD. SGML only permits references within the same document. To make a reference outside an SGML document, HyTime is needed.
A document type definition (DTD) is also called an SGML application. The DTD defines the grammar of a concrete generic markup language for a class of similar documents. It defines three types of markup commands: elements, attributes and entities, and their usage constraints. It defines • • • • •
• • •
the names of elements that are permissible how often an element may appear the order in which elements must appear whether markup, such as the start- or end-tag, may be omitted the contents of elements; i.e., the names of other elements that are allowed to appear inside them, down to the character data level tag attributes and their default values the names of all entities that may be used any typewriter conventions that can be exploited to ease adding markup
A DTD should not contain any information on how to process a document, or what it should look like. Some examples of well known DTDs are • The World Wide Web uses HTML, the HyperText Markup Language. HTML is the world’s largest SGML application. • Three DTDs designed by the AAP (Association of American Publishers) for books, scientific articles, and serials. This was the first major application of SGML. • ISO 12083:1994 (17) ‘‘Electronic Manuscript Preparation and Markup.’’ These are modernized and improved versions of the AAP document types, plus a DTD for mathematical formulas. This work was mainly done by a workgroup of the publications committee of the European Physical Society. • Docbook (18). A DTD for technical documentation. • The Text Encoding Initiative (19). A DTD for the humanities. • The Continuous Acquisition and Life-Cycle Support (CALS) DTDs which are discussed in the next section. CONTINUOUS ACQUISITION AND LIFE-CYCLE SUPPORT Continuous acquisition and life-cycle support (CALS) began as a Department of Defense (DoD) initiative in the 1980s to
36
DOCUMENT INTERCHANGE STANDARDS Table 1. The CALS Standards Standard
Description
Use
CALS MILS-STD1840A
Automated Interchange of Technical Information
Overall standard for exchanging and archiving technical information.
IGES MIL-D-28000
Initial Graphic Exchange Specification
A graphics standard for representing 3-D CAD drawings.
SGML MIL-M-28001
Standard Generalized Markup Language
Technical documents should be marked up with SGML.
CCITT G4 MIL-R28002
Group 4 Facsimile Standard
A standard for describing raster (bit-mapped) data.
CGM MIL-D-28003
Computer Graphics Metafile
A standard format for describing 2-D illustrations with geometric graphics objects.
exchange technical data with the government in electronic form rather than on paper. CALS is a management philosophy to improve the acquisition and life cycle support process. Moving to electronic creation, storage and transfer of information is not an idea specific to the DoD, and industry is very involved in policy and implementation related to CALS. Industry has informally renamed the CALS acronym to mean ‘‘Commerce at Light Speed.’’ Many government agencies like NASA, the Department of Commerce, and the Department of Energy are involved with encouraging CALS practices and the CALS strategy is practiced widely in the United Kingdom, Australia, Taiwan, Korea, Japan, and many other countries. The major CALS standard families are shown in Table 1.
THE HYPERTEXT MARKUP LANGUAGE AND THE EXTENSIBLE MARKUP LANGUAGE Why HTML? During the mid 1980s research at CERN in the use of SGML for client-server documentation systems showed that there was a need for a format that could be easily translated and displayed on many different client platforms. Such a system was developed, and it offered a solution for IBM systems (searching, printing, previewing of printable documents). Fewer functions were available on other platforms (searching and viewing of ASCII text only). The system used the IBM RSCS protocol and could be used across Bitnet, a popular, wide area network linking together IBM mainframe computers. It used SGML for text, but for graphics, it was limited to proprietary formats. It soon became clear that a more general system was required based on Transmission Control Protocol/ Internet Protocol (TCP/IP) with proper client-server format negotiation. This became the protocol called the Hyper Text Transfer Protocol (HTTP). To build hypertext into a documentary base, Tim BernersLee adopted a flexible document type which he called the Hyper Text Markup Language (HTML) (20). In the interest of simplicity, HTML contained the most common elements that were used by CERN’s SGML system at that time, without imposing any particular structure. These were augmented with the tags that were required to support the hypertext paradigm of the web.
What’s Next? Despite the huge success of the Web, HTML has often been criticized because of its lack of structure. It is also not convenient in a distributed environment where an application would like to take advantage of the client’s computing power. One solution to this problem is the Java (21) language, and the HTML extensions that allow the execution of Java programs from pages on the Web. To ensure more efficient data transfer and to avoid the proliferation of HTML dialects the Extensible Markup Language (XML) (22) is being proposed. XML is called extensible because it extends HTML towards a more complete support of SGML. With XML, it will be possible to send richer data over the Internet. Extensible Markup Language The Extensible Markup Language is a subset of SGML. The goal is to enable SGML to be served, received, and processed on the Web in the way that is now possible with HTML. For this reason, XML has been designed for ease of implementation and for interoperability with both SGML and HTML. The adoption of XML will make several types of applications much easier. For example, • accessing specialized semantics (transferring personal data from one database to another) • multiple presentation of data (depending on the reader’s background) • off-loading computation from the server to the client [for CORBA (23) type applications] • obtaining personalized data (from newspapers) These applications require data to be encoded using tags that describe a rich set of semantics. An alternative to XML for these applications is proprietary code embedded as ‘‘script elements’’ in HTML documents and delivered in conjunction with proprietary browser plug-ins or Java applets. A very interesting application of XML is the Channel Definition Format (CDF) (24) developed by Microsoft. CDF is an open specification that permits a web publisher to offer frequently updated collections of information, or channels, from any web
DOCUMENT INTERCHANGE STANDARDS
server for automatic delivery to compatible receiver programs on computers (‘‘server push’’).
HYPERMEDIA/TIME BASED STRUCTURING LANGUAGE The Hypermedia/Time based structuring language (HyTime) is an extension of SGML (25,26) that describes connected or temporal information. It covers different aspects of linking, as well as features for multimedia purposes, including virtual time, scheduling, and synchronization. Architectural Forms HyTime facilities can be integrated into any SGML DTD via the technique of architectural forms. An architectural form can be compared to an object oriented ‘‘framework.’’ The design of a particular hypertext construct, such as a link, is defined by the architectural form, ready to be re-used by the DTD writer. By giving an element in a DTD a specific HyTime attribute with values that are specified in the HyTime standard, a HyTime engine knows what action it should take. Links and Locations Hypertext systems like the World Wide Web do not differentiate between a link and its endpoints (the locations or anchors). The link information is treated as a whole; any editing change affects the entire linking construct. For example, the HTML anchor 具A HREF=‘‘http://www.cern.ch/’’典 contains the target of the link. The drawbacks of this approach are obvious: as soon as the object behind the target (‘‘http://www.cern.ch/’’) no longer exists or is moved elsewhere, the link needs to be updated. As the owner of the target is unaware of who links to it, this is an impossible task. HyTime makes a distinct difference between links and locations. A link is a reference between two or more locations. A location is the address of a potential anchor point, which is the actual, physical point where the link ends. This step of indirection guarantees easy maintenance of hyperlinks. HyTime brings about a consistent way of describing hyperlinks between any media. It should be noted that the World Wide Web is attempting to solve the moving target problem via Universal Resource Numbers, Uniform Resource Names, and similar efforts. HyTime Links Linking in SGML is achieved by assigning ID attributes to elements and referencing these through an IDREF attribute, which is limited to references inside the same document. HyTime defines two link architectural forms: contextual links (which are part of the document where the link markup resides) and independent links (stored externally to the document that the link markup connects). The richness of HyTime linking lies in the many ways one can describe locations in a document, often as a sequence of stepwise refined addresses known as location ladders. The separation of links from anchors makes links more robust. It allows for links between any kind of media, such as musical notes.
37
DOCUMENT LAYOUT Document layout is the skill of positioning text and images on a page, or for electronic media, on a screen. Document layout on electronic media is classically achieved via procedural markup. For example in TeX, the effect of the command rm{text} will be that the word ‘‘text’’ will be typeset using a Roman font. Procedural markup can give powerful but nonportable results. Procedural markup commands depend on the system that will process the document (TeX in the case of the command rm{text}). Another example that illustrates different results when procedural markup is used can be seen when an HTML document is viewed with different Web browsers. The 具TABLE典 tag was introduced by NetScape in version 2.0 of its Navigator. Older browsers like NCSA Mosaic will not display the table layout as intended, although all browers are supposed to ignore unrecognized tags with no loss of content. The issue of rendering of different media on the Web is important but is not the main issue here. By using generic markups, such as SGML, HTML, or XML, a document can be freely interchanged. However, the generic makeup contains no information on the document layout. There are several ways in which this information can be captured. The Cascading Style Sheet (27) mechanism was invented by the W3 Consortium as a standard way to describe the layout of HTML documents. DSSSL (28) defines the document layout of SGML documents. When (and if) XML replaces HTML, a special version of DSSSL for online document delivery, DSSSL-O (29) will be used instead of the CSS mechanism to convey layout information over the Internet. Cascading Style Sheets Cascading Style Sheets (CSS) are interoperable style sheets that allows authors and readers to attach style (e.g., fonts, colors, and spacing) to HTML documents. They • allow designers to express typographic effects • allow externally linked as well as internal and inline style sheets • be interoperable across Web applications • support visual, as well as non-visual, output media • are applied hierarchically (hence ‘‘cascading’’) Unfortunately, the CSS mechanism is completely different from DSSSL. Document Style Semantics and Specification Language DSSSL (pronounced dis-sul), the Document Style Semantics and Specification Language, provides a standardized syntax and layout for SGML documents. DSSSL is declarative, in the sense that style specifications are made by describing final results, not by describing the procedures that are used to create the formatted results. The Parts of the DSSSL Standard DSSSL is divided into different parts. The most important of these are
38
DOCUMENT INTERCHANGE STANDARDS
• the Style Language • DSSSL-Online
interactive TV, whereas current MHEG-5 implementations will fit in a few kB.
• the Transformation Language • the Standard Document Query Language • the Expression Language The transformation process changes one SGML document (conforming to a certain DTD) into another SGML document (conforming to another DTD). The commands to do this are specified in the expression language. The style language describes the formatting of SGML documents; the style language also uses the expression language. Both the transformation language and the style language can address any object using the standard document query language. HyTime shares the standard document query language with DSSSL.
MULTIMEDIA AND HYPERMEDIA INFORMATION CODING EXPERT GROUP The Multimedia and Hypermedia Information Coding Expert Group (MHEG) (30), is a hypermedia architecture for multimedia distribution. MHEG can run in environments with very small resources, such as set-top boxes where Javaenabled browsers are an overload. Although it was originally developed for broadcast applications, MHEG has some substantial advantages for information and point-of-sales terminals as well as interactive TV. In the MHEG model, a video sequence is a ‘‘scene’’ with moving objects. The ‘‘scene’’ remains constant over some time and is only transmitted once by the server to the client. The moving objects, which are smaller in size, are transmitted continuously. Consequently, MHEG will relieve the load on the server. It employs an object-oriented model, generic enough to format different kinds of multimedia documents, and provides on any network the quality of service people expect from TV. In addition, it offers powerful models of spatial and temporal synchronization between different media, which are not provided for in other standards. The latest MHEG standard is MHEG-5. Objectives MHEG-5 was designed for interactive multimedia applications such as video on demand, home shopping, games, education, and information. The standard allows large applications to be distributed between server and client in a way that the client only requires a small amount of memory. HTML and MHEG-5 have many concepts in common, such as the focus on declarative code. However, HTML is a document description language, not a format for describing multimedia applications. MHEG-5 is built from the ground up with the needs of multimedia applications in mind, such as synchronization and speed control of streams, handling of stream events, etc. The Java language makes it possible to write applets for multimedia applications that can be embedded in HTML documents. However, the performance and size of current Java systems are inadequate for the limited resources of
PRESENTATION ENVIRONMENT FOR MULTIMEDIA OBJECTS The Presentation Environment for Multimedia Objects (PREMO) (31) addresses the creation of, presentation of, and interaction with all forms of information using single or multiple media. PREMO is a project which has not yet resulted in a published standard. Objectives The aim of PREMO is the standardization of programming environments for the presentation of multimedia data. PREMO will support still computer graphics, moving computer graphics (animation), synthetic graphics of all types, audio, text, still images, moving images (including video), images coming from imaging operations, and other media types or combinations of media types that can be presented. PREMO complements the work of other emerging ISO standards on Multimedia, such as MHEG and HyTime. These standards do not aim at the presentation of media objects, but deal with aspects of the interchange of multimedia information. Description The Graphical Kernel System (GKS) was the first ISO standard for computer graphics. It was followed by a series of complimentary standards, addressing different areas of computer graphics such as PHIGS, PHIGS PLUS, and CGM. One of the main differences between PREMO and previous graphics standards is the inclusion of multimedia aspects. Technology has made it possible to create systems which use, within the same application, different presentation techniques that are not necessarily related to synthetic graphics, for example, video, still images, and sound. Examples of applications where video output, sound, etc., and synthetic graphics (e.g., animation) coexist are numerous. PREMO proposes development environments that are enriched with techniques supporting the display of different media in a consistent way and which allow for the various media-specific presentation techniques to coexist within the same system. PREMO needs to solve the problem of synchronization of video and sound presentation. This problem is well known in the multimedia community; its integration with the more general demands of a presentation system will obviously be a challenge. CONCLUSION The area of standards for document models is a very rapidly moving field. SGML and its existing applications will be further adopted and exploited. DSSSL and HyTime will inspire new, more sophisticated applications and software. The Internet and Java will push XML, and possibly DSSSL-O. The evolution of MHEG in the interactive TV world seems clear—perhaps MHEG and PREMO will pave the way to a true integration of mass market multimedia applications with the more conventional picture of a document that we have today.
DOMESTIC APPLIANCES
BIBLIOGRAPHY 1. D. E. Knuth, The TeX Book, Reading, MA: Addison-Wesley, 1986. Available www: http://www.tug.org/ 2. Microsoft Corporation, Using Microsoft Word, Word Processing Program 97, Redmond, WA: Microsoft Corporation, 1997. Available www: http://www.microsoft.com/word/ 3. Microsoft Corporation, Word 6.0/95 Binary Convertor for Word 97 now available, Redmond, WA: Microsoft Corporation, 1997. Available www: http://www.microsoft.com/organizations/ corpeval/500.htm 4. Adobe Systems Inc., Acrobat 3.0 User’s Guide, San Jose, CA: Adobe Systems Inc., 1997. Available www: http:// www.adobe.com/studio/tipstechniques/acrcreatepdf/main.html 5. Adobe Systems Inc., Postscript Language Reference Manual, Reading, MA: Addison-Wesley, 1990, 2nd ed. 6. The International Standard Corresponding to ASCII is ISO/IEC 646:1991 (ISO 7-Bit Coded Character Set for Information Interchange), Geneva: ISO, 1991. 7. The International Standard Corresponding to UNICODE is ISO/ IEC 10646-1:1993 (Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane), Geneva: ISO, 1993. 8. The International Organization for Standardization (ISO), 1 rue de Varembe´, PO Box 56, CH-1211 Geneva 20, Switzerland. Available www: http://www.iso.ch 9. The American National Standards Institute (ANSI), 11 West 42nd Street, New York 10036. Available www: http://www. ansi.org 10. The International Electrical Commission (IEC), 3 rue de Varembe´, PO Box 131, 1211 Geneva 20, Switzerland. Available www: http://www.iec.ch 11. The Object Management Group (OMG), 492 Old Connecticut Path, Framingham, MA 01701. Available www: http:// www.omg.org 12. The World-Wide Web Consortium, Massachusetts Institute of Technology, Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139. Available www: http:// www.w3.org/pub/WWW/ 13. The Internet Society creates its standards through the Internet Engineering Task Force (IETF). Available www: http:// info.isoc.org/ 14. The SGML Open Consortium, 7950 Hwy. 72 W. Suite G201, Madison, AL 35758. Available www: http://www.sgmlopen.org 15. The International Standard Corresponding to SGML is ISO/IEC 8879:1986 (Standard Generalized Markup Language), Geneva: ISO, 1986. 16. E. van Herwijnen, Practical SGML, Boston: Kluwer Academic, 1994, 2nd ed. 17. The International Standard Corresponding to ISO 12083 is ISO/ IEC 12083:1994, Geneva: ISO, Electronic Manuscript Preparation and Markup, 1994. This standard is also available from ISO in electronic form with examples and explications as ‘‘The Annotated ISO 12083.’’ 18. The Davenport group maintain the Docbook DTD. Available www: http://www.ora.com/davenport/ 19. The Text Encoding www.uic.edu/orgs/tei/
Initiative.
Available
www:
http://
20. B. White, HTML and the Art of Authoring for the World-Wide Web, Boston: Kluwer Academic, 1996. 21. P. van der Linden, Just Java, Palo Alto, CA: Sun Microsystems Press, 1996.
39
22. Extensible Markup Language (XML). W3C Working Draft WDxml-961114. Available www: http://www.w3.org/pub/WWW/TR/ WD-xml-961114.html 23. The Common Object Request Broker: Architecture and Specification (2.), Object Management Group, 1995. 24. Microsoft Corporation, Channel Definition Format, Redmond, WA: Microsoft Corporation, 1997. Available www: http:// www.microsoft.com/standards/cdf.htm 25. The International Standard Corresponding to HyTime is ISO 10744:1992 (Hypermedia/Time-based Structuring Language (HyTime), Geneva: ISO, 1992. 26. S. J. DeRose and D. G. Durand, Making HyperMedia Work: A User’s Guide to HyTime, Boston: Kluwer Academic, 1994. 27. H. Lie and B. Bos, Cascading Style Sheets, level 1, 1996. Available www: http://www.w3.org/pub/WWW/TR/REC-CSS1961217.html 28. The International Standard Corresponding to DSSSL is ISO/IEC 10179-1996 [Document Style Semantics and Specification Language (DSSSL)], Geneva, IOS, 1996. 29. DSSSL Online Application Profile, 1996. Available www: http:// sunsite.unc.edu/pub/sun-info/standards/dsssl/dssslo/dssslo.htm 30. ISO/IEC, Coding of Multimedia and Hypermedia Information (MHEG), Geneva: ISO, 1900, ISO/IEC 1DIS 3522-1/4. 31. ISO/IEC, Presentation Environment for Multimedia Objects (PREMO), Geneva: IOS, 1900. ISO/IEC 14478. Available www: http://www.cwi.nl/Premo/premo.html
ERIC VAN HERWIJNEN CERN
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4804.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Hypermedia Standard Article Lynda Hardman1 and Dick C. A. Bulterman1 1CWI, Amsterdam, The Netherlands Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W4804 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (323K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4804.htm (1 of 2)16.06.2008 16:39:24
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...D%20ELECTRONICS%20ENGINEERING/37.%20Multimedia/W4804.htm
Abstract The sections in this article are History Applications Definitions of Hypertext, Multimedia, and Hypermedia Hypertext Systems Application Areas Visual Design for Anchors Document Models Future Directions | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/37.%20Multimedia/W4804.htm (2 of 2)16.06.2008 16:39:24
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
HYPERMEDIA In a general sense hypertext allows pieces of information to be connected to each other, and a reader is able to follow these connections directly. The connections are termed links and the action of following the links is termed navigation. The World Wide Web is the most widely accessible example of hypertext. Although it is a subset of the full hypertext model, it is an extremely powerful application. While hypertext is the original term, the term hypermedia is also in common use. Hypertext is not limited to use with text but can also include other media, such as images or even audio or video. Hypermedia often refers to this more general use of the term hypertext but can also indicate that there are links within a multimedia presentation. When reading a hypertext document, some parts of the text are highlighted to indicate that the reader can select them. The most common method of interaction is to select them by clicking with a mouse pointing device. For example, in Fig. 1, the words representing the starting point of a link are underlined. When the reader wishes to follow a link, the mouse cursor is placed over the words (for example, “Research Interests”), and the reader clicks the mouse button. The display then changes to show the new information, shown in Fig. 2. The area where the reader clicks is an anchor and is commonly referred to as a hotspot. An accessible and complete overview of multimedia and hypotext is given in J. Nielsen (20).
History The concept of hypertext has a very long history. Early religious works were a form of hypertext, in which scribes wrote comments on the original texts (these comments were referred to in later texts). These hypertexts were paper based. In 1945, before the advent of electronic computers, Vannevar Bush (1,2) proposed a mechanized system to implement the concept of hypertext. Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing. Douglas Englebart pioneered a system exhibiting a number of hypertext features, called NLS (online system) (3,4). This allowed researchers to create and share documents locally and remotely. NLS included an implementation of multiple document windows and the ability to refer directly to parts of the documents. Theodor Nelson coined the term hypertext, and in particular he wanted to use the technology available at the time to support writers. He designed a system, called Xanadu, which would allow every letter of any 1
2
HYPERMEDIA
Fig. 1. Example hypertext screen 1.
document to be directly referenced or included in any other document while preserving its original author. Copyright protection was one of the underlying issues. This gives a very brief view of some of the early hypertext visionaries. A large amount of research took place through the 1970s and 1980s, when different hypertext systems were implemented and used by limited numbers of people. This changed dramatically in the 1990s when the World Wide Web took off (5). Tim BernersLee demonstrated a text-only browser at the ACM Hypertext ’91 conference. During the few years that followed, the use of the Internet changed dramatically, based on the two main components of the Web—the document language, Hypertext Markup Language (HTML) (6) and Hypertext Transfer Protocol, or HTTP protocol (7). HTML documents were the first globally accessible hypertext documents.
Applications While hypertext is an enabling technology, and in essence domain neutral, there are properties of an application area that makes it particularly suited to hypertext. These are that the information can be partitioned into selfcontained parts and that the topic is relatively complex, making cross references to several topics useful. Medical knowledge is a characteristic example, since a part of the body is related in an anatomical way to neighboring parts but is also related in a functional way to other parts. For example, the lungs are close to the stomach but are part of a separate functional system. The underlying structure of the material can be reflected in the linking structure of the documents. Airplane and automobile manuals and collections of legal documents are similar types of examples.
HYPERMEDIA
3
Fig. 2. Example hypertext screen 2.
On-line news is a slightly different type of example. Here there are topics that remain, more or less, constant, such as today’s value of the New York Stock Exchange (NYSE). This can be linked to all the previous values but can also be linked to the values for today in the European and Asian markets. Hypertext allows the different connections to be explored by the reader, choosing the ones most relevant to the task. We discuss different application areas of hypertext in more depth later in this article.
Definitions of Hypertext, Multimedia, and Hypermedia Hypertext, multimedia, and hypermedia are commonly used terms with no consensus on their definitions, so we give the definitions as used throughout this article. A hypertext document is a collection of self-contained information units and referencing information, called links [Fig. 3(a)]. A hypertext presentation is the runtime manifestation of one or more hypertext documents with which a reader can interact. The information units in a hypertext document may include media types other than text. A commonly used term for this is hypermedia. A multimedia document is a collection of information units and associated synchronization information [Fig. 3(b)]. A multimedia presentation is the runtime manifestation of a multimedia document. A reader can interact with a multimedia presentation by, for example, starting or pausing the presentation. We use the term hypermedia document to denote a collection of information units along with referencing and synchronization information [Fig. 3(c)]. A hypermedia document is thus a collection of multimedia documents along with referencing information. A hypermedia presentation is the runtime manifestation of one or more hypermedia documents. A reader can interact with a hypermedia presentation either as a multimedia
4
HYPERMEDIA
Fig. 3. Hypertext, multimedia, and hypermedia documents.
presentation, by starting or pausing a multimedia presentation, or as a hypertext presentation, by navigating through the information units.
Hypertext Systems A hypertext system is the program for interpreting hypertext documents and presenting them to the reader. The system includes functionalities, such as displaying the text on the screen, highlighting the starting points for links, interpreting the reader’s interactions with the document, and following the chosen links. A wide variety of hypertext systems have been built. To give a flavor of the diversity, we give some illustrations here. KMS. Knowledge Management System (KMS) (1) is a frame-based hypertext system based on the research system ZOG (8). Frames can include text, graphics and links, where links are separated from the text and graphics and denoted by a marker. A three-button mouse is used extensively for interaction, and cursor feedback is used to explain the action of each of the mouse buttons. One of the design goals was to keep the response times of the system to under one second. This reduces user frustration and facilitates users in being able to reorient themselves by going back and forth in the document space. The structure of the linked frames is basically hierarchical, although crosslinks can be created. Notecards. The Notecards system (9) manipulates notes that consist of, for example, text or graphics in separate windows. These notes can be filed in one or more file boxes. Parts of text, in the text nodes, can link to other windows. These are indicated by drawing a box around the appropriate text. A map of links among members of a set of notes can be generated. The system was designed to provide support for information analysts, with the intention of allowing analysts to express the conceptual models they form. Intermedia. Intermedia’s documents are text based in scrollable windows, which can include graphics (10). The links are separate from the text and graphics and are denoted by a marker in the window. A description of the link destination is attached to the marker by placing a description close to the marker or, in the case that the link has several destinations, a menu is displayed. Intermedia is one of the few hypertext systems
HYPERMEDIA
5
that supports links with multiple destinations. Similar to NoteCards, Intermedia has a facility for displaying sets of links graphically. Because of the separation of links from endpoints, Intermedia can permit the same sets of information to use different or multiple sets of links. A set of links is called a web. Intermedia has been used to a great degree in teaching at Brown University. Mosaic. The Mosaic browser (11) was developed at the US National Center for Supercomputing Applications (NCSA). Mosaic added an important factor to the World Wide Web—a graphical user interface to browsing HTML documents. End users could read text in a user-friendly environment and point at the links they wanted to follow. Pictures and icons could also be included within a text page. This was extended further, allowing parts of pictures to be associated with different link destinations. If other document types were not displayable by the browser directly, then an appropriate external viewer could be invoked. In addition to the display of HTML documents, the browser also provided navigation tools, allowing users to go back to documents they had seen before and to mark those of particular interest. Another important success factor was that the browser was developed for the three common operating platforms: Microsoft Windows, Apple Macintosh, and X Windows. The Mosaic browser was developed commercially as Netscape. Microcosm. The philosophy behind the Microcosm system (12) is to allow the creation of links among documents that are not part of a specialist hypertext authoring environment. This requires the specification of anchors and links externally to the documents being linked. The Microcosm designers provide links without forcing the author to create each one separately by hand. They also provide the facility for creating links from any occurrence of a word without requiring the author to specify its position in the document (in other words, without forcing the author to specify the anchors individually). Microcosm also allows for the creation of linkbases. A linkbase is collection of links. Several linkbases can apply to the same sets of documents (for example, for different authors or for different groups of readers).
Application Areas While hypertext can be used for a broad range of purposes, a number of areas have made particular use of hypertext. These include computer-based learning, design histories, collaboration in writing, technical documentation, and entertainment. Computer-Based Learning. Study material can be provided to students as hypertext by just copying a text book. To make use of the advantages of hypertext, however, there has to be something more (for example, by allowing students each to have their own view of the information and allowing them to add their own comments). These can then be shared with a tutor or with their peers. Literature studies are a typical example of this type of use, where there is never a “correct” answer, and students can create their own interpretations and compare them with those of their peers. Another advantage of learning from the computer is that simulations of the ideas being explained (for example, the laws of gravity or the effects of heat loss from a house), can be illustrated directly by linking to a simulation. In medical domains, for example, X rays and videos of joint motion can be included in the material. The hypertext is already on line, so that the environment can be extended to record which material students have already seen, keep scores on tests, or even to prevent students from seeing material until they have read sufficient preparatory material. Collaborative Writing. Writing is not always a solitary process, and often collaboration is carried out with pen and paper. This is not always satisfactory, since paper cannot be stretched to accommodate long comments. Hypertext offers facilities to overcome this problem by allowing links to information inserted by a colleague. A number of the early hypertext systems were designed explicitly for collaborative writing, NLS in particular. This allowed researchers to speak to each other, see each other via video, point to the same documents
6
HYPERMEDIA
on their screens, and type additional material into their documents. Other systems allowed writers to produce text, pass it on to colleagues, have each colleague comment on it, and then pass all the comments back again. Design Histories. Hypertext systems are created by software engineers, and software engineers soon noticed that they could use hypertext for their own purposes. In particular, a number of design decisions are made in the course of creating a software system, and these decisions can be recorded in a hypertext system. For example, a proposition is stated, and linked to it are all the arguments for, all the arguments against, and the decision that was taken. Later in the software life cycle the decision can be reanalyzed and if, for example, an assumption turned out to be incorrect, the decision can be reversed—only after recording the new line of reasoning, of course. The gIBIS system (1) implemented these ideas and uses a graphical interface for illustrating the rhetoric of the argumentation. Technical Documentation. Complex technical documentation printed on paper is both very heavy and provides no easy means of finding what the reader wants. By putting the information on line and turning cross references into hypertext links, the reader’s life becomes much easier. To start with, readers can use an index or a search mechanism for finding the topic they are looking for. When they have found the right “page,” they can browse the links for explanations of, for example, terms they are not familiar with. Car and airplane maintenance are typical examples of this type of use. The added advantage of having the documentation on line is that the computer can be used for even richer illustrations. For example, if the reader wishes to change a spark plug, then the documentation can include a video with spoken commentary showing exactly where the components can be found and how they can be disassembled. Simulations can also be run on the computer to illustrate certain points in detail. As well as providing information, the computer can control information. For example, in a chemical plant or a power station the documentation can be directly connected with the software that runs the plant. This allows experienced controllers to use the documentation while running the plant and allows trainees to learn from the documentation while seeing the live values of parameters from the plant. An important aspect of technical documentation is that the artifact that it is describing is likely to outlive a number of generations of computers. This means that the information has to be stored in a system-independent way. Documents are often described using Standard Generalized Mark-up Language (SGML) (13). Entertainment. Hypertext offers the potential of a new art form. While paper “hypertexts” have been created, they tend to be cumbersome to read, since the reader has to “skip to page X” at every choice point in the story. By putting these online, a writer can offer the reader not only an imaginary world, but multiple worlds in which the reader can influence the progress of the “story.” There is no longer a single story, but threads of intertwined stories. In the film world, experiments have been done in which the same action has been shot from the perspective of different characters. The multiple streams are shown simultaneously (say on different television channels). A viewer can choose which character to follow. The difference with hypertext is that the viewer cannot see everything but must make a choice. Games can also be created using hyperlinks. For example, a world can be created and the reader can explore the world by following hyperlinks. In no time the reader can become completely disoriented, but in contrast to the technical applications, this was precisely the intention of the author. Again, the applications can go beyond simple hypertext and include videos and animation. Adventure games can be made in which players can interact with the environment (for example, draw in the environment, or pick up objects to take with them).
Visual Design for Anchors Paper-based documents have developed their own visual conventions during their extended existence. When creating hypermedia documents, extra information needs to be expressed to readers so they can identify
HYPERMEDIA
7
Fig. 4. Anchor and transition styles on following a link.
potential links and interact with them successfully. A powerful method of expressing this is visually. The problem faced by hypermedia designers is that they need to introduce new conventions, which may violate existing conventions or just add ugly clutter to the screen. Style information for an anchor is needed for specifying how the visual, or even audible, characteristics of an anchor can be specified so that a user is aware that an anchor is indicating the start of a link. Examples of anchor styles are as follows: • • • • •
There is a border around the anchor value, as illustrated in Fig. 4. There is a small icon next to the anchor value. For text items, different color or styles, such as underline or italic, is used. The anchor value changes appearance (e.g., color) when the mouse cursor is over it. The mouse cursor shape changes when it is over the anchor value.
When a user selects an anchor to follow a link, there may also be style information associated with this action. For example, the source anchor may highlight (to acknowledge that the action has been registered) before the destination of the link is displayed. The destination anchor may also be highlighted briefly to distinguish it from any other anchors present in the destination. The following are examples of anchor highlight styles: • • • • •
The appearance (e.g., width or color of the border) changes, as illustrated in Fig. 4. The appearance of the icon changes. For text items, the style and/or color changes. The anchor value flashes. The anchor value changes color.
The style of a source anchor may depend on other properties of the link emanating from it (for example, whether the reader has already seen the destination of the link). Further information on screen design for hypertext can be found in Ref. 19.
8
HYPERMEDIA
Fig. 5. Dexter hypertext model overview. A composite component (left) is linked to an atomic component (right).
Document Models While the user view is a useful interaction paradigm, in order to make full use of hypertext in a powerful publishing environment, we need to model hypermedia documents for processing. This allows information to be reused in multiple systems and also allows the same information to be reused multiple times within a single system. This work borders on that of electronic publishing using technologies such as SGML (13) and HyTime (14). Dexter. The Dexter hypertext reference model (15) was developed as a reference model to rationalize and make explicit the concepts embedded in the then existing hypertext systems. The Dexter model divides a hypertext system into three layers: a within-component layer, where the details of the content and internal structure of the different media items are stored; the storage layer, where the hypertext structure is stored; and the runtime layer, where information used for presenting the hypertext is stored and user interaction is handled. The Dexter model describes the storage layer in detail, and it is this layer that is most relevant to a hypermedia model and of which we give a brief description here. The Dexter model introduces atomic, composite, and link components and anchors. Atomic and composite components are related to each other via link components, where anchors specify the location of the ends of the links. This is shown schematically in Fig. 5. Each component has its own unique identifier. A reference to a component can be made directly to its unique identifier or via a more general component specification. The latter requires a resolver function to “resolve” it to a unique identifier (for instance, to allow the addressing of a component by means of an SQL [Structured Query Language] database query). Atomic Component. An atomic component contains four parts—presentation specification, attributes, a list of anchors, and content. • • •
•
The presentation specification holds a description of how the component should be displayed by the system. The attributes allow a semantic description of the component to be recorded. An anchor is composed of an anchor identifier and a data-dependent anchor value. The anchor identifier is unique within a component and allows the anchor to be referred to from a link component. The anchor value specifies a part of the content of the atomic component and is the only place in the model where the data type of the content is required. The anchor value is used as the base of the hotspot in the presentation of the document. The content is a media item of a single data type.
HYPERMEDIA
9
Composite Component. A composite component is a collection of other components (atomic, composite, or link) that can be treated as a single component. Its structure is the same as an atomic component with, in addition, a list of child components. The structuring of the components is restricted to a directed, acyclic graph. The anchors of a composite component refer to the content of that component. Link. A link is a connection among two or more components. Its structure is the same as an atomic component, with a list of specifiers replacing the content. A specifier defines an endpoint of the link. It consists of an anchor, a direction, and a presentation specification. A single link component can allow the expression of a range of link complexities, including a simple onesource, one-destination, uni-directional link (for example, links in HTML) or a far more complex multisource, multidestination, bidirectional link. Amsterdam Hypermedia Model. While the Dexter model is adequate for describing hypertext, it does not incorporate the temporal aspects of hypermedia. To provide such a model for hypermedia, four fundamental types of relationships must be described in addition to the media items included in the presentation: structural, timing, layout, and interaction relations. Structural relationships define logical connections among items, including the grouping of items to be displayed together and the specification of links among these groupings. Timing relations specify the presentation dependencies among media items, possibly stored at different sites. Layout specifications state where screen-based media are to be sized and placed, either in relation to each other or to the presentation as a whole. Interactions, through navigation, give the end user the choice of jumping to related information. The Amsterdam Hypermedia Model (AHM) (16) describes a model for structured, multimedia information. A diagrammatic impression of the AHM is given in Fig. 6. Timing Information. Timing information is needed to express when each of the media elements composing the presentation will appear on (and disappear from) the screen. This information can be given by specifying a time when an element should appear relative to the complete presentation (e.g., 30 s after the beginning of the presentation, a subtitle corresponding to the spoken commentary should appear) or relative to other elements being played in the presentation (e.g., 3.5 s after the spoken commentary begins, the subtitle should appear). The AHM allows the specification of timing relations defined between single items, groups of media items, or between a single item and a group. These timing relations are specified in the model as synchronization arcs. These can be used to give exact timing relations but can also be used to specify more flexible constraints, such as “play the second media item 3 ± 0.3 s after the first.” Structural Information. The structure items of the model are described here briefly. •
•
Composition plays an important role in multimedia presentations, where almost all presentations contain more than one media element. A composition structure allows a group of media items to be created that can then be treated as a single object. For example, a company logo can be grouped with a spoken commentary and included in several places in the overall presentation. The composite structure is used to store the timing information specified among its constituent elements. Anchors are a means of indexing into the data of a media item, allowing a part of the item to be referred to in a media-independent manner. The data specified by an anchor can be used at the beginning or end of a link. For example, within text, an anchor might define a character string; in an image, an area on the screen. In continuous media, such as video or animation, the area on the screen may change with time. Indeed, the hotspot may appear only for part of the duration of the media item—for example, the video item in Fig. 6 shows two hotspots, which are displayed at different times. One of the uses of a dynamic hotspot is to follow moving objects within a video or animation. In a composite item (for example, the scene represented by Fig. 6), an anchor can refer to the scene as a whole (allowing navigation to the complete scene) or indirectly to anchors in the media items. For example, a video of a bouncing ball may be accompanied by a text about the ball. Both the moving image of the ball and the word ball in the text are part of the same (composite)
10
HYPERMEDIA
Fig. 6. Amsterdam hypermedia model overview.
•
anchor. The anchor in the composite representing the concept “ball” refers to two anchors—the anchor in the text item and one of the anchors in the video item. When the author wishes to relate other information to the ball’s description (e.g., how gravity affects bouncing), then a link can be made to the composite “ball” anchor. Links enable the end user to jump to information deemed by the author to be relevant to the current presentation. The link can lead to another presentation or a different part of the same presentation. Link context is needed for specifying which media items on the screen are affected when a link is followed. For example, in a presentation composed of multiple media items, the complete presentation may be replaced or only a part. There is also a choice of whether the information at the destination of the link replaces the scene that was being played, or whether it is displayed in addition. When the current presentation also contains one or more continuous media items, there is a further choice as to whether the current scene should stop or continue playing when the link is followed.
Layout Information. Layout information includes the size of items displayed in a presentation and their position on the screen. Information about styles, such as font typefaces and styles, is also required. This information can be attached to individual media items, but in the model this is done through the use of
HYPERMEDIA
11
channels. This is particularly useful, for example, when a similar layout is required but the sizes of media items vary. Each media item is then scaled to fit into the assigned channel. Expressing the Model. The AHM does not prescribe a language for specifying a presentation conforming to the model; it could be expressed in a system-independent language such as the HyTime (Hypermedia/TimeBased Document Structuring Language) international standard (14) based on SGML (13). Similarly, transmission of (groups of) multimedia objects conforming to the model could be carried out using a standard such as MHEG (Multimedia and Hypermedia Expert Group) (17). SMIL (Synchronized Multimedia Integration Language) (18), which was developed for the World Wide Web, is based on the same concepts underlying the AHM.
Future Directions Looking at the World Wide Web (the Web), hypertext appears to be already “solved”: users can publish their own documents and access documents from Internet sites around the globe, and links can be made within and between any of these documents. There is, however, still plenty of work to be done and we discuss here a number of directions that we see as relevant in the near future of hypermedia documents: extensible document languages, multimedia, open hypermedia systems and metadata. Extensible Document Languages. HTML is a specific language supported as part of the Web, so that users are constrained by the features it offers. Work has been carried out to allow different document languages to be specified using the Extensible Markup Language (XML) (22). This allows content creators to define their own document types and yet have access to standard tools and playback environments. While this already allows an extra level of flexibility, the XML namespaces initiative (22) will go further and allow a single document to mix and match terms from different document specifications. Multimedia. Multimedia has existed in the CD-ROM world for nearly a decade. The introduction of SMIL (18) has made multimedia documents on the Web possible. Since SMIL is an XML application, this means that SMIL expressions can be used, via XML namespaces, in other XML documents. Any XML document would be able, for instance, to incorporate links, where the destination of the link varies depending on when the reader interacts with it. The difference between using SMIL as a vehicle for multimedia rather than a more programming-based construction is that the document can be manipulated by all the standard XML tools, allowing, for example, styles (as specified using Extensible Style Language [22]) to be applied in the same way that they can be applied to hypertext documents. Multimedia, and in particular temporal layout, will become an integral part of all documents, rather than something special that needs to be treated separately. Just as, for example, many word processing packages allow the creation of images. Open Hypermedia Systems. Although the Web appears to be an open hypermedia system (OHS), it suffers because all the information available on the Web has to conform to the standards and conventions of the Web. An author cannot, for example, make a word processor document available in its native format, have parts of it link to other documents, and expect other Web users to be able to view it. The open hypermedia initiative (21) is directed at enabling hypermedia functionality for existing applications and components without forcing the material to a particular document format. Using the services of an OHS, existing applications in the computing environment can become “hypermedia enabled,” thus supporting linking to and from information managed by the application without altering the information itself. Problems that need to be overcome are that an application that is OHS enabled should be enabled for all OHSs; hypermedia structures created by one OHS should be understood by all other OHSs; and single links should, for example, be able to have one end in one environment and the other in another. For example, a link from a cell in a spreadsheet should be able to link to a date in a diary application.
12
HYPERMEDIA
In other words, what we are now used to on the Web should become part of the desktop computing environment, where links and anchors are not just part of a particular hypertext browser or editor, but are ubiquitous. Metadata. Media objects, such as video or audio, can be made more accessible by associating with them descriptions of their content. Additional information about a document, such as author or creation date, is also useful. Both these types of information are metadata. Metadata can be useful for searching on, for example when looking for pictures on a specific topic, or for processing in a more general way, for example for validating a document. The document formats HTML and SMIL include constructs for providing metadata about the document. Because they are part of the Web “product mix,” they can also include Resource Description Framework (RDF) (22) information, which can be used to label the roles of document elements directly. Integration. The extensions noted previously will not occur in isolation from one another, but will develop in parallel and in combinations. Integration of all extensions in a single (complex) environment will soon be possible. An example application that involves mixing document languages, multimedia, and metadata is the flexible, or adaptive, hypermedia. These are presentations that alter depending on the user’s preferences, prior knowledge, the platform’s capabilities, or the available network bandwidth. From a single source document, decisions can be made as to which parts of the document should be displayed to the reader, or even generated. Different presentation styles are suited for different end-user platforms, mixing of document types is useful for richness of choice of type of presentation at runtime, multimedia allows temporal ordering of document elements and metadata can be used for choosing among multiple document elements. However far these developments will bring us in the short term, it is likely to be another 50 years before the visions of Bush, Englebart, and Nelson (2,3,4) will be realized in full.
BIBLIOGRAPHY 1. J. Conklin Hypertext: An introduction and survey, IEEE Comput., 20 (9): 17–41, 1987. 2. V. Bush As We May Think, The Atlantic Monthly, 176: 101–108, 1945. [Online], http://www.w3.org/History/1945/vbush/ http://ww.isg.sfu.ca/∼duchier/misc/vbush/ 3. D. C. Engelbart A conceptual framework for the augmentation of man’s intellect, in Howerton and Weeks (eds.), Vistas in Information Handling, Organization and Groupware, Washington, DC: Spartan Books, 1963, pp. 1–29. Republished in Irene Greif (ed.), Computer Supported Cooperative Work: A Book of Readings, San Mateo, CA: Morgan Kaufmann, 1988, pp. 35–65. Also republished in T. Nishigaki (ed.), NTT Publishing, 1992. 4. D. C. Engelbart W. K. English A research center for augmenting human intellect, AFIPS Conf. Proc., 33 (1): pp. 395–410. [Online], 1968. Available http://www2.bootstrap.org/ 5. R. Cailliau A Little History of the World Wide Web [Online], 1995. Available http://www.w3.org/History.html 6. D. Raggett A. Le Hors I. Jacobs HTML 4.0 Specification [Online], 1998. Available http://www.w3.org/TR/REC-html40/ 7. H. Frystyk Nielsen J. Gettys HTTP—Hypertext Transfer Protocol [Online], 1998. Available: http://www.w3.org/ Protocols/ 8. D. McCracken R. M. Akscyn Experience with the ZOG human-computer interface system, Int. J. Man-Mach. Stud., 21: 293–310, 1984. 9. F. G. Halasz T. P. Moran R. H. Trigg NoteCards in a Nutshell, ACM Conf. Human Factors Comput. Syst., Toronto, Canada, 45–52, 1987. 10. B. J. Haan et al. IRIS hypermedia services, Commun. ACM, 35 (1): 36–51, 1992. 11. B. R. Schatz J. B. Hardin NCSA Mosaic and the World Wide Web: Global hypermedia protocols for the Internet, Science, 265: 895–901, 1994. 12. W. Hall H. Davis G. Hutchings Rethinking Hypermedia: The Microcosm Approach, Dordrecht, The Netherlands: Kluwer, 1996.
HYPERMEDIA 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
13
ISO, SGML. Standard Generalized Markup Language, ISO/IEC IS 8879: 1985, 1985. ISO, HyTime. Hypermedia/Time-based structuring language, ISO/IEC 10744, 1997. F. Halasz M. Schwartz The Dexter Hypertext Reference Model, Commun. ACM, 37 (2): 30–39, 1994. L. Hardman D. C. A. Bulterman G. van Rossum The Amsterdam Hypermedia Model: Adding time and context to the Dexter Model, Commun. ACM, 37 (2): 50–62, 1994. ISO, MHEG Part 5, ISO/IEC IS 13522-5, 1997. P. Hoschka (ed.) Synchronized Multimedia Integration Language, W3C Proposed Recommendation. Authors: S. Bugaj et al., [Online], 1998. Available http://www.w3.org/TR/REC-smil. P. Kahn K. Lenk Screen typography: Applying lessons of print to computer displays, Seybold Report on Desktop Publishing, 7 (11): 3–15, 1993. J. Nielsen Multimedia and Hypertext, The Internet and Beyond, Boston: AP Professional, 1995. U. K. Wiil Open hypermedia systems working group message from the OHSWG chair, [Online], 1997. Available http://www.csdl.tamu.edu/ohs/intro/chair.html. World Wide Web consortium, Technical reports and publications, [Online], 1998. Available http://www.w3.org/TR/.
LYNDA HARDMAN DICK C. A. BULTERMAN CWI
Multimedia Information Systems W
Multimedia Audio W4806
Perry R. Cook Abstract The sections in this article are The Multimedia Audio Subsystem PCM Wave Storage and Playback Analog Mixing Versus Digital Filters Software Streaming Models Synchronization Audio Compression Musical Instrument Digital Interface Sound and Music Synthesis Speech 3D Audio and Audio for Virtual Environm
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...CTRONICS%20ENGINEERING/37.%20Multimedia/W4806.htm16.06.2008 16:39:41
MULTIMEDIA AUDIO
581
cessing, speech, music, and computer systems architecture. This article will begin by describing the components of a multimedia audio system. This will be followed by an overview of the concepts of digital sampling and quantization, and the filtering issues that arise when dealing with audio data at different sampling rates. Computer architectures for moving audio streams within systems are then described. Popular audio compression algorithms will then be reviewed. Next follows a section on the musical instrument digital interface (MIDI), a section on methods for music synthesis, and a section on speech synthesis and recognition. Finally there is a section on 3-D audio and audio for virtual environments.
THE MULTIMEDIA AUDIO SUBSYSTEM Figure 1 shows a simplified architecture of a multimedia subsystem. As technologies evolve, some components and functionalities have transitioned back and forth between dedicated entertainment systems, computers, and video gaming systems. Examples of dedicated entertainment systems include televisions, home theater systems, future generation stereos, and so on. Computer systems with multimedia include not only desktop computers, but also web servers, palmtops, personal digital assistants (PDAs), and the like. Dedicated video gaming systems use many multimedia components, and add additional controllers and display options to the mix. Other systems which would fit in the multimedia system category might include interactive kiosks and information systems, video conferencing systems, and immersive virtual reality systems. The components shown in Fig. 1 and described below reflect a merging of audio functions from all of these areas. Multimedia Audio Subsystem Inputs Line Input(s). This is a high impedance external analog input. Common connection sources include an external mixer, home stereo preamp output, video tape audio outputs, compact disk players, cassette decks, and so forth. Approximate maximum signal levels for this input are 1 V peak to peak. Microphone Input(s). This connection is an external analog input from a microphone. Signal levels are low, at fractions of millivolts, so low-noise preamplification is required within the multimedia audio hardware. PC Speaker. The internal connection is to a traditional IBM PC speaker channel, which is capable of synthesizing only square waves at somewhat arbitrary frequencies. Because of the square-wave limitation, this channel is usually used only for ‘‘alert’’ messages consisting of beeps at different frequencies. CD Audio. This is usually an internal line-level analog connection to CD-ROM audio. MIDI Synthesizer. This is an internal line-level connection when housed on same board as the PC audio subsystem.
MULTIMEDIA AUDIO Multimedia audio is an emerging and evolving topic that encompasses techniques of both analog and digital signal pro-
Digital Input(s). This is an external digital serial connection, usually via the Sony/Phillips Digital Interface Format (SPDIF), connected via RCA coaxial connectors and wire, or optical fiber. This is generally considered a ‘‘high-end’’ feature, which can be added to many com-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
582
MULTIMEDIA AUDIO
Inputs
Mute/solo monitor
Metering
Speaker
Gain/pan
Microphone
Outputs
Amp
Gain/pan Line Gain/pan
Σ
Headphone
Mixer
Gain/pan
Line Gain
Gain/pan MDI synthesizer
Digital
Pc speaker output
Figure 1. Multimedia subsystem showing inputs, outputs, and connections within the system.
CDROM Drive
puter systems by the addition of a plug-in board, and is not included in most multimedia systems. Multimedia Audio Subsystem Outputs Speaker Output. This is typically a low-impedance 1 W to 4 W external analog output capable of driving loudspeakers directly. Line Output. This high-impedance analog output connects to an external amplifier, mixer, stereo preamplifier, or other device. Maximum signal levels approximate 1 V peak to peak. Headphone Output. This output drives stereo headphones. Sometimes the connector for this output automatically mutes the main speaker outputs when headphones are plugged in. Digital Output(s). See the Digital Input(s) subsection above. Mixer/Router—Between the Inputs and Outputs Gain/Volume. This includes input gain controls for all inputs, and master output gain controls. It usually has a log scale, because this most closely matches the human perception of loudness. Pan. This control allocates the signal from left to right, and is also called balance control. The simplest pan control is linear panning, which applies gain of 움 to the signal routed to one channel, and a gain of (1.0 ⫺ 움) to the other. So a panning 움 value of 0.0 would route all signal to the left speaker, 0.5 would route half-amplitude signals to both speakers, and 1.0 would route all signal to the right speaker. This panning scheme is problematic, however, because perceived loudness is related to power at the listener, and a center-panned signal can appear to be decreased in volume. To provide a more natural panning percept for humans, curves are
A/D
D/A
Compress (optional)
Decompress (optional)
Digital
CPU
Storage
Multimedia subsystem
designed so that a center-panned signal from each channel is 3 dB attenuated in each speaker (rather than 6 dB, as would be the case with linear panning). Mute/Monitor/Solo. These are switch-like interfaces (usually buttons on a graphical user interface), which determine which signals get into or out of the system. Mute causes a signal to not be passed to the output; monitor causes a signal to be passed to the output. Solo causes only the selected signal channel(s) to be heard, and performs a mute operation on all other signal channels. Metering. Metering is graphical indication of volume levels of inputs and/or outputs. Usually displayed as the log of the smoothed instantaneous power, metering is often designed to mimic the historic VU (volume unit) meters of radio and recording consoles. Other Audio-Related System Inputs Joystick. The joystick is technically not part of the audio subsystem of a multimedia computer, but the connector is often used to provide connections to MIDI (see below). MIDI (Musical Instrument Digital Interface). On PC architecture machines, MIDI is usually connected via a special cable that connects to the joystick port, but dedicated plug-in MIDI cards are also available. Other machines connect MIDI directly to the serial ports via an electrical adapter box, and still others require special hardware plug-in boards. MIDI input and output are almost always supported on the same joystick port, plugin board, or serial adapter. See the section titled Musical Instrument Digital Interface for more information on the MIDI hardware and software protocols. Other Audio-Related System Outputs Joystick MIDI. See the Joystick Input subsection above. MIDI. See the MIDI Input subsection above.
MULTIMEDIA AUDIO
Force Joystick and Other Haptic Devices. Haptics (1) is the combined touch senses of taction (sense of vibration and pressure on the skin) and kinesthesia (the sense of positions of limbs, body, and muscles). Haptic display devices are just beginning to appear as computer system peripherals. The simplest include simple vibrators attached to joysticks, embedded in gloves, in chair seats, and low-frequency audio speakers worn on the back. These are intended to provide the sensation of contact, vibration and roughness, and explosions for games. More sophisticated haptic devices do not use audio frequencies per se, but allow for precise forces to be exerted upon the user in response to algorithmic computations taking place in the computer. Connectors range from standard serial ports to specialized connectors from plug-in boards. Video Helmet/Glasses and Other Devices. These often include integrated headphone connections, with the headphones fed from the audio outputs of the system.
PCM WAVE STORAGE AND PLAYBACK Representing Audio as Numbers In a modern multimedia system, audio is captured, stored, transmitted, analyzed, transformed, synthesized, and/or played back in digital form. This means that the signal is sampled at regular intervals in time, and quantized to discrete values. The difference between quantization steps is often called the quantum. The process of sampling a waveform, holding the value, and quantizing the value to the nearest number that can be represented in the system is referred to as analog to digital (A/D) conversion. Coding and representing waveforms in this manner is called pulse code modulation (PCM). The device that does the conversion is called an analog-to-digital converter (ADC, or A/D). The corresponding process of converting the sampled signal back into an analog signal is called digital to analog conversion, and the device that performs this is called a DAC. Filtering is also necessary to reconstruct the sampled signal back into a smooth continuous time analog signal, and this filtering is usually contained in the DAC hardware. Figure 2 shows the process of analog-todigital conversion on a time/amplitude grid.
Amplitude
q = Quantization quantum
T = 1/(Sampling rate)
Time
Figure 2. Linear sampling and quantization. At each timestep, the waveform is sampled and quantized to the nearest quantum value.
583
Sampling and Aliasing An important fundamental law of digital signal processing states that if an analog signal is bandlimited with bandwidth B, the signal can be periodically sampled at sample rate 2B, and exactly reconstructed from the samples (2). Intuitively, a sine wave at the highest frequency B present in a bandlimited signal can be represented using two samples per period (one sample at each of the positive and negative peaks), corresponding to a sampling frequency of 2B. All signal components at lower frequencies can be uniquely represented and distinguished from each other using this same sampling frequency of 2B. If there are components present in a signal at frequencies greater than one-half the sampling rate, these components will not be represented properly, and will alias as frequencies different from their true original values. Most PC-based multimedia audio systems provide a large variety of sampling rates. Lower sampling rates save processor time, transmission bandwidth, and storage, but sacrifice audio quality because of limited bandwidth. Humans can perceive frequencies from roughly 20 Hz to 20 kHz, thus requiring a minimum sampling rate of at least 40 kHz. Speech signals are often sampled at 8 kHz or 11.025 kHz, while music is usually sampled at 22.05 kHz, 44.1 kHz (the sampling rate used on audio compact disks), or 48 kHz. Other popular sampling rates include 16 kHz (often used in speech recognition systems) and 32 kHz. The maximum sampling rate available in multimedia systems is 50 kHz. To avoid aliasing, ADC hardware usually includes filters that automatically limit the bandwidth of the incoming signal as a function of the selected sampling rate. Quantization To store a value in a computer, it must be represented as a finite precision quantity. On a digital computer, that means that some number of binary digits (bits) are used to represent a number that approximates the value. This quantization is accomplished either by rounding to the quantity nearest to the actual value, or by truncation to the nearest quantity less than or equal to the actual value. With uniform sampling in time, a bandlimited signal can be exactly recovered, provided that the sampling rate is twice the bandwidth or greater; but when the signal values are rounded or truncated, the difference between the original signal and the quantized signal is lost forever. This can be viewed as a noise component upon reconstruction. A common analytical technique assumes that this discarded signal component is uniformly distributed randomly at ⫾1/2 the quantum for quantization by rounding, or from 0 to the quantum for quantization by truncation. Using this assumption gives a signal to quantization noise approximation of 6N dB, where N is the number of bits (2). This means that a system using 16 bit linear quantization will exhibit a signal to quantization noise ratio of approximately 96 dB. Most PC-based multimedia audio systems provide two or three basic sizes of audio words. Sixteen-bit data is quite common, as this is the data format used in Compact Disk systems. Eight-bit data is equally common, and is usually used to store speech data. Data size will be discussed in more detail in the section titled Audio Compression.
584
MULTIMEDIA AUDIO
ANALOG MIXING VERSUS DIGITAL FILTERS As mentioned above, there are a number of sampling rates common in a multimedia audio system. Often these sampling rates must be simultaneously supported, as might be the case with speech narration and sound effects taking place along with music synthesis. To mix these signals in the digital domain requires sampling rate conversion (3,4), which can be computationally expensive if done correctly. The solution in some systems is to require that all media that will be played at the same time must be at the same sampling rate. Another solution is to do the conversion back to analog using multiple DACs, and then do the mixing in the analog domain. As processor power increases and silicon costs decrease, more systems are beginning to appear which provide software or hardware sampling rate conversion to one common system sampling rate (usually 44.1 kHz), and then mix all signals in the digital domain for playback on a single fixed sampling rate stereo DAC. SOFTWARE STREAMING MODELS Buffered (Vectorized) Data Architectures Nearly all multimedia systems contain some type of central processor. For audio input, the CPU performs acquisition of the samples from the audio input hardware, possibly compresses the samples, and then stores the resultant data, or transmits the data to another processing stage. For audio output the CPU would collect samples from another process or storage, and/or might possibly perform direct synthesis of sound in software. If the samples were compressed the CPU would perform decompression, and then transmit the sound samples to the audio output hardware. A karaoke application might perform many of these tasks at once, taking in a vocal signal via the ADCs, performing processing on the vocal signal, performing software synthesis of music, mixing the synthesized music with the processed voice, and finally putting the resulting signal out to the DACs. The CPU in many systems must not only perform audio functions, but also other computing tasks as well. For example, in running a game application on a desktop PC system, the CPU would respond to joystick input to control aspects of the game, perform graphics calculations, keep track of game state and score, as well as performing audio synthesis and processing. If the CPU were forced to collect incoming samples and/or calculate output samples at the audio sample rate, the overhead of swapping tasks at each audio sample would consume much of the processor power. Modern pipelined CPU architectures perform best in loops, where the loop branching conditions are stable for long periods of time, or at least well predicted. For this reason almost all audio subsystems work on a block basis, where a large number of samples are collected by the audio hardware before the CPU is asked to do anything about them. Similarly for output, the CPU might synthesize, decompress, process, and mix a large buffer of samples and then pass that buffer to the audio hardware for sample-rate synchronous transmission to the DACs, allowing the CPU to turn to other tasks for a time. A typical system might process buffers of 1024 samples, collected into system memory from the audio ADCs by DMA (direct memory access) hardware. This buffer size would correspond to a pro-
cessor service rate of only 44 times/s for a sampling rate of 44.1 kHz. Blocking Push/Pull Versus Callback There are two common architectures for handling buffered audio input and output—push/pull with blocking, or callback. In a push/pull with blocking architecture, the main audio process writes output (and/or reads input) buffers to the audio output software drivers. The audio driver software keeps a number of buffers (minimum two) for local workspace, and when those buffers are full, it ‘‘blocks’’ the calling write/read function until one or more of the available work buffers are empty/full. The audio driver software will then ‘‘release’’ the calling write/read function to complete its task. Another way to view this is that the function called to write and/or read doesn’t return until more samples are needed or available. Because of the blocking, the code in which the audio driver functions are called is often placed in a different execution ‘‘thread,’’ so that other tasks can be performed in parallel to the audio sample read/write operations. In a callback architecture, the audio drivers are initialized and passed a pointer to a function which is called by the audio subsystem when samples are needed or ready. This function usually just copies a buffer of samples in or out, and sets a flag so that another process knows that it must fill the local buffer with more samples before the next callback comes. SYNCHRONIZATION Since a multimedia system is responsible, by definition, for collecting and/or delivering a variety of media data types simultaneously, synchronization between the different types of data is an important issue. Audio often is the master timekeeper in a multimedia system, for two primary reasons. The first reason is that audio samples flow at the highest rate, typically tens of thousands of samples per second, as opposed to 30 frames/s or so for video and graphics, or 100 events/s for joystick or virtual reality (VR) controller inputs. This means that the granularity of time for audio samples is much smaller than those of the other data types in a multimedia system, and a timekeeper based on this granularity will have more resolution than one based on coarser time intervals of the other data types. The other reason that audio is often the master timekeeper is that the human perception system is much more sensitive to lost data in audio than it is to graphics or video data loss. An occasional dropped frame in video is often not noticed, but even one dropped sample in audio can appear as an obvious click. A dropped or repeated buffer of audio is nearly guaranteed to be noticed and considered objectionable. In some systems where audio serves as the master timekeeper, other data-delivery systems keep informed of what time it is by querying the audio drivers, and throttle their outputs accordingly, sometimes ignoring frames of their own data if the process gets behind the audio real-time clock. In other systems, the audio buffering blocking or callback process can be set up so that the frames of other data are synchronous with the audio buffers. For example, a video game might run at about 22 graphics frames/s, synchronous to audio buffers of size 1024 samples at 22.05 kHz sampling rate.
MULTIMEDIA AUDIO
In some systems such as game systems, the playback of large contiguous and mixed streams of audio is not required, but rather all that is required is the playback of small wave files and the synthesis of MIDI-controlled music (see the section titled ‘‘Musical Instrument Digital Interface’’). In these systems there is often a master timekeeper that synchronizes all events, and the audio event cues are dispatched to the wave and MIDI playback systems at (roughly) the correct times. Once a cue is received, each small wave file plays back in its entirety, and/or each MIDI event is serviced by the synthesizer. If the lengths of the individual sound files are not too long, reasonable synchronization can be achieved by this type of ‘‘wild sync,’’ where only the beginnings of the events are aligned. AUDIO COMPRESSION The data required to store or transmit CD quality audio at 44.1 kHz sampling rate, stereo, 16 bits, is 176 kbyte/s. For storage, this means that a three-minute song requires 30 Mbytes of storage. For transmission, this means that an ISDN line at 128 kbit/s is over 10 times too slow to carry uncompressed CD-quality audio. Audio compression, as with any type of data compression, strives to reduce the amount of data required to store and/or transmit the audio. Compression algorithms can be either lossless, as they must be for computer text and critical data files, or lossy, as they usually are for images and audio.
-Law Compression Rather than using a linear quantization scheme, where the amplitude dynamic range is divided into equal pieces, 애-Law (or a-Law) nonlinear quantization (5) provides a simple method of data compression. Figure 3 shows a nonlinear quantization scheme. Since the quantization noise is related to the size of the quantum, then having a small quantum ensures small quantization noise, and a large signal to quantization noise ratio for small-amplitude signals. For large-amplitude signals, the quantum is large, but the signal to quantization noise ratio is held relatively constant. A common metric applied to 애-Law nonlinear sampling is that N bits performs roughly as well as
Amplitude
0
Time Figure 3. Nonlinear sampling and quantization. At each timestep, the waveform is sampled and quantized to the nearest quantum, but the quanta are small for small signal values, and larger for large signal values. This results in a roughly constant signal-to-noise ratio, independent of signal amplitude.
585
N ⫹ 4 linearly quantized bits. Eight-bit 애-Law is common for speech coding and transmission, providing 2 : 1 compression over 16 bit linear audio data with a perceptual quality roughly equal to that of a 12 bit system. 애-Law compression is lossy, since the original samples cannot be reconstructed from the compressed data. Adaptive Delta Pulse Code Modulation Adaptive delta pulse code modulation (ADPCM) (6) endeavors to adaptively change the quantum size to best match the signal. Further, the changes (deltas) in the signal are coded rather than the absolute signal values. Each sample, the change in the signal value versus the last value, is computed. This delta is compared to the current adapted delta value, and the new adapted delta value is increased or decreased accordingly. All that is coded is the sign of the delta for the given step, and a three-bit value reflecting the current quantized adapted value. This allows 4 bits of information to store a 16 bit linearly quantized sample value, providing 4 : 1 compression over 16 bit linear audio data. ADPCM is supported by many multimedia audio systems, but it has not found broad popular use because the quality is not considered good enough for most program material. ADPCM compression is lossy. MPEG and Other Transform Coders Transform coders endeavor to use information about human perception to selectively quantize regions of the frequency spectrum, in order to ‘‘hide’’ the quantization noise in places that the human ear will not detect. Such coders strive for what is often called ‘‘perceptual losslessness,’’ meaning that human listeners cannot tell the difference between compressed and uncompressed data. A transform coder performs frequency analysis using either the Fourier transform, subband decomposition using a filter bank, wavelet analysis, or a combination of these. The spectral information is inspected to determine the significantly loud features, and masking threshold curves (regions under which a sound could be present and not be detected due to the loudness of a nearby spectral peak) are drawn outward in frequency from these significant peaks. Quantization noise is then ‘‘shaped’’ by allocating bits to subband regions, so that the quantization noise lies under the masking threshold curves. Masking is also computed in time, recognizing that noise can linger significantly in time after a loud event without being heard, but cannot precede the loud event by as much time without being heard. Given some assumptions and suitable program material, the quantization noise will be allocated in such a way that a human listener will not be able to detect it. Transform coding is used in multimedia standard audio formats such as MPEG (Moving Picture Expert Group) (7) and Dolby’s AC-2 and AC-3 (8). Such coders can achieve perceptually lossless compression ratios of approximately 4 : 1 or 8 : 1 on some program material, and even higher compression ratios with some small acceptable degradation. MUSICAL INSTRUMENT DIGITAL INTERFACE The musical instrument digital interface (MIDI) standard (9), adopted in 1984, revolutionized electronic music and soon
586
MULTIMEDIA AUDIO
thereafter affected the computer industry. It is a shining example of what can happen when manufacturers of competing products agree on a simple hardware and software protocol standard, and in the case of MIDI it happened without the aid of a preexisting external organization or governmental body. The Basic Standard A simple two-wire serial electrical connection standard allows interconnection of musical devices over cable distances of up to 15 m, and longer over networks and extensions to the basic MIDI standard. The software protocol is best described as ‘‘musical keyboard gestural,’’ meaning that the messages carried over MIDI are the gestures that a pianist or organist uses to control a traditional keyboard instrument. There is no time information contained in the basic MIDI messages, and they are intended to take effect as soon as they come over the wire. Basic MIDI message types include NoteOn and NoteOff, Sustain Pedal Up and Down, Modulation amount, and PitchBend. NoteOn and Off messages carry a note number corresponding to a particular piano key, and a velocity corresponding to how hard that key is hit. Another MIDI message is Program Change, which is used to select the particular sound being controlled in the synthesizer. MIDI provides for 16 channels, and the channel number is encoded into each message. Instruments and devices can all ‘‘listen’’ on the same network, and choose to respond to the messages sent on particular channels. Certain messages are used to synchronize multiple MIDI devices or processes. These messages include start, stop, and clock. Clock messages are not absolute time messages, but simply ‘‘ticks’’ measured in fractions of a musical quarter note. A process or device wishing to synchronize with another would start counting ticks when it first received a MIDI start message, and would keep track of where it was by looking at the number of ticks that had been counted. These timing messages and some others do not carry channel numbers, and are intended to address all devices connected to a single MIDI network. Extensions to MIDI The most profound basic extension to MIDI has been the advent of General MIDI (9,10), and the Standard MIDI File Specifications (9). By the dawn of multimedia, MIDI had already begun to be a common data type for communication between applications and processes within the computer, with the majority of users not being musicians, and not even owning a keyboard or MIDI interface for their computer. General MIDI helped to assure the performance of MIDI on different synthesizers, by specifying that a particular program (algorithm for producing the sound of an instrument) number must call up a program that approximates the same instrument sound on all General MIDI compliant synthesizers. There are 128 such defined instrument sounds available on MIDI channels 1 to 9 and 11 to 16. For example, MIDI program 0 is grand piano, and MIDI program 57 is trumpet. On General MIDI channel 10, each note is mapped to a different percussion sound. For example, bass drum is note number 35 on channel 10, cowbell is note number 56, and so forth. The MIDI file formats provide a means for the standardized exchange of musical information. The growth of the World Wide Web has brought an increase in the availability
and use of MIDI files for augmenting web pages and presentations. A MIDI level 0 file carries the basic information that would be carried over a MIDI serial connection, including program changes, so that a well-authored file can be manipulated by a simple player program to configure a General MIDI synthesizer and play back a song with repeatable results. A MIDI level 1 file is more suited to manipulation by a notation program or MIDI sequencer (a form of multitrack recorder program that records and manipulates MIDI events). Data are arranged by ‘‘tracks,’’ which are the individual instruments in the virtual synthesized orchestra. Metamessages allow for information that is not actually required for a realtime MIDI playback, but might carry information related to score markings, lyrics, composer and copyright information, and so forth. To help ensure consistency and quality of sounds, the Downloadable Sounds Specification provides means to supply PCM-sampled sounds to a synthesizer (see section on PCM Sampling and Wavetable Synthesis below). Once downloaded, author/application-specific custom sounds can be controlled using standard MIDI commands. Improvements, Enhancements, and Modifications of MIDI So far the most successful extensions to MIDI have been those that do not require a change in hardware of software to support existing legacy equipment and files. Compatibility has been at the heart of the success of MIDI so far, and this has proven to be important for improvements and extensions to MIDI. MIDI Time Code (MTC) is an extension to the basic MIDI message types, which allows for absolute time information to be transmitted over existing MIDI networks (9). MIDI Machine Control (MMC) specifies new messages that control audio recording and video production studio equipment, allowing for synchronization and location functions (9). MIDI Show Control (MSC) specifies new messages that are designed specifically to control stage lighting and special-effects equipment in real time (9). MIDI GS and XG are system specifications which extend the range of sounds and sound control beyond basic general MIDI. Extended control of effects and additional bands of sounds are compatible with general MIDI, and to some degree between GS and XG. Up-to-date information can be obtained from Roland (GS) and Yamaha (XG). Less successful extensions to MIDI include XMIDI and ZIPI, sharing the fact that they require new hardware to be useful. XMIDI (11) endeavors to extend the electrical and message standards by transparently introducing additional higher bandwidth transmissions on the existing MIDI wiring, but has not yet found adoption among manufacturers and the MIDI Manufacturers Association. ZIPI (12) is a more generic high-bandwidth network, requiring new hardware and protocols, but also including MIDI as a subsystem. Various serial systems and protocols for modern recording studio automation and control allow for MIDI data in addition to digital audio to flow over optical cabling. As of this writing, manufacturers of professional music and recording equipment have
MULTIMEDIA AUDIO
yet to agree on a standard, and are pursuing their own system designs. SOUND AND MUSIC SYNTHESIS PCM Sampling and Wavetable Synthesis The majority of audio on computer multimedia systems comes from the playback of stored PCM (pulse code modulation) waveforms. Single-shot playback of entire segments of stored sounds is common for sound effects, narrations, prompts, musical segments, and so forth. For musical sounds, it is common to store just a loop, or table, of the periodic component of a recorded sound waveform and play that loop back repeatedly. This is called wavetable synthesis. For more realism, the attack or beginning portion of the recorded sound is stored in addition to the periodic steady-state part. Figure 4 shows short periodic loops and longer noise segments being used for speech synthesis, as described later. Originally called sampling synthesis in the music industry, all synthesis involving stored PCM waveforms has become known as wavetable synthesis. Filters are usually added to sampling and wavetable synthesis to allow for control of spectral brightness as a function of intensity, and to get more variety of sounds out of a given set of samples. In order to synthesize music, accurate and flexible control of pitch is necessary. In sampling synthesis, this is accomplished by dynamic sample rate conversion (interpolation). Most common is linear interpolation, where the fractional time samples needed are computed as a linear combination of the two closest samples. Sample interpolation is often misunderstood as a curve-fitting problem, but for best audio quality it should be viewed as a filter design problem (4,13). A given sample can only be pitch shifted so far in either direction before it begins to sound unnatural. This is dealt with by storing multiple recordings of the sound at different pitches, and switching or interpolating between these upon resynthesis. This is called ‘‘multisampling.’’ Additive and Subtractive Synthesis Synthesis of signals by addition of fundamental waveform components is called additive synthesis. Since any function can be uniquely represented as a linear combination of sinusoidal components, the powerful tool of Fourier analysis (14) gives rise to a completely generic method of sound analysis and resynthesis. When there are only a few sinusoidal compo-
Whispered “ah” sound
/a/ (father)
587
nents, additive synthesis can be quite efficient and flexible. However, most sounds have a significant number of components which vary quite rapidly in magnitude and frequency. Essentially no multimedia subsystem sound synthesizers use strictly additive synthesis, but this could become a reality as processor power increases, and costs decrease. Some sounds yield well to subtractive synthesis, where a complex source signal is filtered to yield a sound close to the desired result. The human voice is well modeled in this way, because the complex waveform produced by the vocal folds is filtered and shaped by the resonances of the vocal tract tube. The section titled Speech will discuss subtractive synthesis in more detail. FM Synthesis For the first few years of PC-based audio subsystems, sound synthesis by frequency modulation (FM) was the overriding standard. Frequency modulation relies on the modulation of the frequency of simple periodic waveforms by other simple periodic waveforms. When the frequency of a sine wave of average frequency f c (called the carrier), is modulated by another sine wave of frequency f m (called the modulator), sinusoidal sidebands are created at frequencies equal to the carrier frequency plus and minus integer multiples of the modulator frequency. This is expressed as: y(t) = sin(2πt( f c cos(2πt f m )] where I is the index of modulation, defined as ⌬f c /f c. Carson’s rule (a rule of thumb) states that the number of significant bands on each side of the carrier frequency is roughly equal to I ⫹ 2. For example, a carrier sinusoid of frequency 600 Hz, a modulator sinusoid of frequency 100 Hz, and a modulation index of 3 would produce sinusoidal components of frequencies 600, 700, 500, 800, 400, 900, 300, 1000, 200, 1100, 100 Hz. Inspecting these components reveals that a harmonic spectrum (all integer multiples of some fundamental frequency) with 11 significant harmonics, based on a fundamental frequency of 100 Hz can be produced by using only two sinusoidal generating functions. By careful selection of the component frequencies and index of modulation, and by combining multiple carrier/modulator pairs, many spectra can be approximated using FM. The amplitudes and phases of the individual components cannot be independently controlled, however, so FM is not a truly generic waveform synthesis method. The amplitudes of sideband components are controlled by Bessel functions, and since the very definition of FM is a nonlinear function transformation, the system does not extend additively for complex waveforms (or sums of modulator or carrier sine functions). Using multiple carriers and modulators, connection topologies (algorithms) have been designed for the synthesis of complex sounds such as human voices, violins, and brass instruments (15).
/i/ (beet)
Physical Modeling and Generalized Algorithmic Synthesis
Figure 4. Synthesizing the word ‘‘hi’’ from individual PCM wave files. This is done by using a recording of the whispered ‘‘ah’’ sound, followed by the vowel ‘‘ah,’’ then by the vowel ‘‘ee.’’ Since vowel sounds are quasiperiodic, only one period of the vowel sounds is stored.
Physical modeling synthesis (16) endeavors to model and solve the acoustical physics of sound-producing systems in order to synthesize sound. Classical theories and recent advances in knowledge about acoustics, combined with algorithm advances in digital signal processing and increases in processor power, allow certain classes of sound-producing sys-
588
MULTIMEDIA AUDIO
tems to be physically modeled in real time. Unlike additive synthesis, which can use one powerful generic model for any sound, physical modeling requires a different model for each separate family of musical instrument or sound-producing object. The types of fundamental operations required to do physical modeling are varied, and include waveform playback, linear filtering, lookup tables, and other such operations. These represent nearly a complete superset of the fundamental operations required to do other types of synthesis. In the history of computer music, there is a recurring notion of a set of simple ‘‘unit generators,’’ which are used to build more complex algorithms for synthesis and processing (17). The need for multiple algorithms gives rise to a new type of generalized synthesizer architecture. New systems for multimedia sound synthesis, particularly those that run in software on a modern host microprocessor, are beginning to surface based on a generalized algorithmic framework. In such systems, the algorithm that best suits the particular sound can be used. Such systems face problems of resource allocation, because different algorithms exhibit different memory and processor load characteristics. Since, however, each synthesis algorithm is best suited to certain types of sounds, generalized algorithmic synthesis can yield the theoretical best quality for all sounds. Further, the flexible dynamic framework of a generalized software synthesizer allows for new algorithms to be added at a later time. The upcoming MPEG-4 audio layer standard includes a component called structured audio language (SAOL) (18), which provides a framework for manipulating audio in more parametric and flexible ways. There is a component allowing for algorithmic synthesis and processing specifications, thus supporting generalized algorithmic synthesis directly in the audio layer of the multimedia data. SPEECH Through science fiction, long before the advent of the personal computer, the public was introduced to the idea that all machines of the future would speak naturally and fluently, and understand all human speech. For this reason, of all the multimedia data types, speech is the one for which the public has the highest expectations, yet has proven to be the most difficult to implement effectively. Since the areas of speech synthesis, recognition, and coding are quite large, only a brief introduction to these topics will be provided here. Concatenative Phoneme Speech Synthesis A large amount of artificial digital speech is generated by concatenative waveform synthesis. Since the steady states of speech vowels are quasiperiodic waveforms, only one period of each vowel can be stored, and these can be looped to generate arbitrary length segments of these vowels. Since consonant sounds are not periodic, however, they must be stored in longer wave files. Each vowel and consonant is called a phoneme, and at least 40 phonemes are required to represent the sounds encountered in American English speech. To generate speech, these phoneme segments are concatenated together. Pitch shifting is accomplished by sample rate interpolation of the vowel and voiced consonant sounds. Figure 4 shows how the word ‘‘hi’’ might be synthesized using concatenative phoneme synthesis.
Concatenative Diphone Speech Synthesis The transitions from one phoneme to another are not abrupt, because it takes time for the speech articulators (tongue, jaw, etc.) to move from one point to another. Concatenative phoneme synthesis does not allow for smooth transitions between phonemes, and mere cross-fading by interpolation from one phoneme to the other does not produce realistic transitions. More natural synthetic speech can be generated by storing ‘‘diphones,’’ which are recordings of entire transitions from one phoneme to the other. This takes much more storage, because if there are N phonemes, there will be N2 diphones, and the diphone transitions involving vowels are longer than simple single periods of a periodic waveform. In any given language, certain diphone transitions will not be required, and others occur so infrequently that it is acceptable to store only phonemes and do the lower-quality synthesis in cases where rarely needed diphones are needed. Triphones can also be stored and synthesized. Source/Filter Speech Synthesis When the pitch of stored speech waveforms is shifted by sample rate interpolation, undesired distortions of the waveform and accompanying spectra occur. If the pitch shift range is small, this does not generate troubling perceptual artifacts. However, the pitch range of speech can be quite large, perhaps a factor of two for emotional speech. For large pitch shifts, the distortion manifests itself as making the sound seem like the size of the speaker’s head is changing. This might be familiar to the reader from listening to tape recordings that have been sped up or slowed down. The speeding up might make the voice sound like it belongs to a chipmunk or mouse, for example. If the voice of a child or woman is desired, mere pitch shifting is not enough to give the right sound. For this reason, a more parametric synthesis model is required to allow for truly flexible speech synthesis over a large range of pitches and types of voices. In human vowels, the source waveform of the vocal fold vibration is filtered by the resonances of the vocal tract acoustic tube. For unvoiced consonants (s, f, etc.) the source generated by turbulence noise is filtered by the vocal tract resonances. Voiced consonants (z, v, etc.) contain components of both noise and quasiperiodic oscillation. A speech synthesizer based on a source/filter model can be used to generate more natural-sounding and flexible artificial speech. In such a synthesizer, source generators such as a waveform oscillator and a noise source are filtered by digital filters to shape the spectrum to match speech sounds. Transitions between phonemes are accomplished by driving the filter and source parameters from one point to another. Pitch changes are easily accomplished by simply changing the frequency of the waveform generator. Speech Coding and Compression Speech can be compressed and coded (19) using the same techniques that were described in the section titled Audio Compression. Having a parametric synthesis model closely based on the actual human speech mechanism, however, allows higher quality at higher compression ratios. Linear predictive coding (LPC) is a signal-processing technique that allows the automatic extraction of filter and source parame-
MULTIMEDIA AUDIO
ters from a speech waveform (20). LPC is used to code speech with recursive filter of the order of 8 to 20, updated a few times per second, typically 20 to 200. The extracted source waveform can be parametrically modeled using a simple oscillator and noise source, or more realistically modeled using vector codebook quantization methods. High-quality speech transmission can be accomplished using only a few thousand bits per second, representing significant compression improvements over 8 kHz 애-Law coded speech. Speech Recognition Speech recognition systems (21) can be divided into four main groups, based on whether they are intended for isolated words or continuous speech, or designed to be speaker independent or trained for a particular speaker. A speaker-dependent, isolated word system is essentially the farthest from our experience as human speech recognizers, but such systems can be quite accurate and useful for command-type applications such as a voice interface to computer menu commands. Some command systems such as voice dialing must function as speakerindependent, isolated word systems, because the huge number of possible voices makes it impossible to train the system on a few individual voices. Speaker-independent, continuous speech systems have proven to be the most difficult to implement, because of many inherent problems. Some of these problems include identifying where words or phonemes start and end (called segmentation), determining the actual speech from the natural disfluencies and repetitions present in natural speech (uhm, er, I mean, well, uh, you know?), and dealing with the differences in individual speakers and voices. Speech recognizers are usually based on a ‘‘front-end,’’ which processes the speech waveform and yields a few parameters per time period which are matched against a library of templates. Typical front-ends include spectrum analysis and linear predictive coding, as described in the ‘‘Speech Coding and Compression’’ subsection. The templates are often phoneme related. Statistical methods involving hidden Markov models (HMMs) are used to pick the most likely candidate words from a set of template matches. The application of HMMs can extend to the sentence level, helping to determine which are the most likely utterances, given cases of multiple possible ways to say the same thing, or given words that sound alike and might be easily confused by the recognizer.
589
integrated stereo audio capability, and the head-tracking systems used to manipulate the visual displays in these systems can be used to manipulate the audio displays as well. Multispeaker systems such as quadrophonic sound and ambisonics (23) have historically been used in certain research settings, but have not been broadly adopted for various reasons of economy, multiple competing standards, and lack of commercially available program material. Recently, various surround sound formats such as Dolby ProLogic, Dolby 5.1 Digital, DTS, and Dolby AC-3 are entering the home entertainment market. These promise to give a multispeaker capability to future multimedia systems, making more immersive audio experiences in the home likely. In addition to the movie and audiophile music titles, which will initially drive the market for these new systems, one should expect to see games and other multimedia content emerge, which can take advantage of the multiple speaker systems. BIBLIOGRAPHY 1. G. Burdea, Force and Touch Feedback for Virtual Reality, New York: Wiley, 1996. 2. K. Steiglitz, A Digital Signal Processing Primer, with Applications to Digital Audio and Computer Music, Menlo Park, CA: AddisonWesley, 1995. 3. R. Crochiere and L. Rabiner, Multirate Digital Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1983. 4. J. Smith and P. Gossett, A flexible sampling rate conversion method, IEEE Proc. Acoust., Speech Signal Process., 2: 1984, pp. 19.4.1–19.4.4. 5. L. Rabiner and R. Schafer, Digital Processing of Speech Signals, Englewood Cliffs, NJ: Prentice-Hall, 1978. 6. International Multimedia Association, ADPCM, IMA Compatibility Proc., Sect. 6, May 1992. 7. ISO/IEC Working Papers and Standards Rep., JTCI SC29, WG11 N0403, MPEG 93/479, 1993. 8. G. Davidson, W. Anderson, and A. Lovrich, A low-cost adaptive transform decoder implementation for high-quality audio, IEEE Publication # 0-7803-0532-9/92, 1992. 9. MIDI Manufacturers Association, The Complete MIDI 1.0 Detailed Specification, La Habra, CA: MMA, 1996. 10. S. Jungleib, General MIDI, Madison, WI: A-R Editions, 1995. 11. The XM spec: Is MIDI 2.0 finally at hand? Keyboard Magazine, June, 1995.
3D AUDIO AND AUDIO FOR VIRTUAL ENVIRONMENTS
12. Special Issue on the ZIPI Music Interface Language, Comput. Music J., 18 (4): Cambridge, MA: MIT Press, 1994.
The human auditory system is good at using only the information at the eardrums to determine the locations of soundproducing objects. The primary perceptual cues are time delays between the two ears, amplitude differences between the two ears, filter functions related to the shadowing effects of the head and shoulders, and complex filter functions related to the twists and turns of the pinnae (outer ears). Three-dimensional audio (22) uses headphones or speakers to place sources at arbitrary perceived locations around the listener’s head. If headphones are used and the appropriate cues are synthesized into the ‘‘binaural’’ signals presented to the two ears, virtual source locations can be manipulated directly. Using speakers requires more signal processing to cancel the effects of each speaker signal getting to both ears. Virtual reality systems using helmets with vision systems often have an
13. L. Rabiner, A digital signal processing approach to interpolation, Proc. IEEE, 61: 692–702, 1973. 14. R. Bracewell, The Fourier Transform and Its Applications, New York: McGraw-Hill, 1986. 15. C. Roads and J. Strawn (eds.), Foundations of Computer Music, Cambridge, MA: MIT Press, 1985. 16. Two Special Issues on Physical Modeling, Comput. Music J., 16 (4): 1992, 17 (1): 1993. 17. M. Mathews, The Technology of Computer Music, Cambridge, MA: MIT Press, 1969. 18. B. Grill et al. (eds.), ISO 14496-3 (MPEG-4 Audio), Committee Draft, ISO/IEC JTCI/SC29/WG11, document W1903, Fribourg CH, October 1997. 19. A. Spanias, Speech coding: A tutorial review, Proc. IEEE, 82: 1541–1582, 1994.
590
MULTIMEDIA INFORMATION SYSTEMS
20. J. Makhoul, Linear prediction: A tutorial review, Proc. IEEE, 63: 561–580, 1975. 21. J. Picone, Continuous speech recognition using hidden Markov models, IEEE Mag. Acoust. Speech Signal Process., 7 (3): 26–41, 1990. 22. D. Begault, 3-D Sound for Virtual Reality and Multimedia, San Diego: Academic Press, 1994. 23. M. Gerzon, Ambisonics in multichannel broadcasting and video, J. Audio Eng. Soc., 33 (11): 859–871, 1985.
Reading List D. O’Shaughnessy, Speech Communication: Man and Machine, Reading, MA: Addison-Wesley, 1987. C. Roads, A Computer Music Tutorial, Cambridge, MA: MIT Press, 1996.
PERRY R. COOK Princeton University
MULTIMEDIA HOLOGRAPHIC STORAGE. See HOLOGRAPHIC STORAGE.
MULTIMEDIA INFORMATION DELIVERY. See VIEWDATA.
Multimedia Information Systems W
Multimedia Information Systems W4806
Wasfi Al-Khatib, M. F. Khan1, Serhan Data, Arif Ghafoor Abstract The sections in this article are Requirements of Multimedia Information Systems Notion of Time for Multimedia Data Content-Based Retrieval of Multimedia Data Multimedia Document Modeling and Retrieval Conclusion
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...CTRONICS%20ENGINEERING/37.%20Multimedia/W4807.htm16.06.2008 16:39:59
590
MULTIMEDIA INFORMATION SYSTEMS Table 1. Differences Between Conventional and Multimedia Data Conventional Data Types known to programming languages (character, integer, real) Relatively small size Fixed size atomic units Not highly interactive No special temporal requirements No special interface for querying Frequent updating
MULTIMEDIA INFORMATION SYSTEMS Multimedia information systems afford users conventional database functionalities in the context of multimedia data, including audio, image, and video data. Thus multimedia data can be queried on the basis of their semantic contents. Since such contents, in general, are not described in words as in conventional databases, conventional data indexing and search mechanisms cannot be used for processing queries on such data. How can we employ technology in order to obtain full database functionality from multimedia data stores? This article attempts to answer this question by describing the challenges, progress to date, and future directions in the area of multimedia information systems. Multimedia information technology will allow users to store, retrieve, share, and manipulate complex information composed of audio, images, video as well as text. A variety of fields, including business, manufacturing, education, computer-aided design (CAD)/computer-aided engineering (CAE), medicine, weather, and entertainment, are expected to benefit from this technology. A broad range of applications includes remote collaboration via video teleconferencing, improved simulation methodologies for all disciplines of science and engineering, and better human-computer interfaces (1). There is a potential for developing vast libraries of information including arbitrary amounts of text, video, pictures, and sound more efficiently usable than traditional book, record, and tape libraries of today. These applications are just a sample of the kinds of things that may be possible with the development and use of multimedia. As the need for multimedia information systems is growing rapidly in various fields, management of such information is becoming a focal point of research in the database community. Multimedia data possess certain distinct characteristics
Multimedia Data Not generally known Large size (memory and bandwidth) Variable size atomic units Highly interactive Temporal synchronization needed Special interface for querying Mostly archival
from conventional data, as shown in Table 1. This proliferation of applications also explains partly why there is an explosion of research in the areas related to the understanding, development, and utilization of multimedia-related technologies. Depending on the application, multimedia data may have varying quality of presentation requirements. For example, in medical information systems, electronic images such as X rays, MRIs, and sonograms may require high-resolution storage and display systems. Systems designed to store, transport, display, and manage multimedia data require considerably more functionality and capability than conventional information management systems handling textual and numeric data. Some of the hardware problems faced include the following: Storage devices, which are usable on-line with the computers, are not ‘‘big’’ enough. The speed of retrieval from the available storage devices, including disks, is not sufficiently fast to cope with the demands of many multimedia applications. Conversely, storing multimedia data on disk is also relatively slow. Cache memories are a precious resource, but they are too small when it comes to multimedia, hence even greater demands for efficient resource management. Communication bandwidth tends to be another problem area for multimedia applications. A single object may demand large portions of bandwidth for extended periods of time. The problems of communication are compounded because of the delay-sensitive nature of multimedia. Storage problems for multimedia and for similar high-performance applications have been identified as deserving high priority. In multimedia information systems, mono-media may represent individual data entities that serve as components of some multimedia object such as electronic documents or medical records containing electronic images and sonograms. Furthermore these objects/documents can be grouped together for efficient management and access. It is essential that the user be able to identify and address different objects and to compose them both in time and space. The composition should be based on a model that is visually presentable to the user (see Fig. 1). It is therefore desirable that a general framework for spatiotemporal modeling should be available that can ultimately be used for composing and storing multimedia documents. The article is organized as follows. First we introduce the basic concepts of multimedia data, including fundamental pragmatics of multimedia information systems. This is followed by a description of peculiarities of audio, image, and video data, leading to the necessity of handling the temporal
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
MULTIMEDIA INFORMATION SYSTEMS
2
W1
591
Label families Weld X-X
X-X
M V A Soft sheet W2
M V A Soft sheet
W3
Surface finish
W4
1
X-X
M V A Soft sheet
M V A Soft sheet Weld
M V A Soft sheet + – + R
W1 Document model
Image Text
Image Text
Video
Welding rule for Weld area = 30 mm No. of passes = 4 Travel speed = 80.5 mm/min Gas flow = 20 L/min Gas type = Arg Wire density = 8100 kg/m3 Search for: Procedure number Material type Point type Thickness range Process W3 Welding position Zoom W1
Zoom W3
Zoom W4
Different types of objects different semantics for links
Gear 1 Gear 2 Gear 3
Box Side cover
Gear subassembly Motor gear assembly Top cover
Screws Base plate Motor casino W4
Animation W4
Animation W3
Motor subassembly Electric connector
Worm gear assembly W2 Image manipulation
Data base
Filing
Multimedia document
Figure 1. An example multimedia document from the manufacturing domain, along with its document model.
dimension in multimedia data processing. Then we introduce the vital issues inherent in content-based retrieval of image and video data. In order to allow content-based queries on multimedia data, designers must employ novel data modeling and processing techniques. These models and techniques are also covered in this section. Real-world multimedia documents consist of a mix of text, audio, image, and video data. Therefore, techniques and models specific to each of the component media need to be combined in order to handle multimedia documents. In the final section, we study models and techniques used to describe, author, and query complex multimedia documents consisting of several component media. Finally, we present conclusions and general reflections on the technical future of multimedia information systems. REQUIREMENTS OF MULTIMEDIA INFORMATION SYSTEMS From the systems point of view, because of the heterogeneous nature of the data, storage, transportation, display, and management of multimedia data must have considerably more functionalities and capabilities than the conventional information management systems. The fundamental issues faced
by the multimedia information management researchers/designers are as follows: • Development of models for capturing the media synchronization requirements. Integration of these models with the underlying database schema will be required. Subsequently, in order to determine the synchronization requirements at retrieval time, transformation of these models into a metaschema is needed. This entails designing of object retrieval algorithms for the operating systems. Similarly integration of these models with higherlevel information abstractions such as Hypermedia or object-oriented models, may be required. • Development of conceptual models for multimedia information, especially for video, audio, and image data. These models should be rich in their semantic capabilities for abstraction of multimedia information and be able to provide canonical representations of complex images, scenes, and events in terms of objects and their spatiotemporal behavior. • Design of powerful indexing, searching, accessing, and organization methods for multimedia data. Search in
592
MULTIMEDIA INFORMATION SYSTEMS
User interface Navigation tool
Multimedia query interface
Media editing
Interactive icons
Multimededia composition, integration metamodel/query processing
Figure 2. Example of multimedia information management system.
•
•
•
•
Text DBMS
Image DBMS
Audio DBMS
Video DBMS
Records text
Images
Audio
Video
multimedia databases can be quite computationally intensive, especially if content-based retrieval is needed for image and video data stored in compressed or uncompressed form. Occasionally search may be fuzzy or based on incomplete information. Some form of classification/ grouping of information may be needed to help the search process. Design of efficient multimedia query languages. These languages should be capable of expressing complex spatiotemporal concepts, should allow imprecise match retrieval, and should be able to handle various manipulation functions for multimedia objects. Development of efficient data clustering and storage layout schemes to manage real-time multimedia data for both single and parallel disk systems. Design and development of a suitable architecture and operating system support for a general purpose database management system Management of distributed multimedia data and coordination for composition of multimedia data over a network
Accordingly we can perceive an architecture for a general purpose multimedia information system as shown in Fig. 2. The architecture consists of three layers, which include a monomedia database management layer, an object management layer, and a user interface layer. The monomedia database management layer provides the functionalities essential for managing individual media including formatted data (text and numeric) and unformatted data (audio, video, images). One of the key aspects of each database at this level is to maintain efficient indexing mechanism(s) and to allow users to develop semantic-based modeling and grouping of complex information associated with each media. The primary objective is to process content-based queries and facilitate retrieval of appropriate pieces of monomedia data, such as a video clip(s), parts of an image, or some desired audio segments. A major consideration at the time of retrieval is the quality of information that can be sustained
Interactive layer (Layer 3)
Multimedia object management layer and query processing (Layer 2) Database management layer for monomedia (Layer 1)
by the system (both at the database site and the user site). Therefore it is important that some quality of presentation (QoP) parameters, such as speed, resolution, or delay bounds, be specified by the user and maintained by the system at this layer. The middle layer provides the functionality of integration of monomedia for composing multimedia documents as well as integrating/cross-linking information stored across monomedia databases. Integration of media can span multiple dimensions including space, time and logical abstractions (e.g., Hypermedia or object oriented). Therefore the primary function of this layer is to maintain some metaschema for media integration along with some unconventional information, such as the QoP parameters discussed above. The objective is to allow efficient searching and retrieval of multimedia information/documents with the desired quality, if possible. Since there is a growing need for management of multimedia documents and libraries, the need for efficient integration models is becoming one of the key research issues in developing a general purpose multimedia DBMS. The interactive layer consists of various user interface facilities that can support graphics and other multimedia interface functionalities. In this layer various database query and browsing capabilities can be provided.
NOTION OF TIME FOR MULTIMEDIA DATA A multimedia object may contain real-time data like audio and video in addition to the usual text and image data that constitute present-day information systems. Real-time data can require time-ordered presentation to the user. A composite multimedia object may have specific timing relationships among the different types of component media. Coordinating the real-time presentation of information and maintaining the time-ordered relations among component media is known as temporal synchronization. Assembling information on the workstation is the process of spatial composition, which deals
MULTIMEDIA INFORMATION SYSTEMS
Text
Text
Image
T1
Text
Audio clip
Text
Text
Image
Video clip
T2
T3
593
T4
basically with the window management and display layout interface. For continuous media, the integration of temporal synchronization functions within the database management system is desirable, since it can make the storage and handling of continuous data more efficient for the database system. Also implementation of some standard format for data exchange among heterogeneous systems can be carried out more effectively. In this section we first elaborate on the problem of temporal synchronization of multimedia data for composing objects, followed by a discussion of modeling time. These models are then used to develop conceptual models for the multimedia data, as described in a later section. Temporal Synchronization Problem The concept of temporal synchronization is illustrated in Fig. 3 where a sequence of images and text is presented in time to compose a multimedia object. Notice in this figure that the system must observe some time relationships (constraints) among various data objects in order to present the information to the user in a meaningful way. These relationships can be natural or synthetically created (2). Simultaneous recording of voice and video through a VCR, is an example of natural relationship between audio and video information. A voiceannotated slide show, on the other hand, is an example of synthetically created relationship between audio and image information. In this case, change of an image and the end of its verbal annotation, represent a synchronization point in time. A user can randomly access various objects, while browsing through a multimedia information system. In addition to simple forward play-out of time-dependent data sequences, other modes of data presentation are viable and should be supported by a multimedia database management system. These include reverse play-out, fast-forward/fast-backward play-out, and random access of arbitrarily chosen segments of a composed object. Although these operations are quite common in TV technology (e.g., VCRs), these capabilities are very hard to implement in a multimedia system. This is due to the nonsequential storage of multimedia objects, the diversity in
T5
Time
Figure 3. Time-ordered multimedia data.
the features of hardware used for data compression, the distribution of data, and random communication delays introduced by the network. Such factors make the provision of these capabilities infeasible with the current technologies. Conceptually synchronization of multimedia information can be classified into three categories, depending on the ‘‘level of granularity of information’’ to be synchronized (3). These are the physical level, the service level, and the human interface level (3), as shown in Fig. 4. At the physical level, data from different media are multiplexed over single physical connections or are arranged in physical storage. This form of synchronization can be viewed as ‘‘fine grain.’’ The service level synchronization is ‘‘more coarse grain,’’ since it is concerned with the interactions between the multimedia application and the various media, and among the elements of the application. This level deals primarily with intermedia synchronization necessary for presentation or play-out. The human interface level synchronization is rather ‘‘coarse grain,’’ since it is used to specify the random user interaction to a multimedia information system such as viewing a succession of database items, also known as browsing. In addition to time-dependent relational classification (i.e., synthetic/natural), data objects can be classified by their presentation and application lifetimes. A persistent object is one that can exist for the duration of the application. A nonpersistent object is created dynamically and discarded when ob-
Human interface (presentation synchronization) Service layer (stream synchronization) Physical layer
Figure 4. Levels of synchronization of multimedia data.
594
MULTIMEDIA INFORMATION SYSTEMS
solete. For presentation, a transient object is defined as an object that is presented for a short duration without manipulation. The display of a series of audio or video frames represents a transient presentation of objects, whether captured live or retrieved from a database. Henceforth we use the terms static and transient to describe presentation lifetimes of objects, while persistence expresses their storage life in a database. In another classification, multimedia data have been characterized as either continuous or discrete (4). This distinction, however, is somewhat vague, since time ordering can be assigned to discrete media, and continuous media are time-ordered sequences of discrete ones after digitization. We use a definition attributable to Ref. 4, where continuous media are represented as sequences of discrete data elements played out contiguously in time. However, the term continuous is most often used to describe the fine-grain synchronization required for audio or video. Modeling Time The problem of multimedia synchronizing at presentation, user interaction, and physical layers reduces to satisfying temporal precedence relationships among various data objects under real timing constraints. For such purpose, models to represent time must be available. Temporal intervals and instants provide a means for indicating exact temporal specification. In this section, we discuss these models and then describe various conceptual data models to specify temporal information necessary to represent multimedia synchronization. To be applicable to multimedia synchronization, time models must allow synchronization of components having precedence and real-time constraints, and they must provide the capability for indicating laxity in meeting deadlines. The primary requirements for such a specification methodology include the representation of real-time semantics and concurrency, and a hierarchical modeling ability. The nature of presentation of multimedia data implies that a multimedia system has various additional capabilities such as to handle reverse presentation, to allow random access (at an arbitrary start point), to permit an incomplete specification of inter-media timing, to handle sharing of synchronized components among applications, and to provide data storage for control information. In light of these additional requirements, it is therefore imperative that a specification methodology also be well suited for unusual temporal semantics and be amenable to the development of a database for storing timing information. The first time model is an instant-based temporal reference scheme which has been extensively applied in the motion picture industry, as standardized by the Society of Motion Picture and Television Engineers (SMPTE). This scheme associates a virtually unique sequential code to each frame in a motion picture. By assigning these codes to both an audio track and a motion picture track, inter-media synchronization between streams is achieved. This absolute, instant-based scheme presents two difficulties when applied to a multimedia application. First, since unique, absolute time references are assumed, when segments are edited or produced in duplicate, the relative timing between the edited segments becomes lost in terms of play-out. Furthermore, if one medium,
α before β
α
α meets β
α
α overlaps β
β
α starts β
α β
β
α
α equals β
β
α β
β α during–1 β
α
β α finishes–1 β
α
Figure 5. All possible temporal relations between two events.
while synchronized to another, becomes decoupled from the other, then the timing information of the dependent medium becomes lost. This instant-based scheme has also been applied using musical instrument digital interface (MIDI) time instant specification (5). This scheme has also been used to couple each time code to a common time reference (6). In another approach, temporal intervals are used to specify relative timing constraints between two processes. This model is mostly applicable to represent simple parallel and sequential relationships. In this approach synchronization is accomplished by explicitly capturing each of the 13 possible temporal relations (2), shown in Fig. 5, that can occur between the processes. Additional operations can be incorporated in this approach to facilitate incomplete timing specification (4). CONTENT-BASED RETRIEVAL OF MULTIMEDIA DATA Image Data Modeling and Retrieval Traditionally research in image database systems has been focused on image processing and recognition aspects of the data. The growing role of image databases for information technology has spurred tremendous interest in data management aspects of information. Many challenges are faced by the database community in this area, including development of new data models and efficient indexing and retrieval mechanisms. To date, the general approach for image data modeling is to use multilevel abstraction mechanisms and support content-based retrieval using such abstractions. The levels of abstraction require feature extraction, object recognition, and domain-specific spatial reasoning and semantic modeling, as shown in Fig. 6. In this section, we use this figure as our focus of discussion and elaboration of few selected approaches proposed in the literature for developing such multilevel abstractions and associated indexing mechanisms. We discuss the important role played by the knowledge-based representation for processing queries at different levels. Feature Extraction Layer. The main function of this layer is to extract object features from images and map them onto a multidimensional feature space that can allow similarity based retrieval of images using their salient features. Features in an image can be classified as: global or local. Global features generally emphasize ‘‘coarse-grained’’ similaritybased matching techniques for query processing. Example queries include ‘‘Find images that are predominantly green,’’ or ‘‘Retrieve an image with a large round orange textured ob-
MULTIMEDIA INFORMATION SYSTEMS
Semantic modeling Semantic and knowledge specification representation layer knowledge base
Semantic identification process
Object recognition layer
Object models
Object recognition process
Feature extraction layer
Feature specification
Feature extraction process
Image data
Still video frames
Multimedia data
Figure 6. Processing and semantic modeling for image database.
ject.’’ The global feature extraction techniques transform the whole image into a ‘‘functional representation.’’ The finer details among individual parts of the image are ignored. Color histograms, fast fourier transform, Hough transform, and eigenvalues are the well-known functional techniques that fall into this category. Local features are used to identify salient objects in an image and to extract more detailed and precise information about the image. The approach is ‘‘fine grained’’ in the sense that images are generally segmented into multiple regions and different regions are processed separately to extract multiple features. In other words, local features constitute a multidimensional search space. Features in the form of encoded vectors provide the basis for indexing and searching mechanisms of image databases. Typical features include gray scale values of pixels, colors, shapes, and texture. Various combination of features can be specified at the time of formulating database queries. Incorporating domain knowledge with local features can provide more robust and precise indexing and search mechanisms using similarity based measures. Different kinds of measures have been proposed in the literature. These include, among others, Euclidean distance, Manhattan distance, weighted distance, color histogram intersection, and average distance. The performance of these similarity-based search strategies depends on the degree of imprecision and fuzziness introduced by the types of features used and the computational characteristics of the algorithms. Choice of features, their extraction mechanisms, and the search process at this level are domain specific. For example, multimedia applications targeted for X-ray imaging, and geographic information systems (GIS) require spatial features such as shapes and dimensions. On the other hand, for applications involving MMR imaging, paintings, and the like, color features are more suitable. The feature extraction mechanism can be manual, automatic, or hybrid. The trade-off is between the complexity and robustness of the algorithm in terms of its precision and the cost incurred by the manual approach. Various systems have been prototyped that use a feature extraction layer similar to the one shown in Fig. 6. For example, in the Query by Image Content (QBIC) system (7), color, shape, and texture features are used for image retrieval. In this system features are extracted using a fully automatic im-
595
age segmentation method. A model is used to identify objects with certain foreground/background settings. The system allows querying of the database by sketching features and providing color information about the desired objects. A system that uses a combination of color features and textual annotation attributes for image retrieval is the Chabot system (8). The system uses the notion of ‘‘concept query’’ where a concept, like sunset, is recognized by analyzing images using color features. It uses a frame-based knowledge representation of image contents, which is pre-computed and stored as attributes in a relational data model. For improving the performance of the system, it uses textual annotation of images by keywords that are manually entered. A system that uses quantitative methods for edge detection to identify shape features in a radiological database, known as KMeD is presented in Ref. 9. This system employs a three-layer architecture, where the lowest layer, known as the representation layer, uses shapes and contours to represent features. This layer employs a semiautomatic feature extraction mechanism based on a combination of low-level image processing techniques and visual analysis of the image manually. From a functionality point of view, this layer reduces to the feature extraction layer of Fig. 6. Object Recognition Layer. Features extracted at the lower level can be used to recognize objects and faces in an image database. Such a process is carried out by a higher layer as shown in Fig. 6. The process involves matching features extracted from the lower layer with the object models stored in a knowledge base. During the matching process, each model is inspected to find the ‘‘closest’’ match. Identifying an exact match is a computationally expensive task that depends on the details and the degree of precision possessed by the object model. Occlusion of objects and the existence of spurious features in the image can further diminish the success of matching strategies. As pointed out earlier, some fuzziness and imprecision must be incorporated in the similarity measure in order to increase the success rate of queries and not to exclude good candidates. For this reason, examining images manually at this level is generally unavoidable. Identification of human faces is an important requirement in developing image databases. However, due to more inherent ‘‘structuredness’’ in human faces, models and features used for face recognition are different than those used for object recognition. Face recognition involves three steps: face detection whereby a face is located inside an image, feature extraction where various parts of a face are detected, and face recognition where the person is identified by consulting a database containing ‘‘facial models.’’ Several face detection and recognition systems for multimedia environments have been proposed (10). For face recognition, most of these systems use information about various prominent parts of a face such as eyes, nose, and mouth. Another technique decomposes face images into a set of characteristic features called eigenfaces (10). This technique captures variations in a collection of face images and uses them to encode and compare individual features. A third approach, motivated by neurocomputing, uses global transforms, such as Morlet transform, to determine salient features present in human faces (10). Extraction of features and object recognition are important phases for developing large-scale general purpose image database management systems. Significant results have been re-
596
MULTIMEDIA INFORMATION SYSTEMS
Table 2. Survey of Different Image Database Systems Feature Extraction System QBIC Chabot KMED PICTION Yoshitaka et al.a
Object Recognition
Spatial Semantics
Process
Features
Process
Type of Knowledge Base
Process
Automatic Automatic Hybrid Automatic Automatic
Color, shape Color Shape Facial shape Shape
Hybrid Keywords Hybrid Automatic Manual
—
— — Hybrid Automatic Automatic
Frame based Attribute list of shape descriptors Constraints —
ported in the literature for the last two decades, with successful implementation of several prototypes. However, the lack of precise models for object representation and the high complexity of image processing algorithms make the development of fully automatic image management and content-based retrieval systems a challenging task. Spatial Modeling and Knowledge Representation Layer. The major function of this layer is to maintain the domain knowledge for representing spatial semantics associated with image databases. Queries at this level are generally descriptive in nature and are focused mostly on semantics and concepts present in image databases. For most of the applications, semantics at this level are based on ‘‘spatial events’’ (11) describing the relative orientation of objects with each other. Such semantics can provide high-level indexing mechanisms and support content-based retrieval for a large number of multimedia applications. For example, map databases and geographic information systems (GIS), are extensively used for urban planning and resource management. These systems require processing of queries that involve spatial concepts such as close by, in the vicinity, or larger than. In clinical radiology applications, relative sizes and positions of objects are critical for medical diagnosis and treatment. Some example queries in this application include ‘‘Retrieve all images that contain a large tumor in the brain,’’ or ‘‘Find an image where the main artery is 40% blocked.’’ The general approach for modeling spatial semantics for such applications is based on identifying spatial relationships among objects once they are recognized and marked by the lower layer using bounding boxes or volumes. Spatial relationships can be coded using various knowledge-based techniques. These techniques can be used to process high-level queries as well as to infer new information pertaining to the evolutionary nature of the data. Several formal techniques have been proposed to represent spatial knowledge at this layer. Table 2 summarizes the characteristics of several prototyped image database systems. Their key features are highlighted in the table. One of our observations from this table is that the underlying design philosophy of these systems is driven by the application domain. Development of a general purpose, automatic image database system capable of supporting arbitrary domains is a challenging task due to the limitations of existing image processing knowledge representation models. Video Data Modeling and Retrieval The key characteristic of video data that makes it different from temporal data such as text, image, and maps is its
Knowledge Base Support — — Semantic nets Constraints Inclusion hierarchies
spatial/temporal semantics. Video queries generally contain both temporal and spatial semantics. For example in the query ‘‘Find video clips in which the dissection of liver is described,’’ dissection is a spatiotemporal semantic. An important consideration in video data modeling is how to specify such semantics and develop an efficient indexing mechanism. Another critical issue is how to deal with the heterogeneity that may exist among semantics of such data due to difference in the preconceived interpretation or intended use of the information given in a video clip by different sets of users. Semantic heterogeneity has proved to be a difficult problem for conventional databases, with little or no consensus on the way to tackle it in practice. In the context of video databases, the problem is exacerbated. In general, most of the semantics and events in a video data can be expressed by describing the interplay among physical objects in time along with spatial relationships between these objects. Physical objects include persons, buildings, and vehicles. In order to model video data, it is essential to identify the component physical objects and their relationships in time and space. These relations may subsequently be captured in a suitable indexing structure, which may then be used for query processing. In event-based semantic modeling and knowledge representation issues in video data, we consider two levels of modeling: low level and high level, as shown in Fig. 7. The lowlevel modeling is concerned with the identification of objects, their relative movements, and segmentation and grouping of video data using image processing techniques. The high-level modeling is concerned with identifying contents and eventbased semantics associated with video data and representing these contents in conjunction with suitable structures for indexing and browsing. At this level, knowledge-based approaches can be used to process a wide range of content-based queries. Browsing models and structures can be used to allow users to navigate through groups of video scenes. The current approaches for low-level video data modeling can be further classified into two categories based on the types of processing carried out on the raw video data. The first approach is coarse grained and uses various video parsing techniques for segmenting video data into multiple shots. These shots are subsequently grouped for building higherlevel events. The second approach is fine grained and is primarily based on the motion analysis of objects and faces recognized in video data. Coarse-Grained Video Data Modeling Based on Segmentation. In this approach based on global features, video data are analyzed using image processing techniques. These techniques are applied at the frame level, and any significant
MULTIMEDIA INFORMATION SYSTEMS
597
Table 3. Survey of Different Video Database Models Spatial Temporal Models (Event Representation) Smoliar et al. Yeung et al. Golshani et al.
Predefined SCD-based model Hierarchical scene transition graph Algebraic
Day et al.
Spatiotemporal logic using objects and events Spatiotemporal logic using objects and events Algebraic using video objects Algebraic using video expressions
Bimbo et al. Oomoto et al. Weiss et al.
Modeling Approach Parsing, segmentation Parsing, segmentation Object identification and motion analysis Object identification and motion analysis Object identification and motion analysis Segmentation Segmentation
change in global features in a sequence of frames is used to mark a change in the scene. This process allows parsing and automatic segmentation of video into shots. For this reason it is often termed as scene change detection technique. Most of the existing approaches to scene change detection use color histograms as the global feature (13). In other words, a shot is defined as a continuous sequence of video frames that have no significant interframe difference in terms of their visual content (13). Subsequently shots are used to construct scenes and episodes and to build browsing structures for users to navigate through the video database. In order to develop high-level semantics based on this technique (Fig. 7), scenes are clustered based on some desired semantics, and descriptions are attached to these clusters. There are several ways to build this abstraction. One possibility is to identify key objects and other features within each scene using either image processing techniques or textual information from video caption, in case it is available. Domain specific semantics can be provided in form of sketches or reference frames to identify video segments that are closely related to these frames. Reference 14 takes advantage of the wellstructured domain of news broadcasting to build an a priori model of reference frames as a knowledge base to semantically classify the video segments of a news broadcast. Alternatively, the scenes of the segmented video can be examined manually in order to append appropriate textual description. Such description can then be used to develop high-level semantics and events present in different scenes.
High level
Iconic based grouping and browsing
Knowledge-based higher-level semantics
Motion detection analysis
Low level
Video parsing and segmentation
Object recognition
Coarse grained
Fine grained
Figure 7. Semantic modeling of video data.
Mode of Capturing
Query Specification
Automatic Semiautomatic Automatic
Visual browsing tool Visual browsing tool Algebraic expressions
Manual
Logical expressions
Semiautomatic
By sketch
Manual Manual
Visual SQL based Algebraic expressions
Video segmentation techniques are also suitable for building iconic-based browsing environments. In this case a representative frame of each scene can be displayed to the user in order to provide the information about the persons and possible event present in that scene (15). Fine-Grained Video Data Modeling. In this approach, as shown in Fig. 7, detailed temporal information of objects and persons is extracted from the video data in order to identify high-level events and semantics of interest. In the following sections we elaborate on this modeling paradigm. Low-Level Modeling. The main function of this layer is to identify key objects and faces and perform motion analysis to track their relative movements. For this purpose each video frame is analyzed either manually or using image processing techniques for automatic recognition of objects and faces. The major challenge in this approach is to track the motion of objects and persons from frame to frame and perform detailed motion analysis for temporal modeling. Several approaches have been proposed in the literature to track motion of objects. Here we elaborate on two techniques. In one of these approaches the known compression algorithms are modified to identify objects and to track their motion. Such ‘‘semantic-based’’ compression approaches combine both image processing and image compression techniques. For example, in Ref. 16 a motion tracking algorithm uses forward and backward motion vectors of macroblocks used by an MPEG-encoding algorithm to generate trajectories for objects. These trajectories are subsequently used by the higher layer for semantic modeling. The second approach for motion tracking uses a directed graph model to capture both spatial and temporal attributes of objects and persons. The proposed model, known as video semantic directed graph (VSDG), is used to maintain temporal information of objects once they are identified by image processing techniques. This is achieved by specifying the changes in the 3-D projection parameters associated with the bounding volume of objects in a given sequence of frames. At the finest level of granularity, these changes can be recorded for each frame. Although such a fine-grained motion specification may be desirable for frame-based indexing of video data, it may not be required in most of the applications. In addition the overhead associated with such detailed specification may be formidable. Alternatively, a coarse-grained temporal specification can be maintained by only analyzing
598
MULTIMEDIA INFORMATION SYSTEMS
frames for motion tracking at some fixed distance apart. Such skip distance depends on the complexity of events. There is an obvious trade-off between the amount of storage needed for temporal specification and the detailed information maintained by the model. Both of these approaches, and several others, can be used to build high-level semantics, as discussed next. Higher-Level Modeling of Video Data. Based on the information available from the low layer of Fig. 7, higher-level semantics can be built by the user to construct different views of the video data. There has been a growing interest in developing efficient formalisms to represent high-level semantics and event specifications as implied by the high level layer of Fig. 7. Several approaches have been proposed in the literature on this topic. The essence of these formalisms is the temporal modeling and specification of events present in video data. Semantic operators, which include logic, set, and spatiotemporal operators, are extensively used to develop such formalisms. Logical operators include the conventional boolean connectives such as not, and, or, if-then, only-if, and equivalent-to. Set operators like union, intersection, and difference are mostly used for event specification as well as for video composition and editing. Spatiotemporal operators, based on temporal relations, are employed for event specification and modeling. There are a total of 13 such possible operators, as shown in Fig. 5. In essence the approaches proposed in the literature use subsets and combinations of these operators. Temporal Interval-Based Video Modeling. In this section we describe the approaches of video models based on temporal intervals. The first approach is based on spatiotemporal logic and uses temporal and logical operators for specifying video semantics. The second approach uses spatiotemporal operators with set-theoretic operators to specify video events in form of algebraic expressions. Such operations include merge, union, intersection, and so on. As a result of set-theoretic operations, this approach is also useful for video production environments. In this category we discuss three distinct models. In our opinion, these are among the most comprehensive frameworks that are representations of other models in the field. Spatio-temporal Logic. An approach that uses spatial relations for representing video semantics is spatiotemporal logic (17). In this approach each object identified in a scene is represented by a symbol, and scenes are represented by a sequence of state assertions capturing the geometric ordering relationships among the projections of the objects in that scene. The assertions specify the dynamic evolution of these projections in the time domain. The assertions are inductively combined through the boolean connectives and temporal operators. Temporal and spatial operators, such as temporal/spatial eventually and temporal/spatial always are used for modeling video semantics in an efficient manner. Fuzziness and incomplete specification of spatial relationships are handled by defining multi-level assertions that provide general to specific detail of event specifications. For temporal modeling of video data, Ref. 11 uses the notion of generalized temporal intervals initially proposed in Ref. 18. The temporal specification of events in this approach is equivalent to the detailed event specifications of the approach discussed above. A generalized relation, known as nary relation, is a permutation among n intervals, labeled 1 through n. The basis for this realization is that two consecu-
tive intervals satisfy the same temporal relation, which is being generalized. The n-ary relations are used to build the video semantics in form of a hierarchy. For this purpose, simple temporal events are first constructed from spatial events with a special condition that the n-ary operators are of type meets and all operands of a certain operation belong to the same spatial event. This allows one to represent the ‘‘persistence’’ of a specified spatial event over a sequence of frames, which gives rise to a simple temporal event that is valid for the corresponding range of frames with some duration. In order to recognize whether or not a simple event is present in video data, the constructed event is evaluated using the spatial and motion information of objects, captured in the VSDG model. Algebraic Models. These approaches use the temporal operators in conjunction with set operations to build formalisms that allow semantic modeling as well as editing capabilities for video data. For example, the framework discussed in Ref. 16 defines a set of algebraic operators to allow spatiotemporal modeling as well as video editing capabilities. In this framework temporal modeling is carried out by the spatiotemporal operators. These operators are usually defined through functions that map objects and their trajectories into temporal events. Based on lisplike operators for extracting items and lists, functions can be defined in order to perform various video editing operations such as inserting video clips, and extracting video clips and images from other video clips. Another algebraic video model is proposed in Ref. 19. The model allows hierarchical abstraction of video expressions representing scenes and events, which can provide indexing and content-based retrieval mechanisms. A video expression, in its simplest form, consists of a sequence of frames defined on raw data, which usually represent a meaningful scene. Compound video expressions are constructed from simpler ones through algebraic operations, which include creation, composition, and description operators that form the basis of this formalism. Composition operators include several temporal and set operations. The set operators allow performing set operations on various video segments represented by expressions. These operators can be used to generate complex video expressions according to some desired semantics and description. Content-based retrieval is maintained through annotating each video expression with field name and value pairs that are defined by the user. A similar approach is taken to develop an object-oriented abstraction of video data in Ref. 20. A video object in this approach is identical to a video expression in Ref. 21 and corresponds to semantically meaningful scenes and events. An object hierarchy is built using IS-A generalizations and is defined on instances of objects rather than classes of objects. Such generalizations allow grouping of semantically identical video segments. Hierarchical flow of information in this model is captured through interval inclusion based inheritance, where some attribute/value pairs of a video object A is inherited by video object B if the video raw data of B is contained in that of A. Set operators supporting composition operations, including interval projection, merge, and overlap constructs, are used for editing video data and defining new instances of video objects. In modeling of video data, some degree of imprecision is intrinsic. To manage such imprecision, approach discussed in Ref. 17 uses a multilevel representation of video semantics as
MULTIMEDIA INFORMATION SYSTEMS
mentioned earlier. The third level in that approach supports the most precise and detailed representation of spatio-temporal logic. This representation is equivalent to the one proposed in Ref. 17. The approach of Ref. 17, in practice, reduces to the one of Ref. 11 in the course of query evaluation. However, the approach of Ref. 17 is more pragmatic in terms of query formulation using visual sketches which provide an easier and more intuitive interface. The model presented in Ref. 16 has a limitation in the sense that it puts the burden on the user to define semantic functions related to the video object. Furthermore these functions must be defined in terms of object trajectories. Ultimately the high-level functions need to be evaluated to provide precise results for query processing. This necessitates reductions to low-level evaluation of algebraic functions that must be specified by the users. The other two algebraic approaches (19,20) require interactive formulation of video semantics by the user. These approaches afford us great flexibility in identifying the desired semantics. At the same time they put a great burden of responsibility on the user for such formulation, which may not be suitable for naive users. In addition these approaches may prove to be impractical for processing large amounts of video data because of the high cost of human interaction.
MULTIMEDIA DOCUMENT MODELING AND RETRIEVAL An important problem that the multimedia community has to address is the management of multimedia documents. It is a general anticipation that parallel to the explosive growth in computer and networking technologies, multimedia repositories will soon become a reality and easy access to multimedia documents will make it essential to formally develop metaschema and indexing mechanisms for developing large-scale multimedia document management systems. As mentioned earlier, an important issue for managing large volumes of multimedia documents is the support of efficient indexing techniques to support querying of multimedia documents. Searching information about a document can be multidimensional and may span over multiple documents. These include searching by spatiotemporal structures, by logical organization, or by contents. For example, the query ‘‘Find documents that show a video clip of a basketball game accompanied by a textual information about other games’ results’’ requires searching documents by their spatiotemporal structures. Similarly the query ‘‘Find documents that describe the assembly process of the transmission system of a car’’ requires searching document database by contents. On the other hand, the query ‘‘Find all the other sections in this book that refers to the image of the Himalayas of Chapter 7’’ requires searching within a documents based on its logical structure. Another crucial component of multimedia document management is the integration of the data, which requires both temporal and spatial synchronizations of monomedia data to compose multimedia documents. In addition to this, logical organization of document components is desired to facilitate browsing and searching within and across documents. For managing documents, representation of composition and logical information in form of a suitable metaschema is essential for designing efficient search strategies.
599
User interface
Metaschema for documents - spatiotemporal structure - Logical structure - Contents
Text database system
Audio database system
Image database system
Video database system
Figure 8. Generic architecture for multimedia document management system.
A generic architecture that highlights the overall process of document creation, management, and retrieval is shown in Fig. 8. Our focus here is on the second layer of this architecture which deals with the composition and management aspects of multimedia documents. Temporal synchronization is the process of coordinating the real-time presentation of multimedia information and maintaining the time-ordered relations among component media. It is the process of ensuring that each data element appears at the required time and play-out for a certain time period. A familiar example is the voice-annotated slide show, where slides and voice data are played out concurrently. Spatial composition describes the assembly process of multimedia objects on a display device at certain points in time. For text, graphics, image, and video, spatial composition includes overlay and mosaic, and it requires processing such as scaling and cropping. For audio data, spatial operations include mixing of signals, gain, tone adjustment, and selectively playing out various audio signals on multichannel outputs (stereo quad, etc.). In the following sections we elaborate on two main aspects of document management; their spatiotemporal composition requirements and their organization models. Composition Models for Multimedia Documents In order to facilitate users to specify the spatiotemporal requirements, at the time of authoring a document, a composition model is needed. Recently various such models have been proposed in the literature, which include language-based models, time-interval based models, and object-oriented models (18,21–26). Conceptual Models for Multimedia Objects. A number of attempts have been made to develop conceptual models for representing multimedia objects. These models can be classified into five categories: graphical models, Petri-Net based models, object-oriented models, language-based models, and temporal abstraction models. Some models are primarily aimed at synchronization aspects of the multimedia data, while others are more concerned with the browsing aspects of the objects. The former models can easily render themselves to an ultimate specification of the database schema, as briefly discussed
600
MULTIMEDIA INFORMATION SYSTEMS
later in this section. Some models, such as those based on graphs and Petri-Nets, have the additional advantage of pictorially illustrating synchronization semantics, and they are suitable for visual orchestration of multimedia presentations. These models are discussed next. Graphical Models. Labeled directed graphs have been extensively used to represent information (27). Hypertext systems provide an example of such a mechanism. This approach allows one to interlink small information units (data) and provides a powerful capability for users to navigate through a database. Information in such a system represents a ‘‘page’’ consisting of a segment of text, graphics codes, executable programs, or even audio/video data. All the pages are linked via a labeled graph, called hypergraph. The major application of this model is to specify higher-level browsing features of multimedia system. The essence of hypertext is a nonlinear interconnection of information, unlike the sequential access of conventional text. Information is linked via cross-referencing between keywords or subjects to other fragments of information. An application has been implemented (28) for interactive movies by using the hypertext paradigm. Various operations, such as updating and querying, can be performed on a hypergraph. Updating means changing the configuration of the graph and the content of the multimedia data. Querying operations include navigating the structure, accessing pages (read or execute), showing position in the graph, and controlling side effects. Basically it is a model for editing and browsing hypertext. The hypergraph model suffers from many shortcomings. The major drawback is that there is no specific mechanism to handle temporal synchronization among data items. Language-Based Models. In this approach a scripting language is used to describe the spatiotemporal structure of multimedia documents. The leading example is the HyTime model that uses SGML (Standard Generalized Markup Language). HyTime has been recognized as an ISO standard for multimedia document modeling in 1986 (21). SGML has gained increasing popularity recently through the fame of its child, HTML, though it is a result of a decades long effort. SGML basically defines a framework to describe the logical layout of the information in a structured format through a user-defined markup language. Defining metastructures involves location addressing of entities within data, querying of the structure and content of documents, and most important, specification of measurement and scheduling of data contents along spatial and/or temporal axes. This last feature of the standard, and the deserved popularity of markup schemes in data representation, make HyTime the ideal choice for multimedia document specification (21). On the other hand, the multimedia technology still lacks ‘‘HyTime-aware’’ methodologies capable of creating and analyzing HyTime documents from the database management points of view. A number of researchers have reported work involving SGML/HyTime structures (22,24). They mainly concentrate on document modeling and integrating HyTime-based infor¨ zsu et al. describe a mation with databases. In their work, O database application of SGML/HyTime documents for newson-demand applications (24). The documents follow a fixed logical structure, and the document database is restricted to a certain schema. The document units are mapped into database objects in conformance with a predefined type hierarchy. Their work emphasizes the importance of spatial and tempo-
ral analysis and indexing of multimedia documents but does not propose any approach to address this issue. In Reference 22 takes an alternative approach to the same problem: Storage and processing of structured documents within a DBMS framework is presented. This approach realizes the advantages of a general purpose scheme by a document insertion mechanism using super Document Type Descriptors that allows handling of arbitrary documents in the database (22). Like the new-on-demand application in Ref. 24, the scheme uses an object-oriented DB manager called VODAK. Spatiotemporal indexing is explicitly referenced as an important research problem, although no specific results have been reported. However, content-based and general indexing is briefly mentioned. In sum, the HyTime standard is expected to play a major role in leading the research activities in multimedia document modeling. However, the management aspects of HyTime-based documents in terms of searching and indexing are open research issues. Interval-Based Models for Multimedia Documents. Recently the use of Petri-Nets for developing conceptual models and browsing semantics of multimedia objects (18,25) has been proposed. The basic idea in these models is to represent various components of multimedia objects as places and describe their interrelations in the form of transitions. These models have been shown to be effective for specifying multimedia synchronization requirements and visualizing the composition structure of documents. One such model is used to specify object-level synchronization requirements. It is both a graphical and mathematical modeling tool capable of representing temporal concurrency of media. In this approach Timed Petri-Net has been extended to develop a model that is known as Object Composition Petri-Nets (OCPNs); see Ref. 18. The particularly interesting features of this model are the ability to capture explicitly all the necessary temporal relations. Each place in this Petri-Net derivative represents the play-out of a multimedia object, while transitions represent synchronization points. Several variations of the OCPN model have been proposed in the literature. One such variation deals with the spatial composition aspects of multimedia documents. For such composition, additional attributes are specified with each media place in the OCPN. These include the size and location of the display area for different media within a document, a priority vector that describes the relative ordering among changing background/foreground locations of intersecting spaces for media display with time; an ordered list of unary operations, such as crop and scale, applied to the data associated with the place, and a textual description about the contents of the media place. As mentioned, the HyTime model suffers from a drawback, that the extraction of various spatiotemporal and content semantics from this model can be quite cumbersome. On the other hand, the OCPN model not only allows extraction of the desired semantics and generation of a database schema but also has the additional advantage of pictorially illustrating synchronization aspects of the information. In this regard this model is unique and therefore is also well suited for visual orchestration of multimedia document. Organization Models for Multimedia Documents. From organizational structure point of view, a multimedia document
MULTIMEDIA INFORMATION SYSTEMS
can be viewed as a collection of related information objects, such as books, chapters, and sections. The logical structure of objects can be maintained in the form of a metaschema associated with each document. Metainformation about such organization can be used for searching and accessing different parts of a document. Models for the logical structure of multimedia documents can be independent from the composition models. Such independence can support different presentation styles for a document that can be tailored to the target audience, as well as hardware display constraints. The well-known organizational modeling paradigm of documents is based on hypermedia. There are basically three types of links used in a hypermedia environment. These include the base structure links for defining the organization of documents, the associative links for connecting concepts and accessing the same information from different contexts, and referential links that provide additional information on a concept within a document. The HyTime model provides an elegant mechanism for the organizational structure of a document. Using SGML, a document’s logical content is described by specifying the significant elements in that document along with the attributes associated with each such element, in a hierarchical manner. For example, an SGML specification of a textual report document may declare that it contains a title, an author, and a body. Each of these elements would in turn have attributes specifying their structure. The hypermedia-based multimedia document models have several attractive features. For example, they allow efficient path-searching mechanisms for accessing information in various parts of the document (23). Furthermore they allow the development of object-oriented abstractions of documents. For this purpose the document components are represented in form of a set of nodes related to each other through IS-A, ISPART-OF, and AGGREGATE relationships. Associated with each node is a concept or a topic, and the semantic relationships among nodes are based on concepts. In other words, each node in this model is an information unit, and objectoriented abstractions between two nodes can be represented using structural links. Several hypermedia-based models of documents, with object-oriented abstractions have been proposed in the literature (22–25). The model presented in Ref. 22, in essence, is a HyTime model, as discussed earlier. Its hypermedia-based organization has been used to develop a multilayered architecture, known as VODAK. The layers consist of a conceptual schemata level for accessing several multimedia databases, a second level that supports document authoring environment by conceptualizing media objects, and a third level for the presentation of documents. The limitation in the design of VODAK system is that there is no explicit mechanism of supporting query based on contents associated with objects in a document. Recently the researchers in Ref. 23, have proposed a hypermedia-based document model that uses the object-oriented paradigm. They describe a unique indexing scheme based on the underlying multistructure information of document to optimize the index structure and to provide efficient access document elements. The document data model can be implemented using object-oriented technology. The model is augmented with an object-oriented query language syntax.
601
CONCLUSION We have covered several issues pertaining to rapidly evolving multimedia information technology, namely data modeling, storage, indexing and retrieval, and synchronization of multimedia data. It is now widely accepted that one of the main requirements of multimedia information systems is a data model more powerful and more versatile than the relational model, without compromising the advantages of the former. The relational data model exhibits limitations in terms of complex object management, indexing and content-based retrieval of video/image data, and facility for handling the spatiotemporal dimensions of objects. To address these issues, we have emphasized two key requirements for multimedia databases: the process of spatiotemporal modeling, and the computational needs for automatic indexing of spatiotemporal data. Enlisted were the general characteristics of a number of different media types with the notion of time identified among those as the major characteristic that also distinguishes multimedia data from traditional alphanumeric data. We have highlighted various challenges that need to be tackled before multimedia information systems become a reality. This area is expected to preserve it popularity into the next millenium and produce visible outcomes that will find direct and pragmatic usage in our lives.
BIBLIOGRAPHY 1. A. Ghafoor and P. B. Berra, Multimedia database systems, in B. Bhargava and N. Adams, (eds.), Lecture Notes in Computer Science, Vol. 759, New York: Springer-Verlag, 1993, pp. 397–411. 2. T. D. C. Little and A. Ghafoor, Synchronization and storage models for multimedia objects, IEEE J. Select. Areas Commun., 8 (3): 413–427, 1990. 3. J. S. Sventek, An architecture for supporting multi-media integration, Proc. IEEE, Comput. Soc. Office Automation Symp., 1987, pp. 46–56. 4. R. G. Herrtwich, Time capsules: An abstraction for access to continuous-media data, Proc. 11th Real-Time Syst. Symp., 1990, pp. 11–20. 5. D. J. Moore, Multimedia presentation development using the audio visual connection, IBM Syst. J., 29 (4): 494–508, 1990. 6. M. E. Hodges, R. M. Sasnett, and M. S. Ackerman, A construction set for multimedia applications, IEEE Softw., 6 (1): 37–43, 1989. 7. M. Flickner et al., Query by image and video content: The QBIC system, Computer, 28 (9): 23–32, 1995. 8. V. E. Ogle and M. Stonebraker, Chabot: Retrieval from a relational database of images, Computer, 28 (9): 40–48, 1995. 9. C. C. Hsu, W. W. Chu, and R. K. Taira, A knowledge-based approach for retrieving images by content, IEEE Trans. Knowl. Data Eng., 8: 522–532, 1996. 10. M. Misra and V. K. Prasanna, Parallel computations of wavelet transforms, Proc. Int. Conf. Pattern Recognition, 1992. 11. Y. F. Day et al., Spatio-temporal modeling of video data for online object-oriented query processing, IEEE Int. Conf. Multimedia Comput. Syst., 1995, pp. 98–105. 12. A. Yoshitaka et al., Knowledge-assisted content-based retrieval for multimedia database, IEEE Multimedia, 1 (4): 12–21, 1994. 13. A. Nagasaka and Y. Tanaka, Automatic video indexing and full video search for object appearances, 2nd Working Conf. Visual Database Syst., 1991, pp. 119–133.
602
MULTIMEDIA VIDEO
14. S. W. Smoliar and H. Zhang, Content-based video indexing and retrieval, IEEE Multimedia, 1 (2): 62–74, 1994. 15. M. M. Yeung et al., Video browsing using clustering and scene transitions on compressed sequences, Proc. IS&T/SPIE Multimedia Computing and Networking, 1995, pp. 399–413. 16. F. Golshani and N. Dimitrova, Retrieval and delivery of information in multimedia database systems, Inf. Softw. Technol., 36 (4): 235–242, 1994. 17. A. Del Bimbo, E. Vicario, and D. Zingoni, Symbolic description and visual querying of image sequences using spatio-temporal logic, IEEE Trans. Knowl. Data Eng., 7: 609–622, 1995. 18. T. D. C. Little and A. Ghafoor, Interval-based conceptual models for time-dependent multimedia data, IEEE Trans. Knowl. Data Eng., 5: 551–563, 1993. 19. R. Weiss, A. Duda, and D. K. Gifford, Composition and search with a video algebra, IEEE Multimedia, 2 (1): 12–25, 1995. 20. E. Oomoto and K. Tanaka, Ovid: Design and implementation of a video-object database system, IEEE Trans. Knowl. Data Eng., 5: 629–643, 1993. 21. ISO/IEC 10744, Information Technology—Hypermedia/TimeBased Structuring Language (HyTime), Int. Organ. for Standardization, 1992. 22. W. Klas, E. J. Neuhold, and M. Schrefl, Using an object-oriented approach to model multimedia data, Comput. Commun., 13 (4): 204–216, 1990. 23. K. Lee, Y. K. Lee, and P. B. Berra, Management of multi-structured hypermedia documents: A data model, query language, and indexing scheme, Multimedia Tools Appl., 4 (2): 199–223, 1997. 24. M. T. Ozsu et al., An object-oriented multimedia database system for a news-on-demand application, ACM Multimedia Syst. J., 3 (5/6): 182–203, 1995. 25. M. Iino, Y. F. Day, and A. Ghafoor, Spatio-temporal synchronization of multimedia information, IEEE Int. Conf. Multimedia Comput. Syst., 1994, pp. 110–119. 26. ISO 8613, Information Processing—Text and Office Systems– Office Document Architecture (ODA) and Interchange Format, Int. Organ. for Standardization, 1993. 27. F. W. Tompa, A data model for flexible hypertext database systems, ACM Trans. Inf. Syst., 7 (1): 85–100, 1989. 28. R. M. Sasnett, Reconfigurable video, MS thesis, Massachusetts Inst. Technol., Cambridge, MA, 1986.
WASFI AL-KHATIB M. F. KHAN SERHAN DAG˘TAS¸ ARIF GHAFOOR Purdue University
Multimedia Information Systems W
Multimedia Video
Ahmed K. Elmagarmid and Haitao Jiang Abstract The sections in this article are Video Codecs Jpeg and Mjpeg Quicktime Video Databases Video Data Modeling Video Cut Detection and Segmentation Video Indexing Video Query, Retrieval, and Browsing Video Authoring and Editing Video Conferencing Video-on-Demand
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...CTRONICS%20ENGINEERING/37.%20Multimedia/W4805.htm16.06.2008 16:40:14
602
MULTIMEDIA VIDEO
tics of video data (2,3) and its differences from other data types can be summarized in Table 1. Both video and image data contain much more information than plain textual data. The interpretation of the video and image data is thus ambiguous and dependent on both the viewer and the application. By contrast, textual data usually have a limited and well-defined meaning. Textual data are neither spatial nor temporal and can be thought of as onedimensional data. Image data, however, contain spatial but not temporal information and can be regarded as two-dimensional data. Video data, on the other hand, have a third dimension—that is, time. Compared to traditional data types like textual data, video and image data do not have a clear underlying structure and are much more difficult to model and represent. One single image is usually of the magnitude of kilobytes of data volume, and 1 min of full motion video data contains 1800 image frames. The data volume of the video data is said to be about seven orders of magnitude larger than a structured data record (3). Relationship operators such as equal defined for textual data are simple and well-defined. However, the relationships between video (or image) data are very complex and ill-defined. This causes many problems for video data indexing, querying, and retrieval. For example, there is no widely accepted definition of a simple similarity operator between two images or video streams. Identifying the rich information content or features of video data helps us to better understand the video data, as well as to (a) develop data models to represent, (b) develop indexing schema to organize, and (c) develop query processing techniques to access them. The video data content can be classified into the following categories (2,3): • Semantic content is the idea or knowledge it conveys to the user. It is usually ambiguous and context-dependent. For example, two people can watch the same TV program and yet have different opinions about it. Such semantic ambiguity can be reduced by limiting the context or the application. • Audiovisual content includes audio signal, color intensity and distribution, texture patterns, object motions and shapes, and camera operations, among many others. • Textual content provides important metadata about the video data. Examples are the closed caption of a news video clip, the title of the video clip, or actors and actresses listed at the beginning of a feature film.
MULTIMEDIA VIDEO With advances in computer technology, digital video is becoming more and more common in various aspects of life, including communication, education, training, entertainment, and publishing. The result is massive amounts of video data that already exist in digital form, or will soon be digitized. According to an international survey (1), there are more than six million hours of feature films and video archived worldwide, with a yearly rate of increase about 10%. This would be equal to 1.8 million Gbyte of MPEG-encoded digital video data if all these films were digitized. The unique character-
It should be pointed out that various contents of video data are not equally important. The choice and importance of the features depend on the purpose and use of the video data. In an application such as animal behavior video database management system (VDBMS), the motion and shape information of objects (in this case, animals) is the most important content of the video data. There may also be additional meta information, which is usually application specific and cannot be obtained directly from the video data. Such information is usually added during the annotation step of inserting video data into the video database (VDB)—for example, background information about a certain actor in a feature film video database.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
MULTIMEDIA VIDEO
603
Table 1. Comparison of Video Data with Other Types of Data Criteria Information Dimension Organization Volume Relationship
Textual Data Poor Static and nonspatial Organized Low Simple and well-defined
Image Data
Rich Very rich Static and spatial Temporal and spatial Unstructured Unstructured Median Massive Complex and ill-defined
VIDEO CODECS
4. CCIR standards for Comite´ Consultativ International de Radio, which is part of the United Nations International Telecommunications Union (ITU) and is responsible for making technical recommendations about radio, television, and frequency assignments. The CCIR 601 digital television standard is the base for all the subsampled interchange formats such as source input format (SIF), common intermediate format (CIF), and quarter CIF (QCIF). For NTSC (PAL/SECAM), it is 720 (720) pixels by 243 (288) lines by 60 (50) fields/s where the fields are interlaced when displayed. The chrominance channels horizontally subsampled by a factor of two, yielding 360 (360) pixels by 243 (288) lines by 60 (50) fields/s. CCIR 601 uses YCrCb color space which is a variant of YUV color space.
One of the problems faced in using digital video is the huge data volume of video streams. Table 2 shows the data rate of some standard representations of uncompressed digital video data that are obtained by sampling the corresponding analog video stream at certain frequencies. 1. NTSC stands for National Television System Committee; it has image format 4 : 3, 525 lines, 60 Hz, and 4 MHz video bandwidth with a total of 6 MHz of video channel bandwidth. YIQ is the color space standard used in NTSC broadcasting television system. The Y signal is the luminance. I and Q are computed from the difference between the color red and the luminance and the difference between the color blue and luminance. The transformation from RGB to YIQ is linear (5): Y 0.299 0.587 0.114 R I = 0.596 −0.274 −0.322 G Q 0.212 −0.523 0.311 B NTSC-1 was set in 1948. It increased the number of scanning lines from 441 to 525 and replaced AM-modulated sound with FM. The frame rate is 30 frames/s. 2. PAL stands for phase alternating line and was adopted in 1967. It has image format 4 : 3, 625 lines, 50 Hz, and 4 MHz video bandwidth with a total of 8 MHz of video channel width. The frame rate is 25 frames/s. PAL uses YUV for color coding. Y is the same as in YIQ. U ⫽ 0.493 ⫻ (B ⫺ Y) is the function of the difference between red and luminance, and V ⫽ 0.877 ⫻ (R ⫺ Y) is the function of the difference between blue and luminance. 3. SECAM stands for Sequential Coleur a` Memoire; it has image format 4 : 3, 625 lines, 50 Hz, and 6 MHz video bandwidth with a total 8 MHz of video channel bandwidth. The frame rate is 25 frames/s.
Video Data
Clearly, video streams must be compressed to achieve efficient transmission, storage, and manipulation. The bandwidth will become even more acute for high-definition television (HDTV) since uncompressed HDTV video can require a data rate of more than 100 Mbyte/s. The term video codec is a combination of the words video compression and decompression which express its two major functions: (i) compress the video, and (ii) decompress the video during playback. Video codecs can be either hardware or software. Software codecs is slower than hardware codecs, but is more portable and costefficient. Video can be compressed both spatially and temporally. Spatial compression applies to a single frame and can be lossy or lossless. Temporal compression is to identify and store the infraframe difference. Both can be used in a video codec. Most extant video compressions standards are lossy video compression algorithms. In other words, the decompressed result is not totally identical to the original data. This is so because the compression ratio of lossless methods [e.g., Huffman, Arithmetic, Lempel-Ziv-Welch (LZWs)] is not high enough for video compression. On the other hand, some lossy video compression techniques such as MPEG and MJPEG can
Table 2. Data Rate of Uncompressed Digital Video Video Standard
Image Size
Bytes/Pixel
Mbyte/s
NTSC square pixel (USA, Japan, etc.)a PAL square pixel (UK, China, etc.)b SECAM (France, Russia, etc.)c CCIR 601(D2)d
640 768 625 720
⫻ ⫻ ⫻ ⫻
3 3 3 2
27.6 33.2 22.0 21.0
a
NTSC, National Television System Committee. PAL, phase alternating line. SECAM, Sequential Coleur a` Memoire. d CCIR, Comite´ Consultative International de Radio. b c
480 576 468 486
604
MULTIMEDIA VIDEO
reduce the video data rate to 1 Mbyte. Lossy compression algorithms are very suitable for video data, since not all information contained in video data is equally important or perceivable to the human eye. For example, it is known that small changes in the brightness are more perceivable than changes in color. Thus, compression algorithms can allocate more bits to luminance information (brightness) than to the chrominance information (color). This leads to lossy algorithms, but people may not be able to see the data loss. One important issue of a video compression scheme are the tradeoffs between compression ratio and video quality. ‘‘Higher quality’’ implies smaller compression ratio and larger encoded video data. The speed of a compression algorithm is also an important issue. A video compression algorithm may have a larger compression ratio, but it may not be usable in practice due to the high computational complexity and because realtime video requires a 25 fields/s decoding speed. There are many video compression standards including the CCITT/ISO standards, Internet standards, and various proprietary standards based on different tradeoff considerations. Some codecs takes lots of time to compress a video, but decompress very quickly. They are called Asymmetric video codecs. Symmetric video codecs take about the same amount of time in both compression and decompression processes. Video codecs should not be confused with multimedia architectures. Architectures, such as Apple’s QuickTime (6), are operating system extensions or plug-ins that allow the system to handle video and other multimedia data such as audio, animation, and images. They usually support certain codecs for different media including video. We include QuickTime as an example here. JPEG AND MJPEG JPEG is a standardized image compression mechanism. JPEG stands for Joint Photographic Experts Group, the original name of the committee that wrote the standard. It is designed for lossy compression of either full-color or gray-scale still images of natural, real-world scenes. It works well on photographs, naturalistic artwork, and similar images; however, it does not work so well on lettering, simple cartoons, or line drawings. JPEG exploits known limitations of the human eye, notably the fact that small color changes are perceived less accurately than small changes in brightness on which MPEG is also based. Thus, JPEG is intended for compressing images that will be viewed by human beings. If images need to be machine-analyzed, the small errors introduced by JPEG may pose a problem, even if they are invisible to the human eye. A useful property of JPEG is that the degree of loss of information can be varied by adjusting compression parameters. This means that the image maker can trade off file size against output image quality. Another important aspect of JPEG is that decoders can also trade off decoding speed against image quality by using fast, though inaccurate, approximations to the required calculations. JPEG is a symmetric codec (앒1 : 1), and its image coding algorithm is the basis of spatial compression of many video codecs such as MPEG and H.261. The coding process (shown in Fig. 1) involves the following major steps: 1. RGB to YIQ transformation (optional) followed by DCT (discrete cosine transformation).
R G
I B
RGB to YIQ
Y
(optional)
Q
for each plane for each
DCT
Quantization
8 × 8 block DPCM
RLE
Zigzag scan Entropy coding Figure 1. JPEG image coding.
2. Quantization, which is the main source of the lossy compression. JPEG standard defines two default quantization tables, one for the luminance and one for chrominance. 3. Zigzag scan to group the low-frequency coefficients on top of a vector by mapping 8 ⫻ 8 to a 1 ⫻ 64 vector. 4. DPCM (differential pulse code modulation) on directcurrent (dc) component. The dc component is large and varied, but often close to previous values, so we can encode the difference from previous 8 ⫻ 8 blocks. 5. RLE (run length encode) on alternating-current (ac) components. The 1 ⫻ 64 vector has lots of zeros and can be encoded as (skip, value) pairs, where skip is the number of zeros and value is the next nonzero component. (0, 0) is used as end-of-block essential value. 6. Entropy coding, which categorizes dc values into SSS (number of bits needed to represent) and actual bits, for example, if dc value is 7, 3 bits are needed. Alternatingcurrent components (skip, value) are encoded as composite symbol (skip, SSS) using Hoffman coding. Hoffman table can be default or custom (contained in the header). MJPEG stands for motion JPEG. Contrary to popular perception, there is no MJPEG standard. Various vendors have applied the JPEG compression algorithm to individual frames of a video sequence and have called the compressed video MJPEG, but they are not compatible across the vendors. Compared with the MPEG standard, MJPEG is suitable for accurate video editing because of its frame-based encoding (spatial compression only). MJPEG has a fairly uniform bit rate and simpler compression which requires less computation and can be done in real time. The disadvantage of MJPEG is that there is no interframe compression; thus its compression ratio is poorer than that of MPEG (about three times worse).
MULTIMEDIA VIDEO
MJPEG is one of the built-in JPEG-based video codecs in QuickTime. MPEG MPEG (7) stands for Moving Pictures Expert Group, which meets under the International Standards Organization (ISO) to generate standards for digital video and audio compression. The official name of MPEG is ISO/IEC JTC1 SC29 WG11. MPEG video compression algorithm is a block-based coding scheme. In particular, the standard defines a compressed bit stream, which implicitly defines a decompressor. However, the choice of compression algorithms is up to the individual manufacturers as long as the bit streams they produce are compliant with the standard, and this allows proprietary advantage to be obtained within the scope of a publicly available international standard. MPEG provides very good compression but requires expensive computations to decompress, which may limit the frame rate that can be achieved with software codec. The current status of MPEG standards is listed in Table 3. MPEG-1. MPEG-1 (ISO/IEC 11172 standard) defines a bit stream for compressed video and audio optimized to fit into a bandwidth (data rate) of 1.5 Mbit/s which is the data rate of uncompressed audio CDs and DATs. The video stream takes about 1.15 Mbit/s, and the remaining bandwidth is used by audio and system data streams. The data in the system stream provide information for the integration of the audio and video streams with the proper time stamping to allow synchronization. The standard consists of five parts: video (ISO/IEC 11172-1), audio (ISO/IEC 11172-2), systems (ISO/ IEC 11172-3), compliance testing (ISO/IEC 11172-4), and simulation software (ISO/IEC 11172-5). MPEG-1 video coding is very similar to that of MJPEG and H.261. The spatial compression uses a lossy algorithm similar to that of JPEG except that RGB image samples are converted to YCrCb color space and the Cr and Cb are then subsampled in a 1 : 2 ratio horizontally and vertically. The temporal compression is done using block-based motion compensation with macroblocks (16 ⫻ 16 blocks). Each macroblock consists of 4 Y blocks and the corresponding Cr and Cb blocks. Block-matching techniques are used to find the motion vectors by minimizing a cost function measuring the mismatch between a macroblock and each predictor candidate. MPEG-1 coded video may have four kinds of frames with a
typical frame image size of 4.8 kbyte and compress ratio of 27 : 1. • I frames (intra-coded) are coded without any reference to other frames, i.e., they are only compressed spatially. The typical size of an I frame is 18 kbyte with a compress ratio of 7 : 1. • P frames (predictive coded) are coded more efficiently by using motion compensation prediction from the previous I or P frame and contain both intra- and forward-predicted macroblocks. The typical size of a P frame is 6 kbyte with a compress ratio of 20 : 1. • B frames (bidirectionally predictive coded) have the highest compression ratio and require references to both previous and next frames (I and P frames) for motion compensation. B frames have four different kinds of macroblocks: intra-, forward-predicted, and backwardpredicted, and bidirectionally-predicted (average). Bframe macroblocks of MPEG-1 can specify both past and future motion vectors indicating the result is to be averaged. The typical size of the B frame is 2.5 kbyte with a compress ratio of 50 : 1. • D frames (DC-coded) contain only low-frequency information (dc coefficients of blocks) and are totally independent of the rest of data. The MPEG standard does not allow D frames to be coded in the same bitstream as the I/P/B frames. They are intended to be used for fast visible search modes with sequential digital storage media and are rarely used in practice. MPEG-1 video is strictly progressive—that is, noninterlaced. The quality of the MPEG-1 encoded video is said to be comparable to that of a VHS video. Compared with H.261, MPEG-1 video coding allows larger gaps between I and P frames and thus increases the search range of the motion vectors, which are also specified to a fraction of pixels (0.5 pixel) for better encoding. Another advantage is that the bitstream syntax of MPEG-1 allows random access and forward/backward play. Furthermore, it adds the notion of slice for synchronization after loss or corrupted data. MPEG-2. MPEG-2 (ISO/IEC 11318 standard) includes ISO/IEC 13818-1 (MPEG-2 systems), ISO/IEC 13818-2 (MPEG-2 video), and ISO/IEC 13818-3 (MPEG-2 audio) standards. Approved in November 1994 by the 29th meeting of ISO/IEC JTC1/SC29/WG11 (MPEG) held in Singapore,
Table 3. MPEG Family of Video Codec Standards Name
Objective
Status ISO/IEC 11172 Standard, completed in October 1992.
MPEG-3
Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s Generic coding of moving pictures and associated audio. NA
MPEG-4 MPEG-5 and -6 MPEG-7
Multimedia applications. NA Multimedia content description interface.
MPEG-1
MPEG-2
605
ISO/IEC 13818 Standard, completed in November 1994. No longer exists (has been merged into MPEG-2). Under development. Expected in 1999. Does not exist. Under development. Expected in 2000.
606
MULTIMEDIA VIDEO
MPEG-2 was originally targeted at all-digital broadcasting TV quality video (CCIR 601) at coded bit rates of 4 Mbps to 9 Mbps, but has been used in other applications as well, such as video-on-demand (VoD), digital video disc, personal computing, and so on. Other standards organizations that have adopted MPEG-2 include the DVB (digital video broadcast) in Europe, the Federal Communication Commission (FCC) in the United States, the MITI/JISC in Japan, and the DAVIC consortium. The MPEG-2 system provides a two-layer multiplexing approach. The first layer, called packetized elementary stream (PES), is dedicated to ensuring tight synchronization between video and audio. The second layer depends on the intended communication medium. The specification for error-free environment such as local storage is called MPEG-2 program stream, and the specification addressing error-prone environment is called MPEG-2 transport stream. However, MPEG-2 transport stream is a service multiplex; in other words, it has no mechanism to ensure the reliable delivery of the transport data. MPEG-2 system is mandated to be compatible with MPEG-1 systems and has the following major advantages over MPEG-1: • Syntax for efficient coding of interlaced video such as TV programs (e.g., 16 ⫻ 8 block size motion compensation, dual prime, etc.) to achieve a better compression ratio. On the other hand, MPEG-1 is strictly defined for progressive video. • More efficient coding by including user-selectable DCT dc precision (8, 9, 10, or 11 bits), nonlinear macroblock quantization (more dynamic step size than MPEG-1 to increase the precision of quantization at high bit rate), Intra VLC tables, improved mismatch control, and so on. • Scalable extensions which permit the division of a continuous video stream into two or more bit streams which represent video at different resolutions, picture quality, and frame rates. Currently, MPEG-2 has four extension modes: spatial scalability, data partitioning, SNR scalability, and temporal scalability. • A transport stream suitable to a computer network and wider range of picture frames, from 352 ⫻ 240 up to as large as 16383 ⫻ 16383. • High-definition TV (HDTV) encoding support at sampling dimension up to 1920 ⫻ 1080 ⫻ 30 Hz and coded bit rates between 20 Mbit/s and 40 Mbit/s. HDTV was the goal of MPEG-3, but it was later discovered that MPEG-2 syntax also works very well if an optimal balance between sample rate and coded bit rate can be maintained. Now, HDTV is part of the MPEG-2 High1440 Level and High Level toolkit. MPEG-2 also allows progressive video sequence, and its decoder can decode MPEG-1 video stream as well. More technical details about differences between MPEG-2 and MPEG-1 video syntax can be found in Section D.9 of ISO/IEC 13818-2. MPEG-4. MPEG-4 is a standard for multimedia applications being developed by MPEG and will be ISO/IEC 14496 standard in 1999. MPEG-4 will provide a set of technologies to meet the needs of authors, service providers, and end users of multimedia applications. For authors, MPEG-4 will im-
prove the usability and flexibility of content production and will provide better management and protection of the content. For network service providers, MPEG-4 will provide a set of generic QoS (quality of service) parameters for various MPEG-4 media. For end users, MPEG-4 will enable a higher level of interaction with the content within author defined limits. More specifically, MPEG-4 defines standard ways to do the following: 1. Represent units of aural, visual, and audio–visual content, called audio–visual objects (AVOs). The very basic units are called primitive AVOs. Primitive AVOs include three-dimensional objects, two-dimensional background, voice, text and graphics, animated human body, and so on. 2. Compose AVOs together to create compound AVOs that form audiovisual scenes. MPEG-4’s scene composition borrows several concepts from VRML in terms of its structure and the functionality of object composition nodes. 3. Multiplex and synchronize the data associated with AVOs to be transported over network channels with a QoS appropriate for the nature of the specific AVOs. Each AVO is conveyed in one or more elementary streams which are characterized by the requested transmission QoS parameter, stream type, and precision for encoding timing information. Transmission of such streaming information is specified in terms of an access unit layer and a conceptual two-layer multiplexer. 4. Interact with the audiovisual scene generated at the receiver’s end within author-defined limits. Such interaction includes changing the view or listening point of the scene, dragging an object to a different location and so on. The coding of conventional images and video in MPEG-4 is similar to MPEG-1 and -2 coding and involves motion prediction and compensation followed by texture coding. MPEG-4 also provides content-based functionalities which support the separate encoding and decoding of content (i.e., objects in a scene, video objects). This feature provides the most elementary mechanism for interactivity—that is, flexible representation and manipulation of video object content in the compressed domain, without the need for further segmentation or transcoding at the receiver. The content-based functionalities are supported by MPEG-4’s efficient coding of arbitrary shapes and transparency information. MPEG-4 extends the video coding functionalities of MPEG1 and -2, and it will include provisions to efficiently compress video with different input formats, frame rates, pixel depth, bit rates and supports various levels of spatial, temporal, and quality scalability. MPEG-4 includes a very-low-bit-rate video core (VLBV) which provides algorithms and tools for applications with data rates between 5 kbit/s and 64 kbit/s, supporting low spatial resolution (typically up to CIF) and low frame rates (typically up to 15 Hz). MPEG-4 also provides the same basic functionalities for video with higher bit-rates, spatial, and temporal parameters up to the ITU-R Rec. 601 resolution. On February 11, 1998, ISO MPEG decided to adopt the joint proposal submitted by Apple, IBM, Netscape, Oracle,
MULTIMEDIA VIDEO
Silicon Graphics, and Sun Microsystems to use QuickTime File Format as the starting point for the development of a digital media storage format for MPEG-4. MPEG-7. Similar to MPEG-3, MPEG-5 and MPEG-6 have never been defined. MPEG-7, formally known as multimedia content description interface, is targeted at the problem of searching audiovisual content. MPEG-7 will specify a standard set of descriptors that can be used to describe various types of multimedia information. MPEG-7 will also standardize ways of defining other descriptors as well as structures (description schemes) for the descriptors and their relationships. The descriptions are associated with the content to build up indexes and provide fast and efficient search. The multimedia material can be video, audio, three-dimensional models, graphics, images, speech, and information about how these elements are combined in a multimedia presentation. MPEG-7 is also required to include other information about multimedia material: • The form which could help the user determine whether the material is readable or not. Examples are the codec used (e.g., JPEG, MPEG-2) or the overall data size. • Conditions for accessing the material which could include copyright information and price. • Classification which could include parental rating and content classification categories. • Links to other relevant material which could help the user search and browse the material. • The context that describes the occasion of the recorded material, such as ‘‘1997 Purdue football game against Notre Dame.’’ As we can see, MPEG-7 is built on other standard audio– visual representations such as PCM, MPEG-1, MPEG-2, and MPEG-4. In fact, one functionality of the standard is to provide references to suitable portions of these standards. MPEG-7 descriptors are independent of the ways in which the content is encoded or stored, but they are determined by the user domains and applications. Such an abstraction feature is very important. First, it implies that the same materials can be described through different types of features. It is also related to the way the features can be extracted: Low-level features such as shape, size, texture, color, and so on, can be extracted automatically, whereas the higher-level features such as semantic content need much more human interaction. Second, content descriptions do not have to be stored in the same data stream or on the same storage system with the multimedia material; rather, they can be associated with each other through bidirectional links. In addition to the similar MPEG-4 capability of attaching descriptions to objects within the scene, MPEG-7 will also allow different granularity in its descriptions, offering the possibility of different levels of discrimination. MPEG-7 is now in the process of accepting proposals, and the final international standard is expected in the year 2000.
607
describes the video coding and decoding methods for the moving picture component of an audiovisual service at rates up to 2 Mbit/s, which are multiples (1 to 30) of 64 kbit/s. H.261 was completed and approved by ITU in December 1990. The standard is suitable for applications using circuit-switched (phone) networks as their transmission channels, since both basic and primary rate ISDN access with the communication channel are considered within the framework of the standard. With a bit rate of 64 kbit/s or 128 kbit/s, ISDN can only be used for videophone. It is more suitable for video conferencing at a bit rate of 384 kbit/s or higher where better-quality video can be transmitted. H.261 approximates entertainment quality (VHS) video at the bit rate of 2 Mbyte/s. H.261 is usually used in conjunction with other control and framing standards such as H.221, H.230, H.242, and H.320, for communications and conference control. The actual encoding algorithm of H.261 is similar to, but incompatible with, that of MPEG. H.261 has two kinds of frames: I (interframe) and P (interframe) frames which compose the decoded sequence IPPPIPPP . . .. I frames are encoded similar to JPEG. P frames are encoded by estimating the next frame based on the current frame and coding predicted differences using some intraframe mechanism. One important estimator is the motion compensated DCT coder which uses a two-dimensional warp of the previous frame, and the difference is encoded using a block transform (the Discrete Cosine Transform). H.261 needs substantially less CPU power for real-time encoding than MPEG because of the real-time requirements of the applications it was designed for. The H.261 encoding algorithm includes a mechanism that optimizes bandwidth usage by trading picture quality against motion, so that a quickly changing picture will be of lower quality than a relatively static picture. H.261 used in this way is constant-bit-rate encoding rather than constant-quality, variable-bit-rate encoding. H.261 supports two kinds of resolutions, QCIF (Quarter Common Interchange Format 176 ⫻ 144) and CIF (Common Interchange Format, 352 ⫻ 288). H.263 H.263 is an international video codec standard for low bit rate (may be less than 64 kbit/s) communication approved in March 1996 by ITU. The coding algorithm is similar to that of H.261 with changes to improve performance and error recovery: (a) Half-pixel rather than full-pixel precision is used for motion compensation; (b) some parts of the hierarchical structure of the data stream are now optional, so the codec can be configured for a lower data rate or better error recovery; (c) four optional negotiable options are included to improve performance: unrestricted motion vectors, syntax-based arithmetic coding, advance prediction, and forward and backward frame prediction similar to MPEG’s P and B frames. H.263 supports five resolutions including SQCIF (128 ⫻ 96), QCIF, CIF, 4CIF (704 ⫻ 576), and 16CIF (1408 ⫻ 1152). The support of 4CIF and 16CIF makes H.263 competitive with other higher-bit-rate video codec standards like the MPEG standards. It is expected that H.263 will replace H.261 in many applications.
H.261 H.261 is the most widely used international video compression standard for videophone and video conferencing over the Integrated Services Digital Network (ISDN). This standard
Cinepak Cinepak is a standard software video codec which is optimized for 16- and 24-bit video content and CD-ROM-based
608
MULTIMEDIA VIDEO
video playback. Cinepack is based on vector quantizationbased algorithms for video compression and decompression, and thus it is capable of offering variable levels of compression quality based on time available for compression and the date rate of the target playback machine. Interframe compression is also used to achieve higher compression ratios. The average compression ratio of Cinepak is 20 : 1 compared to original source video. As a highly asymmetric video codec (앒192 : 1), Cinepak decompresses quickly and plays reasonably well on both low-end machines (486s and even 286s) and high-end ones (Pentiums). Cinepak is implemented in Video for Windows as well as QuickTime, which creates portability across various platforms. It can also constrain data rates to user-definable levels for CD-ROM playback. Cinepak also has some weaknesses. First, Cinepak compression is complex and very time-consuming. Second, Cinepak must always compress video at least 10 : 1, so it is less useful at higher data rates for 4⫻ CD-ROM and above. Furthermore, it was never designed for very low bandwidth and, as a result, does not work very well at a data rate under 30 kbyte/s. Current Cinepak licensees include Apple Computer for QuickTime on both MacOS and Windows, Microsoft for Video for Windows, 3DO, Ataro Jaguar, Sega, NeXT Corporation’s NeXTStep, Cirrus Logic, Weitek, Western Digital, and Creative Labs. DVI, Indeo, and IVI DVI is the Intel’s original name for its Digital Video Interactive codec, which is based on the region encoding technique. DVI has been replaced by Intel’s Indeo technology (8,9) for scalable software capture and playback of digital video. Intel licenses Indeo technology to companies such as Microsoft, who then integrate it into products such as Microsoft’s Video for Windows. Indeo technology can record 8-, 16-, or 24-bit video sequences and store the sequence as 24-bit for scalability on higher-power PCs. After introducing Indeo 2 and 3, Intel released Indeo 4 and 5 under the new name of Intel Video Interactive (IVI) with many new capabilities. The IVI codec replaces Indeo 3.2’s vector quantization technique with a more sophisticated interframe codec using a hybrid wavelet compression algorithm. Wavelet compression works by encoding the image into a number of frequency bands, each representing the image at a different level of sharpness. By representing the image based on frequency content, it is possible to choose which portion of the video data to keep to achieve the desired compression ratio. For example, high-frequency content, which makes up the fine detail of the frame image, can be reduced to achieve a considerable amount of compression without some of the characteristic blockiness of other codecs. Other interesting features of IVI include scaling, transparency, local window decode, and access protection. Scaling means that the codec can be used to adapt video playback to the processor power of a particular machine. The transparency feature of the IVI lets video or graphical objects be overlaid onto either a video or a background scene and be interactively controlled, which is ideal for interactive video applications. Local video decoding gives programmers the ability to decode any rectangular subregion of a video image. The size and the location of such subregions can be dynamically adjusted during the playback. Each IVI video can also be password-protected, which is very useful for video develop-
ers in controlling video distribution. IVI also allows the developer to place key video frames during the video compression process, and thus it supports rapid access to selected points in the video without giving up the interframe compression. Another useful feature is on-the-fly contrast, brightness, and saturation adjustment. IVI produces better image quality than other codecs such as Cinepak, but it was designed for the Pentium II chip and MMX technologies and requires lots of processor power. Compared to Cinepak, IVI produces superior image quality. But on the low-end PCs, the quality of Indeo video playback can be very poor. Video files compressed with IVI are usually 1.5 to 2 times larger than those of MPEG-1; however, on a fast Pentium machine with a video accelerated graphic card, Indeo plays back at a quality comparable to software MPEG-1 players. The distinctive advantage of IVI is its support for features necessary to develop multimedia games and applications incorporated with interactive video. Currently, IVI is available for use with Video for Windows, and a QuickTime version is also promised.
QUICKTIME QuickTime (10,11) is often mistaken as one of the video codecs. In fact, like Microsoft’s Video for Windows, it is defined by Apple as an architecture standard for computer systems to accommodate video as well as text, graphics, animation, and sound. Unlike video codecs, multimedia architectures such as QuickTime are more concerned with defining a usable API so that program developers can generate cross-platform multimedia applications quickly and effectively. Thus, neither QuickTime nor Video for Windows specifies a specific video codec. Rather, they assume that all kinds of encodings/decoding will be available through hardware/software codecs, and thus provide meta-systems that allow the programmer to name the encoding and provide translations. QuickTime is composed of three distinct elements: the QuickTime Movie file format, the Quicktime Media Abstraction Layer, and a set of built-in QuickTime media services. The QuickTime Movie file format specifies a standard means of storing digital media compositions. Using this format, we can not only store media data individually, but also store a complete description of the overall media composition. Such description might include a description of the spatial and auditory relationships between multiple video and audio channels in a more complex composition. QuickTime Media Abstraction Layer specifies how software tools and applications access media support services built into QuickTime. It also defines how hardware can accelerate performance critical portions of the QuickTime system. Finally, the QuickTime Media Abstraction Layer outlines the means by which component software developers can extend and enhance the media services accessible through QuickTime. QuickTime also has a set of built-in media services that application developers can take advantage of to reduce the time and resources needed for the development. QuickTime 3.0 includes built-in support for 10 different media types including video, audio, text, timecode, music/MIDI, sprite/animation, tween, MPEG, VR, and 3D. For each built-in media type, QuickTime provides a set of media specific services appropriate for managing each particular media type.
MULTIMEDIA VIDEO
QuickTime supports a wide variety of video file formats including Microsoft’s AVI (Audio/Video Interleaved), open DML (a professional extension of AVI), Avid’s OMF (Open Media Framework), MPEG-1 video and audio, DV, Cinepak, Indeo, MJPEG, and many others. For example, QuickTime 3.0 (10) has built-in platform-independent software support (DV codec) for DV. This means that all current QuickTimeenabled application can work with DV without any changes. DV data can be played, edited, combined with other video formats, and easily converted into other formats such as Cinepak or Indeo for CD-ROM video delivery. The QuickTime DV encoder can also encode video of other format into DV which can be transferred back to DV camcorder using Firewire. QuickTime has the potential of becoming a computerindustry standard for the interchange of video and audio quences. According to a recent survey (11), over 50% of all Web video is in QuickTime format. MPEG format is in second place and can also be played in any QuickTime application.
609
video data is usually the first thing done in the VDBMS design process and has great impact on other components. The video data model is, to a certain extent, userand application-dependent. • Video data insertion, which deals with the issue of introducing new video data into a video database. This usually includes the following steps: (1) key information (or features) extraction from video data for instantiating a data model; the automatic feature extraction can usually be done by using image processing and computer vision techniques for video analysis. (2) Break the given video stream into a set of basic units; this process is often called video scene analysis and segmentation. (3) Manually or semiautomatically annotate the video unit; what needs to be annotated is usually within the application domain. (4) Index and store video data into the video database based on the extracted information and annotated information about video data. • Video data indexing, which is the most important step in the video data insertion process. It deals with the organization of the video data in the video database to make user access more efficient. This process involves the identification of important features and computing the search keys (indexes) based on them for ordering the video data. • Video data query and retrieval, which deals with the extraction of video data from the database that satisfies certain user-specified query conditions. Due to the nature of video data, those query conditions are usually ambiguous in that the video data satisfying the query condition are not unique. This difficulty can be partially overcome by providing a graphic user interface (GUI) and video database browsing capability to the users. Such a GUI can greatly help the user with regard to query formulation, result viewing and manipulation, and navigation of the video database.
VIDEO DATABASES It is impossible to cope with the huge and ever-increasing amount of video information without systematic video data management. In the past, a similar need led to the creation of computerized textual and numeric database management systems (DBMSs). These alpha-numerical DBMSs were designed mainly for managing simple structured data types. However, the nature of video data is fundamentally different than alpha-numerical data, and it requires new ways of modeling, inserting, indexing, and manipulating data. A video database management system (VDBMS) can be defined as a software system that manages a collection of video data and provides content-based access to users (3). A generic video database management system is shown in Fig. 2. Similar to the issues involved in the traditional DBMS (12), a VDBMS needs to address the following important issues:
VIDEO DATA MODELING • Video data modeling, which deals with the issue of representing video data—that is, designing the high-level abstraction of the raw video to facilitate various operations. These operations include video data insertion, editing, indexing, browsing, and querying. Thus, modeling of the
Traditional data models like the relational data model have long been recognized as inadequate for representing the rich data structures required by image and video data. In the past few years, many video data models have been proposed; they
User
Video authoring Video input
Video meta data
Video player
.. .. ..
Video segmentation
Graphic user interface Video indexing
Video model
Video annotation Video processing
Raw video data Figure 2. A generic video database system.
610
MULTIMEDIA VIDEO Headline news flying letters shot
Figure 3. Frames and shots of a CNN ‘‘Headline News’’ episode.
Black frames
Anchorperson shot
can basically be classified into the following categories (2): segmentation-based models, annotation layering-based models, and video object-based models. In order to provide efficient management, a VDBMS should support video data as one of its data types, just like textual or numerical data. The supporting video data model should integrate both video content and its semantic structure. Structural and temporal relationships between video segments should also be expressed. Other important requirements for a video data model include: Multilevel Video Structure Abstraction Support. There are two inherent levels of video abstractions: the entire video stream and the individual frames. For most applications, the entire video stream is too coarse as a level of abstraction. On the other hand, a single video frame is too brief to be the unit of interest. Other intermediate abstractions, such as scenes, are required, and thus a hierarchy of video stream abstraction can be formed. At each level of hierarchy, additional information, like shot type, should be allowed to be added. Such multilevel abstraction makes it easier to reference and comprehend video information, as well as being more efficient to index, browse, and store video data. The shot is often considered as the basic structural element for characterizing the video data which can be defined as one or more frames generated and recorded contiguously, representing a continuous action in time and space (13). Shots that are related in time and space can be assembled in an episode (14), and Fig. 3 is an example representing the CNN ‘‘Headline News’’ episode structure. Another example is the compound unit-sequence-scene-shot hierarchical video stream structure in the VideoStar system (15). A scene is a set of shots that are related in time and space, and scenes that together give a meaning are grouped into what is called a sequence. Related sequences are assembled into a compound unit which can be recursively grouped together into a compound unit of arbitrary level. Spatial and Temporal Relationship Support. A key characteristic of video data are the associated spatial and temporal semantics which distinguish video data from other types of data. Thus, it is important that the video model identifies physical objects and their relationship in time and space to support user queries with temporal and spatial constraints. The temporal relationships between different video segments are also very important from the perspective of a user navigating through a video. There are thirteen distinct ways in
News reel 1
News reel 2
News reel 3
Weather forecast
Headline news flying letters shot
Black frames
which any two intervals can be related (16), and they can be represented by seven cases (17) (before, meets, overlaps, during, starts, finishes, and equal), since six pairs of them are inverses of each other. Those temporal relations are used in formulating queries that contain temporal relationship constraints among the video segments (15,18). For spatial relations, most of the techniques are based on projecting objects on a two- or three-dimensional coordinate system. Very few research attempts have been made to formally represent the spatiotemporal relationship of objects contained in video data and support queries with such constraints. Video Annotation Support. A video data model should support incremental and dynamic annotation of the video stream. Unlike textual data, digital video does not easily accommodate the extraction of content features because fully automatic image and speech recognition is not yet feasible. Moreover, the structure of a video captures some of the aspects of the video material but is not suited for the representation of every characteristic of the material. It should be possible to make detailed descriptions of the video content that is linked not necessarily directly to structural components but more often to arbitrary frame sequences (18–20). Additionally, video annotations often change dynamically depending on human interpretation and application contexts. Currently, the video annotation process is mostly an off-line and manual process. Because of this, GUIs are often built to help users input descriptions of video data. Video annotation will remain interactive in general until significant breakthroughs are made in the field of computer vision and artificial intelligence. Video Data Independence. Data independence is a fundamental transparency that should be provided by a VDBMS. One of the advantages of data independence is sharing and reuse of video data; that is, the same basic video material may be used in several different video documents. Sharing and reuse is critical in a VDBMS because of the sheer volume and rich semantics of video data, which can be achieved by defining abstract logical video concepts on the top of physical video data (15,18). For example, Hjelsvold et al. (15) define the video content of a video document as a logical concept called VideoStream, which can be mapped onto a set of physically stored video data called StoredVideoSegment. On the other hand, logical video streams and logical video segments are proposed by Jiang et al. (18) as higher-level abstractions of physical video segments.
MULTIMEDIA VIDEO
Segmentation-Based Video Data Models Segmentation-based video models (14,21,22) rely heavily on image processing techniques. For a given video stream, scene change detection (SCD) algorithms (23) are usually used to parse and segment the stream into a set of basic units called shots. A representing video frame (RFrame) can then be selected from each shot; together they can represent the corresponding video stream. The features of RFrames can be extracted and serves as video indices. In a more sophisticated video model, shots can also be matched or classified against a set of domain specific templates (or patterns or models) in order to extract higher-level semantics and structures contained in the video such as episodes. A hierarchical representation of the video stream can then be built. One example of such a shot model in the CNN news video is the anchor/person shot (14,24) which can be based on locations of the set of features within frames. These features include the ‘‘Headline News’’ icon in the lower right corner and the title of the anchor/person name. The main advantage of segmentation-based video data models is that the indexing process can be fully automated. However, they also have several limitations (18). First, they lack flexibility and scalability since the video streams are presegmented by the SCD algorithms. Second, the similarity measure between two frame images is often ill-defined and limited, making the template matching process unreliable. Third, they lack applicability for video streams that do not have well-defined structures. For example a class lecture video can have no clear visual structure in terms of shots. Therefore, segmentation using SCD algorithms would be extremely difficult. Finally, limited semantics can be derived from template matching processes, and templates themselves are application-specific. Annotation Layering-Based Models Video annotations are often used to record the semantic video (or image) content and provide content-based access in multimedia systems such as Video-on-Demand (VoD) (25). The automatic creation of video annotations may be one of the major limitations of annotation-based models; however, it still can be done by (1) using a closed caption decoder, (2) detecting and extracting text that appears in the video using OCR techniques (26,27), or (3) capturing and transforming audio signals into text through voice recognition techniques (28,29). The basic idea of annotation-based models is to layer the content information on top of a video stream, rather than segment video data into shots. One of the earliest annotationbased models is the stratification model (30). An example of video models that extends this basic idea is the generic video data model in VideoStar system (15). It allows for free text annotation of any arbitrary video frame sequence by establishing an Annotates relationship between a Frame Sequence and an Annotation. The sharing and reuse of the video material is supported by the idea of logical VideoStream. This model, however, supports only simple Boolean queries on the video annotations. Nested stratification is allowed in the Algebraic Video model (31); that is, logical video segments can be overlapped or nested. Multiple views of the same raw video segment can be defined, and video algebraic operators are used for the recomposition of the video material. Four kinds of interval relations (precede, follow, overlap, and equal) are
611
defined as attributes of a logical video segment. The Smart VideoText model (18,32) is based on multilevel video data abstraction and concept graph (CG) (33) knowledge representation. It not only supports Boolean and all the possible temporal interval (16) constraints, but also captures the semantic associations among the video annotations with CGs and supports knowledge-based query and browsing. VideoText model allows multiple users to dynamically create and share free text video content descriptions. Each annotation is mapped into a logical video segment which can be overlapped in arbitrary ways. To summarize, annotation layering-based video models have several advantages. First, they support variable video access granularities, and annotations can be made on logical video segment of any length. Second, video annotations can easily be handled by existing sophisticated information retrieval (IR) and database techniques. Third, multiple annotations can be linked to the same logical segment of video data, and they can be added and deleted independently of the underlying video streams. Thus, they support dynamic and incremental creation and modification of the video annotations, as well as users’s views. Finally, annotation layering-based video models support semantic content-based video queries, retrieval, and browsing. Video Object Models Two prevailing data models used in current DBMSs are the relational and object-oriented models. The object-oriented model has several features that make it an attractive candidate for modeling video data. These features include capabilities of complex object representation and management, object identities handling, encapsulation of data and associated methods into objects, and class hierarchy-based inheritance of attribute structures and methods. However, modeling the video data using the object-oriented data model is also strongly criticized (34,35), mainly for the following reasons: • Video data are raw data created independently from their contents and database structure, which is described later in the annotation process. • In traditional data models such as the object-oriented model, the data schema is static. The attributes of an object are more or less fixed once they are defined, and adding or deleting attributes is impossible. However, attributes of the video data cannot be defined completely in advance because descriptions of video data are userand application-dependent, and the rich information contained in video data implies that semantic meaning should be added incrementally. Thus, a video data model should support an arbitrary attribute structure for the video data as well as incremental and dynamic evaluation of the schemas and attributes. • Many object-oriented data models only support classbased inheritance. However, for the video data objects, which usually overlap or include each other, support for inclusion inheritance (35) is desired. Inclusion inheritance enables sharing of descriptive data among the video objects. The notion of video object is defined in the object-oriented video information database (OVID) (35) as an arbitrary se-
612
MULTIMEDIA VIDEO
quence of video frames. Each video object consists of a unique identifier, an interval presented by a pair of starting and ending frame numbers, and the contents of the video frame sequence described manually by a collection of attribute and value pairs. The OVID video data model is schemaless; that is, it does not use the class hierarchy as a database schema like in the OODB system. Arbitrary attributes can be attached to each video object if necessary. This enables the user to describe the content of the video object in a dynamic and incremental way. Additionally, interval inclusion inheritance is applied to ease the effort of providing description data when an existing video is composed into new video objects using the generalization hierarchy concept. This approach, however, is very tedious since the description of video content is done manually by users, and not through an automatic image processing mechanism. VIDEO CUT DETECTION AND SEGMENTATION One fundamental problem that has a great impact on all aspects of video databases is the content-based temporal sampling of video data (36). The purpose of the content-based temporal sampling is to identify significant video frames to achieve better representation, indexing, storage, and retrieval of the video data. Automatic content-based temporal sampling is very difficult due to the fact that the sampling criteria are not well defined; whether a video frame is important or not is usually subjective. Moreover, it is usually highly applicationdependent and requires high-level, semantic interpretation of the video content. This requires the combination of very sophisticated techniques from computer vision and artificial intelligence. The state of the art in those fields, however, has not advanced to the point where semantic interpretations would be possible. However, satisfying results can still be obtained by analyzing the visual content of the video and partitioning it into a set of basic units called shots. This process is also referred to as video data segmentation. Content-based sampling thus can be approximated by selecting one representing frame from each shot since a shot can be defined as a continuous sequence of video frames which have no significant interframe difference in terms of their visual contents. A single shot usually results from a single continuous camera operation. This partitioning is usually achieved by sequentially measuring interframe differences and studying their variances—for example, detecting sharp peaks. This process is often called scene change detection (SCD).
Figure 4. Example of an abrupt scene change (a) and a gradual scene change (b).
Scene change in a video sequence can be either abrupt or gradual. Abrupt scene changes result from editing ‘‘cuts,’’ and detecting them is often called cut detection. Gradual scene changes result from chromatic, spatial, and/or combined video edits such as zoom, camera pan, dissolve and fade in/ out, and so on. An example of abrupt scene change and gradual scene change is shown in Fig. 4. SCD is usually based on some measurements of the image frame, which can be computed from the information contained in the images. This information can be color, spatial correlation, object shape, motion contained in the video image, or discrete cosine (DC) coefficients in the case of compressed video data. In general, gradual scene changes are more difficult to detect than the abrupt scene changes and may cause many SCD algorithms to fail under certain circumstances. Existing SCD algorithms can be classified in many ways according to, among others, the video features they use and the video objects to which they can be applied. Here, we discuss SCD algorithms in three main categories: (1) approaches that work on uncompressed full image sequences; (2) algorithms that aim at working directly on the compressed video; and (3) approaches that are based on explicit models. The latter are also called top-down approaches, whereas the first two categories are called bottom-up approaches (3). Preliminaries We now introduce some basic notations, concepts, and several common interimage difference measurements. It should be noted that those measurements may not work well for scene detection when used separately, and thus they usually are combined in the SCD algorithms. A sequence of video images, whether fully uncompressed or spatially reduced, is denoted as Ii, 0 ⱕ i ⬍ N, where N is the length or the number of frames of the video data. Ii(x, y) denotes the value of the pixel at position (x, y) for the ith frame. Hi refers to the histogram of the image Ii. The interframe difference between images Ii and Ij according to some measurement is represented as d(Ii, Ij). DC Images and DC Sequences. A DC (discrete cosine) image is a spatially reduced version of a given image. It can be obtained by first dividing the original image into blocks of n ⫻ n pixels each, then computing the average value of pixels in each block which corresponds to one pixel in the DC image. For the compressed video data (e.g., MPEG video), a sequence DC images can be constructed directly from the compressed
MULTIMEDIA VIDEO
613
Figure 5. (a) An example of a full image and its DC image, (b) template matching, (c) color histogram, and (d) 2 histogram.
video sequence, which is called a DC sequence. Figure 5(a) is an example of a video frame image and its DC image. There are several advantages of using DC images and DC sequences in the SCD for the compressed video (27). First, DC images retain most of the essential global information for image processing. Thus, lots of analysis done on the full image can also be done on its DC image instead. Second, DC images are considerably smaller than the full image frames which makes the analysis on DC images much more efficient. Third, partial decoding of compressed video saves more computation time than full-frame decompression. Extracting the DC image of an I frame from an MPEG video stream is trivial since it is given by its DCT coefficients. Extracting DC images from P frames and B frames requires interframe motion information which may result in many multiplication operations. The computation can be speeded up using approximations (37). It is claimed (27) that the reduced images formed from DC coefficients, whether they are precisely or approximately computed, retain the ‘‘global features’’ which can be used for video data segmentation, SCD, matching, and other image analysis.
image colors called bins and counting the number of pixels fall into each bin. The difference between two images Ii and Ij, based on their color histograms Hi and Hj, can be formun lated as d(Ii, Ij) ⫽ 兺k⫽1 兩Hik ⫺ Hjk兩, which denotes the difference in the number of pixels of two image that fall into same bin. In the RGB color space, the above formula can be written as n dRGB(Ii, Ij) ⫽ 兺k (兩Hri(k) ⫺ Hrj(k)兩 ⫹ 兩Hig(k) ⫺ Hjg(k)兩 ⫹ 兩Hib(k) ⫺ b Hj (k)兩). Using only simple color histogram may not be effective at detecting scene changes because two images can be very different in structure and yet have similar pixel values. Figure 5(c) is the interframe difference sequence of the first video sequence in Fig. 4 measured by the color histogram. 2 Histogram. The 2 histogram computes the distance n measure between two image frames as d(Ii, Ij) ⫽ 兺k⫽1 [Hi(k) ⫺ Hi(k))2 /Hj(k)], which is used in many existing SCD algorithms. Experiments indicate that this method generates better results when compared with other intensity-based measurements, e.g., color histogram and template matching. Figure 5(d) is the interframe difference sequence of the first video sequence in Fig. 4 measured by 2 histogram.
Basic Measurements of Interframe Difference Template Matching. Template matching is done by comparing the pixels of two images across the same location which x⬍M,y⬍N can be formulated as d(Ii, Ij) ⫽ 兺x⫽0,y⫽0 兩Ii(x, y) ⫺ Ij(x, y)兩 with image size of M ⫻ N. Template matching is very sensitive to noise and object movements because it is strictly tied to pixel locations. This can cause false SCD and can be overcome to some degree by partitioning the image into several subregions. Figure 5(b) is an example of interframe difference sequence based on template matching. The input video is the one that contains the first image sequence in Fig. 4. Color Histogram. The color histogram of an image can be computed by dividing a color space (e.g., RGB) into discrete
Full Image Video Scene Change Detection Most of the existing work on SCD is based on full image video analysis. The differences between the various SCD approaches are the measurement function used, the features chosen to be measured, and the subdivision of the frame images. The existing algorithms use either intensity features or motion information of the video data to compute the interframe difference sequence. The intensity-based approaches, however, may fail when there is a peak introduction by object or camera motion. Motion-based algorithms also have the drawback of being computationally expensive since they usually need to match the image blocks across video
614
MULTIMEDIA VIDEO
frames. After the interframe differences are computed, some approaches use a global threshold to decide a scene change. This is clearly insufficent since a large global difference does not necessarily imply that there is a scene change. In fact, scene changes with globally low peaks is one of the main causes of the failure of the algorithms. Scene changes, whether abrupt or gradual, are localized processes, and should be checked accordingly. Detecting Abrupt Scene Changes. Algorithms for detecting abrupt scene changes have been extensively studied, and over 90% accuracy rate has been achieved. Following are some algorithms developed specifically for detecting abrupt scene changes without taking gradual scene changes into consideration. Nagasaka and Tanaka (38) presented an approach that partitions the video frames into 4 ⫻ 4 equal-sized windows and compares the corresponding windows from the two frames. Every pair of windows is compared and the largest difference is discarded. The difference values left are used to make the final decision. The purpose of the subdivision is to make the algorithm more tolerant of object movement, camera movement, and zooms. Six different types of measurement functions, namely difference of gray-level sums, template matching, difference of gray-level histograms, color template matching, difference of color histogram, and a 2 comparison of the color histograms, have been tested. The experimental results indicate that a combination of image subdivision and 2 color histogram approach provides the best results. Akutsu et al. (39) used both the average interframe correlation coefficient and the ratio of velocity to motion in each frame of the video to detect scene change. Their assumptions were that (a) the interframe correlation between frames from the same scene should be high, and (b) the ratio of velocity to motion across a cut should be high also. Hsu and Harashima (40) treated scene changes and activities in the video stream as a set of motion discontinuities which change the shape of the spatiotemporal surfaces. The sign of the Gaussian and mean curvature of the spatiotemporal surfaces is used to characterize the activities. Scene changes are detected using an empirically chosen global threshold. The clustering and split-and-merge approach are then taken to segment the video. Detecting Gradual Scene Changes. Robust gradual SCD is more challenging than its abrupt counterpart, especially when, for example, there is a lot of motion involved. Unlike abrupt scene changes, a gradual scene change does not usually manifest itself by sharp peaks in the interframe difference sequence and can thus be easily confused with object or camera motion. Gradual scene changes are usually determined by observing the behavior of the interframe differences over a certain period of time. For example, the twin comparison approach (41) algorithm uses two thresholds Tb, Ts, Ts ⬍ Tb for camera breaks and gradual transition, respectively. If the histogram value difference d(Ii, Ii⫹1) between consecutive frames with difference values satisfies Ts ⬍ d(Ii, Ii⫹1) ⬍ Tb, they are considered potential start frames for the gradual transition. For each potential frame detected, an accumulated comparison Ac(i) ⫽ D(Ii, Ii⫹1) is computed until Ac(i) ⬎ Tb and d(Ii, Ii⫹1) ⬍ Ts. The end of the gradual transition is declared when this condition is satisfied. To distinguish gradual transition from other camera operations such as pans and zooms, the approach uses image flow computations. Gradual transitions result in a null optical flow when there are other camera operations resulting in particular types of flows. The ap-
proach achieves good results, with failures occurring due to either (a) similarity in color histograms across shots when color contents are very similar or (b) sharp changes in lighting such as flashes and flickering object. Shahraray (36), on the other hand, detected abrupt and gradual scene changes based on motion-controlled temporal filtering of the disparity between consecutive frames. Each image frame is subdivided, and image block matching is done based upon image intensity values. A nonlinear order statistical filter (42) is used to combine the image matching values of different image blocks; that is, the weight of an image match value in the total sum depends on its order in the image match value list. It is claimed that this match measure of two images is more consistent with a human’s judgment. Abrupt scene changes are detected by a thresholding process, and gradual transitions are detected by the identification of sustained low-level increases in image matching values. False detection due to the camera and object motions are suppressed by image block matching as well as temporal filtering of the image matching value sequence. SCD results can be verified simply by measuring the interframe difference of representing frames resulting from the SCD algorithm; high similarity would likely indicate a false detection. To improve the result of detecting fades, dissolves, and wipes which most existing algorithms have difficulties with, Zabih et al. (43) proposed an algorithm based on the edge changing fraction. They observed that new intensity edges appear (enter the scene) far from the locations of old edges during a scene change, and that old edges disappear (exit the scene) far from the locations of old edges. Abrupt scene changes, fades, and dissolves are detected by studying the peak values in a fixed window of frames. Wipes can be identified through the distribution of entering and exiting edge pixels. A global computation is used to guard the algorithm from camera and object motion. The experimental results indicate that the algorithm is robust against the parameter variances, compression loss, and subsampling of the frame images. The algorithm performs well in detecting fades, dissolves, and wipes but may fail in cases of very rapid changes in lighting and fast moving objects. It may also have difficulties when applied to video that is very dim where no edge can be detected. Scene Change Detection on the Compressed Video Data Two approaches can be used to detect scene changes on compressed video streams. The video stream can be fully decompressed, and then the video scene analysis can be performed on full frame image sequence. However, fully decompressing the compressed video data can be computationally intensive. To speed up the scene analysis, some SCD algorithms work directly on compressed video data without the full decompression step. They produce results similar to that of the full-image-based approach, but are much more efficient. Most of SCD algorithms in this category have been tested on the DCT-based standard compressed video since DCT (discrete cosine transformation)-related information can be extracted directly and doesn’t require full decompression of video stream. Some algorithms operate on the corresponding DC image sequences of the compressed video (27,44,45), while some use DC coefficients and motion vectors instead (46–49). They all need only partial decompression of the video.
MULTIMEDIA VIDEO
DC Image Sequence-Based Approach. Yeo and Liu (27,44,45) propose to detect scene changes in the DC image sequence of the compressed video data. Global color statistic comparison (RGB color histogram) is found to be less sensitive to the motion but more expensive to compute. Although template matching is usually sensitive to the camera and object motion and may not produce good results as the full frame image case, it is found to be more suitable for DC sequences because DC sequences are smoothed versions of the corresponding full images. Yeo’s algorithm uses template matching measurement. Abrupt scene changes were detected by first computing the interframe difference sequence and then applying a slide window of size m. A scene change is found if the difference between two frames is the maximum within a symmetric window of size 2m ⫺ 1 and is also n times the second largest difference in the window. The second criteria is for the purpose of guarding false SCD because of fast panning, zooming, or camera flashing. The window size m is set to be smaller than the minimum number of frames between any scene change. The selection of parameters n and m relates to the trade-off between missed detection rate and false detection rate; typical values can be n ⫽ 3 and m ⫽ 10. Gradual scene changes can also be captured by computing and studying the difference of every frame with the previous kth frame—that is, checking if a ‘‘plateau’’ appears in the difference sequence. Experimental results indicate that over 99% of abrupt changes and 89.5% of gradual changes can be detected. This algorithm is about 70 times faster than on full image sequences, which conforms to the fact that the size of the DC images of a MPEG video is only of their original size. Although there may exist situations in which DC images are not sufficient to detect some video features (27), this approach is nonetheless very promising. DC Coefficients-Based Approach. Arman et al. (46) detect scene changes directly on MJPEG video by choosing a subset of the DC coefficients of the 8 ⫻ 8 DCT blocks to form a vector. The assumption is that the inner product of the vectors from the same scene is small. A global threshold is used to detect scene changes; and in case of uncertainty, a few neighboring frames are then selected for further decompression. Color histograms are used on those decompressed frames to find the location of scene changes. This approach is computationally efficient but does not address gradual transitions. Sethi and Patel (47) use only the DC coefficients of I frames of a MPEG video to detect scene changes based on luminance histogram. The basic idea is that if two video frames belong to the same scene, their statistical luminance distribution should be derived from a single statistical distribution. Three statistical tests used are Yakimovsky’s likelihood ratio test, the 2 histogram comparison test, and the Kolmogorov– Smirnov test. Experiments show that the 2 histogram comparison seems to produce better results. DCT blocks and vector information of a MPEG video are used by Zhang et al. (48) to detect scene changes based on a count of nonzero motion vectors. It is observed that the number of valid motion vectors in P or B frames tended to be low when such frames lie between two different shots. Those frames are then decompressed, and full-image analysis is done to detect scene changes. The weakness of this approach is that motion compensation-related information tends to be unreliable and unpredictable in the case of gradual transitions, which might cause the approach to fail. Meng et al. (49), use the variance of DC coefficients I and P frames and motion vector informa-
615
tion to characterize scene changes of MPEG-I and MPEG-II video streams. The basic idea is that frames tend to have very different motion vector ratios if they belong to different scenes and have very similar motion vector ratios if they are within the same scene. Their scene-detection algorithm works in the following manner. First an MPEG video is decoded just enough to obtain the motion vectors and DC coefficients. Inverse motion compensation is applied only to the luminance microblocks of P frames to construct their DC coefficients. Then the suspected frames are marked in the following ways: (a) An I frame is marked if there is a peak interframe histogram difference and the immediate B frame before it has a peak value of the ratio between forward and backward motion vectors; (b) a P frame is marked if there is a peak in its ratio of intracoded blocks and forward motion vectors; and (c) a B frame is marked if its backward and forward motion vector ratio has a peak value. Final decisions are made by going through the marked frames to check whether they satisfy the local window threshold. The threshold is set according to the estimated minimal scene change distance. Dissolve effect is determined by noticing a parabolic variance curve. It should be pointed out that the above algorithms also have following limitations. First, current video compression standards like MPEG are optimized for data compression rather than for the representation of the visual content and they are lossy. Thus, they do not necessarily produce accurate motion vectors. Second, motion vectors are not always readily obtainable from the compressed video data since a large portion of the existing MPEG video has I frames only. Moreover, some of the important image analysis, such as automatic caption extraction and recognition, may not be possible on the compressed data. Model-Based Video Scene Change Detection It is possible to build an explicit model of scene changes to help the SCD process (3,50,51). These model-based SCD algorithms are sometimes referred to as top-down approaches, whereas algorithms discussed above are known as bottom-up approaches. The advantages of the model-based SCD is that a systematic procedure based on mathematical models can be developed, and certain domain-specific constraints can be added to improve the effectiveness of the approaches. For example, the production model-based classification approach (3,51) is based on a study of the video production process and different constraints abstracted from it. The edit effect model contains both abrupt and gradual scene changes. Gradual scene changes such as fade and dissolve are modeled as chromatic scaling operations; for example, fade is modeled as a chromatic scaling operation with positive and negative fade rates. The algorithm first identifies the features that correspond to each of the edit classes to be detected and then classifies video frames based on these features by using both template matching and 2 histogram measurements. Feature vectors extracted from the video data are used together with the mathematical models to classify the video frames and to detect any edit boundaries. This approach has been tested with cut, fade, dissolve, and spatial edits, at an overall 88% accurate rate. Another example is an SCD algorithm based on a differential model of the distribution of pixel value differences in a video stream (50). The model includes: (1) a small amplitude additive zero-centered Gaussian noise that models camera, film, and other noises; (2) an intrashot change model
616
MULTIMEDIA VIDEO
for pixel change probability distribution constructed from object, camera motion, angle change, focus, or light change at a given time and in a given shot; and (3) a set of shot transition models for different kinds of abrupt and gradual scene changes that are assumed to be linear. The SCD algorithm first reduces the resolution of frame images by undersampling to overcome the effects of the camera and object motion and make the compensation more efficient in the following steps. The second step is to compute the histogram of pixel difference values and count the number of pixels whose change of value is within a certain range determined by studying above models. Different scene changes are detected by checking the resulting integer sequence. Experiments show that the algorithm can achieve 94% to 100% detection rate for abrupt scene changes and around 80% for gradual scene changes. Evaluation Criteria for SCD Algorithm Performance It is difficult to evaluate and compare existing SCD algorithms due to the lack of objective performance measurements. This is mainly attributed to the diversity in the various factors involved in the video data. There are, however, still some common SCD performance measurements (23) given a set of benchmark video: (1) speed in terms of the number of frames processed per time unit; (2) average success rate or failure rate which includes both false detection and missed detection (a 100% scene change capture rate does not imply that the algorithm is good since it may have very high false change alarms); (3) accuracy in terms of determining the precise location and type of a scene change; (4) stability, that is, its sensitivity to the noise in the video stream (flashing of the scene and background noises often trigger the false detection); (5) types of the scene changes and special effects that the algorithm can handle; and (6) generality in terms of the applications it can be applied to and kinds of video data resources it can handle. Further improvement on existing SCD algorithm can be achieved in the following ways (23). First, use additional available video information such as closed caption and audio signal. Some initial efforts (29,52) on using audio signal have been made for video skimming and browsing support. Second, develop adaptive SCD algorithms that can combine several SCD techniques and self-adjust various parameters for different video data. Third, use a combination of various scenechange models. Different aspects of video editing and production process can be individually modeled for developing detectors for certain scene changes. Another idea is to develop new video codecs that include more information about the scene content (53). Current motion-compensated video codec standards like MPEG complicate the scene analysis task by partitioning the scene into arbitrary tiles, resulting in a compressed bitstream that is not physically or semantically related to the scene structure. A complete solution to the SCD problem, however, may require information available from psychophysics (54) and understanding the neural circuitry of the visual pathway (55). Techniques developed in computer vision for detecting motion or objects (56–58) can also be incorporated into SCD algorithms. VIDEO INDEXING Accessing video data is very time-consuming because of the huge volume and complexity of the data within the video da-
tabases. Indexing of video data is needed to facilitate the process, which is far more difficult and complex compared to the traditional alpha-numerical databases. In traditional databases, data are usually selected on one or more key fields (or attributes) that can uniquely identify the data itself. In video databases, however, what to index on is not as clear and easy to determine. The indexes can be built on audio–visual features, annotations, or other information contained in the video. Additionally, unlike alpha-numerical data, contentbased video data indexes are difficult to generate automatically. Video data indexing is also closely related to the video data model and possible user queries. Based on how the indexes are derived, existing work on video indexing can be classified into three categories: annotation-based indexing, feature-based indexing, and domain-specific indexing. Annotation-Based Indexing Video annotation is very important for a number of reasons. First, it fully explores the richness of the information contained in the video data. Second, it provides access to the video data based on its semantic content rather than just its audio–visual content like color distribution. Unfortunately, due to the current limitations of machine vision and imageprocessing techniques, full automation of video annotation, in general, still remains impossible. Thus, video annotation is a manual process that is usually done by an experienced user, either as part of the production process or as a post-production process. The cost of manual annotation is high and thus not suitable for the large collection of video data. However, in certain circumstances, video annotation can also be automatic captured from video signals as we discussed in the section entitled ‘‘Video Data Modeling.’’ Automatic video semantic content extraction using computer vision techniques with given application domain and knowledge has also been studied. One example is the football video tracking and understanding (59). Another example is the animal behavior video database (4). In addition, video database systems usually provide a user-friendly GUI to facilitate video annotation creation and modification. One of the earliest ideas for recording descriptive information about the film or video is the stratification model (30). It approximates the way in which the film/video editor builds an understanding of what happens in individual shots. A data camera is used during the video production process to record descriptive data of the video including time code, camera position, and voice annotation of who–what–why information. This approach is also called source annotation (3). However, the model doesn’t address the problem of converting this annotation information into textual descriptions to create indexes of the video data. It is common to simply use a preselected set of keywords (31) for video annotation. This approach, however, is criticized for a number of reasons (2,60). First, it is not possible to use only keywords to describe the spatial and temporal relationships, as well as other information contained in the video data. Second, keywords cannot fully represent semantic information in the video data and do not support inheritance, similarity, or inference between descriptors. Keywords also do not describe the relations between descriptions. Finally, keywords do not scale; that is, the greater the number of keywords used to describe the video data, the lesser the chance the video data will match the
MULTIMEDIA VIDEO
query condition. A more flexible and powerful approach is to allow arbitrary free text video annotations (18,32,6) which are based on logical data abstractions. Jiang et al. (18) also further address the problem of integrating knowledge-based information retrieval systems with video database to support video knowledge inferencing and temporal relationship constraints. Another way to overcome the difficulties of keyword annotations is suggested by an annotation system called Media Stream (60,62). Media Stream allows users to create multilayer, iconic annotations of the video data. The system has three main user interfaces: Director’s Workshop, icon palettes, and media time lines for users to annotate the video. Director’s Workshop allows users to browse and compound predefined icon primitives into iconic descriptors by cascading hierarchical structure. Iconic descriptors are then grouped into one or more icon palettes and can be dropped into a media time line. The media time line represents the temporal nature of the video data, and the video is thus annotated by a media time line of icon descriptors. The creation of video indices however, is not discussed. The spatiotemporal relationships between objects or features in a video data can also be symbolically represented by spatial temporal logic (STL) (63). The spatial logic operators include before, after, overlaps, adjacent, contained, and partially intersects. Temporal logical operators include eventually and until. Standard boolean operators are also supported including and, or, and not. The symbolic description, which is a set of STL assertions, describes the ordering relationships among the objects in a video. The symbolic description is created for, and stored together with, each video data in the database and serves as an index. The symbolic description is checked when a user query is processed to determine matches. Feature-Based Indexing Feature-based indexing techniques depend mainly on image processing algorithms to segment video, to identify representing frames (RFrames), and to extract key features from the video shots or RFrames. Indexes can then be built based on key features such as color, texture, object motion, and so on. The advantage of this approach is that video indexing can be done automatically. Its primary limitation is the lack of semantics attached to the features which are needed for answering semantic content-based queries. One of the simplest approaches is to index video based upon visual features of its RFrames. A video stream can first be segmented into shots which can be visually represented by a set of selected RFrames. The RFrames are then used as indices into these shots. The similarity comparison of the RFrames can be based on the combination of several image features such as object shapes measured by the gray level moments and color histograms of the RFrames (64). This approach can be a very efficient way of indexing video data; however, types of query are limited due to the fact that video indexing and retrieval are completely based on the computation of image features. Video data can also be indexed on the objects and object motions which can be either interactively annotated or automatically extracted using motion extraction algorithms such as optical flow methods and block matching estimation techniques (65). Object motions can be represented by different combinations of primitive motions such as north and rotate-
617
to-left, or motion vectors (65). Motion vectors can then mapped by using spatiotemporal space (x ⫺ y ⫺ t) and are aggregated into several representative vectors using statistical analysis. Objects and their motion information are stored in a description file or a labeling record as an index to the corresponding video sequence. Notice that each record also needs to have a time interval during which the object appears. Multiple image features can be used simultaneously to index video data. They are often computed and grouped together as multidimensional vectors. For example, features used in MediaBENCH (66) include average intensity, representative hue values which are the top two hue histogram frequencies of each frame, and camera work parameters created by extracting camera motions from video (67). These values are computed every three frames in order to achieve realtime video indexing. Indexes are stored along with pointers to the corresponding video contents. Video data can be segmented into shots using a SCD algorithm based on index filtering by examining indices frame by frame and noticing the inter-frame differences. Thus, a structured video representation can be built to facilitate video browsing and retrieval operations. Domain-Specific Indexing Domain-specific indexing approaches use the logical (highlevel) video structure models, such as the anchorperson shot model and CNN ‘‘Headline News’’ unit model, to further process the low-level video feature extraction and analysis results. After logical video data units have been identified, certain semantic information can be attached to each of them, and domain specific indices can be built. These techniques are effective in their intended domain of application. The primary limitation is their narrow range of applicability, and limited semantic information can be extracted. Most current research uses collections of well-structured video such as news broadcast video as input. One of the early efforts with domain-specific video indexing was done by Swanberg et al. (14) in the domain of CNN news broadcasting video. Several logical video data models that are specific to news broadcasting (including the anchorperson shot model, the CNN news episode models, and so on) are proposed and used to identify these logical video data units. These models contain both spatial and temporal ordering of the key features, as well as different types of shots. For example, the anchorperson shot model is based on the location of a set of features including the icon ‘‘Headline News’’ and the titling of the anchorperson. Image-processing routines, including image comparison, object detection, and tracking, are used to segment the video into shots and interactively extract the key elements of the video data model from the video data. Hampapur et al. (21) proposed a methodology for designing feature-based indexing schemes which uses low-level imagesequence features in a feature-based classification formalism to arrive at a machine-derived index. A mapping between the machine-derived index and the desired index was designed using domain constraints. An efficacy measure was proposed to evaluate this mapping. The indexing scheme was implemented and tested on cable TV video data. Similarly, Smoliar et al. (22,68,69) used an a priori model of a video structure based on domain knowledge to parse and index the news pro-
618
MULTIMEDIA VIDEO
gram. A given video stream is parsed to identify the key features of the video shots, which are then compared with domain-specific models to classify them. Both textual and visual indexes are built. The textual index uses a category tree and assigns news items to the topic category tree. The visual index is built during the parsing process, and each news item is represented as a visual icon inside a window that provides an unstructured index of the video database. A low-level index that indexes the key frames of video data is also built automatically. The features used for indexing include color, size, location, and shape of segmented regions and the color histograms of the entire image and nine subregions that are coded into numerical keys. VIDEO QUERY, RETRIEVAL, AND BROWSING The purpose of a video database management system (VDBMS) is to provide efficient and convenient user access to a video data collection. Such access normally includes query, retrieval, and browsing. The video data query and retrieval process typically involves the following steps. First, the user specifies a query using facilities provided by a GUI; this query is then processed and evaluated. The value or feature obtained is used to match and retrieve the video data stored in video database. In the end, the resulting video data is presented to the user in a suitable form. Video query is closely related to other aspects of VDBMS, such as video data indexing, since features used for indexing are also used to evaluate the query, and the query is usually processed by searching the indexing structure. Unlike alpha-numerical databases, video database browsing is critical due to the fact that a video database may contain thousands of video streams with great complexity of video data. It is also important to realize that video browsing and querying are very closely related to each other in the video databases. In a video database system, a user’s data access pattern is basically a loop of the ‘‘query-browse’’ process in which video queries and video browsing are intermingled. Playing video data can be thought of as the result of the process. User video browsing usually starts with a video query about certain subjects that the user is interested in. This makes the browsing more focused and efficient since browsing a list of all of the video streams in a video database is time and network resources consuming. Such initial queries can be very simple because the user isn’t familiar with the database content. On the other hand, a video query normally ends with the user browsing through the query results. This is due to the ambiguous nature of the video query; that is, it results in multiple video streams, some of which are not what the user wanted. Browsing is a efficient way of excluding unwanted results and examining the contents of possible candidates before requesting the playing of a video. Different Types of Queries Identifying different classes of user queries in a VDBMS is vital to the design of video query processing. The classification of the queries in a video database system can be done in many ways depending on intended applications and the data model they are based on, as well as other factors (3). A video query can be a semantic information query, meta information query, or audio–visual query. A semantic information query requires an understanding of the semantic content
of the video data. For an example, ‘‘find a video clip in which there is an angry man.’’ This is the most difficult type of query for a video database. It can be partially solved by semantic annotation of the video data, but its ultimate solution depends on the development of technologies such as computer vision, machine learning, and artificial intelligence (AI). A meta information query is a query about video meta data, such as who is the producer and what is the date of production. In most cases, this kind of query can be answered in a way that is similar to the conventional database queries. Meta data are usually inserted into the video database along with the corresponding video data by video annotation that is currently manually or semi-manually done off-line. An example of a query could be to find video directed by Alan Smithee and titled ‘‘2127: A Cenobite Space Odyssey.’’ This class also includes statistical queries, which are used to gather the information about video data without content analysis of the video. A typical example is to find the number of films in the database in which Tom Cruise has appeared. An audiovisual query depends on the audio and visual features of the video and usually doesn’t require understanding the video data. An example of such a query would be to find a video clip with dissolve scene change. In those queries, audio and visual feature analysis and computation, as well as the similarity measurement, are the key operations as compared to the textual queries in the conventional DBMS. Video queries can also based on the spatiotemporal characteristics of the video content as well. A video query can be about spatial, temporal, or both kind of information. An example of a spatial query is ‘‘retrieve all video clips with a sunset scene as background,’’ and the query ‘‘find all video clips with people running side by side’’ is a spatiotemporal query example. Depending on how a match is determined in the query evaluation, a video query can be classified as either an exact match-based query or a similarity match-based query. An exact match-based query is used to obtain an exact match of the data. One example is to find a CNN ‘‘Dollars and Sense’’ news clip from the morning of March 18, 1996. Similarity matchbased queries actually dominate the VCBMS because of the ambiguity nature of the video data. One example is ‘‘find a video clip that contains a scene that is similar to the given image.’’ A video query can have various query granularity which can be either video frames, clips, or streams. A frame-based query is aimed at individual frames of video data that are usually the atomic unit of the video database. A clip-based query is used to retrieve one or more subsets of video streams that are relatively independent in terms of their contents. A video stream-based query deals with complete video streams in the database. An example query is ‘‘find a video produced in 1996 that has Kurt Russell as the leading actor.’’ Queries can also be categorized according to query behavior. A deterministic query usually has very specific query conditions. In this case, the user has a clear idea what the expected result should be. A browsing query is used when a user is uncertain about his or her retrieval needs or is unfamiliar with the structures and types of information available in the video database. In such cases, the user may be interested in browsing the database rather than searching for a specific entity. The system should allow for the formulation of fuzzy queries for such purpose.
MULTIMEDIA VIDEO
There are many ways that a user can specify a video query. A direct query is defined by the user using values of features of certain frames, such as color, texture, and camera position. A query by example, which is also called query by pictorial example (QBPE) or Iconic Query (IQ), is very useful since visual information is very difficult to describe in words or numbers. The user can supply a sample frame image as well as other optional qualitative information as a query input. The system will return to the user a specified number of the best-match frames. The kind of query methodology is used in IBM’s QBIC system (70) and JACOB system (71). In an iterative query, the user uses a GUI to incrementally refine their queries until a satisfying result is obtained. The JACOB system is a practical example of this approach. Query Specification and Processing Video Query Language. Most textual query languages such as SQL have limited expressive power when it comes to specifying video database queries. The primary reason is that the visual, temporal, and spatial information of the video data can not be readily structured into fields and often has a variable-depth, complex, nested character. In a video database, queries about visual features can be specified, for examples, by using an iterative QBPE mechanism; and spatiotemporal queries can be expressed, for example, by TSQL or spatial temporal logic (STL). Queries dealing with the relationships of video intervals can be specified using a temporal query language like TSQL (TSQL 2, Applied TSQL2) (72,73). TSQL2 has been shown to be upward compatible with SQL-92 and can be viewed as an extension of SQL-92. However, not all SQL-92 relations can be generated by taking the time slices of TSQL2 relations, and not all SQL-92 queries have a counterpart in TSQL-92. The completeness and evaluation of the TSQL2 are discussed by Bohlen et al (72). STL (74) is proposed as a symbolic representation of video content, and it permits intentional ambiguity and detail refinement in the queries. Users can define a query through an iconic interface and create sample dynamic scenes reproducing the contents of the video to be retrieved. The sample scenes are then automatically translated and interpreted into STL assertions. The retrieval is carried out by checking the query STL assertions against the descriptions of every image sequence stored in the database. The description of a video sequence is used to define the object-centered spatial relationship between any pair of objects in every frame and created manually when the sequence is stored in the database. The VideoSTAR system uses a video query algebra (15,75) to define queries based on temporal relationships between video intervals. A GUI is developed to assist users interactively define queries with algebra operations include (a) normal set operations (AND, OR, and DIFFERENCE), (b) temporal set operations, (c) filter operations that are used to determine the temporal relationships between two intervals, (d) annotation operations that are used to retrieve all annotations of a given type and have nonempty intersections with a given input set, (e) structure operations that are similar to the above but on the structural components, and (f) mapping operations that map the elements in a given set onto different contexts that can be basic, primary, or video stream. VideoSQL is the video query language used in OVID (35), which allows users to retrieve video objects that satisfying certain
619
conditions through SELECT-FROM-WHERE clauses. Video SQL does not, however, contain language expressions for specifying temporal relations between video objects. Other Video Query Specifications. Despite its expressive power and formalism, defining and using certain video query language can often become very complex and computationally expensive. Some researchers simply combine important features of the video data to form and carry out queries. In these cases, the types of queries that can be defined and processed are usually limited. For an example, the MovEase system (76) includes motion information as one of the main features of the video data. Motion information, together with other video features (color, shape, object, position, and so on), is used to formulate a query. Objects and their motion information (path, speed) can be described through a GUI in which objects are represented as a set of thumbnail icon images. Object and camera motions can be specified by using either predefined generic terms like pan, zoom, up, down, or user input-defined motion descriptions, such as zigzag path. The query is then processed and matched against the preannotated video data stored. Results of the query are displayed as icons, and users can get meta information or the video represented by each icon image simply by clicking on it. Query Processing. Query processing usually involves query parsing, query evaluation, database index search, and the returning of results. In the query parsing step, the query condition or assertion is usually decomposed into the basic unit and then evaluated. After that, the index structure of video database is searched and checked. The video data are retrieved if the query assertion is satisfied (74) or if the similarity measurement (65) is maximum. The result video data are usually displayed by a GUI in a way convenient to the user [such as iconic images (76)]. One example is an on-line objectoriented query processing technique (77) which uses generalized n-ary operations for modeling both spatial and temporal contents of video frames. This enables a unified methodology for handling content-based spatial and spatiotemporal queries. In addition, the work devises a unified object-oriented interface for users with a heterogeneous view to specify queries. Another example is the VideoSTAR system (75) which parses the query and breaks it into basic algebraic operations. Then, a query plan is determined and many be used to optimize the query before it is computed. Finally, the resulting video objects are retrieved. VIDEO AUTHORING AND EDITING Digital video (DV) authoring usually consists of three steps: video capture and digitization, video editing, and final production. In the video capture step, raw video footage can be captured or recorded in either analog format or digital format. In the first case, the analog video needs to be digitized using a video capture board on the computer. The digital video is usually stored in a compressed format such as MPEG, MJPEG, DV, and so on. Analog recording, digitization, and editing using video capture cards and software tend to suffer from information loss during the conversion. However, this approach is very important since the majority of the existing video materials are on video tapes and films. According to an international survey (1), there are more than 6 million hours of feature films and video archived worldwide with a yearly
620
MULTIMEDIA VIDEO
increase of about 10%. With the appearance of the DV camcorder, especially those with the Firewire interface, a superior digital video authoring and editing solution finally comes into reality. DV editing refers to the process of rearranging, assembling, and/or modifying raw video footage (or clips) obtained in the video capture step according to the project design. The raw video clips which may not come from the same resources can be trimmed, segmented, and assembled together on a time line in the video construction window. Possible edits also include transitions and filters, as well as many other operations such as title superimposition. Special effect transitions are commonly used for assembling video clips, which include various wipes and dissolves, 3-D vortex, page peel, and many others. Filters including video and audio filters can be used to change the visual appearance and sound of video clips. The examples of filters are Gaussian sharpen, ghosting, flip, hue, saturation, lightness, and mirror. During the video editing process, the user can preview the result in the software window on the computer screen or on an attached TV monitor. Digital video editing can be classified as linear or nonlinear, described in more detail later. In the final production step, the final editing results can be recorded back on a video tape or a CD. The final format of the video production depends on the intended application, for example, One should choose MPEG-1 video compression for CD application and MPEG-2 for TV quality video playback. In any case it is a good idea to keep the original DV tape or analog (Hi-8 or VHS) tape. Video capture board mentioned above is one of the key components of a video editing system and is responsible for digitizing analog video input into digital ones for desktop digital video editing. It is also widely used in other applications such as video conferencing. Some of the common or expected features of a video capture board are listed in the following. The actual features depend on each individual card and can make the card very expensive. • Real-time, full-screen (640 ⫻ 480 NTSC, 768 ⫻ 576 PAL), true color (24 bits), and full motion (30 frames/s, 25 frames/s PAL) capture and playback of NTSC, PAL, or SECAM analog video. • Analog output in NTSC, PAL, or SECAM in composite or S-video. This feature can be used to output the editing result back on to the video tape or preview the editing result on a TV monitor. • Support for multiple sampling rate audio data, along with the ability to record and play audio from voice grade to CD/DAT stereo quality. It also need to support the sychronization of the video and audio channels • Hardware support for video compression standards such as MJPEG, MPEG, and ITU H.261. It needs to also support audio compression standards (G.711, G.722, and G.728) and be compatible with QuickTime or AVI. • Software and developing tools for video editing and video conferencing, and so on. Linear Digital Video Editing Linear video editing systems are usually hardware-based and require edits to be made in a linear fashion. The concept behind linear editing is simple: The raw video footage which may be recorded on several tapes is transferred segment by segment from source machine(s) onto a tape in another video
recorder. In the process, the original segments can be trimmed and rearranged, unwanted shots can be removed, and audio and video effects can be added. An edit recorder, controlled by an editing controller, is used to control all of the machines and make final edit master. The edit controller can be used to shuttle tapes back and forth to locate the beginning and ending points of each needed video segment. These reference points are entered as either control track marks or time code numbers into the edit controller to make edits. There are two types of linear edits. Assemble editing allows video and audio segment to be added one after another, complete with their associated control track. However, the control track is difficult to record without any error during video edits. For example, any mis-timing during this mechanical process results in a glitch in the video. Insert editing requires a stable control track to be established first for stable playback. Video and audio segments can then be inserted over the prerecorded control track. Linear video editing is generally considered slow and inflexible. Although video and audio segments can be replaced within the given time constraints of the edited master, it is impossible to change the length of segments or insert shots into the edited master without starting all over again. This can be easily done with the more flexible and powerful nonlinear video editing. Despite its limitations, linear video editing is nonetheless an abandoned solution and it is still used even for DV editing (78) for a number of reasons. First, when editing long video programs, linear editing may actually save time when compared to the nonlinear editing. This is because, for example, there is no need to transfer video data back and forth between the video tapes and computer. Second, long digital video programs occupy a huge amount of disk space. A 1 h DV, for example, fills about 13 Gbyte-space. The file size constraints of the computer operating system limit the length of the video footage that can be placed on the disk and operated by nonlinear editors. So, the choice of linear or nonlinear editing is really application-dependent. The best solution may be a combination of both. Nonlinear Video Editing Nonlinear video editing (NLE) is sometimes called randomaccess video editing, which is made possible through digital video technologies. Large-capacity and high-speed disks are often used as the recording medium and video footage are stored in either compressed or uncompressed digital format. NLE supports random, accurate, and instant access to any video shot or frame in a video footage. It also allows the video segments to be inserted, deleted, cut, and moved around at any given point in the editing process. Nonlinear video editing supports a much wider range of special effects such as fades, dissolves, annotation, and scene-to-scene color corrections. It also supports many audio enhancement including audio filters and sound effects. Most NLE systems have multiple time lines to indicate the simultaneous presence of multiple audio and video sources. For example, one could have background music, the original sound track of the raw footage, and the voice of the narrator at the same time. One can instantly preview and make adjustments to the result at any give point of the NLE process. The video and audio segments can be clicked and dragged to be assembled on a designated time line. Video segments are often represented by thumbnail icons of its video frames with
MULTIMEDIA VIDEO
adjustable temporal resolutions (one icon per 100 frames, for example). The results of nonlinear video editing can be converted into analog video signals and output back to a video tape, or stored in any given digital video format. Digital Video Camcorder and Digital Video Editing Using a DV camcorder, video is digitally captured and compressed into DV format before it is recorded onto the DV tape. There are two ways that the DV footage can be edited. One can still connect the analog output of the DV camcorder/VCR to the video capture board on the computer and edit the video as previously discussed. However, this approach is not recommended due to the quality loss in A/D (analog-to-digital) conversions and lossy codecs used in DV equipment. True endto-end high-quality digital video editing can be done using DV equipment (VCR or camcorder) and the Firewire (IEEE 1394). A single Firewire cable can carry all the DV data between DV devices and the computer including video, audio, and device control signals. It eliminates multiple cables required in the traditional digital video authoring and editing system. Sony first introduced the DV camcorders with the Firewire connector. This approach has no generation loss and is not necessarily more expensive than the first method. One may need to purchase a Firewire interface board; however, its price may be cheaper than many video capture boards. Firewire—IEEE 1394. Firewire (79), officially known as IEEE 1394, is a high-performance digital serial interface standard. Originated by Apple for desktop local area networks (LANs), it was later developed and approved in December 1995 by IEEE. IEEE 1394 supports data transfer rates of 12.5, 25, 50, 100, 200, and 400 Mbit/s which can easily meet the requirements of DV data transportation or even uncompressed digital video data at 250 Mbit/s. Data rate over 1 Gbit/s is under design. Other key advantages of IEEE 1394 include the following: • It is supported by 1394 Trade Association which has over 40 companies including Apple, IBM, Sun, Microsoft, Sony, and Texas Instruments. For example, Apple is the first to support Firewire in its operating system (Mac OS 7.6 and up) and provide Firewire API 1.0 in Mac OS 8.0. • It is a digital interface; there is no A/D conversion and data integrity loss. • It is physically small (thin serial cable), easy to use (no need for terminator and device ID, etc.), and hot pluggable. Hot pluggable means that 1394 devices can be added to or removed from the IEEE 1394 bus at any time, even when the bus in full operation. • It has scalable architecture which allows for the mixture of data rates on a single bus. • It has flexible topology which supports daisy chaining and branching for true peer-to-peer communication. Peer-to-peer communication allows direct dubbing from one camcorder to another as well as sharing a camcorder among multiple computers. • It supports asynchronous data transport which provides connectivity between computers and peripherals such as printers and modems and provides command and control for new devices such as DV camcorders. • It also supports isochronous data transport guarantees delivery of multiple time-critical multimedia data
621
streams at predetermined rates. Such just-in-time data delivery also eliminates the need for costly buffering. The current standard allows Firewire cable up to 4.5 m per hop; but with repeaters or bridges, over 1000 bus segments can be connected and thus can reach thousands of meters. Each firewire cable contains two power conductors and two twisted pairs for data signaling. Signal pairs are shielded separately; additionally, the entire cable is also shielded. The Firewire cable power is specified to be from 8 V dc to 40 V dc at up to 1.5 A. It is used to maintain a device’s physical layer continuity when the device is powered down or malfunctions and provide power for the devices connected to the bus. However, some manufacturers may have sightly different cables; for example, the Sony camcorder Firewire cable only has four wires with two power wires removed. Firewire is widely used for attaching DV camcorders to computers and as a high-performance, cost-effective digital interface for many other audio/video applications such as digital TV and Multimedia CDROM (MMCD). IEEE 1394 has been accepted as the standard digital interface by the Digital VCR Conference (DVC) and has been endorsed by European Digital Video Broadcasters (DVB) as their digital TV interface as well. The EIA (Electronic Industries Association) has also approved IEEE 1394 has the point-to-point interface for digital TV and the multipoint interface for entertainment systems. In the future, IEEE 1394, as a high-speed, low-cost, and user-friendly interface, is expected to improve existing interfaces such as SCSI. In fact, the American National Standards Institute (ANSI) has already defined Serial Bus Protocol (SBP) to encapsulate SCSI-3 for IEEE 1394. Various DV Video Format. DV is a digital video format (80) developed by DVC and adopted by over 50 manufacturers including Sony, Panasonic, JVC, Philips, Toshiba, Hitachi, Sharp, Thomson, Sanyo, and Mitsubishi. The DV specification was approved in September 1993 and is intended primarily for prosumer, eventually consumer applications. The DV format offer two tape cassette sizes: the standard 4 h (125 mm ⫻ 78 mm ⫻ 14.6 mm) and the mini 1 h (66 mm ⫻ 48 mm ⫻ 12.2 mm). Most of the DV VCRs will play both. The DV video compression algorithm is DCT-based and very similar to that of MPEG and MJPEG. First, RGB video is converted to a YUV digital component video. The luminance signal (Y) is sampled at 13.5 MHz, which provides a 5.75 luminance bandwidth for both the NTSC and PAL systems. For NTSC video, the R–Y (U) and B–Y (V) color difference signals are digitized at 3.375 MHz sampling rate, which provides a 1.5 MHz bandwidth for each chroma component. The result is 4 : 1 : 1 digital video. The PAL DV system samples each chroma component at 6.775 MHz yields there by a 3.0 MHz bandwidth per chroma component. However, PAL DV uses a 4 : 2 : 0 sampling schema that yields only half of the vertical chroma resolution of the NTSC DV format. Before compression, digital video frames are stored in a 720 ⫻ 480 pixel buffer where the correlation between two fields are measured. Two fields are compressed together unless the correlation is low, which indicates too much interfield motion. Each DCT macroblock consisting of four 8 ⫻ 8 blocks has its own quantization table (Q-table), which enables dynamic intraframe compression. The DV formation has a standard set of Q-tables. DV video compression ratio is 5 : 1. DV provides two digital audio record modes: 16-bit and 12-bit.
622
MULTIMEDIA VIDEO
The 16-bit mode uses a sampling frequency of 48 kHz and 12bit mode operates at 32 kHz. DV format uses Reed–Solomon error correction on the buffered video data to prevent frame loss. Each DV track consists of four sectors: subcode, video, audio, and ITL. Subcode sector records timecode, an index ID for quick searches for specific scenes, and the PP-ID for Photo Mode recording and playback. Video sector records not only the video data but also the auxiliary data such as data and time, focus mode, AE-mode, shutter speed F-stop, and gain setting. ITI sector stores data for the DV device itself, such as tracking signal for audio dubbing. The separation of audio and video signals makes videoonly insert editing possible. DVCPRO is a professional variant of the DV by Panasonic. The main differences are the doubled tape speed needed for dropout tolerance and general recording robustness. It is also capable of 4⫻ normal speed playback which can be used to accelerate data transfer. DVCAM is Sony’s DV variation. DV and DVCAM uses 4 : 2 : 0, and DVCPRO uses 4 : 1 : 1 sampling rates for PAL. They all use 4 : 1 : 1 for NTSC and have a data rate of 25 Mbps. Panasonic also has DVCPRO-50 for the studio-quality video. Unlike DV, DVCPRO, and DVCAM which sample at 4 : 1 : 1, DVCPRO-50 provides a 4 : 2 : 2 sampling which is consistent with ITU-R BT.601-4 (CCIR-601) digital video standard. Such a sampling rate is sometimes preferred since it provides more color information and better compositing. The data rate of DVCPRO-50 is 50 Mbps, which is twice that of DV, and it supports lightly compressed picture (3.3 : 1) with a high signal-to-noise ratio. JVC’s Digital-S is another 50 Mbps video format. Together, they are known as DV422 and are compatible with each other. The 4 : 1 : 1 DV tapes can be played on the CV422 decks which can bump the output to 4 : 2 : 2 for post-production uses. Another advantage of DV422 is that it is closer to the MPEG-2 standard which samples at 4 : 2 : 0. Sony’s Betacam SX is yet another DV video format targeted at professional market. Betacam SX is similar to MPEG-2 and uses adaptive quantization and MPEG-2’s IB frame (IBIB. . .) compression to achieve a constant data rate of 18 Mbps with 4 : 2 : 2 sampling. Betacam SX thus has a higher compression ratio of 9.25 : 1. DV Board. DV boards are sometimes referred to as Firewire interface boards. This is because the Firewire interface is the most important component on the board since it enables the fast DV data transmission between the computer and DV equipment. Besides Firewire interface, a DV board usually contains the following: • DV codecs. Some DV boards come with software codecs that use the computer processor to decompress DV files for preview and editing. Software codecs are cost-effective, flexible, and easy to be upgraded. Other DV boards have a DV codec chip which frees the CPU from the compression/decompression procession and can be fast enough for full-motion, real-time playback. However, they are also much more expensive. Notice that a software DV codec can also make use of the hardware codec in the DV equipment connecting to the DV board. • Analog video/audio I/O ports. They are especially useful for previewing the DV on an analog TV monitor and mixing the analog video footage with DV files or converting analog video footage to DV format. The DV board may
also contain a chip that can compress the analog video to the DV or MPJEG digital video format. In this case, the DV board functions like the video capture board previously described. It usually supports full resolution, true color, and real-time compression of analog video. • Additional Firewire ports for connecting other Firewire peripherals such as a printer. They can also be used for synchronized video/audio playback and VCR control with time code for accurate video editing. End-to-End Digital Video Editing Using DV and Firewire. Editing DV with Firewire (78,81) requires DV equipment such as DV camcorder or DVCR with Firewire I/O port. A Firewire cable is used to connect the equipment to the computer which has a DV board with Firewire interface. Editing is done by using a nonlinear video editing software such as Adobe Premiere. The computer needs to have sufficient processor power and large amount of disk space. The data rate of DV is usually 3.7 Mbit/s, which means 222 Mbyte space per minute and 20 Gbyte for a 90 min DV footage. The disk drive also needs to be fast enough to accommodate the steady DV stream of 3.7 Mbit/s. Digital video authoring and editing using DV with Firewire consists of the following steps: Step 1. Shoot the video footage using a DV camcorder. As the video is being shot, it is compressed by the DV codec chips in the DV camcorder and recorded digitally on a DV tape which can also be played by the DVCR. Step 2. DV footage can be then transferred into the computer and stored on a hard disk through the Firewire which is connected to the Firewire interface of computer’s DV board. During the transferring process, the DV data is usually encapsulated into certain multimedia systems such as AVI or QuickTime. DV codecs are not involved during the transfer. Step 3. Video editing software such as Adobe Premiere can be used to work with DV data which are now encapsulated in some multimedia system format. Notice that the DV data in the computer so far are identical to what is on the DV tape; that is, no information is lost. The DV codec is only for decompressing the DV data when filters and/or transitions are to be added. Otherwise, the DV data are simply copied to the target file. The DV codec can be software or hardware on the DV board. During the editing process, the video can be previewed either on the computer screen or on a monitor. Monitor preview is usually supported by the analog port on DV board, DV camcorder, or DVCR. Such a Firewire interface board needs to have a DV codec hardware which increases the cost considerably. Step 4. After all the edits are done, the resulting DV file can be transferred back to the DV equipment via Firewire. It is obvious that the whole editing process has no generation loss. The result can also be transcoded into other digital formats such as MPEG, or outputs to Hi-8 or VHS tapes. The latter can be done through the analog I/O port of the DV board, DV camcorder, or DVCR. DV footage can be easily mixed with analog footage during above NLE process. If the DV camcorder or DVCR has analog input, the analog footage can be transferred into the DV
MULTIMEDIA VIDEO
camcorder and then digitized, compressed, and recorded in the DV format. Another way is to make use of the analog I/O port on the DV board with hardware DV codec. Such a board is capable of converting analog video to DV, but costs more. The third approach is to use the video capture board to digitize the analog video into MJPEG digital video clips. Such video clips can then be transcoded by video editing software into DV format when a DV codec is presented. The advantages of video authoring and editing using DV and Firewire are obvious. First, the video is of high quality and free of noises. Experiments (78) show that DV video still has high quality (better than Betacam SP video digitized at highest quality) even after being decompressed and recompressed 10 times. Second, the high-quality video also tends to be compressed better and is more tolerable to lower data rate. Third, DV has a steady data rate of 3.7 Mbit/s which is easier to handle and results good playback. This approach is also cost-effective since codec is hardwared inside the DV camcorder and DVCR. There is simply no need for an expensive video capture board. VIDEO CONFERENCING Video conferencing refers to the interactive digital video and audio communication between a group of parties who may be remotely located through the use of computers over computer networks (82). Video conferencing is generally considered one type of data conferencing which also includes text and graphics, and so on. Video conferencing has many important applications, such as tele-medicine and distance learning. Video conferencing requires real-time capture, sampling, coding, and transmission of both audio and video. Compression is critical to video conferencing due to the huge data volume involved. For example, an uncompressed full motion CIE frame size video stream needs a bandwidth of 30 frame ⫻ (352 ⫻ 288) pixel/frame ⫻ 8 bit/pixel ⫽ 24 Mbit/s. Some important video codecs are described in the section entitled ‘‘Video Codecs.’’ The audio analog signal is usually sampled at a rate range from 8 kHz to 48 kHz. This is based on the Nyquist theory since human hearing range is 20 Hz to 20 kHz, and human voice ranges from 40 Hz to 4 kHz. Sampled values are then quantized into a number of discrete levels (256 for 8-bit representation, or 65536 for 16-bit representation) and then coded using following methods: • PCM (pulse code modulation), which includes uniform PCM, mu-law PCM, and A-law PCM. Uniform PCM uses equally spaced quantizer values and is an uncompressed audio encoding. Au-law and A-law PCMs use logarithmic quantizer step spacing which can represent larger value range using the same number of bits. Mu-law and A-law PCMs can achieve a compression ratio of 1.75 : 1, and they are formally defined in the IUT-T Recommendation G.711. • ADPCM (adaptive difference pulse code modulation) encodes the difference between each sample and its predicted value based on the previous sample. The quantizing and prediction parameters of ADPCM are adaptive to the signal characteristics, and ADPCM typically can achieve a compression ratio of 2 : 1. There are several ITU-T recommendations which specify different ADPCM
623
audio encoding algorithms, including G.721, G.722, G.723, G.726, and G.727. Unipoint and Multicast Video Conferencing Video conferencing can be categorized in several ways. Depending on the number of parties involved, a video conference can be either point-to-point or multipoint. Point-to-point (circuit-switched) or unicast (packet-switched) video conferencing is the simplest form of video conference which involves only two sites. Both parties of a point-to-point must use the same video/audio coding algorithms and operate at the same speed. Multipoint (circuit-based) or multicast (packet-switched) video conference involves multiple parties. In the circuitbased multipoint video conferencing, each party talks to an MCU (multipoint control unit). For the packet-based video conference, a somewhat analogous software tool called the MSB (multisession bridge) is needed. MCU uses the following methods to switch between each video conferencing participant: 1. Potting. MCU switches between participants at certain time interval. 2. Voice Active Switching or Picture Follows Voice. MCU switches to the participant who has highest audio level. 3. Continuous Presence. MCU divides the window into several subwindows, one for each participant. 4. Chair Control. MCU always presents the picture of the participant who is designated as the chair of the conference. ITU-T Recommendation H.231 is a standard that covers MCU and defines how several H.320-compatible video conferencing system can be linked together. H.243 defines the MCU protocols. In multipoint video conferencing, all codecs must be mutually compatible, and the MCU must be compatible with the codecs. The video conference operates at the smallest frame size (FCIF or QCIF) and the lowest bandwidth of any of the nodes and node-MCU links. Packet-Switched and Circuit-Switched Video Conferencing Video conferencing can also be distinguished by the way the data are transmitted over the network: packet-switched or circuit-switched. Packet-Switched Video Conferencing. Packet-switched communication is a method of data transfer where the information is divided into packets, each of which has an identification and destination address. Packets are sent individually through a network and, depending on network conditions, may take different routes to arrive at their destination at different times and out of order. Unlike circuit-switched communication, bandwidth must be shared with others on the same network. In the packet-switched video conferencing, the data can be transmitted over the Internet (e.g., using MBONE or Multicast BackbONE). The general bandwidth requirement is 192 kbit/s, in which 128 kbit/s is for video and 64 kbit/s is for audio. An advantage of packet-switched communication for video conferencing is the capability to more easily accommodate multipoint conferences. A disadvantage is the unpredictable timing of data delivery, which can cause problems for delay-
624
MULTIMEDIA VIDEO
sensitive data types such as voice and video. Video packets that are received out of order may have to be discarded. Audio packets can be buffered at the receiver, reordered, and played out at a constant rate; however, this induces a delay which can be detrimental to interactive communication. Circuit-Switched Video Conferencing. Circuit-switched communication is a method of data transfer where a path of communication is established and reserved for the duration of the session. A dedicated amount of bandwidth is allocated for exclusive use during the session. When the session is completed, the bandwidth is freed and becomes available for other sessions. Advantages of circuit-based communication for video conferencing include the availability of dedicated bandwidth and predictability of data delivery. A disadvantage is that the session is primarily point-to-point and requires expensive MCUs to accommodate multipoint conferences. Also, the dedicated bandwidth tends to be wasted during periods of limited activity in a conference session. The general bandwidth requirement for a circuit-based video conferencing over POTN (Plain Old Telephone Network) is 128 kbit/s (video 108 kbit/ s, audio: 16 kbit/s, overhead: 4 kbit/s). Video Conferencing Over Various Networks Video conferencing can be classified based on the communication network it uses. POTS-Based Video Conferencing. POTS (Plain Old Telephone Service) is the basic telephone service that provides access to the POTN. POTS is widely available but has very low bandwidth (the total bandwidth of a V.34 modem is only 36.6 kbit/s). ITU-T Recommendation H.324 is an interoperability standard for video conferencing operating over V.34 modem (33.6 kbit/s). H.324 uses H.263 for video encoding and G.723 for audio codec (please refer to the section entitled ‘‘H.323’’). ISDN-Based Video Conferencing. ISDN (integrated services digital network) is a digital service over the public switched network. ISDN has two access rates: basic rate interface (BRI) and primary rate interface (PRI). BRI provides two data channels of 64 kbit/s (B-channels) and one signaling channel of 16 kbit/s (D-channel). ISDN PRI provides 23 or 30 B channels of 64 kbit/s and one D-channel of 64 kbit/s, but is much more expensive. ITU-T H.320 is the interoperability standard for ISDN-based video conferencing. It uses H.261 as the video codec and G.711 and G.728 for audio codec (please refer to the section entitled ‘‘H.320’’). B-ISDN-Based Video Conferencing. B-ISDN (broadband ISDN) is the high-speed and broadband extension of the ISDN. It is a concept as well as a set of services and developing standards for integrating digital transmission services over the broadband network of fiber-optic and radio media. BISDN provides bandwidth range from 2 Mbit/s to 155 Mbit/s and up. It uses a fast cell-switching protocol called Asynchronous Transfer Mode (ATM) (83,84) as the underlying data link layer protocol. ATM has many advantages for video conferencing: (a) high bandwidth available instantly on demand; (b) more efficient than circuit switch with statistical multiplexing which can combine many virtual circuits into one physical channel; (c) low cell delay variation which is good for realtime video and audio; (d) high resilience with dynamic alternative routing. BISDN can also be used to interconnect LANs together to provide wide area video conferencing. The Integrated Service
Working Group of IETF developed a best-effort, real-time Internet service model which includes RTP (Real-Time Transportation Protocol), RSVP (Resource Reservation Protocol), and RTCP (Real-Time Control Protocol). The interconnected LANs need to have these protocols and must be able to interwork with BISDN’s access protocols such as AAL5. ITU-T Recommendations H.321 and H.310 are the interoperability standards for BISDN-based video conferencing. H.321 (Adaptation of H.320 Visual Telephone Terminals to BISDN Environments, adopted in March 1996) describes technical specifications for adapting barrow-band visual telephone terminals defined by H.320 to B-ISDN. H.310 (Broadband Audiovisual Communication Systems and Terminals, adopted in November 1996) specifies technical requirements for both the unidirectional and bidirectional broadband audiovisual systems and terminals. With such high bandwidth, B-ISDN video conferencing uses MPEG-2/H.261 as the video codec and MPEG-1/MPEG-2/ITU G series for audio coding. Therefore, B-ISDN video conferencing can achieve very high video and audio quality. B-ISDN and ATM show great promise for video conferencing applications, but their deployment is currently limited. LAN-Based Video Conferencing. The physical layer of LANs (local area networks) usually consist of 10 Mbps Ethernet, 100 fast Ethernet, or 4 or 16 Mbit/s Token Ring segments. With much more bandwidth available than ISDN, LAN video conferencing can achieve picture quality similar to that of television. However, bandwidth management and scalability for a large number of users becomes a problem since the network bandwidth is shared among all the participants and users in a LAN. H.323 is the ITU-T recommendation for LAN-based video conferencing. It defines terminals, equipment, and services for multimedia conferencing over a network without a Quality-of-Service (QoS) guarantee such as a LAN. LAN-based video conferencing can also use UDP, RTP for point-to-point transmission of real-time video and audio, and RSVP, which works together with RTP. RSVP allows the router to reserve bandwidth for the smooth transmission of time-sensitive data such as video and audio. Internet-Based Video Conferencing. The Internet uses IP (Internet Protocol) and two transportation layer protocols: TCP and UDP. TCP (Transmission Control Protocol) provides a reliable end-to-end service by using error recovery and reordering. UDP (User Datagram Protocol) is an unreliable service without error recovery capability (83). Internet video conferencing applications primarily use UDP for video and audio data transmission. TCP is not practical because of its error recovery mechanism. If lost packets were retransmitted, they would arrive too late to be of any use. TCP is used by video conferencing applications for other non-time-sensitive data such as whiteboard data and shared application data. Notice that UDP is an unreliable data transportation protocol; in other words, packets may be lost, duplicated, delayed, or out of order. All these may not be a problem for highly reliable and low-delay LANs, but will cause serious problems for wide area Internet video conferencing. The above challenges of transmitting video and audio over the Internet has led to the development of a new transport protocol called Real-Time Transport Protocol (RTP) proposed by the IETF-AVT (Audio/Video Transport Working Group). RTP (RFC 1889) provides support for sequencing, time stamp,
MULTIMEDIA VIDEO
and QoS feedback. RTP is used in ITU-T Recommendation H.323. Most of the commonly used MBONE tools as well as video conferencing products on the market have implemented some version of RTP. MBONE-Based Video Conferencing. MBONE (Multicast BackbONE) (85,86) is a virtual network that sits on top of the Internet and uses software multicast routers. Using the MBONE, it is possible to transmit video, audio, and other data in real time to multiple destinations throughout the global Internet. MBONE originated from the first two experiments to multicast live audio and video from meetings of the IETF (Internet Engineering Task Force) to other sites. Multicast has been implemented over LANs such as Ethernet and Fiber Distributed Data Interface (FDDI) and an Internet extension has been defined in RFC 1112 in 1989 (87). Basically, MBONE consists of ‘‘islands’’ supporting IP multicast such as multicast LANs like Ethernet, connected by point-topoint links called ‘‘tunnels.’’ With IP multicast, data are transmitted to a host group (83,87) which includes all the participating hosts. Each host group is specified by a class D IP address in the range of 244.0.0.0 to 239.255.255.255. Multicast routers are responsible for delivering the sender’s data to all receivers in the destination group. The Internet Group Management Protocol (IGMP) is used by multicast routers to determine what groups are active on a particular subnet. There are several routing protocols that multicast routers can use to efficiently route the data packets, including Distance Vector Multicast Routing Protocol (DVMRP), Multicast Open Shortest Path First (MOSPF), and Protocol-Independent Multicast (PIM). If a router is not equipped with these routing protocols, it can use the tunneling technique, which means to encapsulate the multicast packet inside a regular IP packet and set the destination to another multicast router. Most major router vendors now support IP multicast. Interoperability Standards Interoperability standards are required for the video conferencing products from different vendors to work together. There are several organizations, including ITU (International Telecommunication Union), IMTC (International Multimedia Teleconferencing Consortium), and PCWG (Personal Conferencing Working Group) that are working toward prompting and producing standards for video conferencing. Many standards have been proposed such as the ITU-T G series standards for audio coding, H.261/H.263 for video coding, H.221/ H.223 for multiplexing, and so on. Core standards of video conferencing are ITU-T T.120, H.320, H.323, and H.324 series of standards. T.120 addresses the real-time data conferencing, H.320 is for ISDN video conferencing, H.323 standard addresses video conferencing over the LAN without QoS guarantee, and H.324 is for low-bit-rate multimedia communication over the telephone network using V.34 modems. We are going to concentrate on these four major standards in the following. T.120. (Data Protocols for Multimedia Conferencing) is a series of ITU-T recommendations for multipoint data communication service in multimedia conferencing environment. It was adopted in July 1996 and has been committed to by over 100 key international vendors in-
625
cluding Microsoft, Apple, Intel, IBM, Cisco, MCI, AT& T, and so on. The T.120 defines a hierarchical structure (Fig. 6) with defined protocols and service definitions between the layers (88). T.122 and T.125 define a connection-oriented service that is independent of the T.123 transport stacks operating below it. The lower-level layers (T.122, T.123, T.124, and T.125) specify an application-independent mechanism for providing multipoint data communication services. The upper-level layers (T.126 and T.127) define protocols for specific conferencing applications, such as shared whiteboarding and multipoint file transfer. T.120 covers the document (file and graphics) sharing portion of a multimedia teleconference and can be used within H.320, H.323, and H.324 or by itself. Other T.120 series recommendations are summarized as follows. T.121. (generic application template), which was adopted in July 1996, provides guidance for application and applications protocol developers on the correct and effective use of the T.120 infrastructure. It supplies a generic model for an application that communicates using T.120 services and defines a Generic Application Template specifying the use of T.122 and T.124 services. T.122. (multipoint communication service for audiographics and audiovisual conferencing service definition) was adopted in March 1993. It defines network connection independent services, including multipoint data delivery (to all or a subset of a group), uniformly sequenced data reception at all users, resource control by applications using a token mechanism, and multiapplication signaling and synchronization. T.123. (network specific data protocol stacks for multimedia conferencing) was adopted in October 1996. The networks currently include ISDN, CSDN, PSDN, B-ISDN, and LAN. Communication profiles specified provide reliable point-to-point connections between a terminal and an MCU, between a pair of terminals, or between MCUs. T.124. (generic conference control), which was adopted in August 1995, provides a high-level framework for conference management and control of multimedia terminals and MCUs. It includes Generic Conference Control (GCC) and other miscellaneous functions including conference security. T.125. (multipoint communication service protocol specification) was adopted in April 1994 and specifies a protocol to implement the Multipoint Communication Service (MCS) defined by T.122. T.126. (multipoint still image and annotation protocol), which was adopted in August 1995, supports multipoint exchanges of still images, annotations, pointers, and remote events. The protocol conforms to the conference conductship model defined in T.124 and uses services provided by T.122 (MCS) and T.124 (GCC). T.126 includes components for creating and referencing archived images with associated annotations. T.127. (multipoint binary file transfer protocol) was adopted in August 1995. It defines a protocol to support the interchange of binary files within an interactive con-
626
MULTIMEDIA VIDEO
Applications (using both standard and nonstandard application protocols)
Applications (using standard application protocols)
Applications (using standard application protocols)
Node controller
Multiple file transfer T.127 Still image exchange T.126 ITU-T standard application protocols
Generic application template (GAT) T.121
...
... Nonstandard application protocols
Generic application template (GAT) T.121
Generic conference control (GCC) T.124
Multipoint communication service (MCS) T.122/T.125
Network-specific transport protocols T.123
Figure 6. Architecture of ITU-T T.120 Series Recommendation.
ferencing or group working environment where T.120 recommendation series are used. T.127 supports simultaneous distribution of multiple files, selective distribution of files to a subset of participants, and retrieval of files from a remote site.
H.320. H.320 (Narrow-Band Visual Telephone System and Terminal Equipment) is the ITU recommendation adopted in March 1996. Narrow-band refers to the bit rate ranging from 64 kbit/s to 1920 kbit/s (64 kbit/s ⫻ 30). H.320 specifies video conferencing over circuit switched networks like ISDN and
Table 4. ITU H.320 Recommendations Video codec:
H.261
Audio codec:
G.711
Frame structure:
G.722 G.728 H.221
Control and indication: Communication procedure:
H.230 H.242
Video codec for audiovisual service at p ⫻ 64 kbps. Please refer to section entitled ‘‘H.261.’’ PCM (pulse code modulation) of voice frequencies. 8 kHz, 8-bit encoding and requires 64 kbit/s bandwidth. 7 kHz audio-coding within 64 kbit/s. Coding of speech at 16 kbit/s using low-delay code excited linea prediction. Frame structure for a 64 kbit/s to 1920 kbit/s channel in audiovisual teleservices. It supports a variety of data rates from 300 bit/s up to 2 Mbit/s. H.221 uses double error correction for secure transmission and can be used in multipoint configurations. It allows the synchronization of multiple 64 kbit/s or 384 bit/s connections and the control of the multiplexing of audio, video, data, and other signals within the synchronized multiconnection structure in the case of multimedia services such as video conferencing. Frame-synchronous control and indication signals for audiovisual systems. System for establishing communication between audiovisual terminals using digital channel up to 2 Mbit/s. This recommendation describes all the point-to-point procedures involving the BAS codes in each frame which the control channel within the multiplexing structure specified in H.221.
MULTIMEDIA VIDEO
627
Table 5. ITU-T H.323 Recommendations Video codec:
H.261 H.263 G.711 G.722 G.723
Audio codec:
G.728 G.729 Control:
H.245
Packet and Synchronization:
H.225
Video codec for audiovisual service at p ⫻ 64 kbit/s. Please refer to section entitled ‘‘H.261.’’ Video coding for low bit rate communication. Please refer to section entitled ‘‘H.263.’’ Pulse code modulation (PCM) of voice frequencies. 7 kHz audio-coding within 64 kbit/s (48, 56, and 64 kbit/s). Dual-rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s modes. Coding of speech at 16 kbit/s using low-delay code excited linear prediction (3.1 KHZ). Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linearprediction (CS-ACELP) Control protocol for multimedia communication. H.245 defines syntax and semantics of terminal information messages and procedures for in-band negotiation at the beginning and during communication. The messages include receiving and transmitting capabilities as well as mode preference from the receiving end, logical channel signaling, and Control and Indication. Acknowledged signaling procedures are specified for reliable audiovisual data communication. Media stream packetization and synchronization and nonguaranteed quality of service LANs. H.225 specifies messages for call control including signaling, registration, and admissions, as well as packetization/synchronization of media streams.
Switched 56. H.320 was designed primarily for ISDN, as ISDN BRI offers two 64 kbps (B-channel) data bandwidth for video conferencing. H.320 video conferencing system can also work over 3 ISDN BRI service (6 B-channels or 384 kbps) which are combined together using an inverse multiplexer (1MUX). This yields better picture quality since more bandwidth is allocated for video data, but it costs a lot more. H.320 includes a series of recommendations which are summarized in Table 4.
H.323. H.323 (Visual Telephone Systems and Equipment for Local Area Networks Which Provide a Nonguaranteed Quality of Service) is a series of recommendations by ITU-T adopted in November 1996. H.323 extends H.320 to incorporate Intranet, LANs, and other packet-switched networks. It describes terminals, equipment, and services for multimedia communication over LANs which do not provide a guaranteed quality of service. H.323 terminals and equipment may carry video, audio, data, or any combination including videotelephone, and support for voice is mandatory. They may interwork with H.310/H.321 terminals on B-ISDN, H.320 terminals on N-ISDN, H.322 terminals on Guaranteed Quality
of Service LANs, H.324 terminals in GSTN, and wireless networks. H.323 series recommendations are summarized in Table 5. H.324. (Terminal for Low-Bit-Rate Multimedia Communication) is a series of recommendations by ITU-T adopted in March 1996. H.324 describes terminals for low-bit-rate multimedia communication over V.34 modems (total bandwidth of 36.6 kbps) on the Global Standard Telephone Network (GSTN). H.234 terminals may carry real-time voice, data, and video or any combination, including videotelephony. H.324 recommendation series are summarized in Table 6. H.324 allows more than one channel of each type to be in use and uses the logical signaling procedures. The content of each logical channel is described when it is opened, and procedures are provided for the expression of receiver and transmitter capabilities. This limits transmissions to what receivers can decode, and receivers may request a particular mode from transmitters. H.324 terminals may be used in multipoint video conferencing and interwork with H.230 terminals on the ISDN as well as terminals on wireless network. Compared with H.320 (ISDN) and H.323 (LAN), H.324 specifies multimedia teleconferencing over the most pervasive commu-
Table 6. ITU-T Recommendation H.324 Video codec: Audio codec:
H.263 G.723
Control: Multiplexing:
H.245 H.223
Video coding at data rate less than 64 kbit/s. Please refer to section entitled ‘‘H.263.’’ Audio codec for multimedia telecommunication at 5.3 or 6.4 kbit/s. It has a silence suppression mode so that the audio bandwidth can be used for other data when no audio is being transmitted. Control protocol for multimedia communication. Multiplexing protocol for low-bit-rate multimedia communications. H.223 specifies a packet-oriented multiplexing protocol which can be used for two low-bit rate multimedia terminals or between a low-bit-rate multimedia terminal and a MCU or an interworking adapter. The protocol allows the transfer of any combination of digital voice, audio, image, and data over a single communication link. The control procedures necessary to implement H.223 are defined in H.245.
628
MULTIMEDIA VIDEO
nication network (GSTN) today. As a result, H.324-based video conferencing products are prominent in the market.
......
Video server
......
Video server
......
VIDEO-ON-DEMAND Video-on-Demand (VoD) is an interactive digital video system that works like a cable television that allows subscribers to choose and view a movie from a large video archive at their own leisure. VoD is sometimes referred to as Interactive TV (ITV), and it is one of the most important client/server applications of digital video. VoD involves video servers which contain a large collection of digital video titles and deliver selected ones in stream mode to the subscribers over the network. The client then decompresses the stream and plays back at a good quality (at least comparable to standard VHS). A VoD system must support VCR-like functions including pause, rewind, fast-forward, play, and so on. Such commands are issued by the subscribers, processed by set-top boxes, and sent to video servers. Some of the key applications of VoD are video or film on demand, interactive games, distance learning, home shopping, and so on. VoD needs to be cost-effective in order to compete with the existing video services such as video rental and cable TV. Depending on the interactive capabilities they provide, VoD can be classified into the following categories (89,90): Broadcast (No-VoD), Pay-per-view (PPV), Quasi VoD (Q-VoD), Near VoD (N-VoD) and True VoD (T-VoD). No-VoD service is similar to broadcast TV, in which the user is a passive viewer and has no control over the session. PPV service is similar to the existing PPV offered by cable TV companies in which the subscriber can sign up and pay for certain programs. No-VoD and PPV subscribers have no control over the program viewing and have to receive the program at a predetermined schedule by the service provider. Q-VoD service allows limited user control of viewing by grouping users based on a threshold of interest, and users can switch between different viewing groups. N-VoD service provides staggered movie start times. The additional sessions allow viewers to jump from session to session to gain access to a different portion of the feature presentation. T-VoD service dedicates an entire session to a single user and provides the individual user control over the presentation. The user can select the program at any time and has full-function virtual VCR capabilities such as fast-forward. T-VoD is the most difficult service to provide. A VoD system mainly consists of three components: set-top boxes, video servers, and data delivery network as shown in Fig. 7. Set-Top Boxes Set-top boxes interface TV equipment with the VoD services. Set-top boxes must contain the video decoder to decode the compressed video stream delivered from the server and convert into a standard TV transmission format. It also needs to provide VCR-like functionalities by allowing upstream (from the subscriber to the service provider) user commands. A settop box may consist of the following components: (1) a powerful CPU, a RAM buffer for reducing network jitters, and a graphic chip for screen overlays; (2) a 1 GHz tuner for cable delivery of VoD programs, or an ADSL modem for ADSL delivery; (3) an error correction chip; (4) a hardware MPEG-2 decoder for real-time video data decompression and audio
ATM
High-speed backbone
ATM-ADSL Local video servers and switching office interface
......
ADSL
TV ...... Set-top box
Figure 7. VoD system architecture.
hardware; (5) an RGB color converter and a radio-frequency (RF) demodulator or baseband demodulator for telephone line delivery of VoD programs; (6) an infrared receiver for remote control; and finally (7) a security chip to prevent theft. Set-top boxes should be of low cost. It is suggested that settop boxes should be sold at around $150 to make VoD widely acceptable. It also needs to be open and interoperable so that users can subscribe to several different VoD services. Video Servers Video servers store and provide user access to large collections of video titles. Its main functionalities include video storage, admission control, request handling, video retrieval, guaranteed stream transmission, video stream encryption, and support of virtual VCR functions. Designing a cost-effective, scalable, and efficient video server is a very challenging task. A video server should have the capacity to hold hundreds of terabytes of digital video and other information on different media such as magnetic tapes, optical write/read (W/R) disks, hard disks, or random access memory (RAM) buffer. It also must support simultaneous and real-time access to hundreds of different video titles by hundreds or even thousands of subscribers. Real-Time Disk Scheduling. Disk scheduling and admission control algorithms are needed for guaranteed real-time video storage access. A common approach of real-time disk scheduling is to retrieve disk blocks for each stream in a round-robin fashion and keep the block size to the proportion of the stream’s playback rate. Thus, this approach is known as quality proportional multisubscriber servicing (QPMS) (6), rate conversion (91), or period transformation technique (92). Other real-time disk scheduling algorithm include: • The elevator disk scheduling algorithm (93), which scans the disk cylinders from the innermost to the outermost and then scans backwards. This algorithm is widely used because of its nearly minimal seek time and fairness.
MULTIMEDIA VIDEO
Taking the priorities of the requests into consideration, the elevator disk scheduling algorithm can be easily extended to the real-time disk scheduling. Tasks can be grouped into different priority classes, and their priorities are determined based on factors such as tasks’ deadlines. Each disk access request will be assigned a priority, and the highest priority class with pending disk accesses is serviced using the elevator algorithm. • The group sweeping scheme (GSS) (94), which minimizes both the disk access time and the buffer space. This algorithm assigns each request to a group. Groups are served in a round-robin fashion and the elevator scheduling algorithm is used within each group. The algorithm behavior can be adjusted by changing the group size and the number of groups. It approximates to the elevator algorithm as the number of groups decreases, but approximates to the round-robin algorithm if the number of requests in each group increases. • The prefetching disk scheduling algorithm, which can be extended for the real-time disk scheduling to reduce the memory requirement of the media server. Examples of such extensions are the love page prefetching and delayed prefetching algorithms which are used in the SPIFFI VoD system (95). Love page prefetching is a buffer pool page replacement algorithm that extends the global LRU algorithm (93) by distinguishing prefetched pages and referenced pages. Love page prefetching makes use of the fact that the video is usually accessed in a strictly sequential manner (for example, watching a movie), and the probability of a data block in the RAM buffer being referenced again is not high (96). It uses two LRU chains: one for referenced pages and one for prefetched pages. When a new page is needed, the referenced page chain is searched first and a page from prefetched chain is taken if there are no available pages in the referenced page chain. Delayed prefetching algorithm delays the data prefetching until the last minute, thus reducing the size of the RAM buffer needed to store the prefetched video data. CPU Admission and Scheduling Algorithms. The purpose of CPU admission control and scheduling algorithms is to ensure that a feasible scheduling exists for all the admitted tasks. One example of CPU admission control and scheduling algorithm is as follows (97). Isochronous tasks (also known as periodic tasks) are periodic network transmissions of video and audio data. These tasks need performance guarantees— that is, throughput, bounded latency, and low jitter. Their priorities can be determined using a rate-monotonic basis (98); that is, a task with a higher frequency has a higher priority. A preemptive fixed-priority scheduling algorithm is used for isochronous tasks. Other real-time and non-real-time tasks can be scheduled using a weighted round robin, which can be preempted by isochronous tasks. General-purpose tasks have the lowest priorities, but they need to have minimum CPU quantum to avoid starvation. Video Storage Strategies. The video storage subsystem consists of control units, disk/tape storage, and access mechanism. The video titles must be stored in compressed digital format. MPEG-2 is often used since it is the video codec for
629
broadcast and HDTV video and is widely accepted by the cable and TV industry. Real-time video playback imposes strict delay and delay-variance requirements on the retrieval of video data from the storage subsystem. Video titles can be stored on many different media such as RAMs, hard disks, optical R/W disks, and magnetic tapes. RAMs provide the fastest data access but are prohibitively expensive. On the other hand, magnetic tapes are very costeffective, but too slow for the multisession and real-time requirement of VoD. Thus, a video server normally uses a hybrid and hierarchical storage structure (96) in which disk arrays are used to store the video retrieved from tertiary storage and deliver the video at users’ requests. If we assume the capacity of one disk to be 1 Gbyte and assume the transfer bandwidth to be 4 Mbyte/s, a 1000-disk system is large enough to store 300 MPEG-2 movies of 90 min each and support 6500 concurrent users (99). In order to deliver smooth, continuous, and real-time video streams, a RAM buffer can be used to cache the popular portion of videos. The arrangement of video titles across different storage media depends on the relative usage, the available bandwidth, and the level of interactivity supported. Such arrangements are often referred to as video data placement policy with the goal of balancing the storage device load and maximizing the utilization of both bandwidth and space. One example is the Bandwidth-to-Space Ratio (BSR) policy (100). BSR policy characterizes each storage device by its BSR, and each video stream by the ratio of its required bandwidth to the space needed to store it. The policy then dynamically determines how the video stream needs to be replicated and on which storage devices; this is done according to changes in users’ demands. Another algorithm is the Dynamic Segment Replication (DSR) policy (101) which uses partial replication of the video streams to balance the load. DSR is based on the observation that a group of consecutive requests of a popular video stream can share the partial replication of the video stream generated by the previous request on the same video. Video placement can also be combined with video encoding to create multiresolution replications of the same video stream (102,103). Experiments show that such a schema can satisfy more user requests (with different QoS) than the one resolution approach. Several basic techniques including striping, declustering, and replication can also be used to increase the video disk storage performance by interleaving a video title on multiple disks. Striping interleaves portions of disk blocks on multiple disks. The aim is to reduce the block access latency by parallel reading of the complete blocks. Declustering distributes blocks of files on several disks thus allowing parallel block access from the same file and increasing the data rate of the video stream. Video titles can also be replicated files among video servers based on the user demand and access pattern (e.g., time/day of peak access, average number of simultaneous viewers) to balance the load. Disk Failure Tolerance. Real-time, continuous video streams of VoD require storage media with very high availability and reliability. Although a single disk may be very reliable, a large disk array used in a media server system may have an unacceptable high failure probability. For example, if the mean time to failure (MTTF) of a single disk is on the order of 300,000 h, the MTTF of a 1000-disk array system will be
630
MULTIMEDIA VIDEO Table 7. Different Data Rates of ADSL Channels Upstream (Duplex) Bear Channels
Downstream Bear Channels n ⫻ 1.536 Mbit/s
POTS
ISDN
4
ADSL upstream channel 80 94
ADSL downstream channel
106 120
n ⫻ 2.048 Mbit/s
1.536 3.072 4.608 6.114 2.048 4.096
Mbit/s Mbit/s Mbit/s Mbit/s Mbit/s Mbit/s
c channels: Optional channels:
16 64 160 384 544 576
kbit/s kbit/s kbit/s kbit/s kbit/s kbit/s
KHZ Figure 8. Frequency spectrum of ADSL.
just 300 h (99). Thus it is often necessary to sacrifice some of the disk space and bandwidth to improve the reliability and availability of the media server system. Usually, several parity (99,104,105) and mirroring (106) schemas can be used. For example, the streaming RAID schema (105) can effectively increase the MTTF of the disk array in the above example to 1100 years (99). Data Delivery Network Data delivery network connects subscribers and video servers, which includes backbone network, community network (or subscriber network), and switch office. It delivers video streams and carries control signals and commands. Due to the cost consideration, the subscriber network is usually based on twisted copper line or coax cable, whereas the backbone network is based on fiber or coax cable. Network technologies suitable for VoD are ADSL and ATM. Although ISDN is suitable for video conferencing, it does not meet the bandwidth requirement of VoD because its highest bandwidth is under 2 Mbit/s. To date, VoD trails have been conducted extensively on ADSL and ATM, with ATM forming the backbone from video serves to the switch office, and ADSL linking the switch office to individual homes. Switch office is responsible for distributing video signals to individual subscribers (e.g., through ADSL). ADSL. Asymmetric Digital Subscriber Line (ADSL) refers to the two way capability of a twisted copper pair with analog to digital conversion at the subscriber end (e.g., through ADSL modem) and an advanced transmission technology. ADSL coexists with POTS (lower 4 kHz) and ISDN (lower 8 kHz) service over the same twisted copper line by using higher frequencies in the spectrum for data transmission (see Fig. 8). They can be separated from each other by the ADSL modem at the subscriber’s side by using filtering such as passive filtering. This ensures the POTS service in case of ADSL modem failure. The ADSL upstream and downstream channels can be separated frequency division multiplexing (FDM) or can overlap each other. In the latter case, a technique called local echo cancellation is used to decode the resulting signal. ADSL can provide asymmetric transmission of data up to 9 Mbit/s downstream to the customer and 800 kbps upstream depending on the line length and line and loop conditions. Table 7 lists some of the ADSL data rates (107). The actual ADSL downstream capacity also depends on the length of the copper loops (see Table 8) (108) and many other factors including wire gauge, bridged taps, and cross-coupled inter-
faces. Line attenuation increases with loop length and frequency, and it decreases as wire diameter increases (107). The asymmetric bandwidth characteristics of ADSL fit interactive video services such as VoD very well since they need much higher bandwidth for the downstream data transmission (e.g., broadcasting quality MPEG-2 video needs 6 Mbit/ s bandwidth) than the upstream user signaling (e.g., rewind command). ADSL is usually used to provide dedicated asymmetrical megabit access for interactive video and high-speed data communication over a single telephone line such as Internet access. Another huge advantage of ADSL is that it can run over POTS (Plain Old Telephone Service) and thus can reach vast amount of customers. This is very important since the full deployment of boardband cable or fiber will take decades and enormous investment. In other words, ADSL helps make digital video services such as VoD marketable and profitable for the telephone company and other service suppliers. There are two modulation methods for ADSL, namely, DMT (Discrete Multitone) and CAP (Carrierless Amplitude/ Phase modulation). DMT is usually preferred because of its higher throughput and greater resistance to adverse line conditions. It can effectively compensate for widely varying line noise conditions and quality levels. The basic idea of DMT is to divide the available bandwidth into large numbers of subchannels or carriers using the discrete fast Fourier transform (FFT). The data are then distributed over these subchannels so that the throughput of every single subchannel is maximized. If some of the subchannels cannot carry any data, they can be turned off to optimize the use of the available bandwidth. DMT is used in the ANSI ADSL standard T1.413. ADSL transmits data in superframes which consist of 68 ADSL frames and one additional frame for synchronization. Each ADSL frame contains two parts: the fast data and interleaved data. The fast data may contain CRC error checking bits and forward error correction bits. The interleaved data contains only the user data. Notice that the error correction can be used to reduce the impulse noise on the video signal, but it also introduces delay. Whether to employ error correc-
Table 8. Relationship Between the Loop Length and the ADSL Bandwidth Length Up to 18,000 ft 16,000 ft 12,000 ft 9,000 ft (average line length for US customers)
Downstream 1.544 2.048 6.312 8.448
(T1) Mbit/s (E1) Mbit/s (DS2) Mbit/s Mbit/s
MULTIMEDIA VIDEO
631
Table 9. Some VoD User Trials Location
Company
Fairfax, VA (Stargazer)
Bell Atlantic
Orlando, FL
Time Warner etc.
Helsinki, Finland
Helsinke Telephone Co.
Singapore
Singapore Telecom
Yokosuka, Japan
NTT, Microsoft
Suffolk, England
British Telecom
Germany
Deutsche Telekom
Technology
Service
nCube/Oracle video server; ADSL (1.5 Mbps/64 Kbps); MPEG-1 and MPEG-2 SGI Challenge video server; ATM over fiber/coax at 45 Mbps with customer side at 3.5 Mbps for MPEG video ADSL (2.048 Mbps/16 Kbps), ATM as backbone ATM/ADSL (5.5 Mbps/168 kbps) over fiber/copper ATM/ADSL over fiber/copper; MPEG-2 nCube/Oracle Media Server; ATM/ ADSL (2 Mbps) over fiber/copper; MPEG-1, MPEG-2 ATM/ADSL over fiber/coax, satellite
VoD, home shopping, etc.
tion or not depends on the network and type of data ADSL transmits. ADSL is often viewed as the transition technology used before existing copper lines can be converted to the fiber or coax cables. A higher-speed variant of it, called VDSL, is under development. VDSL would provide 12.96 Mbit/s to 51.84 Mbit/s downstream and 1.6 Mbit/s to 2.3 Mbit/s upstream data rates with the compromise of the line length (4500 ft to 1000 ft) (108). ATM. Asynchronous transfer mode (ATM) uses a fixed 53byte cell (packet) for dynamic allocation of bandwidth. The cells have characteristics of both circuit-switch and packetswitch networks. A virtual path is set up through the involved switches when two end points wish to communicate. This provides a bit-rate-independent protocol that can be implemented on many network media such as twisted pair, coax, and fiber. ATM operates at very high speed; for example, SONET (Synchronous Optical Net) operates at 155 Mbit/s and ATM could potentially operate up to 2.2 Gbps over a cellswitched network. However, ATM requires broadband fiber and coax cables to fully achieve its capacity. ATM is ideal for VoD applications because of its high bandwidth and cell switching capability, which is a compromise between delaysensitive and conventional data transmissions. ATM AAL1 protocol was designed for constant bit-rate services such as the transmission of MPEG video. The ATM Forum also proposed a standard for constant bitrate AAL5 which can be used for both VoD and fast Internet access. The ATM backbone network can interlink with the ADSL network through an interface. Such an interface demultiplexes ATM signals and regenerates many 1.5 or 2 Mbit/s signals to feed to the ADSL lines. The drawback of ATM is its availability and the high cost of related equipment. Thus, ATM is often used as the backbone network of the VoD systems. In the future, ATM is expected to replace ADSL in a VoD system once copper twisted lines are upgraded to fiber lines or broadband coax cables. VoD Trials VoD is an emerging technology that still needs further product development and refinement before it can be accepted by
VoD, home shopping, games, etc.
VoD VoD VoD VoD, home shopping, games, etc. PPV, NVoD, etc.
the average consumers. Many VoD user trials have been done or are being conducted around the world since the early 1990s, which cost billions of dollars in investments. Some examples of VoD trials are listed in Table 9. Despite the failure of early user trials in the early 1990s due to unacceptable high cost, people have gathered valuable information for the statistical analysis of the related technologies and on the overall economic value of VoD, which include: • Feasibilities of various VoD system architectures and networking technologies such as ADSL. • Customer expectations about VoD services. For example, the usage by category data gathered during the Bell Atlantic VoD market trial (Stargazer) in 1995 supports the view that customers desire diversified product offering. • Customer acceptance and satisfaction. For example, according to the results of the Stargazer VoD trial, the buy rate of VoD subscribers is significantly higher than that of cable PPV and video rental. With the experience gained in the early phases of various VoD trials and recent advances in the technology and standardization, VoD is becoming more and more affordable especially when set-top boxes are getting cheaper. It is expected that VoD will become a reality in the near future; in other words, VoD will become not only commercially viable, but also ready for the market.
BIBLIOGRAPHY 1. D. E. Gibson, Report on an International Survey of 500 Audio, Motion Picture Films and Video Archives, talk given in the annual FIAT/IASA Conf., Bogensee, Germany, September 1994. 2. A. K. Elmagarmid et al., Video Database System: Issues, Products and Applications, Norwell, MA: Kluwer, 1997. 3. A. Hampapur, Design video data management systems, PhD thesis, Univ. Michigan, 1995. 4. H. Jiang and J. W. Dailey, Video database system for studying animal behavior, Proc. Multimedia Storage and Archiving Syst., Vol. 2916, 1996, pp. 162–173.
632
MULTIMEDIA VIDEO
5. A. Watt, Fundamentals of Three-dimensional Computer Graphics, Reading, MA: Addison-Wesley, 1989. 6. H. M. Vin and P. V. Rangan, Designing a multiuser HDTV storage service, IEEE J. Selected Areas Commun., 11 (11): 153– 164, 1993. 7. D. L. Gall, MPEG: A video compression standard for multimedia applications, Commun. ACM, 34 (4): 46–58, 1991. 8. N. Johnson, Indeo video interactive: Back with a vengeance, Dig. Video Mag., February: 46–54, 1996. 9. J. Ozer, Indeo sets the standard for video quality, but some features may be ahead of their time, PC Magazine, January: 1996. 10. QuickTime technology brief—QuickTime 3.0, Apple Computer, Inc., 1997. 11. C. Wiltgen, The QuickTime FAQ [Online] Available http:// www.QuickTimeFAQ.com. 12. C. J. Date, An Introduction to Database Systems, Reading, MA: Addison-Wesley, 1975. 13. G. Davenport, T. G. A. Smith, and N. Pincever, Cinematic primitives for multimedia, IEEE Comput. Graphics Appl., 11 (4): 67– 74, 1991. 14. D. Swanberg, C.-F. Shu, and R. Jain, Knowledge guided parsing in video database, Proc. IS&T/SPIE Symp. on Electron. Image Sci. & Technol., 1993, pp. 13–24. 15. R. Hjelsvold and R. Midtstraum, Modeling and querying video data. Proc. 20th Int. Conf. on Very Large Data Bases, September 1994. 16. J. F. Allen, Maintaining knowledge about temporal intervals, Commun. ACM, 26 (11): 832–843, 1983. 17. T. D. C. Little and A. Ghafoor, Interval-based conceptual model for time-dependent multimedia data, IEEE Trans. Knowl. Data Eng., 5: 551–563, 1993. 18. H. Jiang, D. Montesi, and A. K. Elmagarmid, VideoText database systems, Proc. 4th IEEE Int. Conf. on Multimedia Computing and Syst., 1997, pp. 344–351. 19. R. Hjelsvold, Video information content and architecture. Proc. 4th Int. Conf. on Extending Database Technol., Cambridge, UK, March 1994. 20. T. G. A. Smith, If you could see what I mean . . . descriptions of video in an anthropologist’s notebook, Master’s thesis, MIT, 1992. 21. A. Hampapur, R. Jain, and T. Weymouth, Digital video indexing in multimedia systems, Proc. Workshop on Indexing and Reuse in Multimedia Syst., 1994. 22. H. J. Zhang et al., Automatic parsing of news video, Proc. 1st IEEE Int. Conf. on Multimedia Computing and Syst., 1994. 23. H. Jiang et al., Scene change detection techniques for video database systems, ACM Multimedia Syst., 6 (3): 186–195, 1998. 24. D. Swanberg, C.-F. Shu, and R. Jain, Architecture of multimedia information system for content-based retrieval, Proc. Audio Video Workshop, 1992. 25. T. D. C. Little et al., A digital ondemand video service supporting content-based queries, Proc. 1st ACM Int. Conf. on Multimedia, 1993, pp. 427–436. 26. R. Lienhart, Automatic text recognition for video indexing, Proc. 4th ACM Int. Multimedia Conf., 1996, pp. 11–20. 27. B.-L. Yeo, Efficient Processing of Compressed Images and Video, PhD thesis, Princeton Univ., January 1996. 28. T. Kanade et al., Informedia digital video library system: Annual progress report. Technical report, Carnegie Mellon Univ., Comput. Sci. Dept., Pittsburgh, PA, February 1997. 29. M. A. Smith and A. Hauptmann, Text, speech, and vision for video segmentation: The informedia project, AAAI Fall 1995 Symp. on Computational Models for Integrating Language and Vision, 1995.
30. T. G. A. Smith and G. Davenport, The stratification system: A design environment for random access video, Workshop on Networking and Operating Syst. Support for Digital Audio and Video, 1992. 31. R. Weiss, A. Duda, and D. Gifford, Content-based access to algebraic video, Proc. 1st IEEE Int. Conf. on Multimedia Computing and Syst., 1994. 32. F. Kokkoras et al., Smart VideoText: An intelligent video database system, TR 97-049, Dept. Comput. Sci., Purdue Univ., IN, 1997. 33. J. F. Sowa, Conceptual Structures: Information Processing in Minds and Machines, Reading, MA: Addison-Wesley, 1984. 34. J. Banerjee and W. Kim, Semantics and implementation of schema evolution in object-oriented database, Proc. ACM SIGMOD ’87, 1987, pp. 311–322. 35. E. Oomoto and K. Tanaka, OVID: Design and implementation of a video-object databse system, IEEE Trans. Knowl. Data Eng., 5: 629–643, 1993. 36. B. Shahraray, Scene change detection and content-based sampling of video sequences, Proc. Digital Video Compression: Algorithms and Technol., Vol. 2419, 1995, pp. 2–13. 37. B.-L. Yeo and B. Liu, On the extraction of DC sequence from MPEG compressed video, Int. Conf. on Image Processing, 1995. 38. A. Nagasaka and Y. Tanaka, Automatic video indexing and fullvideo search for object appearances, Proc. 2nd Working Conf. on Visual Database Syst., 1991, pp. 119–133. 39. A. Akutsu et al., Video indexing using motion vectors, Proc. Visual Commun. and Image Processing, 1992. 40. P. R. Hsu and H. Harashima, Detecting scene changes and activities in video databases, Proc. ICASSP ’94, Vol. 5, 1994, pp. 33–36. 41. H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, Automatic parsing of full-motion video, Multimedia Syst., 1: 10–28, July 1993. 42. H. G. Longbotham and A. C. Bovic, Theory of order statistic filters and their relationship to linear FIR filters, IEEE Trans. Acoust. Speech Signal Process., ASSP-37 (2): 275–287, 1989. 43. R. Zabih, J. Miller, and K. Mai, Feature-based algorithms for detecting and classifying scene breaks, Proc. 4th ACM Conf. on Multimedia, 1995. 44. B.-L. Yeo and B. Liu, Rapid scene analysis and compressed video, IEEE Trans. Circuits Syst. Video Technol., 5: 533–544, 1995. 45. B.-L. Yeo and B. Liu, A unified approach to temporal segmentation of motion JPEG and MPEG compressed video. Proc. 2nd IEEE Int. Conf. on Multimedia Computing and Syst., 1995. 46. F. Arman, A. Hsu, and M. Chiu, Image processing on compressed data for large video database, Proc. ACM Multimedia ’93, 1993, pp. 267–272. 47. I. K. Sethi and N. Patel, A statistical approach to scene change detection, Proc. Storage and Retrieval for Image and Video Database III, Vol. 2420, 1995, pp. 329–338. 48. H. J. Zhang et al., Video parsing using compressed data, Proc. Image and Video Processing II, Vol. 2182, 1994, pp. 142–149. 49. J. Meng, Y. Juan, and S. F. Chang, Scene change detection in a mpeg compressed video sequence, Proc. Storage and Retrieval for Image and Video Database III, Vol. 2420, 1995. 50. P. Aigrain and P. Joly, Automatic real-time analysis of film editing and transition effects and its applications, Comput. Graphics, 18 (1): 93–103, 1994. 51. A. Hampapur, R. Jain, and T. Weymouth, Digital video segmentation, Proc. ACM Multimedia ’94, 1994.
MULTIMEDIA VIDEO 52. M. A. Smith and M. G. Christel, Automating the creation of a digital video library, Proc. ACM Multimedia ’95, 1995, pp. 357–358. 53. V. M. Bove, What’s wrong with today’s video coding?, TV Technol., February: 1995. 54. M. R. W. Dawson, The how and why of what went where in apparent motion, Psychol. Rev., 98: 569–603, 1991. 55. M. Livingstone and D. O. Hubel, Segregation of form, color, movement and depth: Anatomy, physiology and perception, Science, 240: 740–749, 1988. 56. C. Cedras and M. Shah, Motion-based recognition: A survey, Image Vision Comput., 13 (2): 129–155, 1995. 57. F. Arman and J. K. Aggarwal, Model-based object recognition in dense-range images—a review, ACM Comput. Surv., 25 (1): 5–43, 1993. 58. M. S. Telagi and A. H. Soni, 3-D object recognition techniques: A survey, Proc. 1994 ASME Design Tech. Conf., Vol. 73, 1994. 59. S. S. Intille, Tracking using a local closed-worlds assumption: Tracking in the football domain, Tech. Rep. 296, MIT Media Laboratory, Perceptual Computing Section, August 1994. 60. M. Davis, Media streams: An iconic visual language for video annotation, Proc. Int. Symp. on Visual Languages, 1993, pp. 196–202. 61. R. Hjelsvold, VideoSTAR—A database for video information sharing, PhD thesis, Norwegian Inst. Technol., November 1995. 62. M. Davis, Knowledge representation for video, Proc. 12th Nat. Conf. on Artificial Intell., Vol. 1, Cambridge, MA: AAAI Press, 1994, pp. 120–127. 63. A. D. Bimbo, E. Vicario, and D. Zingoni, Sequence retrieval by contents through spatiotemporal indexing, Proc. Int. Symp. on Visual Languages, 1993. 64. F. Arman et al., Content-based browsing of video, Proc. ACM Multimedia ’94 , 1994. 65. M. Ioka and M. Kurokawa, Estimation of notion vectors and their application to scene retrieval, Tech. Rep. 1623-14, IBM Res., Tokyo Res. Laboratory, Shimotsuruma, Yamato-shi, Kanagawa-ken 242, Japan, 1993. 66. Y. Tonomura and A. Akutsu, A structured video handling technique for multimedia systems, IEICE Trans. Inf. Syst., E78-D: 764–777, 1994. 67. A. Akutsu and Y. Tonomura, Video tomography: An efficient method for camerawork motion vectors, Proc. ACM Multimedia ’94, 1994. 68. S. W. Smoliar and H. J. Zhang, Content-based video indexing and retrieval, IEEE Multimedia, 1 (2): 62–72, 1994. 69. S. W. Smoliar, H. J. Zhang, and J. H. Wu, Using frame technology to manage video, Proc. Workshop on Indexing and Reuse in Multimedia Systems, 1994. 70. M. Flickner et al., Query by image and video content: The QBIC system, Computer, 28 (9): 23–31, 1995. 71. E. Ardizzone and M. Cascia, Automatic video database indexing and retrieval, Multimedia Tools Appl., 4 (1): 29–56, 1997. 72. M. H. Bohlen et al., Evaluating and enhancing the completeness of TSQL2, Tech. Rep. TR95-05, Comput. Sci. Dept., Univ. Arizona, 1995. 73. R. T. Snodgrass, The temporal query language Tquel, ACM Trans. Database Systems, 12: 299–321, 1987. 74. A. D. Bimbo and E. Vicario, A logical framework for spatio temporal indexing of image sequence, in S. K. Chang (ed.), Spatial Reasoning, Berlin: Springer-Verlag, 1993. 75. R. Hjelsvold, R. Midtstraum, and O. Sandst, Searching and Browsing a Shared Video Database, chapter Design and Implementation of Multimedia Database Management Systems, Norwell, MA: Kluwer, 1996.
633
76. G. Ahanger, D. Benson, and T. D. C. Little, Video query formulation, Proc. IS&T/SPIE Symp. on Electronic Image Sci. & Technol.: Storage and Retrieval for Image and Video Database III, Vol. 2420, 1995, pp. 280–291. 77. Y. F. Day et al., Spatio-temporal modeling of video data for online object-oriented query processing, Proc. IEEE ICDE, 1995. 78. B. Doyle and J. Sauer, Digital in and digital out—digital editing with Firewire, New Media Magazine, 7 (11): 46–55, 1997. 79. 1994-1995 IEEE Standard for a High Performance Serial Bus, Inst. of Electr. and Electron. Engineers. 80. S. Mullen, The DV format [Online], 1996. Available www: http:// members.aol.com/dvcnysteve/DVpage.html 81. B. Schmitt and A. Gerulaitis, How to edit your DV footage? [Online], 1997. Available www: http://www.computervice.com/dvl/ howedit.html 82. L. A. Rettinger, Desktop video conferencing: Technology and use for remote seminar delivery, Master’s thesis, North Carolina State Univ., July 1995. 83. D. E. Comer, Internetworking with TCP/IP, Vol. I, Englewood Cliffs, NJ: Prentice-Hall, 1995. 84. D. Miloli and R. Keinath, Distributed Multimedia Through Boardband Communications, Norwood, MA: Artech House, 1994. 85. V. Kumar, MBone: Interactive Multimedia on The Internet, New York: Macmillan, 1995. 86. K. Savetz, N. Randall, and Y. Lepage, MBone: Multicasting Tomorrow’s Internet, Foster City, CA: IDG Books, 1986. 87. S. Deering, Host extensions for IP multicasting, Internet Request for Comment (RFC) 1112, 1989. 88. A primer on the T.120 series standard, DataBeam Corp. [Online], 1997. Available www:http://www.databeam.com/ccts/ t120primer.html 89. A. D. Gelman et al., A store-and-forward architecture for VideoOn-Demand service, Proc. IEEE ICC, 1991, pp. 27.3.1–27.3.5. 90. T. D. C. Little and D. Venkatesh, Prospects for interactive Video-on-Demand, IEEE Multimedia, 1 (3): 14–24, 1994. 91. P. Lougher and D. Shepherd, The design of a storage server for continuous media, Computer, 36: 32–42, 1993. 92. S. J. Daigle, Disk scheduling for continuous media data streams, Master’s thesis, Carnegie Mellon Univ., 1992. 93. A. Silberschatz and P. B. Galvin, Operating System Concepts, 4th ed., Reading, MA: Addison-Wesley, 1994. 94. P. S. Yu et al., Design and analysis of a grouped sweeping schema for multimedia storage management, Proc. 3rd Int. Workshop on Network and Operating Syst. Support for Digital Audio and Video, 1992, pp. 44–45. 95. C. S. Freedman and D. J. DeWitt, The SPIFFI scalable videoon-demand system, Proc. ACM SIGMOD’95, 1995, pp. 352–363. ¨ zden et al., A low-cost storage server for movie on demand 96. B. O databases, Proc. 20th VLDB Conf., 1994, pp. 594–605. 97. K. K. Ramakrishnan et al., Operating system support for a Video-on-Demand file service, Multimedia Syst., 3: 53–65, 1995. 98. C. L. Liu and J. W. Layland, Scheduling algorithms for multiprogramming in hard-real-time environment, J. ACM, 11: 46– 61, 1973. 99. S. Berson, L. Golubchik, and R. R. Muntz, Fault tolerant design of multimedia servers, Proc. ACM SIGMOD ’95, 1995, pp. 364–375. 100. A. Dan and D. Sitaram, An online video placement policy based on bandwidth to space ration (BSR), Proc. ACM SIGMOD ’95, 1995, pp. 376–385. 101. A. Dan, M. Kienzle, and D. Sitaram, Dynamic segment replication policy for load-balancing in video-on-demand servers, IBM Res. Rep. RC 19589, 1994.
634
MULTIMETERS
102. T. Chieuh and R. Katz, Multiresolution video representation for parallel disk arrays, Proc. ACM Multimedia ’93, 1993, pp. 401–409. 103. K. Keeton and R. H. Katz, Evaluating video layout strategies for a high-performance storage server, Multimedia Syst., 3: 43– 52, 1995. 104. D. A. Patterson, G. Gibson, and R. H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proc. ACM SIGMOD ’88, 1988, pp. 109–116. 105. F. Tobagi et al., Streaming RAID—a disk array management system for video files, ACM Multimedia ’93, 1993, pp. 393–400. 106. D. Bitton and J. Gray, Disk shadowing, Proc. VLDB ’88, 1998, pp. 331–338. 107. ADSL tutorial: Twisted pair access to the information highway, ADSL Forum [Online]. Available www:http://www.adsl.com/ adsl-tutorial.html 108. General introduction to copper access technology, ADSL Forum [Online]. Available www:http://www.adsl.com/general-tutorial.html
AHMED K. ELMAGARMID HAITAO JIANG Purdue University