Three-Dimensional Television
Signals and Communication Technology Circuits and Systems Based on Delta Modulation Linear, Nonlinear and Mixed Mode Processing D.G. Zrilic ISBN 3-540-23751-8 Functional Structures in Networks AMLn – A Language for Model Driven Development of Telecom Systems T. Muth ISBN 3-540-22545-5 RadioWave Propagation for Telecommunication Applications H. Sizun ISBN 3-540-40758-8 Electronic Noise and Interfering Signals Principles and Applications G. Vasilescu ISBN 3-540-40741-3 DVB The Family of International Standards for Digital Video Broadcasting, 2nd ed. U. Reimers ISBN 3-540-43545-X Digital Interactive TV and Metadata Future Broadcast Multimedia A. Lugmayr, S. Niiranen, and S. Kalli ISBN 3-387-20843-7 Adaptive Antenna Arrays Trends and Applications S. Chandran (Ed.) ISBN 3-540-20199-8 Digital Signal Processing with Field Programmable Gate Arrays U. Meyer-Baese ISBN 3-540-21119-5 Neuro-Fuzzy and Fuzzy Neural Applications in Telecommunications P. Stavroulakis (Ed.) ISBN 3-540-40759-6 SDMA for Multipath Wireless Channels Limiting Characteristics and Stochastic Models I.P. Kovalyov ISBN 3-540-40225-X Digital Television A Practical Guide for Engineers W. Fischer ISBN 3-540-01155-2 Multimedia Communication Technology Representation, Transmission and Identification of Multimedia Signals J.R. Ohm ISBN 3-540-01249-4 Information Measures Information and its Description in Science and Engineering C. Arndt ISBN 3-540-40855-X Processing of SAR Data Fundamentals, Signal Processing, Interferometry A. Hein ISBN 3-540-05043-4
Chaos-Based Digital Communication Systems Operating Principles, Analysis Methods, and Performance Evalutation F.C.M. Lau and C.K. Tse ISBN 3-540-00602-8 Adaptive Signal Processing Application to Real-World Problems J. Benesty and Y. Huang (Eds.) ISBN 3-540-00051-8 Multimedia Information Retrieval and Management Technological Fundamentals and Applications D. Feng, W.C. Siu, and H.J. Zhang (Eds.) ISBN 3-540-00244-8 Structured Cable Systems A.B. Semenov, S.K. Strizhakov, and I.R. Suncheley ISBN 3-540-43000-8 UMTS The Physical Layer of the Universal Mobile Telecommunications System A. Springer and R. Weigel ISBN 3-540-42162-9 Advanced Theory of Signal Detection Weak Signal Detection in Generalized Obeservations I. Song, J. Bae, and S.Y. Kim ISBN 3-540-43064-4 Wireless Internet Access over GSM and UMTS M. Taferner and E. Bonek ISBN 3-540-42551-9 The Variational Bayes Method in Signal Processing ˇ ıdl and A. Quinn V. Sm´ ISBN 3-540-28819-8 Topics in Acoustic Echo and Noise Control Selected Methods for the Cancellation of Acoustical Echoes, the Reduction of Background Noise, and Speech Processing E. Hänsler and G. Schmidt (Eds.) ISBN 3-540-33212-x Terrestrial Trunked Radio - TETRA A Global Security Tool Peter Stavroulakis ISBN 3-540-71190-2 Three-Dimensional Television Capture, Transmission, Display H.M. Ozaktas and L. Onural (Eds.) ISBN 3-540-72532-6
Haldun M. Ozaktas · Levent Onural (Eds.)
Three-Dimensional Television Capture, Transmission, Display
With 316 Figures and 21 Tables
Prof. Haldun M. Ozaktas
Prof. Levent Onural
Dept. of Electrical Engineering Bilkent University TR-06800 Bilkent, Ankara Turkey
Dept. of Electrical Engineering Bilkent University TR-06800 Bilkent, Ankara Turkey
Library of Congress Control Number: 2007928410 ISBN 978-3-540-72531-2 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com c Springer-Verlag Berlin Heidelberg 2008 The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: by the authors and Integra, India using a Springer LATEX macro package Cover design: eStudio Calamar S.L., F. Steinen-Broo, Pau/Girona, Spain Printed on acid-free paper
SPIN: 11781196
543210
Preface
This book was motivated by, and most of its chapters derived from work conducted within, the Integrated Three-Dimensional Television—Capture, Transmission, and Display project which is funded by the European Commission 6th Framework Information Society Technologies Programme and led by Bilkent University, Ankara. The project involves 19 partner institutions from 7 countries and over 180 researchers throughout Europe and extends over the period from September 2004 to August 2008. The project web site is www.3dtv-research.org. The editors would like to thank all authors who contributed their work to this edited volume. All contributions were reviewed, in most cases by leading experts in the area. We are grateful to all the anonymous reviewers for their help and their fruitful suggestions, which not only provided a basis for accepting or declining manuscripts, but which also improved their quality significantly. We would also like to thank our Editor Christoph Baumann at Springer, Heidelberg for his guidance throughout the process. We hope that this book will prove useful for those interested in three-dimensional television and related technologies and that it will inspire further research that will help make 3DTV a reality in the near future. The editors’ work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
About the Authors
Gozde B. Akar received the BS degree from Middle East Technical University, Turkey in 1988 and MS and PhD degrees from Bilkent University, Turkey in 1990 and 1994, respectively, all in electrical and electronics engineering. Currently she is an associate professor with the Department of Electrical and Electronics Engineering, Middle East Technical University. Her research interests are in video processing, compression, motion modeling and multimedia networking. Anil Aksay received his BS and MS degrees in electrical and electronics engineering from Middle East Technical University, Ankara, Turkey in 1999 and 2001, respectively. Currently he is with the Multimedia Research Group in METU, where he is working as a researcher towards the PhD degree. His research interests include multiple description coding, image and video compression, stereoscopic and multi-view coding, video streaming and error concealment. A. Aydin Alatan received his BS degree from Middle East Technical University, Ankara, Turkey in 1990, the MS and DIC degrees from Imperial College, London, UK in 1992, and the PhD degree from Bilkent University, Ankara, Turkey in 1997, all in electrical engineering. He was a post-doctoral research associate at Rensselaer Polytechnic Institute and New Jersey Institute of Technology between 1997 and 2000. In August 2000, he joined the faculty of Electrical and Electronics Engineering Department at Middle East Technical University. Jaakko Astola received the PhD degree in mathematics from Turku University, Finland in 1978. Between 1979 and 1987 he was with the Department of Information Technology, Lappeenranta University of Technology, Lappeenranta, Finland. From 1987 to 1992 he was an associate professor in applied mathematics at Tampere University, Tampere, Finland. From 1993 he has been a professor of Signal Processing at Tampere University of Technology.
VIII
About the Authors
His research interests include signal processing, coding theory, spectral techniques and statistics. Richard Bates is a research fellow within the Imaging and Displays Research Group at De Montfort University, where his main interests are software development for 3DTV displays and evaluating the usability and acceptability of 3DTV displays. He holds a PhD on human-computer interaction. Kostadin Stoyanov Beev is a researcher in the Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences. He received the PhD degree in 2007 in the area of wave processes physics. His MS degree in engineering physics has two specializations: quantum electronics and laser technique, and medical physics. His research interest is in optics, holography, material science, biophysics and evanescent waves. Kristina Nikolaeva Beeva is a researcher in the Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences. Her MS degree in engineering physics has two specializations: quantum electronics and laser technique, and medical physics. Her research interest is in the fields of optics, holography, laser technique and biophysics. Philip Benzie obtained an honours degree in electrical and electronic engineering from Aberdeen University (AU) in 2001. His PhD on the application of finite element analysis to holographic interferometry for non-destructive testing was obtained in 2006 at AU. Currently his research interests include holographic imaging, non-destructive testing and underwater holography. M. Oguz Bici received the BS degree in electrical and electronics engineering from Middle East Technical University (METU), Ankara, Turkey in 2005. Currently he is with the Multimedia Research Group of the Electrical and Electronics Engineering Department, METU, studying towards a PhD degree. His research interests are multimedia compression, error resilient/multiple description coding and wireless multimedia sensor networks. Cagdas Bilen received his BS degree in electrical and electronics engineering from Middle East Technical University (METU), Ankara, Turkey in 2005. Currently he is a MS student in the Electrical and Electronics Engineering Department and a researcher in Multimedia Research Group of METU. Among his research topics are image and video compression, error concealment, distributed video coding, multiple description coding, stereoscopic and multiview video coding. Sukhee Cho received the BS and MS degrees in computer science from Pukyong National University in 1993 and 1995, respectively. She received the PhD degree in electronics and information engineering from Yokohama National University in 1999. She is currently with the Radio and Broadcasting Research Division, Electronics and Telecommunications Research Institute
About the Authors
IX
(ETRI), Daejeon, Korea. Her research interests include stereoscopic video coding, multi-viewpoint video coding (MVC) and 3DTV broadcasting systems. M. Reha Civanlar received the PhD degree in electrical and computer engineering from North Carolina State University in 1984. He is currently a vice president and director of the Media Lab in DoCoMo USA Labs. He was a visiting professor of computer engineering at Koc University in Istanbul for four years starting in 2002. Before, he was the head of Visual Communications Research Department at AT&T Labs Research. He is a recipient of 1985 Senior Paper Award of the IEEE Signal Processing Society and a fellow of the IEEE. Edilson de Aguiar received the BS degree in computer engineering from the Espirito Santo Federal University, Vitoria, Brazil in 2002 and the MS degree in computer science from the Saarland University, Saarbr¨ ucken, Germany in 2004. He is currently working as a PhD student in the Computer Graphics Group at the Max-Planck-Institut (MPI) Informatik, Saarbr¨ ucken, Germany. His research interests include computer animation, motion capture and 3D video. Stephen DiVerdi is a doctoral candidate at the University of California in Santa Barbara. He received his bachelors degree in computer science from Harvey Mudd College in 2002. His research covers the intersection of graphics, vision, and human computer interaction, with an emphasis on augmented reality. Funda Durupınar is a PhD candidate at the Department of Computer Engineering, Bilkent University, Ankara, Turkey. She received her BS degree in computer engineering from Middle East Technical University in 2002 and her MS degree in computer engineering from Bilkent University in 2004. Her research interests include physically-based simulation, cloth modeling, behavioral animation and crowd simulation. Karen Egiazarian received the MS degree in mathematics from Yerevan State University, Armenia, and the PhD degree in physics and mathematics, from Moscow Lomonosov State University, and the DrTech degree from Tampere University of Technology. Currently he is full professor in the Institute of Signal Processing, Tampere University of Technology. His research interests are in the areas of applied mathematics, and signal and image processing. G. Bora Esmer received his BS degree from Hacettepe University, Ankara in 2001 and the MS degree from Bilkent University, Ankara in 2004. He is a PhD student in the area of signal processing since 2004. His areas of interest include optical information processing and image processing. Christoph Fehn received the Dr-Ing. degree from Technical University of Berlin, Germany. He currently works as a scientific project manager at Fraunhofer HHI and as an associate lecturer at University of Applied Sciences,
X
About the Authors
Berlin. His research interests include video processing and coding, computer graphics, and computer vision for applications in the area of immersive media, 3DTV, and digital cinema. He has been involved in MPEG standardization activities for 3D video. Atanas Gotchev received MS degrees in communications engineering and in applied mathematics from Technical University of Sofia, Bulgaria, the PhD degree in communications engineering from Bulgarian Academy of Sciences, and the DrTech degree from Tampere University of Technology, Finland. Currently he is a senior researcher at the Institute of Signal Processing, Tampere University of Technology. His research interests are in transform methods for signal, image and video processing. Uˇ gur G¨ ud¨ ukbay is an associate professor at the Department of Computer Engineering, Bilkent University, Ankara, Turkey. He received his PhD degree in computer engineering and information science from Bilkent University in 1994. Then, he conducted research as a postdoctoral fellow at Human Modeling and Simulation Laboratory, University of Pennsylvania, USA. His research interests include different aspects of computer graphics, multimedia databases, computational geometry, and electronic arts. He is a senior member of IEEE and a professional member of ACM. Jana Harizanova is a researcher in the Central Laboratory of Optical Storage and Processing of Information of the Bulgarian Academy of Sciences. Recently she has received her PhD degree in the field of holographic and laser interferometry. Her main research interests include interferometry, diffraction optics and digital signal processing. Tobias H¨ ollerer is an assistant professor of computer science at the University of California, Santa Barbara, where he leads a research group on imaging, interaction, and innovative interfaces. H¨ ollerer holds a graduate degree in informatics from the Technical University of Berlin and MS and PhD degrees in computer science from Columbia University. Klaus Hopf received the Dipl-Ing degree in electrical engineering from the Technical University of Berlin, Germany. He has performed work within government-funded research projects on the development of videoconferencing systems, and in research projects developing new technologies and for the representation of video images in high resolution (HDTV) and 3D. Namho Hur received the BS MS and PhD degrees in electrical and electronics engineering from Pohang University of Science and Technology (POSTECH), Pohang, Korea in 1992, 1994, and 2000. He is currently with the Radio and Broadcasting Research Division, Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea. As a research scientist, he has been with Communications Research Centre Canada (2003–2004). His main research interests are control theory, power electronics and 3DTV broadcasting systems.
About the Authors
XI
Rossitza Ilieva received the MS degree in physics in 1968 and a post graduation in optics and holography in 1979 from Sofia University. She has experience in stereo imaging and holography (researcher at CLOSPI BAS, 1975–2005), Since then she is a researcher in the Electrical and Electronics Engineering Department of Bilkent University, working on holographic 3DTV displays. Peter Kauff is the head of the Immersive Media & 3D Video Group in the Image Processing Department of the Fraunhofer HHI. He is with HHI since 1984 and has been involved in numerous German and European projects related to digital HDTV signal processing and coding, interactive MPEG-4based services, as well as in a number of projects related to advanced 3D video processing for immersive tele-presence and immersive media. Jinwoong Kim received the BS and MS degrees from the Department of Electronics Engineering of Seoul National University in 1981 and 1983, respectively. He received the PhD degree from the Department of Electrical and Computer Engineering of Texas A&M University in 1993. He joined ETRI in 1983 where he is currently a principal member of research staff. He has been involved in many R&D projects including TDX digital switching system, HDTV encoder system and chipset and MPEG-7 and MPEG-21 core algorithms and applications. He is now 3DTV project leader at ETRI. Metodi Kovachev received the MS degree in 1961 and the PhD degree in physics in 1982 from Sofia University. He has experience in optical design, stereo imaging, holography, optical processing, electronics (director and senior researcher at CLOSPI, Bulgarian Academy of Sciences, 1975–2005). Since then he is a senior researcher in the Electrical and Electronics Engineering Department of Bilkent University, working on holographic 3DTV displays. Alper Koz received the BS and MS degrees in Electrical and Electronics Engineering from Middle East Technical University, Ankara, Turkey in 2000 and 2002, respectively. He is currently a PhD student and a research assistant in the same department. His research interests include watermarking techniques for image, video, free view, and 3D television. Hyun Lee received the BS degree in electronics engineering from the Kyungpook National University, Korea in 1993 and the MS degree from KAIST (Korean Advanced Institute of Science and Technology), Korea in 1996. He enrolled in a doctoral course at KAIST in 2005. Since 1999, he has been with the Digital Broadcasting Research Division in ETRI (Electronics and Telecommunications Research Institute). His current interests include mobile multimedia broadcasting, digital communications and 3DTV systems. Wing Kai Lee holds a PhD degree in optical engineering from the University of Warwick. He is currently working as a research fellow in the Imaging and Displays Research Group, De Montfort University. His research interests include 3D display and imaging, high speed imaging, holography, optical non-contact measurements, digital video coding, and medical imaging.
XII
About the Authors
Marcus A. Magnor heads the Computer Graphics Lab at Technical University Braunschweig. He received his BA (1995) and MS (1997) in physics and his PhD (2000) in electrical engineering. He established the independent research group Graphics-Optics-Vision at the Max-Planck-Institut Informatik in Saarbr¨ ucken and received the venia legendi in computer science from Saarland University in 2005. His research interests entwine around the visual information processing pipeline. Recent and ongoing research topics include video-based rendering, 3DTV, as well as astrophysical visualization. Aydemir Memi¸soˇ glu works as a software engineer at Havelsan AS¸., Ankara, Turkey. He received his BS and MS degrees in computer engineering from Bilkent University, Ankara, Turkey in 2000 and 2003, respectively. His research interests include different aspects of computer graphics, specifically human modeling and animation. Philipp Merkle received the Dipl-Ing degree from the Technical Univerity of Berlin, Germany in 2006. He has been with Fraunhofer HHI since 2003. His research interests are mainly in the field of representation and coding of free viewpoint and multi-view video, including MPEG standardization activities. Karsten M¨ uller received the Dr-Ing degree in electrical engineering and the Dipl-Ing degree from the Technical Univerity of Berlin, Germany, in 2006 and 1997 respectively. He has been with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Berlin since 1997. His research interests are mainly in the field of representation, coding and reconstruction of 3D scenes. He has been involved in MPEG standardization activities. Andrey Norkin received his MS degree in computer science from the Ural State Technical University, Russia in 2001 and LicTech degree in signal processing from Tampere University of Technology, Finland in 2005. Currently he is a researcher at the Institute of Signal Processing, Tampere University of Technology, where he is working towards his PhD degree. His research interests include image and video coding, error resilience of compressed images, video, and 3D meshes. Alex Olwal is a PhD candidate at KTH (The Royal Institute of Technology) in Stockholm. He was a visiting researcher at Columbia University in 2001–2003 and at UC Santa Barbara in 2005. Alex’s research focuses on interaction techniques and novel 3D user interfaces, such as augmented and mixed reality. His research interests include multimodal interaction, interaction devices, and ubiquitous computing. Levent Onural received his PhD in electrical and computer engineering from SUNY at Buffalo in 1985 (BS, MS from METU) and is presently a full professor at Bilkent University. He received a TUBITAK award in 1995 and an IEEE Third Millenium Medal in 2000, and is an associate editor of IEEE Transactions on Circuits and Systems for Video Technology. Currently he is leading the European Commission funded 3DTV Project as the coordinator.
About the Authors
XIII
J¨ orn Ostermann studied electrical engineering and communications engineering. Since 2003 he is full professor and head of the Institut f¨ ur Informationsverarbeitung at the Leibniz Universit¨at Hannover, Germany. He is a fellow of the IEEE and a member of the IEEE Technical Committee on Multimedia Signal Processing and past chair of the IEEE CAS Visual Signal Processing and Communications (VSPC) Technical Committee. His current research interests are video coding and streaming, 3D modeling, face animation, and computer-human interfaces. Haldun M. Ozaktas received a PhD degree from Stanford University in 1991. He joined Bilkent University, Ankara in 1991, where he is presently professor of electrical engineering. In 1992 he was at the University of ErlangenN¨ urnberg as an Alexander von Humboldt Fellow. In 1994 he worked as a consultant for Bell Laboratories, New Jersey. He is the recipient of the 1998 ICO International Prize in Optics and the Scientific and Technical Research Council of Turkey Science Award (1999), and a member of the Turkish Academy of Sciences and a fellow of the Optical Society of America. ¨ uc B¨ ulent Ozg¨ ¸ is a professor at the Department of Computer Engineering and the dean of the Faculty of Art, Design and Architecture, Bilkent University, Ankara, Turkey. He formerly taught at the University of Pennsylvania, USA, Philadelphia College of Arts, USA, and the Middle East Technical University, Turkey, and worked as a member of the research staff at the Schlumberger Palo Alto Research Center, USA. His research areas include different aspects of computer graphics and user interface design. He is a member of IEEE, ACM and IUA. Ismo Rakkolainen received a DrTech degree from the Tampere University of Technology, Finland in 2002. He has 26 journal and conference proceedings, articles, 1 book chapter, 4 patents and several innovation awards. His primary research interests include 3D display technology, 2D mid-air displays, virtual reality, interaction techniques and novel user interfaces. Tarik Reyhan received his BS and MSdegree from the Electrical Engineering Department of METU in 1972 and 1975, respectively. He received his PhD degree from the Electrical Engineering Department of the University of Birmingham, UK in 1981. He worked in ASELSAN from 1981 to 2001. He joined the Electrical and Electronics Engineering Department of Bilkent University in 2001. His area of interests include R&D management, telecommunications, RF design, electronic warfare and night vision. Simeon Hristov Sainov is a professor in Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences, and the head of the Holographic and Optoelectronic Investigations Group. He has graduated from St. Petersburg State University, Faculty of Physics. His major fields of scientific interest are physical optics, near-field optics, holography and laser refractometry.
XIV
About the Authors
Ventseslav Sainov is the director of the Central Laboratory of Optical Storage and Processing of Information of the Bulgarian Academy of Sciences. He has expertise in the fields of light sensitive materials, holography, holographic and laser interferometry, shearography, non-destructive testing, 3D micro/macro measurements, and optical and digital processing of interference patterns. Hans-Peter Seidel is the scientific director and chair of the Computer Graphics Group at the Max-Planck-Institut (MPI) Informatik and a professor of computer science at Saarland University. He has received grants from a wide range of organizations, including the German National Science Foundation (DFG), the German Federal Government (BMBF), the European Community (EU) and NATO. In 2003 Seidel was awarded the Leibniz Preis, the most prestigious German research award, from the German Research Foundation (DFG). Ian Sexton holds a PhD on 3D displays architecture and his research interests include 3D display systems, computer architecture, computer graphics, and image processing. He founded the Imaging and Displays Research Group at De Montfort University and is an active member of the SID and sits on its UK and Ireland Chapter Committee. Aljoscha Smolic received the Dr-Ing. degree from Aachen University of Technology in 2001. He is a scientific project manager at the Fraunhofer HHI and an adjunct professor at the Technical University of Berlin. His research interests include video processing and coding, computer graphics, and computer vision. He has been leading MPEG standards activities for 3D video. Ralf Sondershaus Ralf Sondershaus received his Diplom (MS) degree in computer science from the University of T¨bingen in 2001. From 2001 until 2003, he worked in the field of geographic visualization. From 2003, he was a PhD candidate at the Department for Graphical-Interactive Systems (GRIS) at the University of T¨ ubingen where he received his PhD in 2007. His research interests include compact multi-resolution models for huge surface and volume meshes, volume visualization and geographic information systems (GIS). Nikolce Stefanoski studied mathematics and computer science with telecommunications as field of application. Since January 2004 he has been working toward a PhD degree at the Institut f¨ ur Informationsverarbeitung of the Leibniz Universit¨ at Hannover, Germany. His research interests are coding of time-variant 3D geometry, signal processing, and stochastics. Elena Stoykova is a member of the SPIE and scientific secretary of the Central Laboratory of Optical Storage and Processing of Information of the Bulgarian Academy of Sciences. She has an expertise in the fields of interferometry, diffraction optics, digital signal processing, Monte-Carlo simulation. She is the author of more than 100 publications in scientific journals and proceedings.
About the Authors
XV
Phil Surman holds a PhD on 3D television displays from De Montfort University. He has been conducting independent research for many years on 3D television. He helped to instigate several 3DTV European projects and is currently working on multi-viewer 3DTV displays at the Imaging and Displays Research Group. A. Murat Tekalp received the PhD degree in electrical, computer and systems engineering from Rensselaer, Troy, New York in 1984. He has been with Eastman Kodak Company (1984–1987) and University of Rochester, New York (1987–2005), where he has been promoted to Distinguished University Professor. Since 2001, he is a professor at Koc University, Istanbul, Turkey. He has been selected a Distinguished Lecturer by IEEE Signal Processing Society and is a fellow of IEEE. Christian Theobalt is a postdoctoral researcher at the MPI Informatik in Saarbr¨ ucken, Germany and head of the junior research group 3D Video and Vision-based Graphics within the Max-Planck Center for Visual Computing and Communication. His research interests include free-viewpoint and 3D video, markerless optical motion capture, 3D computer vision, and imageand physics-based rendering. George A. Triantafyllidis received the Diploma and PhD degrees from the Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Greece in 1997 and 2002, respectively. He has been with Informatics and Telematics Institute, Thessaloniki, Greece from 2000 to 2004 as a research associate and since 2004 as a senior researcher. His research interests include 3D data processing, medical image communication, multimedia signal processing, image analysis and stereo image sequence coding. Libor V´ aˇ sa graduated from the University of West Bohemia in 2004 with a specialisation in computer graphics and data visualisation. Currently he is working towards his PhD degree in the Computer Graphics Group at the University of West Bohemia in the field of dynamic mesh compression and simplification. John Watson was appointed to a chair (professorship) in optical engineering at Aberdeen University in 2004. His research interests include underwater holography, subsea laser welding, laser induced spectral analysis and optical image processing. He is an elected member of the administrative committee of OES and a fellow of the IET and IOP. Thomas Wiegand is the head of the Image Communication Group in the Image Processing Department of Fraunhofer HHI. He received the Dipl-Ing degree in electrical engineering from the Technical University of HamburgHarburg, Germany in 1995 and the Dr-Ing degree from the University of Erlangen-Nuremberg, Germany in 2000. He is associated rapporteur of ITU-T VCEG, associated rapporteur/co-chair of the JVT, and associated chair of MPEG Video.
XVI
About the Authors
Mehmet S ¸ ahin Ye¸sil is a PhD student at the Institute of Applied Mathematics, Middle East Technical University, Ankara, Turkey. He received his BS and MS degrees in computer engineering from Bilkent University, Ankara, Turkey in 2000 and 2003, respectively. His research interests are computer graphics, cryptography, and computer and network security. He works as an officer at the Turkish Air Forces. Kugjin Yun received the BS and MS degrees in computer engineering from Chunbuk National University of Korea in 1999 and 2001. He joined the Electronics and Telecommunications Research Institute (ETRI) in 2001, and he is currently with the Broadcasting System Research Group. His research interests include the 3D T-DMB and 3DTV broadcasting systems. Xenophon Zabulis received his BA, MS and PhD degrees in computer science degree from the University of Crete in 1996, 1998, and 2001, respectively. He has worked as a postdoctoral fellow at the Computer and Information Science Department, at the interdisciplinary General Robotics, Automation, Sensing and Perception Laboratory and at the Institute for Research in Cognitive Science, both at the University of Pennsylvania. He is currently a research fellow at the Institute of Informatics and Telematics, Centre for Research and Technology Hellas, Thessaloniki, Greece.
Contents
1 Three-dimensional Television: From Science-fiction to Reality Levent Onural and Haldun M. Ozaktas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2 A Backward-compatible, Mobile, Personalized 3DTV Broadcasting System Based on T-DMB Hyun Lee, Sukhee Cho, Kugjin Yun, Namho Hur and Jinwoong Kim . . . . 11 3 Reconstructing Human Shape, Motion and Appearance from Multi-view Video Christian Theobalt, Edilson de Aguiar, Marcus A. Magnor and Hans-Peter Seidel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Utilization of the Texture Uniqueness Cue in Stereo Xenophon Zabulis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Pattern Projection Profilometry for 3D Coordinates Measurement of Dynamic Scenes Elena Stoykova, Jana Harizanova and Ventseslav Sainov . . . . . . . . . . . . . . 85 6 Three-dimensional Scene Representations: Modeling, Animation, and Rendering Techniques Uˇgur G¨ ud¨ ukbay and Funda Durupınar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7 Modeling, Animation, and Rendering of Human Figures ¨ uc¸, Aydemir Memi¸soˇglu Uˇgur G¨ ud¨ ukbay, B¨ ulent Ozg¨ and Mehmet S ¸ ahin Ye¸sil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8 A Survey on Coding of Static and Dynamic 3D Meshes Aljoscha Smolic, Ralf Sondershaus, Nikolˇce Stefanoski, Libor V´ aˇsa, Karsten M¨ uller, J¨ orn Ostermann and Thomas Wiegand . . . . . . . . . . . . . . . 239
XVIII Contents
9 Compression of Multi-view Video and Associated Data Aljoscha Smolic, Philipp Merkle, Karsten M¨ uller, Christoph Fehn, Peter Kauff and Thomas Wiegand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 10 Efficient Transport of 3DTV A. Murat Tekalp and M. Reha Civanlar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 11 Multiple Description Coding and its Relevance to 3DTV Andrey Norkin, M. Oguz Bici, Anil Aksay, Cagdas Bilen, Atanas Gotchev, Gozde B. Akar, Karen Egiazarian and Jaakko Astola . . 371 12 3D Watermarking: Techniques and Directions Alper Koz, George A. Triantafyllidis and A. Aydin Alatan . . . . . . . . . . . . 427 13 Solving the 3D Problem—The History and Development of Viable Domestic 3DTV Displays Phil Surman, Klaus Hopf , Ian Sexton, Wing Kai Lee and Richard Bates 471 14 An Immaterial Pseudo-3D Display with 3D Interaction Stephen DiVerdi, Alex Olwal, Ismo Rakkolainen and Tobias H¨ ollerer . . . . 505 15 Holographic 3DTV Displays Using Spatial Light Modulators Metodi Kovachev, Rossitza Ilieva, Philip Benzie, G. Bora Esmer, Levent Onural, John Watson, Tarik Reyhan . . . . . . . . . . . . . . . . . . . . . . . . . 529 16 Materials for Holographic 3DTV Display Applications Kostadin Stoyanov Beev, Kristina Nikolaeva Beeva and Simeon Hristov Sainov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 17 Three-dimensional Television: Consumer, Social, and Gender Issues Haldun M. Ozaktas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
1 Three-dimensional Television: From Science-fiction to Reality Levent Onural and Haldun M. Ozaktas Department of Electrical Engineering, Bilkent University, TR-06800 Bilkent, Ankara, Turkey
Moving three-dimensional images have been depicted in many science-fiction films. This has contributed to 3D video and 3D television (3DTV) to be perceived as ultimate goals in imaging and television technology. This vision of 3DTV involves a ghost-like, yet high quality optical replica of an object that is visually indistinguishable from the original (except perhaps in size). These moving video images would be floating in space or standing on a tabletop-like display, and viewers would be able to peek or walk around the images to see them from different angles or maybe even from behind (Fig. 1.1). As such, this vision of 3DTV is quite distinct from stereoscopic 3D imaging and cinema. 3D photography, cinema, and TV actually have a long history; in fact, stereoscopic 3D versions of these common visual media are almost as old as their 2D counterparts. Stereoscopic 3D photography was invented as early as 1839. The first examples of 3D cinema were available in the early 1900s. Various forms of early 2D television were developed in the 1920s and by 1929, stereoscopic 3DTV was demonstrated. However, while the 2D versions of photography, cinema, and TV have flourished to become important features of twentieth century culture, their 3D counterparts have almost disappeared since their peak around 1950. Our position is that this was not a failure of 3D in itself, but a failure of the then only viable technology to produce 3D, namely stereoscopy (or stereography). Stereoscopic 3D video is primarily based on the binocular nature of human perception, and it is relatively easy to realize. Two simultaneous conventional 2D video streams are produced by a pair of cameras mimicking the two human eyes, which see the environment from two slightly different angles. Then, one of these streams is shown to the left eye, and the other one to the right eye. Common means of separating the right-eye and left-eye views are glasses with colored transparencies or polarization filters. Although the technology is quite simple, the necessity to wear glasses while viewing has often been considered as a major obstacle in front of wide acceptance of 3DTV. But perhaps more importantly, within minutes after the onset of viewing, stereoscopy frequently causes eye fatigue and feelings similar to that experienced during motion
2
L. Onural and H. M. Ozaktas
Fig. 1.1. Artist’s vision of three-dimensional television. (graphic artist: Erdem Y¨ ucel)
sickness, caused by a mismatch of perceptory cues received by the brain from different sensory sources. Recently, with the adoption of digital technologies in all aspects of motion picture production, it has become possible to eliminate some of the factors which result in eye fatigue. This development alone makes it quite probable for stereoscopic 3D movies to be commonplace within a matter of years. Nevertheless, some intrinsic causes of fatigue may still remain as long as stereoscopy remains the underlying 3D technology. Stereoscopic 3D displays are similar to conventional 2D displays: a vertical screen or a monitor produces the two video channels simultaneously, and special glasses are used to direct one to the left eye and the other to the right eye. In contrast, autostereoscopic monitors are novel display devices where no special glasses are required. Covering the surface of a regular high-resolution digital video display device with a vertical or slanted lenticular sheet, and driving these monitors by so-called interzigged video, one can deliver the two different scenes to the left and the right eyes of the viewer, provided that the viewer stays in the correct position. (A lenticular sheet is essentially a transparent film or sheet of plastic with a fine array of cylindrical lenses. The ruling of the lenses can either be aligned or slanted with respect to the axes of the display.) Barrier technology is another way of achieving autostereoscopy: electronically generated fence-like optical barriers coupled with properly interzigged digital pictures generate the two or more different views required. It is possible to provide many more views than the two views of classical stereoscopy, by using the autostereoscopic approach in conjunction with slanted lenticular sheets or barrier technology. Up to nine views are common, creating
1 Three-dimensional Television: From Science-fiction to Reality
3
horizontal parallax with a viewing angle of about 20 degrees. Classical stereoscopy with its two views is not able to yield parallax in response to head movement. People watching three-dimensional scenes expect occlusion and disocclusion effects when they move with respect to the scene; certain parts of objects should appear and disappear as one moves around. This is not possible with two fixed views, producing an unnatural result if the observer is moving. Head-tracking autostereoscopic display devices have been developed to avoid this viewer position constraint; however, serving many users at the same time remains a challenge. Free viewpoint video (FVV) functionality is another approach to allowing viewer movement. It offers the same functionality familiar from threedimensional computer graphics. The user can choose a viewpoint and viewing direction within a visual scene interactively. In contrast to pure computer graphics applications which deal with synthetic images, FVV deals with real world scenes captured by real cameras. As in computer graphics, FVV relies on a certain three-dimensional representation of the scene. If from that threedimensional representation, a virtual view (not an available camera view), corresponding to an arbitrary viewpoint and viewing direction can be rendered, free viewpoint video functionality will have been achieved. In most cases, it will be necessary to restrict to some practical limits the navigation range (the allowed virtual viewpoints and viewing directions). Rendering stereo pairs from the three-dimensional representation not only provides three-dimensional perception, but also supports natural head-motion parallax. Despite its drawbacks, stereoscopic 3D has found acceptance in some niche markets such as computer games. Graphics drivers that produce stereo video output are freely available. With the use of (very affordable) special glasses, ordinary personal computers can be converted into three-dimensional display systems, allowing three-dimensional games to be played. Stereo video content is also becoming available. Such content is either originally captured in stereo (such as in some commercially available movies) or is converted from ordinary two-dimensional video. Two-dimensional to three-dimensional conversion is possible with user-assisted production systems, and is of great interest for content owners and producers. Stereoscopic 3D, whether in its conventional form as in the old stereoscopic cinema, or in its more modern forms involving autostereoscopic systems, falls far from the vision of true optical replicas that have been outlined at the beginning of this chapter. To circumvent the many problems and shortcomings of stereoscopy in a radical manner, it seems necessary to abandon the basic binocular basis of stereoscopy, and by turning to basic physical principles, to focus on the goal of true optical reconstruction of optical wave fields. Optically sensitive devices, including cameras and human eyes, do not “reach out” to the environment or the objects in them; they merely register the light incident on them. The light registered by our eyes, which carries the information about the scene, is processed by our visual system and brain, and thus we perceive our environment. Therefore, if the light field which fills a given 3D region can be
4
L. Onural and H. M. Ozaktas
recorded with all its physical attributes, and then recreated from the recording in the absence of the original object or scene, any optical device or our eyes embedded in this recreated light field will “see” the original scene, since the light incident on the device or our eyes will be essentially indistinguishable in both cases. This is the basic principle of holography, which is a technique known since 1948. Holography is distinct from ordinary photography in that it involves recording the entire optical field, with all its attributes, rather than merely its intensity or projection (“holo” in holography refers to recording of the “whole” field). As expected, the quality of the holographic recording and reconstruction process will directly affect the fidelity of the created ghostlike images to their originals. Digital holography and holographic cinema and TV are still in their infancy. However, advances in optical technology and computing power have brought us to the point where we can seriously consider making this technology a reality. It seems highly likely that high quality 3D viewing will be possible as the underlying optics and electronics technologies mature. Integral imaging (or integral photography) is an incoherent 3D photographic technique which has been known since 1905. In retrospect, the technique of integral imaging can also be classified as a kind of holography, since this technique also aims to record and reproduce the physical light distribution. The basic principle is to record the incidence angle distribution of the incoming light at every point of recording, and then regenerate the same angular illumination distribution by proper back projection. The same effect is achieved in conventional holography by recording the phase and amplitude information simultaneously, instead of the intensity-only recording of conventional photography. The phase information is recorded using interference, and therefore, holographic recordings require coherent light (lasers). Intensity recording, such as with common optical emulsion or digital photography, loses the direction information. It is helpful to keep in mind the distinction between 3D displays and 3D television (3DTV). We use the term 3D display to refer to imaging devices which create 3D perception as their output. 3DTV refers to the whole chain of 3D image acquisition, encoding, transport/broadcasting, reception, as well as display. We have so far mostly discussed the display end of 3DTV technology. An end-to-end 3DTV system requires not only display, but also capture and transmission of the 3D content. Some means of 3D capture were already implicit in our discussion of displays. For example, stereoscopic 3DTV involves a stereoscopic camera, which is nothing but two cameras rigidly mounted side by side with appropriate separation. The recording process in integral imaging is achieved using microlens arrays, whereas holographic recording employs coherent light and is based on optical interference. In these conventional approaches, the modality of 3D image capture is directly related to that of 3D image reconstruction, with the reconstruction process essentially amounting to reversal of the capture process. In contrast, current research in 3DTV is targeting a quite different approach in which the input capture and output display
1 Three-dimensional Television: From Science-fiction to Reality
5
modalities are completely decoupled and bridged by digital representation and processing. In recent years, tremendous efforts has been invested worldwide to develop convincing 3DTV systems, algorithms, and applications. This includes improvements over the whole processing chain, including image acquisition, three-dimensional representation, compression, transmission, signal processing, interactive rendering, and display (Fig. 1.2). The overall design has to take into account the strong interrelations between the various subsystems. For instance, an interactive display that requires random access to threedimensional data will affect the performance of a coding scheme that is based on data prediction. The choice of a certain three-dimensional scene representation format is of central importance for the design of any 3DTV system. On the one hand, it sets the requirements for acquisition and signal processing. On the other hand, it determines the rendering algorithms, degree of and mode of interactivity, as well as the need for and means of compression and transmission. Various three-dimensional scene representations are already known from computer graphics and may be adapted to 3DTV systems as well. These include different types of data representations, such as three-dimensional mesh models, multiview video, per-pixel depth, or holographic data representations. Different capturing systems which may be considered include multi-camera systems, stereo cameras, lidar (depth) systems, or holographic cameras. Different advanced signal processing algorithms may be involved on the sender
Fig. 1.2. Functional blocks of an end-to-end 3DTV system (from L. Onural, H. M. Ozaktas, E. Stoykova, A. Gotchev, and J. Watson, An overview of the holographic display related tasks within the European 3DTV project, in Photon Management II: SPIE Proceedings 6187, 2006)
6
L. Onural and H. M. Ozaktas
side, including three-dimensional geometry reconstruction, depth estimation, or segmentation, in order to transform the captured data into the selected three-dimensional scene representation. Specific compression algorithms need to be applied for the different data types. Transmission over different channels requires different strategies. The vast amount of data and user interaction for FVV functionality essential to many systems complicates this task even further. On the receiver side, the data needs to be decoded, rendered, and displayed. In many cases this may require specific signal conversion and display adaptation operations. Interactivity needs to be taken care of. Finally, the images need to be displayed. Autostereoscopic displays have already been mentioned, but there are also other more ambitious types of displays. Such displays include volumetric displays, immersive displays and, of course, holographic displays. For those who have set their eyes on the ambitious applications of three-dimensional imaging, the fully-interactive, full parallax, high-resolution holographic display is the ultimate goal. Whether or not this is achievable depends very much on the ability to efficiently handle the vast amounts of raw data required by a full holographic display and the ability to exploit the rapid developments in optical technologies. Current end-to-end 3DTV systems require tightly coupled functional units: the display and the capture unit must be designed together, and therefore, compression algorithms are also quite specific to the system. However, it is quite likely that in future 3DTV systems, the techniques for 3D capture and 3D display will be totally decoupled from each other. It is currently envisioned that the information provided by the capture device will provide the basis for the computerized synthesis of the 3D scene. This synthesis operation will heavily utilize 3D computer graphics techniques (which are commonly used in computer animations) to assemble 3D scene information from multiple-camera views or other sets of complementary data. However, instead of synthetic data, the 3D scene information will be created from a real-life scene. Many techniques have been developed for the capture of 3D scene information. A common technique is based on shooting the scene simultaneously from different angles using multiple conventional 2D cameras. Camera arrays with up to 128 cameras have been discussed in the literature. However, acceptable quality 3D scene information can be captured by using a much smaller number of cameras, especially if the scene is not too complex. The synthesized 3D video, created from the data provided by the capture unit, can then be either transmitted or stored. An important observation is that, 3D scenes actually carry much less information than one may initially think. The difference between 2D images and 3D images is not so much like the difference between a 2D array and a 3D array of numbers, since most objects are opaque and in any event, our retinas are two-dimensional detectors. The difference is essentially the additional information associated with depth and parallax. Therefore, 3D video is highly compressible. Special purpose compression techniques have already been reported in the literature and research
1 Three-dimensional Television: From Science-fiction to Reality
7
in this area is ongoing. Transmission of such data is not too different than transmission of conventional video. For example, video streaming techniques which are commonly used over the Internet can easily be adapted to the 3D case. Nevertheless, such adaptation does require some care as the usability of incomplete 3D video data is totally different than the usability of incomplete 2D video, and packet losses are common in video streaming. In order for the display to show the 3D video, the received data in abstract form must first be translated into driving signals for the specific 3D display device to be used. In some cases, this can be a challenging problem requiring considerable processing. Development of signal processing techniques and algorithms for this purpose is therefore crucial for successful realization of 3DTV. Decoupling of image acquisition and display is advantageous in that it can provide complete interoperability by enabling the display of the content on totally different display devices with different technologies and capabilities. For instance, it may be possible to feed the same video stream to a highend holographic display device, a low-end stereoscopic 3D monitor, or even a regular 2D monitor. Each display device will receive the same content, but will have a different signal processing interface for the necessary data conversion. In the near future, it is likely that multiview video will be the common mode of 3DTV delivery. In multiview video, a large amount of 2D video data, captured in parallel from an array of cameras shooting the same scene from different angles, will be directly coded by exploiting the redundancy of data, and then streamed to the receiver. The display at the receiving end, at least in the short term, will then create the 3D scene autostereoscopically. (In the long term, the autostereoscopic display may be replaced with volumetric or holographic displays.) Standardization activities for such a 3DTV scheme are well underway under the International Organization for Standardization Moving Picture Experts Group (ISO MPEG) and International Telecommunication Union (ITU) umbrellas. Countless applications of 3D video and 3DTV have been proposed. In addition to household consumer video and TV, there are many other consumer applications in areas such as computer games and other forms of entertainment, and video conferencing. Non-consumer applications include virtual reality applications, scientific research and education, industrial design and monitoring, medicine, art, and transportation. In medicine, 3DTV images may aid diagnosis as well as surgery. In industry, they may aid design and prototyping of machines or products involving moving parts. In education and science, they may allow unmatched visualization capability. Advances in this area will also be closely related to advances in the area of interactive multimedia technologies in general. While interactivity is a different concept from three-dimensionality, since both are strong trends, it is likely they will overlap and it will not be surprising if the first 3DTV products also feature a measure of interactivity. Indeed, since interactivity may also involve immersion into the scene and three-dimensionality is an important aspect of
8
L. Onural and H. M. Ozaktas
the perception of being immersed in a scene, the connections between the two trends may be greater than might be thought at first. Although the goals are clear, there is still a long way to go before we have widespread commercial high-quality 3D products. A diversity of technologies are necessary to make 3DTV a reality. Successful realization of such products will require significant interdisciplinary work. The scope of this book reflects this diversity. To better understand where each chapter fits in, it is helpful to again refer to the block diagram in Fig. 1.2. Chapter 2 presents a novel operational end-to-end prototype 3DTV system with all its functional blocks. The system is designed to operate over a terrestrial Digital Media Broadcast (T-DMB) infrastructure for delivery to mobile receivers. Chapters 3, 4, and 5 deal with different problems and approaches associated with the capture of 3D information. In Chap. 3, a novel 3D human motion capture system, using simultaneous multiple video recordings, is presented after an overview of various human motion capture systems. Chapter 4 shows that it is possible to construct 3D objects from stereo data by utilizing the texture information. A totally different 3D shape capture technique, based on pattern projection, is presented in detail in Chap. 5. Representation of dynamic 3D scenes is essential especially when the capture and display units are decoupled. In decoupled operation, the data captured by the input unit is not directly forwarded to the display; instead, an intermediate 3D representation is constructed from the data. Then, this representation is used for display-specific rendering at the receiving end. Chapters 6 and 7 present examples of representation techniques within an end-to-end 3DTV system. In Chap. 6, a detailed overview of modeling, animation, and rendering techniques for 3D are given. Chapter 7, on the other hand, details a representation for the more specific case where the object is a moving human figure. Novel coding or compression techniques for 3DTV are presented in Chaps. 8 and 9. Chapter 8 deals specifically with the compression of 3D dynamic wiremesh models. Compression of multi-view video data is the focus of Chap. 9, which provides the details of an algorithm which is closely related to ongoing standardization activities. Transport (transmission) of 3DTV data requires specific techniques which are distinct from its 2D counterpart. Issues related to streaming 3D video are discussed in Chap. 10. Chapter 11 discusses the adaptation of the multiple description coding technique to 3DTV. Watermarking of conventional images and video has been widely discussed in the literature. However, the nature of 3D video data requires novel watermarking techniques specifically designed for such data. Chapter 12 discusses 3D watermarking techniques and proposes novel approaches for this purpose. Different display technologies for 3DTV are presented in Chaps. 13, 14, and 15. Chapter 13 gives a broad overview of the history of domestic 3DTV displays together with contemporary solutions. Chapter 14 describes an
1 Three-dimensional Television: From Science-fiction to Reality
9
immaterial pseudo-3D display with 3D interaction, based on the unique commercial 2D floating-in-the-air fog-based display. Chapter 15 gives an overview and the state-of-the-art of spatial light modulator based holographic 3D displays. Chapter 16 discusses in detail the physical and chemical properties of novel materials for dynamic holographic recording and 3D display. Finally, the last chapter discusses consumer, social, and gender issues associated with 3DTV. We believe that early discussion and investigation of these issues are important for many reasons. Discussion of consumer issues will help evaluation of the technologies and potential products and guide developers, producers, sellers, and consumers. Discussion of social and gender issues may help shape public decision making and allow informed consumer choices. We believe that it is both an ethical and a social responsibility for scientists and engineers involved in the development of a technology to be aware of and contribute to awareness regarding such issues. We believe that this collection of chapters provides a good coverage of the diversity of topics that collectively underly the modern approach to 3DTV. Though it is not possible to cover all relevant issues in a single book, we believe this collection provides a balanced exposure for those who want to understand the basic building blocks of 3DTV systems from a broad perspective. Readers wishing to further explore the areas of 3D video and television may also wish to consult four recent collections of research results [1, 2, 3, 4] as well as a series of elementary tutorials [5]. Parts of this chapter appeared in or were adapted from [6] and [7]. This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. 3D Videocommunication: Algorithms, Concepts and Real-Time Systems in Human Centred Communication. O. Schreer, P. Kauff, and T. Sikora, editors. Wiley, 2005. 2. Three-Dimensional Television, Video, and Display Technologies. B. Javidi and F. Okano, editors. Springer, 2002. 3. Special issue on three-dimensional video and television. M. R. Civanlar, J. Ostermann, H. M. Ozaktas, A. Smolic, and J. Watson, editors. Signal Processing: Image Communication, Vol. 22, issue 2, pp. 103–234, February 2007. 4. Special issue on 3-D technologies for imaging and display. B. Javidi and F. Okano, editors. Proceedings of the IEEE, Vol. 94, issue 3, pp. 487–663, March 2006. 5. K. Iizuka. Welcome to the wonderful world of 3D (4 parts). Optics and Photonics News, Vol. 17, no. 7, p. 42, 2006; Vol. 17, no. 10, p. 40, 2006; Vol. 18, no. 2, p. 24, 2007; Vol. 18, no. 4, p. 28, 2007. 6. L. Onural. Television in 3-D: What are the prospects? Proceedings of the IEEE, Vol. 95, pp. 1143–1145, 2007. 7. M. R. Civanlar, J. Ostermann, H. M. Ozaktas, A. Smolic, and J. Watson. Special issue on three-dimensional video and television (guest editorial). Signal Processing: Image Communication, Vol. 22, pp. 103–107, 2007.
2 A Backward-compatible, Mobile, Personalized 3DTV Broadcasting System Based on T-DMB Hyun Lee, Sukhee Cho, Kugjin Yun, Namho Hur and Jinwoong Kim Electronics and Telecommunications Research Institute (ETRI) 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-350 Republic of Korea
2.1 Introduction Mobile reception of broadcasting services has recently got much attention worldwide. Digital multimedia broadcasting (DMB), Digital video broadcasting-handheld (DVB-H), MediaFLO are such examples. Among them, commercial service of terrestrial DMB (T-DMB) was launched in Korea for the first time to provide mobile multimedia services in 2005. Telecommunication Technology Association (TTA) of Korea and ETSI in Europe established a series of specification for T-DMB video and data services based on the Eureka147 digital audio broadcasting (DAB) system [1, 2, 3, 4]. Multimedia-capable mobile devices are becoming core of the portfolio of various multimedia information generation and consumption platforms. People are expected to depend more and more on mobile devices for their access and use of multimedia information. Various acquisition and display technologies are also being developed to meet ever-increasing demand of users for higher quality multimedia content. Reality is one of the major reference for judging the high quality, and there have been a lot of research activities on 3DTV and ultra-high definition TV(UDTV) concepts and systems [12, 13, 14]. UDTV is a result of a research direction for providing immersive reality with ultra-high resolution of the image, wider coverage of color gamut, and big screen for frame-less feeling. On the other hand, research of 3DTV is pursuing the direction of providing the feeling of depth, especially exploiting the human stereopsis. 3D display without need of wearing glasses, giving natural feeling of depth without eye fatigue has long been the goal of 3DTV researchers, which is still a very challenging area. Even though perfect 3D display is yet a far way to go, as a result of recent developments of display technology, we can easily implement high quality autostereoscopic 3D display on small size multimedia devices with reasonable cost overhead. Providing mobility and increased reality in multimedia information services is thus a promising direction for the future. Specifically, 3D AV service
12
H. Lee et al.
over T-DMB system is attractive due to the facts that (1) glassless 3D viewing with small display is relatively easy to implement and more suitable to single user environment like T-DMB, (2) T-DMB is a new media and thus has more flexibility in adding new services on existing ones, (3) 3D AV handling capability of 3D T-DMB terminals has lots of potential to generate new types of services if it is added with other components like built-in stereo camera. We believe that a portable, personal 3D T-DMB system will be a valuable stepping stone towards realizing the ideal 3DTV for multi-users at home. In this paper, we investigate various issues on implementing 3D AV services on the T-DMB system. There are four major issues on the new system development: (1) content creation, (2) compression and transmission, (3) display, (4) service scenarios and business models. We first look into system and functional requirements of the 3D T-DMB system, and then propose solutions for the cost-effective and flexible system implementation to meet such requirements. In Sect. 2.2, we overview T-DMB system in terms of system specification and possible services. In Sect. 2.3, 3D T-DMB requirements and major issues for implementation are described with efficient solutions for the issues. In Sect. 2.4, results of system implementation along with simulation results are presented. In Sect. 2.5, various service scenarios of 3D AV contents and business models are covered. Finally, we draw some conclusions in Sect. 2.6.
2.2 Overview of T-DMB System T-DMB system[3] is based on Eureka-147 DAB system1 especially as its physical layer. Due to the robustness of coded orthogonal frequency division multiplexing (COFDM) technology to multi-path fading, mobile receivers moving at high speed can reliably receive multimedia data which are sensitive to channel errors[5]. Enhancements from Eureka-147 DAB for video service are achieved by highly efficient source coding technology for compressing multimedia data, object-based multimedia handling technology for interactivity, synchronization technology among video, audio, and auxiliary data stream, multiplexing technology for multiplexing media streams, and enhanced channel error correction technology.
1
The European countries started the research about the DAB technology by establishing the Eureka-147 joint project from the end of 1980s. The objective of the project was to develop DAB system that enables to provide audio service of high-quality with mobile reception. In 1994 ETSI (European Telecommunications Standards Institute) adopted the basic DAB standard ETSI EN 300 401[1]. The ITU-R issued Recommendations (BS.1114) and (BO.1130) relating to satellite and terrestrial digital audio broadcasting, recommending the use of Eureka-147 DAB mentioned as “Digital System A” in 1994.
2 3D Broadcasting System Based on T-DMB
13
2.2.1 T-DMB Concept The core concepts of T-DMB are personality, mobility and interactivity. Personality means that T-DMB can provide individual user with personal services by using portable devices (mobile phone, PDA, laptop PC and DMB receiver). Mobility is another important concept of T-DMB, which means that it can offer seamless reception of broadcasting contents at anytime, anywhere. The last but not least is interactivity which enables to serve bidirectional services linked with mobile communication network. The examples of this service are pay per view (PPV), on-line shopping, and Internet service. 2.2.2 T-DMB Protocol Stack Broadcasting protocol being supported in T-DMB is shown in Fig. 2.1. T-DMB accommodates audio and data service as well as video service including MPEG-2 and MPEG-4 technologies. For data service, Eureka-147 DAB supports a variety of transport protocols such as multimedia object transfer (MOT), IP tunneling, and transparent data channel (TDC)[2]. On these protocols, DAB system can transport various data information such as: •
•
Program-associated data (PAD): text information related to audio program such as audio background facts, a menu of future broadcasts and advertising information as well as visual information such as singer images or CD cover; Non-program-associated data (NPAD): travel and traffic information, headline news, stock marked price or weather forecasting information among other things.
2.2.3 Video Services The main service of T-DMB is video services rather than basic CD quality audio and data service. Figure 2.2 shows how various types of services are
2 Multi ch ch
MCI & SI
DLS TDC
MOT
IP tunneling
TDC
MPEG-4 A/V
TMC EWS
MPEG-4 BIFS
MPEG-4 SL Audio
PAD
NPAD
MPEG-2 TS FEC (RS)
FIDC
Audio service
Data service
FIC
MSC DAB (Eureka-147) T-DMB
Fig. 2.1. T-DMB broadcasting protocol stack [17]
Video service
14
H. Lee et al. DAB (ETSI EN 300 401) FIC Data service Multiplexer Control Data
Fast Information (FIC) Path
Audio Program Service
DAB Audio Frame DAB Audio Framepath Path
Packet Data Service
Packet Mode Data Path
Video Multiplexer
Stream Mode
Optional Conditional Access Scrambler
Energy Dispersal Scrambler
Convolutional Encoder
Main Service (MSC) Multiplexer
Video Service
Service Information (SI) Path
Time Interleaver
CIFs
Transmission Frame Multiplexer
Service Information
OFDM Signal Generator
Video Transmission Signal
Fig. 2.2. The conceptual transmission architecture for the video service2
composed for transmission. There are two transmitting modes for visual data service. One is packet mode data channel and the other is stream mode data channel. • •
Packet mode: It is the basic data transport mechanism. The data are organized in data groups which consist of header, a data field of up to 8191 bytes and optionally a cyclic redundancy check (CRC); Stream mode: It can provide a constant data rate of a multiple of 8 kbps or 32 kbps depending on coding profiles. T-DMB video service data are normally carried in stream mode2 .
Details of constructing the video service are shown in Fig. 2.3, which is the internal structure of the video multiplexer in Fig. 2.2. The video, audio, and the auxiliary data information which make up a video service, are multiplexed into an MPEG-2 TS and outer error correction bits are added. The multiplexed and outer-coded stream is transmitted by the stream mode data channel. The initial object descriptor (IOD) generator creates IOD and the object descriptor (OD)/binary format for scene description (BIFS) generator creates OD/BIFS streams that comply with the ISO/IEC 14496-1[6]. Advanced video coding (AVC, MPEG-4 Part 10), which has high coding efficiency for multimedia broadcasting service at a low data transfer rate, is used to encode video contents and bit sliced arithmetic coding (BSAC) is used to encode audio contents. The video and audio encoders generate encoded bit streams compliant with the AVC Baseline profile and BSAC, respectively. BIFS is also 2
c c European Telecommunications Standards Institute 2006. European Broadcasting Union 2006. Further use, modification, redistribution is strictly prohibited. ETSI standards are available from http://pda.etsi.org/pda/ and http://www.etsi.org/services products/freestandard/home.htm
2 3D Broadcasting System Based on T-DMB IOD Data
IOD Generator
PES Packet
Outer Convolutional Interleaver
AUX Data SL Packet
MPEG-2 TS
Outer Encoder RS(204,188)
PES Packet
Video SL Packet
TS Multiplexer
Audio SL Packet
PES Packetizer
AUX Data ES
PES Packet
PES Packetizer
AUX Data
Audio ES
PSI Section 14496 Section
PES Packetizer
Audio Encoder
Video ES SL Packetizer
Video Encoder
OD/BIFS SL Packet
OD/BIFS Stream
Section Generator
OD/BIFS Generator
15
Fig. 2.3. The conceptual architecture of the video multiplexer for a video service2
adopted to encode an interactive data related to video contents. Each media stream is firstly encapsulated into MPEG-4 sync layer (SL) packet stream, compliant with ISO/IEC 14496-1 system standard[6]. The section generator creates sections compliant with ISO/IEC 13818-1[8] for the input IOD, OD, and BIFS. And then, each PES packetizer generates a PES packet stream compliant with ISO/IEC 13818-1 for each SL packet stream. The TS multiplexer combines the input sections and PES packet streams into a single MPEG-2 transport stream complying with ISO/IEC 13818-1. The MPEG-2 transport stream is encoded for forward error correction by using Reed Solomon coding and convolutional interleaving, and finally fed into the DAB sub-channel as a stream data service component.
2.3 Requirements and Implementation Issues of 3D T-DMB 2.3.1 3D T-DMB Requirements The requirements of 3D T-DMB system are as follows: (1) Backward compatibility: Like other broadcasting services, new 3D T-DMB service should also be backward-compatible with the existing 2D T-DMB. This means that users with 2D T-DMB terminals should be able to receive 3D T-DMB services and view the content with their 2D display. 3D T-DMB is based on the stereoscopic presentation of 3D visual scenes, and the video information can be easily represented with a reference 2D
16
H. Lee et al.
video plus some types of additional information. 2D T-DMB receivers can use only the reference video information for 2D representation. On the other hand, 3D T-DMB receivers can use both data to generate a stereoscopic video (left view image and right view image for each video frame) and render it onto 3D display. (2) Forward compatibility: This means that new 3D T-DMB terminals can receive 2D T-DMB services and view 2D visual information. This basically requires that display of 3D T-DMB terminals can be switched between 2D and 3D mode. The 3D T-DMB terminals should have 2D mode to function exactly as 2D T-DMB terminals. (3) 2D/3D switchable display: This is one of the essential requirements of 3D T-DMB system, not only for forward compatibility mentioned in (2) above, but also for providing various 2D/3D hybrid visual services. The latter will be explained in detail in Sect. 2.5. (4) Low transmission overhead: T-DMB system has limited bit budget for signal transmission. Table 2.1 shows the available bandwidth for each operating mode, and Table 2.2 shows typical service allocations of T-DMB broadcasters in Korea. There should be a trade-off relationship between lowering overhead bitrate and 3D visual quality presented to the users. Therefore, it is very important to use highly efficient video compression schemes as well as efficient multiplexing and transmission schemes. In addition to these requirements, other factors such as easiness of content creation, flexibility in adaptation to display evolution, and overall system safety in terms of ‘3D eye strain’ should also be carefully taken care of in order to make 3D T-DMB a platform for viable and long-lasting 3D services. 2.3.2 3D T-DMB System Architecture and Transport 3D T-DMB system provides stereoscopic video as well as 3D surround audio. Stereoscopic video can either be represented in a video plus corresponding depth information[7] or two (left and right) video. Though the former has advantages in terms of flexibility and efficiency, difficulty in acquiring accurate depth information for general scenes is still a big challenge. So, in our system video input is defined as two video signals which are to be compressed and transmitted. The system is designed to handle and carry 3D surround sound as well, which is rendered adaptively to speaker configurations of 3D T-DMB Table 2.1. Available bitrate for each protection level[1] Protection level
1-A 2-A 3-A 4-A 1-B 2-B 3-B 4-B
Convolutional coding rate 1/4 3/8 1/2 3/4 4/9 4/7 2/3 4/5 Available bitrate per 1 ensemble (kbps) 576 864 1152 1728 1024 1312 1536 1824
2 3D Broadcasting System Based on T-DMB
17
Table 2.2. Typical service allocation of T-DMB broadcasters in Korea (Protection Level: 3-A) Broadcaster
Number of Channels
Bitrate (kbps)
Contents
KBS
1 3 1 3 1 3 1 2 1 1 2 2
544 128 544 128 544 128 512 160 329 544 128 512/544
KBS1 TV Music, Business News MBC TV Radio, Business News SBS TV Radio, Traffic Information YTN TV Music, Traffic Information Data Korea DMB TV Music, Culture U1 Media TV/KBS2 TV
MBC SBS YTN DMB
Korea DMB U1 Media
Video Audios Video Audios Video Audios Video Audios Data Video Audios Videos
terminals (for example, stereo for portable devices and 5.1 channel in a car audio environment). Figure 2.4 shows the internal structure of 3D T-DMB transmitting server. The proposed 3D T-DMB media processor consists of 3D video encoding, 3D audio encoding, MPEG-4 Systems encoding, MPEG-4 over MPEG-2 encapsulator, and channel coding parts. The MPEG-4 Systems encoding part is modified from its 2D counterpart so that it can generate SL packets which include 3D T-DMB signals in backward compatible way. The MPEG-4 over MPEG-2 encapsulator converts SL packets to MPEG-2 TS packets. Note that the program specific information (PSI) is also utilized in making MPEG-2 TS packets in the same block. Now we will look into the crucial idea of representing AV objects in MPEG4 Systems in more detail. In a 3D T-DMB system, we have four AV objects in total, i.e., Vl , Va , As , and Aa , which are left video, additional video, stereo audio, and additional audio, respectively. To meet the backward compatibility, we propose a scheme of using two ODs, each OD consisting of two ESs. Additionally, two ODs are assumed to be independent, but two ESs in each OD are assumed to be dependent. The dependence of ESs can be indicated simply by assigning ‘TRUE’ boolean value to StreamDependenceFlag and by assigning the same ES− IDs to dependsOn− ES− ID in ES− Descriptor, as scripted in Fig. 2.5. Next, according to the definition of the MPEG-4 Systems, we assign 0x21 and 0x40 to objectTypeIndication of Vl and As in DecoderConfigDescriptor, respectively. On the other hand, in the case of Va and Aa , we assign 0xC0 and 0xC1 to the objectTypeIndication indicating user private in MPEG-4 Systems, respectively. Note that the dependent ESs of Va and Aa are ignored in current 2D T-DMB receivers thus disturbing none of their functions, but are identified in new 3D T-DMB receivers.
18
H. Lee et al. Audio
Audio Encoder (MUSICAM)
Convolutional Encoder
Time Interleaver
3D Video Encoding
MUX
Time Interleaver
Convolutional Encoder
Convolutional Interleaver
DVB-ASI
MPEG-4 Systems Encoding
Reed Solomon (RS) Encoder
IOD Generator
PSI Generator
OD/BIFS Generator
TS Multiplexer
Aa
Multi-channel 3D Audio Encoder
PES Packetizer
3D Audio Encoding As
SL Packetizer
Va
Stereoscopic Video Encoder
COFDM Modulator
Vl
MPEG-4 over MPEG-2 3D DMB Media Processor
Conventional DAB Transmitter (Eureka-147)
Fig. 2.4. Block diagram of the T-DMB system including 3D AV services
ObjectDescriptor { //OD for 3D Video ObjectDescriptorID 3 esDescr [ // Description for Video(Left-view images) ES ES_Descriptor { ES_ID 3 OCRstreamFlag TRUE OCR_ES_ID 5 muxInfo muxInfo { ... } decConfigDescr DecoderConfigDescriptor { streamType 4 // Visual Stream bufferSizeDB 15060000 objectTypeIndication 0x21 // reserved for ISO use decSpecificInfo DecoderSpecificInfoString { ... } } slConfigDescr SLConfigDescriptor { ... } }] esDescr [ // Description for 3D Addtional Video Data(Right-view images ) ES ES_Descriptor { ES_ID 4 Stream DependenceFlag TRUE dependsOn_ES_ID 3 OCRstreamFlag TRUE OCR_ES_ID 5 muxInfo muxInfo { ... } decConfigDescr DecoderConfigDescriptor { streamType 4 // Visual Stream bufferSizeDB 15060000 objectTypeIndication 0xC0 // User Private decSpecificInfo DecoderSpecificInfoString { ... } } slConfigDescr SLConfigDescriptor { ... } } ] }
Fig. 2.5. Backward compatible OD scheme for 3D video
RF
2 3D Broadcasting System Based on T-DMB
19
2.3.3 Content Creation Since the resolution of T-DMB video is QVGA (320 × 240), we need basically to capture or generate two QVGA video signals for stereoscopic 3D T-DMB. At this moment, most of 2D T-DMB contents are obtained by down-sampling SDTV (Standard Definition TV) or HDTV (High Definition TV) resolution contents. As T-DMB service gets more popular in the future, we can expect DMB-specific contents are also produced. The ratio of the down-sampling is from 4:1 up to 27:1, which can remove details of the image in some critical regions. Thus unlike the 2D DMB case, this down-sampling can lead to a major quality deterioration of resulting 3D video. One other problem is that due to the small disparity values permitted in 3D T-DMB displays, large disparities in HDTV scenes cause discomfort when they are converted to DMB contents by simple down-sampling. Disparities in CG contents should also be limited appropriately. 2.3.4 Video Compression Three factors should be considered for 3D T-DMB video compression. They are limited transmission bandwidth of T-DMB system, backward compatibility with the existing 2D DMB service, and finally exploiting characteristics of human stereopsis. Basic concept of the compression technology for stereoscopic video is based on spatio-temporal prediction structure, since there exists an inter-view redundancy between different view scenes captured at the same time. So far, there have been MPEG-2 MVP (multi-view video profile), MPEG-4 based visual coding using temporal scalability, and AVC/H.264 based multi-view video coding (MVC) as typical stereoscopic video coding (SSVC) with spatiotemporal prediction structure. Basically, the prediction structures of them are same except for the structure derived from multiple reference frames on AVC/H.264. For 3D video encoding in our 3D T-DMB system, we propose a residualdownsampled stereoscopic video coding (RDSSVC) which downsamples the residual data based on AVC/H.264. Figure 2.6 shows a block diagram of the stereoscopic video coding based on AVC/H.264. In order to guarantee monoscopic 2D video service over conventional T-DMB system, left-view images are coded with temporal-only prediction structure without downsampling but right-view images are coded with spatio-temporal prediction structure on AVC/H.264 shown in Fig. 2.7. The motion and disparity are estimated and compensated for the original resolution. The residual data representing the prediction errors are downsampled before the transform and quantization. Hence, the downsampled residual data are transformed, quantized, and coded by CAVLC. In decoding, after the inverse quantization and inverse transform, the reconstructed residual data are upsampled to the original resolution and compensated with the pixels in
20
H. Lee et al.
Right-view + Image
Down
DCT
Q
CAVLC
Bitstream
IQ
IDCT
UP + + Motion & Disparity Compensation
Frame Memory
Motion & Disparity Estimation
Reconstructed frame MV/DV
Left-view Image
AVC/H.264
Bitstream
Fig. 2.6. Structure of residual-downsampling based stereoscopic video coding
the correspondent blocks on the reference pictures which have already been decoded. Left-view sequence is encoded and decoded following exactly H.264 specification. By down-sampling the residual data before transmission and then up-sampling in the receiving side, apparent visual quality degrades in return for bit savings. However, final perceptual quality of the stereoscopic 3D video will be about the same as ‘full bitrate’ version due to ‘additive nature’ of human stereopsis[15]. Using RDSSVC we could reduce the resulting bitrates while maintaining the perceptual quality of the stereoscopic video.
Right-view Images
P
P
P
P
Left-view Images
I
P
P
P
Disparity compensated prediction Motion compensated prediction
Fig. 2.7. Structure of reference frames in stereoscopic video coding
2 3D Broadcasting System Based on T-DMB
21
2.3.5 Display For stereoscopic 3D T-DMB system, we use parallax barrier type autostereoscopic LCD display. We implemented displays with two different resolutions: one is VGA (640 × 480) and the other is QVGA (320 × 240). Parallax barrier type 3D display has several merits such as easy implementation of 2D/3D switching function, low cost overhead and small physical size increase. Though an autostereoscopic display without eye-tracking has narrow fixed viewing zone, the nature of T-DMB use pattern (holding the portable device in a hand and view the screen, usually for personal use) makes it a commercially acceptable display for stereoscopic image and video service. Allowing larger viewing zone (in terms of viewing distance and head movement)while maintaining crosstalk to minimum is very crucial to the acceptance of the autostereoscopic displays. VGA resolution has advantages over QVGA one with higher perceptual quality and more suitability to 2D/3D hybrid service. 2.3.6 3D Audio For the 3D audio signal, 5.1 channel input signals are encoded in two paths: the first path is SSLCC (sound source location cue coding) encoding, which processes the input signal to generate a stereo audio As and additional surround information, and the next path is BSAC encoding of the stereo audio signal. If a T-DMB terminal has multi-channel speakers, then SSLCC encoded information is added to basic stereo audio to reproduce multi-channel 3D audio[16]. In addition to the 3D reproducing capability of the 3D T-DMB system, BIFS-based object handling functions inherent in MPEG-4 standards enable various interactive play modes of audio.
2.4 Implementation and Simulation Results 2.4.1 System Implementation With the use of Pentium-IV PC and auto-stereoscopic display, we implemented a prototype 3D T-DMB terminal. Figure 2.8 shows a photograph of the prototype system. The 3D T-DMB terminal consists of MPEG-4 over MPEG-2 de-capsulator, MPEG-4 Systems decoder, 3D video decoder, 3D audio decoder, and scene compositor. The MPEG-4 over MPEG-2 de-capsulator is to reconstruct SL packets of Vl , Va , As , Aa , and OD/BIFS. The PSI analyzer parses IOD information and then hands it over to IOD decoder. The MPEG-4 Systems decoding is to recover the ESs of 3D audio-visual signals as well as OD/BIFS from the SL packets. According to the OD information, both the 3D video decoder and the 3D audio decoder determine how to decode the signals. Next, the decoded video and audio signals enter the 3D video generator and the 3D audio
22
H. Lee et al. 3D T-DMB Receiver
3.5 inch 3D Display D-Sub
MPEG-4 over MPEG-2 Demultiplexing PSI Analyzer
MPEG-4 Systems Decoding IOD Decoder
3D Video Decoding
3D Audio Decoding 3D audio generator
Multi-channel 3D Audio Decoder
2D/3D AV
2D/3D LCD, Speakers
3D video generator
SL De-Packetizer
PES De-Packetizer
TS De-Multiplexer
RF Tuner
TS
Stereoscopic Video Decoder
Scene Compositor / Renderer
OD/BIFS Decoder
RF
Scene Composition
Fig. 2.8. Prototype of a 3D T-DMB receiver and its structure
generator, respectively. If the display (VGA resolution) mode is set to ‘3D’, 3D video generator enlarges left and right images to 320 × 480, respectively, by up-sampling them twice in the vertical direction and interleaves the left and right video signals. In the case of ‘2D’ mode, 3D video generator enlarges the left image to 640 × 480 and outputs left video signal. Similarly, 3D audio generator makes multi-channel 3D audio signals by mixing As and Aa in
2 3D Broadcasting System Based on T-DMB
23
the case of ‘3D’ mode. Normally, the stereo audio (As ) is fed into the scene compositor. Finally, the scene compositor produces the synchronized video and audio signals. 2.4.2 Backward Compatibility The backward compatibility issue of the 3D T-DMB system was stressed in the previous section. To verify this property of the proposed 3D T-DMB system, we tested our system with the 3D T-DMB bitstreams satisfying the new syntax and semantics explained above. We have verified that the proposed system satisfies the required backward compatibility with the TDMB system. As we mentioned previously, the conventional T-DMB receiver ignored the elementary streams for the additional video (Va ) and the additional audio (Aa ) because ‘OD/BIFS Decoder’ in ‘MPEG-4 Systems Decoding’ block simply ignores the Vα and Aα of unidentified (new) objectTypeIndication. 2.4.3 Video Coding It is also important to evaluate how much gain is improved by the proposed stereoscopic video coding. Hence, we compared the performance of the proposed coding method with Simulcast and the conventional SSVC. The three coding methods were implemented using the JSVM (joint scalable video model) 4.4 reference software of H.264/AVC. We used two sequences of ‘Diving’ and ‘Soccer’ of 320 × 240 (QVGA) with 240 frames as test sequence; they are captured by our stereoscopic camera. Figure 2.9 shows one frame for ‘Diving’ and ‘Soccer’ sequences, respectively. In Simulcast, the images of left- and right-views are encoded as I-pictures at every second and the rest images are encoded as P-pictures. In the temporal prediction, the three closest preceding frames are referenced as shown in Fig. 2.10.
Fig. 2.9. One frame for ‘Diving’ (left) and ‘Soccer’ (right) sequences
24
H. Lee et al.
Right-view Images
I
P
P
P
Left-view Images
I
P
P
P
Motion compensated prediction
Fig. 2.10. Structure of reference frames in Simulcast
For the proposed and conventional stereoscopic video coding, left-view images are encoded as I-pictures at every second, the rest images and all right-view images are encoded as P-pictures. In P-picture type, the structure of reference frames is shown in Fig. 2.7. The coding method for left- view images is exactly same as that of Simulcast. On the other hand, the coding of right-view images includes temporal prediction from the two closest preceding frames and spatial prediction from one left view frame of the same time in both proposed and conventional stereoscopic video coding. In the proposed coding, the procedure of residual-downsampling is done by reducing the resolution of residual data by half on the horizontal direction. In principle, the procedure of residual-downsampling is performed for the macroblocks with prediction modes over 8x4 size among Inter modes because a transform is done on the 4x4 block size. In detail, the coding of left-view images includes all existing prediction modes in AVC without residual-downsampling: Skip mode, Inter modes (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4), and Intra modes (16x16, 4x4). On the other hand, in the coding of right-view images, we made a selection whether the procedure of the residual-downsampling should be done or not according to the types of prediction mode except Skip and Intra modes, and the macroblocks of 4x8 and 4x4 types are excluded from residual-downsampling. We show the coding results using RD (rate-distortion) curves of the proposed coding, conventional stereoscopic video coding, and Simulcast in Figures. 2.11 and 2.12. In the experiments, coding bitrates are allocated by eleven QP values from 22 to 42 increasing with step 2. For 3D T-DMB, total coding bitrate for 3D video is available under 768 kbps and right-view video should be encoded under 384 kbps within 768 kbps. Hence, Figures. 2.11 and 2.12 represent PSNR values versus the coding bitrates of under 768 kbps and under 384 kbps, respectively. Figure 2.11 represents the results of total coding efficiency for left- and right-view sequences. The proposed coding method has higher efficiency by up to 0.2 dB and by 0.2 ∼ 2.2 dB than the conventional stereoscopic video coding and Simulcast, respectively, for both test sequences. RD curves in Fig. 2.12 represent the results of the coding efficiency for only right-view sequence. The proposed coding method has higher efficiency by up to 0.7 dB and 0.2 ∼ 2.3 dB than the conventional
2 3D Broadcasting System Based on T-DMB 'Diving' (Left + Right)
'Soccer' (Left + Right)
44
36
43
35
PSNR [dB]
PSNR [dB]
42 41 40 39
34 33 32
Simulcast
Simulcast
31
SSVC
38
SSVC
Proposed
37 300
400
500
600
700
25
800
Proposed
30 330 430 530 630 730 830 930 1030 1130
900 1000
Bitrate [kbit/s]
Bitrate [kbit/s]
(a)
(b)
Fig. 2.11. PSNR versus total bitrates of left and right-view sequences: (a) ‘Diving’ sequence; (b) ‘Soccer’ sequence
stereoscopic video coding and Simulcast, respectively. It is noted in the RD curves that the gain of the proposed coding method becomes bigger as the coding bit rate decreases. As the bit rate becomes higher, the irreversible loss caused by residual downsampling cannot be restored regardless of how many bits are allocated. In practical situation, the bitrates of right-view images can be reduced much lower than that of left-view images because human perception is dominated by the high-quality component of a stereo pair. This can be achieved by separate rate control of each view. Hence it is necessary in the future work to study rate-control and test subjective quality for stereoscopic video coding in order to get better coding efficiency.
'Diving' (Right)
'Soccer' (Right)
35
44 42
33 PSNR [dB]
PSNR [dB]
40 38 36 34
31
29
Simulcast
Simulcast
SSVC
32
SSVC
Proposed
Proposed
30
27 0
100
200
300
Bitrate [kbit/s]
(a)
400
500
600
0
100
200
300
400
500
600
Bitrate [kbit/s]
(b)
Fig. 2.12. PSNR versus bitrates of right-view sequence: (a) ‘Diving’ sequence; (b) ‘Soccer’ sequence
26
H. Lee et al.
2.5 3D T-DMB Service Scenarios and Business Models 3D visual services in 3D T-DMB system can be divided into several categories. The main service type will be full 3D AV service which presents 3D videos with accompanying 3D surround audio. Since 3D audio has its own merits, it may be serviced with or without 3D video. T-DMB broadcasters usually have more than one audio channels in their channel portfolio and some users prefer audio channels to video channels. 3D images can be sent together with music programs so that users can enjoy 3D image slide show while they are listening to music from audio channels. In order not to exhaust users with more eye straining full 3D video program, we can combine 2D with 3D visual contents to show hybrid types of contents. We can select some highlight scene from a program and show it in eye-catching 3D video and all other portions of the program in plain 2D video. For advertising videos, main objects can be presented in 3D overlapped inside conventional 2D visual content. It is well known that 3D visual scenes keep eyes longer than the presentation in 2D format. We could also adopt Picture in Picture (PiP) style for this kind of service. Figures 2.13 and 2.14 show examples of hybrid video services. In order for various types of 3D or 2D/3D hybrid services to be supported by 3D T-DMB system, issues like data compression format, multiplexing and synchronization among several audio-visual objects, and methods of signalling/identification of new service types should be addressed. We are currently working on these issues for integrated 3D T-DMB system implementation.
Partial 3D image
Background 2D image
Fig. 2.13. The example of 2D/3D hybrid service in T-DMB
2 3D Broadcasting System Based on T-DMB
27
PiP (2D image)
Background 3D image
PiP (2D image)
Fig. 2.14. An example of 3D PiP service in T-DMB
2.6 Concluding Remarks We introduced a 3D T-DMB prototype system which can provide 3D AV services over T-DMB while maintaining the backward compatibility with the T-DMB system. We implemented prototype transmitting server and 3D T-DMB terminals, and have verified that the proposed concept of 3D AV services over T-DMB works well. And under the limited bandwidth and the small-sized display, the subjective tests have shown that the developed system can provide acceptable depth feeling and video quality. Similar approach could be applied to various applications such as terrestrial digital television, digital cable television, IPTV, and so on. T-DMB is a very attractive platform for successful commercial 3DTV trials due to its service characteristics: small display, single viewer and a new media. If the 3D T-DMB service is widely accepted, ‘big’ 3DTV at home will naturally be the next step.
Acknowledgement “This work was supported by the IT R&D program of MIC/IITA, [2007-S00401] Development of Glasses-free Single-user 3D Broadcasting Technologies”
References 1. ETSI EN 300 401 (2000) Radio Broadcasting Systems; Digital Audio Broadcasting (DAB) to Mobile, Portable and Fixed Receivers 2. ETSI EN 301 234 (2006) Digital Audio Broadcasting (DAB); Multimedia Object Transfer (MOT) Protocol
28
H. Lee et al.
3. ETSI TS 102 428 (2005) Digital Audio Broadcasting (DAB); DMB Video Service; User Application Specification 4. TTAS.KO-07.0026 (2004) Radio Broadcasting Systems; Specification of the Video Services for VHF Digital Multimedia Broadcasting (DMB) to Mobile, Portable and Fixed Receivers 5. Hoeg W, Lauterbach T (2003) Digital Audio Broadcasting: Principles and Applications of Digital Radio. John Wiley & Sons, England 6. ISO/IEC 14496-1 (2001) Information Technology – Generic Coding of Audio– Visual Objects-Part 1: Systems 7. ISO/IEC CD 23002-3 (2006) Auxiliary Video Data Representation 8. ISO/IEC 13818-1 (2000) Information Technology – Generic Coding of Moving Pictures and Associated Audio Information: Systems, Amendment 7: Transport of ISO/IEC 14496 Data Over ISO/IEC 13818-1 9. ITU-R Rec. BS. 775-1 (1994) Multichannel Stereophonic Sound System With and Without Accompaning Picture 10. ITU-R Recommendation BT.500-10 (2000) Methodology for the Subjective Assessment of the Quality of Television Picture 11. Cho S, Hur N, Kim J, Yun K, Lee S (2006) Carriage of 3D Audio-Video Services by T-DMB. In: Proceedings of International Conference on Multimedia & Expo (ICME). Toronto, pp. 2165–2168 12. Javidi B, Okano F (2002) Three-Dimensional Television, Video, and Display Technologies. Springer-Verlag, Berlin 13. Schreer O, Kauff P, Sikora T (2005) 3D Video Communication: Algorithms, Concepts and Real-Time Systems in Human Centered Communication. John Wiley, England 14. Sugawara M, Kanazawa M, Mitani K, Shimamoto H, Yamashita T, Okano F (2003) Ultrahigh-Definition Video System With 4000 Scanning Lines. SMPTE Journal 112:339–346 15. Pastoor S (1991) 3D Television: A Survey of Recent Research Results on Subjective Requirements. Signal Processing: Image Communication 4:21–32 16. Seo J, Moon HG, Beack S, Kang K, Hong JK (2005) Multi-channel Audio Service in a Terrestrial-DMB System Using VSLI-Based Spatial Audio Coding. ETRI Journal 27:635–638 17. Gwangsoon Lee, KyuTae Yang, Kwang-Yong Kim, YoungKwon Hahm, Chunghyun Ahn, and Soo-In Lee (2006) Design of Middleware for Interactive Data Services in the Terrestrial DMB ETRI Journal, 28(5):652–655
3 Reconstructing Human Shape, Motion and Appearance from Multi-view Video Christian Theobalt1 , Edilson de Aguiar1 , Marcus A. Magnor2 and Hans-Peter Seidel1 1 2
MPI Informatik, Saarbr¨ ucken, Germany TU Braunschweig, Braunschweig, Germany
3.1 Introduction In recent years, an increasing research interest in the field of 3D video processing has been observed. The goal of 3D video processing is the extraction of spatio-temporal models of dynamic scenes from multiple 2D video streams. These scene models comprise of descriptions of the shape and motion of the scene as well as its appearance. Having these dynamic representations at hand, one can display the captured real world events from novel synthetic camera perspectives. In order to put this idea into reality, algorithmic solutions to three major problems have to be found: the problem of multi-view acquisition, the problem of scene reconstruction from image data, and the problem of scene display from novel viewpoints. Human actors are presumably the most important elements of many realworld scenes. Unfortunately, it is well-known to researchers in computer graphics and computer vision that both the analysis of shape and motion of humans from video, as well as their convincing graphical rendition are very challenging problems. To tackle the difficulties of the involved problems, we propose in this chapter three model-based approaches to capture the motion, as well as the dynamic geometry of moving humans. By applying a fast dynamic multiview texturing method to the captured time-varying geometry, we are able to render convincing free-viewpoint videos of human actors. This chapter is a roundup of several algorithms that we have recently developed. It shall serve as an overview and make the reader aware of the most important research questions by illustrating them on state-of-the art research prototypes. Furthermore, a detailed list of pointers to related work shall enable the interested reader to explore the field in greater depth on its own. For all the proposed methods, human performances are recorded with only eight synchronized video cameras. In the first algorithmic variant, a template model is deformed to match the shape and proportions of the captured human actor and it is made to follow the motion of the person by means of a markerfree optical motion capture approach. The second variant extends the first
30
C. Theobalt et al.
one, and enables the estimation not only of shape and motion parameters of the recorded subject, but also the reconstruction of dynamic surface geometry details that vary over time. The last variant shows how we can incorporate high-quality laser-scanned shapes into the overall work-flow. Using any of the presented method variants, the human performances can be rendered in realtime from arbitrary synthetic viewpoints. Time-varying surface appearance is generated by means of a dynamic multi-view texturing from the input video streams. The chapter is organized as follows. We begin with details about the multiview video studio and the camera system we employ for data acquisition, Sect. 3.2. In Section 3.3 we review the important related work. Our template body model is described in Sect. 3.4 Our template body model is described in Sect. 3.4. The first marker-less algorithm variant is described in Sect. 3.5. Here, also the details of our silhouette-based analysis-through-synthesis approach are explained. The second algorithmic variant that enables capturing of timevarying surface details is described in Sect. 3.6. Finally, Sect. 3.7 presents our novel approach to transfer the sequence of poses of the template model to a high-quality laser scan of the recorded individual. The nuts and bolts of the texturing and blending method are explained in Sect. 3.8. In Sect. 3.9, we present and discuss results obtained with either of the described algorithmic variants. We conclude the chapter with an outlook to future directions in Sect. 10.
3.2 Acquisition – A Studio for Multi-view Video Recording The input to our system are multiple synchronized video streams of a moving person, so-called MVV sequences, that we capture in our multi-view video studio. The spatial dimensions of the room, which are 11 by 5 meters, are large enough to allow multi-view recording of dynamic scenes from a large number of viewpoints. The ceiling has a height of approximately 4m. The walls and floor are covered with opaque black curtains and a carpet respectively to avoid indirect illumination into the studio. The studio features a multi-camera system that enables us to capture a volume of approx.4×4×3 m with eight externally synchronized video cameras. We employ Imperx MDC-1004 cameras that feature a 1004×1004 CCD sensor with linear 12 bits-per-pixel resolution and a frame rate of 25 fps. The imaging sensors can be placed in arbitrary positions, but typically we resort to an approximately circular arrangement around the center of the scene. Optionally, one of the cameras is placed in an overhead position. The cameras are calibrated into a common coordinate frame. Color consistency across cameras is ensured by applying a color-space transformation to each camera stream. The lighting conditions in the studio are fully-controllable. After recording a sequence, the result image data is captured in parallel by
3 Reconstructing Human Shape, Motion and Appearance from MVV
31
eight frame grabber cards, being streamed in real-time to a RAID system consisting of sixteen hard drives. Our studio now also features a Vitus Smart ∧ TM full body laser scanner. It enables us to capture high-quality triangle meshes of each person prior to recording her with the camera system
3.3 Related Work Since the work presented here jointly solves a variety of algorithmic subproblems, we can capitalize on a huge body of previous work in the fields of optical human motion capture, optical human model reconstruction, image-based rendering and mesh-based animation processing. We now give a brief overview of important related work in each of these fields. 3.3.1 Human Motion Capture By far the most widely used commercial systems for human motion capture are marker-based optical acquisition setups [1]. They make use of the principle of moving light displays [2]. Optical markings, which are either made of a retroreflective material or LEDs are placed on the body of a tracked subject. Several special-purpose high-frame-rate cameras (often with specialized light sources) are used to record the moving person. The locations of the markers in the video streams are tracked and their 3D trajectories over time are reconstructed by means of optical triangulation [3]. A kinematic skeleton is now matched to the marker trajectories to parameterize the captured motion in terms of joint angles [4]. The main algorithmic problems that have to be solved are the unambiguous optical tracking of the markers over time as well as the establishment of marker correspondences across multiple camera views [5]. Today, many commercial marker-based capturing systems are available, e.g. [6]. Although the accuracy of the measured motion data is fairly high, the application of marker-based systems is sometimes cumbersome and often impossible. The captured individuals typically have to wear special body suits. It is thus not possible to capture humans wearing normal everyday apparel, and therefore the captured video streams can not be employed for further processing, e.g. texture reconstruction. However, for the application we have in mind, the latter is essential. Marker-free motion capture approaches bridge this gap and enable the capturing of human performances without special modification of the scene [7]. The principle, as well as the challenge behind marker-free optical motion capture methods is to invert the nonlinear multi-modal map from the complex pose space to the image space by looking at specific features. Most methods in the literature use some kind of kinematic body model to track the motion. The models typically consists of a linked kinematic chain of bones and interconnecting joints, and are fleshed out with simple geometric primitives in
32
C. Theobalt et al.
order to model the physical outline of the human body. Typical shape primitives are ellipsoids [8, 9], superquadrics [10, 11, 12], and cylinders [13, 14]. Implicit surface models based on metaballs are also feasible [15]. Many different strategies have been suggested to bring such a 3D body model into optimal accordance with the poses of the human in multiple video streams. Divide and conquer methods track each body segment separately using image features, such as silhouettes [16], and mathematically constrain their motion to preserve connectivity. Conceptually related are constraint propagation methods that narrow the search-space of correct body poses by finding features in the images and propagating constraints on their relative placement within the model and over time, [17, 18]. In [17] a general architectural framework for human motion tracking systems has been proposed which is still used in many marker-free capturing methods, e.g. analysis-by-synthesis. According to this principle, model-based tracking consist of a prediction phase, a synthesis phase, an image analysis phase, and a state estimation phase. In other words, at each time step of a motion sequence the capturing system first makes a prediction of the current pose, then synthesizes a view with the model in that pose, compares the synthesized view to the actual image, and updates the prediction according to this comparison. Different tracking systems differ in what algorithmic strategy they employ at each stage. Analysis-through-synthesis methods search the space of possible body configurations by synthesizing model poses and comparing them to features in the image plane. The misalignment between these features, such as silhouettes [15], and the corresponding features of the projected model drives a pose refinement process [19, 20, 21]. Physics-based approaches derive forces acting on the model which bring it into optimal accordance with the video footage [22, 23]. Another way to invert the measurement equation from pose to image space is to apply inverse kinematics [24], a process known from robotics which computes a body configuration that minimizes the misalignment between projected model and image data. Inverse kinematics inverts the measurement equation by linearly approximating it. The method in [25] fits a kinematic skeleton model fleshed out with cylindrical limbs to one or several video streams of a moving person. A combination of a probabilistic region model, the twist parameterization for rotations and optical flow constraints from the image enable an iterative fitting procedure. An extension of this idea is described in [26] where, in addition to the optical flow constraints, also depth constraints from real-time depth image streams are employed. Rosenhahn et al. have formulated the pose recovery problem as an optimization problem. They employ conformal geometric algebra to mathematically express distances between silhouette cones and shape model outlines in 3D. The optimal 3D pose is obtained by minimizing these distances [27]. Recently, the application of statistical filters in the context of human motion capture has become very popular. Basically, all such filters employ a
3 Reconstructing Human Shape, Motion and Appearance from MVV
33
process model that describes the dynamics of the human body and a measurement model that describes how an image is formed from the body in a certain pose. The process model enables prediction of the state in the next time step and the measurement model allows for the refinement of the prediction based on the actual image data. If the noise is Gaussian and the model dynamics can be described by a linear model, a Kalman Filter can be used for tracking [9]. However, the dynamics of the complete human body is non-linear. A particle filter can handle such non-linear systems and enables tracking in a statistical framework based on Bayesian decision theory [28]. At each time step a particle filter uses multiple predictions (body poses) with associated probabilities. These are refined by looking at the actual image data (the likelihood). The prior is usually quite diffuse, but the likelihood function can be very peaky. The performance of statistical frameworks for tracking sophisticated 3D body models has been demonstrated in several research projects [13, 29, 30, 31]. In another category of approaches that have recently become popular, dynamic 3D scene models are reconstructed from multiple silhouette views and a kinematic body model is fitted to them [32]. A system that fits an ellipsoidal model of a human to visual hull volumes in real-time is described in [8]. The employed body model is very coarse and approximates each limb of the body with only one quadric. In [9] a system for off-line tracking of a more detailed kinematic body model using visual hull models is presented. The method described in [33] reconstructs a triangle mesh surface geometry from silhouettes to which a kinematic skeleton is fitted. Cheung et al. also present an approach for body tracking from visual hulls [34]. In contrast, we propose three algorithmic variants to motion capture that employ a hardware-accelerated analysis-through-synthesis approach to capture time-varying scene geometry and pose parameters [35, 36, 37] from only eight camera views. By appropriate decomposition of the tracking problem into subproblems, we can robustly capture body poses without having to resort to computationally expensive statistical filters. 3.3.2 Human Model Reconstruction For faithful tracking, but also for convincing renditions of virtual humans, appropriate human body models have to be reconstructed in the first place. These models comprise of correct surface descriptions, descriptions of the kinematics, and descriptions of the surface deformation behavior. Only a few algorithms have been proposed in the literature to automatically reconstruct such models from captured image data. In the work by Cheung et al. [34], a skeleton is estimated from a sequence of shape-fromsilhouette volumes of the moving person. A special sequence of moves has to be performed with each limb individually in order to make model estimation feasible. In the approach by Kakadiaris et al. [22], body models are estimated from multiple video streams in which the silhouettes of the moving person
34
C. Theobalt et al.
have been computed. With their method too, skeleton reconstruction is only possible if a prescribed sequence of movements is followed. In [38, 39], an approach to automatically learn kinematic skeletons from shape-from silhouette volumes is described that does not employ any a priori knowledge and does not require predefined motion sequences. The method presented in [40] similarly proposes a spectral clustering-based approach to estimate kinematic skeletons from arbitrary 3D feature trajectories. In [41] a method is described that captures the surface deformation of the upper body of a human by interpolating between different range scans. The body geometry is modeled as a displaced subdivision surface. A model of the body deformation in dependence on the pose parameters is obtained by the method described in [42]. A skeleton model of the person is known a priori and the motion is captured with a marker based system. Body deformation is estimated from silhouette images and represented with needles that change in length and whose endpoints form the body surface. Recently, Anguelov et al. [43] have presented a method to learn a parameterized human body model that captures both variations in shape, as well as variations in pose from a database of laser scans. In contrast, two of the algorithmic variants describe a template-based approach that automatically builds the kinematic structure and the surface geometry of a human from video data. 3.3.3 Free-viewpoint Video Research in free-viewpoint video aims at developing methods for photorealistic, real-time rendering of previously captured real-world scenes. The goal is to give the user the freedom to interactively navigate his or her viewpoint freely through the rendered scene. Early research that paved the way for free-viewpoint video was presented in the field of image-based rendering (IBR). Shape-from-silhouette methods reconstruct geometry models of a scene from multi-view silhouette images or video streams. Examples are imagebased [44, 45] or polyhedral visual hull methods [46], as well as approaches performing point-based reconstruction [47]. The combination of stereo reconstruction with visual hull rendering leads to a more faithful reconstruction of surface concavities [48]. Stereo methods have also been applied to reconstruct and render dynamic scenes [49, 50], some of them employing active illumination [51]. On the other hand, light field rendering [52] is employed in the 3DTV system [53] to enable simultaneous scene acquisition and rendering in real-time. In contrast, we employ a complete parameterized geometry model to pursue a model-based approach towards free-viewpoint video [35, 54, 55, 56, 57, 37]. Through commitment to a body model whose shape is made consistent with the actor in multiple video streams, we can capture humans motion and his dynamic surface texture [36]. We can also apply our method to capture personalized human avatars [58].
3 Reconstructing Human Shape, Motion and Appearance from MVV
35
3.3.4 Mesh-based Deformation and Animation In the last algorithmic variant we explain in this chapter, we map the motion that we captured with our template model to a high-quality laser scan of the recorded individual. To this end, we employ a method for motion transfer between triangle meshes that is based on differential coordinates [59, 60]. The potential of these methods has already been stated in previous publications, however the focus always lay on deformation transfer between synthetic moving meshes [61]. Using a complete set of correspondences between different synthetic models, [62] can transfer the motion of one model to the other. Following a similar line of thinking, [63, 64] propose a mesh-based inverse kinematics framework based on pose examples with potential application to mesh animation. Recently, [65] presents a multi-grid technique for efficient deformation of large meshes and [66] presents a framework for performing constrained mesh deformation using gradient domain techniques. Both methods are conceptually related to our algorithm and could also be used for animating human models. However, none of the papers provides a complete integration of the surface deformation approach with a motion acquisition system, nor does any of them provide a comprehensive user interface. We capitalize on and extend ideas in this field in order to develop a method that allows us to easily make a high-quality laser scan of a person move in the same way as the performing subject. Realistic motion for the scan as well as non-rigid surface deformations are automatically generated.
3.4 The Adaptable Human Body Model While 3D object geometry can be represented in many ways, we employ a triangle mesh representation since it offers a closed and detailed surface description and can be rendered very fast on graphics hardware. Since the template human model should be able to perform the same complex motion as its real-world counterpart, it is composed of multiple rigid-body parts that are linked by a hierarchical kinematic chain. The joints between segments are suitably parameterized to reflect the kinematic degrees of freedom of the object. Besides object pose, also the shape and dimensions of the separate body parts must be customized in order to optimally reproduce the appearance of the real subject. A publicly available VRML geometry model of a human body is used as our template model, Fig. 3.1a. It consists of 16 rigid body segments: one for the upper and lower torso, neck, and head, and pairs for the upper arms, lower arms, hands, upper legs, lower legs and feet. A hierarchical kinematic chain connects all body segments. 17 joints with a total of 35 joint parameters define the pose of the template model. For global positioning, the model provides three translational degrees of freedom which influence the position of the skeleton root, i.e located at the
36
C. Theobalt et al.
(a)
(b)
(c)
Fig. 3.1. (a) Surface model and the underlying skeletal structure - spheres indicate joints and the different parameterizations used. (b) Schematic illustration of local vertex coordinate scaling by means of a B´ezier scaling curve. (c) The two planes in the torso illustrate the local scaling directions
pelvis. Different joints in the body model provide different numbers of rotational degrees of freedom the same way as the corresponding joints in an anatomical skeleton do. Figure 3.1a shows individual joints in the kinematic chain of the body model and the respective joint color indicates if it is a 1-DOF hinge joint, a 3-DOF ball joint, or a 4-DOF extended joint [35]. In addition to the pose parameters, the model provides anthropomorphic shape parameters that control the bone lengths as well as the structure of the triangle meshes defining the body surface. The first set of anthropomorphic parameters consists of a uniform scaling that scales the bone as well as the surface mesh uniformly in the direction of the bone axis. In order to match the geometry more closely to the shape of the real human each segment features four onedimensional B´ezier curves, B+x (u), B−x (u), B+z (u), B−z (u), that are used to scale individual coordinates of each vertex in the local triangle mesh. The scaling is performed in the local +x, -x, +z, and -z directions of the coordinate frame of the segment which are orthogonal to the direction of the bone axis. Figure 3.1b shows the effect of changing the B´ezier scaling values using the arm segment as an example. Intuitively, the four scaling directions lie on two orthogonal planes in the local frame. For illustration, we show these two planes in the torso segment in Fig. 3.1c.
3.5 Silhouette-based Analysis-through-synthesis The challenge in applying model-based analysis for free-viewpoint video reconstruction is to find a way how to automatically and robustly adapt the geometry model to the appearance of the subject as it was recorded by the video cameras. In general, we need to determine the parameter values that achieve the best match between the model and the video images. Regarding this task as an optimization problem, the silhouettes of the actor, as seen from the different camera viewpoints, are used to match the
3 Reconstructing Human Shape, Motion and Appearance from MVV
37
model to the video images (an idea used in similar form in [67]): the model is rendered from all camera viewpoints, and the rendered images are thresholded to yield binary masks of the silhouettes of the model. The rendered model silhouettes are then compared to the corresponding image silhouettes [35, 54, 55, 57]. As comparison measure, the number of silhouette pixels is determined that do not overlap. Conveniently, the exclusive-or (XOR) operation between the rendered model silhouette and the segmented video-image silhouette yields those pixels that are not overlapping. The energy function thus evaluates to: EXOR (μ) =
Y X N
(Ps (x, y) ∧ Pm (x, y)) ∨ (Ps (x, y) ∧ Pm (x, y))
(3.1)
i=0 x=0 y
where μ is the model parameters currently considered, e.g. pose or anthropomorphic parameters, N the number of cameras, and X and Y the dimensions of the image. Ps (x, y) is the 0/1-value of the pixel (x, y) in the captured image silhouette, while Pm (x, y, μ) is the equivalent in the reprojected model image given that the current model parameters are μ. Fortunately, this XOR energy function can be very efficiently evaluated in graphics hardware (Fig. 3.2). An overview of the framework used to adapt model parameter values such that the mismatch score becomes minimal is shown in Fig. 3.2. A standard numerical optimization algorithm, such as Powell’s method [68], runs on the
Fig. 3.2. Hardware-based analysis-through-synthesis method: To match the geometry model to the multi-video recordings of the actor, the image foreground is segmented and binarized. The model is rendered from all camera viewpoints and the boolean XOR operation is executed between the foreground images and the corresponding model renderings. The number of remaining pixels in all camera views serves as matching criterion. Model parameter values are varied via numerical optimization until the XOR result is minimal. The numerical minimization algorithm runs on the CPU while the energy function evaluation is implemented on the GPU
38
C. Theobalt et al.
CPU. As a direction set method it always pertains a number of candidate descend directions in parameter space. The optimal descend in one direction is computed using Brent’s line search method. For each new set of model parameter values, the optimization routine invokes the matching function evaluation routine on the graphics card. One valuable benefit of model-based analysis is the low-dimensional parameter space when compared to general reconstruction methods. The parameterized model provides only a few dozen degrees of freedom that need to be determined, which greatly reduces the number of potential local minima. Furthermore, many high-level constraints are implicitly incorporated, and additional constraints can be easily enforced by making sure that all parameter values stay within their anatomically plausible range during optimization. Finally, temporal coherence is straightforwardly maintained by allowing only some maximal rate of change in parameter value from one time step to the next. The silhouette-based analysis-through-synthesis approach is employed for two purposes, the initialization of the geometry of the model (Sect. 3.5.1) and the computation of the body pose at each time step (Sect. 3.5.2). 3.5.1 Initialization In order to apply the silhouette-based framework to real-world multi-view video footage, the generic template model must first be initialized, i.e. its proportions must be adapted to the subject in front of the cameras. This is achieved by applying the silhouette-based analysis-through-synthesis algorithm to optimize the anthropomorphic parameters of the model. This way, all segment surfaces can be deformed until they closely match the stature of the actor. During model initialization, the actor stands still for a brief moment in a pre-defined pose to have his silhouettes recorded from all cameras. The generic model is rendered for this known initialization pose, and without user intervention, its proportions are automatically adapted to the silhouettes of the individual. First, only the torso is considered. Its position and orientation is determined approximately by maximizing the overlap of the rendered model images with the segmented image silhouettes. Then the pose of arms, legs and head are recovered by rendering each limb in a number of orientations close to the initialization pose and selecting the best match as starting point for refined optimization. Following the model hierarchy, the optimization itself is split into several sub-optimizations in lower-dimensional parameter spaces (Sect. 3.5.2). After the model has been coarsely adapted in this way, the uniform scaling parameters of all body segments are adjusted. The algorithm then alternates typically around 5–10 times between optimizing joint parameters and segment scaling parameters until it converges. Finally, the B´ezier control parameters of all body segments are optimized in order to fine-tune each outline of the segment such that it complies with the
3 Reconstructing Human Shape, Motion and Appearance from MVV
(a)
(b)
39
(c)
Fig. 3.3. (a) Template model geometry. (b) Model after 5 iterations of pose and scale refinements. (c) Model after adapting the B´ezier scaling parameters
recorded silhouettes. Figure 3.3 shows the initial model shape, its shape after five iterations of pose and scene optimization, and its shape after B´ezier scaling. From now on the anthropomorphic shape parameters found remain fixed. 3.5.2 Marker-free Pose Tracking The analysis-through-synthesis framework enables us to capture the pose parameters of a moving subject without having the actor wear any specialized apparel. The individualized geometry model automatically tracks the motion of the actor by optimizing the 35 joint parameters for each time step. The model silhouettes are matched to the segmented image silhouettes of the actor such that the model performs the same movements as the human in front of the cameras, Fig. 3.4. At each time step an optimal stance of the model is found by performing a numerical minimization of the silhouette XOR energy functional (3.1) in the space of pose parameters. To efficiently avoid local minima, the model parameters are not all optimized simultaneously. Instead, the hierarchical structure of the model is
(a)
(b)
(c)
Fig. 3.4. (a) One input image of the actor performing. (b) Silhouette XOR overlap. (c) Reconstructed pose of the template body model with kinematics skeleton
40
C. Theobalt et al.
exploited. We effectively constrain the search space by exploiting structural knowledge about the human body, knowledge about feasible body poses, temporal coherence in motion data and a grid sampling preprocessing step. Model parameter estimation is performed in descending order with respect to the individual impact of the segments on silhouette appearance and their position along the kinematic chain of the model. First, position and orientation of the torso is varied to find its 3D location. Next, arms, legs and head are considered. Finally, hands and feet are examined, Fig. 3.5. Temporal coherence is exploited by initializing the optimization for one body part with the pose parameters found in the previous time step. Optionally, a simple linear prediction based on the two preceding parameter sets is feasible. In order to cope with fast body motion that can easily mislead the optimization search, we precede the numerical minimization step with a regular grid search. The grid search samples the dimensional parameter space at regularly-spaced values and checks each corresponding limb pose for being a valid pose. Using the arm as an example, a valid pose is defined by two criteria. Firstly, the wrist and the elbow must project into the image silhouettes in every camera view. Secondly, the elbow and the wrist must lie outside a bounding box defined around the torso segment of the model. For all valid poses found, the error function is evaluated, and the pose that exhibits the minimal error is used as starting point for a direction set downhill minimization. The result of this numerical minimization specifies the final limb configuration. The parameter range from which the grid search draws sample values is adaptively changed based on the difference in pose parameters of the two preceding time steps. The grid sampling step can be computed at virtually no cost and significantly increases the convergence speed of the numerical minimizer. The performance of the silhouette-based pose tracker can be further improved by capitalizing on the structural properties of the optimization problem. First, the XOR evaluation can be speed up by restricting the computation to a sub-window in the image plane and excluding non-moving body parts
Fig. 3.5. Body parts are matched to the silhouettes in hierarchical order: the torso first, then arms, legs and head, finally hands and feet. Local minima are avoided by a limited regular grid search for some parameters prior to optimization initialization
3 Reconstructing Human Shape, Motion and Appearance from MVV
41
from rendering. Second, optimization of independent sub-chains can be performed in parallel. A prototype implementation using 5 PCs and 5 GPUs, as well as the improved XOR evaluation exhibited a speed-up of up to factor 8. Details can be found in [54].
3.6 Dynamic Shape and Motion Reconstruction In the method described in Sect. 3.5, we have presented a framework for robustly capturing the shape and motion of a moving subject from multi-view video. However, the anthropomorphic shape parameters of the model are captured from a static initialization posture and, in turn, remain static during the subsequent motion tracking. Unfortunately, by this means we are unable to capture subtle time-varying geometry variations on the body surface, e.g. due to muscle bulging or wrinkles in clothing. This section presents an extension to our original silhouette-based initialization method that bridges this gap. By not only taking silhouette constraints but also a color-consistency criterion into account, we are able to reconstruct also dynamic geometry details on the body model from multi-view video [36]. To achieve this purpose, our algorithm simultaneously optimizes the body pose and the anthropomorphic shape parameters of our model (Sect. 3.4). Our novel fitting scheme consists of two steps, spatio-temporally consistent (STC) model reconstruction and dynamic shape refinement. The algorithmic workflow between these steps is illustrated in Fig. 3.6. Our method expects a set of synchronized multi-view video sequences of a moving actor as input. The STC model is characterized by two properties. Firstly, at each time step of video its pose matches the pose of the actor in the input streams. Secondly, it features a constant set of anthropomorphic shape parameters that have not only been reconstructed from a single initialization posture, but from a set of postures that have been taken from sequences in which the person performs arbitrary motion. To reconstruct the STC representation, we employ our silhouette-based analysis-by-synthesis approach within a spatio-temporal optimization procedure, Sect. 3.6.1.
Fig. 3.6. Visualization of the individual processing steps of the Dynamic Shape and Motion Reconstruction Method
42
C. Theobalt et al.
The spatio-temporally-consistent scene representation is consistent with the pose and shape of the actor at multiple time steps, but it still only features constant anthropomorphic parameters. In order to reconstruct dynamic geometry variations we compute appropriate vertex displacements for each time step of video separately. To this end, we jointly employ a color- and silhouette-consistency criterion to identify slightly inaccurate surface regions of the body model which are then appropriately deformed by means of a Laplacian interpolation method, Sect. 3.6.2. 3.6.1 Spatio-temporal Model Reconstruction We commence the STC reconstruction by shape adapting the template model using the single-pose initialization procedure described in Sect. 3.5.1. Thereafter, we run a two-stage spatio-temporal optimization procedure that alternates between pose estimation and segment deformation. Here, it is important to note that we do not estimate the respective parameters from single time steps of multi-view video, but always consider a sequence of captured body poses. In the first step of each iteration, the pose parameters of the model at each time step of video are estimated using the method proposed in Sect. 3.5.2. In the second step, the B-spline control values for each of the 16 segments are computed by an optimization scheme. We find scaling parameters that optimally reproduce the shape of the segments in all body poses simultaneously. A modified energy function sums over the silhouetteXOR overlap contributions at each individual time step. Figure 3.7 shows the resulting spatio-temporally consistent model generated with our scheme. 3.6.2 Dynamic Shape Refinement The spatio-temporal representation that we have now at our disposition is globally silhouette-consistent with a number of time steps of the input video sequence. However, although the match is globally optimal, it may not exactly match the silhouettes of the actor at each individual time step. In particular, subtle changes in body shape that are due to muscle bulging or deformation of the apparel are not modeled in the geometry. Furthermore, certain types
(a)
(b)
(c)
(d)
Fig. 3.7. (a) Adaptable generic human body model; (b) initial model after skeleton rescaling; (c) model after one and (d) several iterations of the spatio-temporal scheme
3 Reconstructing Human Shape, Motion and Appearance from MVV
43
of geometry features, such as concavities on the body surface, can not be captured from silhouette images alone. In order to capture these dynamic details in the surface geometry, we compute per-vertex displacements for each time step of video individually. To this end, we also exploit the color information in the input video frames. Assuming a purely Lambertian surface reflectance, we estimate appropriate per-vertex displacements by jointly optimizing a multi-view color-consistency and a multi-view silhouette-consistency measure. Regularization terms that assess mesh distortions and visibility changes are also employed. The following subsequent steps are performed for each body segment and each time step of video: 3.6.2.1 Identification of Color-inconsistent Regions We use color information to identify, for each time step of video individually, those regions of the body geometry which do not fully comply with the appearance of the actor in the input video images. To numerically assess the geometry misalignment, we compute for each vertex a color-consistency measure similar to the one described in [69]. By applying a threshold to the color-consistency measure we can decide if a vertex is in a photo-consistent or photo-inconsistent position. After calculating the color-consistency measure for all vertices, all photo-inconsistent vertices are clustered into contingent photo-inconsistent surface patches by means of a region growing method. In Fig. 3.6, color-inconsistent surface patches are marked in grey. 3.6.2.2 Computation of Vertex Displacements We randomly select a set of vertices M out of each color-inconsistent region that we have identified in the previous step. For each vertex j ∈ M with position vj we compute a displacement rj in the direction of the local surface normal that minimizes the following energy functional: E(vj , rj ) = wI EI (vj + rj ) + wS ES (vj + rj )+ wD ED (vj + rj ) + wP EP (vj , vj + rj ) • •
•
(3.2)
EI (vj + rj ) is the color-consistency measure. The term ES (vj + rj ) penalizes vertex positions that project into image plane locations that are very distant from the boundary of the silhouette of the person. The inner and outer distance fields for each silhouette image can be pre-computed by means of the method described in [70]. ED (v) regularizes the mesh segment by measuring the distortion of triangles. We employ a distortion measure which is based on the Frobenius norm κ [71]: a 2 + b 2 + c2 √ − 1, (3.3) κ= 4 3A
44
•
C. Theobalt et al.
where a, b and c are the lengths of the edges of a triangle and A is the area of the triangle. For an equilateral triangle the value is 0. For degenerate triangles it approaches infinity. To compute ED (vj + rj ) for a displaced vertex j at position vj + rj , we average the κ values for the triangles adjacent to j. The term EP (vj , vj +rj ) penalizes visibility changes that are due to moving a vertex j from position vj to position vj + rj . It has a large value if in position vj + rj the number of cameras that sees that vertex is significantly different from the number of cameras that sees it at vj .
The weights wI , wS , wD , wP are straightforwardly found through experiments, and are chosen in such a way that EI (v) and ES (v) dominate. We use the LBFGS-B method [72], a quasi-Newton algorithm, to minimize the energy function E(vj , rj ). After calculating the optimal displacement for all M random vertices individually, these displacements are used to smoothly deform the whole region by means of a Laplace interpolation method. 3.6.2.3 Laplacian Model Deformation Using a Laplace interpolation method (see e.g. [73, 74]), each color-inconsistent region is deformed such that it globally complies with the sampled per-vertex displacements. The new positions of the vertices in a region form an approximation to the displacement constraints. Formally, the deformed vertex positions are found via a solution to the Laplace equation Lv = 0,
(3.4)
where v is the vector of vertex positions and the matrix L is the discrete Laplace operator. The matrix L is singular, and we hence need to add suitable boundary conditions to (3.4) in order to solve it. We reformulate the problem as 2 0 L v− min (3.5) d K This equation is solved in each of the three Cartesian coordinate directions (x, y and z) separately. The matrix K and the vector d impose the individual sampled per-vertex constraints which will be satisfied in least-squares sense: ⎧ ⎨ wi if a displacement is specified for i, Kij = wi if i is a boundary vertex, (3.6) ⎩ 0 otherwise. The elements of d are: ⎧ ⎨ wi · (vi + ri ) if a displacement is specified for i, if i is a boundary vertex, di = wi · vi ⎩ 0 otherwise.
(3.7)
3 Reconstructing Human Shape, Motion and Appearance from MVV
45
The values wi are constraint weights, vi is the position coordinate of vertex i before deformation, and ri is the displacement for i. The least-squares solution to (3.5) is found by solving the linear system: T T L L L x = (L2 + K2 )x = d. K K K
(3.8)
Appropriate weights for the displacement constraints are easily found through experiments. After solving the 3 linear systems individually for x, y and z-coordinate directions, the new deformed body shape that is both colorand silhouette-consistent with all input views is obtained.
3.7 Confluent Laser-scanned Human Geometry The commitment to a parameterized body model enables us to make the motion estimation problem tractable. However, a model-based approach also implies a couple of limitations. Firstly, a template model is needed for each type of object that one wants to record. Secondly, the two first methods can not handle people wearing very loose apparel such as dresses or wide skirts. Furthermore, while a relatively smooth template model enables easy fitting to a wide range of body shapes, more detailed geometry specific to each actor would improve rendering quality even more. Thus, to overcome some of the limitations imposed by the previous template-based methods, we present in this section a method to incorporate a static high-quality shape prior into our framework. Our new method enables us to make a very detailed laser scan of an actor follow her motion that was previously captured from multi-view video. The input to our framework is a MVV sequence captured from a real actor performing. We can now apply any of the model-based tracking approaches, either the one from Sect. 3.5 or Sect. 3.6, to estimate shape and motion of the subject. Our algorithm now provides a simple and very efficient way to map the motion of the moving template onto the static laser scan. Subsequently, the scan imitates the movements of the template, and non-rigid surface deformations are generated on-the-fly. Thus, no manual skeleton transfer or blending weight assignment is necessary. To achieve this goal, we formulate the motion transfer problem as a deformation transfer problem. To this end, a sparse set of triangle correspondences between the template and the scan is specified, Fig. 3.8a. The transformations of the marked triangles on the template model are mapped to their counterparts on the scan. Deformations for in-between triangles are interpolated on the surface by means of a harmonic field interpolation. The surface of the deformed scan at each time step is computed by solving a Poisson system. Our framework is based on the principle of differential mesh editing and only requires the solution of simple linear systems to map poses of the template
46
C. Theobalt et al.
(a)
(b)
Fig. 3.8. (a) The motion of the template model is mapped onto the scan by only specifying a small number of correspondences between individual triangles. (b) The pose of the real actor, captured by the template model is accurately transfered to the high-quality laser scan
to the scan. Due to this computational efficiency we can map postures even to scans with several tens of thousands of triangles at nearly interactive rates. As an additional benefit, our algorithm implicitly solves the motion retargeting problem which gives us the opportunity to map input motions to target models with completely different body proportions. Figure 3.8b shows an example where we mapped the motion of our moving template model onto a high-quality laser scan. This way, we can easily use detailed dynamic scene geometry as underlying shape representation during free-viewpoint video rendering. For details on the correspondence specification procedure and the deformation interpolation framework, we would like to refer the reader to [37].
3.8 Free-viewpoint Video with Dynamic Textures By combining any of the three methods to capture dynamic scene geometry (Sects. 3.5, 3.6 or 3.7), with a dynamic texture generation we can create and render convincing free-viewpoint videos that reproduce the omni-directional appearance of the actor. Since time-varying video footage is available, model texture does not have to be static. Lifelike surface appearance is generated by using the projective texturing functionality of modern GPUs. Prior to display, the geometry of the actor as well as the calibration data of the video cameras is transferred to the GPU. During rendering, the viewpoint information, shape of the model, the current video images, as well as the visibility and blending coefficients νi , ωi for all vertices and cameras are continuously transferred to the GPU. The color of each rendered pixel c(j) is determined by blending all l video images Ii according to c(j) =
l i=1
νi (j) ∗ ρi (j) ∗ ωi (j) ∗ Ii (j)
(3.9)
3 Reconstructing Human Shape, Motion and Appearance from MVV
47
where ωi (j) denotes the blending weight of camera i, ρi (j) is the optional viewdependent rescaling factor, and νi (j) = {0, 1} is the local visibility. During texture pre-processing, the weight products νi (j)ρi (j)ωi (j) have been normalized to ensure energy conservation. Technically, (3.9) is evaluated for each fragment by a fragment program on the GPU. The rasterization engine interpolates the blending values from the triangle vertices. By this means, time-varying cloth folds and creases, shadows and facial expressions are faithfully reproduced, lending a very natural, dynamic appearance to the rendered object. The computation of the blending weights and the visibility coefficients is explained in the following two subsections. 3.8.1 Blending Weights The blending weights determine the contribution of each input camera image to the final color of a surface point. If surface reflectance can be assumed to be approximately Lambertian, view-dependent reflection effects play no significant role, and high-quality, detailed model texture can still be obtained by blending the video images cleverly. Let θi denote the angle between a vertex normal and the optical axis of camera i. By emphasizing for each vertex individually the camera view with the smallest angle θi , i.e. the camera that views the vertex most head-on, a consistent, detail-preserving texture is obtained. A visually convincing weight assignment has been found to be ωi =
1 (1 + maxj (1/θj ) − 1/θi )α
(3.10)
where the weights ωi are additionally normalized to sum up to unity. The parameter α determines the influence of vertex orientation with respect to camera viewing direction and the impact of the most head-on camera view per vertex, Fig. 3.9. Singularities are avoided by clamping the value of 1/θi to a maximal value. Although it is fair to assume that everyday apparel has purely Lambertian reflectance, in some cases the reproduction of view-dependent appearance
(a) α = 0
(b) α = 3
(c) α = 15
Fig. 3.9. Texturing results for different values of the control factor α
48
C. Theobalt et al.
effects may be desired. To serves this purpose, our method provides the possibility to compute view-dependent rescaling factors, ρi , for each vertex onthe-fly while the scene is rendered: ρi =
1 φi
(3.11)
where φi is the angle between the direction to the outgoing camera and the direction to input camera i. 3.8.2 Visibility Projective texturing on GPU has the disadvantage that occlusion is not taken into account, so hidden surfaces get also textured. The z-buffer test, however, allows determining for every time step which object regions are visible from each camera. Due to inaccuracies in the geometry model, it can happen that the silhouette outlines in the images do not correspond exactly to the outline of the model. When projecting video images onto the model, a texture seam belonging to some frontal body segment may fall onto another body segment farther back, Fig. 3.10a. To avoid such artifacts, extended soft shadowing is applied. For each camera, all object regions of zero visibility are determined not only from the actual position of the camera, but also from several slightly displaced virtual camera positions. Each vertex is tested whether it is visible from all camera positions. A triangle is textured by a camera image only if all of its three vertices are completely visible from that camera. While too generously segmented silhouettes do not affect rendering quality, too small outlines can cause annoying untextured regions. To counter such rendering artifacts, all image silhouettes are expanded by a couple of pixels prior to rendering. Using a morphological filter operation, the object outlines of all video images are dilated to copy the silhouette boundary pixel color values to adjacent background pixel positions, Fig. 3.10b.
Fig. 3.10. (a) Small differences between object silhouette and model outline cause erroneous texture projections. (b) Morphologically dilated segmented input video frames that are used for projective texturing
3 Reconstructing Human Shape, Motion and Appearance from MVV
49
3.9 Results We have presented a variety of methods that enable us to capture shape and motion of moving people. The coupling of either of these methods with a texturing approach enables us to generate realistic 3D videos. In the following, we will briefly discuss the computational performance of each of the described methods, and will comment on the visual results obtained when using them to generate free-viewpoint renderings. Free-viewpoint videos reconstructed with each of the described methods can be rendered in real-time on a standard PC. During display, the user can interactively choose an arbitrary viewpoint onto the scene. We have applied our method to a variety of scenes ranging from simple walking motion over fighting performances to complex and expressive ballet dance. Ballet dance performances are ideal test cases as they exhibit rapid and complex motion. The silhouette-based analysis-through-synthesis method (Sect. 3.5) demonstrates that it is capable of robustly following human motion involving fast arm motion, complex twisted poses of the extremities, and full body turns (Fig. 3.11). Even on a single Intel XEON 1.8 GHz PC featuring a fairly old Nvidia GeForce 3 graphics board, it takes approximately 3 to 14 seconds to determine a single pose. On a state-of-the-art PC fitting times in the range of a second are feasible and with a parallel implementation, almost interactive frame rates can be generated. Figures 3.11 and 3.12 show the visual results obtained by applying the dynamic texturing scheme to generate realistic time-varying surface appearance. A comparison to the true input images confirms that the virtual viewpoint
Fig. 3.11. Variant I: Novel viewpoints are realistically synthesized. Two distinct time instants are shown on the left and right with input images above and novel views below
50
C. Theobalt et al.
Fig. 3.12. Variant I: Conventional video systems cannot offer moving viewpoints of scenes frozen in time. With our framework freeze-and-rotate camera shots of body poses are possible. The pictures show such novel viewpoints of scenes frozen in time for different subjects and different types of motion
renditions look very lifelike. By means of clever texture blending, a contiguous appearance with only few artifacts can be achieved. Although clever texture blending can cloak most geometry inaccuracies in the purely silhouette-fitted body model, a dynamically refined shape representation can lead to an even better visual quality. Improvements in rendering are due to improved geometry and, consequently, less surface blending errors during projective texturing. Figure 3.13a illustrates the typical geometry improvements that we can achieve through our spatio-temporal shape refinement approach. Often, we can observe improvements at silhouette boundaries. Due to the computation of per-time step displacements we can capture shape variations that are not representable by means of varying B´ezier parameters. In general, the exploitation of temporal coherence already at the stage of STC model reconstruction leads to a better approximation of the true body shape which is most clearly visible if the actor strikes extreme body postures as they occur in Tai Chi, Fig. 3.13b. However, the shape improvements come at the cost of having to solve a more complex optimization problem. In a typical setting, it takes around 15 minutes on a 3.0 GHz Pentium IV with a GeForce 6800 to compute the spatio-temporally consistent model. For STC
3 Reconstructing Human Shape, Motion and Appearance from MVV
51
(a)
(b)
Fig. 3.13. Variant II: (a) The right subimage in each pair of images shows the improvements in shoulder geometry that are achieved by capturing time-varying surface displacements in comparison to pure silhouette matching (left subimages). (b) Different time instants of a Tai Chi motion; the time-varying shape of the torso has been recovered
reconstruction, we do not use the full 100 to 300 frames of a sequence but rather select 5 to 10 representative postures. The per time step surface deformation takes typically around 4 minutes per time step using our non-optimized implementation. Both purely template-based algorithmic variants are subject to a couple of limitations. Firstly, there certainly exist extreme body posture, such as the fetal position, that cannot be faithfully recovered with our motion estimation scheme. However, in all the test sequences we processed, body postures were captured reliably. Secondly, the application of a clever texture blending enables us to generate free-viewpoint renderings at high quality. Nonetheless, the commitment to a segmented shape model may cause some problems. For instance, it is hard to implement a surface skinning scheme given that the model comprises of individual geometry segments. Furthermore, it is hard to capture the geometry of people wearing fairly loose apparel since the anthropomorphic degrees of freedom provide a limited deformation range. Even the per-time step vertex displacements cannot capture all crisp surface details since we implicitly apply a smoothness constraint during Laplacian deformation. Fortunately, our third algorithmic variant (Sect. 3.7) allows us to overcome some of the limitations imposed by the template model. It allows us to apply a laser-scanned shape model in our free-viewpoint rendering framework. While the template model is still
52
C. Theobalt et al.
employed for capturing the motion, a scanned surface mesh is used during rendering. To make the scan mimic the captured motion of the actor, we apply a mesh deformation transfer approach that requires the user only to specify a handful of triangle correspondences. As an additional benefit, the motion transfer approach implicitly generates realistic non-rigid surface deformations. To demonstrate the feasibility of our method, we have acquired full body scans of several individuals prior to recording their motion with our multicamera setup. Figure 3.14 shows two free-viewpoint renditions of a dynamically textured animated scan next to ground truth input images. In both cases, the renderings convincingly reproduce the true appearance of the actor. The high-detail human model enables us to faithfully display also the shape of wider apparel that can not be fully reproduced by a deformable template. Also, the shapes of different heads with potentially different hair styles can be more reliably modeled. The scan model also has a number of conceptual advantages. For instance, a contiguous surface parameterization can now be computed which facilitates higher-level processing operations on the surface appearance, as they are, for instance, needed in reflectance estimation [75, 76]. Nonetheless, there remain a few limitations. For instance, subtle timevarying geometry details that are captured in the scans, such as folds in the trousers, seem to be impainted into the surface while the scan moves. We believe that this can be solved by applying a photo-consistency-based deformation of the surface at each time step. Also, the placement of the correspondences by the user requires some training. In our experience though, even unexperienced people rather quickly gained a feeling for correct correspondence placement. Despite these limitations, this is the first approach of its kind that enables capturing the motion of a high-detail shape model from video without any optical markings in the scene.
Fig. 3.14. Variant III: Our Confluent 3D Video approach enables the creation of free-viewpoint renditions with high-quality geometry models. Due to the accurate geometry the rendered appearance of the actor (left sub-images) nicely corresponds to his true appearance in the real world (right sub-images)
3 Reconstructing Human Shape, Motion and Appearance from MVV
53
3.10 Conclusions In this chapter, three very powerful and robust methods to capture timevarying shape, motion and appearance of moving humans from only a handful of multi-view video streams were presented. By applying a clever dynamic 3D texturing method from camera images to the moving geometry representations, we can render realistic free-viewpoint videos of people in real-time on a standard PC. In combination, the presented methods enable high-quality reconstruction of human actors completely passively, which has never been possible up to now. Our model-based variants also open the door for attacking new challenging reconstruction problems that were hard due to the lack of a decent dynamic scene capture technology, e. g. cloth tracking. Depending on the application in mind, each one of them has its own advantages and disadvantages. Thus, the question which one eventually is the best has to be answered by the user.
Acknowledgements This work is supported by EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. B. Bodenheimer, C. Rose, S. Rosenthal, and J. Pella. The process of motion capture: Dealing with the data. In Proc. of Eurographics Computer Animation and Simulation, 1997. 2. G. Johannson. Visual perception of biological motion and a model for its analysis. In Perception and Psychophysics, 14(2):201–211, 1973. 3. M. Gleicher. Animation from observation: Motion capture and motion editing. In Computer Graphics, 4(33):51–55, November 1999. 4. L. Herda, P. Fua, R. Plaenkers, R. Boulic, and D. Thalmann. Skeleton-based motion capture for robust reconstruction of human motion. In Proc. of Computer Animation 2000, IEEE CS Press 2000. 5. M. Ringer and J. Lasenby. Multiple-hypothesis tracking for automatic human motion capture. In Proc. of European Conference on Computer Vision, 1: 524–536, 2002. 6. www.vicon.com. 7. T.B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. In CVIU, 81(3):231–268, 2001. 8. K.M. Cheung, T. Kanade, J.-Y. Bouguet, and M. Holler. A real time system for robust 3D voxel reconstruction of human motions. In Proc. of CVPR, 2: 714–720, June 2000. 9. I. Miki´c, M. Triverdi, E. Hunter, and P. Cosman. Articulated body posture estimation from multicamera voxel data. In Proc. of CVPR, 1:455ff, 2001.
54
C. Theobalt et al.
10. C. Sminchisescu and B. Triggs. Kinematic jump processes for monocular 3d human tracking. In Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, I 69–76, 2003. 11. D.M. Gavrila and L.S. Davis. 3D model-based tracking of humans in action: A multi-view approach. In CVPR 96, 73–80, 1996. 12. I.A. Kakadiaris and D. Metaxas. Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection. In Proc. CVPR, 81–87, Los Alamitos, California, USA, 1996. IEEE Computer Society. 13. H. Sidenbladh, M.J. Black, and J.D. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In Proc. of ECCV, 2:702–718, 2000. 14. L. Goncalves, E. DiBernardo, E. Ursella, and P. Perona. Monocular tracking of the human arm in 3D. In Proc. of CVPR, 764–770, 1995. 15. R. Plaenkers and P. Fua. Articulated soft objects for multi-view shape and motion capture. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10), 2003. 16. A. Mittal, L. Zhao, and L.S. Davis. Human body pose estimation using silhouette shape analysis. In Proc. of Conference on Advanced Video and Signal-based Surveillance (AVSS), 263ff, 2003. 17. J. O’Rourke and N.I. Badler. Model-based image analysis of human motion using constraint propagation. In PAMI, 2(6), 1980. 18. Z. Chen and H. Lee. Knowledge-guided visual perception of 3d human gait from a single image sequence. In IEEE Transactions on Systems, Man and Cybernetics, 22(2):336–342, 1992. 19. N. Grammalidis, G. Goussis, G. Troufakos, and M.G. Strintzis. Estimating body animation parameters from depth images using analysis by synthesis. In Proc. of Second International Workshop on Digital and Computational Video (DCV’01), 93ff, 2001. 20. R. Koch. Dynamic 3D scene analysis through synthesis feedback control. In PAMI, 15(6):556–568, 1993. 21. G. Martinez. 3D motion estimation of articulated objects for object-based analysis-synthesis coding (OBASC). In VLBV 95, 1995. 22. I.A. Kakadiaris and D. Metaxas. 3D human body model acquisition from multiple views. In Proc. of ICCV’95, 618–623, 1995. 23. Q. Delamarre and O. Faugeras. 3D articulated models and multi-view tracking with silhouettes. In ICCV99, 716–721, 1999. 24. S. Yonemoto, D. Arita, and R. Taniguchi. Real-time human motion analysis and ik-based human figure control. In Proc. of IEEE Workshop on Human Motion, 149–154, 2000. 25. C. Bregler and J. Malik. Tracking people with twists and exponential maps. In Proc. of CVPR 98, 8–15, 1998. 26. M.M. Covelle, A. Rahimi, M. Harville, and T.J. Darrell. Articulated pose estimation using brighness and depth constancy constraints. In Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, 2: 438– 445, 2000. 27. B. Rosenhahn, T. Brox, and J. Weickert. Three-dimensional shape knowledge for joint image segmentation and pose tracking. In To appear in International Journal of Computer Vision, 2006. 28. B. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealed particle filtering. In Proc. of CVPR’00, 2: 2126ff, 2000.
3 Reconstructing Human Shape, Motion and Appearance from MVV
55
29. T. Drummond and R. Cipolla. Real-time tracking of highly articulated structures in the presence of noisy measurements. In Proc. of ICCV, 2: 315–320, 2001. 30. J. MacCormick and M. Isard. Partitioned sampling, articulated objects, and interface-quality hand tracking. In Proc. of European Conference on Computer Vision, 2:3–19, 2000. 31. H. Sidenbladh, M. Black, and R. Sigal. Implicit probabilistic models of human motion for synthesis and tracking. In Proc. of ECCV, 1:784–800, 2002. 32. C. Theobalt, M. Magnor, P. Sch¨ uler, and H.-P. Seidel. Combining 2d feature tracking and volume reconstruction for online video-based human motion capture. In Proc. of the 10th Pacific Conference on Computer Graphics and Applications (Pacific Graphics 2002), pages 96–103, Beijing, China, 2002. IEEE. 33. A. Bottino and A. Laurentini. A silhouette based technique for the reconstruction of human movement. In CVIU, 83:79–95, 2001. 34. G. Cheung, S. Baker, and T. Kanada. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In Proc. of CVPR, 2003. 35. J. Carranza, C. Theobalt, M.A. Magnor, and H.-P. Seidel. Free-viewpoint video of human actors. In Proc. of SIGGRAPH’03, 569–577, 2003. 36. E. de Aguiar, C. Theobalt, M. Magnor, and H.-P. Seidel. Reconstructing human shape and motion from multi-view video. In 2nd European Conference on Visual Media Production (CVMP), pages 42–49, London, UK, December 2005. The IEE. 37. E. de Aguiar, R. Zayer, C. Theobalt, M. Magnor, and H.-P. Seidel. A framework for natural animation of digitized models. Research Report MPI-I-2006-4-003, Saarbruecken, Germany, July 2006. Max-Planck-Institut fuer Informatik. 38. C. Theobalt, E. de Aguiar, M. Magnor, H. Theisel, and H.-P. Seidel. Markerfree kinematic skeleton estimation from sequences of volume data. In ACM Symposium on Virtual Reality Software and Technology (VRST 2004), pages 57–64, Hong Kong, China, November 2004. ACM. 39. E. de Aguiar, C. Theobalt, M. Magnor, H. Theisel, and H.-P. Seidel. M3: Marker-free model reconstruction and motion tracking from 3d voxel data. In 12th Pacific Conference on Computer Graphics and Applications, PG 2004, pages 101–110, Seoul, Korea, October 2004. IEEE. 40. E. de Aguiar, C. Theobalt, and H.-P. Seidel. Automatic learning of articulated skeletons from 3d marker trajectories. In Proc. of ISVC’06, 2006. 41. B. Allen, B. Curless, and Z. Popovic. Articulated body deformations from range scan data. In Proc. of ACM SIGGRAPH 02, 612–619, 2002. 42. P. Sand, L. McMillan, and J. Popovic. Continuous capture of skin deformation. In ACM Transactions on. Graphics, 22(3):578–586, 2003. 43. D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rogers, and J. Davis. SCAPE - shape completion and animation of people. In ACM Transactions on Graphics (Proc. of SIGGRAPH’05), 24(3): 408–416, 2005. 44. W. Matusik, C. Buehler, R. Raskar, S.J. Gortler, and L. McMillan. Image-based visual hulls. In Proc. of ACM SIGGRAPH 00, 369–374, 2000. 45. S. W¨ urmlin, E. Lamboray, O.G. Staadt, and M.H. Gross. 3d video recorder. In Proc. of IEEE Pacific Graphics, 325–334, 2002. 46. T. Matsuyama and T. Takai. Generation, visualization, and editing of 3D video. In Proc. of 1st International Symposium on 3D Data Processing Visualization and Transmission (3DPVT’02), 234ff, 2002.
56
C. Theobalt et al.
47. M.H. Gross, S. W¨ urmlin, M. N¨ af, E. Lamboray, C.P. Spagno, A.M. Kunz, E. Koller-Meier, T. Svoboda, L.J. Van Gool, S. Lang, K. Strehlke, A. Vande Moere, and O.G. Staadt. blue-c: a spatially immersive display and 3d video portal for telepresence. In ACM Transactions on Graphics (Proc. of SIGGRAPH’03), 22(3):819–827, 2003. 48. M. Li, H. Schirmacher, M. Magnor, and H.-P. Seidel. Combining stereo and visual hull information for on-line reconstruction and rendering of dynamic scenes. In Proc. of IEEE Multimedia and Signal Processing, 9–12, 2002. 49. C. Lawrence Zitnick, S. Bing Kang, M. Uyttendaele, S. Winder, and R. Szeliski. High-quality video view interpolation using a layered representation. In ACM TOC (Proc. SIGGRAPH’04), 23(3):600–608, 2004. 50. T. Kanade, P. Rander, and P.J. Narayanan. Virtualized reality: Constructing virtual worlds from real scenes. In IEEE MultiMedia, 4(1):34–47, 1997. 51. M. Waschb¨ usch, S. W¨ urmlin, D. Cotting, F. Sadlo, and M. Gross. Scalable 3D video of dynamic scenes. In Proc. of Pacific Graphics, 629–638, 2005. 52. M. Levoy and P. Hanrahan. Light field rendering. In Proc. of ACM SIGGRAPH’96, 31–42, 1996. 53. W. Matusik and H. Pfister. 3d tv: A scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. In ACM Transactions on Graphics (Proc. of SIGGRAPH’04), 23(3):814–824, 2004. 54. C. Theobalt, J. Carranza, M. Magnor, and H.-P. Seidel. A parallel framework for silhouette-based human motion capture. In Vision, Modeling and Visualization 2003 (VMV-03): Proc., pages 207–214, Munich, Germany, November 2003. 55. C. Theobalt, J. Carranza, M. Magnor, and H.-P. Seidel. Enhancing silhouettebased human motion capture with 3d motion fields. In Jon Rokne, Reinhard Klein, and Wenping Wang, editors, 11th Pacific Conference on Computer Graphics and Applications (PG-03), pages 185–193, Canmore, Canada, October 2003. IEEE. 56. C. Theobalt, J. Carranza, M. Magnor, and H.-P. Seidel. Combining 3d flow fields with silhouette-based human motion capture for immersive video. In Graphical Models, 66:333–351, September 2004. 57. C. Theobalt, J. Carranza, M. Magnor, and H.-P. Seidel. 3d video – being part of the movie. In ACM SIGGRAPH Computer Graphics, 38(3):18–20, August 2004. 58. N. Ahmed, E. de Aguiar, C. Theobalt, M. Magnor, and H.-P. Seidel. Automatic generation of personalized human avatars from multi-view video. In VRST ’05: Proc. of the ACM Symposium on Virtual Reality Software and Technology, pages 257–260, Monterey, USA, December 2005. ACM. 59. M. Alexa, M.-P. Cani, and K. Singh. Interactive shape modeling. In Eurographics Course Notes. 2005. 60. O. Sorkine. Differential representations for mesh processing. In Computer Graphics Forum, 25(4), 2006. 61. R. Zayer, C. R¨ ossl, Z. Karni, and H.-P. Seidel. Harmonic guidance for surface deformation. In Marc Alexa and Joe Marks, editors, Proc. of Eurographics 2005, 24:601–609, 2005. 62. R.W. Sumner and J. Popovic. Deformation transfer for triangle meshes. In ACM Transactions on Graphics, 23(3):399–405, 2004. 63. R.W. Sumner, M. Zwicker, C. Gotsman, and J. Popovic. Mesh-based inverse kinematics. In ACM Transactions on Graphics, 24(3):488–495, 2005.
3 Reconstructing Human Shape, Motion and Appearance from MVV
57
64. K.G. Der, R.W. Sumner, and J. Popovic. Inverse kinematics for reduced deformable models. In ACM Transactions on Graphics, 25(3):1174–1179, 2006. 65. L. Shi, Y. Yu, N. Bell, and W.-W. Feng. A fast multigrid algorithm for mesh deformation. In ACM Transactions on Graphics, 25(3):1108–1117, 2006. 66. J. Huang, X. Shi, X. Liu, K. Zhou, L.-Y. Wei, S.-H. Teng, H. Bao, B. Guo, and H.-Y. Shum. Subspace gradient domain mesh deformation. In ACM Transactions on Graphics, 25(3):1126–1134, 2006. 67. H.P.A. Lensch, W. Heidrich, and H.-P. Seidel. A silhouette-based algorithm for texture registration and stitching. In Graphic Models, 63(4): 245–262, 2001. 68. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. In Numerical Recipes in C++. 2002, Cambridge University Press. 69. P. Fua and Y.G. Leclerc. Object-centered surface reconstruction: Combining multi-image stereo and shading. In International Journal of Computerised Vision, 16(1):35–55, 1995. 70. M.N. Kolountzakis and K.N. Kutulakos. Fast computation of the euclidian distance maps for binary images. In Information Processing Letters, 43(4):181–184, 1992. 71. P.P. Pebay and T.J. Baker. A comparison of triangle quality measures. In Proc. to the 10th International Meshing Roundtable, 327–340, 2001. 72. R. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. In SIAM Journal of Science Comparative, 16(5): 1190–1208, 1995. 73. G. Farin. Curves and Surfaces for CAGD: A Practical Guide, 1999. Morgan Kaufmann. 74. Y. Lipman, O. Sorkine, D. Cohen-Or, D. Levin, C. R¨ ossl, and H.-P. Seidel. Differential coordinates for interactive mesh editing. In Franca Giannini and Alexander Pasko, editors, Shape Modeling International 2004 (SMI 2004), pages 181–190, Genova, Italy, 2004. IEEE. 75. C. Theobalt, N. Ahmed, E. de Aguiar, G. Ziegler, H.P.A. Lensch, M. Magnor, and H.-P. Seidel. Joint motion and reflectance capture for creating relightable 3d videos. Research Report MPI-I-2005-4-004, Saarbruecken, Germany, April 2005, Max-Planck-Institut fuer Informatik. 76. C. Theobalt, N. Ahmed, H.P.A. Lensch, M. Magnor, and H.-P. Seidel. Enhanced dynamic reflectometry for relightable free-viewpoint video, Research Report MPI-I-2006-4-006, Saarbr¨ ucken, Germany, 2006, Max-Planck-Institut fuer Informatik.
4 Utilization of the Texture Uniqueness Cue in Stereo Xenophon Zabulis Informatics and Telematics Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
4.1 Introduction The cue to depth due to the assumption of texture uniqueness has been widely utilized in approaches to shape-from-stereo. Despite the recent growth of methods that utilize spectral information (color) or silhouettes to threedimensionally reconstruct surfaces from images, the depth cue due to the texture uniqueness constraint remains relevant, as being utilized by a significant number of contemporary stereo systems [1, 2]. Certainly, combination with other cues is necessary for maximizing the quality of the reconstruction, since they provide of additional information and since the texture-uniqueness cue exhibits well-known weaknesses; e.g. at cases where texture is absent or at the so-called “depth discontinuities”. The goal of this work is to provide of a prolific, in terms of accuracy, precision and efficiency, approach to the utilization of the texture uniqueness constraint which can be, thereafter, combined with other cues to depth. The uniqueness constraint assumes that a given pixel from one image can match to no more than one pixel from the other image [3, 4]. In stereo methods, the uniqueness constraint is extended to assume that, locally, each location on a surface is uniquely textured. The main advantages of the cue derived from the uniqueness constraint over other cues to depth are the following. It is independent from silhouette-extraction, which requires an accurate segmentation (e.g. [5]). It is also independent of any assumption requiring that cameras occur around the scene (e.g. [6]) or on the same baseline (e.g. [7, 8]). Moreover, it does not require that cameras are spectrally calibrated, such as in voxel carving/coloring approaches (e.g. [9, 10, 11]). The locality of the cue due to the uniqueness constraint facilitates multi-view and parallel implementations, for real-time applications [12, 13, 14]. Traditionally, stereo-correspondences have been established through a similarity search, which matched image neighborhoods based on their visual similarity [15, 16]. After the work in [17], volumetric approaches have emerged that establish correspondences among the acquired images after
60
X. Zabulis
backprojecting them onto a hypothetical surface. Space-sweeping methods [17, 18, 19, 20, 21, 22, 23], backrpoject the images onto a planar surface, which is “swept” along depth to evaluate different disparities. Orientation-optimizing methods [24, 25, 26, 27, 28], compare the backprojections onto hypothetical surface segments (surfels), which are evaluated at range of potential locations and orientations in order to find the location and orientation at which the evaluated backprojections match the best. The relation with the traditional, neighborhood-matching, way of establishing correspondences is the following. In volumetric methods, a match is detected as a location at which the similarity of backprojections is locally maximized. This location is considered as the 3D position of the surface point and the image points that it projects at as corresponding. Orientation-optimizing methods compensate for the projective distortion in the similarity-matching process and have been reported to be of superior accuracy to window-matching and space-sweeping approaches [1, 29]. The reason is that the matching process is more robust when the compared textures are relieved from - the different for each camera - projective distortion. On the other hand, their computational cost is increased as many times as the number of evaluated orientations of the hypothetical surface segment. The remainder of this chapter is organized as follows. In Sect. 4.2, relevant work is reviewed. In Sect. 4.3, the uniqueness cue and relevant theory are defined in the context of volumetric approaches and an accuracy-increasing extension to this theory is proposed. In Sect. 4.4, computational optimizations that accelerate the above method and increase its precision are proposed. In addition, the theoretical findings of Sect. 4.3 are employed to define a space-sweeping approach of enhanced accuracy, which is then combined with orientation-optimizing methods into a hybrid method. Finally in Sect. 4.5, conclusions are drawn and the utilization of the contributions of this work by stereo methods is discussed.
4.2 Related Work Both traditional and volumetric techniques optimize a similarity criterion along a spatial direction in the world or in the image, to determine the disparity or position of points on the imaged surfaces. In traditional stereo (e.g. [15, 16]) or space-sweeping [17, 18, 19, 20, 21, 22, 23], a single orientation is considered, typically the one frontoparallel to the cameras. Orientationoptimizing techniques [24, 25, 26, 27, 28] consider multiple orientations at an additional computational cost, but also provide an estimation of the normal of the imaged surface. As benchmarked in the literature [1, 2] and explained in Sect. 4.3.2, the accuracy of sweeping methods is limited in the presence of surface slant, compared to methods that account for surface orientation.
4 Utilization of the Texture Uniqueness Cue in Stereo
61
Although that space-sweeping approaches produce results of similar accuracy to the, traditional, neighborhood-matching algorithms [1, 29] they exhibit decreased computational cost. Furthermore, the time-efficiency of sweeping methods is reinforced when implemented to execute in commodity graphics hardware [23, 30, 31]. Due to its “single-instruction multiple-data” architecture, graphics hardware executes the, essential for the space-sweeping approach, warping and convolution operations in parallel. Regarding the shape and the size of sweeping surfaces, it has been shown [32] that projectively expanding this surface (as in [30, 31, 32, 33, 34]) exploits better the available pixel resolution, than implementing the sweep as a simple translation of the sweeping plane [17, 18, 19, 20, 21, 22, 23]. In this context, a more accurate space-sweeping method is proposed in Sect. 4.4.4. In orientation-optimizing approaches, the size of the hypothetical surface patches has been formulated as constant [24, 25, 26, 27, 28]. Predefined sets of sizes have been utilized in [28] and [24], in a coarse-to-fine acceleration scheme. However, the evaluated sizes were the same for any location and orientation of the patch, rather than modulated as their function. Metrics for evaluating the similarity of two image regions (or backprojections) fall under two main categories. Photoconsistency [20, 35] and texture similarity [21, 22, 23, 36, 37], which is typically implemented using the SAD, SSD, NCC, or MNCC [38] operators. The difficulty using the photoconsistency metric is that radiometric calibration of the cameras is required, which is difficult to achieve and retain. In contrast, the NCC and MNCC metrics are not sensitive to the absence of radiometric calibration, since they compare the correlation of intensity values rather than their differences [39]. Finally, some sweeping-based stereo reconstruction algorithms match sparse image features [17, 18, 19] but are, thus, incapable of producing dense depth maps. Global optimization approaches have also utilized the uniqueness approach [7, 26, 40, 41, 42, 43], but can yield local minima of the overall cost function and are much more difficult to parallelize than local volumetric approaches. As in local methods, the similarity operator is either an oriented backprojection surface segment (e.g. [26]) or an image neighborhood (e.g. [7]). Thus, regardless of how the readings of this operator are utilized by the reconstruction algorithm, the proposed enhancement of the hypothetical surface patch operator should only improve the accuracy of these approaches. Finally, the assumption of surface continuity [3, 4] has been utilized for resolving ambiguities as well as correcting inaccuracies (e.g. [7, 41]). In traditional stereo, some approaches to enforce this constraint are to filter the disparity map [13], bias disparity values to be in coherence with neighboring [7], or require inter-scanline consistency [42]. The continuity assumption has been also utilized in 2D [44], but seeking continuity in the image intensity domain. The assumption has also been enforced to improve the quality of reconstruction, in post-processing; an abundance of approaches for 3D filtering of the results exists in the deformable models literature (see [45] for a review).
62
X. Zabulis
4.3 The Texture Uniqueness Cue in 3D In this section, the texture uniqueness cue is formulated volumetrically, or in 3D, and it is shown that this formulation can lead to more accurate reconstruction methods than the, traditional, 2D formulation. Next, the spatial extent over which textures are matched is considered and an accuracy-increasing extension to orientation-optimizing approaches is proposed. It is noted that, henceforth, it is assumed that images portray Lambertian surfaces, which can also be locally approximated by planar patches. Extension of these concepts beyond the Lambertian domain can be found in [46]. 4.3.1 Uniqueness Cue Formulation Let a calibrated image pair Ii = 1,2 , acquired from two cameras with centers o2 o1,2 and principal axes e1,2 ; cyclopean eye is at o = o1 + and mean optical 2 e1 + e2 axis is e = 2 . Let also a planar surface patch S, of size α × α, centered at p, with unit normal n. Backprojecting Ii onto S yields image wi ( p, n): wi ( (4.1) p, n) = Ii Pi · p + R(n) · [x y 0]T , where Pi is the projection matrix of Ii , R(n) is a rotation matrix so that R(n) · [0 0 1]T = n and x , y ∈ [− α2 , α2 ] local coordinates on S. When S is tangent at a world surface, wi are identities of the surface pattern (see Fig. 4.1 left). Thus I1 (P1 x) = I2 (P2 x), ∀x ∈ S, and therefore their similarity is optimal. Otherwise wi are dissimilar, because they are collineations from different surface regions. Assuming a voxel tessellation of space, the locations of surface points and corresponding normals can be
Fig. 4.1. Left: A surface is projectively distorted in images I1,2 , but the collineations w1,2 from a planar patch tangent to this surface are not (from [28], c 2004 IEEE). Right: Illustration of the geometry for (4.4)
4 Utilization of the Texture Uniqueness Cue in Stereo
63
recovered by estimating the positions at which similarity is locally maximized along the direction of the surface normal, or otherwise, exhibit a higher similarity value than their (two) neighbors in that direction. Such a location will be henceforth referred to as a similarity local maximum, or simply, local max ( imum. To localize the similarity local maxima, function V p) = s( p)κ( p), is evaluated as: s( p) = max (sim(w1 ( p, n), w2 ( p, n))) , (4.2) n
κ( p) = arg max (s( p)) . n
(4.3)
where s( p) the optimal correlation value at p, and κ( p) the optimizing orien tation. The best matching backprojections are w1,2 = w1,2 ( p, κ). Metric sim can be SAD, SSD, NCC, MNCC etc. To evaluate sim, a r × r lattice of points is assumed on S. In addition, a threshold τc is imposed on s so that local maxima of small similarity value are not interpreted as surface occurrences. The parameterization of n requires two dimensions and can be expressed in terms of longitude and latitude, which define any orientation within a unit hemisphere. To treat equally different eccentricities of S the orientation, c = [xc yc zc ]T , that corresponds to the pole of this hemisphere points to o; that is c = p − o (see Fig. 4.1 right). The parameterized orientations n = [xi , yi , zi ]T are then: zc · xc · cos ω · sin ψ − yc · N1 · sin ω · sin ψ + xc · N2 · cos ψ N1 · N2 zc · yc · cos ω · sin ψ + xc · N1 · sin ω · sin ψ + yc · N2 · cos ψ yi = (4.4) N1 · N2 zc · cos ψ − N2 · cos ω · sin ψ , zi = N1
x2c + yc2 + zc2 , N2 = x2c + yc2 , ω ∈ [0, 2π), ψ ∈ [0, π2 ). where N1 = The corresponding rotation that maps [0 0 1]T to the particular orientation is R = Rx · Ry , where Rx is the rotation matrix for a cos−1 zk rotation about the xx axis. If xk = 0, then Ry is the rotation matrix for a tan−1 ( xykk ) rotation about the yy axis or, otherwise, Ry is the 3 × 3 identity matrix. The computational cost of the optimization for a voxel is O(N r2 ), where N is the number of orientations evaluated by n. A reconstruction of the locations and corresponding normals of the imaged surfaces can be obtained by detecting the similarity local maxima that are due to the occurrence of a surface. These maxima can be detected as the positions where s is maximized along the direction of the surface normal [28]. An estimation of the surface normal is provided by κ since, according to 4.3, κ should coincide with the surface normal. A suitable algorithmic approach to the computation of the above location is given by a 3D version of the Canny edge detector [47]. In this version, the gradient is also 3-dimensional, its magnitude is given by s( p) and its direction by κ( p). The non-maxima suppression xi =
64
X. Zabulis
step of Canny’s algorithm performs, in essence, the detection of similarity lo that are not local maxima along the cal maxima since it rejects all voxels in V surface normal. A robust implementation of the above approach is achieved . following the work in [48], but substituting the 3D gradient with V 4.3.2 Search Direction and Accuracy of Reconstruction Both optimizing for the orientation of the hypothesized patch S and detecting similarity maxima in the direction of the surface normal increase the accuracy of final reconstruction. The claimed increase in accuracy for the uniqueness cue is theoretically expected due to the following proposition: the spatial error of surface reconstruction is a monotonically increasing function of the angle between the normal vector of the imaged surface and the spatial direction over which a similarity measure is optimized (proof in Appendix A). This proposition constitutes of a mathematical explanation of why space-sweeping approaches are less accurate than orientation optimizing methods. Intuitively, the inaccuracy is due to the fact that the backprojections on S, which is oriented differently than the imaged surface, do not correspond to the same world points - except for the central point of S. The above proposition also explains why similarity local maxima are optimally recovered when backprojections are evaluated tangentially to the surface to be reconstructed. In addition to the above, when the search direction for local maxima is in wide disagreement with the surface normal, inhibition of valid maxima occurs, deteriorating the quality of reconstruction even further. The reason is that an inaccurate search direction may point to and, thus, suppress validly occupied neighboring voxels. When κ is more accurate, this suppression attenuates because κ points perpendicularly to the surface. The following experiments confirm that detecting similarity local maxima along κ and optimizing n provides of a more fidelious reconstruction. The first experiment utilizes computer simulation to show that this improvement of accuracy occurs even in synthetic images, where noise and calibration errors are absent. In the experiment, simulated was a binocular camera pair that imaged obliquely a planar surface. A planar patch S, oriented so that its normal was equal to e, was swept along depth. At each depth, the locations of the surface points that were imaged on S, through the backprojection process, were calculated for each camera. Thus at a given depth, a point on S indicates a pair of world points occurring someplace on the imaged surface. For each depth, the distances of such pairs of world points were summed. In Fig. 4.2(a–c), the setup as well as the initial, middle and final position of the patch are shown. Figure 4.2(d) shows the sum of distances obtained for each depth value, for a r × r, r = 11 grid on S. According to the prediction, the minimum of this summation function does not occur at δ = 0, which is the correct depth. The dislocation of this minimum is the predicted depth error for this setup. The experiment shows that even in ideal imaging conditions,
4 Utilization of the Texture Uniqueness Cue in Stereo 6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
−1
−1
−1
−2 −3 −20
−2 −18
−16
−14
−12
−10
−3 −20
150 sum of coordinate distances
6
−2 −18
−16
−14
−12
−10
−3 −20
65
100
50
0 −2 −18
−16
−14
−12
−10
−1
0 distance δ
1
2
Fig. 4.2. Left to right: (a) initial, (b) middle and (c) final position of a hypothetical patch (magenta). Line pencils show the lines of sight from the optical centers to the imaged surface, through the patch. The middle line plots the direction of e. The plot on the right (d), shows the sum of distances of points imaged through the same point on the patch as a function of δ; it indicates that the maximum of similarity is c 2006 IEEE) obtained at δ = 0(< 0, in this case) (from [55],
space-sweeping is guaranteed to yield some error if the scene includes any significant amount of slant. In the second experiment (see Fig. 4.3), a well-textured planar surface was reconstructed considering an S which assumed: either solely the frontoparallel orientation - as in space-sweeping, or a set of orientations within a cone of opening γ around e. Judging by the planarity of the reconstructed surface, the least accurate reconstruction was obtained by space-sweeping. Notice that in (c), due to the compensation for the projective distortion, backprojections w1,2 were more similar than in (b). As a result, higher similarity values were obtained and, thus, more local maxima exhibited a similarity value higher than threshold τc . Figure 4.3(d) is discussed in Sect. 4.3.3. 4.3.3 Optimizing Accuracy in Discrete Images In this subsection, the size, α, of S and the corresponding image areas where S projects are studied. A modulation of α is proposed to increase the accuracy of the patch operator, as it has been to date formulated [24, 25, 26, 28]. Finally, integration with the surface continuity assumption is demonstrated to alleviate the result from residual inaccuracies. In discrete images, the number of subtended pixels by the projection of S is analogous to the reciprocals of distance squared and the cosine of the relative obliqueness of S to the cameras. Thus in (4.3), when α is constant the greater the obliqueness the fewer the image pixels that the r × r image samples are obtained from. Therefore, there will always be some obliqueness after which the same intensity values will start to be multiply sampled. In this case, as obliqueness and/or distance increase the population of these intensities will tend to exhibit reduced variance, because they are being sampled from decreasingly fewer pixels. Thus, a bias in favor of greater slants and distances is predicted. Mathematically, because variance occurs in the denominator of the correlation function. Intuitively, because fewer image area supports now the similarity matching of backprojections on S, and as a consequence, this matching becomes less robust.
66
X. Zabulis
15
(mm)
40 100
65
x−axis
(pixels)
90 200
115 140
300
165
400
190 200
300 400 (pixels)
500
600
135
15
15
40
40
(mm)
(mm)
100
65
(mm)
85
60
110 z−axis
(mm)
85
60
65 90
x−axis
x−axis
90
110 z−axis
115
115
140
140
165
165
190
190 135
110 z−axis
85
(mm)
60
135
Fig. 4.3. Comparison of methods for similarity local maxima detection. Clockwise from top left: (a) Image from a horizontally arranged binocular pair (baseline was calculated three-ways: 156 mm), showing a XZ section in space at which V (b) plane-sweeping, (c) optimizing n, and (d) updated assuming surface continuity (see Sect. 4.3.3). In (a), checker size was 1cm2 and target was ≈ 1.5 m from the cameras. In the maps (b–d), dark dots are local maxima, white lines are κ, voxel = 125 = mm3 , r = 21 × 21, α = 20 mm. In (c), γ = 60◦ and n’s orientations were parameterized every 1◦ . Notice that the last two methods detect more local maxima, although that the same similarity threshold τc was used in all three c 2006 IEEE) conditions (from [55],
To observe the predicted phenomenon, surface orientation was estimated and compared to ground truth. Experiments were conducted with both real and synthetic images, to stress the point that the discussed inaccuracy cannot be attributed to noise or calibration errors and that, therefore, it must be contained in the information loss due to image discretization. In the first experiment (see Fig. 4.4), a binocular image pair was synthesized to portray a square, textured and planar piece of surface. Equations (4.2) and (4.3) were then evaluated for the central point on the surface. The similarity values obtained for each orientation of n were arranged to form a longitude - latitude map, which can be read as follows. The longitude and latitude axes correspond to the dimensions defined by, respectively, modulating ψ and ω
4 Utilization of the Texture Uniqueness Cue in Stereo Latitude
pixels
50
pixels
50
100
150
200
250
600
300
400 pixels
(deg)
20 40 60 80
200
350
400
Longtitude (deg)
300
100
Latitude
100 200
150
600
200
400 pixels
250
200
300
400
(deg)
20 40 60 80
300
350
200
Longtitude (deg)
100
67
Fig. 4.4. Two textures rendered on a 251 × 251 mm planar surface patch (left) and the corresponding similarity values obtained by rotating a S concentric with the patch (right). In the maps, camera pose is at (0, 0), crosses mark the maximal similarity value and circles mark ground truth. In the experiment, α = 100 mm, r = 15, γ = 90◦ . The binocular pair was ≈ 1.5 m from the patch and its baseline was 156 mm. The, angular, parameterization of n was in steps of .5◦ . The errors for the two conditions, measured as the angle between the ground truth normal and the esc 2007 IEEE) timated one, were and 2.218◦ (top) and 0.069◦ (bottom) (from [56],
in (4.4); coordinates (0, 0) correspond to c. In the map, lighter pixels indicate a high similarity value and darker the opposite (henceforth, this convention is followed for all the similarity maps in this chapter). Due to the synthetic nature of the images, which facilitated a perfect calibration, a small amount of the predicted inaccuracy was observed. In the second experiment, calibration errors give rise to even more misleading local maxima and, also, the similarity value at very oblique orientations of n (> 60◦ ) is observed to reach extreme positive or negative values. In Fig. 4.5, to indicate the rise of the spurious maximum at the extremes of the correlation map, the optimization was twice computed: once for γ = 120◦ and once for γ = 180◦ . In both cases, the global maximum occurred at the extreme border of this map, thus corresponding to a more oblique surface normal - relative to the camera - than ground truth. The above phenomenon can be alleviated if the size of the backprojection surface S is modulated so that its image area remains invariant. In particular, the side of S (or diameter, for a circular S) is modulated as: α0 · d v · n −1 α=· ; ω = cos , (4.5) d0 · cos ω |v | · |n| where v = p − o, d = |v |, ω is the angle between v and n and d0 , α0 initial parameters. Notice that even for a single location, size is still varied as
X. Zabulis
(d e g )
68
50 100
20 40 60 80 100
200 (deg)
200
(deg)
pixels
150
250
300
20 40 60
300 (deg)
350 50
100 150 200 250 300 350 400 pixels
50
100
150
200 (deg)
250
300
350
50
100
150
200 (deg)
250
300
350
20 40 60
Fig. 4.5. Comparison of techniques. Repeating the experiment of Fig. 4.4 for the first two frames of the “Venus” Middlebury Sequence and for two different γ s: 120◦ (top map) and 180◦ (middle map). In the experiment, α = 250 length units, baseline was 100 length units and r = 151. The surface point for which 4.2 and 4.3 were evaluated is marked with a circle (left). The projection of S subtended an area of ≈ 50 pixels. The bottom map shows the increase in accuracy obtained by the size-modulation of S with respect to obliqueness (see forward in text). Mapping of c 2007 IEEE) similarity values to intensities is individual for each map (from [56],
n is varied. Figures 4.5 and 4.6 show the angular and spatial improvement in accuracy induced by the proposed size-modulation. They compare the reconstructions obtained with patch whose size was modulated as above against those obtained with a constant-sized S - as to date practiced in [24, 25, 26, 28]. A “side-effect” of the above modulation is that the larger the distance and the obliqueness of a surface, the lower the spatial frequency that is reconstructed at. This effect is considered as a natural tradeoff since distant and oblique surfaces are also imaged at lower frequencies. Assuming surface continuity has been shown to reduce inaccuracies due to noise or lack of resolution in a wide variety of methods and especially in global-optimization approaches (see Sect. 4.2). To demonstrate the compatibility of the proposed operator with these approaches and suppress residual inaccuracies, the proposed operator is implemented with feedback obtained from a surface-smoothing process. Once local maxima have been detected in , the computed κ s are updated as follows. For voxels where a similarity V local maximum occurs, κ is replaced by the normal of the least-squares fitting plane through the neighboring occupied voxels. For an empty voxel, pe , the updated value of κ is j βiκ( pj )/ βj , where j enumerates the occupied voxels within pe ’s neighborhood. After the update, local maxima are re-detected . The results are more accurate, because similarity local maxima in the new V are detected along a more accurate estimation of the normal. Note that if n’s optimization is avoided, the initial local maxima are less accurately localized and so are the updated κs.
(pixels)
100 x−axis (voxels)
100 200 300 400 500 600 50
100
150 (pixels)
200
250
50
100
150 (pixels)
200
250
200 300 400 500
0
10
2
10 z−axis (voxels)
600
69
100 x−axis (voxels)
20 40 60 80 100 120 140 160 180 200
x−axis (voxels)
(pixels)
4 Utilization of the Texture Uniqueness Cue in Stereo
200 300 400 500
0
10
2
10 z−axis (voxels)
600
0
10
2
10 z−axis (voxels)
20 40 60 80 100 120 140 160 180 200
Fig. 4.6. Comparison of techniques. Shown is a stereo pair (left column) and three separate calculations of s across a vertical section, through the middle of the foreground surface. The bottom figures are zoom-in detail on the part that corresponds to the foreground and z-axes (horizontal in maps) are logarithmic. In the bottom figures, ground truth is marked with a dashed line. In the 2nd column, a fine α was used, hence the noisy response at the background. Using a coarse α (3rd column), yields a smoother response in greater distances, but diminishes any detail that could be observed in short range. In the 4th column, α is projectively increased, thus, normalizing the precision of reconstruction by the area that a pixel c 2007 IEEE) images at that distance (from [56],
4.4 Increasing Performance Two techniques are proposed for increasing the performance of the proposed implementation of the uniqueness cue. The first aims the provision of high-precision results and the second the aims the reduction of its computational cost. In addition, a hybrid approach is proposed that combines the rapid execution of space-sweeping with the increased accuracy of the proposed orientation-optimizing method. To enhance the accuracy of the space-sweeping part of the proposed approach, an enhanced version of spacesweeping is introduced that is based on the conclusions of Sect. 4.3.2. 4.4.1 Precision In volumetric methods, the required memory and computation increase by a cubic factor as voxel size decreases and, thus, computational requirements are quite demanding in reconstructions of high-precision. The proposed technique refines the initial voxel-parameterized reconstruction to sub-voxel precision, given V and the detected similarity local maxima as input. The local maxima are in voxel-parameterization and treated as a coarse approximation of the imaged surface. The method densely interpolates through the detected simi , in order for the result larity local maxima. This interpolation is guided by V to pass through the locations where similarity (s, or |V |) is locally maximized.
70
X. Zabulis
To formulate the interpolation, Sf henceforth refers to the 0-isosurface of |, or otherwise to the set of locations at which G = 0. At the vicinity G = |∇V of the detected local maxima, similarity is locally maximized and, thus, the derivative’s norm (G) should be 0. The result is defined as the localization of Sf at the corresponding regions. Guiding the interpolation with the locations of Sf , utilizes the obtained similarity values to accurately increase the precision of the reconstruction and not, blindly, interpolate through the detected similarity local maxima. The interpolation utilizes the Radial Basis Function (RBF) framework in [49] to approximate the isosurface. This framework requires pivots which will guide the interpolation, and which in the present case are derived by the detected local maxima. For each one of them, the values of G at the locations p1,2 = pm ±λκ, where pm is the position of the local maxima, are estimated by trilinear interpolation. The pivots are assigned with values ξ1,2 = c · G( p1,2 ), where c is −1 for the closer of the two pivot points to the camera and 1 for the other. Values ξ1,2 are of the opposite sign, to constrain the 0-isosurface to occur in between them. The value of λ is chosen to be less than voxel size (i.e. 0.9) to avoid interference with local maxima occurring at neighboring voxels [50]. Function G is approximated in high-resolution by the RBF framework and the isosurface is extracted by the Marching Cubes algorithm [51]. The result is represented as a mesh of triangles. The proposed approach is, in essence, a search for the zero-crossings of G. The computational cost of the above process is much less than the cost at the precision that is interpolated. However, it is still a of evaluating V computationally demanding process of complexity O(N 3 ), where N is the number of data points. Even though the optimization in [50], which reduces complexity to O(N log N ), was adopted the number of data points in widearea reconstructions can be quite large to obtain results in real-time. To, at least, parallelize the process, the reconstruction volume can be tessellated in overlapping cubes and the RBF can be independently computed at each. No significant differences in the reconstruction were observed when fitting the RBF directly to the whole reconstruction and in the individual cubes, due the overlap of cubes. The partial meshes are finally merged as in [52]. 4.4.2 Acceleration Two hierarchical, coarse-to-fine iterative methods are proposed for the acceleration of the search for similarity local maxima. The first, is an iterative coarse-to-fine search that reduces the number of evaluated n’s in (4.3). In this formula, the exhaustive search computes s for every n within a cone of opening γ. At each iteration i: (a) the cone is canonically sampled and the optimizing direction κi is selected amongst the sampled directions, (b) the sampling gets exponentially denser, but (c) only the samples within the opening of an exponentially narrower cone around κi−1 are evaluated. At each iteration, the opening γi of the cone is reduced as
4 Utilization of the Texture Uniqueness Cue in Stereo
71
γi+1 = γi /δ, δ > 1 (in our experiments δ = 2). Iterations begin with κ1 = c and end when γi falls below a precision threshold τγ . For a voxel at p, the parameterized normals are given by (4.4), by modulating ψ to be in [0, γi ] and setting c = p − o. In Fig. 4.7, the accuracy of the proposed method is shown as a function of computational effort. As ground truth, the result of the exhaustive search was considered, which required 10800 invocations of the similarity function. It can be seen that after 3 iterations, which correspond to a speedup > 7, the obtained surface normal estimation is inaccurate by less than 3◦ . After 7 iterations accuracy tends to be less than 1.5◦ (speedup ≈ 2). Given the correction of the surface normal in Sect. 4.3.3, the residual minute inaccuracies may be neglected without consequences for the quality of the reconstruction and the process is stopped at the 3rd iteration. Also in practice, a speedup of ≈ 20 is obtained, since in our implementation only the 1st iteration is performed if all samples are less than threshold. The second method reduces the number of evaluated voxels, by iteratively focusing computational effort at the volume neighborhoods of similarity local maxima. It is based on a scale-space treatment of the input images. At each iteration, αi = α0 /2i and I1,2 are convolved with a Gaussian of σi = σ0 /2i . Also, voxel volume is reduced by 1/23 and correlation is computed only at the neighborhoods of the local maxima detected in the previous iteration. The effect of these modulations is that at initial scales correspondences are evaluated for coarse-scale texture and at finer scales utilize more image detail. Their purpose is to efficiently compare w1,2 at coarse scales. At these scales, the projections of the points on S in the image are sparse and, thus, even a minute calibration error causes significant miscorrespondence of their projections. Smoothing, in effect, decreases image resolution and, thus, more correspondences are established at coarse scales. In Fig. 4.8, the method is demonstrated. No errors in the first iteration that led to a void in the final reconstruction were observed, utilizing 3 iterations of the above algorithm but of course this tolerance is a function of the available image resolution. In our experiments a speedup of ≈ 5 was observed on average.
4 3.5 3 2.5 2 1.5 1
9
0.8
8
−0.4
0.6
Variance (degrees)
Mean Error (degrees)
4.5
0
1000 2000 3000 4000 5000 6000 Number of MNCC comparisons
−0.2
7
0
6
0.2
5 4
0
0.4 0.5
0.6
0
1000 2000 3000 4000 5000 6000 Number of MNCC comparisons
1
Fig. 4.7. Mean (left) and variance (middle) of the error of angular optimization as a function of the computational effort, measured in similarity metric invocations. In the experiment, the results of 103 estimations were averaged γ1 = 60◦ , δ = 2 and τγ = 1◦ . The right plot illustrates the hierarchical evolution of considered orientations, as points on the unit sphere
X. Zabulis
17
(m )
(m )
16.8
16.7
16.7
16.8
16.8
16.9
16.9
17
(m )
72
17.4
17.6
17 17.1
x − a x is
17.2
x − a x is
x − a x is
17.1 17.2 17.3
17.2 17.3
17.4
17.4
17.5
17.5
17.6
17.6
16.7 −2.7
−2.9
−3.1 z−axis
−3.3 (m)
−3.5
−3.7
−2.7
−2.9
−3.1 −3.3 z−axis (m)
−3.5
−3.7
−2.7
−2.9 −3.1 −3.3 z−axis (m)
−3.5
−3.7
Fig. 4.8. Coarse-to-fine localization of similarity local maxima. A digital photograph from a ≈ 40 cm-baseline binocular pair and, superimposed, a section in space that perpendicularly intersects the imaged piece of rock. The 2D map on the right shows the result of local maxima detection across this section. Marked in white are the detected local maxima at voxel precision and with gray the result of their subvoxel approximation. These local maxima were then reprojected to the original image and marked (left). At the bottom, the three maps show the result of the coarse to fine , for the same section in space. In the experiment, α0 = 8 cm, computation of V c 2005 IEEE) voxel = (4 cm)3 , r = 21, σ0 = 5 (from [57],
4.4.3 Sphere-sweeping In this subsection, the geometry of space-sweeping is revisited and a spherical parameterization of the sweeping surface is proposed and evaluated. The proposed approach substitutes the backprojection plane, in space-sweeping, with a spherical sector that projectively expands from the cyclopean eye outwards. Using this backprojection surface, a visibility ray v departing from the optical center is always perpendicular to the backprojection surface for any eccentricity of in the field of view (FOV) (see Fig. 4.9 left). Thus, the number of sampled image pixels per unit area of backprojection surface is maximized. In contrast, a frontoparallel planar surface is imaged increasingly slanted relatively to v as moves to the periphery of the image and, therefore, a smaller accuracy is expected, based on the conclusion of Sect. 4.3.3 (see [32] for a proof). The method is formulated as follows. Let a series of concentric and expanding spherical sectors Si at corresponding distances di from the cyclopean
4 Utilization of the Texture Uniqueness Cue in Stereo
73
Fig. 4.9. Illustration of the sector (left) and voxel (right) based volume tessellations. Visibility is naturally expressed in the first representation, whereas in the c 2006 IEEE) second, traversing voxels diagonally is required (from [32],
eye (C). Their openings μ, λ in the horizontal and vertical direction, respectively, are matched to the horizontal and vertical FOVs of the cameras and tessellated by an angular step of c. Parameterization variables ψ and ω are determined as ψ ∈ {c · i − μ; i = 0, 1, 2, . . . , 2μ/c} and ω ∈ {c · j − λ; j = 0, 1, 2, . . . , 2λ/c} and [μ/c] = μ/c,[λ/c] = λ/c. For both ψ and ω, value 0 corresponds to the orientation of the mean optical axis e. To generate sectors Si , a corresponding sector S0 is first defined on a unit sphere centered at O = [0 0 0]T . A point p = [x y z]T on S0 is given by: x = sin(ψ), y = cos(ψ)sin(ω), z = cos(ψ)cos(ω), Its corresponding point p on Si is then: p = di [Rz (−θ)Ry (−φ)p + C] ,
(4.6)
where Ry and Rz are rotation matrices for rotations about the yy and zz axes, v1,2 are unit vectors on the principal axes of the cameras, v = (v1 +v2 )/2, and θ (longitude), φ (colatitude) v ’s spherical coordinates. Computational power is conserved, without reducing the granularity of the reconstructed depths, when parameterizing di on a disparity basis [53]: di = d0 + β i , i = 1, 2, . . . iN , where d0 and iN define the sweeping interval and β is modulated so that the farthest distance is imaged in sufficient resolution. The rest of the sweeping procedure is similar to plane-sweeping and, thus, overviewed. For each Si , the stereo images (≥ 2) are sampled at the projections Si ’s points on the acquired images, thus forming two (2μ/c×2λ/c) backprojection images. Backprojecting and locally comparing images is straightforwardly optimized by a GPU as a combination of image difference and convolution operations (e.g. [37, 54]). The highest local maximum along a ray of visibility is selected as the optimum depth. Correlation values are interpolated along depth to obtain subpixel accuracy. Parameterizing the reconstruction volume into sectors instead of voxels provides of a useful surface parameterization, because the data required to compute visibility are already structured with respect to rays from the optical center. These data refer to a sector-interpretable grid (see Fig. 4.9), but
74
X. Zabulis
are structured in memory in a conventional 3D matrix. Application of visibility becomes then more natural, because the oblique traversal of a regular voxel space is sensitive to discretization artifacts. Finally, computational acceleration obtained by graphics-hardware is equally applicable to the resulting method as to plane-sweeping [32].
50 100
−50
−50
−100
−100
−150
−150
−200
−200
250 300 350 400
mm
200
mm
(pixels)
150
−250
−300
−350
−350
−400
−400 −450
−450
450
−500
−500 100
200
300
400
500
−250
−300
0
600
100
200
300
400
500
600
0
100
200
300
mm
(pixels)
50
400
500
600
mm
−200
−200
−300
−300
−400
−400
mm
(pixels)
150 200 250 300
mm
100
−500
−500
−600
−600
−700
−700
350 400 450
−800 −300 −200 −100
−800 100
200
300
400
500
−200 −100
600
0
100
200
300
400
0
100
mm
(pixels)
200
300
400
mm
360 340
50
340
100
320
150
250
300
mm
mm
(pixels)
320 200
300
280
300
260
280 350
240
260
400 450
220 100
200
300
400
500
−350
−360 −340 −320 −300 −280 −260 −240 −220
600
−340 50
−400
−420
300
−440
350
−460
400
mm
−400
mm
(pixels)
−380
−380
250
−200
−360
−360
200
−250
−340
100 150
−300
mm
mm
(pixels)
−420 −440 −460 −480
−480
450
−500 100
200
300
(pixels)
400
500
600
350
400
450
mm
500
350
400
450
500
mm
Fig. 4.10. Comparison of plane and sphere-sweeping. Each row shows an image from a 156 mm-baseline binocular pair (left) and a planar section (as in Fig. 4.8) in space illustrating the reconstructions obtained from plane (middle) and sphere (right) sweeping. The reconstruction points shown are the ones that occur within the limits of the planar section. In the first two rows, the section intersects perpendicularly the reconstructed surface. In the 3rd row, the section is almost parallel to the viewing direction and intersects the imaged sphere perpendicularly to its equator. In the last row, the section is almost parallel to image plane. The limits of the sections and the reconstructed points are reprojected on the images of the left row. The superiority of spherical sweeping is pronounced at the sections of the reconstructions c 2006 IEEE) that correspond to the periphery of the image (from [32],
4 Utilization of the Texture Uniqueness Cue in Stereo
75
The proposed approach was compared to plane-sweeping on the same data and experimental conditions. Images were 480 × 640 pixels and target surfaces occurred from 1 m to 3 m from the cameras. In Fig. 4.10, slices along depth that were extracted from the reconstructions are compared. Almost no effect between the two methods can be observed in the reconstructions, when obtained from the center of images (top row). The results differ the most when comparing reconstructions obtained from the periphery of images (rest of rows). In terms of reconstructed area, sphere-sweeping provided about ≈ 15% more reconstructed points. 4.4.4 Combination of Approaches The methods presented in this chapter were combined into a stereo algorithm, which combines the efficiency of space-sweeping with the accuracy of orientation optimization. Results are shown in Figs. 4.11, 4.12 and 4.13. The algorithm initiates by reconstructing a given scene with the method of spheresweeping (see Sect. 4.4.3). Then the proposed orientation-optimizing operator is employed and similarity local maxima are detected. Indicatively, performing
Fig. 4.11. Reconstruction results. Images from three 20 cm-baseline binocular pairs (1st row). Demonstration of coarse to fine spatial refinement of reconstruction for the 1st binocular pair (2nd row), using the approach of Sect. 4.4.2. Multiview reconstruction of the scene (3rd row)
76
X. Zabulis
Fig. 4.12. Image from a binocular pair and reconstruction, utilizing the precision enhancement of Sect. 4.4.1. Output is a point cloud and image size was 640 × 480. Last row compares comparison of patch-based reconstructions for plane-sweeping (left), orientation optimization (middle) and enforcement of continuity (right) c 2006 IEEE) (from [55],
the correlation step for the last example, 286 sec were required on a Pentium 3.2 GHz, where image size was ≈ 106 pixels, voxel size was 10 cm, r = 21, γ = 60◦ . A wide outdoors area reconstruction, in Fig. 4.11, demonstrates the multiview expansion of the algorithm. For multiple views, at the end of each scale-iteration the space-carving rule [9] is applied to detect empty voxels and further reduce the computation at the next scale. At the last scale, the obtained V s from each view are combined with the algorithm in [28].
4 Utilization of the Texture Uniqueness Cue in Stereo
77
Fig. 4.13. Image from a 40 cm-baseline binocular pair and reconstruction, utilizing the precision enhancement of Sect. 4.4.1. The output isosurface is represented in a mesh and texture is directly mapped on it
4.5 Conclusion This chapter is concerned with the depth cue due to the assumption of texture uniqueness, as one of the most powerful and widely utilized approaches to shape-from-stereo. The factors that affect the accuracy of the uniqueness cue were studied and the reasons that cast orientation-optimizing methods of superior accuracy to traditional and space-sweeping approaches were explained. Furthermore, the proposed orientation-optimizing techniques improve the accuracy and precision orientation optimization as to date practiced. Acceleration of the orientation-optimization approach is achieved by the introduction of two coarse-to-fine techniques that operate in the spatial and angular domain of the patch-based optimization. Finally, a hybrid approach is proposed that utilizes the rapid execution of a novel, accuracy-enhanced version of space-sweeping to obtain an initial approximation of the reconstruction result. This result is then refined, based on the proposed techniques for the orientation-optimization of hypothetical surface patches. The proposed extensions of to the implementation of the uniqueness cue to depth are integratable with diverse approaches to stereo. The size-modulation
78
X. Zabulis
proposed in Sect. 4.3.3 is directly applicable to any approach that utilizes a planar surface patch, either in simple or multi-view stereo. Furthermore, utilization of the orientation-optimizing implementation of the uniqueness constraint has been demonstrated to be compatible with the assumption of surface continuity and, thus, applicable for global optimization approaches to stereo. A future research avenue of this work is the integration of the uniqueness cue with other cues to depth. The next direction of this work is the utilization of parallel hardware for the real-time computation of wide area reconstructions, based on the fact that each of the proposed methods in this chapter is massively parallelizable. Computing the similarity s and orientation κ can be independently performed for each voxel and each orientation. Furthermore, detection of similarity local maxima can be also computed in parallel, if the voxel space is tessellated into overlapping voxel neighborhoods. Regarding the sweeping method that was proposed, the similarity for each depth layer and for each pixel within this layer can also independently computed. Once the similarity values for each depth layer are available, the detection of the similarity-maximizing depth value for each column of voxels along depth can be also independently performed.
A Appendix It is shown that s is maximized at the locations corresponding to the imaged surface only when values of s are computed from collineations that are parallel to the surface. Definitions are initially provided. Let: • • • •
• •
Cameras at T , Q that image a locally planar surface, Sweeping direction v , which given a base point (e.g. the cyclopean eye) defines a line L, which intersects the imaged surface at K. The two general types of possible configurations are shown in Fig. 4.14. The hypothetical backprojection planar patch S, where the acquired images are backprojected. Patch S is on L, centered at D ∈ L and oriented as v . Function b(X1 , X2 ), X1,2 ∈ R3 be the intersection of the line through X1 and X2 with the imaged surface. The surface point that is imaged at some point A on S is b(A, T ), where T is the optical center. In the figure, B = b(A, Q) and C = b(A, T ). Point O the orthogonal projection of A on the surface. θ and φ the acute angles formed by the optical rays through A, from T and Q.
Assuming texture uniqueness, the backprojection images of B and C are predicted to be identical only when B and C coincide. Thus the distance |BC| for some point on S is studied, assuming that when |BC| → 0 correlation of backprojections is maximized. As seen in Fig. 4.14,
4 Utilization of the Texture Uniqueness Cue in Stereo Q
79
Q
v
T
v
T L L A
A
D
D
L
S
S
D O
θ K
B φ
C
A
n B
θ
O
K
φ
C
n
K
ω
Fig. 4.14. Proof figures. See corresponding text
• • •
|BC| is either |OB| + |OC| (middle) or |OC| − |OB| (left), |AO| = |AB| sin θ = |AC| sin φ, |OC| = |AC| cos φ, |OB| = |AB| cos θ, and φ < θ < π/2.
θ+tan φ tan θ−tan φ Thus, |BC| is either |AO| tan tan θ tan φ or |AO| tan θ tan φ . Both quantities are positive. The first because θ, φ ∈ (0, π/2] and the second because θ > φ (see left on the above figure), Therefore the monotonicity of |BC| as a function of δ = |KD| is fully determined by |AO|. In the special case, where θ or φ is π/2, say θ, |BC| = |OC| = |AC| cos φ = |AO|/ tan φ and |BC| is a monotonically increasing function of δ. Thus, similarity of the backprojection images on S is indeed maximized when D coincides with K. This case corresponds to ψ = 0 (see forward in text). Thus, when v = n, it is only for point D that b(D, T ) and b(D, Q) will coincide when δ = 0. The geometry for all other points on S is shown in Fig. 4.14 (right). From the figure, it is shown that in this case b(A, T ) and b(A, Q) coincide only when δ > 0 (|KD| > 0). For all the rest of the points the depth error is:
|KD|2 = r2 tan2 ω
(4.7)
which shows that the error is determined not only by r (|AD|) but also from the “incidence angle” ψ = π2 − ω between v and n. Equation (4.7) shows that when v = n (ω = 0) the similarity of backprojections is maximized at the location of the imaged surface (when δ = 0). In contrast, when ω → π2 (or, when the surface is imaged from an extremely oblique angle) reconstruction error (KD) tends to infinity. Since (4.7) holds for every point on S, it is concluded that the error in reconstruction is a monotonically increasing function of the angle between the incidence angle ψ or the search direction relative to the imaged surface.
80
X. Zabulis
Acknowledgement This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1–2–3):7–42, 2002. 2. M. Z. Brown, D. Burschka, and G. D. Hager. Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8): 993–1008, 2003. 3. D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science, 194:283–287, 1976. 4. D. Marr and T. Poggio. A computational theory of human stereo vision. In Royal Society of London Proceedings Series B, Vol. 204, pp. 301–328, 1979. 5. A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2): 150–162, 1994. 6. K. M. Cheung, T. Kanade, J. Y. Bouguet, and M. Holler. A real time system for robust 3D voxel reconstruction of human motions. In IEEE Computer Vision and Pattern Recognition, Vol. 2, pp. 714–720, 2000. 7. V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In European Conference in Computer Vision, Vol. 1, pp. 379–393, 2002. 8. M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353–363, 1993. 9. K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):197–216, 2000. 10. W. Culbertson, T. Malzbender, and G. Slabaugh. Generalized voxel coloring. In Vision Algorithms: Theory and Practice, pp. 100–115, 1999. 11. G. Slabaugh, B. Culbertson, T. Malzbender, M. Livingston, I. Sobel, M. Stevens, and R. Schafer. Methods for volumetric reconstruction of visual scenes. International Journal of Computer Vision, 3(57):179–199, 2004. 12. J. Lanier. Virtually there. Scientific American, 284(4):66–75, 2001. 13. J. Mulligan, X. Zabulis, N. Kelshikar, and K. Daniilidis. Stereo-based environment scanning for immersive telepresence. IEEE Circuits and Systems for Video Technology, 14(3):304–20, 1999. 14. N. Kelshikar, X. Zabulis, K. Daniilidis, V. Sawant, S. Sinha, T. Sparks, S. Larsen, H. Towles, K. Mayer-Patel, H. Fuchs, J. Urbanic, K. Benninger, R. Reddy, and G. Huntoon. Real-time terascale implementation of tele-immersion. In International Conference in Computer Science, pp. 33–42, 2003. 15. N. Ayache. Artificial Vision for Mobile Robots: Stereo Vision and Multisensory Perception. MIT Press, Cambridge MA, 1991. 16. E. Trucco, and A. Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hall, New Jersey, 1998. 17. R. T. Collins. A space-sweep approach to true multi-image matching. In IEEE Computer Vision and Pattern Recognition, pp. 358–363, 1996.
4 Utilization of the Texture Uniqueness Cue in Stereo
81
18. J. Bauer, K. Karner, and K. Schindler. Plane parameter estimation by edge set matching. In 26th Workshop of the Austrian Association for Pattern Recognition, pp. 29–36, 2002. 19. C. Zach, A. Klaus, J. Bauer, K. Karner, and M. Grabner. Modelling and visualizing the cultural data set of Graz. In Virtual reality, archeology, and cultural Heritage, 2001. 20. K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):199–218, 2000. 21. C. Zhang and T. Chen. A self-reconfigurable camera array. In Eurographics Symposium on Rendering, 2004. 22. T. Werner, F. Schaffalitzky, and A. Zisserman. Automated architecture reconstruction from close-range photogrammetry. In CIPA – International Symposium, 2001. 23. C. Zach, A. Klaus, B. Reitinger, and K. Karner. Optimized stereo reconstruction using 3D graphics hardware. Workshop of Vision, Modelling, and Visualization, pp. 119–126, 2003. 24. A. Bowen, A. Mullins, R. Wilson, and N. Rajpoot. Light field reconstruction using a planar patch model. In Scandinavian Conference on Image Processing, pp. 85–94, 2005. 25. R. Carceroni and K. Kutulakos. Multi-View scene capture by surfel sampling: From video streams to Non-Rigid 3D motion, shape & reflectance. International Journal of Computer Vision, 49(2–3):175–214, 2002. 26. O. Faugeras and R. Keriven. Complete dense stereovision using level set methods. In European Conference in Computer Vision, pp. 379–393, 1998. 27. A. S. Ogale and Y. Aloimonos. Stereo correspondence with slanted surfaces: Critical implications of horizontal slant. In IEEE Computer Vision and Pattern Recognition, Vol. 1, pp. 568–573, 2004. 28. X. Zabulis and K. Daniilidis. Multi-camera reconstruction based on surface normal estimation and best viewpoint selection. In IEEE International Symposium on 3D Data Processing, Visualization and Transmission, pp. 733–40, 2004. 29. D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In IEEE Workshop on Stereo and Multi-Baseline Vision, pp. 131–140, 2001. 30. M. Li, M. Magnor, and H. P. Seidel. Hardware-accelerated rendering of photo hulls. Eurographics, 23(3), 2004. 31. R. Yang, G. Welch, and G. Bishop. Real-time consensus-based scene reconstruction using commodity graphics hardware. In Pacific Graphics, 2002. 32. X. Zabulis, G. Kordelas, K. Mueller, and A. Smolic. Increasing the accuracy of the space-sweeping approach to stereo reconstruction, using spherical backprojection surfaces. In International Conference on Image Processing, 2006. 33. M. Pollefeys and S. Sinha. Iso-disparity surfaces for general stereo configurations. In European Conference on Computer Vision, pp. 509–520, 2004. 34. V. Nozick, S. Michelin, and D. Arqus. Image-based rendering using planesweeping modelisation. In International Association for Pattern Recognition – Machine Vision Applications, pp. 468–471, 2005. 35. S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):151–173, 1999.
82
X. Zabulis
36. R. Szeliski. Prediction error as a quality metric for motion and stereo. In International Conference in Image Processing, Vol. 2, pp. 781–788, 1999. 37. I. Geys, T. P. Koninckx, and L. J. Van Gool. Fast interpolated cameras by combining a gpu based plane sweep with a max-flow regularisation algorithm. In IEEE International Symposium on 3D Data Processing, Visualization and Transmission, pp. 534–541, 2004. 38. H. Moravec. Robot rover visual navigation. Computer Science: Artificial Intelligence, pp. 105–108, 1980/1981. 39. J. Mulligan and K. Daniilidis. Real time trinocular stereo for tele-immersion. In ICIP, pp. 959–962, Thessaloniki, Greece, 2001. 40. S. Paris, F. Sillion, and L. Qu. A surface reconstruction method using global graph cut optimization. In Asian Conference in Computer Vision, 2004. 41. K. Junhwan, V. Kolmogorov, and R. Zabih. Visual correspondence using energy minimization and mutual information. In International Conference in Image Processing, Vol. 2, pp. 1033–1040, 2003. 42. Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(2):139–154, 1985. 43. I. J. Cox, S. L. Hingorani, S. B. Rao, and B. M. Maggs. A maximum likelihood stereo algorithm. CVIU, 63(3):542–567, 1996. 44. P. Mordohai and G. Medioni. Dense multiple view stereo with general camera placement using tensor voting. In IEEE International Symposium on 3D Data Processing, Visualization and Transmission, pp. 725–732, 2004. 45. J. Montagnat, H. Delignette, and N. Ayache. A review of deformable surfaces: topology, geometry and deformation. Image and Vision Computing, 19: 1023–1040, 2001. 46. H. Jin, S. Soatto, and A. Yezzi. Multi-view stereo beyond lambert. In IEEE Computer Vision and Pattern Recognition, pp. 171–178, 2003. 47. J. F. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986. 48. O. Monga, R. Deriche, G. Malandain, and J. P. Cocquerez. Recursive filtering and edge tracking: Two primary tools for 3d edge detection. Image and Vision Computing, 9(3):203–214, 1991. 49. G. Turk and J. F. O’Brien. Modelling with implicit surfaces that interpolate. ACM Transactions on Graphics, 21(4):855–873, 2002. 50. J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R. Fright, B. C. McCallum, and T. R. Evans. Reconstruction and representation of 3D objects with radial basis functions. In ACM – Special Interest Group on Graphics and Interactive Techniques, pp. 67–76, 2001. 51. W. Lorensen and H. Cline. Marching cubes: A high resolution 3D surface construction algorithm. Computer Graphics, 21(4):169–169, 1987. 52. G. Turk and M. Levoy. Zippered polygon meshes from range images. In ACM – Special Interest Group on Graphics and Interactive Techniques, pp. 311–318, 1994. 53. J. X. Chai, X. Tog, S. C. Chan, and H. Y. Shum. Plenoptic sampling. In ACM – Special Interest Group on Graphics and Interactive Techniques, pp. 307–318, 2000. 54. C. Zach, K. Karner, and H. Bischof. Hierarchical disparity estimation with programmable 3D hardware. In International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, pp. 275–282, 2004.
4 Utilization of the Texture Uniqueness Cue in Stereo
83
55. X. Zabulis and G. Kordelas. Efficient, Precise, and Accurate Utilization of the Uniqueness Constraint in Multi-View Stereo, 3DPVT 2006, Third International Symposium on 3D Data Processing, Visualization and Transmission, University of North Carolina, Chapel Hill, 2006. 56. X. Zabulis and G. D. Floros. Modulating the size of back projection surface patches, in volumetric stereo, for increasing reconstruction accuracy and robustness, IEEE 3DTV-Conference 2007, Kos, Greece, 2007. 57. X. Zabulis, A. Patterson and K. Daniilidis. Digitizing Archa Excavations from Multiple Views, In Proceedings of IEEE 3Digital Imaging and Modeling, 2005.
5 Pattern Projection Profilometry for 3D Coordinates Measurement of Dynamic Scenes Elena Stoykova, Jana Harizanova and Ventseslav Sainov Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences
Introduction Three-dimensional time-varying scene capture is a key component of dynamic 3D displays. Fast remote non-destructive parallel acquisition of information being inherent property of optical methods makes them extremely suitable for capturing in 3D television systems. Recent advance in computers, image sensors and digital signal processing becomes a powerful vehicle that motivates the rapid progress in optical profilometry and metrology and stimulates development of various optical techniques for precise measurement of 3D coordinates in machine design, industrial inspection, prototyping, machine vision, robotics, biomedical investigation, 3D imaging, game industry, culture heritage protection, advertising, information exchange and other fields of modern information technologies. To meet the requirements of capturing for the needs of 3D dynamic displays the optical profilometric methods and systems must ensure accurate automated real-time full-field measurement of absolute 3D coordinates in a large dynamic range without loss of information due to shadowing and occlusion. The technical simplicity, reliability and cost of capturing systems are also crucial factors. Some of the already commercialized optical systems for 3D profilometry of real objects are based on laser scanning. As the scanning of surfaces is realized one-dimensionally in space and time (point by point or line by line) at limited speed, especially for large-scale scene in out-door conditions, these systems are subject to severe errors caused by vibration, air turbulence and other environmental influence and are not applicable for measurement in real time. Among existing techniques, the methods which rely on functional relationship of the sought object data with the phase of a periodic fringe pattern projected onto and reflected from the object occupy a special place as a full-field metrological means with non-complex set-ups and processing algorithms that are easy to implement in outdoor and industrial environment. Pattern Projection Profilometry (PPP) includes a wide class of optical methods for contouring and shape measurement going back to the
86
E. Stoykova et al.
classical shadow and projection moir´e topography [1, 2] and the well-known and widely used since the ancient times triangulation. Nowadays, pattern projection systems enable fast non-ambiguious precise measurement of surface profile of wide variety of objects from plastic zones in the notch of the micro-cracks in fracture mechanics [3] and micro-components [4] to cultural heritage monuments [5]. An optimized and equipped with a spatial light modulator (SLM) system provides measurement accuracy up to 5.10−5 from the object size [1]. The main goal of this Chapter is to consider phase-measuring methods in pattern projection profilometry as a perspective branch of structured light methods for shape measurement emphasizing on the possibility to apply these methods for time-varying scene capturing in the dynamic 3D display. The Chapter consists of 3 Sections. Section 5.1 gives the basic principles of PPP and Phase Measuring Profilometry (PMP), describes the means for generation of sinusoidal fringe patterns, formulates the tasks of phase demodulation in a profilometric system and points out the typical error sources influencing the measurement. Section 5.2 deals with phase-retrieval methods. They are divided in two groups – multiple frame and single frame methods or temporal and spatial methods. Following this division, we start with the phase-shifting approach which is outlined with its pros and cons. Special attention is dedicated to error-compensating algorithms and generalized phase-shifting techniques. Among spatial methods, the Fourier transform method is discussed in detail. The generic limitations, important accuracy issues and different approaches for carrier removal are enlightened. Space-frequency representations as the wavelet and windowed Fourier transforms for phase demodulation are also considered. Other pointwise strategies for demodulation from a single frame as quadrature filters, phase-locked loop and regularized phase tracking are briefly presented. The problem of phase unwrapping which is essential for many of the phase retrieval algorithms is explained with classification of the existing phase-unwrapping approaches. The Chapter also includes the developed by the Central Laboratory of Optical Storage and Processing of Information to the Bulgarian Academy of Sciences (CLOSPI-BAS) experimental set-ups as well as the technical solutions of problems associated with measurement of the absolute 3D coordinates of the objects and the loss of information due to shadowing effect. In the end, we discuss the phase demodulation techniques from the point of view of observation of fast dynamic processes and the current development of real-time measurements in the PMP. This work is supported by EC within FP6 under Contract 511568 “3DTV”.
5.1 Pattern Projection Profilometry 5.1.1 General Description The principle of PPP is elucidated with the scheme depicted in Fig. 5.1. The optical axes of both projector system and observation system are crossed at
5 Pattern Projection Profilometry for 3D Coordinates Measurement
87
Fig. 5.1. Schematic of pattern projection profilometry
a certain plane called reference plane. Although there exist methods based on a random pattern projection, the PPP generally relies on structured light projection [1]. In structured light techniques a light pattern of a regular structure such as a single stripe, multiple stripes, gradients, grids, binary bars, or intensity modulated fringes as e.g. a sine-wave, is projected onto the object. The object reflects the deformed light pattern when observed from another angle. Analysis of the deformed image captured with a CCD camera yields the 3D coordinates of the object provided known positions of the camera, the projector and the object. The procedure to obtain the required geometric relationships for calculation of coordinates is called camera calibration [6]. The accuracy of the measurement crucially depends on correct determination of the stripe orders of the reflected patterns and on their proper connection to the corresponding orders in the projected patterns. This presumes one or more patterns to be projected – the simpler the pattern structure, the bigger the number of patterns required to derive the object’s profile. For example, in the so-called Gray-code method systems [7] several binary patterns of varying spatial frequency are projected. Number of projections needed to compensate scarce information in binary pattern projection is substantially reduced by intensity or colour modulation of the projected patterns. Projection of more complicated patterns with increased number of stripes and intensity differences between the stripes involves more accurate but more difficult interpretation of the captured images. A detailed review and classification of coded patterns used in projection techniques for the coordinates measurement is presented in [8]. The patterns are unified in three subdivisions based on spatial, temporal (timemultiplexing) or direct codification. The first group comprises patterns whose points are coded using information from the neighbouring pixels. The advantage of such an approach is capability for measurement of time-varying scenes. Its disadvantage is the complicated decoding stage due to shadowing effect as the surrounding area cannot always be recovered. Time-multiplexing approach
88
E. Stoykova et al.
is based on measurement of intensity values for every pixel as a sequence in time. In practice, this is achieved by successive projection of a set of patterns onto the object surface that limits its application only to static measurements. The codeword for a given pixel is usually formed by a sequence of intensity values for that pixel across the projected patterns. The third subdivision is based on direct codification which means that each point of the pattern is identified just by itself. There are two ways to obtain pixel coordinates using this type of pattern: by increasing the range of colour values or by introducing periodicity in the pattern. These techniques are very sensitive to noise due to vibration, shadowing, saturation or ill-illumination. Thus, preliminary calibration is needed in order to eliminate the colour of the objects using one or more reference images that make the method inapplicable for time-varying scenes measurements. An attractive approach among structured light methods is the phase measuring profilometry (PMP) [9, 10] or fringe projection profilometry, in which the parameter being measured is encoded in the phase of a two-dimensional (2D) periodic fringe pattern. The main obligatory or optional steps of the PMP are shown schematically in Fig. 5.2. The phase measuring method enables determination of 3D coordinates of the object with respect to a reference plane or of absolute 3D coordinates of the object itself. The phase extraction requires a limited number of patterns and for some methods may need only one pattern, thus making real-time processing possible. Nowadays, the PMP is a highly sensitive tool in machine vision, computer-aided design, manufacturing, engineering, virtual reality, and medical diagnostics. A possibility for real-time remote shape control without the simultaneous physical presence of the two objects by using a comparative digital holography is shown in [11]. For the purpose, the digital hologram of the master object is recorded at one location and transmitted via Internet or using a telecommunication network to the location of the tested object where it is fed into a spatial light modulator (SLM). 5.1.2 Methods for Pattern Projection In general, the pattern projected onto the object in the PMP is described by a periodic function, f ∈ [−1, 1]. The most of the developed algorithms
Projection
Object
Fringe pattern
Phase retrieval
Coordinates
Constraints
Processing algorithm
Unwrapping
Fig. 5.2. Block-scheme of phase-measuring profilometry
5 Pattern Projection Profilometry for 3D Coordinates Measurement
89
in the PMP presume a sinusoidal profile of fringes, which means that these algorithms are inherently free of errors only at perfect sinusoidal fringe projection. Projection of purely sinusoidal fringes is not an easy task. The fringes that fulfil the requirement of f = cos[. . .] can be projected by coherent light interference of two enlarged and collimated beams. As the fringes are in focus in the whole space, this method makes large-depth and large-angle measurements possible, however at limited lateral field of measurement, restricted by the diameter of the collimating lens. The main drawback of interferometrically created fringes is the complexity of the used set-up and vulnerability to the environmental influences as well as the inevitable speckle noise produced by coherent illumination. An interesting idea how to keep the advantages of coherent illumination and to avoid the speckle noise is proposed in [12] where the light source is created by launching ultra short laser pulses into highly nonlinear photonic crystal fibres. Using of conventional imaging system with different types of single-, dual-, and multiple-frequency diffraction gratings, as an amplitude or phase sinusoidal grating or Ronchi grating, enlarges the field of measurement and avoids the speckle noise, however, at the expense of higher harmonics in the projected fringes. In such systems, a care should be taken to decrease the influence of the higher harmonics, e.g. by defocused projection of a Ronchi grating or by using an area modulation grating to encode almost ideal sinusoidal transparency as it is described in [9, 13]. A new type of projection unit based on a diffractive optical element in the form of saw tooth phase grating is described in [14]. The use of a programmable SLM, e.g. liquiud crystal display (LCD) [15, 16] or digital micro-mirror device (DMD) [17, 18, 19], permits to control very precisely the spacing, colour and structure of the projected fringes [20, 21], and to miniaturize the fringe projection system enabling applications in space-restricted environment [22]. Synthetic FPs produced by an SLM, however, also suffer from the presence of the higher harmonics. The discrete nature of the projected fringes entails tiny discontinuities in the projected pattern that lead to loss of information. This problem is more serious for the LCD projectors whereas the currently available DMD chips with 4k × 4k pixel resolution make the digital fringe discontinuities a minor problem [23]. For illustration, Figs. 5.3–5.5 show schematically implementation of the PMP based on classical Max-Zhender interferometer (Fig. 5.3) [24], on DMD projection (Fig. 5.4) [25] and by using a phase grating (Fig. 5.5) [26]. The wrapped phase maps and 3D reconstruction of the objects for these three types of illumination are presented in Fig. 5.6. 5.1.3 Phase Demodulation The 2D fringe pattern (FP) that is phase modulated by the physical object being measured may be represented by the following mathematical expression: I(r, t) = IB (r, t) + IV (r, t)f [ϕ(r, t) + φ(r, t)]
(5.1)
90
E. Stoykova et al.
Fig. 5.3. Fringe projection system, based on a Mach-Zhender interferometer: L – lens; BS – beam splitter; SF – spatial filter; P – prism; PLZT – phase-stepping device
Fig. 5.4. Fringe projection system, based on computer generated fringe patterns. L – lens
5 Pattern Projection Profilometry for 3D Coordinates Measurement
91
Fig. 5.5. Fringe projection system based on a sinusoidal phase grating; L – lens
where IB (r, t) is a slowly varying background intensity at a point r(x, y) and a moment t, IV (r, t) is the fringe visibility that is also a low-frequency signal, ϕ(r, t) is the phase term related to the measured parameter, e.g. object profile. The phase term φ(r, t) is optional being introduced during the formation of the waveform f or during the phase evaluation. The continuous FP (5.1) recorded at a moment t is imaged over a CCD camera and digitized for further analysis as a 2D matrix of quantized intensities Iij ≡ I(x = iΔx, y = jΔy) with dimensions Nx × Ny , where Δx and Δy are the sampling intervals along X and Y axes and define the spatial resolution, Nx is the number of columns and Ny is the number of rows. The camera spatial resolution is a crucial parameter for techniques based on the principle of optical triangulation. The brightness of each individual matrix element (pixel) is given by an integer that varies from the minimum intensity, equal to 0, to the maximum intensity, equal e.g. to 255. The purpose of computer-aided fringe analysis is to determine ϕ(r, t) across the pattern and to extract the spatial variation of
Fig. 5.6. Wrapped phase maps and 3D reconstruction of objects obtained with sinusoidal fringes generated using a) interferometer, b) DMD, c) phase grating
92
E. Stoykova et al.
the parameter being measured. In the case of profilometry, once the phase of the deformed waveform is restored, nonambiguous depth or height values can be computed. The process of phase retrieval is often called phase evaluation or phase demodulation. The fringe density in the FP is proportional to the spatial gradient of the phase [27]. Hence evaluation of the fringe density is also close to phase demodulation. In general, the phase retrieval includes the steps: (i) Phase evaluation step in which a spatial distribution of the phase, the so-called phase map, is calculated using one or more FPs. As the phaseretrieval involves nonlinear operations, implementation of many algorithms requires some constraints to be applied. (ii) The output of the phase evaluation step, in most cases, yields phase values wrapped onto the range −π to π, which entails restoration of the unknown multiple of 2π at each point. Phase unwrapping step is a central step to these algorithms, especially for realization of the automatic fringe analysis. (iii) Elimination of additional phase terms introduced to facilitate phase measurement by an adequate least squares fit, an iterative process or some other method is sometimes required. Historically, the PMP has emerged from the classical moir´e topography [28], in which the fringes modulated by the object surface create a moir´e pattern. In the dawn of the moir´e topography operator intervention was required for assignment of fringe-orders, determination of fringe extrema or interpositions. Over the years, the phase-measuring systems with coherent and non-coherent illumination that realize the principles of moir´e, speckle and holographic interferometry have been extensively developed for measurement of a wide range of physical parameters such as depth, surface profile, displacement, strain, deformation, vibration, refractive index, fluid flow, heat transfer, temperature gradients, etc. The development of interferometric methodology, image processing, and computer hardware govern the rapid progress in automation of fringe analysis. Gradually, a host of phase evaluation algorithms have been proposed and tested. A detailed overview of phase estimation methods is given in [29]. In terms of methodology, most algorithms fall into either of two categories: temporal or spatial analysis. A common feature of temporal analysis methods is that the phase value of a pixel is extracted based on the phaseshifted intensities of this pixel. Spatial analysis methods extract a phase value by evaluating the intensity of a neighbourhood of the pixel being studied [30]. A temporal analysis method is the phase-shifting profilometry. Typical spatial analysis methods are Fourier transform methods with carrier fringes and without carrier fringes. Recently, the wavelet transform method has started to gain popularity. A crucial requirement for implementation of any algorithm is the ability for automatic analysis of FPs [31]. Another important requirement for capture of 3D coordinates is to perform the measurement in real time. From this point of view, the methods capable to extract phase information from a single frame are the most perspective. In order to replace
5 Pattern Projection Profilometry for 3D Coordinates Measurement
93
the conventional 3D coordinate measurement machines using contact styli, the PMP should be able to measure diffusely reflecting surfaces and to derive correct information about discontinuous structures such as steps, holes, and protrusions [32]. 5.1.4 Conversion from Phase Map to Coordinates Usually, in the PMP the depth of the object is determined with respect to a reference plane. Two measurements are made for the object and for the reference plane that yield two phase distributions ϕobj (x, y) and ϕref (x, y), respectively. The object profile is retrieved from the phase difference, Δϕ(x, y) = ϕobj (x, y) − ϕref (x, y). Calibration of the measurement system, i.e. how to calculate the 3D coordinates of the object surface points from a phase map, is another important issue of all full-field phase measurement methods. The geometry of a conventional PMP system is depicted in Fig. 5.7. The reference plane is normal to the optical axis of the camera and passes through the crosspoint of the optical axes of the projector and the camera. The plane XC OYC of the Cartesian coordinate system (OXC YC ZC ) coincides with the reference plane and the axis ZC passes through the camera center. The plane P which is taken to pass through the origin of (OXC YC ZC ) is normal to the optical axis of the projector. The Cartesian coordinate system (OXP YP ZP ) with the plane XP OYP coinciding with the plane P and axis ZP passing through the center of the projector system can be transformed to (OXC YC ZC ) by rotations around the XC axis, YC axis, and ZC axis in sequence, through the angles α, β, and γ, respectively. Mapping between the depth and the phase difference depends on positions and orientations of the camera and projector, fringe spacing, location of the reference plane, etc. It is important to note that the mapping is described by the non-linear function [33]. According to
Fig. 5.7. Geometry of the pattern projection system. The depth (or height) of the object point A with respect to the reference plane R is hA
94
E. Stoykova et al.
the geometry depicted in Fig. 5.7, the phase difference at the camera pixel (x = iΔx, y = jΔy) is connected to the depth (or height) hij of the current point A on the object as viewed by the camera in (x = iΔx, y = jΔy) with respect to the reference plane R by the expression [33]: Δϕ(x = iΔx, y = jΔy) ≡ Δϕij =
aij hij 1 + bij hij
(5.2)
where the coefficients aij = aij (LP , LC , d, α, β, γ) and bij = bij (LP , LC , d, α, β, γ) depend on coordinates of point A, the rotation angles between (OXC YC ZC ) and (OXP YP ZP ), fringe spacing d and distances LP and LC of the projector and the camera, respectively, to the reference plane. If the PMP is used for investigation of a specularly reflective surface, which acts as a mirror, the phase of the FP recorded by the CCD is distorted proportionally to the slope of the tested object [34]. In this simple model it is assumed that the lateral dimensions given usually by x and y coordinates are proportional to the image pixel index (i, j). However, this simplified model gives inaccurate formulas in case of lens distortion and if magnification varies from point to point, which destroys proportionality of the x and y coordinates to the image index (i, j) [35]. Reliable conversion of the phase map to 3D coordinates needs a unique absolute phase value. This phase value can be obtained using some calibration mark, e.g. one or several vertical lines with known positions on the projector at digital fringe projection. Calibration of PMP system based on a DMD digital fringe projection is addressed in [23] where a new phasecoordinate conversion algorithm is described. In [36] calibration is governed by a multi-layer neural network trained by using the data about the FP’s irradiance and the height directional gradients obtained for the test object. In this way, it is not necessary to know explicitly the geometry of the profilometric system. 5.1.5 Error Sources An important issue of all phase determination techniques is their accuracy and noise tolerance. It seems logical to adopt the following general model of the recorded signal: I(r, t) =Nm (r, t) {IB (r, t) + IV (r, t)f [ϕ(r, t) + φ(r, t) + Nph (r, t)]} + Na (r, t)
(5.3)
where the terms Nm (r, t), Na (r, t) and Nph (r, t) comprise the possible deterministic and random error sources. Depending on the processing algorithm and the experimental realization of the profilometric measurement multiple error sources of different nature will affect the accuracy of phase restoration and henceforth, the 3-D profile recovery becomes a challenging task. Environmental error sources as mechanical vibration, turbulent and laminar air flows in the optical path, dust diffraction, parasitic fringes, ambient light, that occur
5 Pattern Projection Profilometry for 3D Coordinates Measurement
95
during the acquisition of fringe data are unavoidable, being especially crucial in the interferometric set-ups. Error sources in the measurement system as the digitization error, low sampling rate due to insufficient resolution of the camera, nonlinearity error, electronic noise, thermal or shot noise, imaging error of the projector and the camera, the background noise, the calibration errors, optical system aberrations, beam power fluctuations and nonuniformity, frequency or temporal instability of the illuminating source, spurious reflections, and defects of optical elements, low precision of the digital-data processing hardware etc. occur in nearly all optical profilometric measurement systems leading to random variations of the background and fringe visibility. Measurement accuracy can be improved by taking special measures, e.g. by using a high-resolution SLM to reduce the digitization error of the projector and by defocusing the projected FPs or by selecting a CCD camera with a higher data depth (10 or 12 bits versus 8 bits). To reduce the errors due to calibration, a coordinate measuring machine can be used to provide the reference coordinates and to build an error compensating map [37]. Speckle noise affects the systems with coherent light sources [38, 39]. A special emphasis should be put on systematic and random error sources, Nph (r, t), that influence the measured phase. Such error sources as miscalibration of the phase-shifting device or non-parallel illumination which causes non-equal spacing in the projected pattern along the object introduce a non-linear phase component. Methodological error sources such as shadowing, discontinuous surface structure, low surface reflectivity, or saturation of image-recording system, would produce unreliable phase data. Accuracy of the measurement depends on the algorithm used for phase retrieval. For the local (pointwise) methods, the calculated output at a given point is affected by the values registered successively at this point or at neighbouring points whereas in global methods all image points affect the calculated value at a single point. A theoretical comparison of three phase demodulation methods in PMP in the presence of a white Gaussian noise is made in [40].
5.2 Phase Retrieval Methods 5.2.1 Phase-shifting Method 5.2.1.1 General Description A typical temporal analysis method is the phase-shifting (PS) algorithm in which, the phase value at each pixel on a data frame is computed from a series of recorded FPs that have undergone a phase shift described by a function φ(r, t). If the reference phase φi , i = 1, 2, . . . , M is kept constant during the capture time and is changed by steps between two subsequent FPs, the method is called phase stepping or phase shifting profilometry (PSP). In this case, to determine the values of IB (r), IV (r) and ϕ(r ) at each point, at least three FPs (N = 3) are required. In phase integration modification of the method, the reference phase is changed linearly in time during the measurement [41].
96
E. Stoykova et al.
The PSP is well accepted in many applications due to its well-known advantages as high measurement accuracy, rapid acquisition, good performance at low contrast and intensity variations across the FP, and possibility for determination of the sign of the wave front. The PSP can ensure accuracy better than 1/100th of the wavelength in determination of surface profiles [42]. As in all profilometric measurements PSP operates either in a comparative mode with a reference surface or in an absolute mode Usually, in PSP, phase evaluation relies on sinusoidal pattern projection I(r, t) = I0 (r, t) + IV (r, t) cos[ϕ(r, t) + φ(r, t)]
(5.4)
Violation of the assumption f [. . .] = cos(. . .) causes systematic errors in the evaluated phase distribution. Two approaches are broadly used in the phaseshifting, one based on equal phase steps – typically multiples of π/2 – and the other based on arbitrary phase steps. These two approaches are usually referred to as a conventional and a generalized PSP [43, 44]. A modification of the method with two successive frames shifted at known phase steps and one frame shifted at unknown phase step is proposed in [45]. All these phaseshifting algorithms can also be called digital heterodyning [46]. The most general approach for phase retrieval in the PSP with M FPs shifted at known phase-steps is the least squares technique [47, 48]. Under the assumption that the background intensity and visibility have only pixel-to-pixel (inter-frame) variation, in the digitized FPs m m Iij = Bij + Vijm cos(ϕij + φm ), m = 1, 2 . . . M
(5.5)
where Bij = IB (iΔx , jΔy ) and Vij = IV (iΔx , jΔy ), i = 1, 2, . . . , Ny , j = 1, 2, . . . , Nx , we have 1 2 M Bij = Bij = . . . = Bij = Bij and Vij1 = Vij2 = . . . = VijM = Vij
(5.6)
Assuming also that the phase steps are known, the object phase is obtained m from minimization of the least-square error between the experimental Iˆij and the calculated intensity distribution Sij =
M
m m 2 (Iˆij − Iij ) =
m=1
M
m 2 (Bij + aij cos φm + bij sin φm − Iˆij )
(5.7)
m=1
The unknown quantities aij = Vij cos ϕij and bij = −Vij sin ϕij are found as the least squares solution of the Equation Bij ˆ −1 Y ˆ ij = aij = Ξ ˆ ij Ω ij bij
(5.8)
5 Pattern Projection Profilometry for 3D Coordinates Measurement M
M where Ξij =
M m=1 M m=1
m=1 M
cos φm sin φm
M
m=1
cos2 φm
(cos φm ) sin φm
m=1
M m=1
M
cos φm M
m=1
97
sin φm
(cos φm ) sin φm and
m=1
M
m=1
sin2 φm
m Iij
M ˆ ij = I m cos φ Y ij m m=1 M
m=1
m Iij sin φm
The phase estimate is obtained in each pixel as ˆ ij = tan−1 (−bij /aij ) ϕ
(5.9)
In the case of the so called synchronous detection the M FPs are equally spaced over one fringe period, φm = 2πm/M , and the matrix Ξij becomes diagonal. More general approach is to take M equally shifted FPs and to determine the phase from M bm Im (x, y) ϕ(x, y) = tan−1 m=1 (5.10) M am Im (x, y) m=1
The number of frames or “buckets” usually gives the name of the algorithm. Popular algorithms are the 3-frame algorithm with a step of 120◦ or 90◦ as well as the 4-frame algorithm and 5-frame algorithm with a step of 90◦ : ˆ = arctan ϕ
I4 − I2 2(I4 − I2 ) π ˆ = arctan ,ϕ , αi = (i − 1) I1 − I3 I1 − 2I3 + I5 2
(5.11)
5.2.1.2 Accuracy of the Measurement The choice of the number of frames depends on the desired speed of the algorithm, sensitivity to phase-step errors and harmonic content of the function f [. . .], and accuracy of the phase estimation. The errors in the phase-step, φ(r, t), and a nonsinusoidal waveform are the most common sources of systematic errors in the PSP [46, 49]. A nonsinusoidal signal may be caused by the non-linear response of the detector [46]. The phase shift between two consecutive images can be created using different means depending on the experimental realization of the profilometric system. Different phase-shifting devices are often subject to nonlinearity and may not ensure good repeatability. Miscalibration of phase shifters may be
98
E. Stoykova et al.
the most significant source of error [50]. In fringe-projection applications precise linear translation stages are used [51]. In interferometric systems a phase shifter usually is a mirror mounted on a piezoelectric transducer (PZT). In such systems non-stability of the driving voltage, nonlinearity, temperature linear drift and hysteresis of the PZT device and the tilt of the mirror affect the accuracy of the measurement. In the scheme presented in Fig. 5.3, a special feedback is introduced to keep constant the value of the phase-step. For largescale objects it is more convenient to create a phase shift by slightly changing the wavelength of the light source. A phase-shifting system with a laser diode (LD) source has been proposed in [52] and [53], in which the phase shift is created by a change of the injection current of the LD in an unbalanced interferometer. The phase shift can be introduced by digitally controlling the SLM that is used for generation of fringes. As an example, an electrically addressed SLM (EA-SLM) is used to display a grating pattern in [42]. In [54] a DMD microscopic system is designed in which the three colour channels in the DMD projector are programmed to yield intensity profiles with 2π/3 phase shift. Using the colour channel switching characteristic and removing the colour filter, the authors succeed to project grey-scale fringes and by proper synchronization between the CCD camera and the DMD projection to perform one 3D measurement within 10 ms. A comprehensive overview of the overall error budget of the phase-shifting measurement is made in [55, 56]. The analysis in [57] divides the error sources in the PSP into three groups. The first group comprises systematic errors with a sinusoidal dependence on the measured phase as the phase-step errors and the detector non-linearities. The second group includes random error sources that may also cause sinusoidal ϕ-dependence of the measured phase error. Such sources are the instability of the light source, random reference phase fluctuations, and mechanical vibrations. The third group of errors consists of random errors which are not correlated to the measured phase as different noises that introduce random variation across the FP. Such noises are the detector output and quantization noise of the measured intensity. ˆ , i = According to [57], the systematic phase-step error δφi = φi − φ i 1, 2 . . . M , given by the difference between the theoretical phase step, φi , for the i-th frame and the mean
value of the phase step that is actually introduced ˆ by the phase-shifter, φi , may be presented as a series δφi = ε1 φi + ε2 φ2i + ε3 φ3i +. . ., with coefficients ε1 , ε2 , ε3 , . . . that depend on the phase shifter. The error analysis in [57] has indicated the linear and the quadratic phase step deviations as one of the main error sources degrading the phase measurement accuracy. If only the linear term is kept in δφi , the error induced in ϕ in all points of the FP for most of the PS algorithms is given by [57]: δϕ =
M ∂ϕ ∂Ii i=1
∂Ii
∂φi
δφi
(5.12)
5 Pattern Projection Profilometry for 3D Coordinates Measurement
99
The linear approximation (5.15) leads to dependence of the systematic error, δϕ, on cos 2ϕ and sin 2ϕ [24, 58, 59, 60]. In fact, as it is shown in [57], the quadratic and cubic terms in (5.12) also lead to cos 2ϕ dependence of the systematic error. Influence of miscalibration and non-linearity of the phaseshifter for different phase-stepping algorithms is studied in [61]. The other frequently addressed systematic error is the non-linearity caused by the detector. To study its effect on the measured phase, [57] uses a polyno mial description of intensity error, δIi = Iˆi − Ii = α1 I 2 + α2 I 3 + α3 I 4 + . . ., i
i
i
where α1 , α2 , α3 are constants. The detector non-linearity introduces higher harmonics in the recorded FP. Calculations and simulation made by different authors show that a linear approximation in δIi leads to dependence of the phase systematic error, δϕ, on cos(M ϕ) e.g. for the four-step algorithm δϕ depends on cos(4ϕ). Important source of intensity error is the quantization error in video cameras and frame grabbers used for data acquisition. By rounding or truncating values in the analog-to-digital conversion the quantization changes the real intensity values in the FP and causes an error in the phase estimate which depends on the number of intensity levels. Quantization is a non-linear operation procedure. First quantization error analysis in PSP is made in the thesis of Koliopoulus [62] and further developed by Brophy in [63]. Brophy [63] studies how the frame-to-frame correlations of intensity error influence the phase error. In the absence of frame-to-frame correlation the phase variance δϕ2 decreases as 1/M. Brophy assumes in the analysis that the intensity quantization error expressed in grey levels is uniformly distributed in the interval [−0.5, 0.5]. This source of error does not exclude frame-to-frame correlation. As a result, δϕ2 may increase with the number of frames. Brophy obtains for √ 1/2 = 1/( 3Q). Specific algorithms a Q-level quantization the formula δϕ2 could be designed in this case to decrease the phase variance. By introducing a characteristic polynomial method Zhao and Surrel [64, 65] succeed to avoid necessity to determine the inter-frame correlation of intensities in calculation of the phase variance. For the purpose, the phase in (5.10) can be taken as an argument of a linear combination S(ϕ) =
M
cm Im =
m=1
1 IV P (ς) exp(jϕ) 2
(5.13)
in which the characteristic polynomial is defined by P (ς) =
M
cm ς m
(5.14)
m=1
where ς = exp(jφ), cm = am + jbm . Surrel [66] shows that error-compensating behaviour of any phase-shifting algorithm can be determined by analyzing location and multiplicity of the roots of P (ς). This approach permits to find
100
E. Stoykova et al.
the sensitivity of the phase-shifting algorithms also to the harmonic content in the FP [66] and to obtain a simplified expression for the phase quantization error. It has been obtained that for an 8-bit or more quantization, this error is negligible for noiseless FPs, if the intensity is spread over the whole dynamic range of the detection system. The analysis and simulations made in [67] show that in the most common CCD cameras a nominal 6-bit range is used from the available 8-bit range which leads to a phase error of the order of 0.178 radians. The accuracy is increased by a factor of four if a 12-bit camera is used [67]. Algorithms with specific phase steps to minimize the errors from miscalibration and nonsinusoidal waveforms have been derived using the characteristic polynomial. It is obtained that a (j + 3)-frame algorithm eliminates the effects of linear phase-shift miscalibration and harmonic components of the signal up to the j-th order. Vibration as a source of error is essential in the interferometric set-ups. For example, testing of flat surfaces needs a very high accuracy of 0.01μm. Vibration induces blurring and random phase errors during acquisition of the successive frames in the temporal PS. For this reason interferometric implementation of the temporal PSP is appropriate whenever the atmospheric turbulence and mechanical conditions of the interferometer remain constant during the time required for obtaining the interferograms [31]. Analysis made in [68] shows that low frequency vibrations may cause considerable phase error whereas high frequency vibration leads to a reduced modulation depth [68]. In [68] a (2 + 1) algorithm is proposed in which two interferograms, separated by a quarterwave step, are required to calculate the phase. A third normalizing interferogram, averaged over two phases that differ at 180◦ , makes possible to evaluate the background intensity. A thorough analysis of the vibration degrading effect is made in [69, 70]. Applying a Fourier analysis, an analytical description of the influence of small amplitude vibrations on the recorded intensity is obtained and the relationship between the Fourier spectrum of the phase error and the vibration noise spectrum is found by introduction of the phase-error transfer function which gives the sensitivity of the PS measurement to different noise frequency components. It is shown that immunity to vibration noise increases for the algorithms with higher number of recorded patterns. A max-min scanning method for phase determination is described in [71] and it is shown in [50] that it has a good noise tolerance to small amplitude low-frequency and high frequency noise. Lower accuracy of phase demodulation and phase unwrapping should be expected in the image zones with low fringe modulation or contrast, e.g. in areas with low reflectivity. Fringe contrast is important characteristic for finding of an optimal unwrapping path and for optimal processing of phase data such as filtering, improving visualization, and masking [72]. However, using of high fringe contrast as a quality criterion of a good data is not always reliable because this feature of the FPs is insensitive to such surface structure changes as steps. In the areas with steps which do not cast shadow the fringe
5 Pattern Projection Profilometry for 3D Coordinates Measurement
101
contrast is high but phase data are unreliable. Evaluation of the fringe contrast from several successively recorded images is inapplicable for real-time measurement. In [73] the fringe contrast and quality of the phase data are evaluated from a single FP by a least-squares approach. It is rather complicated to perform in situ monitoring of the phase step, e.g. by incorporating additional interferometric arms. The more preferable approach is the so-called self-calibration of the phase steps [74] which makes use of the redundancy of the FPs. Some of the developed self-calibrating algorithms are pointwise whereas others take into account the information contained by the whole FP. However, most of the developed self-calibration methods put restrictions on the number and quality of the FPs and on the performance of the phase shifters. Over the years, different approaches for deriving error-compensating algorithms have been proposed [49, 75]. Hibino [49] divides the PS algorithms into three categories according to their ability to compensate systematic phase-step errors. The first group comprises algorithms without immunity to systematic phase-step error, e.g. the synchronous detection algorithm. The second group consists of the error-compensating algorithms able to eliminate linear or nonlinear phase-step errors. The third group of algorithms compensate for systematic phase-step errors in the presence of harmonic components of the signal. To justify the compensating properties of the proposed algorithms different approaches have been invented as averaging of successive samples [76], a Fourier description of the sampling functions [77], an analytical expansion of the phase error [57] etc. Currently, the five-frame algorithm proposed by Schwider–Hariharan [78] becomes very popular. Hariharan et al. show that the error of the five-frame algorithm has a quadratic dependence on the phase step error. A new four interferogram method for compensating linear deviations from the phase step is developed in [79]. To increase the accuracy, algorithms based on more frames start to appear [76]. Algorithms derived in [80] based on seven or more camera frames prove to have low vulnerability to some phase-step errors and to lowfrequency mechanical vibration. In [81] three new algorithms are built with π/2 phase steps based on the Surrel [82] six-frame algorithm with a π/2 step, and four modifications of the conventional four-frame algorithm with a phase step of π/2 are studied using a polynomial model for the phase-step error. The ability to compensate errors is analyzed by the Fourier spectra analysing method. The main conclusion of the analysis is that it is possible to improve performance of π/2 algorithms by appropriate averaging technique. A selfcalibrating algorithm proposed in [83] relies on the assumption of constant arbitrary phase-steps between the consecutive FPs and quasi-uniform distribution of the measured phase taken modulo 2π in the range (0, 2π) over the recorded FP. When the phase steps differ from the actual ones, the probability density distribution of the retrieved phase is no longer uniform and exhibits two maxima. Applying of an iterative fitting procedure to a histogram built for the retrieved phase permits to find the actual phase steps and to correct
102
E. Stoykova et al.
the demodulated phase. The algorithm is further improved in [84] where the visibility of fringes across the FP is assumed to be constant whereas the background is allowed to have only intraframe variations. The improved algorithm introduces a feedback to adjust the supposed phase shifts until the calculated visibility map becomes uniform. The merit of the algorithm is its operation at arbitrary phase steps, however at the expense of constant visibity requiriment. A general approach to diminish or eliminate some error sources in PS interferometry is proposed in [85]. A model for Nph (r, t) is built which takes in account the phase-step error and considers an interferomer with a spherical Fizeau cavity. A generic algorithm for elimination of the mechanical vibration during the measurement is also described. Reference [77] adopts Fourier-based analysis to determine suitable sampling functions for the design of a five-frame PS algorithm that is insensitive to background variation when a laser diode is used as a phase shifter. Criteria are defined to check algorithm vulnerability to the background change. In addition, the authors evaluate the influence of the linear phase-shift miscalibration and the quadratic non-linearity of the detector error. An accurate method for estimation of the phase step between consecutive FPs is proposed in [51] for the case of five frame algorithm with an unknown but constant phase step, which permits to calculate the phase step as a function of coordinates and to use the so called lattice-site representation of the phase angles. In this representation the distance of the corresponding lattice site to the origin depends on the phase step. In the ideal case all lattice sites that correspond to a given phase step but to different phases lie on a straight line passing through the origin of the coordinate system whose both axes represent the numerator and denominator in the equation for phase step calculation [78]. The error sources deform somehow shape and spread of both histogram and lattice-site representation patterns. Application of the latter to analysis of behaviour of four and five frame algorithms is made in [86]. It is proven that the lattice-site representation outperforms the histogram approach for detection of errors in the experimental data. A phase shifter in an interferometric setup is vulnerable to both translational and tilt-shift errors during shifting, which results in a different phasestep value in every pixel of the same interferogram. An iterative algorithm that compensates both translational- and tilt-shift errors is developed in [87] which is based on the fact that the 2D phase distribution introduced by the phase-shifter is a plane. This plane can be determined by a first-order Taylor series expansion that makes possible to transform the nonlinear equations for defining the phase-shift plane into linear ones. By using an iterative procedure both errors can be minimized. A liquid-crystal SLM may produce nonlinear and spatially nonuniform phase shift [75]. 5.2.1.3 Generalized Phase-shifting Technique In the conventional phase-shifting algorithms the phase steps are known and uniformly spaced. In this case simple trigonometry permits derivation of
5 Pattern Projection Profilometry for 3D Coordinates Measurement
103
explicit formulas for the object phase calculation. It is also assumed that the background and visibility of fringes have only pixel-to-pixel variation but remain constant from frame to frame. In the generalized PSP, which in recent years gains increasing popularity because of the advantage to use arbitrary phase steps, these steps are unknown and should be determined from the recorded FPs. It is a frequently solved task in the PSP, e.g. for calibration of the phase-shifter. Determination of the phase-step between two consecutive interferograms is similar to the signal-frequency estimation, which has attracted a lot of attention in the signal-processing literature. However, it is more complicated due to the fact that the background intensity (the dc component) is involved in the processed signal [88]. Determination of the phase step is equivalent to the task of the phase-step calibration which, generally speaking, can be performed by using two approaches: fringe tracking or calculation of the phase-step from the recorded FPs [89]. In the fringe tracking the size of the phase step is obtained from the displacement of fringes following some characteristic features of the fringes, e.g. positions of their extrema after performing fringe skeletonizing to find the centers of dark or bright interference lines [79]. An extensive overview of algorithms for determination of unknown phase steps from the recorded FPs is made in [90]. Phase step determination in a perturbing environment is analyzed in [91]. Several methods as Fourier series method, iterative linear and non-linear least squares methods are compared on the basis of computer simulations which prove the reliability of all of them for the derivation of the phase step. Historically, development of self-calibrating algorithms starts with the first phase-stepping algorithm proposed by Carr´e in 1966 [92]. The algorithm is designed to operate at an arbitrary phase step, φ, which is determined during the processing under assumption of linear phase step errors. It requires four phaseshifted images Ii = I0 + IV cos[ϕ + (i − 1.5)φ], i = 0, . . . , 3 under assumption of the same background intensity, modulation, and phase step for all recorded images.The Carr´e algorithm accuracy is dependent on the phase step. Carr´e recommends the value of 110 degrees as most suitable. The accuracy of the algorithm has been studied both theoretically [57] and by computer simulations [55] for the phase step π/2. Computer simulations and experiments performed in [79] for the case of white additive noise and Fourier analysis made in [93] confirm the conclusion of Carr´e that highest accuracy is observed at 110 deg. In [94] search for the best step that minimize the error of the Carr´e algorithm is made by means of linear approximation of Tailor series expansion of the phase error. Linear approximation yields correct results only in the case of small error expansion coefficients. The obtained results also indicate φ = 110◦ as the best choice but only when the random intensity fluctuations (additive noise) are to be minimized. This value is not recommendable for compensation of a phase step error or a systematic intensity error. The authors draw attention to the fact that the numerator in the Carr´e algorithm should be positive which is fulfilled only for perfect images without noise. A number
104
E. Stoykova et al.
of other algorithms with a fixed number of equal unknown phase steps have been recently proposed in [95, 96, 97]. Use of a fixed number of equal steps is certainly a weak point in the measurement practice. This explains the urge of the phase-shifting community to elaborate more sophisticated algorithms with randomly distributed arbitrary unknown phase steps. Direct real time evaluation of a random phase step in the generalized PS profilometry without calibration of the phase-shifter is realized in [98] where the phase step is calculated using a Fourier transform of straight Fizeau fringes that are simultaneously generated in the same interferometric set-up. The necessity to have an additional optical set-up puts limitation on the method application. Evaluation of the phase steps by Lissajous figure technique is described in [99, 100]. The phase is determined by ellipse fitting based on Bookstein algorithm of a Lissajous figure obtained if two phase-shifted fringe profiles are plotted against each other. The algorithm, however, is sensitive to noise and easily affected by low modulation of the FP. Some improvement of the algorithm is proposed in [100] where the Lissajous figures and elliptic serial least-squares fitting are used to calculate the object phase distribution. The algorithm has both immunity to errors in φ and possibility for its automatic calibration. Reduction of phase error caused by linear and quadratic deviations of the phase step by means of a self-calibrating algorithm is proposed in [59]. The estimates of the phase steps are derived from each FP, and the exact phase difference between the consecutive patterns is calculated. Numerical simulation proves the efficiency of the algorithm up to 10% linear and 1% quadratic phase deviations and by experiments with a Twyman–Green interferometer for gauge calibration. Phase-calibration algorithm for phase steps less than π that uses only two normalized FPs is proposed in [101]. For the purpose, a region that concises a full fringe (region with the object phase variation of at least 2π) is chosen. The phase step is retrieved by simple trigonometry. A method for evaluation of irregular and unknown phase steps is described in [102] based on introduction of the carrier frequency in the FPs. The phase steps are determined from the phases of the first-order maximum in the spectra of the recorded phase-shifted FPs in the Fourier domain. The Fourier analysis can be applied to a subregion of the FP with high quality of the fringes. This straightforward and simple method works well only in the case of FPs with narrow spectra. Algorithms that exploit a spatial carrier use a relatively small number of interferograms. Improvement of the Fourier transform method based on the whole-field data analysis is proposed in [103]. The phase-step is obtained by minimization of the total energy of the first-order spectrum of the difference of two consecutive FPs with one of them multiplied by a factor of exp(jφ) where φ is equal to the estimated value of the phase step to be determined. Simulations and experiments prove that the algorithm is effective, robust against noise, and easy to implement. Based on quadrature filter approach, Marroquin et al. propose in [104] an iterative fitting technique that simultaneously yields the phase steps and the object phase, which
5 Pattern Projection Profilometry for 3D Coordinates Measurement
105
is assumed to be smooth. In [89] a method is proposed that requires only two phase-stepped images. The phase step is estimated as an arccosine from the correlation coefficient of both images without requirement for constant visibility and background intensity. The method can show position-dependent phase-step differences, but it is strictly applicable only to areas with a linear phase change. To overcome the errors induced in the phase step by different sources it is desirable to develop a pointwise algorithm that can compute the phase step and the object phase at each pixel [105]. The first attempt to deduce an algorithm with unknown phase steps using the least-squares approach belongs to Okada et al. [106]. Soon it is followed by several proposals of self-calibrating least-squares PS algorithms [45, 107, 108, 109]. The essence of the least-squares approach is to consider both phase steps and object phase as unknowns and to evaluate them by an iterative procedure. This approach is especially reliable for FPs without spatial carrier fringes presenting stable performance in the case of nonlinear and random errors in the phase step. The number of equations which can be constructed from M FPs each consisting of Nx × Ny pixels is 3M × Nx × Ny whereas the number of unknowns is 3Nx × Ny + M − 1. This entails the requirement 3M ×Nx ×Ny ≥ 3Nx ×Ny +M −1 to ensure the object phase retrieval. To have stable convergence the least-squares PS algorithms with unknown phase steps need comparatively uniformly spaced initial phase steps that are close to the actual ones. These algorithms usually are effective only at small phase-step errors and require long computational time. They are not able to handle completely random phase-steps. As a rule, these methods are either subject to significant computational burden or require at least five FPs for reliable estimation. The least-squares approach is accelerated in [109, 110] where a computationally extensive pixel-by-pixel calculation of the phase step estimate is replaced with a 2 × 2 matrix equation for cos φ and ˆ = tan−1 (sin φ/ cos φ) until sin φ. The phase step is determined iteratively as φ the difference between two consecutive phase step estimates falls down below a predetermined small value. The limitations of the least squares approach are overcome by an advanced iterative algorithm proposed in [111] and [112] which consists of the following consecutive steps: i) Using a least-squares approach, the object phase is estimated in each pixel under assumption of known phase steps and intraframe (pixel-to-pixel) variations of the background intensity and visibility. ii) Using the extracted phase distribution, the phase steps φn = tan−1 (−dn /cn ) are updated by minimization of the least-square error: Sn =
Ny Nx i=1 j=1
n n 2 (Iˆij −Iij ) =
Ny Nx n 2 (B n +cn cos ϕij +dn sin ϕij −Iˆij ) (5.15) i=1 j=1
under assumption of interframe (frame-to-frame) variations of the backn ground intensity and visibility, Bij = B n and Vijn = V n , with cn = n n V cos φn and dn = −V sin φn .
106
E. Stoykova et al.
iii) If the pre-defined converging criteria are not fulfilled the steps i) and ii) are repeated. An improved iterative least-squares algorithm is constructed in [108] which minimizes the dependence of differences between the recorded intensities and their recalculated values with respect of the phase step errors. An iterative approach is considered in [113] where the phase steps are estimated by modelling of an interframe intensity correlation matrix using the measured FPs. This makes the method faster, more accurate and less dependent on the quality of the FPs. The smallest eigenvalue of this matrix yields the random error of intensity measurement. As few as four FPs are required for phase-steps estimation. The developed iterative procedure is rather simplified in comparison with the methods that rely on pixel-to-pixel calculation. The accuracy of 2 × 10−3 rad has been achieved. A pointwise iterative approach for the phase step determination based on linear predictive property and least squares minimization of a special unbiased error function is proposed in [88]. The algorithm works well only for a purely sinusoidal profile of the FP. Phase retrieval and simultaneous reconstruction of the object wave front in PS holographic interferometry with arbitrary unknown phase steps is proposed in [107]. Assuming uniform spatial distribution of the phase step over the recorded interferogram, the authors obtain the following relationship between consecutive interferograms pn =
In+1 − In φn+1 − φn 2 √ 4 I I = π sin 2 0 r
(5.16)
where I0 and Ir are the intensities of the object and the reference waves. The parameter pn can be determined for all recorded interferograms which further permits to restore the complex amplitude of the object wave. The process is repeated iteratively until the difference φn+1 − φn becomes less than a small predetermined value. The algorithm is proved to work well for any number of patterns M > 3 by computer simulations. Extension of the algorithm is proposed in [114] for the case when only the intensity of the reference beam must be measured. The need of iterations, however, makes it unsuitable for real-time measurement as the authors recommend at least 20 iterations in 1 min to reach the desired high accuracy. To avoid iterations and the need of alternative estimation of the object phase and the phase step, Qian et.al. [115] propose to apply a windowed Fourier transform to a local area with carrier-like fringes in two consecutive FPs. The objective of [43] is to develop a generalized PS interferometry with multiple PZTs in the optical configuration that operates under illumination with a spherical beam in the presence of higher harmonics and white Gaussian intensity noise. These goals are achieved by a super-resolution frequency estimation approach in which Z-transform is applied to the phase-shifted FPs, and their images in the Z-domain are multiplied by a polynomial called an annihilating filter. The zeros of this filter in the Z-domain should coincide
5 Pattern Projection Profilometry for 3D Coordinates Measurement
107
with the frequencies in the fringes. Hence, the parametric estimation of the annihilating filter provides the desired information about the phase steps. Pixelwise estimation of arbitrary phase steps from an interference signal buried in noise in the presence of nonsinusoidal waveforms by rotational invariance is proposed in [105]. First, a positive semidefinite autocorrelation matrix is built from the M phase-shifted records at each pixel (i, j) which depends only on the step between the samples. The signal is separated from the noise by a canonical decomposition to positive definite Toeplitz matrices formed from the autocorrelation estimates. The phase steps are determined as frequency estimates from the eigen decomposition of the signal autocorrelation matrices. The exact number of harmonics in the signal is required. The method is extended to retrieve two distinct phase distributions in the presence of higher harmonics and arbitrary phase-steps introduced by multiple PZTs [116]. In [117] the problem of using two or more PZTs in the PS interferometry with arbitrary phase steps in the presence of random noise is solved by maximumlikelihood approach. The developed algorithm should allow for compensation of non-sinusoidal wavefront and for non-collimated illumination. 5.2.1.4 Phase Unwrapping As it has been already mentioned, the presence of the inverse trigonometric function arctg in the PS algorithms introduces ambiguity in the measured phase distribution. The calculated phase is wrapped into the interval of (−π; +π). The process of removing 2π crossovers (unwrapping) could simply be described as subtracting or adding 2π multiples to the wrapped phase data [118] that is equivalent to assign the fringe order at each point: ϕunw (i, j) = ϕwr (i, j) + 2πk(i, j),
(5.17)
where ϕunw (i, j) is the unwrapped phase at the pixel (i, j), ϕwr (i, j) is the experimentally obtained wrapped phase at the same point, and k(i, j) is an integer, that counts 2π crossovers from a starting point with a known phase value to the point (i, j) along a continuous path. Therefore, the phase unwrapping problem is a problem of estimation of the correct value of k(i, j) in order to reconstruct the initial true signal [119]. The described unwrapping procedure performs well only in the case of a noise-free, correctly sampled FP, without abrupt phase changes due to object discontinuities [120]. The basic error sources that deteriorate the unwrapping process are i) speckle noise, ii) digitalization and electronic noise in the sampled intensity value, iii) areas of low or null fringe visibility, and iv) violation of the sampling theorem [44, 120, 121]. In addition, the phase unwrapping algorithms should distinguish between authentic phase discontinuities and those caused by object peculiarity, coalescence [122], shadowing or non-informative zones due to limited detector visibility range. Over the years a lot of research is aimed to develop different unwrapping techniques [123], which should find
108
E. Stoykova et al.
the middle ground between alleviation of the computational burden and reduction of influence of the phase ambiguities [124]. One of the major problems in the phase unwrapping is how to estimate the unreliable data that may disturb actual data restoration. The principal categorization of algorithms that attack the major error sources is proposed in [125, 126], where three basic classes are outlined: i) Global class. In global class algorithms the solution is formulated in terms of minimization of a global function. The most popular phase unwrapping approaches [125, 127, 128, 129, 130, 131, 132] are based on solution of a unweighted or weighted least-squares problem [123, 133]. All the algorithms in this class are known to be robust but computationally intensive. The presence of noise and other fringe discontinuities, however, leads to corrupted results because of the generalized least square approach applied. To overcome this disadvantage, a time-consuming post-processing should be utilized [126]. ii) Region class. An essential feature of these algorithms is subdivision of wrapped data in regions. Each region is processed individually and on this basis larger regions are formed till all wrapped phase values are processed. This restricts the local errors only to the processed zone of the FP preventing their propagation into the other regions. There are two groups of region algorithms: 1) Tile-based and 2) Region-based. In tilebased approach [134, 135] the phase map is divided into grid of small tiles, unwrapped independently by line-by-line scanning techniques and after that the regions are joined together. However, this algorithm is not successful in processing of very noisy data. The region-based approach, proposed initially by Geldorf [136] and upgraded by other researchers [119, 128, 137, 138, 139, 140] relies on forming uniform regions of continuous phase. A comparison of a pixel to its neighbour is performed. If the phase difference is within a predefined value, then the pixel and its neighbour are attached to the same region; otherwise, they belong to different regions. After that the regions are shifted with respect to each other to eliminate the phase discontinuities. iii) Path following class, in which data unwrapping is performed using an integration path. The class of path-following algorithms can be subclassified into three groups: 1) Path-dependant methods; 2) Residuecompensation methods and 3) Quality guided path methods. The first group is characterized with phase integration on preliminary defined path (i.e. linear scanning, spiral scanning, multiple scan direction [141]); the simplest example of this type is proposed by Schafer and Oppenheim’s [142]. Despite of their benefit to be low-time consuming, these methods are not reliable at the presence of noise and other error sources due to the fixed integration path. The residue-compensation methods rely on finding the nearest residues (defined as unreliable phase data) and connect them in pairs of opposite polarity by a branch-cut [143]. Uncompensated residues
5 Pattern Projection Profilometry for 3D Coordinates Measurement
109
could be connected also to the image border pixels. The unwrapped procedure is realized without crossing any branch-cut placed, limiting the possible integration paths. Other similar approaches [144, 145] are also based on branch-cuts unwrapping strategy. These methods produce fast results, but an inappropriately placed branch-cut could lead to isolation of some phase zones and discontinuous phase reconstruction. Quality-guided path following algorithms initially unwrap the most reliable data, while the lowest reliable data are passed up in order to avoid error spreading. The choice of integration path depends on pixels quality in the meaning of quality map, first proposed by Bone [146], who uses a second difference as a criterion for data reliability, setting up a threshold, and the all phase data with calculated second derivatives under it are unwrapped in any order. The method is improved [147, 148] by introducing an adaptive threshold with increasing threshold value whose implementation allows all data to be processed. However, when reliable quality map is not presented, the method fails in phase restoration. The accuracy of the produced quality map assures successful performance of the method [149] with different type of phase quality estimators, such as correlation coefficients [123, 150], phase derivatives variance [151, 152] or fringe modulation [77, 153]. For illustration of some of the discussed phase unwrapping methods we processed the wrapped phase map (Fig. 5.8) of two real objects – plane and complicated relief surface, experimentally produced by two-spacing projection PS interferometry [5]. The results are shown in Fig. 5.9. Goldstein algorithm (Fig. 5.9a) identifies the low quality phase values, but does not create correct branch-cuts. The main advantage of this algorithm is minimization of the branch-cut length, thus allowing for fast data processing. However, this approach is not efficient in the case of phase maps with sharp
Fig. 5.8. Wrapped phase map of a test (left) and a real (right) object
110
E. Stoykova et al.
Fig. 5.9. Unwrapped phase map for a) Goldstein method, b) mask-cut method, c) minimum Lp – norm method, d) weighted multigrid method, e) conjugated gradient method, f ) least-squares method, g) quality-guided path following method and h) Flynn method
discontinuities. The same bad result is observed when implementing Mask-cut algorithm (Fig. 5.9b) that upgrades the Goldstein method with introducing quality map to guide the branch-cut placement. In comparison with Goldstein method, the incorrect interpretation of phase data could be due to low accuracy of the quality map. The phase unwrapping using all four minimum norm methods fails (Fig. 5.9) in the case of complex phase map with low quality noisy regions and discontinuities. A possible reason is the absence of a good quality map. Increasing the number of iterations improves the quality of the demodulated phase but at the expense of longer computational time. Quality-guided path following method (Fig. 5.9g) successfully demodulates the processed phase map. The regions with bad quality values (due to noise and shadowing) are recognized due to implementation of quality map that guides the integrating path. The algorithm is fast and successfully presents the small details that make it suitable for processing of complex phase maps. Flynn method (Fig. 5.9h) also provides phase reconstruction by effectively identifying phase discontinuities as a result of its main benefit – to perform well without an accurate phase map. However, in comparison with Qualityguided path following method it has poorer presentation of details and flat surfaces and is more time consuming. Involvement of arctan function in phase retrieval is an obstacle in achieving the two main goals of the PMP: high measurement accuracy and unambiguous full-filed measurement. Among the solutions of this problem there is the so-called temporal-phase unwrapping method [154, 155] which makes pixel-bypixel unwrapping along the time coordinate by projection of a proper number of FPs at different frequencies. Thus propagation of the unwrapping error to
5 Pattern Projection Profilometry for 3D Coordinates Measurement
111
the neighbouring pixels is avoided. The first projected pattern in the temporal sequence consists of a single fringe, and the phase changes from −π to +π across the field of view [156]. If the number of fringes increases at subsequent time values as n = 2, 3 . . . , N , so the phase range increases as (−nπ, nπ). For each n, M phase-shifted FPs are recorded. Therefore, the measured intensity depends on pixel coordinates, current number of fringes and number of the phase-shifted patterns. Analysis made in [44, 157] shows that the error in depth determination scales as N −1 to N −3/2 . Obviously, temporal unwrapping is suitable to applications when the goal is to derive the phase difference. Modifications of the original scheme have been tested aiming to reduction of the used FPs. As an example, in [158] two sinusoidal gratings with different spacings are used for fringe projection. The grating with higher spatial frequency ensures the sensitivity of the measurement while the coarse grating creates a reference pattern in the phase unwrapping procedure. Projection of tilted grids for determination of the absolute coordinates is proposed in [159]. In [20] a SLM is used to project fringes for surface contouring with a time-varying spatial frequency, e.g. linearly increasing, and by temporal unwrapping the 3D coordinates are restored pixel by pixel. In [157] an exponential increase of the spatial frequency of fringes is used which enhances the unwrapping reliability and reduces the time for data acquisition and phase demodulation. In [160] temporal unwrapping is combined with digital holography. The method requires a time-coded projection which is a serious limitation. This limitation is overcome in [161], where the authors propose projection of a single FP obtained by merging two sinusoidal gratings with two different spacings 1 /f1 > 1/ f2 . The following FP is recorded: I(x, y) = IB (x, y) + I1 (x, y) + I2 (x, y) = = IB (x, y) + IV1 (x, y) cos[2πf1 x + ϕ(x, y)]
(5.18)
+IV2 (x, y) cos[2πf2 x + ϕ(x, y)] Two phase maps ϕ1,2 (x, y) are derived from the components I1,2 (x, y) that are isolated from the registered FP and multiplied by the signals cos(2πf1,2 x) and sin(2πf1,2 x) respectively. Due to the relation, f1 ϕ2 (x, y) = f2 ϕ1 (x, y), higher sensitivity is achieved, at least within non-ambiguity interval of ϕ1 (x, y). In [162] two Ronchi gratings of slightly different spacings are used for fringe generation. The small difference in spacings is a ground to conclude that at a given point (x, y) both ϕ1,2 (x, y) and their difference are monotone functions of the object depth or height, h. This allows for coarse and fine estimation of h. A multifrequency spatial-carrier fringe projection system is proposed in [22]. The system is based on two-wavelength lateral shearing interferometry and varies the spatial-carrier frequency of the fringes either by changing the wavelength of the laser light or by slight defocusing. In [163] a white-light Michelson interferometer produces the varying pitch gratings of different wavelengths which are captured and separated by a colour video camera using red, green
112
E. Stoykova et al.
and blue channels. Parallel and absolute measurement of surface profile with a wavelength scanning interferometry is given in [32]. By using Michelson and Fizeau interferometer the authors report measuring objects with steps and narrow dips. The multiwavelength contouring for objects with steps and discontinuities is further improved in [164] by an optimizing procedure for determination of the minimum number of wavelengths that are necessary for phase demodulation. A pair of coarse and fine phase diffraction gratings is used for simultaneous illumination at two angles of an object in a PS interferometric system for flatness testing. The synthetic wavelength is 12.5 mm, and a height resolution of 0.01 mm is achieved. A PS approach without phase unwrapping is described in [165]. It includes calculation of the partial derivatives to build a 2D map of the phase gradient and numerical integration to find the phase map. The method proves to be less sensitive to phase step errors and does not depend on the spatial nonuniformity of the illuminating beam and on the shape of the FP boundary. Projection of a periodic sawtoothlike light structure and the PS approach are combined in [166]. Projection of such a pattern is simpler in comparison with the sinusoidal profile. The phase demodulation procedures are described for right-angle triangle teeth and isosceles triangle teeth. The method requires uniform reflectivity of the surface. The recommendable φ is half the period of the projected pattern. 5.2.2 Absolute Coordinates Determination Projecting of two FPs with different spatial frequencies can be used for measurement of 3D coordinates as is proposed in [167]. The method is based on the generation in the (x , y , 0) plane of fringes with spacings d1 and d2 that are parallel to the y axis (Fig. 5.10). The y and y axes are perpendicular to the plane of the drawing. The phase of the projected fringes is determined
Fig. 5.10. Basic set-up for absolute coordinates determination
5 Pattern Projection Profilometry for 3D Coordinates Measurement
113
as ϕi = 2πx /di , i = 1, 2. The phase is reconstructed in the xyz coordinate system, with the z axis oriented parallelly to the optical axis of the CCD camera. Angle α is the inclination angle of the illumination axis z with respect to the observation axis z. The phase maps are determined by the five-step algorithm for each of the spacings. The smaller of the spacings is chosen to allow ten pixels per fringe period. The phase Δϕi (x, y) can be represented in the xyz coordinate system as Δϕi (x, y) = ϕi (x, y) − ϕ0 =
2π lx cos α + lz(x, y) sin α − ϕ0 , · di l − z(x, y) cos α + x sin α
(5.19)
where i = 1, 2; z(x, y) is the relief of the object at point (x, y), l is the distance from the object to the exit pupil of the illumination objective, and ϕ0 is an unknown calibration constant. Subtracting the obtained phase distributions and assuming, Δϕ2 (x, y) − Δϕ1 (x, y) = 2πnx,y , we obtain the expression for the coordinate z in the form nx,y (1 + x sin α) + χlx cos α d2 − d1 z(x, y) = , χ= (5.20) nx,y cos α − χl sin α d1 d2 The vertical interference fringes, generated with a collimated laser light and a Michelson’s interferometer (one mirror is mounted on a phase-stepping device) are projected on the plane (x , y ,0). Different spacings of interference patterns are used for successive illumination of the object surface (d1 = 1 mm, d2 = 2 and 6 mm). The angle α of the object illumination is 30 deg. The wrapped phase maps at different spacings of the projected FPs are presented in Fig. 5.11. Figure 5.12 gives the 3D reconstruction of the object. The method’s sensitivity mainly depends on the accuracy with which the phase difference is measured, i.e., on the accuracy of nx,y estimation. The influence of inaccuracy in determining l and α can be neglected. The measurement accuracy increases with the difference (d1 − d2 ) and with the illumination angle α but is not uniform over the length of the object and decreases as its transverse size increases. It is interesting to compare the obtained result to the two-wavelength holographic contouring of the same object, presented in [168, 169]. In reconstruction with a single wavelength of the two-wavelength recorded hologram
Fig. 5.11. Phase maps obtained for different spacings of the projected interference patterns after median filtration and low-quality zones detection; left) d1 = 1 mm, d2 = 2 mm; right) d1 = 1 mm, d2 = 6 mm
114
E. Stoykova et al.
Fig. 5.12. Reconstructed 3D image from the difference phase map
the object’s image is modulated as a result of interference of the two reconstructed images by sinusoidal contouring fringes in normal direction separated at distance Δz which depends on both recording wavelengths and the angle between the reference and the object beam (surface normal). A ten mW CW generating temperature stabilized diode laser, emitting two shifted at Δλ ∼ 0.08 nm wavelengths in the red spectral region (∼ 635 nm) is used for recording of a single exposure reflection (Denisyuk’s type) hologram onto silver-halide light-sensitive material. The illumination angle is 30 deg. The image reconstructed in white light is shown in Fig. 5.13. The step between the contouring fringes is Δz = 1.83 mm. 5.2.3 Fourier Transform Method 5.2.3.1 Basic Principle and Limitations The most common and simple way for phase demodulation from a single FP is to use Fourier transform for analysis of the fringes. Almost three decades of intensive research and application make the Fourier transform based technique a well established method in holography, interferometry and fringe projection profilometry. In two works [170, 171] published within an year in 1982 and 1983 by Takeda and co-workers, it is shown that 1D version of the Fourier transform method can be applied both to interferometry [170] and PPP [171]. Soon after that, the method gains popularity under the name of Fourier fringe analysis (FFA) [172, 173, 174, 175, 176]. For the 3D shape measurement the method
5 Pattern Projection Profilometry for 3D Coordinates Measurement
115
Fig. 5.13. Reconstructed image in white light illumination of reflection hologram
becomes known as Fourier transform profilometry (FTP). The FTP surpasses in sensitivity and avoids all the drawbacks exhibited by the previously existing conventional moir´e technique used for 3D shape measurement as the need to assign the fringe order, poor resolution, and the incapability of discerning concave or convex surfaces [177, 178]. The computer-aided FFA is capable to register a shape variation that is much less than one contour fringe in moir´e topography [171]. Some years later, the 1D Fourier transform method is extended to process 2D patterns – firstly by applying 1D transform to carrier fringes parallel to one of the coordinate axes [179, 180] and further by generalization of the method to two dimensions [175]. Actually, as has been reported in [174], the algorithm proposed in [175] has been in use since 1976 for processing of stellar interferograms. The ability of FFA for fully automatic distinction between a depression and an elevation in the object shape put the ground for automated processing in the FTP [171]. The main idea of the FFA is to add a linearily varying phase into the FP, i.e. to use in (5.1) φ(r) = 2πf0 · r, which can be done e.g. by tilting one of the mirrors in the interferometric setup or by using a diffraction grating for fringe projection. Obviously, the introduction of the carrier frequency f0 = (f0x , foy ) is equivalent to adding a plane in the phase space, as it is shown in Fig. 5.14. The expression for the recorded intensity becomes:
116
E. Stoykova et al.
Fig. 5.14. Left: pattern with open fringes; middle: 3D presentation of the phase map without carrier removal; right: pattern with closed fringes
I(r) = IB (r) + IV (r)f [ϕ(r) + 2πf · r0 ] ∞ = IB (r) + IV (r) Ap cos p ϕ(r) + 2πf · r0
(5.21)
p=1
where the dependence on time variable, t, is omitted since we consider the case of phase retrieval from a single FP. The purpose of carrier frequency introduction is to create a FP with open fringes in which the phase change is monotonic (Fig. 5.14). The further processing of (21) is straightforward and includes the steps: i) Fourier transform of the carrier frequency FP that is modulated by the object ∞ D(f) = DB (f) + Dp (f − pf0 ) (5.22)
ii)
iii) iv) v)
p = −∞ p = 0 1 with Dp (f) = F 2 IV (r)Ap ejpϕ(r) and DB (fx , fy ) = F {IB (x, y)}, where F {. . .} denotes Fourier transform and f = (fx , fy ) is the spatial frequency; selection of the fundamental spectrum that corresponds to one of the two first diffraction orders, D1 (f − f0 ) or D1 (f + f0 ), by proper asymmetric bandpass filtering; removal of the carrier frequency D1 (f − f0 ) → D1 (f); inverse Fourier transform back to the spatial domain F −1 {D1 (f)}; extraction of the phase information from the resulting complex signal Ψ(r) = 12 IV (r)A1 (r)ejϕ(r) in the spatial domain, whose argument is the searched phase: ϕ(r) = tan−1
Im[Ψ(r)] Re[Ψ(r)]
(5.23)
As is seen, introduction of the carrier frequency separates in the Fourier domain both counterparts of the fundamental spectrum from each other and from the background intensity contribution concentrated around the zerofrequency (Fig. 5.15). Due to the global character of the Fourier transform,
5 Pattern Projection Profilometry for 3D Coordinates Measurement
117
Fig. 5.15. Schematic of the Fourier fringe analysis
the phase estimate calculated at an arbitrary pixel depends on the whole recorded FP. This means that any part of the pattern influences all other parts and vice versa. The successive steps of the FFA are illustrated in Fig. 5.16. Similar to the PS technique, the FFA returns a phase value modulo 2π and needs further unwrapping. As it can be seen from (5.23), the phase is restored without the influence of the terms IB (x, y) and IV (x, y). This means that the Fourier algorithm is not vulnerable to the noise sources that create IB (x, y) as e.g. stray light from the laboratory environment, unequal intensities in the two arms of interferometer or the dark signal from the imaging system, as well as to the noise contribution in IV (x, y) as e.g. nonuniform intensity distribution of the illuminating beam, optical noise or nonuniform response of the CCD [181]. In most cases, the higher harmonics content is ignored, and the recorded pattern with open fringes looks like [173]: I(x, y) =IB (x, y) + Ψ(x, y) exp[2πj(f0x x + f0y y)] + Ψ∗ (x, y) exp[−2πj(f0x x + f0y y)]
Fig. 5.16. Single frame phase retrieval with FFA
(5.24)
118
E. Stoykova et al.
The 2D Fourier transform of (5.24) can be written in the form: D(fx , fy ) = DB (fx , fy )+D1 (fx −f0x , fy −f0y )+D1∗ (fx +f0x , fy +f0y ) (5.25) where the asterisk represents the complex conjugate. If the background, the visibility and the phase vary slow in comparison to (f0x , f0y ), the amplitude spectrum is a trimodal function with a broadened zero peak DB (fx , fy ) and D1 and D1∗ placed symmetrically to the origin. In this case, the three parts of the spectrum in (5.25) can be well isolated from each other. A 2D bandpass filter centered at (f0x , f0y ) extracts a single spectrum D1 (fx − f0x , fy − f0y ) which is shifted to the origin in the frequency domain (Fig. 5.15). The amplitude of the zero-order spectrum at each point in the frequency domain exceeds at least twice the amplitudes of the first orders, which restricts the size of the filter window to remain, roughly speaking, less than halfway between zeroand first-order maxima. If after the filtering D1 (fx − f0x , fy − f0y ) remains where it is, a tilt is introduced in the restored height distribution [182]. The inverse Fourier transform of D1 (fx , fy ) yields, at least theoretically, the complex signal, Ψ(x, y). The FTP uses optical geometries similar to those of projection moir´e topography [171]. The most common and easy for implementation is the crossed-optical-axes geometry, like the one depicted in Fig. 5.1. For the measurement with a reference plane, the phase change is determined from Δϕ(x, y) = Im{log[Ψ(x, y)Ψ∗r (x, y)]} [183], where Ψ∗r (x, y) corresponds to the reference plane and is obtained after the inverse Fourier transform of the filtered positive or negative counterpart of the fundamental spectrum. The necessary condition to avoid overlapping of the spectra in (5.37) if we assume without a loss of generality that the carrier fringes are parallel to the y axis, is given by [173]: 1 ∂ϕ(x, y) 1 ∂ϕ(x, y) f0x + > 0 or f0x + <0 (5.26) 2π ∂x 2π ∂x x,y∈S x,y∈S The choice of the inequality depends on whether the positive or negative counterpart of first order spectrum has been filtered; here S is the area occupied by the FP. This limitation on the phase variation and hence on the depth variation within the object under investigation is the main drawback of the FFA. It is obvious that the above condition is satisfied only for open fringes with monotonic phase behaviour which makes the FFA inapplicable to closed fringes. In order to avoid aliasing, (f1x )max ≤ (fnx )min for n > 1 and (f1x )min ≥ (fB )max should be satisfied (Fig. 5.15), where 2πfnx = n[2πf0x + ∂ϕ(x, y)/∂x] and (fB )max is the maximal frequency of the background spectrum. The above non-equalities entail ∂ϕ(x, y) 2πf0x ∂h(x, y) L0 ≤ or (5.27) ∂x ∂x ≤ 3d 3
5 Pattern Projection Profilometry for 3D Coordinates Measurement
119
When the height variation exceeds limitation (27), aliasing errors hamper the phase retrieval. Application of the Fourier transform technique without spatial heterodyning was proposed by Kreis [44, 176]. Applying a proper bandpass filtering to the Fourier transform D(fx , fy ) = DB (fx , fy ) + D1 (fx , fy ) + D1∗ (fx , fy )
(5.28)
ˆ 1 (fx , fy ) of D ˆ ∗ (fx , fy ) of I(x, y) = IB (x, y) + Ψ(x, y) + Ψ∗ (x, y), an estimate D 1 can be derived and the phase distribution restored from ˆ (x, y) = arg F −1 [D(fx > 0, fy > 0)] ϕ (5.29) However, the distortions in the restored phase due to possible overlapping of D1 (fx , fy ) and D1∗ (fx , fy ) are more severe in this case in comparison with the spatial heterodyning. This technique is appropriate for objects which cause slowly varying phase modulation centered about some dominant spatial frequency. In view of the obvious relations fx (x, y) = ∂ϕ(x, y)/∂x and fy (x, y) = ∂ϕ(x, y)/∂y, the phase estimate increases monotonically along X and Y [184], and the sign of the local phase variation is not restored. 5.2.3.2 Accuracy Issues and Carrier Removal Obviously, the two possible ways to improve the accuracy of the FFA is to vary the carrier frequency or the width of the filter window in the frequency domain. To ensure monotonic phase change throughout the FP the carrier frequency should be chosen large enough; however, it may happen that the 2 2 −1/2 carrier fringe period, (f0x + f0y ) , exceeds the spatial resolution of the CCD camera. Therefore, high resolution imaging systems are required for the measurement of steep object slopes and step discontinuities [185]. Besides, introduction of the spatial carrier entails, as a rule, a change in the experimental setup which could require sophisticated and expensive equipment that may not always be available. In addition, the change in the carrier frequency could hardly be synchronized with the dynamic behaviour of the object. The width of the Fourier-plane window affects in opposite ways the accuracy of phase restoration and spatial resolution [186]. The three terms in (5.25) are continuous functions throughout the Fourier domain. If the filter width is taken too large the information from the rejected orders of the Fourier transform will leek into the processed frequency window leading to phase distortions. Decrease of the width worsens the spatial resolution. A trade-off between accuracy of phase determination and spatial resolution is required. Obviously, for the real FP that is corrupted by noise the demodulated ˆ (x, y), differs from the real phase given by (5.23). Since the phase estimate, ϕ noise covers the whole Fourier transform plane, a decrease in filter width leads to considerable noise reduction. For optimal filtering prior information on the
120
E. Stoykova et al.
noise and bandwidth of the modulating signals is required. This dependence of filter parameters on the problem to be solved makes automatic processing of the fringe patterns difficult [187]. Using of tight square profile filter window leads to ‘filter ringing’ causing distortions in the restored phase distribution [182]. Phase accuracy of approximately 0.01 fringe is obtained for a Gaussian apodization window centered at the carrier frequency [172]. The main advantage of a Gaussian filter is its continuous nature and absence of zeros in the Fourier transform. Use of a 2D Hanning window is reported in [188] which provides better suppression of noise. The background phase distribution caused by optical aberrations can be also eliminated using differential mode [181]. Substantial reduction of the background intensity can be achieved by the normalization procedure developed in [189] which includes determination of two enveloping 2D functions eb (x, y) and ed (x, y) obtained by applying surface fitting to the centre lines of bright and dark fringes. The wrapped phase of the normalized fringe pattern In (x, y) = A
I(x, y) − ed (x, y) +B eb (x, y) − ed (x, y)
(5.30)
where A and B are normalization constants, remains the same as for the non-normalized pattern, but the contribution of the background is strongly diminished. A transform-domain denoising technique for processing of speckle FPs based on the discrete cosine transform with a sliding window and an addaptive thresholding is developed in [190]. To decrease the noise influence a method to enhance the FP by modifying the local intensity histogram before the Fourier transform is proposed in [184]. Modification is based on monotonic transformation from the real intensity values to the ideal values thus removing the noise without worsening of the contrast. A background removal is proposed in [191] for the case of continuous registration of FPs by adding the patterns in series. After normalization to the grey-level range of the CCD camera, the intensity distribution of the resulting pattern gives the background estimation at high number of added patterns. The method proves to be especially efficient for low carrier frequency FPs when the zero- and first-order peaks overlap to a great extent. Improvement of the spatial resolution without loss of phase demodulation accuracy is proposed and verified in [186]. The idea is to make use of the two complementary outputs of an interferometer taking in view that the locations of constructive interference in the plane of the first output correspond to destructive interference at the second output, i.e. we have: I1 (x, y) = IB1 (x, y) + IV 1 (x, y) cos[ϕ1 (x, y) + 2π(f0x x + f0y y)] (5.31) I2 (x, y) = IB2 (x, y) − IV 2 (x, y) cos[ϕ2 (x, y) + 2π(f0x x + f0y y)] (5.32) If precautions are taken to ensure equal contrasts and gains in a perfect way while recording the two interferograms by two different cameras, the zero-order spectrum vanishes at subtraction of the Fourier spectra of both patterns. This
5 Pattern Projection Profilometry for 3D Coordinates Measurement
121
permits to increase the size of the window of the filter applied to the firstorder spectrum and the spatial resolution respectively by a factor of 2. The FFA method with two complementary interferograms is very useful for images with high spatial frequencies in which the fundamental spectrum is not well localized or for the case of undersampling [181]. Elimination of the zero-order by registration of two FPs phase shifted at π using a defocused image of a Ronchi grating is proposed in [192]. The authors reported contribution of the higher orders to be 25% in comparison with the fundamental spectrum. Projection of a quasi-sinusoidal wave and π phase shifting technique increase the acceptable height variation to |∂h(x, y)/∂x| ≤ L0 /d [183]. A modification of the FFA which makes it suitable for a special class of closed FPs is proposed in [173]. The goal is achieved by transforming the closed FP into an open FP in a polar coordinate system using x = X + r cos θ, y = Y + r sin θ where X, Y are the coordinates of the center of the closed FP in the Cartesian coordinate system and the point (X,Y) is chosen for the origin of the polar coordinate system. The FP in the r-θ space consists of straight open fringes that permit application of the conventional FFA. However, this is true only for a concave or convex phase surface with the origin of the polar coordinate system coinciding with the apex of the wavefront. The phase retrieved in the r-θ space is transformed back to the Cartesian coordinate system and the phase map of the closed-fringe pattern is recovered. The Fourier transform is calculated using a discrete Fourier transform (DFT). Using of DFT leads to the so-called leakage for frequencies that are not integer multiplies of (1/Nx Δx , 1/Ny Δy ) [193]. Several authors point out that the error induced by the leakage effect in the retrieved phase is inevitable due to the discretization of the image by the CCD and non-integer number of fringes within the image [172, 191, 193, 194]. The distortions caused by the leakage are negligible if the carrier frequencies ensure integer number of fringes within the image and the object height distribution is concentrated also within the image [194], i.e. no phase distortions occur at the image boundaries. To avoid the leakage effect when large objects are monitored with a non-vanishing height at image borders, a method is proposed in [194] in which the full image is divided in overlapping subimages by a window that slides along the axis normal to the carrier fringes, e.g. axis X. The window width is chosen approximately equal to one fringe period. If this width is NW , Nx − NW , consecutive images are processed. The next step is to apply Fourier transform successively to all rows parallel to the X-axis of each subimage, thus achieving a local phase demodulation. The sliding pace is one discretization step per subimage which in practice ensures phase recovery at each point of the image and explains why the method is called interpolated or regressive 1D Fourier transform [194]. Briefly, the fringe pattern in each subimage I(xk , . . . , xk+NW , yl ) is modelled by a singlefrequency sine-wave Ik,l (x) = Ak,l sin(2πfk,l x + ϕk,l ) with frequency fk,l and phase ϕk,l connected to the height in the point (kΔx , lΔy ). The Fourier
122
E. Stoykova et al.
transform of the sine-wave leads to a set of two non-linear equations which when solved for the two largest Fourier coefficients yield the required frequency fk,l and phase ϕk,l [194]. The frequencies evaluated by the proposed approach are not limited to the frequencies of the Fourier transform and no leakage occurs. The necessity to find out the two largest spectral lines of the locally computed FFT involves sorting operations which increases slightly the computational burden. Sine-wave modelling gives good results in image regions without abrupt phase changes, i.e. for smooth objects without height discontinuities. The discrete nature of the Fourier spectrum may cause distortions in the recovered phase at the step of removal of the heterodyning effect. If the sampling interval in the frequency domain is considerably large, it is difficult to translate the positive or negative component of the fundamental spectrum by exactly (f0x , f0y ) to the origin. If the bias error in the shifted position of the fundamental spectrum is (δf0x , δf0y ), the retrieved phase is given by ˆ (x, y) = ϕ(x, y) exp {−2πj(xδf0x + yδf0y )} ϕ
(5.33)
with |δf0x,y | ≤ 0.5Δf0x,y , where Δf0x and Δf0y give the resolution in the frequency domain. The modulation of the true phase may lead to considerable phase shifts in some parts of the object. Distortions in the recovered phase due to the discrete nature of the Fourier spectrum are studied in [175, 180]. The approach proposed there relies on background and carrier frequency evaluation by the least-squares fit of a plane in the part of the recorded image that is not affected by the object. The evaluated phase plane is subtracted from the retrieved phase in the spatial domain. However, this approach is rather cumbersome due to inevitable dependence on the proper choice of the objectfree area. An efficient approach is proposed in [195] to evaluate the phase map ψ(x, y) = 2πf0x x + ϕ(x, y) from the FP by computing the mean value of its first phase derivative along the X-axis ∂[ψ(x, y)] ∂ϕ(x, y) ¯ (x, y) = ψ (5.34) = 2πf0x + ∂x ∂x S S where () denotes averaging over the entire image S. It is reasonable to S assume that the expectation of the derivative is given by 2πf0x . Thus subtraction of the estimate of the mean value (5.34) from the 2D map of the first phase derivative along the X-axis is expected to yield the first derivative of the phase modulation caused by the object. The first derivative is calculated as a difference of the phase values at two adjacent pixels, ψ(xi+1 ) − ψ(xi ). In [196] carrier removal is performed using an orthogonal polynomial curve fitting algorithm. For the purpose, intensity distribution along one row parallel to e.g. X-axis is modelled by a sine-wave whose Fourier transform, Fs (ω), can be represented theoretically by [196]: Fs (ω) =
a∗ a + jω − ζ jω − ζ ∗
(5.35)
5 Pattern Projection Profilometry for 3D Coordinates Measurement
123
where ζ is the pole of Fs (ω). By fitting Fs (ω) with an orthogonal polynomial and using a least-square approach, an estimate of the carrier frequency can ˆ = ζˆ. The algorithm is based on the assumption be obtained from (5.35) as ω that the carrier frequency is the same throughout the whole image. To find the carrier frequency [191] makes use of the sampling theorem applied to the amplitude of the Fourier transform in the spatial frequency domain. Using the interpolation formula [191]: |D(fx , fy )| = |Dmn | sin c[π(fx − m)] sin c[π(fy − n)] , m,n
fx,y = fx,y (Δfx,y )−1 (5.36)
with Dmn = D(mΔfx , nΔfy ), one is able to calculate precisely the carrier frequency. An important drawback of carrier removal based on a frequency shift or by applying techniques described in [191, 195, 196] is inability to remove possible non-linear component of the carrier frequency. Such situation is encountered when divergent or convergent illumination is used for grating projection on the large- or small-scale object which yields a carrier FP with a non-equal spacing, for which the carrier removal by frequency shift fails [197]. To deal with this case Takeda et al proposed in [171] to use a reference plane. This solution entails implications such as the need of two measurements as well as the careful adjustment of the reference plane and increases the overall uncertainty of the measurement. Srinivasan et al. [198] propose a phase mapping approach without a reference plane. There have been developed methods that directly estimate a phase-to-height relationship from the measurement system geometry without estimating the carrier frequency. A profilometry method for a large object under divergent illumination is developed in [199] with at least three different parallel reference planes for calibration of the geometrical parameters of the system. The calibration permits to convert directly the phase value composed of both the carrier and shape-related components to a height value. However, high accuracy of determination of the geometrical parameters is required which makes the process of calibration very complicated. A general approach for the removal of a nonlinear-carrier phase component in crossed-optical-axes geometry is developed in [200] for divergent projection of the grating with a light beam directed at angle α to the normal to the reference plane and a CCD camera looking normally at it. If for this optical geometry the carrier fringes are projected along the Y-axis, the phase induced by them depends on x-coordinate in a rather complicated way [200] x φ(x) = 2π f0x (u)du = 2πpL1 H x 0
0 −1
(L2 + u sin β)
[H 2 + (d + u)2 ]1/2 du + φ(0)
(5.37)
124
E. Stoykova et al.
where φ(0) is the initial carrier phase angle, p is the grating pitch, β is the angle between the grating and the reference plane and L1 , L2 , H and d are distances characterizing the optical geometry. The authors propose to use a power series expansion for φ(x): φ(x) =
∞
an xn
(5.38)
n=0
and to determine the coefficients a0 , a1 , . . ., an . . . by a least-squares method minimizing the error function: Ω(a0 , a1 , . . . , aN ) = [a0 + a1 x + . . . + aN xN − ϕ(x, y)]2 (5.39) (x,y)∈S
where S comprises all image points, ϕ(x, y) is the unwrapped phase, and the number N ensures an acceptable fit to ϕ(x, y). The method is generalized for carrier fringes with an arbitrary direction in the spatial plane. The phaseto-height conversion becomes much simpler at successful elimination of the nonlinear carrier. Reliability of the FFA is thoroughly studied in [193] at all steps involved in the phase demodulation by means of 1D model of an artificial ideal noisefree open-fringes sinusoidal FP with constant magnitudes of IB (r), IV (r) and ϕ(r) throughout the image. The purpose of analysis is to identify only the errors inherent in the FFA. The filter used in the spatial frequency domain is a rectangular window apodized by a Gaussian function. As a result, an improved formula for phase derivation from the complex signal Ψ(x, y) is proposed. One of the most serious problems of the FTP arises from objects with large height discontinuities that are not band-limited which hinders application of the Fourier analysis. In addition, discontinuous height steps and/or spatially isolated surfaces may cause problems with the phase unwrapping [158, 201]. A modification of FFA, proposed in [182], makes unnecessary phase unwrapping by simple elimination of any wraps in the calculated phase distribution. This is achieved by proper orientation of the projected fringes and by choosing independently the angle between the illumination and viewing directions, θ, and the fringe spacing, L, in a way as to fulfil the requirement h(x, y)L−1 sin θ ≤ 1 after removal of the carrier frequency. Obviously, the method is efficient only for comparatively flat objects at the expense of decreased resolution. A two-wavelength interferometer is developed in [202]. The recorded pattern is given by I(x, y) = IB (x, y) + IV 1 (x, y) cos[2πf1x + ϕ1 (x, y)] +IV 2 (x, y) cos[2πf2x + ϕ2 (x, y)],
(5.40)
where ϕ1,2 = 2πh(x, y)/λ1,2 and f1,2x are inversely proportional to both used wavelengths. The Fourier transform of the recorded pattern yields
5 Pattern Projection Profilometry for 3D Coordinates Measurement
D(fx , fy ) = DB (fx , fy ) +
2
125
[D1k (fx − f1x , fy ) + D1k∗ (f0x + f1x , fy )] (5.41)
k=1
Two first-order spectra are selected, e.g. D11 and D12∗ , by bandpass filtering and shifted towards the origin of the coordinate system. After the inverse Fourier transform one obtains: Ψ(x, y) = IV (x, y) cos[Ξ(x, y)] exp[jΓ(x, y)]
(5.42)
where precautions are taken to provide IV (x, y) = IV 1 (x, y) = IV 2 (x, y) and Ξ(x, y) = πh(x, y)/Λ1 and Γ(x, y) = πh(x, y)/Λ2 . Here Λ1 = λ1 λ2 /(λ1 + λ2 ) is the average wavelength, and fx = aωR0 is the synthetic wavelength. As it could be seen, the phase modulation using the synthetic wavelength substantially increases the range of the interferometric measurement without need for phase unwrapping. Correct restoration of the 3D object shape and accurate phase unwrapping across big height variations and surface discontinuities can be done using multiple phase maps with various sensitivities. A method called spatial-frequency multiplexing technique was proposed in [203]. The proposed idea is extended in [204] to a technique termed multichannel FFA. The key idea of the method is that phase discontinuities which are not due to the processing algorithm but to surface discontinuities would appear at the same location on FPs generated with differing carrier frequencies. These FPs can be projected simultaneously on the object surface if FFA is used [203]. The spectra that correspond to the used multiple FPs are separated in the frequency space by means of a set of bandpass filters tuned to the carrier frequencies of the fringes. A FFA interferometric technique for automated profilometry of diffuse objects with discontinuous height steps and/or surfaces spatially isolated from one another is designed and tested in [201]. It makes use of spatiotemporal specklegrams produced by a wavelength-shift interferometer with a laser diode as a frequency-tunable light source. Necessity to record and process multiple FPs under stringent requirement for vibration-free environment is the main drawback of the developed approach. Phase demodulation and unwrapping by FFA for discontinuous objects and big height jumps obtains further development in [205]; the merit of the work is that all necessary information is derived from a single FP. This is achieved by combining spatial-frequency multiplexing technique with Gushov and Solodkin unwrapping algorithm [206]. The FP projected on the object consists of multiple sinusoids with different carrier frequencies: I(x, y) = IB (x, y) + IV (x, y)
K
cos[ϕk (x, y) + 2π(fkx x + fky y)]
(5.43)
k=1
Defining a set of simultaneous congruence equations for the real height distribution and height distributions corresponding to wrapped phase maps ϕk (x, y)
126
E. Stoykova et al.
and using the Gushov and Solodkin algorithm, phase unwrapping can be done pixelwise. A coaxial optical sensor system is described in [207] for absolute shape measurement of 3-D objects with large height discontinuities and holes without shadowing. This is achieved by a depth of focus method with a common image plane for pattern projection and observation. The FFA is applied for evaluation of the contrast, IV (x, y). The absolute height distribution is determined from the translation distance of the image plane that ensures a maximum fringe contrast at each pixel like in a white-light interferometry. Absolute phase measurement using the FFA and temporal phase unwrapping is developed in [188]. Use of a four-core optical fibre for pattern projection and 2D Fourier analysis is demonstrated in [208]. The projected FP is formed as a result of interference of the four wave fronts emitted from the four cores located at the corners of the square. In its essence, the FTP is based on determination of the quadrature component of the signal, i.e. it is described by approximation of the Hilbert transform. Kreis [209] is the first who applies a 2D generalized Hilbert transform for phase demodulation. However, the discontinuity of the used Hilbert transform operator at the origin leads to ringing in the regions where the phase gradient is close to zero. Accuracy of this approximation depends on the bandwidth of the processed signal. Contrary to the traditional opinion that it is not possible to find natural isotropic extension of the Hilbert transform beyond one dimension and to apply the analytic signal concept to multiple dimensions, a novel 2D quadrature (or Hilbert) transform is developed in [210, 211] as a combined action of two multiplicative operators: two-dimensional spiral phase signum function in the Fourier space and an orientational phase spatial oper˜ r ) = I(r) − IB (r) obtained after ator. The quadrature component of the FP I( the removal of background is obtained from the approximation ˜ r )]} j exp[jθ(r)]IV (r) sin[ϕ(r)] ∼ (5.44) = F −1 {S(f)F [I( where θ(r ) is the fringe orientation angle, and S(f) is the 2D spiral phase signum function fx + jfy = S(f) (5.45) fx2 + fy2 The new transform shows effective amplitude and phase demodulation of closed FPs. A vortex phase element has been applied in [212] for demodulation of FPs. The phase singularity of the vortex filter transforms the FP into a pattern with open fringes in the form of spirals which allows for differentiating between elevations and depressions in the object. 5.2.4 Space-frequency Representations in Phase Demodulation 5.2.4.1 Wavelet Transform Method The FFA as a global approach exhibits unstable processing for patterns with low fringe visibility, non-uniform illumination, low SNR as well as in the
5 Pattern Projection Profilometry for 3D Coordinates Measurement
127
presence of local image defects which influence the entire demodulated phase map [213]. Since the spatial gradient of the phase is proportional to the fringe density, information about the latter is a step towards phase demodulation. This observation paves a ground for introduction of space-frequency methods in the fringe analysis. New phase retrieval tools have been applied as the windowed Fourier transform (WFT) [214, 215] and the continuous wavelet transform (CWT) [216]. The wavelet transform is a method that can detect the local characteristics of signals. This explains extensive research in wavelet processing of FPs and interferograms during the last decade [217, 218]. It could be very useful for patterns with great variation of density and orientation of fringes – the case in which the standard FFA fails [28]. The other methods that are capable to ensure localized phase retrieval, as the WFT or the regularised phasetracking algorithm generally would require a priori information about the fringe density and orientation. In the wavelet transform analysis it is not necessary to choose the filter in the frequency domain. An extensive review of the wavelet transform can be find in [219]. The idea to apply the CWT to the 2-D fringe data is proposed independently in [220] and [213]. The CWT can be applied both to open and closed fringes. The CWT shows promising results as a denoising tool for interferograms in holography and speckle interferometry [221] and as a method to improve bad fringe visibility in laser plasma interferometry [213]. In whitelight interferometry, CWT proves to be very effective for detecting the zero optical path length [222]. The wavelets can be very useful for finding the zones with a constant law of variation of the fringes [223]. The CWT of a 1D function I(x) is defined as ∞ Φ(a, b) = −∞
I(x)ψ∗a,b (x)dx
=
√
∞ a
ˆ (afx )D(fx ) exp(jbfx )dfx ψ
(5.46)
−∞
, a = 0 and b are scaling and translation where ψa,b (x) = |a|−1/2 ψ x−b a ˆ (x) = F [ψ(x)]. The kernel of the transform is parameters which are real, and ψ a single template waveform, the so-called mother wavelet ψ(x) which should satisfy the admissibility condition [219], in order to have a zero mean and should present some regularity to ensure the local character of the wavelet transform both in the space and frequency domains. The above conditions mean that the wavelet can be considered as an oscillatory function in the spatial domain and a bandpass filter in the frequency domain. The scaling factor entails the change of the width of the analyzing function thus making possible analysis of both high-frequency and low-frequency components of a signal with good resolution. Usually, the mother wavelet is normalized to have a unit norm [219]. The wavelet transform decomposes the input function over a set of scaled and translated versions of the mother wavelet. The Fourier transform of the daughter wavelet ψa,b (x) is
128
E. Stoykova et al.
ˆ a,b (fx ) = |a|1/2 ψ ˆ (afx ) exp(−jbfx ) ψ
(5.47)
ˆ (fx ) is the Fourier transform of the mother wavelet. where ψ The wavelet transform Φ(a, b) plotted in the spatial coordinate/spatial frequency space gives information about the frequency content proportional to 1/a at a given position b. In case of the FPs the translation parameter b follows the natural sampling of the FP, given by the pixel number, b → n, n = 0, . . . , N , where N is the total number of pixels. The parameter a ∈ [amin , amax ] is usually discretized applying a log sampling a = 2m , where m is an integer. Finer sampling is given by [219, 224]: ψνn (x) = 2−(ν−1)/Nν ψ(2−(ν−1)/Nν x − n), ν = 1, . . . , Nν
(5.48)
where the fractional powers of 2 are known as voices and the spatial coordinate x is also given in the units of a pixel number. One should make a difference between the CWT and the discrete wavelet transform which employs a dyadic grid and orthonormal wavelet basis functions and exhibits zero redundancy. The modulus of the CWT |Φ(a, b)|2 is the measure of a local energy density in the x − fx space. The energy of the wavelet ψa,b (x) in the x − fx space is concentrated in the so-called Heisenberg box centered at (b, η/a) with lengths aΛx and Λf /a along the spatial and frequency axis respec∞ 2 ∞ 1 ˆ (fx )|2 dfx = Λ2f and x |ψ(x)|2 dx = Λ2x , 2π (fx − η)2 |ψ tively, where 1 2π
∞
−∞
0
ˆ (fx )|2 dfx = η. The area of the box, Λx Λf , remains constant. The fx |ψ
0
plot of I(x) = IB (x) + IV (x) cos[2πf0x (x)x + ϕ(x)] as a function of position and frequency is called a scalogram [215, 216]. The huge amount of information contained in the CWT Φ(a, b) could be made more condensed if one considers the local extrema of Φ(a, b). Two definitions of CWT maxima are widely used: i) wavelet ridges used for determination of the instantaneous frequency and defined as d(|Φ(a, b)|2 /a) = 0; (5.49) da ii) wavelet modulus maxima used to localize singularities in the signals and defined as d|Φ(a, b)|2 = 0; (5.50) db The choice of a proper analyzing wavelet is crucial for the effective processing of the FPs. The most frequently used in interferometry and profilometry wavelet is the truncated form of the Morlet wavelet which is a plane wave modulated by a Gaussian envelope 2 x (5.51) ψ(x) = π−1/4 exp(jω0 x) exp − 2
5 Pattern Projection Profilometry for 3D Coordinates Measurement
129
It is well-suited for processing of pure sinusoids or modulated sinusoids; ω0 is the central frequency. The correction term exp(−ω20 /2) which is introduced in the complete form of the Morlet wavelet ψ(x) = π−1/4 exp(jω0 x − ω20 /2) exp(−x2 /2) to correct for the non-zero mean of the complex sinusoid is usually neglected as vanishing at high values of ω0 . Usually ω0 > 5 is chosen ˆ (fx = 0) ≈ 0 [221], which means that the Morlet wavelet has five to ensure ψ ‘significant’ oscillations within a Gaussian window. The Morlet wavelet provides high frequency resolution. The Fourier transform and energy spectrum of the Morlet wavelet are √ √ √ (fx − ω0 )2 4 ˆ (fx ) = 2 π exp − ˆ (fx )|2 = 2 π exp −(fx − ω0 )2 ψ and |ψ 2 (5.52) For complex or analytic wavelets the Fourier transform is real and vanishes for negative frequencies. So, the Morlet wavelet removes the negative frequencies and avoids the zero-order contribution [221]. The Morlet wavelet produces a bandpass linear filtering around the frequency ω0 /a. Two other wavelets used in fringe analysis are the Mexican hat wavelet [216] and the Paul wavelet of order n [225]. The apparatus of the wavelet ridges can be used for phase demodulation of FPs, if the analyzing wavelet is constructed as ψ(x) = g(x) exp(jω0 x), where g(x) is a symmetric window function and ω0 > 2Λf , i.e. ψ(x) practically rejects negative frequencies. The CWT of the AC component of one row or column of the FP I(x,y) takes a form: Φ(a, b) =a
−1/2
∞ IV (x) cos ϕ(x)g
−∞
x−b a
ω 0 exp −j (x − b) a
dx =Z(ϕ) + Z(−ϕ)
(5.53)
where a−1/2 Z(ϕ) = 2
∞ IV (x + b) exp[jϕ(x + b)]g −∞
ω0 ! x! exp −j x dx a a
(5.54)
It is difficult to solve analytically the integral in (5.54), but in the case of small variation of visibility and phase of the fringes over the support of the analyzing wavelet ψa,b , we can use a Taylor series expansion of IV (x) and ϕ(x) to the first order to simplify (5.54). IV (x + b) ≈ IV (b) + xIV (b), ϕ(x + b) ≈ ϕ(b) + xϕ (b)
(5.55)
where IV (b) and ϕ (b) are the first order derivatives of IV and ϕ(b) with respect to x. Taking in view the symmetric character of g(x) and the condition ω0 > 2Λf , one obtains [216]:
130
E. Stoykova et al.
√ ω a 0 IV (b)ˆ − ϕ (b) exp[jϕ(b)] Φ(a, b) ≈ Z(ϕ) = g a 2 a
(5.56)
To make contributions of the second order terms negligible, the following nonequalities should be fulfilled for the second order derivatives [216]: ω02 |IV (b)| 2 |ϕ (b)| << 1 and ω << 1 0 |ϕ (b)|2 |IV (b)| |ϕ (b)|2
(5.57)
As it can be seen from the expression (5.56), there are two possible ways to determine the phase from the CWT behaviour on the ridge: i) from the rescaled scalogram 2 |Φ(a, b)|2 I 2 (b) ω0 = V g a − ϕ (b) ˆ a 4 a
(5.58)
which for the points on the ridge yields directly the instantaneous fre 2 quency ω0 a−1 R (b) = ϕ (b), where aR is the value that maximizes |Φ(a, b)| ; ii) from the phase of the CWT on the ridge √ aR Φ(aR , b) ≈ Z(ϕ) = IV (b) exp[jϕ(b)] (5.59) 2 In the gradient-based algorithm the phase is calculated by integration and no phase unwrapping is needed. In (5.59) the phase is determined modulo 2π. Obviously in the case of open fringes the phase of the ridge is exactly the same as the phase of the signal [226]. For the closed fringes the sign of the phase gradient should be determined independently, e.g. by employing PS technique [227]. The ridge extraction is correct only for the so-called analytic asymptotic limit of the intensity signal with respect to the width of the analyzing wavelet which is a severe limitation on the CWT method, especially if objects with cracks, holes, or such that introduce slow variations in the phase are to be evaluated [221]. Analysis made in [221] reveals that the CWT method exhibits high accuracy at large phase gradients whereas slow variations of the phase are not so well localized. Improvement of accuracy requires to adopt the higher spatial fringe frequency direction [221]. The gradient-based method needs also knowledge in advance of the sign of the first derivative of the phase, and also of the phase itself over a set of points. Along the ridge of the 3-D surface Φ(a, b) the true input signal is efficiently captured even when it is strongly contaminated with noise. The strong noise suppression is intrinsic for the ridge extraction procedure. In other words, the phase gradient in the vicinity of the pixel b is proportional to 1/aR . The ridge points can be found by a standard procedure applied to determine the maximum of a function or by more sophisticated algorithms to select accurately the ridge in noisy FPs. The direct approach when the maximum of Φ(a, b) is searched as the highest magnitude at each value of b works well at high SNRs. In [228] ridge detection is based on minimization by Monte-Carlo type methods of a penalty function
5 Pattern Projection Profilometry for 3D Coordinates Measurement
131
on the set of all possible ridge curve candidates taking into account a priori information of the signal and the noise. The procedure is shown to be robust to additive white noise. In [229] a cost function is built for the adaptive selection of the ridge: ∂ϕ(b) 2 2 db cos t[ϕ(b), b] = −C0 |Φ[ϕ(b), b]| db + C1 (5.60) ∂b b
b
where ϕ(b) is a parameter curve of b; C0 , C1 are the positive weight coefficients. The cost function is small for signals with large magnitude and a smooth parameter curve. In [220] the CWT is applied to the rows of a 2D interferogram using a Morlet analyzing wavelet. Each row after being extended by interpolation at its both edges to avoid discontinuities is processed separately from the other rows, and the phase is retrieved by integration from the phase gradient up to an additive constant. To find these constants, the CWT is applied also to the diagonal of the FP and the heights of the rows are adjusted correspondingly. Due to integration of the phase gradient, no phase unwrapping is required. Phase evaluation by integration of the phase gradient with a Paul wavelet is proposed in [230]. The efficiency of this analyzing wavelet is checked by means of simulated FPs which are processed row by row with extension of each row on both sides by zero padding. Reference [231] justifies CWT application in fringe profilometry with crossed-optical geometry. It is shown that the modulated phase Δϕ(x, y) = ϕ(x, y) − ϕr (x, y) can be obtained as a difference Δϕ(x, y) = ϕ(x, y) − ϕr (x, y) = ϕ(aR , b) − ϕr (arR , b)
(5.61)
where arR is the scale factor at the ridge of FP on the reference plane at every position b. The simulations and experiment prove that the CWT phase retrieval with the Morlet wavelet overcomes the limitation imposed by the FT profilometry on the height of the investigated objects. The same wavelet is used in [232] where the authors make comparison between the gradient-based CWT, phase-based CWT and FT profilometry. A technique which is totally different from the gradient-based and phasebased CWT phase demodulation of carrier fringe patterns is presented in [233]. It implies that the spatial carrier frequency is much larger than expansion of the spectrum associated with the height variation of the object. The basic idea is to isolate the carrier frequency from the spectrum of ϕ(x, y) on the frequency scale. For the purpose, the CCD-recorded spatial carrier FP is modulated by two sinusoidal waves generated at the carrier frequency fx0 and phase-shifted at π/2: Ms (x, y) = IB (x, y) sin(2πf0x ) +IV (x, y) {cos[4πf0x x + ϕ(x, y)] − sin ϕ(x, y)}
(5.62)
132
E. Stoykova et al.
Mc (x, y) = IB (x, y) cos(2πf0x ) +IV (x, y) {cos[4πf0x x + ϕ(x, y)] + cos ϕ(x, y)}
(5.63)
As it can be seen, the spectra of the terms IV (x, y) sin ϕ(x, y) and IV (x, y) cos ϕ(x, y) are the lowest on the frequency scale. Now the object spectrum can be separated from all other contributions in (5.62) and (5.63) by the CWT using as an analyzing wavelet the scaling function which puts the spectra of IV (x, y) sin ϕ(x, y) and IV (x, y) cos ϕ(x, y) in the approximation band. This method requires phase unwrapping. The shortcomings of the method are the necessity to know the carrier frequency and the constraint imposed on the range of object phase variation. In [234] the CWT is applied to remove irregular localized noise in a set of low-SNR phase-shifted moir´e interferograms which are typical for measurements in which the physical change is of the order of the measurement sensitivity. The 1D CWT is used for preprocessing of the phase-shifted patterns recorded using a four-step algorithm. It is shown that the ridge extraction of the CWT yields information about the local spatial frequencies at each position of the processed FP whereas the spurious frequencies connected with the noise have negligible contribution in the ridge formation. Thus, restoration of the local frequency map from the ridge detection permits to regenerate denoised FPs. Four-step PS interferometry and CWT are combined in [235] for phase demodulation with increased accuracy in moir´e interferometry with noisy FPs. To avoid the error in calculation of the CWT that arises from the finite length of the data sequences, the authors propose to introduce computer-generated carrier phase in a way that the modified input sequence contains enough spatial periods to neglect the finite length of the data array. Application of the CWT to the measurement of in-plane displacements by the digital speckle pattern interferometry (DSPI) and out-of-plane deflection by the projecting moir´e fringes has been recently reported for demodulation of fringes with non-uniform carrier frequency distribution over the image [236]. Considering a 1-D FP I(x) = IB (x) + IV (x) cos[2πf0x (x)x + ϕ(x)] with a non-uniform carrier f0x (x) and representing the phase ϕ(x) as a Taylor series near the point of interest b up to the linear term ϕ(x) ≈ ϕ(b) + ϕ (b)(x − b) on a limited support [b − as, b + as] for a Morlet analyzing mother wavelet with a support [−s, s], it is obtained [236]: ! √ ω2 Φ(a, b) = 2πIB (b) exp − 20 ! (5.64) √ 2 + 22π IB (b)IV (b) exp {jΩ(b)} exp − a2 Θ where Θ = Θ(a, b, ω0 ) and the phase Ω(b) = 2πf (b)b + ϕ(b) of Φ(a, b) contains information about the modulation phase ϕ(x) at any point b for a fixed value of a. The obtained expression permits to derive the phase change between two states of the object under condition that local value of the carrier frequency
5 Pattern Projection Profilometry for 3D Coordinates Measurement
133
is kept unchanged. Since the support window depends on the scaling parameter a, a statistical procedure is developed to derive the phase searching in the whole scaling range of the wavelet transform. A similar idea is proposed in [237] for elimination of phase distortion in phase-shifted fringe projection method caused by declination of the projected FP from the ideal sinusoidal form. By differentiating the above expression for Φ(a, b), the phase value Ω(b) is evaluated from # " Im[Φ(amax , b)] −1 Ω(b) = tan = 2πfx b + ϕ(b) (5.65) Re[Φ(amax , b)] where amax is obtained from dΦ(a, b)/da = 0. The described procedure for phase extraction requires phase unwrapping. Application of CWT for localization of defects in FPs in non-destructive testing is described in [238]. A new approach for micro-range distance measurement that makes use of moir´e effect and wavelet representation is described in [239]. The 1D wavelet analysis with a Mexican hat wavelet is used to determine the pitch of the moir´e pattern. In [226] a correlation is observed between a FP and its wavelet map. On the basis of this observation an algorithm is presented for reconstruction of lines of interference fringes, which, however, is effective only for vertical fringes. In all modifications of the CWT approach the 2-D FPs are simplified to 1D patterns [233]; therefore their implementation requires long computation times. At the same time, none of these modifications exhibit results that are superior to other widely used fringe processing techniques. 5.2.4.2 Windowed Fourier Transform Method The 1D WFT and inverse WFT of the function ζ(x) can be written in a form: ∞ ζ(x)g(x − u) exp(−jfx x)dx
Z(u, fx ) = −∞
1 ζ(x) = 2π
(5.66)
∞ ∞ Z(u, fx )g(x − u) exp(jfx x)dfx du
(5.67)
−∞ −∞
where Z(u, fx ) is the WFT spectrum. As it can be seen, the WFT is similar to the FT except for the symmetrical window function g(x). The WFT kernel is obtained by translation of the window at u and by modulation at frequency fx . The WFT processes the signal mainly in the local area defined by the extent of the window and signals that are separated by a distance greater than the window width do not influence each other. Thus the WFT spectrum gives information not only about contribution of different spectral components but also about where in the signal domain they occur. The resolution limit or the smallest Heisenberg box [215, 240] is achieved for a Gaussian window.
134
E. Stoykova et al.
The WFT with a Gaussian window is usually called the Gabor transform. The extension of the 1D WFT to the 2D is straightforward. Introduction of the window restricts the processed signal area and simplifies interpretation of the spectrum [214] which may consist of a single peak in the considered local area. Another advantage of the WFT can be effective noise reduction by implementation of a threshold that simply cuts off the low noise spectral amplitudes which are spread all over the frequency domain. Similar to CWT, these particular features of the WFT put foundations for two processing approaches known as the WF ridges method and the WF filtering method [214]. The Windowed Fourier filtering can be applied to phase-shifted patterns as well as to a single pattern with carrier fringes [241]. In the case of a four-step algorithm with π/2 phase step, the four recorded patterns can be combined to form the following complex signal [214, 240]: IP S (x) =
1 [I1 (x) − I3 (x) + jI4 (x) − jI2 (x)] = IV (x) exp[jϕ(x)] 2
(5.68)
that undergoes the WFT filtering. In the case of a carrier FPs the signal itself is formed of a background and two exponential functions. By using the definition of the WFT, filtering of (5.68) or of a carrier FP is described with the following expression [214]: 1 ¯ I(x) = 2π
b I(x) ⊗ g(x) exp(jfx x) ⊗ g(x) exp(jfx x) dfx
(5.69)
a
where ⊗ denotes a convolution with respect to the variable x, I(x) ⊗ g(x) exp(jfx x) denotes that the threshold has been applied, and all spectrum parts that are below the threshold are set to zero. By setting the integration limits a and b only the desired part of the spectrum is processed interactively. In the case of phase-shifted images one chooses a < 0 and b > 0 to include both negative and positive frequencies of the FP. For carrier FPs by choosing b > a > 0, rejection of the background and the negative frequencies is accomplished. The WFT noise filtering outperforms the conventional FT filtering [242]. However, the need to determine the threshold and the limits a and b from the recorded FPs is a serious shortcoming. To discretize optimally the continuous WFT, the authors of [242] apply frame theory to form a tight-windowed Fourier frame. It is shown by computer simulation that the tight frame is achieved if sampling intervals in the frequency domain are chosen inversely proportional to the spatial extensions of the 2D Gaussian kernel. The noise reduction achieved by the WF frame is better in comparison with the results obtained with the orthogonal wavelet transform. The windowed Fourier ridges method relies on the assumption [240] that both IB (x) and IV (x) and the first derivative of the phase ϕ (x) = dϕ(x)/dx
5 Pattern Projection Profilometry for 3D Coordinates Measurement
135
are slowly varying functions over the window extent. It can be shown that in this case the WFT consists of 3 terms [240]: Z(u, fx ) = Z(IB ) + Z(ϕ) + Z(−ϕ)
(5.70)
Z(IB ) = IB (u) exp(−jufx )ˆ g (fx ) 1 Z(±ϕ) = IV (u) exp {±j[ϕ(u) ∓ ufx ]} gˆ[fx ∓ ϕ (u)] 2
(5.71)
where we have
(5.72)
and the circumflex denotes the Fourier transform. The three terms of the WFT are separated in the frequency domain if ϕ (u) > Λf , i.e. if the fringe density is rather high. In this case the terms Z(IB ) and Z(−ϕ) are negligible on the ridge of the transform at fx = ϕ (u). This condition can be satisfied by introduction of carrier fringes. The carrier fringes are not necessary if the WFT is combined with the four-step PS technique [240]: Ii = IBi + IV i cos[ϕ + (i − 1)π/2], i = 1, 2, 3, 4
(5.73)
It is easy to show that [240] Z1 − Z3 = 2Z(ϕ) + 2Z(−ϕ), Z4 − Z2 =
2 2 Z(ϕ) + Z(−ϕ) j j
(5.74)
where Zi is the WFT of the pattern Ii . Obviously, Z(ϕ) can be readily computed from Z(ϕ) = 0.25(Z1 − Z3 + jZ4 − jZ2 )
(5.75)
even if ϕ (u) > Λf is not fulfilled but at the expense of recording four FPs instead of one. Capability of the WFT to localize abrupt changes in the fringe density or local frequency makes it a suitable algorithm for fault detection [232] and condition monitoring in optical non-destructive testing. The WFT combines insensitivity to noise of the FFA and sensitivity to local changes of crosscorrelation methods. Monitoring of local frequencies gives information about the FP evolution. The WFT or Gabor filtering is applied in [243] for effective removal of the spatial carrier in the case of real FPs, which could be hardly expected to have parallel and equally spaced spatial carrier fringes. To ensure effective phase demodulation in such cases, the method developed in [243] performs localized matching filtration with Gabor filters that form a specially constructed multi-channel Gabor spatial filter set which spans the total variation range of the carrier frequency f0 = (fx0 , fy0 ) = f0 (x, y). By comparing the outputs of the filters at each point (x, y), a value of the central frequency of the filter with the maximum output is assigned to f0 (x, y). However, the underlying theory puts the constraint on the analyzed phase field which should vary so slowly that its spectrum to be entirely covered by the
136
E. Stoykova et al.
frequency band of each of the Gabor filters. Strain contouring using a set of Gabor filters is described also in [244]. In [245] a dilating Gabor transform is introduced by using a Gaussian function with a changeable window in order to improve the WFT efficiency when processing FPs with a spectral content that strongly varies across the FP. It is shown that the Fourier transform spectrum can be represented as a sum of the spectra obtained by the Gabor transform at different locations of the FP. 5.2.5 Single Frame Methods As we have seen, phase demodulation is a relatively simple procedure for carrier-frequency FPs or for open fringes. However, introduction of carrier fringes in real time by technical means is a complicated task and restricts the spectrum of the signal that may be recovered. In many cases one encounters the problem to process wideband FPs without frequency that is dominant throughout the FP. Such are closed FPs in which the phase experiences nomonotonous change or FPs, in which the signal is noise dependent [246]. A frequently met problem is phase demodulation from patterns with partial-field fringes, in which the FP is available in a subregion of the image. The full-field methods as the Fourier transform applied to such a FP leads to artefacts at the borders. Suitable for the closed FPs is the PS technique but at the expense of acquisition of several frames which is unacceptable for time-varying scene capture. This motivates concentration of efforts on the development of phase demodulation techniques from a single FP which in general may consist of closed fringes. From the mathematical point of view phase demodulation of a single FP is an ill-posed problem because of the inherent sign ambiguity [140]. The ϕ1 (x, y) = (x2 + y 2 ), ϕ2 (x, y) = −(x2 + y 2 ), ϕ3 (x, y) = phase2 distributions 2 W (x + y ), ϕ (x, y) = (x2 + y 2 ) at x ≤ 0 and −(x2 + y 2 ) at x > 0 create 4 the same FP [30, 247], as follows from cos ϕ1 = cos ϕ3 = cos ϕ3 = cos ϕ4 ; W(.) is a phase wrapping operator. This makes impossible derivation of a unique solution from the observed data without introduction of prior constraints in the demodulation algorithm [140, 247]. Filtering an image or phase unwrapping of noisy images are also ill-posed problems due to unknown information near the borders of the filter and noise-generated inconsistencies [140]. A powerful tool for solution of such ill-posed problems is Bayesian estimation theory. The estimate is sought as the minimizer of a cost function [140] which contains data terms derived from the likelihood function and from the prior model. If the task is to estimate the function f (r) on the nodes of a regular lattice L from the observation data I(r) = Af (r) + N (r), available in r ∈ S, where A is a noninvertible operator, N is a random Gaussian field with variance σ2 , and S is the subset of L, the likelihood of I(r) is given by [248]: $ % 1 2 2 exp − (5.76) [Af (r) − I(r)] /2σ P I|f (f ) = PN (Af − I) = K r∈S
5 Pattern Projection Profilometry for 3D Coordinates Measurement
137
where K is a normalization constant. A computationally efficient algorithm is possible if the prior constraints can be expressed as interactions of neighbouring pixels, i.e. if a Markov random-field model is used. Markovian prior distribution of & f (r) is given by the Gibbs distribution Pf (f ) = exp − VC (f ) K , where K is a normalization constant and C
the potential functions VC (f ) describe the behaviour of f (r) on a set of cliques that form a neighbourhood system in L. A clique comprises one or more sites in L with any two of them being neighbours of each other. The maximum a posteriori estimator can be built from the posteriori distribution P f |I (f ) = P I|f (f )Pf (f ) of f (r) as a minimizer of the functional [248]: Φf,g (r) + ε VC (fˆ) (5.77) U (fˆ) = r∈S
C
where the function Φf,g (r) depends on the observation data and the noise model. The regularization parameter ε depends on the noise variance. The cost function (5.77) consists of data terms that ensure solution consistent with the observations and of regularization terms which take into the account some properties of the estimate. Following the natural expectation, a frequently applied constraint in the phase recovery problem is the smoothness of the phase field throughout the FP. In the case of globally smooth f (r) the popular Markov random field models are the first-order (membrane) field described by the potential function Vij (f ) = [f (i) − f (j)]2 between pairs of the nearest-neighbour sites, the second-order (thin-plate) model with the potential function Vijk (f ) = [−f (i) + 2f (j) − f (k)]2 for three neighbouring sites, lying on a line, and the potential function associated with the sites in the corners of a rectangular Vijkl (f ) = [−f (i) + f (j) − f (k) + f (l)]2 , where f (i) ≡ f (iΔx,y ). The maximum a posteriori estimate obtained as a result of minimization of (5.77) is equivalent to a low-pass linear filter acting on the pattern I(r). Such a filter has the advantage to be independent on boundary conditions thus making possible to process FPs with irregular shapes, to enable recovery of missing data and to interpolate data between the lattice nodes. The regularized approach could also be applied for creation of robust non-linear filters or of quadrature filters (QFs) [249]. The frequency response of a linear ) with ω = 2πf, e.g. a Gaussian QF is described by a window function gˆ(ω function as in the case of a Gabor filter, centered at a given carrier frequency 0 = 2πf0 which exceeds the spread of gˆ(ω ) on the frequency axis. In the spaω tial domain this QF has a complex impulse response whose real and imaginary 0 · r) and g(r) sin(ω 0 · r) respectively are connected through a parts g(r) cos(ω ). So, if Hilbert transform, where g(r) is the inverse Fourier transform of gˆ(ω ˜ r ) = IV (r) cos[ω 0 · r + ϕ(r)], in which the phase change a QF is applied to I( 0 · r and the filter window in the frequency ϕ(r) is small in comparison with ω domain covers the spectrum of ϕ(r), the output of filter tuned at f0 gives the ˜ r ) = 1 IV (r) exp{j[ω 0 ·r +ϕ(r)]}. Determination of ϕ(r ) from complex signal I( 2
138
E. Stoykova et al.
0 · r is fulfilled. In this signal is straightforward if the requirement ϕ(r) << ω ˜ general, to apply the QF to I(r) when the latter condition is not valid, one = 2πf at the point r to needs to know the sign of the spatial frequency ω correct the sign of the Hilbert transform output [250]. To extend the QF method for wideband or noisy FPs, in [251] it is proposed to apply the QF adaptively to a local part of the FP where it can be written in the form: ˜ r ) = IV (r ) cos[ϕ(r)] = IV (r) cos[ω (r ) · r + ϕ ˜ (r)] I(
(5.78)
Such an adaptive QF can be designed using the Bayesian estimation theory, with Markov random fields as prior models, under the additional constraints (r) is a that the FP is locally monochromatic and the dominant frequency ω smooth function across the FP. The idea is to build a complex image Ω(r) = (r). This ΩRe (r) + jΩIm (r) as an output of the QF with a tunable frequency ω complex image is subject to the following constraints [246]: (r ) · r + ϕ ˜ (r)] is locally monochroi) the real part ΩRe (r) = I˜V (r) cos[ω ˜ (r)| << ω(r) · r; this constraint means that in the matic, where |ϕ neighbouring points s = (x − 1, y) or s = (x, y − 1) one may write (r) · s + ϕ ˜ (r)]; ΩRe (r) ≈ I˜V (r) cos[ω ii) the imaginary part should approximate the corresponding quadrature im (r) · r + ϕ ˜ (r)]; age, i.e. ΩIm (r) ≈ I˜V (r ) sin[ω ˜ r ) ∝ I(r). iii) the real part must be proportional to the observed FP, i.e. I( Thus the phase ϕ(r) is determined from Ω(r). To fulfil the above constraints, the output of the filter, Ω(r), is constructed to minimize the following cost function [246]:
2
|Ω(r) − Ω(s) − 2[I(r) − I(s)]| +
( r, s)∈S
U (Ω) = +ε
(r) · (r − s)] |Ω(r) exp[−0.5j ω
(5.79)
( r, s)∈S
(s) · (s − r)]| − Ω(s) exp[−0.5j ω
2
where (r, s) ∈ S denotes that all nearest-neighbour pairs of sites r and s are included in the sums; S is the region with available data, which in general may have an irregular shape. Each point r = (x, y) away from the borders has four nearest neighbours, which are the points (x − 1, y), (x + 1, y), (x, y − 1) and (x, y + 1). It is obvious that the first sum in the cost function controls the resemblance between the observed FP and the constructed complex image Ω(r). The value of the second sum is vanishing if the constraints i) and ii) are implemented. The parameter ε controls the spectral properties of the filter. A large value of ε produces a narrowband QF, which can effectively remove a signal-dependent noise [246], as is the noise caused by contrast variations or the speckle noise, without distortions of the signal. As the local frequency
5 Pattern Projection Profilometry for 3D Coordinates Measurement
139
(r) is also unknown, the cost function (5.79) must be modified to incorporate ω (r), as is proposed in [248]: constraints set on the field ω ⎡ ⎤ )+ )⎦ U (Ω) = U (Ω) + μ ⎣ Vijk (ω Vijkl (ω (5.80) [i,j,k]
[i,j,k,l]
The minimization method proposed in [248] and characterized by the authors as computationally expensive and applicable only to low fringe denˆ (r) = sities is further improved in [246]. The idea is to build an estimate ω [ρ(r) cos θ(r), ρ(r) sin θ(r)] by sequentially applying a regularization procedure to estimate the frequency vector orientation, θ(r), the frequency sign and the fringe density ρ(r) at each point of the FP. The drawback of this approach is the necessity to minimize several cost functions which is a time-consuming procedure. The fringe orientation is given by the direction: [cos θ(r), sin θ(r)] · ∇ϕ(r) = Θ(r) · ∇ϕ(r) = 0 The fringe orientation at site r is determined with π ambiguity: + , ˜ r )/∂x π −1 ∂ I( θ(r) = tan ± ˜ 2 ∂ I(r)/∂y
(5.81)
(5.82)
A n-dimensional quadrature transform is derived in [250] from the approximate equation ∇I(r ) · ∇ϕ(r) ∼ = −IV (r) sin[ϕ(r)] |∇ϕ(r)|
2
(5.83)
which is valid for a slowly varying contrast function IV (r) and becomes exact if the contrast of fringes is constant across the pattern. The QF is n-dimensional because it can be applied in the case of r(x1 , x2 , . . . , xn ) in the form Qn {IV (r) cos[ϕ(r)]} = nϕ (r) · ∇I(r) |∇ϕ(r)|
−1
(5.84)
where nϕ (r) is the unit vector normal to the isophase contour at point r. The performance of Q2 for processing of closed FPs is compared to the vortex operator developed in [229]. Both algorithms proposed in [229] and [250] rely on two operators – an isotropic 2D Hilbert transform, and an operator which gives the orientation 2π of the fringes. Determination of the fringe orientation is more difficult task and requires a sequential approach. A regularized estimator for determination of an orientational vector field nϕ (r ) through minimization of a cost function is proposed in [252]. It has been demonstrated in [253] that information of fringe orientation can be derived from the local gradients of the fringe intensity in a normalized FP obtained after a suitable bandpass derivative filtering. To increase the accuracy of θ(r) estimation, especially in a low modulation zones, the authors apply neighbouring-direction averaging.
140
E. Stoykova et al.
An algorithm for accurate θ(r) extraction with four different derivative kernels is presented in [254]. Phase demodulation from a single FP based on phase-locked loop (PLL) has been proposed in a series of works [10, 255, 256]. The advantage of this technique is that the phase unwrapping is implicit in the PLL and, therefore, is not necessary. The principle of the PLL is well known since 1932 [257]. The output of the phase detector measures the difference between the phase of a discrete input signal and the phase of a digital controlled oscillator (DCO). From the output of the phase detector the digital filter produces a control signal which is fed into the DCO. The control signal changes the frequency of the DCO in a way to decrease the phase difference between the input signal and the DCO. If the control signal is equal to zero, the DCO generates a signal at a constant frequency which is called a free-running frequency of the DCO. In the case of phase demodulation, the PLL system locks and tracks the phase-modulated signal. The running frequency of the DCO is equal to the spatial carrier frequency of the fringes. In tracking state, the frequency of the DCO approaches the instantaneous frequency and is proportional to the control signal from the digital filter. Ideally, the control signal is a replica of the derivative of the modulating signal. A zero-order digital filter (a filter with only a proportional path) leads to a first-order digital PLL system, whereas the filter with a first-order infinite impulse response gives a second-order PLL [258]. The PLL can be applied only to open fringes. To achieve good results with the PLL, the background intensity distribution must be strongly attenuated or removed by a high-pass filter, e.g. by differentiation of the intensity distribution with respect to the x coordinate. The second assumption of the PLL method is that the visibility of fringes is constant and one may set it ˜ equal to 1.0. If applied to a row in a FP, I(x) = IV (x) cos[2πf0 + ϕ(x)], the discretized first-order PLL system which is a nonlinear dynamic system is usually described by [259] ˜ sin[2πf0 + ϕ ˆ (x + 1) = ϕ ˆ (x) + τI(x) ˆ (x)] ϕ
(5.85)
ˆ , in a point (x + 1) is determined from the Therefore, the phase estimate, ϕ ˜ sin[2πf0 + phase estimate in the previous point corrected by the term τI(x) ˆ (x)]. However, to construct the estimate an a priori knowledge about the ϕ spatial carrier frequency is required. In addition, the PLL system is not able to demodulate a low frequency carrier pattern modulated by a wideband signal. It may occur that the estimated phase map will be corrupted by the double fringe frequency. Another serious drawback of the first-order PLL algorithm is its low immunity to noise. To improve noise performance of the algorithm, a second-order PLL algorithm is proposed in [257]. The improved algorithm permits real time implementation; in [257] a frame rate of 25 processed frames per second is reported. However, the higher frequency disturbances of the PLL system itself are still present in the improved algorithm. The drawback of the PLL algorithms is their inability to handle FPs with rapid phase variations.
5 Pattern Projection Profilometry for 3D Coordinates Measurement
141
An idea to solve the phase modulation problem from the regularization point of view is realized also in the so called phase-tracking approach which gives a basis for creation of robust automatic algorithms for phase retrieval from wideband noisy FPs bounded by arbitrarily shaped pupils without any edge distortions. A regularized version of a phase-tracking detecting algorithm which can be employed for open or closed fringes is described in [30]. The regularized phase-tracking (RPT) yields a continuous phase estimate without phase unwrapping. The RPT uses a two terms cost function which implements the assumption of locally spatially monochromatic FP with smooth and continuous phase [30]: ˆ , ωx , ωy ) U= Ux,y (ϕ x,y∈S
Ux,y (ϕ0 , ωx , ωy ) =
˜ η) − cos ϕ (x, y, μ, η)]2 [I(μ, e
(5.86)
μ,η∈(Nxy ∩S)
ˆ (μ, η) − ϕe (x, y, μ, η)]2 m(μ, η) + γ[ϕ
ˆ (x, y) is the phase estimate that minimizes the cost function, S is the where ϕ region occupied by the analyzed FP, Nxy is the neighbourhood of the point of interest (x, y), the term Ux,y is the energy of the RPT system at the site (x, y), m(x,y) is an indicator which takes values of 1 and 0 to mark the previously ˜ y) is obtained from I(x, y) after subtraction of the used pixels. The FP I(x, background IB (x, y) and the normalization procedure IV (x, y) ≈ 1. The term ϕe (x, y, μ, η) is the local phase plane that is used to approximate simultaneously the observed data through a cosinusoidal model and the phase values that have already been estimated: ˆ (x, y) + ω ˆ x (x, y)(x − μ) + ω ˆ y (x, y)(y − η) ϕe (x, y, μ, η) = ϕ
(5.87)
ˆ x,y (.) are the x- and y- components of the local frequency. As it can be where ω seen, the second term in the cost function is small only if the phase estimate is very smooth. The parameter γ controls the smoothness of the phase estimate. Due to the multimodal character of the cost function U, the problem of finding its global minimum is difficult and computationally expensive. To avoid this obstacle, the authors of [30] develop a sequential demodulating algorithm. The phase is calculated by a propagative scheme from pixel to pixel. To start the processing, a seed point (x0 , y0 ) is chosen in S, preferably in the region with ˆ, ω ˆ x, ω ˆ y ) is optimized in the site low-frequency fringes. The function Ux,y (ϕ ˆ (x0 , y0 ) and ω ˆ x,y (x0 , y0 ), and the indicator m(x0 , y0 ) is (x0 , y0 ) by finding ϕ set to 1 to show that this site has been already processed. By using the indicator function the other pixels are processed sequentially, and the first iteration ˆ 1 (x, y) is obtained. The function Ux,y (ϕ ˆ , ωx , ωy ) of the estimated phase map ϕ is optimized in the site (x, y) by using a simple gradient descent. The output of the RPT system gives the estimated phase already unwrapped. The authors investigate also another approach to refine the first iteration estimate.
142
E. Stoykova et al.
The latter can be used as an input to the iteration conditional mode algorithm which is designed to find a maximum a posteriori estimator for images that are ˆ (x, y), ω ˆ x (x, y), ω ˆ y (x, y)]T with modelled as random vector fields ϑ(x, y) = [ϕ posterior Gibbs distributions of the form P (ϑ) = exp[−U (ϑ)]/Z, where Z is a normalization constant and U (ϑ) is an energy function. The algorithm finds a local minimum of U (ϑ) with respect to ϑ at each site (x, y) in a small number of steps. This shows that the RPT system can be considered as an adaptive narrowband filter. This makes the RPT more robust to noise when wideband FPs are processed. The cost function (121) ensures satisfactory results only for low noise closed fringes. To deal with a high noise level, the cost function must include an additional term. In [247] it is proposed to add a term that regularizes the requirement that a slightly phase-shifted FP should resemble the processed pattern. The authors recommend use of a constant phase shift between 0.1π and 0.3π rad; they also propose a new scanning strategy which initially demodulates pixels around the stationary points. The shortcoming of this fringe-follower regularized phase tracking is the need of low-pass filtering and a binary threshold operation. The RPT is a local processing approach [214] in which appropriate cosine elements are fitted to the FP in a local area. Thus spatially separated signals have no influence on each other. It can be used for phase-shifted FPs, a singlecarrier fringes FP, and a single closed fringe FP. In addition, the RPT shows very good behaviour close to the borders. The shortcomings are the necessity to optimize the estimates of the phase and the local frequency simultaneously, as well as the requirement to remove the background and to normalize the FP. In [40] an improved version of the RPT with a more robust minimization algorithm is used to demodulate squared-grating deflectograms. The propagative scheme of phase demodulation of open and closed fringes with introduction of a quality map is described in [260]. To avoid the necessity to normalize the FP, a modified cost function is proposed in [261] which assumes that the fringe modulation is also locally monochromatic and is described by IVe (x, y, μ, η) = IV (x, y) + βx (x, y)(x − μ) + βy (x, y)(y − η)
(5.88)
where βx,y are the local modulation frequencies. The RPT approach is further improved in [259] by combining quadrature estimation with the RPT. The sequential quadrature and phase tracking estimator builds the phase estimate by minimizing the following cost function: ˜ 2 r) ˜ r ) − cos ϕ ˆ (r)]2 + ∂ I( ˆ ˆ U = [I( + ω ( r ) sin ϕ ( r ) x ∂x ˜ 2 ∂ I( r) ˆ y (r) sin ϕ ˆ (r) + ∂y + ω
(5.89)
˜ r )/∂x and ∂ I( ˜ r )/∂y are calculated from the first-order The derivatives ∂ I( differences. The first term of the cost function tends to zero when the estimate ϕ( ˆ r ) is close to ϕ(r). The second term in (5.89) enforces the requirement on the
5 Pattern Projection Profilometry for 3D Coordinates Measurement
143
estimates ϕ( ˆ r ) and ω ˆ x,y (r) to make possible approximation of the quadrature of the signal. With this additional constraint the optimal values of ϕ( ˆ r ) and ω ˆ x,y (r) are found by search along the direction of the steepest descend of ˆ x, ω ˆy, ϕ ˆ are subject to optimization the cost function [259]. The functions ω 0 ∞ ˆ ˆ ˆ ˆ x (ri−1 )∞ , ω ˆ x (ri )0 = with initial conditions ϕ(ri ) = ϕ(ri−1 ) , ωx (ri )0 = ω ∞ ˆ ˆ ˆ ˆ ωx (ri−1 ) , where the index ∞ denotes a stable pair ϕ, (ωx , ωy ) of estimates in the previous estimated site. This sequential approach evaluates the phase without phase unwrapping. The quadrature phase-tracking system can be used to demodulate open fringes without the need to know the carrier frequency, unlike the case in the PLL approach. It outperforms the PLL system also by its ability to process very low frequency fringes without worry of overlapping of low and higher frequency spectra. The quadrature phase tracking can be used to demodulate closed fringe patterns if instead of row-by-row scanning strategy one follows the path of the fringes. The regularized techniques based on the minimization of quadratic functionals have been also applied for phase unwrapping in [217, 249]. In the proposed algorithms the smoothing regularization term serves to control interpolation (or extrapolation) especially in regions with bad data and to reduce the noise. This approach for phase unwrapping is extended to process ESPI images characterized with high level of speckle noise and phase discontinuities [262]. In [263] the authors propose to fit a global non-linear function in each pixel instead of a local plane using the genetic algorithm technique. The approach is checked for a polynomial fitting. Phase demodulation from a single FP with open or closed fringes based on numerical correlation between the measured FP and a virtual FP is developed in [264, 265]. The recorded FP is divided into zones and in each zone the FP is approximated with parallel, inclinable, and equidistant fringes and the correlation function in the zone of interest is minimized with respect to amplitude modulation, background illumination, pitch, fringe orientation, and phase. In [264] the virtual FP is built with a sinusoidal profile whereas in [265] the approach is extended to polynomial fitting which allows for acceleration of the computation. Phase demodulation by means of a non-linear filtering method based on the theory of the Markov stochastic process is developed in [266] as a recurrence procedure under the condition of a correlated phase noise. The recurrent procedure enables real-time processing of noisy data and phase retrieval without unwrapping.
5.3 Capture of Real Objects 5.3.1 Full-field Measurement In a simple pattern projection system only one part of the object surface is viewed both by projector and the image sensor which yields a solid angle of
144
E. Stoykova et al.
about 2π for reliable measurement. Measurement of surfaces with almost vertical structures as e.g. cylindrical surfaces and of front and back sides of a body requires 360 degrees of observation. In addition, effect of shadowing in objects with a strong surface tilt or distortions caused by non-linear recording due to specular reflection and diffraction at the object surface makes impossible observation within these parts of the image. To compensate for the loss of information, systems with multiple directions of illumination or observation are required. One of the main problems that should be solved for accurate performance of such a system is to make precise transformation of the coordinate systems attached to all sensors into a common global coordinate system. To determine accurately the relative orientation of the sensors, they could be fixed to mechanical devices that provide position information very precisely. Such systems are expensive and vulnerable to small angular inaccuracies that may cause large errors in coordinate calculation [267]. The matching of the point clouds obtained as a result of phase demodulation of the FPs recorded by different sensors can be done also numerically, e.g. by optimal fitting which sometimes may lead to ambiguous solutions. To overcome most of these illumination-caused difficulties, a 3D optical sensor with a periodic illumination from at least three different directions using a telecentric projection system is described in [268]. A grating with grey code and with a sinusoidal intensity transmission is used for generation of a structured light pattern. A nearly complete 3D measurement of coordinates is realized for objects with very complex surface profiles by object rotation. The system measures the phase within a number of patches from the object surface. Determination of orientation of the patches in the space permits to match them all in a global coordinate system. In [267] the PS measurement system comprises two cameras and one projector. The cameras are calibrated by using a reference object with a large number of circular targets which is imaged by both cameras at different viewing angles. The global coordinates of the reference object are restored using photogrammetric processing. To solve the problem with shadowing and to build a good estimator of 3D coordinates, we introduce two approaches – one with double symmetrical illumination (DSI) and the other with double symmetrical observation (DSO) of the object [25]. The DSI pattern projection system with an adjustable Michelson interferometer is presented in Fig. 5.17. One of the mirrors of the interferometer is attached to a PLZT for precise control of the phase step with an optoelectronic feedback. The light source is a HeNe laser with λ = 632.8 nm. Vertical interference fringes are divided by a beam splitter into two arms (left – LA and right – RA) at equal intensity and projected onto the specimen. Two series of FPs for a 5-frames phase-stepping algorithm are recorded at two different spacings, d1 and d2 . The angle of object illumination is α in both arms. A Peltier cooled CCD camera with 604 × 288 pixels and 8-bits grey-scale coding captures the deformed FPs. The wrapped phase maps are obtained as a difference between the calculated phase maps corresponding to d1,2 for LA and RA illumination, respectively. The unwrapped phase maps
5 Pattern Projection Profilometry for 3D Coordinates Measurement
145
Fig. 5.17. Block diagram of DSI approach a) projection system b) optical setup for DSI, L – lens; BS – beam splitter; M – mirror; SF – spatial filter; S – shutter, PLZT – phase-stepping device
are obtained with a quality-guided path following method [123]. The pixels estimation is performed by a phase derivative variance algorithm in a 3 × 3 window and a quality map is produced that indicates low-quality regions. After unwrapping, a new phase map is composed, as the shadow zones from the phase map corresponding to one illumination direction are replaced and adjusted with good quality zones from the other map. The DSI approach is applied for examination of a test object (Fig. 5.18) and a real object (Fig. 5.19). The shadow zones are successfully recovered in the composed surfaces. The DSO system is realized with Max-Zhender interferometer. The light source is an Ar+ laser with λ = 488 nm. Again two spacings, d1 and d2 , and 5-frames phase-stepping algorithm are used. Two CCD cameras (resolution 604 × 288 pixels) positioned symmetrically capture the deformed FPs. The part of the interference pattern is reflected on the phase detector for the optoelectronic feedback. The wrapped phase maps are obtained as a difference between the calculated phase maps corresponding to d1,2 for each point of view. The 3D distributions calculated from two observation points are transformed to a common coordinate system (x , y , z ). Figure 5.20 presents the experimental results for the DSO approach. The surface is successfully reconstructed, although some shadow zones are not recovered. The experimental results confirm that the increase of the angle and spacing difference ensure better sensitivity. The choice of a suitable illumination/observation angle is important to prevent partial recovery of the shadow zones and to avoid occurrence of areas that can not be observed with both CCD cameras. The real and the calculated from the measured data 3D coordinates of the test object are compared and the estimated error does not exceed ±3%.
146
E. Stoykova et al.
Fig. 5.18. Wrapped (top) and unwrapped (middle) phase maps and 3D visualization (bottom) for DSI measurement of a test object at d1 = 2.3 mm and d2 = 8.5 mm; shadow zones are masked with black colour. Left – LA direction, right – RA direction
5.3.2 Real-time Measurement The two possible ways to realize a real-time measurement are to develop processing approaches for phase demodulation of a single pattern (single-frame or single-shot acquisition) or to record multiple patterns at high acquisition speed that are processed by the well-known phase-stepping algorithms. As we have seen in the previous sections, during the last two decades, various methods for single frame fringe pattern demodulation have been explored. The straightforward Fourier transform phase demodulation with and without carrier fringes suffers from filtering problems caused by wideband noisy carriers and a limitation on object height variation. In addition, introduction of the carrier frequency which follows the rate of change in the observed object is not a trivial task and may require expensive equipment. The choice of filter parameters is problem dependent and requires preliminary information about the noise and bandwidth of the modulating signals. This jeopardizes automatic processing of the FPs which is one of the main requirements for realization of the capture process in a 3D dynamic display. The alternative spatial analysis
5 Pattern Projection Profilometry for 3D Coordinates Measurement
147
Fig. 5.19. Wrapped (top) and unwrapped (middle) phase maps and 3D visualization (bottom) for DSI measurement of a bronze statuette at d1 = 2 mm and d2 = 7.5 mm; shadow zones are masked with black colour. Left – LA direction, right – RA direction
Fig. 5.20. DSO experimental results of bronze statuette measurements with periods of interference patterns d1 = 2 mm and d2 = 3.8 mm; Shadow zones are masked with black colour. Wrapped phase (top left), unwrapped phase (bottom left) 3D reconstruction (right)
148
E. Stoykova et al.
method for phase retrieval from a single frame as the regularized phase tracking shows high accuracy both for patterns with open and closed fringes and is capable to process noisy images with irregular shape borders. It fits local plane surfaces to the recovered phase which makes unavoidable averaging over several pixels. Due to the fact that it seeks the phase estimate through minimization of a cost function, this approach involves iterative solving of a set of linear equations and is time-consuming. There have been developed other methods as the fitting-error modified spatial fringe modulation [78]; phase demodulation based on fringe skeletonizing when an extreme map is introduced by locating the fringes minima and maxima [187, 269, 270]; phase-stepping recovery of objects by numerical generation of multiple frames from a single recorded frame [271] or by developing a spatial modification based on assumption of slowly varying phase [272]. The main drawback of many of the spatial analysis methods is the inevitable averaging over several pixels in the neighbourhood of the point of interest which hampers investigation of high-frequency content FPs. This stimulates search and development of methods with multiple frames registration and real-time demodulation. A single-shot measurement is realized in [273] by simultaneous projection of three colour patterns (red, green and blue) on the object at different angles and Fourier analysis of the deformed image recorded by a single CCD camera. A phase-stepping method for measuring the 3-D surface profile of a moving object by projection of a sinusoidal grating pattern and continuous intensity acquisition by three phase-shifted linear array sensors positioned along the projected stripes is proposed in [274]. The method is restricted to objects moving at a constant speed. High-resolution 3D measurement of absolute coordinates using three phase-shifted fringe patterns coded with three primary colours and recorded at data acquisition speed of 90 fps is presented in [275]. Optimal intensity-modulation projection technique is proposed in [276] based on optimization procedure for rearranging the intensities in a projection pattern in order to improve detection of the stripe-order and to make possible measurement in real-time. A shadow moir´e system with three TV cameras that is able to measure the shape of an object in a dynamic event is described in [66, 277]. The entries to the three cameras permit to construct a general nonlinear function of the object depth, and using Newton–Raphson method for numerical analysis to find the object profile. To improve the accuracy of the measurement, a new algorithm is proposed that takes into the account the higher harmonics in the projected FP. The PS methodology can be applied for direct detection of the complex amplitude at the image sensor and to reconstruct the 3D image from four holograms that are sequentially recorded using reference waves phase-shifted at 0, π/2, π and 3π/2. The main advantage is the ability to register only the first-order diffracted wave. For real-time reconstruction quasi-PS digital holography is proposed by implementing a spatial division multiplexing technique. For the purpose, the digital hologram is divided into segments of 2 × 2 pixels. The four pixels are phase-shifted at the required phase step by numerical generation and are further extracted and relocated to construct four phase-shifted
5 Pattern Projection Profilometry for 3D Coordinates Measurement
149
holograms. The improved reconstruction scheme of the method is described in [278]. A method for automatic phase extraction from a single pattern with closed noisy fringes based on an arccosine function is developed in [187]. To overcome phase jumps of π and sign ambiguities, an extreme map is attached to the processed FP after adaptive weighted filtering for noise reduction and contrast enhancing. The extreme map indicates positions of fringe peaks and throughs throughout the entire area of the FP. High-speed 3-D surface contouring by DMD projection of a colour-encoded digital FP whose RGB components comprise three phase-shifted at 2π/3 FPs is tested in [279]. Using of a DMD permits to enhance substantially the contrast of the projected pattern. The image deformed by the object is recorded by a colour CCD camera and sent to a computer to separate the RGB components and to create three grey-scale phase-shifted images. The intensities for the red, blue and green channels are recalculated from the recorded intensities to compensate for the coupling between the three channels. With a standard video camera, a contouring speed up to 60 frames/s can be expected. A highspeed system in which a SLM is used to generate FPs with five optimized spacings is described in [280]. To implement a four-frame algorithm 20 FPs are recorded at video rate and processed in real time using a pipeline image processor which ensures measurements of 250,000 coordinates in less than 1 s. However, the pixilated nature of the SLM restricts the measurement accuracy up to 5.10−4 from the object size [1]. A single-shot PMP system based on the PS principle can be realized by simultaneous projection of four phase-shifted at π/2 sinusoidal FPs of equal intensity, contrast and spacing that are generated at four different wavelengths. To simplify the technical solution and to have better stability, we analyze the system realization by using of sinusoidal phase diffraction gratings [27]. The FP generation module consists of 4 blocks (FPG1-4) corresponding to four different wavelengths (λ1 − λ4 ) as is shown in Fig. 5.21 left, where the FPGs are FP Generators, DLs are 20 mW CW single mode diode lasers, G1 – G4 are phase gratings. The diode lasers emit at wavelengths: λ1 = 785 nm, λ2 = 808 nm, λ2 = 830 nm and λ4 = 850 nm. To optimize the optical efficiency of wavelength mixing, the interference mirrors (IM1 – 3) are used as follows: the mirror IM1 transmitting λ1 , λ2 and reflecting λ3 , λ4 ; the mirror IM2 transmitting λ1 and reflecting λ2 and the mirror IM3 transmitting λ3 and reflecting λ4 . The registration module (Fig. 5.21 right) consists of four synchronized CCD cameras. The spectral separation of the individual FPs is provided by a second set of mirrors IM1–3. The precise positioning and adjustments of cameras and optical elements ensure parallel recording of the FPs. The proposed four-wavelength system relies on the independence of the spatial period of the Fresnel diffraction pattern created by a sinusoidal phase grating with transmittance [281] t(x, y) = exp[im sin(2πx/d)], where m is the modulation parameter and d is the spatial grating period along the x axis, on the wavelength. At plane wave illumination, the complex amplitude at distance z behind the grating is a structure periodical along x and z:
150
E. Stoykova et al.
Fig. 5.21. Optical arrangement of the four-wavelengths PMP system: left) FP generation module right) registration module
U (x, y, z) =J0 (m) + 2
∞
{J2q (m) cos[4qπx/d]
q=0
× cos[(2q)2 πλz/d2 ] − i sin[(2q)2 πλz/d2 ] + (5.90) +iJ2q+1 (m) sin[(2q + 1)2πx/d] 2 2 2 2 × cos[(2q + 1) πλz/d ] − i sin[(2q + 1) πλz/d ] The y-axis is parallel to the fringes, Jq is the Bessel function of the order q and λ is the wavelength. By a proper choice of m, the influence of higher diffraction orders could be minimized. Figure 5.22 shows intensity distribution behind the phase grating (z = 0.1 m) at m = 0.3 and d = 0.4 mm as a
Fig. 5.22. Intensity distribution behind a phase grating in a plane parallel to its surface as a function of the wavelength
5 Pattern Projection Profilometry for 3D Coordinates Measurement
151
Fig. 5.23. Fourier spectrum of the intensity distribution in Fig. 27 (the zero-term is excluded)
function of the wavelength whereas Fig. 5.23 depicts the corresponding Fourier spectrum. As it can be seen, the projected FP is practically with an identical sinusoidal profile for the chosen spectral region from 785 nm to 850 nm, i.e the degrading effect of the higher frequency components is overcome. The requirement for close location along z of the Talbot planes corresponding to the four wavelengths is also fulfilled. These features of the used sinusoidal phase gratings are the most important for realization of real time operating PMP system for 3D coordinate measurement of dynamic scenes.
Acknowledgment This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. Chen F, Brown GM, Song M (2000) Overview of three-dimensional shape measurement using optical methods. Opt Eng 39: 10–22 2. Tiziani H-J (1993) Optical techniques for shape measurements. In: Juptner W, Osten W (eds) Fringe’93, Akademie, Berlin, pp 165–174 3. Sainov V, Stoilov G, Tonchev D et al. (1996) Shape and normal displacement measurement of real objects in a wide dynamic range. In: Optical Metrology, Akademie, Verlag, pp 52–60 4. Xie H, Liu Z, Fang D et al. (2004) A study on the digital nano-moir´e method and its phase shifting technique. Meas Sci Technol 15: 1716–1721 5. Harizanova J (2006) Holographic and digital methods for recording and processing of information for cultural heritage protection. Ph.D. thesis, CLOSPIBAS 6. Li J, Hassebrook L, Guan C (2003) Optimized two-frequency phase-measuring profilometry light-sensor temporal-noise sensitivity. J Opt Soc Am A 20: 106–115
152
E. Stoykova et al.
7. Sansoni G, Corini S, Lazzari S et al. (1997) Three-dimensional imaging based on Gray-code light projection: characterization of the measuring algorithm and development of a measuring system for industrial applications. Appl Opt 36: 4463–4472 8. Pages J, Salvi J, Garcia R et al. (2003) Overview of coded light projection techniques for automatic 3D profiling. In: Proceedings of IEEE, International Conference on Robotics & Automation, pp 133–138. 9. Xian T, Su X (2001) Area modulation grating for sinusoidal structure illumination on phase-measuring profilometry. Appl Opt 40: 1201–1208 10. Gurov I, Hlubina P, Chugunov V (2003) Evaluation of spectral modulated interferograms using a Fourier transform and the iterative phase-locked loop method. Meas Sci Technol 14: 122–130 11. Baumbach T, Osten W, von Kopylow C et al. (2006) Remote metrology by comparative digital holography. Appl Opt 45: 925–934 12. Su W, Shi K, Liu Z et al. (2005) A large-depth-of-field projected fringe profilometry using supercontinuum light illumination. Opt Express 13: 1025–1032 13. Xue L, Su X (2001) Phase-unwrapping algorithm based on frequency analysis for measurement of a complex object by the phase measuring profilometry method. Appl Opt 40: 1207–1216 14. Schirripa-Spagnolo G, Ambrosini D (2001) Surface contouring by diffractive optical element-based fringe projection. Meas Sci Technol 12: N6–N8 15. Quan C, He XY, Wang CF et al. (2001) Shape measurement of small objects using LCD fringe projection with phase-shifting. Opt Commun 189: 21–29 16. Quan C, Tay CJ, Kang X et al. (2003) Shape measurement by use of liquidcrystal display fringe projection with two-step phase shifting. Appl Opt 42: 2329–2335 17. Huang P, Zhang C, Chiang F-P (2003) High-speed 3-D shape measurement based on digital fringe projection. Opt Eng 42: 163–168 18. Sitnik R, Kujavinska M, Wonznicki J (2002) Digital fringe projection system for large-volume 360 deg shape measurement. Opt Eng 41: 443–449 19. Sitnik R, Kujavinska M (2000) Opto-numerical methods for data acquisition for computer graphics and animation systems. In: Proceedings of SPIE 3958, pp 36–45 20. Saldner H, Huntley J (1997) Profilometry using temporal phase unwrapping and a spatial light modulator-based fringe projector. Opt Eng 36: 610–615 21. Mehta D, Dubey S, Hossain M et al. (2005) Simple multifrequency and phaseshifting fringe-projection system based on two-wavelength lateral shearing interferometry for three-dimensional profilometry. Appl Opt 44: 7515–7521 22. Chen L-C, Huang C-C (2005) Miniaturized 3D surface profilometer using digital fringe projection. Meas Sci Technol 16: 1061–1068 23. Chen L-C, Liao C-C (2005) Calibration of 3D surface profilometry using digital fringe projection. Meas Sci Technol 16: 1554–1566 24. Harizanova J, Sainov V (2006) Three-dimensional profilometry by symmetrical fringes projection technique. Opt Las Eng 44: 1270–1282 25. Harizanova J, Kolev A (2005) Comperative study of fringes generation in twospacing phase-shifting profilometry. In: Proceedings of SPIE 6252, pp 21–25 26. Sainov V, Stoykova E, Harizanova J (2006) Real time phase stepping pattern projection profilometry. In: Proceedings of SPIE 6341, pp 63410P– 63411/63416
5 Pattern Projection Profilometry for 3D Coordinates Measurement
153
27. Quan C, Tay C, Chen L (2005) Fringe-density estimation by continuous wavelet transform. Appl Opt 44: 2359–2365 28. Meadows D, Johnson W, Allen J (1970) Generation of surface contours by moir´e patterns. Appl Opt 9: 942–947 29. Dorrio B, Fernandez J (1999) Phase-evaluation methods in whole-field optical measurement techniques. Meas Sci Technol 10: R33–R55 30. Servin M, Marroquin JL, Cuevas FJ (1997) Demodulation of a single interferogram by use of a two-dimensional regularized phase-tracking technique. Appl Opt 36: 4540–4548 31. Wang Z, Ma H (2006) Advanced continuous wavelet transform algorithm for digital interferogram analysis and processing. Opt Eng 45: 045601 32. Yamaguchi I, Yamamoto A, Yano M (2000) Surface topography by wavelength scanning interferometry. Opt Eng 39: 40–46 33. Guo H, He H, Yu Y et al. (2005) Least-squares calibration method for fringe projection profilometry. Opt Eng 44: 033603 34. Skydan O, Lalor M, Burton D (2005) Three-dimensional shape measurement of non-full-field reflective surfaces. Appl Opt 44: 4745–4752 35. Hu Q, Harding K (2007) Conversion from phase map to coordinate: comparison among spatial carrier, Fourier transform, and phase shifting methods. Opt Las Eng 45: 342–348 36. Cuevas FJ, Servin M, Stavroudis O et al. (2000) Multi-layer neural network applied to phase and depth recovery from fringe patterns. Opt Commun 181: 239–259 37. Huang P, Hu Q, Chiang F-P (2003) Error compensation for a threedimensional shape measurement system. Opt Eng 42: 482–486 38. Yu F, Wang E (1973) Speckle reduction in holography by means of random spatial sampling. Appl Opt 12: 1656–1659 39. Liu H, Lu G, Wu S et al. (1999) Speckle-induced phase error in laser-based phaseshifting projected fringe profilometry. J Opt Soc Am A 16: 1484–1495 40. Berryman F, Pynsent P, Cubillo J (2003) A theoretical comparison of three fringe analysis methods for determining the three-dimensional shape of an object in the presence of noise. Opt Las Eng 39: 35–50 41. Lei Z, Kang Y, Yun D (2004) A numerical comparison of real-time phaseshifting algorithms. Opt Las Eng 42: 395–401 42. Bitou Y (2003) Digital phase-shifting interferometer with an electrically addressed liquid-crystal spatial light modulator. Opt Lett 28: 1576–1578 43. Patil A, Langoju R, Rastogi P (2004) An integral approach to phase shifting interferometry using a super-resolution, frequency estimation method. Opt Express 12: 4681–4697 44. Kreis T (2004) Handbook of Holographic Interferometry, Wiley-VCH GmbH, Weinheim. 45. Li W, Su X (2001) Real-time calibration algorithm for phase shifting in phasemeasuring philometry. Opt Eng 40: 761–766 46. Rathjen C (1995) Statistical properties of phase-shift algorithms. J Opt Soc Am A 12: 1997–2008 47. Morgan C (1982) Least-squares estimation in phase-measurement interferometry. Opt Lett 7: 368–370 48. Greivenkamp J (1984) Generalized data reduction for heterodyne interferometry. Opt Eng 23: 350–352
154
E. Stoykova et al.
49. Hibino K (1997) Susceptibility of systematic error-compensating algorithms to random noise in phase-shifting interferometry. Appl Opt 36: 2084–2093 50. Ding X, Cloud G, Raju B (2005) Noise tolerance of the improved max-min scanning method for phase determination. Opt Eng 44: 035605–035607 51. Gutmann B, Weber H (1998) Phase-shifter calibration and error detection in phase-shifting applications: a new method. Appl Opt 37: 7624–7631 52. Ishii Y, Chen J, Murata K (1987) Digital phase-measuring interferometry with a tunable laser diode. Opt Lett 12: 233–235 53. Ishii Y (1999) Wavelength-tunable laser-diode interferometer. Opt Rev 6: 273–283 54. Zhang C, Huang PS, Chiang F (2002) Microscopic phase-shifting profilometry based on digital micromirror device technology. Appl Opt 41: 5896–5904 55. Creath K (1988) Phase-measurement interferometry techniques. Prog Opt 26: 349–393 56. Schwider J (1990) Advanced evaluation techniques in interferometry. Prog Opt 28: 271–359 57. Van Wingerden J, Frankena HJ, Smorenburg C (1991) Linear approximation for measurement errors in phase shifting interferometry. Appl Opt 30: 2718–2729 58. Ahmad F, Lozovskiy V, Castellane R (2005) Interferometric phase estimation through a feedback loop technique. Opt Commun 251: 51–58 59. Li Y, Zhu Z, Li X (2005) Elimination of reference phase errors in phaseshifting interferometry. Meas Sci Technol 16: 1335–1340 60. Schwider J, Dresel T, Manzke B (1999) Some considerations of reduction of reference phase error in phase stepping interferometry. Appl Opt 38: 655–659 61. Arai Y, Yokozeki S (1999) Improvement of measurement accuracy in shadow moir´e by considering the influence of harmonics in the moir´e profile. Appl Opt 38: 3503–3507 62. Koliopoulos C (1981) Interferometric optical phase measurement techniques. Ph.D. Thesis, University of Arizona, Source: Dissertation Abstracts International, Volume: 42–08, Section: B, p 3319 63. Brophy C (1990) Effect of intensity error correlation on the computed phase of phase-shifting interferometry. J Opt Soc Am A 7: 537–541 64. Zhao B, Surrel Y (1997) Effect of quantization error on the computed phase of phase-shifting measurements. Appl Opt 36: 2070–2075 65. Zhao B (1997) A statistical method for fringe intensity-correlated error in phase-shifting measurement: the effect of quantization error on the N-bucket algorithm. Meas Sci Technol 8: 147–153 66. Surrel Y (1996) Design of algorithms for phase measurements by the use of phase stepping. Appl Opt 35: 51–60 67. Skydan O, Lilley F, Lalor M et al. (2003) Quantization error of CCD cameras and their influence on phase calculation in fringe pattern analysis. Appl Opt 42: 5302–5307 68. Wizinowich P (1990) Phase-shifting interferometry in the presence of vibration: a new algorithm and system. Appl Opt 29: 3271–3279 69. De Groot P (1995) Vibration in phase-shifting interferometry. J Opt Soc Am A 12: 354–365 70. De Groot P, Deck L (1996) Numerical simulations of vibration in phaseshifting interferometry. Appl Opt 35: 2172–2178
5 Pattern Projection Profilometry for 3D Coordinates Measurement
155
71. Ding X, Cloud G, Raju B (2004) Improved signal processing algorithm for the max-min scanning method for phase determination. Opt Eng 43: 63–68 72. Strobel B (1996) Processing of interferometric phase maps as complex-valued phasor images. Appl Opt 35: 2192–2198 73. Quan C, Tay C, Chen L et al. (2003) Spatial-fringe-modulation-based quality map for phase unwrapping. Appl Opt 42: 7060–7065 74. Cheng Y, Wyant J (1985) Phase-shifter calibration in phase-shifting interferometry. Appl Opt 24: 3049–3052 75. Hibino K, Oreb B, Farrant D et al. (1997) Phase-shifting algorithms for nonlinear and spatially nonuniform phase shifts. J Opt Soc Am A 14: 918–930 76. Schmit J, Creath K (1995) Extended averaging technique for derivation of error-compensating algorithms in phase-shifting interferometry. Appl Opt 34: 3610–3619 77. Afifi M, Nassim K, Rachafi S (2001) Five-frame phase-shifting algorithm insensitive to diode laser power variation. Opt Commun 197: 37–42 78. Hariharan P, Oreb B, Eiju T (1987) Digital phase-shifting interferometry: a simple error compensating phase calculation algorithm. Appl Opt 26: 2504–2505 79. Schwider J, Falkenstorfer O, Schreiber H et al. (1993) New compensating four-phase algorithm for phase-shift interferometry. Opt Eng 32: 1883–1885 80. De Groot P (1995) Derivation of algorithms for phase-shifting interferometry using the concept of a data-sampling window. Appl Opt 34: 4723–4730 81. Zhang H, Lalor M, Burton DR (1999) Error-compensating algorithms in phase-shifting interferometry: a comparison by error analysis. Opt Las Eng 31: 381–400 82. Zhao B, Surrel Y (1995) Phase-shifting: six-sample self-calibrating algorithm insensitive to the second harmonic in the fringe signal. Opt Eng 34: 2821–2822 83. Dobroiuy A, Logofatu P, Apostol D et al. (1997) Statistical self-calibrating algorithm for three-sample phase-shift interferometry. Meas Sci Technol 8: 738–745 84. Dobroiuy A, Apostol D, Nascov V et al. (1998) Self-calibrating algorithm for three-sample phase-shift interferometry by contrast leveling. Meas Sci Technol 9: 744–750 85. Zhu Y, Gemma T (2001) Method for designing error-compensating phasecalculation algorithms for phase shifting interferometry. Appl Opt 40: 4540–4546 86. Styk A, Patorski K (2007) Identification of nonlinear recording error in phase shifting interferometry. Opt Las Eng 45: 265–273 87. Chen M, Guo H, Wei C (2000) Algorithm immune to tilt phase-shifting error for phase-shifting interferometers. Appl Opt 39: 3894–3898 88. Guo H, Chen M (2005) Least-squares algorithm for phase-stepping interferometry with an unknown relative step. Appl Opt 44: 4854–4859 89. Brug H (1999) Phase-step calibration for phase-stepped interferometry. Appl Opt 38: 3549–3555 90. Patil A, Rastogi P (2005) Approaches in generalized phase shifting interferometry. Opt Las Eng 43: 475–490 91. Perry K, McKelvie J (1995) Reference phase shift determination in phase shifting interferometry. Opt Las Eng 22: 77–90
156
E. Stoykova et al.
92. Carr´e P (1966) Installation et utilisation du comparateur photoelectrique et interferentiel du Bureau International des Poids et Mesures. Metrologia 2: 13–23 93. Freischlad K, Koliopoulos C (1990) Fourier description of digital phasemeasuring interferometry. J Opt Soc Am A 7: 542–551 94. Kemao Q, Fangjun S, Xiaoping W (2000) Determination of the best phase step of the Carr’e algorithm in phase shifting interferometry. Meas Sci Technol 11: 1220–1223 95. Stoilov G, Dragostinov T (1997) Phase-stepping interferometry: five-frame algorithm with an arbitrary step. Opt Las Eng 28: 61–69 96. Kreis T (1993) Computer aided evaluation of fringe patterns. Opt Eng 19: 221–240 97. De Lega XC, Jacquot P (1996) Deformation measurement with objectinduced dynamic phase shifting. Appl Opt 35: 5115–5120 98. Lai G, Yatagai T (1991) Generalized phase-shifting interferometry. J Opt Soc Am A 8: 822–827 99. Kinnstaetter I, Lohmann A, Schwider J et al. (1988) Accuracy of phase shifting interferometry. Appl Opt 27: 5082–5087 100. Wei C, Wang Z (1999) General phase-stepping algorithms with automatic calibration of phase steps. Opt Eng 38: 1357–1360 101. Chen X, Gramaglia M, and Yeazell J (2000) Phase-shift calibration algorithm for phase-shifting interferometry. J Opt Soc Am A 17: 2061–2066 102. Goldberg K, Bokor J (2001) Fourier-transform method of phase-shift determination. Appl Opt 40: 2886–2894 103. Guo C, Rong Z, He J et al. (2003) Determination of global phase shifts between interferograms by use of an energy-minimum algorithm. Appl Opt 42: 6514–6519 104. Marroquin J, Servin M, Rodriguez-Vera R (1998) Adaptive quadrature filters for multiple phase-stepping images. Opt Las Eng 23: 238–240 105. Patil A, Rastogi P, Raphael B (2005) Phase-shifting interferometry by a covariance-based method. Appl Opt 44: 5778–5785 106. Okada K, Sato A, Tsujiuchi J (1991) Simultaneous calculation of phase distribution and scanning phase shift in phase shifting interferometry. Opt Commun 84: 118–124 107. Cai LZ, Liu Q, Yang XL (2003) Phase-shift extraction and wave-front reconstruction in phase-shifting interferometry with arbitrary phase steps. Opt Lett 28: 1808–1810 108. Guo H, Zhao Z, Chen M (2007) Efficient iterative algorithm for phase-shifting interferometry. Opt Las Eng 45: 281–292 109. Kim S-W, Kang M-G, Han G-S (1997) Accelerated phase-measuring algorithm of least squares for phase-shifting interferometry. Opt Eng 36: 3101–3106 110. Han G-S, Kim S-W (1994) Numerical correction of reference phases in phase-shifting interferometry by iterative least-squares fitting. Appl Opt 33: 7321–7325 111. Wang Z, Han B (2004) Advanced iterative algorithm for phase extraction of randomly phase-shifted interferograms. Opt Lett 29: 1671–1674 112. Wang Z, Han B (2007) Advanced iterative algorithm for randomly phaseshifted interferograms with intra- and inter-frame intensity variations. Opt Las Eng 45: 274–280
5 Pattern Projection Profilometry for 3D Coordinates Measurement
157
113. Yun H, Hong C (2005) Interframe intensity correlation matrix for selfcalibration in phase-shifting interferometry. Appl Opt 44: 4860–4870 114. Cai LZ, Liu Q, Yang XL (2004) Generalized phase-shifting interferometry with arbitrary unknown phase steps for diffraction objects. Opt Lett 29: 183–185 115. Qian K, Soon S, Asundi A (2004) Calibration of phase shift from two fringe patterns. Meas Sci Technol 15: 2142–2144 116. Patil A, Rastogi P (2005) Rotational invariance approach for the evaluation of multiple phases in interferometry in the presence of nonsinusoidal waveforms and noise. J Opt Soc Am A 22: 1918–1929 117. Patil A, Langoju R, Rastogi P (2007) Phase-shifting interferometry using a robust parameter estimation method. Opt Las Eng 45: 293–297 118. Gorecki C (1992) Interferogram analysis using a Fourier transform method for automatic 3D surface measurement. Pure Appl Opt 1: 103–110 119. Baldi A (2003) Phase unwrapping by region growing. Appl Opt 42: 2498–2505 120. Meneses J, Gharbi T, Humbert P (2005) Phase-unwrapping algorithm for images with high noise content based on a local histogram. Appl Opt 44: 1207–1215 121. Herraez MA, Gdeisat MA, Burton DR et al. (2002) Robust, fast, and effective two-dimensional automatic phase unwrapping algorithm based on image decomposition. Appl Opt 41: 7445–7455 122. Schofield MA, Zhu Y (2003) Fast phase unwrapping algorithm for interferometric applications. Opt Lett 28: 1194–1196 123. Ghiglia DC, Pritt MD (1998) Two-Dimensional Phase Unwrapping, Wiley & Sons 124. Arines J (2003) Least-squares modal estimation of wrapped phases: application to phase unwrapping. Appl Opt 42: 3373–3378 125. Baldi A (2001) Two-dimensional phase unwrapping by quad-tree decomposition. Appl Opt 40: 1187–1194 126. Baldi A, Bertolino F, Ginesu F (2002) On the performance of some unwrapping algorithms. Opt Las Eng 37: 313–330 127. Takajo H, Takahashi T (1988) Noniterative method for obtaining the exact solution for the normal equation in least-squares phase estimation from the phase difference. J Opt Soc Am A 5: 1818–1827 128. Hung KM, Yamada T (1998) Phase unwrapping by regions using least-squares approach. Opt Eng 37: 2965–2970 129. Pritt MD, Shipman JS (1994) Least-squares two-dimensional phase unwrapping using FFTs. IEEE Trans on Geoscience and Remote Sensing 11: 706–708 130. Ghiglia DC, Romero LA (1996) Mimimum LP -norm two-dimensional phase unwrapping. J Opt Soc Am A 13: 1–15 131. Marroquin JL, Rivera M, Botello S et al. (1999) Regularization methods for processing fringe-pattern images. Appl Opt 38: 788–794 132. Lyuboshenko I, Maıtre H, Maruani A (2002) Least-mean-squares phase unwrapping by use of an incomplete set of residue branch cuts. Appl Opt 41: 2129–2148 133. He X, Kang X, Tay C et al. (2002) Proposed algorithm for phase unwrapping. Appl Opt 41: 7422–7428 134. Herraez MA, Burton DR, Lalor MJ et al. (2002) Fast two-dimensional phaseunwrapping algorithm based on sorting by reliability following a noncontinuous path. Appl Opt 41: 7437–7444
158
E. Stoykova et al.
135. Stephenson P, Burton DR, Lalor MJ (1994) Data validation techniques in a tiled phase unwrapping algorithm. Opt Eng 33: 3703–3708 136. Geldorf J (1987) Phase unwrapping by regions. In: Proceedings of SPIE 818, pp 2–9 137. Huntley JM, Saldner H (1993) Temporal phase-unwrapping algorithm for automated interferogram analysis. Appl Opt 32: 3047–3052 138. Huang MJ (2002) A quasi-one-frame phase-unwrapping algorithm through zone-switching and zone-shifting hybrid implementation. Opt Commun 210: 187–200 139. Qiu W, Kang Y, Qin Q et al. (2006) Regional identification, partition, and integral phase unwrapping method for processing moir´e interferometry images. Appl Opt 45: 6551–6559 140. Huang MJ, He Z (2002) Phase unwrapping through region-referenced algorithm and window-patching method. Opt Commun 203: 225–241 141. Robinson DW (1993) In: Raed CR, Robinson DW (eds) Interferogram Analysis: Digital Fringe Pattern Measurement Techniques, Institute of Physics Publishing, Bristol, pp. 192–229 142. Oppenheim AV, Schafer RW (1975) Digital Signal Processing, Prentice Hall 143. Goldstein RM, Zebker HA, Werner CL (1988) Satellite radar interferometry: two-dimensional phase unwrapping. Rad Scien 23: 713–720 144. Cusack R, Huntley JM, Goldrein HT (1995) Improved noise-immune phaseunwrapping algorithm. Appl Opt 34: 781–789 145. Chen CW, Zebker HA (2000) Network approaches to twodimensional phase unwrapping: intractability and two new algorithms. J Opt Soc Am A 14: 401–414 146. Bone DJ (1991) Fourier fringe analysis: the two-dimentional phaseunwrapping problem. Appl Opt 30: 3627–3662 147. Quiroga JA, Gonzalez-Cano A, Bernabeu E (1995) Stablemarriage algorithm for preprocessing phase maps with discontinuity sources. Appl Opt 34: 5029–5038 148. Lim H, Xu W, Huang X (1995) Two new practical methods for phase unwrapping. In Proceedings of International Geoscience and Remote Sensing Symposium, NJ, IEEE, pp 196–198 149. Gao Y, Liu X (2002) Noise immune unwrapping based on phase statistics and self-calibration. Opt Las Eng 38: 439–459 150. Roth M (1995) Phase unwrapping for interferometric SAR by the least-error path. In: Technical Memorandum F1B0-95U-019 (JHU/APL, Laurel, MD) 151. Lu Y, Wang X, Zhang X (2007) Weighted least-squares phase unwrapping algorithm based on derivative variance correlation map. Optik 118: 62–66 152. Pritt MD (1996) Phase unwrapping by means of multigrid techniques for interferometric SAR. IEEE Trans Geosci Remote Sess 34: 728–738 153. Li W, Su XY (2002) Phase unwrapping algorithm based on phase fitting reliability in structured light projection. Opt Eng 41: 1365–1372 154. Huntley JM, Saldner H (1993) Temporal phase-unwrapping algorithm for automated interferogram analysis. Appl Opt 32: 3047–3052 155. Saldner H, Huntley J (1997) Temporal phase unwrapping: application to surface profiling of discontinuous objects. Appl Opt 36: 2770–2775 156. Huntley JM, Saldner HO (1997) Error-reduction methods for shape measurements by temporal phase unwrapping. J Opt Soc Am A 14: 3188–3196
5 Pattern Projection Profilometry for 3D Coordinates Measurement
159
157. Huntley JM, Saldner HO (1997) Shape measurement by temporal phase unwrapping: comparison of unwrapping algorithms. Meas Sci Technol 8: 986–992 158. Zhao H, Chen W, Tan Y (1994) Phase-unwrapping algorithm for the measurement of three-dimensional object shapes. Appl Opt 33: 4497–4500 159. Nadeborn W, Andra P, Osten W (1996) A robust procedure for absolute phase measurement. Opt Las Eng 24: 245–260 160. Pedrini G, Alexeenko I, Osten W et al. (2003) Temporal phase unwrapping of digital hologram sequences. Appl Opt 42: 5846–5854 161. Sansoni G, Redaelli E (2005) A 3D vision system based on one-shot projection and phase demodulation for fast profilometry. Meas Sci Tech 16: 1109–1118 162. Hao Y, Zhao Y, Li D (1999) Multifrequency grating projection profilometry based on the nonlinear excess fraction method. Appl Opt 38: 4106–4111 163. Gilbert B, Blatt J (2000) Enhanced three-dimensional reconstruction of surfaces using multicolor gratings. Opt Eng 39: 52–60 164. Wagner C, Osten W, Seebacher S (1999) Direct shape measurement by digital wavefront reconstruction and multiwavelength contouring. Opt Eng 39: 79–85 165. Paez G, Strojnik M (1999) Phase-shifted interferometry without phase unwrapping: reconstruction of a decentered wave front. J Opt Soc Am A 16: 475–480 166. Fang Q, Zheng S (1997) Linearly coded profilometry. Appl Opt 36: 2401–2407 167. Sainov V, Harizanova J, Stoilov G et al. (2000) Relative and absolute coordinates measurement by phase-stepping laser interferometry. In: Optics and Lasers in Biomedicine and Culture, Springer, pp 50–53 168. Sainov V, Harizanova J, Shulev A (2003) Two-wavelength and two-spacing projection interferometry for real objects contouring. In: Proceedings of SPIE 5226, pp 184–188 169. Sainov V, Stoykova E, Harizanova J (2006) Optical methods for contouring and shape measurement. In: Proceedings of ICO’06, Opto-informating & Information Photonics, St. Petersburg, Russia, pp 130–132 170. Takeda M, Ina H, Kobayashi S (1982) Fourier-transform method of fringepattern analysis for computer-based tomography and interferometry. J Opt Soc Am A 72: 156–160 171. Takeda M, Mutoh K (1983) Fourier transform profilometry for the automatic measurement 3-D object shapes. Appl Opt 22: 3977–3982 172. Kostianovski S, Lipson S, Ribak E (1993) Interference microscopy and Fourier fringe analysis applied to measuring the spatial refractive-index distribution. Appl Opt 32: 4744–4750 173. Ge Z, Kobayashi F, Matsuda S et al. (2001) Coordinate transform technique for closed-fringe analysis by the Fourier-transform method. Appl Opt 40: 1649–1657 174. Roddier C, Roddier F (1987) Interferogram analysis using Fourier transform techniques. Appl Opt 26: 1668–1673 175. Bone DJ, Bachor HA, Sandeman R (1986) Fringe pattern analysis using a 2-D Fourier transform. Appl Opt 25: 1653–1660 176. Kreis T (1986) Digital holographic interference phase measurement using the Fourier-transform method. J Opt Soc Am A 3: 847–856 177. Sciammarella CA (2000) Computer-assisted holographic moir´e contouring. Opt Eng 39: 99–105 178. D’Acquisto L, Fratini L, Siddiolo AM (2002) A modified moire technique for three-dimensional surface topography. Meas Sci Technol 13: 613–622
160
E. Stoykova et al.
179. Macy W (1983) Two-dimensional fringe-pattern analysis. Appl Opt 22: 3898–3901 180. Nugent K (1985) Interferogram analysis using an accurate fully automatic algorithm. Appl Opt 24: 3101–3105 181. Liu J, Ronney P (1997) Modified Fourier transform method for interferogram fringe pattern analysis. Appl Opt 36: 6231–6241 182. Burton D, Goodall A, Atkinson J et al. (1995) The use of carrier frequency shifting for the elimination of phase discontinuities in Fourier transform profilometry. Opt Las Eng 23: 245–257 183. Su X, Chen W (2001) Fourier transform profilometry: a review. Opt Las Eng 35: 263–284 184. De Nicola S, Ferraro P, Gurov I et al. (2000) Fringe analysis for moir´e interferometry by modification of the local intensity histogram and use of a two-dimensional Fourier transform method. Meas Sci Technol 11: 1328–1334 185. Srinivasan V, Liu HC, Halioua M (1984) Automated phase-measuring profilometry of 3-D diffuse objects. App Opt 23: 3105–3108 186. Vander R, Lipson SG, Leizerson I (2003) Fourier fringe analysis with improved spatial resolution. Appl Opt 42: 6830–6837 187. Quan C, Tay C, Yang F et al. (2005) Phase extraction from a single fringe pattern based on guidance of an extreme map. Appl Opt 44: 4814–4821 188. Kinell L (2004) Spatiotemporal approach for real-time absolute shape measurements by use of projected fringes. Appl Opt 43: 3018–3017 189. Hu X, Liu G, Hu C et al. (2006) Characterization of static and dynamic microstructures by microscopic interferometry based on a Fourier transform method. Meas Sci Technol 17: 1312–1318 190. Shulev A, Gotchev A, Foi A et al. (2006) Threshold selection in transformdomain denoising of speckle pattern fringes. In: Proceedings of SPIE 6252, pp 21–27 191. Lovric D, Vucic Z, Gladic J et al. (2003) Refined Fourier-transform method of analysis of full two-dimensional digitized interferograms. Appl Opt 42: 1477–1484 192. Li J, Su X, Guo L (1990) Improved Fourier transform profilometry of automatic of 3-D object shapes. Opt Eng 29: 1430–1444 193. Vucic Z, Gladic J (2005) Phase retrieval errors in standard Fourier fringe analysis of digitally sampled model interferograms. Appl Opt 44: 6940–6948 194. Vanherzeele J, Guillaume P, Vanlanduit S (2005) Fourier fringe processing using a regressive Fourier-transform technique. Opt Las Eng 43: 645–658 195. Li JL, Su XY, Su HJ et al. (1998) Removal of carrier frequency in phaseshifting techniques. Opt Las Eng 30: 107–115 196. Lu M, He X, Liu S (2000) Powerful frequency domain algorithm for frequency identification for projected grating phase analysis and its applications. Opt Eng 39: 137–142 197. Chen L, Quan C (2005) Fringe projection profilometry with nonparallel illumination: a least-squares approach. Opt Lett 30: 2101–2104 198. Srinivasan V, Liu H, Halioua M (1985) Automated phase-measuring profilometry: a phase mapping approach. J Opt Soc Am A 24: 185–188 199. Zhou W, Su X (1994) A direct mapping algorithm for phase-measuring profilometry. J Mod Opt 41: 89–94 200. Chen L, Tay CJ (2006) Carrier phase component removal: a generalized leastsquares approach. J Opt Soc Am A 23: 435–443
5 Pattern Projection Profilometry for 3D Coordinates Measurement
161
201. Takeda M, Yamamoto H (1994) Fourier-transform speckle profilometry: threedimensional shape measurements of diffuse objects with large height steps and/or spatially isolated surfaces. Appl Opt 33: 7829–7837 202. Onodera R, Ishii Y (1998) Two-wavelength interferometry that uses a Fouriertransform method. Appl Opt 37: 7988–7993 203. Takeda M, Kitoh M (1992) Spatiotemporal frequency multiplex heterodyne interferometry. J Opt Soc Am A 9: 1607–1614 204. Burton D, Lalor M (1994) Multichannel Fourier fringe analysis as an aid to automatic phase unwrapping. Appl Opt 33: 2939–2948 205. Takeda M, Gu Q, Kinoshita M et al. (1997) Frequency-multiplex Fouriertransform profilometry: a single-shot three-dimensional shape measurement of objects with large height discontinuities and/or surface isolations. Appl Opt 36: 5347–5354 206. Gushov V, Solodkin Y (1991) Automatic processing of fringe patterns in integer interferometers. Opt Las Eng 14: 311–324 207. Takeda M, Aoki T, Miyamoto Y et al. (2000) Absolute three-dimensional shape measurements using coaxial and coimage plane optical systems and Fourier fringe analysis for focus detection. Opt Eng 39: 61–68 208. Bulut K, Inci MN (2005) Three-dimensional optical profilometry using a fourcore optical fibre. Opt Las Tech 37: 463–469 209. Kreis T (1986) Digital holographic interference phase measurement using the Fourier-transform method. J Opt Soc Am A 3: 847–856 210. Larkin KG, Bone D, Oldfield M (2001) Natural demodulation of twodimensional fringe patterns. I. General background of the spiral phase quadrature transform. J Opt Soc Am A 18: 1862–1870 211. Larkin KG, Bone D, Oldfield M (2001) Natural demodulation of twodimensional fringe patterns. II. Stationary phase analysis of the spiral phase quadrature transform. J Opt Soc Am A 18: 1871–1881 212. Jesacher A, F¨ urhapter S, Bernet S et al. (2006) Spiral interferogram analysis. J Opt Soc Am A 23: 1400–1409 213. Tomassini P, Giulietti A, Gizzi L et al. (2001) Analyzing laser plasma interferograms with a continuous wavelet transform ridge extraction technique: the method. Appl Opt 40: 6561–6568 214. Qian K (2004) Windowed Fourier transform for fringe pattern analysis. Appl Opt 43: 2695–2702 215. Sciammarella C, Kim T (2003) Determination of strains from fringe patterns using space-frequency representations. Opt Eng 42: 3182–3193 216. Watkins L (2007) Phase recovery from fringe patterns using the continuous wavelet transform. Opt Las Eng 45: 298–303 217. Marroquin J, Rivera M (1995) Quadratic regularization phase functional for phase unwrapping. J Opt Soc Am A 12: 2393–2400 218. Colonna de Lega X (1997) Processing of non-stationary interference patterns: adapted phase shifting algorithms and wavelet analysis. Application to dynamic deformation measurements by holographic and speckle interferometry. Swiss Federal Institute of Technology 219. Daubechies I (1992) Ten Lectures on Wavelets, PA: SIAM, Philadelphia 220. Watkins LR, Tan SM, Barnes TH (1999) Determination of interferometer phase distributions by use of wavelets. Opt Lett 24: 905–907
162
E. Stoykova et al.
221. Federico A, Kaufmann G (2002) Evaluation of the continuous wavelet transform method for the phase measurement of electronic speckle pattern interferometry fringes. Opt Eng 41: 3209–3216 222. Kadooka K, Kunoo K, Uda N et al. (2003) Strain analysis for moir´e interferometry using the two-dimensional continuous wavelet transform. Exp Mech 43: 45–51 223. Belyakov A, Gurov I (2003) Analyzing interference fringes by the wavelet method. J Opt Tech 70: 13–17 224. Daubechies I (1990) The wavelet transform, time-frequency localization and signal analysis. IEEE Trans I 36: 961–1005 225. Dursun A, Ozder S, Ecevit F (2004) Continuous wavelet transform analysis of projected fringe patterns. Meas Sci Technol 15: 1768–1772 226. Belyakov A (2006) Analyzing interference-fringe patterns by discriminating the features of wavelet maps of symmetric wavelets. J Opt Tech 73: 183–187 227. Zheng R, Wang Y, Zhang X et al. (2005) Two-dimensional phase-measuring profilometry. Appl Opt 44 : 954–958 228. Carmona R, Hwang W, Torresani B (1997) Characterization of signals by the ridges of their wavelet transforms. IEEE Trans Sig Proccess 45: 2586–2590 229. Liu H, Cartwright A, Basaran C (2004) Moire interferogram phase extraction: a ridge detection algorithm for continuous wavelet transforms. Appl Opt 43: 850–857 230. Afifi M, Fassi-Fihri A, Marjane M et al. (2002) Paul wavelet-based algorithm for optical phase distribution evaluation. Opt Com 211: 47–51 231. Zhong J, Weng J (2004) Spatial carrier-fringe pattern analysis by means of wavelet transform: wavelet transform profilometry. Appl Opt 43: 4993–4998 232. Qian K, Seah H, Asundi A (2005) Fault detection by interferometric fringe pattern analysis using windowed Fourier transform. Meas Sci Technol 16: 1582–1587 233. Zhou J (2005) Wavelet-aided spatial carrier fringe pattern analysis for 3-D shape measurement. Opt Eng 44: 113602 234. Liu H, Cartwright A, Basaran C (2003) Sensitivity improvement in phaseshifted moir´e interferometry using 1-D continuous wavelet transform image processing. Opt Eng 42: 2646–2652 235. Liu H, Cartwright A, Basaran C (2004) Experimental verification of improvement of phase shifting moire interferometry using wavelet-based image processing. Opt Eng 43: 1206–1214 236. Li H, Chen H, Zhang J et al. (2007) Statistical searching of deformation phases on wavelet transform maps of fringe patterns. Opt Las Eng 39: 275–281 237. Miao H, Quan C, Tay CJ et al. (2007) Analysis of phase distortion in phaseshifted fringe projection. Opt Las Tech 45: 318–325 238. Li X (2000) Wavelet transform for detection of partial fringe patterns induced by defects in non-destructive testing of holographic interferometry and electronic speckle pattern interferometry. Opt Eng 39: 2821–2827 239. Chang RS, Sheu J, Lin CH et al. (2003) Analysis of CCD moire pattern for micro-range measurements using the wavelet transform. Opt Las Tech 35: 43–47 240. Qian K, Soon S, Asundi A (2003) Phase-shifting windowed Fourier ridges for determination of phase derivatives. Opt Lett 28: 1657–1659 241. Qian K (2004) Windowed Fourier transform method for demodulation of carrier fringes. Opt Eng 43: 1472–1473
5 Pattern Projection Profilometry for 3D Coordinates Measurement
163
242. Qian K, Soon S (2005) Two-dimensional windowed Fourier frames for noise reduction in fringe pattern analysis. Opt Eng 44: 075601 243. Yao W, He A (1999) Application of Gabor transformation to the twodimensional projection extraction in interferometric tomography. J Opt Soc Am A 16: 258–263 244. Jun W, Asundi A (2002) Strain contouring with Gabor filters: filter bank design. Appl Opt 41: 7229–7236 245. Zhong J, Weng J (2004) Dilating Gabor transform for the fringe analysis of 3-D shape measurement. Opt Eng 43: 895–899 246. Marroquin J, Rodriguez-Vera R, Servin M (1998) Local phase from local orientation by solution of a sequence of linear systems. J Opt Soc Am A 15: 1536–1544 247. Servin M, Marroquin J, Cuevas F (2001) Fringe-follower regularized phase tracker for demodulation of closed-fringe interferograms. J Opt Soc Am A 18: 689–695 248. Marroquin J, Figueroa J, Servin M (1997) Robust quadrature filters. J Opt Soc Am A 14: 779–791 249. Rivera M, Marroquin J, Botello S et al. (2000) Robust spatiotemporal quadrature filter for multiphase stepping. Appl Opt 39: 284–292 250. Servin M, Quiroga J, Marroquin J (2003) General n-dimensional quadrature transform and its application to interferogram demodulation. J Opt Soc Am A 20: 925–934 251. Marroquin J, Servin M, Rodriguez-Vera R (1997) Adaptive quadrature filters and the recovery of phase from fringe pattern images. J Opt Soc Am A 14: 1742–1753 252. Villa J, De la Rosa I, Miramontes G et al. (2005) Phase recovery from a single fringe pattern using an orientational vector-field-regularized estimator. J Opt Soc Am A 22: 2766–2773 253. Zhou X, Baird J, Arnold J (1999) Fringe-orientation estimation by use of a Gaussian gradient filter and neighboring-direction averaging. Appl Opt 38: 795–804 254. Canabal H, Quiroga J, Bernabeu E (1998) Automatic processing in moire deflectometry by local fringe direction calculation. Appl Opt 37: 5894–5901 255. Villa J, Quiroga J, Servın M (2000) Improved regularized phase-tracking technique for the processing of squared-grating deflectograms. Appl Opt 39: 502–508 256. Servin M, Malacara D, Cuevas F (1994) Direct phase detection of modulated Ronchi rulings using a phase locked loop. Opt Eng 33: 1193–1199 257. Gdeisat M, Burton D, Lalor M (2000) Real-time fringe pattern demodulation with a second-order digital phase-locked loop. Appl Opt 39: 5326–5336 258. Gdeisat M, Burton D, Lalor M (2002) Fringe pattern demodulation with a two-frame digital phase-locked loop algorithm. Appl Opt 41: 5471–5578 259. Servin M, Marroquin J, Quiroga J (2004) Regularized quadrature and phase tracking from a single closed-fringe interferogram. J Opt Soc Am A 21: 411–419 260. Rivera M (2005) Robust phase demodulation of interferograms with open or closed fringes. J Opt Soc Am A 22: 1170–1175 261. Legarda-Saenz R, Osten W, J¨ uptner W (2002) Improvement of the regularized phase tracking technique for the processing of nonnormalized fringe patterns. Appl Opt 41: 5519–5526
164
E. Stoykova et al.
262. Rivera M, Rodriguez-Vera R, Marroquin J (1997) Robust procedure for fringe analysis. Appl Opt 36: 8391–8396 263. Cuevas F, Sossa-Azuela J, Servin M (2002) A parametric method applied to phase recovery from a fringe pattern based on a genetic algorithm. Opt Commun 203: 213–223 264. Joo W, Cha S (1996) Knowledge-based hybrid expert system for automated interferometric data reduction. Opt Las Eng 24: 57–75 265. Robin E, Valle V, Br´emand F (2005) Phase demodulation method from a single fringe pattern based on correlation with a polynomial form. Appl Opt 44: 7261–7269 266. Gurov I, Sheynihovich D (2000) Interferometric data analysis based on Markov nonlinear filtering methodology. J Opt Soc Am A 17: 21– 26 267. Reich C, Ritter R, Thesing J (2000) 3-D shape measurement of complex objects by combining photogrammetry and fringe projection. Opt Eng 39: 224–231 268. Kowarschik R, Kuhmstedt P, Gerber J et al. (2000) Adaptive optical threedimensional measurement with structured light. Opt Eng 39: 150–158 269. Kreis T In: Rastogi PK (ed) Holographic Interferometry: Principles and Methods, Springer, Heidelberg, pp 151–212 270. Tay CJ, Quan C, Yang FJ et al. (2004) A new method for phase extraction from a single fringe pattern. Opt Commun 239: 251–258 271. Liebling M, Blu T, Unser M (2004) Complex-wave retrieval from a single off-axis hologram. J Opt Soc Am A 21: 367–377 272. De Angelis M, De Nicola S, Ferraro P et al. (2005) Profile measurement of a one-dimensional phase boundary sample using a single shot phase-step method. Opt Las Eng 43: 1305–1314 273. Skydan O, Lalor M, Burton D (2005) Using coloured structured light in 3-D surface measurement. Opt Las Eng 43: 801–814 274. Yoneyama S, Morimoto Y, Fujigaki M et al. (2003) Three dimensional surface profile measurement of moving object by a spatial-offset phase stepping method. Opt Eng 42: 137–142 275. Zhang S, Yau S (2006) High-resolution, real-time 3D absolute coordinate measurement based on a phase-stepping method. Opt Express 14: 2644–2654 276. Lu C, Xiang L (2003) Optimal intensity-modulation projection technique for three-dimensional shape measurement. Appl Opt 42: 4649–4657 277. Arai Y, Yokozeki S, Yamada T (1995) Fringe-scanning method using a general function for shadow moir´e. Appl Opt 34: 4877–4882 278. Awatsuji Y, Sasada M, Fujii A et al. (2006) Scheme to improve the reconstructed image in parallel quasi-phase-shifting digital holography. Appl Opt 45: 968–974 279. Huang P, Hu Q, Jin F et al. (1999) Color-encoded digital fringe projection technique for high-speed three-dimensional surface contouring. Opt Eng 38: 1065–1071 280. Coggrave C, Huntley J (2000) Optimization of a shape measurement system based on spatial light modulators. Opt Eng 39: 91–98 281. Goodman J (2004) Introduction to Fourier Optics, Roberts & Company Publishers
6 Three-dimensional Scene Representations: Modeling, Animation, and Rendering Techniques Uˇgur G¨ ud¨ ukbay and Funda Durupınar Department of Computer Eng., Bilkent University, 06800, Bilkent, Ankara, Turkey
Modeling the behavior and appearance of captured three-dimensional (3D) objects is a fundamental requirement for scene representation in a threedimensional television (3DTV) framework. By using the data acquired from multiple cameras, it is possible to model a scene with high quality visual results. In fact, 3D scene capturing and representation phases are highly correlated. Information acquired from the capturing phase can be employed in the representation phase by using computer graphics and image processing techniques. The resultant model then allows the users to interact with the scene, not just remain observers but be participants themselves. Thus, the main considerations for the quality of a scene representation technique are basically the accuracy of the technique about how the results correspond to the original scene and the efficiency of the technique as real-time performance is required. 3D shape modeling is an essential component of scene representation for 3DTV. Time-varying mesh representations provide a suitable way of representing 3D shapes. With these methods, the static components of a scene are constructed only once and the other objects are modeled as dynamic components, thus the computational time to represent 3D scenes is reduced. Polygonal meshes are efficiently used in shape modeling due to their builtin representation in hardware. Thus, they are suitable for applications such as 3DTV where real-time performance is required. Alternatively, volumetric representations can be used in shape modeling. The basic volume elements, voxels, of a 3D space correspond to the 2D pixels of an image. Volumetric techniques require large amounts of data in order to represent a scene or object accurately. Images acquired from multiple calibrated cameras provide the necessary information for volumetric models. Thus, these methods are intuitive for 3DTV. However, recent research shows that point-based approaches are the most suitable shape modeling techniques for 3DTV. The reason is that results of 3D data acquisition methods such as laser scans already represent the scene in a point-based manner.
166
U. G¨ ud¨ ukbay and F. Durupınar
3D scene representation has two components: geometry and texture. Geometry representation is handled by modeling the shape of an object or a scene. Since the scenes mostly contain dynamic objects that move and deform in different ways, modeling the motion becomes important. Animation techniques that have potential for real-time hardware implementations are promising approaches to be used in a 3DTV framework. Texture representation is handled by the underlying rendering technique. Scan-line rendering techniques are suitable for 3DTV as they are hardware-supported and efficient. In addition, image-based rendering is a very successful and promising rendering scheme for 3DTV as it directly makes use of the captured images. This chapter provides introductory knowledge for the modeling, animation, and rendering techniques used in computer graphics. It is not an exhaustive survey of these topics and includes only representatives of each, focusing on techniques relevant to 3DTV. The interested reader is referred to the references for an in-depth discussion of the topics covered. The chapter is organized as follows. First, different 3D scene representation techniques, namely mesh-based representations, volumetric methods, and point-based techniques, will be discussed. Then, we will explain animation techniques for modeling object behavior. Finally, we will discuss illumination models and rendering techniques for 3D scenes containing different types of objects and lighting conditions.
6.1 Modeling There are two main approaches to represent the shape of arbitrary free-form objects. The first approach, which is called Constructive Solid Geometry, models the shapes of free-form objects as a composition of geometrically and algebraically defined primitives, such as polygons, implicit surfaces, or parametric surfaces. This approach uses Boolean operations to combine regular shapes and is widely used as a Computer-Aided Design tool. The second approach deforms regular shapes using deformation techniques, such as regular deformations [1] and Free-Form Deformations [2] to obtain irregular, free-form objects. Before going into the details of different shape representation techniques based on Euclidean geometry, we will say a few words about modeling the shapes of natural objects. Natural objects, such as mountains, clouds, and trees, cannot be described using equations since these objects do not have regular shapes; their irregular or fragmented features cannot be realistically modeled using the methods based on Euclidean geometry [3]. Fractal-geometry methods use procedures to model such objects [4]. L-systems (Lindenmayer systems) provide a mathematical formalism for realistic modeling of plants and plant generation. The basic idea is to define complex objects, like plants,
6 Three-Dimensional Scene Representations
167
by successively replacing parts of simple initial objects using a set of rewriting rules. The rewriting rules are applied in a parallel fashion for different parts of the objects [5]. 6.1.1 Polygonal Mesh Representations The surface of a 3D object can be approximated using a number of planar polygons. A polygonal approximation to a 3D object has faces, edges, vertices and normal vectors to identify the spatial orientation of the polygon surfaces. These are stored in geometric data tables. A vertex table stores the x, y, and z-coordinates of the vertices. Surfaces, or polygons, are stored in surface tables, which contain pointers to the vertex tables for each vertex comprising that polygonal surface. Edge tables are useful for wireframe drawing purposes and they also represent edges using pointers to the vertex tables [3]. Mostly, triangles are used for polygonal approximations of objects since triangles can be processed in hardware using graphics cards in today’s computers. Figure 6.1 shows a simple object and its corresponding vertex, edge and surface tables. In addition, there are also some attributes associated with vertices and faces such as the degree of transparency, surface reflectivity, and texture characteristics, which are stored in attribute data tables. These are necessary for shading polygonal surfaces. The normal vector of a polygonal surface is calculated by taking the cross product of two non-colinear vectors lying on the polygonal surface. The vertex normals are calculated by taking the average of the face normals sharing a vertex. v1 e1
S1
e5 e3 S3
e2
v2 e6
S2
v5
Edge table
e4
v4 e7
v3 Vertex table Vertex x
v1 v2 v3 v4 v5
x1 x2 x3 x4 x5
y
z
y1 y2 y3 y4 y5
z1 z2 z3 z4 z5
Edge
Start
End
e1 e2 e3 e4 e5 e6 e7
v1 v2 v1 v4 v1 v2 v3
v2 v4 v4 v5 v5 v3 v4
Surface table Surface Vertex Vertex Vertex 1
2
3
S1
v1
v2
v4
S2
v2
v3
v4
S3
v1
v4
v5
Fig. 6.1. A polygonal object and its vertex, edge and surface tables
168
U. G¨ ud¨ ukbay and F. Durupınar
When the polygonal approximations of objects are very large, containing millions of polygons, level-of-detail approximations of the models become inevitable. Polygonal model simplification is the main tool to obtain different levels of detail of polygonal models. Progressive mesh representations that store different level-of-detail approximations of large models are used to visualize complex models using view-dependent visualization techniques. These techniques are used to display the models by using the suitable level of detail according to the current viewpoint so that the polygons that do not contribute to the final image are not processed by the graphics pipeline [6, 7, 8]. Figure 6.2 shows a sphere rendered with two different levels of detail. 6.1.2 Parametric Surfaces A parametric surface is defined as a mapping from 2-space to 3-space since each parametric surface can be defined using two parameters. Parametric surfaces are represented with the following equation: ⎡ ⎤ x(u, v) u0 ≥ u ≤ u1 X(u, v) = ⎣ y(u, v) ⎦ , (6.1) v0 ≥ v ≤ v1 z(u, v) Normal vectors for parametric surfaces can be calculated by taking the cross product of the surface tangent functions. Surface tangent functions can be found by taking the partial derivatives of the parametric surface function with respect to the surface parameters. As an example, the derivation of the parametric normal vector equation for the unit sphere is given in the following equations. ⎤ ⎡ cos(u) cos(v) − π ≤ u ≤ π2 X(u, v) = ⎣ cos(u) sin(v) ⎦ , 2 (6.2) −π ≤ v < π sin(u) N (u, v) =
(a)
∂X ∂X × ∂u ∂v
(6.3)
(b)
Fig. 6.2. Level-of-detail example on wireframe and smooth-shaded spheres. (a) lowresolution; (b) high resolution
6 Three-Dimensional Scene Representations
⎤
⎡
169
⎤
⎡
−cos(u) sin(v) −sin(u) cos(v) N (u, v) = ⎣ −sin(u) sin(v) ⎦ × ⎣ cos(u) cos(v) ⎦ 0 cos(u)
(6.4)
⎡
⎤ cos2 (u) cos2 (v) N (u, v) = ⎣ cos2 (u) sin2 (v) ⎦ sin2 (u)
(6.5)
Each coordinate of a point on a parametric surface can be calculated independently from other coordinates; this makes the parametric surfaces attractive for generating polygonal approximations for object surfaces. This is generally done by sampling a regular grid on the parameter space and then calculating the points on the parametric surface by plugging the parameter values at the grid locations into the parametric surface functions for each coordinate. The coordinates of the points on the parametric surface are stored in a two-dimensional array that corresponds to the grid for parameter values. Then, the polygons (triangles) are implicitly obtained by forming triangles on the grid. Such kinds of polygonal approximations are called regular meshes since the polygons are formed using neighboring grid points in a regular way and the polygon information is not stored explicitly. The problem with parametric surfaces is that we only know the parametric surface functions for a limited set of regular objects. Examples of parametric surfaces that can be used for representing primitive objects are quadrics, superquadrics [9], and bi-cubic surfaces, such as B-spline, Hermite, B´ezier, etc. [10, 11]. Figure 6.3 shows examples of parametric surfaces, namely supertoroids with different parameters (a) and a B´ezier surface (b). 6.1.3 Implicit Surfaces An implicit surface equation has the following form: f (x, y, z) = 0.
(a)
(6.6)
(b)
Fig. 6.3. Examples of parametric surfaces: (a) supertoroids with different parameters; (b) a B´ezier surface
170
U. G¨ ud¨ ukbay and F. Durupınar
Implicit surfaces divide the space into object interior and exterior regions. They allow us to talk about the solids defined by the interior of the implicit surfaces. Implicit surfaces are especially useful for collision detection and response in computer animation and ray surface intersection tests for rendering applications such as ray tracing. However, they are not suitable for generating polygonal approximations for the surfaces of the objects. Collision detection applications generally require to test whether a point p is inside or outside of a surface, for which we can use the implicit equation of the surface. ⎧ ⎨ f (p) = 0, p is on the surface. f (p) > 0, p lies outside the surface. (6.7) if ⎩ f (p) < 0, p lies inside the surface. Implicit surface equations are also used for ray-surface intersection tests. A ray is represented parametrically as r(t) = r0 + t v
(6.8)
where r0 is the ray origin, v is the direction vector of the ray, and t is the ray parameter. Then, we can test whether a ray intersects an implicit surface f (x, y, z) = 0 by substituting the parametric ray equation into the implicit surface equation and solving for the ray parameter t: f (r0 + t v) = 0
(6.9)
6.1.4 Subdivision Surfaces Subdivision surfaces is another popular surface modeling scheme. The idea of subdivision surfaces was first introduced by Catmull and Clark [12] and Doo and Sabin [13] independently in√1978. Other notable subdivision schemes are Loop [14], Butterfly [15], and 3-Subdivision [16]. Algorithmic definition of subdivision surfaces distinguishes them from standard spline surfaces. Subdivision surfaces resemble both polygon meshes and patch surfaces, and they take the best aspects of each representation technique. For instance, they can represent smooth surfaces with arbitrary topology and can be rendered smoothly owing to the well-defined surface normal, unlike low-resolution polygonal geometry. Simplicity, efficiency, and ease of implementation are the main advantages of subdivision surfaces. Subdivision surfaces are constructed through recursive splitting and averaging operations. Splitting is performed by dividing a face into new faces and averaging is performed by taking a weighted average of neighboring vertices to obtain a new vertex. Splitting and averaging operations are shown in Fig. 6.4. The Doo-Sabin Subdivision Scheme is illustrated in Fig. 6.5 and the Catmull-Clark Subdivision Scheme is illustrated in Fig. 6.6. The results of applying various subdivision schemes to a cube are shown in Fig. 6.7.
6 Three-Dimensional Scene Representations
(a)
171
(b)
Fig. 6.4. Subdivision operations: (a) recursive splitting; (b) averaging
The shape of a subdivision surface is determined by a structured mesh of control points and a set of subdivision rules prescribing a procedure for refining the mesh to a finer approximation. The subdivision surface itself is defined as the limit of repeated recursive refinements. Subdivision surfaces satisfy all the usual requirements for surface representation that confront computer graphics practitioners. Starting with an initial polygonal mesh of arbitrary topology, a subdivision scheme is used to generate a new mesh that is the initial mesh for the next refinement. The repetitive application of this process will generate a sequence of polygonal meshes whose limit may be a smooth surface, assuming that appropriate conditions are satisfied [17]. This makes subdivision surfaces suitable as a multi-resolution mesh representation where switching between coarser and finer refinements can be easily achieved. The recursive nature of subdivision surfaces provides control over different levels of detail through adaptive subdivision. However, this nature also introduces a weakness for the modeling of sharp features such as creases or corners. Recently, some new techniques that perform modifications and additions to the subdivision rules have overcome this problem [18].
(a)
(b)
(c)
(d)
Fig. 6.5. The Doo-Sabin subdivision scheme: (a) generate new vertices with respect to Doo-Sabin subdivision masks; (b) form new faces inside the old faces by connecting the generated vertices; (c) form new faces for each edge in the coarser mesh by connecting the four new vertices adjacent to an old edge; (d) form new faces for each vertex in the old mesh by connecting the new vertices adjacent to an old vertex
172
U. G¨ ud¨ ukbay and F. Durupınar
(a)
(b)
(c)
(d)
Fig. 6.6. The Catmull-Clark subdivision scheme: (a) generate new vertices for each face; (b) generate new vertices for each edge; (c) move each original vertex to a new location; (d) form new faces using the generated vertices
6.1.5 Point-based Representations Points were first introduced as rendering primitives by Levoy [19] in 1985. As new display elements, points are also known as surfels [20]. Due to their
(a)
(b)
(c)
(d)
√ Fig. 6.7. Results of applying various subdivision schemes to a cube: (a) 3Subdivision; (b) Loop Subdivision; (c) Doo-Sabin Subdivision; (d) Catmull-Clark Subdivision. The control mesh is the unit cube drawn in wireframe. Courtesy of Tekin Kabasakal
6 Three-Dimensional Scene Representations
173
structural simplicity and flexibility, point samples are used to model shapes. Although point-based representations utilize more modeling primitives, since the primitives are simple and do not require explicit connectivity or topology information, these methods are efficient alternatives to mesh-based representations. Point sets do not have a fixed continuity class, contrary to meshes, which have piecewise linear C 0 connectivity. The continuity problem for meshes is handled by smoothing techniques such as applying Gouraud shading or subdivision operations. In contrast, point-based methods specify connectivity information implicitly through the spatial interrelation among the points [21]. Point-based modeling is in some sense similar to image-based modeling as it takes different views of an object as input and reconstructs the surface. However, point samples require more geometric information than image pixels and they are view-independent [22]. Moreover, the ease of insertion, deletion and repositioning of point samples makes these techniques suitable for dynamic settings with frequent changes of model geometry [23]. Point-based representations can be grouped into two: piecewise constant point sampling and piecewise linear surface splats [24]. Studies in the first group include Point Set Surfaces (PSS) [21, 25, 26]. PSS are used to represent shapes by taking a weighted average of the points. Normally, they can only be applied to regular samples due to the weighting scheme, which is based on a spatial scale parameter. Adamson et al. extend PSS to irregular settings by generalizing the weighting scheme [26]. Fleishman et al. [21] describe a progressive scheme, which reduces the amount of data required and improves modeling and visualization. They develop a simplification scheme for point sets to construct a base point set that represents a smoother version of the original shape. Then, they perform adaptive surface refinement. Reconstruction of continuous surfaces from the irregularly-spaced point samples without losing visual quality is an important challenge for point-based methods. Moreover, hidden surface removal and transparency issues should be correctly handled. These difficulties have been overcome by the introduction of surface splats, first proposed by Zwicker et al. [27]. Surface splatting uses samples of the surface of an object to represent it [28]. Surface splats provide better visual quality and more efficiency by using an Elliptical Weighted Average (EWA) filter, which reduces aliasing artifacts. The performance limitations of this technique, which was originally purely software-based, have been overcome recently by utilizing the latest GPU technology. Botsch et al. discuss the capabilities of GPUs for hardware-based surface splatting in [24]. 6.1.6 Volumetric Representations Spatial subdivision techniques provide a natural way to represent solid objects and 3D scenes. These techniques simplify many calculations on solid objects and 3D scenes, such as boolean operations on solid objects to create complex objects from simpler ones, collision detection for animation, ray/surface intersections for raytracing, occlusion detection for the visualization of urban
174
U. G¨ ud¨ ukbay and F. Durupınar
scenery, etc. The only disadvantage of these techniques is the high storage cost since a solid object or a 3D scene is represented using a three-dimensional array. The high storage cost of spatial subdivision data structures are alleviated by using an adaptive subdivision of space instead of a uniform subdivision. The unit element of a three-dimensional space is called a voxel. One common method to represent solid objects or 3D scenes is to use octrees, which are hierarchical tree structures. The three-dimensional space is partitioned into eight regions (octants), where each region corresponds to a node of the tree structure. Each octant is further subdivided recursively, if necessary. In case of a regular subdivision, the subdivision process terminates when a pre-defined depth is reached. In adaptive subdivision, the subdivision process terminates if the octant is completely unoccupied or a minimum resolution is obtained for the cells. The nodes of the octree structure point to the parts of the scene, or the solid object, contained in the part of the space to which that node corresponds. The octree representation is shown in Fig. 6.8. Another spatial subdivision method to represent solid objects and 3D scenes volumetrically is Binary Space Partitioning (BSP) trees. The main idea is to adaptively partition the space into two regions with a plane. BSP trees are more efficient than octrees as they reduce the tree depth. They are especially useful for applications that require the subdivision of space into regions containing an equal number of scene objects. Spatial subdivision techniques can be used for different types of object representations, including polygon meshes and surface patches. Different algorithms, such as intersection tests, traverse the octree structure recursively starting from the root. Details of spatial subdivision techniques can be found in [29]. Voxel-based representations are also used to reconstruct an environment from images obtained by multiple calibrated cameras. These representations generally use a regular 3D voxel array or an octree subdivision and the 3D scene is represented as a set of occupied voxels. These voxels can be colored and transparent and the surface normals associated with occupied voxels are stored for rendering purposes. Volume rendering techniques can be used to render such voxel-based 3D scenes. Unless the voxels are very small, rendering the surfaces of voxel-based 3D data produces a blocky appearance. Thus,
Fig. 6.8. The octree representation
6 Three-Dimensional Scene Representations
175
refinement techniques should be applied to the meshes describing the surface to obtain a plausible appearance. Some volumetric 3D reconstruction techniques compute the outer-bound approximation of the scene geometry, called visual hull, from silhouette images [30, 31, 32]. These techniques are applicable to images where foregroundbackground segmentation at each reference view is possible. The silhouette is the 2D projection of the corresponding 3D foreground object. The parts of the surface of the object that also lie on the surface of the visual hull can be reconstructed using silhouette-based approaches.
6.2 Animation An illusion of motion is created when slightly different images are viewed in succession. Animation is the process of organizing and filming immobile objects to produce the images necessary to create such an illusion of movement. Animation techniques can be categorized into two main groups: traditional animation and computer animation. Cartoon movies are the most widespread of traditional animation examples. They are produced by the method called cel animation. Cel animation is performed by the animators who draw and paint each frame by hand. Cartoon films have been an important sector of the entertainment industry since the 1930’s, a consequence of the success of the Walt Disney Studios. The second animation category is computer animation. Computer animation can be further subdivided into two groups: computer-assisted animation and computer-generated animation [33]. Computer-assisted animation is the computer-aided counterpart of traditional 2D cel animation. Papers, paint, brushes and various drawing materials are replaced by computers, scanners, cameras, mice, etc. The computer is mainly used for cell painting and inbetweening. In this way, traditional cartoon animation can be performed more efficiently and economically. Computer-generated animation is also known as true computer animation, where images are generated by means of rendering a 3D model. Motion is produced by modifying the model over time. The models have various parameters such as polygon vertex positions, spline knot positions, joint angles, muscle contraction values, colors, and camera parameters. Animation is performed by varying the parameters over time and rendering the models to generate the frames along the way [34]. Fundamental principles of traditional animation, such as squash and stretch, timing and motion, anticipation, staging, follow through and overlapping action, straight ahead action and pose-to-pose action, slow in and out, arcs, exaggeration, secondary action, and appeal [35], can be formalized and used as high level constructs in computer animation systems. In this way, most of the burden of generating realistic animation is left to the computer since the elements of an animated character move in harmony according to these
176
U. G¨ ud¨ ukbay and F. Durupınar
constructs. The application of these principles ensures that the characters have a personality appealing to the audience. 6.2.1 Hierarchical Approaches Hierarchical modeling approaches store a 3D scene in the form of a tree or a graph structure. A very important property of these hierarchical approaches is that they unify modeling and animation. These representations store the primitive objects, including the lights and cameras, that make up the scene hierarchy (specified in the objects’ local coordinate system) and the transformations to place them in world coordinates, in the nodes of a graph or a tree. Representative examples of such hierarchical techniques are scene graphs and scene tree representations. Virtual Reality Modeling Language (VRML), Java3D, and Open Scene Graph are widely used scene graph Application Programming Interfaces [36]. Figure 6.9 illustrates the scene tree representation for a 3D scene. Transformation hierarchies is a modeling technique to represent articulated structures, such as humans and robots. It uses tree structures to represent articulated bodies. An intermediate node contains 3D transformation(s) that apply to all the children of that node. The leaf nodes correspond to primitive objects. Hierarchical modeling is implemented by using a matrix stack where the transformation matrices in the hierarchy are stored in the matrix stack. A recursive algorithm traverses the model hierarchy and calculates the composite transformations that correspond to the intermediate nodes. The algorithm stores the composite transformation matrices at the intermediate nodes of the tree structure by pushing them onto the stack so that they can be popped and re-used for the other branches of the same node. The primitives in the leaf nodes are drawn by applying the composite transformation sequence from the root to that node. Transformation hierarchies do not let the animator control the end-effectors of an articulated structure. They cannot handle
Sun Planet2
Separator
Separator
Planet1 Sun
Moon
Translate
Planet1
Translate
Rotate
Translate
Rotate
Planet2
Rotate
Moon
(a)
(b)
Fig. 6.9. (a) A 3D scene; (b) corresponding scene tree representation
6 Three-Dimensional Scene Representations
177
closed-kinematics chains, such as keeping the feet on the ground. They cannot handle general constraints. Although there are more sophisticated techniques to model articulated structures, such as inverse kinematics, hierarchical modeling is a principal tool for modeling and animation [37]. 6.2.2 Keyframing One of the biggest problems in traditional cel animation is the necessity to draw and paint each frame by hand, which makes it highly labor-intensive. Lead animators, who want to work more efficiently, only draw the most important frames, which are called the keyframes. Then, low-level animators draw the remaining frames between the keyframes. Computer animation, on the other hand, makes use of the computer to generate both the keyframes and the inbetween frames. The keyframes of a bouncing ball can be seen in Fig. 6.10, where the ball is depicted on the ground, at the highest point, and on the ground, respectively. Inbetween frames can be generated by interpolation techniques. One of these techniques is linear interpolation. If linear interpolation is used to generate inbetweens for a moving object, the object moves with constant velocity. Discontinuities and sudden leaps can be observed in the motion. In order to have a smooth motion, curve interpolation techniques such as Hermite or B-spline curves can be used. The inbetweens generated with different interpolation techniques can be seen in Fig. 6.11. A bouncing ball does not have the same velocity throughout its path; the closer it is to the ground, the faster it moves. Thus, in order to obtain more realistic results, it is not sufficient to specify the path alone, but the velocity changes as well. In addition, various other properties of the object, such as its shape and color, may change during the motion. Figure 6.12 shows the motion of a deformable bouncing ball. 6.2.3 Physically-based Modeling and Animation Methods used for modeling the shape and appearance of objects are not suitable for dynamic scenes where the objects are moving. The models do not interact with each other or with external forces. In real life, the behavior and form of many objects are determined by their physical properties, such as mass, damping, and the internal and external forces acting on the object.
Fig. 6.10. The keyframes from the animation of a bouncing ball
178
U. G¨ ud¨ ukbay and F. Durupınar
(a)
(b)
Fig. 6.11. The inbetweens from the animation of a deformable bouncing ball generated with different interpolation techniques: (a) linear interpolation; (b) spline interpolation
The rigidity (or deformability) of the objects is determined by the elastic and inelastic properties (such as internal stresses and strains) of the material. If we want to realistically animate the objects, we must model the physical properties of the objects so that they follow pre-defined trajectories and interact with the other objects in the environment, just like real physical objects. Physically-based techniques achieve this by adding physical properties to the models, such as forces, torques, velocities, accelerations, mass, damping, kinetic and potential energies, etc. Physical simulation is then used to produce animation based on these properties. To this end, the solution of the equations of motion is required so that the course of a simulation is determined by the initial positions and velocities of the objects, and by the forces
Fig. 6.12. The motion of a deformable bouncing ball
6 Three-Dimensional Scene Representations
179
and torques applied to the objects as it moves. Today, physical simulations are widely used in the film industry and in game development and there are efficient techniques to approximate the physics involved. When several objects are simultaneously involved in a computer animation, we encounter the problem of detecting and controlling object interactions. In such an animation, we may have more than one object moving around, or we may have impenetrable obstacles (such as walls) that do not move. When no special attention is paid to object interactions, the objects will sail through each other; this is usually not physically reasonable and produces a disconcerting visual effect. Whenever two objects attempt to penetrate each other (i.e., the surface of one object comes into contact with the surface of a second object), a collision is said to occur [38, 39]. The general requirement that arises then is an ability to detect collisions. Some animation systems at present do not provide even minimal collision detection; they require the animator to visually inspect the scene for object interactions and respond accordingly. This is time consuming and difficult even for keyframe or parameter systems where the user explicitly defines the motion; it is even worse for procedural and dynamic animation systems where the motion is generated by functions and laws defining their behavior. Although automatic collision detection is expensive to code and to run, it is a considerable convenience for animators, particularly when more automated methods of motion control, such as dynamics or behavioral control, are used. The other related issue is the response to a collision once it is detected. Even keyframe systems could benefit from automatic suggestions about the motion of objects immediately following a collision; animation systems using dynamic simulation must respond to collisions automatically and realistically. Linear and angular momentum must be preserved, and surface friction and elasticity must be reasonable. An elaborate discussion of collision detection and response can be found in [40, 41]. 6.2.3.1 Constraint-based Methods of Animation Constraints provide a unified method to build objects and to animate them. The models assemble themselves as the elements move to satisfy the constraints. Constraints provide a way to specify the behavior of physical objects in advance without specifying their exact positions, velocities, etc. In other words, constraints are partial descriptions of the objects’ desired behavior. So, given a constraint, we must determine the forces to meet the constraint and then find forces to maintain the constraint. A good deal of research has been done towards the use of constraint-based methods to create realistic animation [42, 43, 44, 45]. Many constraint-based modeling systems have been developed, including constraint-based models for the human skeleton [46] (in which the connectivity of segments and limits of angular motion on joints are specified), the energy constraints [47], and the dynamic constraints [48]. Examples of constraints are point-to-nail constraint, which is used to fix a
180
U. G¨ ud¨ ukbay and F. Durupınar
point on a model to a user-specified location in space, point-to-point (attachment) constraint, which is used to attach two points on different bodies to create complex models from simpler ones, point-to-path constraint, which requires some points on a model to follow an arbitrary user-specified path, and orientation constraint, which is used to align objects by rotating them [48]. Figure 6.13 shows a cloth patch constrained from two corners waving with gravity and wind forces. 6.2.3.2 Deformable Models Modeling the behavior of deformable objects is an important aspect of realistic animation. To simulate the behavior of deformable objects, we must approximate a continuous model by using discretization techniques, such as finite difference and finite element methods. For finite difference discretization, a deformable object could be approximated by using a grid of control points where the points are allowed to move in relation to one another. The manner in which the points are allowed to move determines the properties of the deformable object. For example, in order to obtain the effect of an elastic surface, the grid points can be connected by springs. In fact, massspring systems are one of the simplest, yet most effective ways of representing deformable objects and they are very popular. By changing the spring forces acting on the particles that comprise an object, different deformable behaviors can be simulated. To animate nonrigid objects in a simulated physical environment, the methods of elasticity and plasticity theory can be employed. However, such techniques are computationally demanding. Elasticity theory provides methods to construct the differential equations that model the behavior of nonrigid objects as a function of time.
Fig. 6.13. A cloth patch constrained from two corners waving with the gravity and wind forces
6 Three-Dimensional Scene Representations
181
To simulate the dynamics of elastically deformable models, there are two well-known approaches: the primal formulation [49] and the hybrid formulation [50]. These formulations use concepts from elasticity and plasticity theory and represent deformations of the objects using quantities from differential geometry, such as metric and curvature tensors [51]. The primal formulation works better for highly deformable materials since this formulation can handle nonlinear deformations; however the hybrid formulation is better for highly rigid materials since it can only handle small deformations that can be represented linearly. To create animation with deformable models, the differential equations of motion must be discretized and the system of linked ordinary differential equations obtained from the discretization process must be solved as described in [50]. The finite difference or finite element methods can be used for the discretization process. In addition to the approaches using elasticity theory to model the shapes and motions of deformable models, there are other approaches to model and animate deformable models. Witkin et al. formulate a model for nonrigid dynamics based on global deformations with relatively few degrees of freedom [42]. This model is restricted to simple linear deformations that can be formulated by affine transformations. In [52], Pentland and Williams describe the use of modal analysis to create simplified dynamic models of nonrigid objects. This approach breaks nonrigid dynamics down into the sum of independent vibration modes. It reduces the dimensionality and stiffness of the models by discarding high-frequency modes. Another method, based on physics and optimization theory, uses mathematical constraint methods to create realistic animation of flexible models [44]. This method uses reaction constraints for fast computation of collisions of flexible models with polygonal models, and it uses augmented Lagrangian constraints for creating animation effects, such as volume preserving squashing, and the molding of taffy-like substances. To model flexible objects, they use the finite element method. Thingvold and Cohen [53] define a model of elastic and plastic B-spline surfaces which supports both animation and design operations. The motion of their models is controlled by assigning different physical properties and kinematic constraints to various portions of the surface. Metaxas and Terzopoulos [54] propose an approach for creating dynamic solid models capable of realistic physical behaviors starting from common solid primitives such as spheres, cylinders, cones, and superquadrics [9]. Such primitives can deform kinematically in simple ways. To gain additional modeling power they allow the primitives to undergo parameterized global deformations (bends, tapers, twists, shears, etc.). Even though their models’ kinematic behavior is stylized by the particular solid primitives used, the models behave in a physically correct way with prescribed mass distributions and elasticities. Metaxas and Terzopoulos also propose efficient constraint methods for connecting the dynamic primitives to make articulated models.
182
U. G¨ ud¨ ukbay and F. Durupınar
6.3 Rendering Rendering techniques in computer graphics try to model the interaction of light with the environment to generate pictures of scenes [55]. This varies from implementation of the Phong illumination model, which is a first order approximation of the rendering equation [56], to very sophisticated global illumination techniques. More realistic renderings of the scenes can be obtained by using complex methods such as ray tracing [57, 58], or radiosity [59], and photon mapping [60], which calculate object-to-object interreflections, transmission, etc. Rendering techniques to be used in a 3DTV framework must generate realistic pictures and must be amenable to realtime implementations. A detailed discussion of real-time rendering can be found in [61]. 6.3.1 Reflection and Illumination Models Reflection models define the interaction of light with a surface. They take into account the material properties of the surface and the nature of the incident light, such as wavelength, the angle of incidence, etc. The reflective properties of materials are fully described by the Bidirectional Reflectivity Distribution Function (BRDF) [62]. BRDF is the ratio of the reflected radiance in a particular direction from a surface to the irradiance incoming from another direction to the surface. Each of the incoming and outgoing directions is represented with two angles (bidirectional). The BRDF is composed of specular, uniform diffuse, and directional diffuse components. Illumination models define the nature of the light reflected from or refracted through a surface. Local illumination models only calculate the direct illumination from light sources on object surfaces. They do not consider object-to-object light interactions (reflections, transmissions, etc.). Light incident at a surface is composed of the reflected, scattered, absorbed and transmitted light. One of the most popular local illumination models used in computer graphics is the Phong illumination model. This model has three components: • • •
Ambient light : the amount of illumination in a scene which is assumed to come from any direction and is thus independent of the presence of objects, the viewer position, or actual light sources in the scene. Diffuse reflection: the light reflected in all directions from a point on the surface of an object. It does not depend on the viewer’s position. Specular reflection: the component of illumination seen at a surface point of an object that is produced by reflection about the surface normal. It depends on the viewer’s position and appears as a highlight.
When there is a single light source in the environment, the Phong illumination model is composed of these three components as (see Fig. 6.14):
6 Three-Dimensional Scene Representations
N
183
R
L V
Fig. 6.14. Vectors used in the Phong illumination model
I = ka ia + [kd (L · N)id + ks (R · V)ns is ],
(6.10)
where • • • • • • • • • • •
ia is the ambient intensity, id is the diffuse intensity of the light source, is is the specular intensity of the light source, ka is the ambient reflection coefficient, kd is the diffuse reflection coefficient, ks is the specular reflection coefficient, N is the unit normal vector, L is the unit direction vector to the light, R is the unit reflection vector, V is the unit direction vector to the viewer, ns is a shininess constant that decides how the light is reflected from a shiny point; it is very high for highly specular objects, such as mirror, which causes very shiny but small highlights.
The vectors used in the model are illustrated in Fig. 6.14. When there are multiple light sources in a scene, the contributions from the individual sources are summed as: I = ka ia +
n
[kd (N · Ll )ild + ks (Rl · V)ns ils ]
(6.11)
l=1
6.3.2 Rendering Techniques Rendering techniques are classified into object-space and image-space techniques. Object-space techniques calculate the intensity of light for each point on an object surface (usually represented using polygonal approximations) and then use interpolation techniques to interpolate the intensity inside each polygon. Flat shading, Gouraud shading [63], and Phong shading are in this category. They use local illumination models, e.g., the Phong illumination model [64], to calculate the intensities of points and a scan-line approach to
184
U. G¨ ud¨ ukbay and F. Durupınar
render the polygons. Radiosity is also an object-space technique; however, it is a global illumination algorithm that solves the rendering equation only for diffuse reflections. In contrast to object-space techniques, image-space techniques calculate intensities for each pixel on the image. Ray tracing is an image-space rendering algorithm. It sends rays to the scene from the camera through each pixel and recursively calculates the intersections of these rays with the scene objects. To render a 3D scene, the visible parts of it for different views must be calculated. This requires the implementation of hidden surface algorithms together with rendering methods. Some rendering algorithms, such as ray tracing and radiosity, handle the visible surface problem implicitly while in others, such as Gouraud and Phong shading, that use local illumination models, it must be handled explicitly. Images containing uniformly shaded objects are not very realistic since real objects have textures, bumps, scratches, and dirt on them. There are several rendering techniques that add realism to the rendering of uniformly shaded 3D scenes. Texture mapping [65, 66], environment mapping [67], and bump mapping [68] are representative examples of such methods. Since scan-line renderers, such as Gouraud shading, are amenable to hardware implementations, they are more appropriate for the real-time display capabilities required for 3DTV than sophisticated rendering techniques, such as raytracing and radiosity. Image-based rendering is a recent and promising approach to the rendering of 3D scenes. Such techniques directly render new views of a scene from the acquired images, thus eliminating the need for an explicit scene representation phase. 6.3.2.1 Scan-line Renderers Scan-line rendering is one of the most popular methods due to its low computational cost. Hardware implementation enables the rendering of very complex models in real-time, but even without hardware support, scan-line algorithms offer very good performance. Scan-line algorithms work in object-space by iterating over the polygons (mostly triangles) of scene objects. First, the frame buffer, which holds the pixel intensity values, and the z-buffer, which manages pixel depth values relative to the camera, are initialized. Next, the polygons are painted by projecting them onto the screen and filling them by scan-converting into a series of horizontal spans. While iterating over the scan lines to paint a polygon, the intersection points of the scan line with the polygon edges are computed and the horizontal spans inside the polygons are painted pixel by pixel. For each pixel inside a polygon, intensity and depth values are calculated in order to paint each pixel correctly. Depending on the z-buffer depth value of a pixel, it can be colored or just skipped. If the depth of a polygon pixel is less than the value for the respective screen pixel in the z-buffer, the z-buffer is updated and
6 Three-Dimensional Scene Representations
185
the pixel is colored by the corresponding value in the frame buffer, otherwise, it is ignored. Flat shading: By using a local illumination model, e.g., the Phong illumination model, we can calculate an intensity value for the RGB color components at a single position for each polygon. We can then fill every projected polygon approximating an object with the intensity value calculated for this polygon. This method quickly generates a curved-surface appearance for an object approximated with polygons. Gouraud shading: Flat shading generates intensity discontinuities along polygon edges. Although increasing the number of the polygons that compose an object gives a smoother appearance when flat shading is used, it requires more computational power. Gouraud shading was developed to generate a smooth appearance for objects using only a small number of polygons. Gouraud shading linearly interpolates the intensity values across the surface of a polygon. The basic steps of Gouraud shading are as follows: •
The vertex normal vectors are calculated by averaging the face normals surrounding the vertex as (see Fig. 6.15 (a)): n Ni NV = i=1 (6.12) | ni=1 Ni |
•
An illumination model is applied to each vertex to calculate the vertex intensity. The brightness at each vertex is calculated. Each projected polygon is shaded by using a modified scan-line polygon filling algorithm. Moving from scan line to scan line, the intensity values of the pixels are linearly interpolated for each projected polygon. Any number of quantities can be interpolated at this step. For instance, colored surfaces are rendered by interpolating the color component R, G and B values. Figure 6.15 (b) illustrates how the intensity values are interpolated along the edges of the polygon and the pixels inside the polygon.
•
A N4
NV
D=lerp(A, B) E=lerp(A, C)
N1
D
V
P
P=lerp(D, E)
E
N3
N2
(a)
B
C
(b)
Fig. 6.15. Gouraud shading: (a) calculating vertex normals from face normals; (b) linear interpolation (lerp) of the intensities along the polygon edges and interiors
186
U. G¨ ud¨ ukbay and F. Durupınar
Gouraud shading is a simple and fast technique, which is supported by most of the graphics accelerators today. It does have some deficiencies as a result of the linear interpolation scheme; for instance, discontinuities appear as odd looking bright or dark bands, called Mach Bands, on the surface of the object. It also fails to give good results when the color changes quickly, e.g., specular highlights. Phong shading: The disadvantages of Gouraud shading have been overcome by Phong shading. The basic steps of the Phong shading algorithm are as follows: • • •
The vertex normal vectors are calculated by averaging the surface normals surrounding the vertex. This step is the same as the first step in Gouraud shading. The vertex normals are linearly interpolated over the polygon surface. A modified version of scan-line polygon filling algorithm is applied to render projected polygons. An illumination model is used to calculate pixel intensities using the interpolated normal vectors.
Phong shading gives more accurate results than Gouraud shading; however, since the intensities are calculated explicity for each pixel, this method requires more computations. Object rendering techniques using local illumination models are illustrated in Fig. 6.16. 6.3.2.2 Ray Tracing Ray tracing tries to imitate the light-object interactions in nature by modeling the behavior of photons emitted from light sources. When photons hit the objects, they bounce losing some of their energy. When the photons lose most of their energy, they are absorbed. If the objects are transparent or translucent, some of the light energy is transmitted. To imitate the behavior of photons for photorealistic image synthesis, we must take into account the effect of the photons that hit the image plane and come to our eyes. These photons emanate from the light sources and come to the image plane after successive bounces from the objects in the scene, thus contributing to the intensity and color of the pixels in the image. Photons that do not reach the image plane make no contribution to the image. The trajectories that photons follow can be modeled with rays. Backward ray tracing starts from the eye and sends rays to the pixels in the image plane, instead of following the rays emitted from light sources, to avoid tracing the rays that do not contribute to the image. The light intensity of the image pixels are determined by the rate at which the photons hit and by their energies. The color of pixels are determined by the distribution of the wavelengths of incoming photons. The rays sent from the viewer (camera) to the image pixels are called eye (pixel) rays. If they hit a light source in the scene, we use the intensity of the light source to determine the intensity of the pixel. If the
6 Three-Dimensional Scene Representations
(a)
(b)
(c)
(d)
187
Fig. 6.16. Object rendering using local illumination models. (a) wireframe; (b) flat shading; (c) Gouraud shading; (d) Phong shading
ray does not hit anything in the scene, we set the intensity of the pixel to zero. If we hit a surface point, we recursively follow more rays to determine where the light striking that surface point came from. This is done by sending a reflection ray in the specular reflection direction at that point (which is calculated according to the incoming ray direction and the surface normal) and a transmission ray according to the theory of refraction (Snell’s Law is used to calculate the transmission ray direction). We also send illumination (shadow) rays to the light sources to understand whether the surface point sees a light source or not. We add the contributions coming from reflection and transmission directions and the contribution of the light sources that see the point to find the intensity and color. The reflection and transmission rays are recursive rays, just like eye rays, in the sense that when they hit a surface point new reflection and transmission rays are fired. Illumination rays are not recursive. Figure 6.17 illustrates how backward ray tracing works [58]. Figure 6.18 depicts a raytraced scene. In ray tracing, most of the time is spent for intersection calculations. Different objects need different ways to find the intersections. Ray/surface intersections can be easily found for the objects whose implicit functions are
188
U. G¨ ud¨ ukbay and F. Durupınar T2 Object 2
R3 S5 E
R1 Object 1
E
S6
Object
S4
T1 S1
Object 3 S3
S1
L1
S2
R2
R1
1
S2 T1
S5
S3 Object
S6 T2
Object
2
R2
R3
3
S4
L2
(a)
(b)
Fig. 6.17. Backward ray tracing. (a) An eye ray E sent from the eye to a pixel is traced through successive bounces in the scene. Reflection rays are labeled with R, transmission rays are labeled with T, and shadow rays are labeled with S. (b) Corc responding ray tree. Reprinted from [58] with permission. 1988 Elsevier
known. Techniques proposed to accelerate ray tracing generally try to make intersection tests faster by using bounding boxes or reducing the number of intersection tests by utilizing bounding volume hierarchies and spatial coherence schemes. To make ray-object intersection tests faster, simple bounding volumes enclosing the objects are first tested with the rays. If the rays intersect with the bounding volumes, then real ray-object intersection tests are performed. Spatial coherence schemes first preprocess the scene to construct a spatial subdivision structure, such as a regular 3D grid (Spatially Enumerated Auxiliary Data Structure-SEADS) [69], uniform or adaptive octrees [70], or Binary Space Partition (BSP) trees [58]; the objects in the scene are stored in the nodes of the spatial subdivision structure. The ray tracing algorithm only
Fig. 6.18. An image generated with ray tracing. Courtesy of Okan Arıkan
6 Three-Dimensional Scene Representations
189
makes intersection tests for the objects in the nodes of the spatial subdivision structure that the ray passes through. Two other important acceleration techniques for ray tracing are adaptive depth control [71] and first-hit speed-up [72]. Ray tracing produces a ray tree for each eye ray, the depth of which increases with each reflection and transmission that does not leave the scene. Since rays at low levels contribute little to the image, adaptive depth control stops firing reflection and transmission rays when the computed intensity for a point becomes less than a certain threshold. This is checked for an intersection point by multiplying the specular reflection and transmission coefficients for the intersections up to that point and comparing it with a pre-defined threshold. Even for highly reflective scenes, the average ray tree depth does not exceed two if we use adaptive depth control. Since most of the intersection calculations are done in the step, Weghorst proposed to use a z-buffer algorithm as a pre-processing step to determine the first hit. Then the ray tracing algorithm is executed by using the intersection points for the objects that are stored in the z-buffer. Ray tracing can only handle specular reflections where the light sources are point light sources (although there are some variations of ray tracing, like distributed ray tracing, that increase the realism of the rendering by adding spatial aliasing, soft shadows, and depth-of-field effects, by firing more rays and distributing the ray origins and directions statistically based on probability distribution functions) [73]. 6.3.2.3 Radiosity The main motivation for radiosity is to accurately model the diffuse objectto-object reflections since most real environments consist mainly of objects that reflect light diffusely. A very large proportion of the light energy comes from direct illumination from light sources and diffuse reflections. For photorealistic image synthesis, the physical behavior of light must be modeled. Since the intensity and distribution of light is governed by energy transfer and conservation principles, these must be taken into account to accurately simulate the physical behavior of light transport between light sources and materials in a scene [59]. Radiosity is a method to determine the intensity of light diffusely reflected within an environment. It is an object-space algorithm that solves for the intensity at discrete points or surface patches within an environment. The solution is thus independent of the viewer position. The radiosity solution (which are intensities of patches in the environment) is then input to a rendering algorithm (such as Gouraud shading) to compute the image for a particular view position. This final phase does not require much computation and different views are easily obtained from the view-independent solution. This makes radiosity very attractive for dynamic scenes, e.g., architectural walkthroughs, where the geometry is fixed but the viewer position is dynamic [74, 75].
190
U. G¨ ud¨ ukbay and F. Durupınar
The main assumption of the method is that all the surfaces in the scene are perfect diffuse (Lambertian) reflectors. Unlike ray tracing, radiosity also assumes that the surfaces in the scene are decomposed into polygonal patches. Light sources and other objects are treated uniformly; the patches may be emitters (area light sources) or other objects that do not emit light. Radiosity, B, is defined as the energy leaving a surface patch per unit area per unit time and is the sum of emitted and the reflected energy. The radiosity Bi of a patch i is given by (6.13) Bi dAi = Ei dAi + Ri Bj FdAj dAi dAj , j
The form factor, FdAj dAi , determines the fraction of energy leaving dAj that arrives on dAi . The integral is over all patches j in the environment. Ri is the fraction of the incident light that is reflected from the patch i in all directions, called the reflectivity of the patch i. We can discretize an environment into n patches and assume the radiosity and emittance over a patch is constant. If we replace FAj Ai by Fji to simplify the notation, the radiosity of a discrete patch is given by n Bi Ai = Ei Ai + Ri Bj Fji Aj (6.14) j=1
The reciprocity relationship between two patches is given by Fij Ai = Fji Aj and Fij = Fji
Aj Ai
(6.15)
Then, the radiosity equation becomes Bi = Ei + Ri
n
Bj Fij
(6.16)
j=1
For an environment containing n patches, equations for the radiosities of the patches ⎡ ⎤ 1 − R1 F11 −R1 F12 · · · −R1 F1n ⎢ −R2 F21 1 − R2 F22 · · · −R2 F2n ⎥ ⎢ ⎥ ⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . −Rn Fn1
−Rn Fn2
· · · 1 − Rn Fnn
we have a linear system of ⎡ ⎤ ⎤ E1 B1 ⎢ E2 ⎥ ⎢ B2 ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ ⎣ . ⎦ ⎣ . ⎦ Bn En ⎡
(6.17)
The emittance values (Ei ) are non-zero for only light sources and the reflectivities (Ri ) are known. The form factors Fij are calculated based on the geometry of the patches. The form factors for a patch can be calculated analytically by placing a hemisphere around the patch and using the relative orientation and distance from this patch to the other patches. However,
6 Three-Dimensional Scene Representations
191
this is only possible for very simple geometries. In most cases, approximation methods, such as the hemi-cube approach [76], are used to calculate the form factors. Note that the form factors Fii are zero for planar or convex patches. Since the form factors from a patch to all other patches add up to 1 n ( j=1 Fij = 1) and Ri is always less than 1, the matrix in the linear system of ( 6.17) is diagonally dominant and guaranteed to converge [75]. The classical radiosity algorithm calculates the radiosity of the patches one at a time by gathering the radiosities from all other patches. In this approach, it is not possible to obtain an intermediate solution for the patches during the solution of the radiosity algorithm. Another variant of the radiosity algorithm, called progressive refinement radiosity [77], updates the radiosity of all patches in a scene by shooting the radiosity of a patch to all other patches. In this way, the radiosity of all the patches are updated simultaneously and it is possible to obtain intermediate solutions during the solution of the algorithm. If these partial solutions are rendered, the scene is lit progressively. This idea can be further elaborated by sorting the patches with respect to their emittance values. If the patches with higher emittance values (light sources) are processed first by shooting their radiosities to the other patches, it is possible to obtain very good approximations of the final images in the earlier steps. Hierarchical radiosity is another improvement to reduce the computational complexity of the classical radiosity algorithm [78]. The dominant term in the computational complexity of the algorithm comes from form factor calculations, which are (O(n2 )) for a scene containing n patches since we have to compute the form factors from each patch to all other patches. During the solution, hierarchical radiosity computes the light interactions between separated groups of patches (clusters) as a single interaction. Thus, it starts with a set of coarse initial patches and forms a quadtree with respect to the form factor estimations. Some of the patches are then subdivided on-the-fly according to the form factor estimations and brightness values, and the radiosity solution is refined. Figure 6.19 shows two scenes rendered using hierarchical radiosity. There are attempts to combine ray tracing and radiosity. Wallace et al. describe a multi-pass method where an extended radiosity solution is applied in the first pass and a ray tracing solution is applied in the second pass. The method successfully calculates the effects of different light transport mechanisms: diffuse-to-diffuse, diffuse-to-specular, specular-to-specular, and specular-to-diffuse, to some extent. It makes certain assumptions about the rendered scenes, e.g., that the number of specular surfaces is limited and that they cannot see each other, in order to prevent infinite reflections [79]. 6.3.2.4 Photon Mapping Photon mapping is a new approach to the global illumination of the scenes, which makes realistic rendering more affordable. Photon mapping uses forward ray tracing (i.e., sending rays from light sources) to calculate reflecting and refracting light for the photons. It is a two-step process (distributing the photons
192
U. G¨ ud¨ ukbay and F. Durupınar
Fig. 6.19. Images of the University of California, Berkeley Soda Hall (Rooms 380 and 420) generated with hierarchical radiosity. Courtesy of Ali Kemal Sinop
and rendering the scene) that works for arbitrary geometric representations, including parametric and implicit surfaces; it calculates the ray-surface intersections on demand. Figure 6.20 shows an image generated with photon mapping. 6.3.2.5 Image-based Rendering Unlike the approaches described above that render a 3D scene composed of objects modeled with different geometric modeling techniques, there is another rendering approach, called image-based rendering (IBR), that directly renders a scene from the pre-acquired photographs. High-quality visualization results
Fig. 6.20. An image of the Cornell Box generated with photon mapping. Courtesy of Atılım C ¸ etin
6 Three-Dimensional Scene Representations
193
can be obtained depending on both the quality and quantity of the reference images. The main motivation for IBR is to reduce the modeling bottleneck, since the creation of an object or scene model is a highly demanding task and it is expensive to represent all the surface details with geometric primitives. The roots of IBR date back to texture and environment mapping techniques. In addition to their original functions of approximating reflections of the environment on a surface, environment maps are also used to display an outward-looking view of the environment from a fixed location with varying orientation [80]. Chen [81] uses such a technique by employing 360-degree cylindrical panoramic images to construct a virtual environment. Camera panning and zooming are simulated by digitally warping the virtual environment. Unfortunately, interpolation between two images by warping fails in cases where previously occluded areas become visible. Another interpolation approach is to use corresponding feature points between two images and thus to compute the depth of each pixel by using the information of the camera positions. Chen and Williams [82] describe the view interpolation technique that performs morphing on adjacent images to create an image of a new inbetween viewpoint. The method uses the camera’s position and orientation and the range data of the images to determine a pixel-by-pixel correspondence between the images. The correspondence maps between two successive images are computed and they are stored as a pair of morph maps. The precomputation of the morphing provides efficiency. Another method, Layered Depth Images, solves the occlusion problem by associating more than one depth value to a pixel. These values correspond to the depth of each surface layer that a ray through the pixel intersects [83]. A 5D function that describes the intensity of light observed from every position and direction in 3D space is called the “plenoptic function” [84]. The plenoptic function is defined as: p = P (θ, φ, Vx , Vy , Vz ) where (Vx , Vy , Vz ) represent a point in space, θ represents the azimuth angle and φ represents the elevation angle. It is also possible to include the time parameter to the plenoptic function in a dynamic scene. IBR aims to reconstruct the plenoptic function from a set of images. In fact, the plenoptic function describes the set of all possible environment maps for a given scene in computer graphics terminology [85]. Once this function is obtained, the reconstruction of the scene becomes straightforward. Levoy and Hanrahan propose a technique called “Light Field Rendering” that is based on the idea of interpreting the input images as 2D slices of the light field, which is a 4D function based on the plenoptic function [80]. The light field characterizes the radiance as a function of position and direction in unobstructed space. Generating new views corresponds to extracting and resampling a slice. Lumigraph is a similar method that also uses a 4D function,
194
U. G¨ ud¨ ukbay and F. Durupınar
which is a subset of the plenoptic function [86]. Lumigraph enables the generation of new images of an object independent of the geometric complexity or illumination conditions of the scene or object. McMillan and Bishop [85] also present an IBR system that is based on the sampling, reconstruction and resampling of the plenoptic function. IBR only requires the acquisition of photographs; thus scene and object representation is comparably easy [87]. Standard geometric and lighting techniques sometimes lack the proper models to simulate some real-world shading and appearance effects. Since IBR methods do not require explicit geometric models to render real-world scenes, they can reproduce real-world shading and appearance effects faithfully without having to explicitly model them. Although IBR methods have significant memory requirements (e.g., light fields) and their computational complexity is very high, the computational cost of interactively viewing the scene is independent of the complexity of the scene. Moreover, IBR techniques can also combine real-world photographs with computer-generated images to be used as pre-acquired images. Thus, with all its advantages, IBR is a promising approach to be used in a 3DTV framework. However, there are many challenges, such as feature correspondence, camera calibration, and the construction of plenoptic functions, that need to be addressed for IBR to be applicable as a general rendering technique for complex dynamic scenes [88]. 6.3.2.6 Volume Rendering Volumetric data contains scalar values for 3D locations in space. The 3D locations for which the volume data are defined determines the type of the volumetric data. If the scalar values are defined for a regular 3D array of locations, the data can be represented in the form of structured grids where the connnectivity between the vertices is defined implicitly. If the distribution of data points do not follow a regular pattern, the connectivity of the vertices should be defined explicitly. These unstructured grids are generally represented by using tetrahedral cells. Volume rendering techniques are classified as direct and indirect. Indirect volume rendering methods, such as Marching Cubes [89], extract an intermediate geometric representation of the surfaces from volume data and render them using surface rendering methods. Indirect methods are faster and more suitable for applications where the visualization of the surfaces of the volume data is important. Visual hull techniques can also be regarded as an indirect volume rendering approach since they extract and render the surface of the scene geometry. Direct volume rendering techniques render the volume data without generating an intermediate representation; thus facilitating the visualization of the inside of a material, such as partially transparent body fluids. Structured volume data can be directly visualized in real-time using special-purpose hardware [90].
6 Three-Dimensional Scene Representations
195
Direct volume rendering algorithms for unstructured grids are classified as image-space, object-space and hybrid. Image-space methods traverse the image-space by casting a ray for each pixel. The ray is followed inside the volume to sample and compose the volume data along the ray. In object-space methods, the volume is traversed in object-space and the cells are depth-sorted with respect to the current viewpoint. Then, the cells are projected onto the image-plane in sorted order and their contributions are composited. In hybrid methods, the volume is traversed in object order and the contributions of the cells to the final image are accumulated in image order [91]. Volumetric datasets can also be rendered by using advanced per-pixel operations available in the rasterization stage and in the graphics hardware [92]. Although this approach works for both structured and unstructured grids, it is much more successful for structured grids since it can use 3D textures. In both cases, it avoids any polygonal representation using per-pixel operations. Currently, volume visualization techniques are most commonly used in medical imaging and scientific simulations, such as Computational Fluid Dynamics or geophysical simulations. The improvement of 3D scene capture technologies and high-performance computers provide easy acquisition of volume data so that volume visualization techniques can be used in other applications, such as 3DTV. However, this requires real-time implementation of these techniques and although there are specialized hardware for direct volume rendering of structured data, direct volume rendering of unstructured data, namely the tetrahedral mesh representations, is still far from being real time.
6.4 Conclusions 3D shape modeling is an indispensable component of scene representation for 3DTV. Dynamic mesh representations provide a suitable way of representing 3D shapes. Polygonal meshes are amenable to hardware implementations; thus, they are suitable for a 3DTV framework where real-time performance is required. Volumetric representations provide a good alternative for 3DTV because images acquired from multiple calibrated cameras provide the necessary information for volumetric models. Point-based representations are also promising for 3DTV because the results of 3D data acquisition methods such as laser scans already represent the scene in a point-based manner. Since the scenes mostly contain dynamic objects, modeling the motion becomes important. Animation techniques that have potential for real-time implementations are promising approaches to be used in a 3DTV framework. Today, physically-based modeling and animation techniques are widely used in the film industry and in game development and there are efficient techniques to approximate the physics involved; thus these techniques have potential for use in 3DTV. Since scan-line renderers are amenable to hardware implementations, they are more appropriate for the real-time display capabilities required for 3DTV
196
U. G¨ ud¨ ukbay and F. Durupınar
than sophisticated rendering techniques. Point-based software renderers can realistically render models containing millions of points in a second. Imagebased rendering is a very successful and promising rendering scheme for 3DTV as it directly makes use of the captured images. However, there are many challenges to be addressed for IBR to be applicable as a general rendering technique for complex dynamic scenes.
Acknowledgment This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. A. H. Barr, “Global and local deformations of solid primitives,” ACM Computer Graphics (Proc. SIGGRAPH’84), Vol. 18, No. 3, pp. 21–30, Jul. 1984. 2. T. W. Sederberg and S. R. Parry, “Free-form deformation of solid geometric models,” ACM Computer Graphics (Proc. SIGGRAPH’86), Vol. 20, No. 4, pp. 151–160, Aug. 1986. 3. D. Hearn and P. Baker, Computer Graphics with OpenGL, 3rd Edition. Englewood Cliffs, NJ: Prentice Hall, 2003. 4. B. Mandelbrot, Fractals: Geometry of Nature. New York: Freeman Press, 1982. 5. P. Prusinkiewicz and A. Lindenmayer, The Algorithmic Beauty of Plants (The Virtual Laboratory). Springer, 1996. 6. H. Hoppe, “Progressive meshes,” ACM Computer Graphics (Proc. SIGGRAPH’96), pp. 99–108, Jul. 1996. 7. ——, “View-dependent refinement of progressive meshes,” ACM Computer Graphics (Proc. SIGGRAPH’97), pp. 189–198, Jul. 1997. 8. D. Luebke and C. Erikson, “View-dependent simplification of arbitrary polygonal environments,” ACM Computer Graphics (Proc. SIGGRAPH’97), pp. 199–208, Jul. 1997. 9. A. H. Barr, “Superquadrics and angle-preserving transformations,” IEEE Computer Graphics and Applications, Vol. 1, No. 1, pp. 11–23, Jan. 1981. 10. R. Bartels, J. Beatty, and B. Barsky, An Introduction to Splines for Use in Computer Graphics and Geometric Modeling. Los Alamos, CA: Morgan Kaufmann, 1987. 11. P. B´ezier, Numerical Control — Mathematics and Applications. London: John Wiley & Sons, 1972. 12. E. Catmull and J. Clark, “Recursively generated b-spline surfaces on arbitrary topological meshes,” Computer-Aided Design, Vol. 10, No. 6, pp. 350–355, 1978. 13. D. Doo and M. Sabin, “Behaviour of recursive subdivision surfaces near extraordinary point,” Computer-Aided Design, Vol. 10, No. 6, pp. 356–360, 1978. 14. C. Loop, “Smooth subdivision surfaces based on triangles,” Master’s thesis, Department of Mathematics, University of Utah, 1987.
6 Three-Dimensional Scene Representations
197
15. N. Dyn, D. Levin, and J. A. Gregory, “A butterfly subdivision scheme for surface interpolation with tension control,” ACM Trans. on Graphics, Vol. 9, No. 2, pp. 160–169, 1990. √ 16. L. Kobbelt, “ 3-Subdivision,” ACM Computer Graphics (Proc. of SIGGRAPH’00), pp. 103–112, 2000. 17. D. Zorin and P. Schroder, “Subdivision for modeling and animation,” ACM SIGGRAPH Course Notes, 2000. 18. A. Lee, H. Moreton, and H. Hoppe, “Displaced subdivision surfaces,” ACM Computer Graphics (Proc. SIGGRAPH’00), pp. 85–94, Jul. 2000. 19. M. Levoy and T. Whitted, “The use of points as display primitive,” University of North Carolina at Chapel Hill, Tech. Rep. TR-85-022, 1985. 20. H. Pfister, M. Zwicker, J. Van Baar, and M. Gross, “Surfels: Surface elements as rendering primitives,” ACM Computer Graphics (Proc. of SIGGRAPH’00), pp. 335–342, 2000. 21. S. Fleishman, D. Cohen-Or, M. Alexa, and C. Silva, “Progressive point set surfaces,” ACM Trans. on Graphics, Vol. 22, No. 4, pp. 997–1011, 2003. 22. J. Grossman and W. Dally, “Point sample rendering,” in Proceedings of Eurographics Rendering Workshop, pp. 181–192, 1998. 23. M. Pauly, L. Kobbelt, and M. Gross, “Point-based multiscale surface representation,” ACM Trans. on Graphics, Vol. 25, No. 2, pp. 177–193, 2006. 24. M. Botsch, A. Hornung, M. Zwicker, and L. Kobbelt, “High-quality surface splatting on today’s GPUs,” in Proceedings of Eurographics Symposium on Point-Based Graphics, pp. 17–24, 2005. 25. M. Alexa, J. Behr, D. Cohen-Or, S. Fleishman, D. Levin, and C. Silva, “Point set surfaces,” in Proceedings of IEEE Visualization’01, pp. 21–28, 2001. 26. A. Adamson and M. Alexa, “Anisotropic point set surfaces,” in Proceedings of AFRIGRAPH’06, 2006. 27. M. Zwicker, H. Pfister, J. Van Baar, and M. Gross, “Surface splatting,” ACM Computer Graphics (Proc. of SIGGRAPH’01), pp. 371–378, 2001. 28. S. Rusinkiewicz and L. Levoy, “QSplat: A multiresolution point rendering system for large meshes,” in ACM Computer Graphics (Proc. of SIGGRAPH’00), pp. 343–352, 2000. 29. H. Samet, “The quadtree and related hierarchical data structures,” ACM Computing Surveys, Vol. 16, No. 2, pp. 187–260, 1984. 30. A. Laurentini, “The visual hull concept for silhouette based image understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 16, No. 2, pp. 150–162, 1994. 31. J.-M. Hasenfratz, M. Lapierre, J.-D. Gascuel, and E. Boyer, “Real-time capture, reconstruction and insertion into virtual world of human actors,” in Proceedings of Eurographics Vision, Video and Graphics Conference, pp. 49–56, 2003. 32. G. Slabaugh, W. Culbertson, T. Malzbender, M. Stevens, and R. Schafer, “Methods for volumetric reconstruction of visual scenes,” International Journal of Computer Vision, Vol. 57, No. 3, pp. 179–199, 2004. 33. R. Parent, Computer Animation: Algorithms and Techniques. Los Altos, CA: Morgan-Kaufmann, 2001. 34. A. Witkin, “Animation,” in Computer Graphics I Course Notes, School of Computer Science, Carnegie-Mellon University, 1995. 35. J. Lasseter, “Principles of traditional animation applied to 3D computer animation,” ACM Computer Graphics (Proc. SIGGRAPH’87), Vol. 21, No. 4, pp. 35–44, Jul. 1987.
198
U. G¨ ud¨ ukbay and F. Durupınar
36. E. Angel, Interactive Computer Graphics: A Top-Down Approach Using OpenGL. Addison-Wesley, 2006. 37. A. Witkin, “Hierarchical modeling,” in Computer Graphics I Course Notes, School of Computer Science, Carnegie-Mellon University, 1995. 38. D. Baraff, “Analytical methods for dynamic simulation of non-penetrating rigid bodies,” ACM Computer Graphics (Proc. SIGGRAPH’89), Vol. 23, No. 3, pp. 223–232, Jul. 1989. 39. M. Moore and J. Wilhems, “Collision detection and response for computer animation,” ACM Computer Graphics (Proc. SIGGRAPH’88), Vol. 22, No. 4, pp. 289–298, Aug. 1988. 40. P. Jim´enez, F. Thomas, and C. Torras, “3D collision detection: A survey,” Computers & Graphics, Vol. 25, No. 2, pp. 269–285, 2001. 41. M. C. Lin and S. Gottschalk, “Collision detection between geometric models: A survey,” in Proceedings of IMA Conference on Mathematics of Surfaces, pp. 37–56, 1998. 42. A. Witkin, M. Gleischer, and W. Welch, “Interactive dynamics,” ACM Computer Graphics (Proc. SIGGRAPH’90), Vol. 24, No. 4, pp. 11–22, Aug. 1990. 43. A. Witkin and W. Welch, “Fast animation and control of nonrigid structures,” ACM Computer Graphics (Proc. SIGGRAPH’90), Vol. 24, No. 4, pp. 243–252, Aug. 1990. 44. J. Platt and A. H. Barr, “Constraint methods for flexible models,” ACM Computer Graphics (Proc. SIGGRAPH’88), Vol. 22, No. 4, pp. 279–288, Aug. 1988. 45. A. Witkin and M. Kass, “Spacetime constraints,” ACM Computer Graphics (Proc. SIGGRAPH’88), Vol. 22, No. 4, pp. 159–168, Aug. 1988. 46. N. I. Badler, K. H. Manoochehri, and G. Walters, “Articulated figure positioning by multiple constraints,” IEEE Computer Graphics and Applications, Vol. 7, No. 6, pp. 39–51, Nov. 1987. 47. A. Witkin, K. Fleischer, and A. H. Barr, “Energy constraints on parameterized models,” ACM Computer Graphics (Proc. SIGGRAPH’87), Vol. 21, No. 4, pp. 225–232, Jul. 1987. 48. R. Barzel and A. H. Barr, “A modeling system based on dynamic constraints,” ACM Computer Graphics (Proc. SIGGRAPH’88), Vol. 22, No. 4, pp. 179–188, Aug. 1988. 49. D. Terzopoulos, J. Platt, A. H. Barr, and K. Fleischer, “Elastically deformable models,” ACM Computer Graphics (Proc. SIGGRAPH’87), Vol. 21, No. 4, pp. 205–214, Jul. 1987. 50. D. Terzopoulos and A. Witkin, “Physically based models with rigid and deformable components,” IEEE Computer Graphics and Applications, Vol. 8, No. 6, pp. 41–51, Nov. 1988. 51. M. P. Do Carmo, Differential Geometry of Curves and Surfaces. Englewood Cliffs, NJ: Prentice-Hall, 1974. 52. A. Pentland and J. Williams, “Good vibrations: Modal dynamics for graphics and animation,” ACM Computer Graphics (Proc. SIGGRAPH’89), Vol. 23, No. 3, pp. 215–222, Jul. 1989. 53. J. A. Thingvold and E. Cohen, “Physical modeling with b-spline surfaces for interactive design and animation,” ACM Computer Graphics (Proc. SIGGRAPH’90), Vol. 24, No. 4, pp. 129–137, Aug. 1990. 54. D. Metaxas and D. Terzopoulos, “Dynamic deformation of solid primitives with constraints,” ACM Computer Graphics (Proc. SIGGRAPH’92), Vol. 26, No. 2, pp. 309–312, Jul. 1992.
6 Three-Dimensional Scene Representations
199
55. D. Rogers, Procedural Elements of Computer Graphics (2nd Edition). Boston, MA: McGraw-Hill, 1997. 56. J. T. Kajiya, “The rendering equation,” ACM Computer Graphics (Proc. SIGGRAPH’86), Vol. 20, No. 4, pp. 143–150, 1986. 57. T. Whitted, “An improved illumination model for shaded display,” Communications of the ACM, Vol. 23, No. 6, pp. 343–349, 1980. 58. A. Glassner (editor), An Introduction to Ray Tracing. Academic Press, 1989. 59. C. Goral, K. Torrance, D. Greenberg, and B. Battaile, “Modeling the interaction of light between diffuse surfaces,” ACM Computer Graphics (Proc. SIGGRAPH’84), pp. 213–222, 1984. 60. H. W. Jensen, Realistic Image Synthesis Using Photon Mapping. Addison Wesley, 2001. 61. T. Moller, E. Haines, and T. Akenine-Moller, Real-Time Rendering (2nd Edition). Natick, MA: A.K. Peters, Ltd., 2002. 62. F. Nicodemus, “Reflectance nomenclature and directional reflectance and emissivity,” Applied Optics, Vol. 9, pp. 1474–1475, 1970. 63. H. Gouraud, “Continuous shading of curved surfaces,” IEEE Trans. on Computers, Vol. C-20, No. 6, pp. 623–629, Jun. 1971. 64. B. T. Phong, “Illumination for computer generated pictures,” Communications of the ACM, Vol. 18, No. 6, pp. 311–317, 1975. 65. J. F. Blinn and M. E. Newell, “Texture and reflection in computer generated images,” Communications of the ACM, Vol. 19, No. 10, pp. 542–547, 1976. 66. P. Heckbert, “Survey of texture mapping,” IEEE Computer Graphics and Applications, Vol. 6, No. 11, pp. 56–67, Nov. 1986. 67. N. Greene, “Environment mapping and other applications of world projections,” IEEE Computer Graphics and Applications, Vol. 6, No. 11, pp. 21–29, 1986. 68. J. F. Blinn, “Simulation of wrinkled surfaces,” ACM Computer Graphics (Proc. SIGGRAPH’78), Vol. 12, No. 3, pp. 286–292, Aug. 1978. 69. A. Fujimoto, T. Tanaka, and K. Iwata, “ARTS: Accelerated ray-tracing system,” IEEE Computer Graphics and Applications, Vol. 6, No. 4, pp. 16–26, 1986. 70. B. Pradhan and A. Mukhopadhyay, “Adaptive cell division for ray tracing,” Computers & Graphics, Vol. 15, No. 4, pp. 549–552, 1991. 71. R. A. Hall and D. P. Greenberg, “A testbed for realistic image synthesis,” IEEE Computer Graphics and Applications, Vol. 3, No. 8, pp. 10–99, 1983. 72. H. Weghorst, G. Hooper, and D. P. Greenberg, “Improved computational methods for ray tracing,” ACM Trans. on Graphics, Vol. 3, No. 1, pp. 52–69, 1984. 73. L. Cook, T. Porter, and L. Carpenter, “Distributed raytracing,” ACM Computer Graphics (Proc. SIGGRAPH’84), pp. 137–145, 1984. 74. I. Ashdown, Radiosity: A Programmer’s Perspective. John Wiley & Sons, 1994. 75. A. Watt and M. Watt, Advanced Animation and Rendering Techniques. Addison-Wesley, 1992. 76. M. F. Cohen and D. P. Greenberg, “The hemicube: A radiosity solution for complex environments,” ACM Computer Graphics (Proc. SIGGRAPH’85), Vol. 19, No. 3, pp. 31–40, 1985. 77. M. F. Cohen, E. C. Chen, J. R. Wallace, and D. P. Greenberg, “A progressive refinement approach to fast radiosity image generation,” ACM Computer Graphics (Proc. SIGGRAPH’88), Vol. 22, No. 4, pp. 75–84, 1988. 78. P. Hanrahan, D. Salzman, and L. Aupperle, “A rapid hierarchical radiosity algorithm,” ACM Computer Graphics (Proc. of SIGGRAPH’91), Vol. 25, No. 4, pp. 197–206, 1991.
200
U. G¨ ud¨ ukbay and F. Durupınar
79. J. R. Wallace, M. F. Cohen, and D. P. Greenberg, “A two-pass solution to the rendering equation: A synthesis of ray tracing and radiosity methods,” ACM Computer Graphics (Proc. SIGGRAPH’87), Vol. 21, No. 4, pp. 311–320, 1987. 80. M. Levoy and P. Hanrahan, “Light field rendering,” ACM Computer Graphics (Proc. SIGGRAPH’96), pp. 31–42, 1996. 81. S. Chen, “Quicktime VR – An image-based approach to virtual environment navigation,” ACM Computer Graphics (Proc. of SIGGRAPH’95), pp. 29–38, 1995. 82. S. Chen and L. Williams, “View interpolation for image synthesis,” ACM Computer Graphics (Proc. of SIGGRAPH’93), pp. 279–288, 1993. 83. J. Shade, S. Gortler, L.-W. He, and R. Szeliski, “Layered depth images,” ACM Computer Graphics (Proc. SIGGRAPH’98), pp. 231–242, 1998. 84. E. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision,” Computational Models of Visual Processing. Cambridge, MA: MIT Press, 1991. 85. L. McMillan and G. Bishop, “Plenoptic modeling: An image-based rendering system,” ACM Computer Graphics (Proc. SIGGRAPH’95), pp. 39–46, 1995. 86. S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The Lumigraph,” ACM Computer Graphics (Proc. SIGGRAPH’96), pp. 43–54, 1996. 87. V. Popescu, “Forward rasterization: A reconstruction algorithm for image-based rendering,” Ph.D. dissertation, Department of Computer Science, University of North Carolina at Chapel Hill, 2001. 88. S. B. Kang and H.-Y. Shum, “A review of image-based rendering techniqes,” in Proceedings of IEEE/SPIE Visual Communications and Image Processing (VCIP), pp. 2–13, 2000. 89. W. Lorensen and H. Cline, “Marching cubes: A high resolution 3D surface construction algorithm,” ACM Computer Graphics (Proc. SIGGRAPH’87), Vol. 21, No. 4, pp. 163–169, Jul. 1987. 90. H. Pfister, J. Hardenbergh, J. Knittel, H. Lauer, and L. Seiler, “The VolumePro real-time ray-casting system,” ACM Computer Graphics (Proc. SIGGRAPH’99), Vol. 33, pp. 245–250, 1999. 91. H. Berk, C. Aykanat, and U. G¨ ud¨ ukbay, “Direct volume rendering of unstructured grids,” Computers & Graphics, Vol. 27, No. 3, pp. 387–406, 2003. 92. R. Westermann and T. Ertl, “Efficiently using graphics hardware in volume rendering applications,” ACM Computer Graphics (Proc. of SIGGRAPH’98), Vol. 32, No. 4, pp. 169–179, 1998.
7 Modeling, Animation, and Rendering of Human Figures ¨ uc¸, Aydemir Memi¸soˇglu Uˇgur G¨ ud¨ ukbay, B¨ ulent Ozg¨ and Mehmet S ¸ ahin Ye¸sil Department of Computer Eng., Bilkent University, 06800, Bilkent, Ankara, Turkey
Human body modeling and animation has long been an important and challenging area in computer graphics. The reason for this is two-fold. First, the human body is so complex that no current model comes even close to its true nature. The second reason is that our eyes are so sensitive to human figures that we can easily identify unrealistic body shapes (or body motions). Today many fields use 3D virtual humans in action: video games, films, television, virtual reality, ergonomics, medicine, biomechanics, etc. We can classify all these applications into three categories: film production for arts and entertainment, real-time applications, such as robotics, video games and virtual environments, and simulations, such as computer-aided ergonomics for the automotive industry, virtual actors, biomedical research, and military simulations. The type of application determines the complexity of the models. For example video games or virtual reality applications require the lowest possible ratio between the computation cost and capabilities of the model. However, for biomedical research, realism is essential and the animated model should obey physical laws. Hence, the models are designed and animated according to the specific area in which they are applied. Humans are an indispensable part of dynamic 3D scenes. Therefore, human face and body specific representations and animation techniques should be heavily used in a 3DTV framework to achieve the goals of real-time implementation and realism. Techniques of 3D motion data collection, such as motion capture, can be incorporated in human model animation. Continuous video and motion recording at high sampling rates produce huge amounts of data. Keyframe transmission that can be regenerated into continuous motion using interpolation techniques will reduce the size of the transmitted data significantly. To study human modeling and animation, many techniques based on kinematics, dynamics, biomechanics, and robotics have been developed by researchers. In order to produce realistic animations, rendering is also an inseparable part. Furthermore, hair, garment, interaction of multiple avatars, expression of feelings, behavior under extreme conditions (such as accidents,
202
U. G¨ ud¨ ukbay et al.
deep sea diving, etc.) and many more real life experiences make the problem as complicated as one’s imagination. The human body has a rigid skeleton. This is not the case with some other living, artificial or imaginary objects. If the animation aims at a particular instance of bone fracture, maybe for an orthopedical simulation, then the rules all of a sudden change. As long as the subject excludes these non-articulated body behaviors, there is a reasonable starting point, a skeleton that is an articulated object with joints and rigid elements. It is natural, then, to assume that if a proper motion is given to the skeleton, one can build up the rest of the body on top of this. Layers include muscles, skin, hair and garments that can somehow be realistically rendered based on skeleton motion, plus some external forces, such as wind and gravity, to add more realism, at least to hair and garment. This obviously is a reverse way of looking at things; it is the muscles that expand or contract to give motion to the skeleton, but if the ultimate aim is to generate a realistic animation visually, and if the muscles can be accurately modeled, the order in which the forces are originated can be reversed. This makes the skeletal motion to be the starting source of animation. It is very difficult to fit all the aspects of human modeling and animation into a limited scope of a book chapter. Thus, this chapter discusses some aspects of human modeling, animation, and rendering, with an emphasis on multi-layered human body models and motion control techniques for walking behavior.
7.1 Articulated Body Models Since the 1970s, researchers have proposed several different approaches for the realistic modeling of the human body and its movements. Human body modeling first consists of the basic structural modeling. This includes the definition of joints and segments, their positions and orientations, and the hierarchy between these components. It also includes defining the volume of the body which is composed of muscles, fat, and skin. The second part of the problem, simulating human motion, is a complex task. It is very difficult to take into account all the interactions with the environment involved in a simple movement. A realistic human model should provide accurate positioning of the limbs during motion, realistic skin deformations based on muscles and other tissues, realistic facial expressions, realistic hair modeling, etc. In the early stages, humans were represented as stick figures, simple articulated bodies made of segments and joints. These articulated bodies were simulated using methods based on kinematics. More recently, dynamic methods have been used to improve the realism of the movement. However, since the human body is a collection of rigid and non-rigid components that are very difficult to model, dynamic and kinematics models did not meet the need. Consequently, researchers began to use human anatomy to produce human
7 Modeling, Animation, and Rendering of Human Figures
203
models with more realistic behaviors. The models proposed can be divided into four categories: stick figure models, surface models, volume models, and multi-layered models (see Fig. 7.1). 7.1.1 Stick Figure Models Early studies on human body modeling and animation date back to the seventies. Badler and Smoliar [1] and Herbison-Evans [2] proposed 3D human models based on volumetric primitives, such as spheres, prisms, or ellipsoids. The technological limitations allowed stick figures with only a few joints and segments and simple geometric primitives. These models are built by using a hierarchical set of rigid segments connected at joints. The complexity of these articulated figures depends on the number of joints and segments used. The motions were usually specified as a set of hierarchical transformations, controlled by the joint constraints so that the members will not break from each other. Studies on directed motions of articulated figures by Korein [3] and the stick figure model by Thalmanns [4] are representative of this category. 7.1.2 Surface Models Surface models were proposed as an improvement on the stick models. A new layer, representing human skin, was introduced in addition to the skeleton layer [5]. Therefore, this model is based on two layers: a skeleton, which is the backbone of the character animation and a skin, which is a geometric envelop of the skeleton layer. Deformations in the skin layer are governed by the motion of the skeleton layer. The skin layer can be modeled by using points and lines, polygons (used in Rendezvous a Montreal [6]), and curved surface patches (e.g., B´ezier, Hermite, and B-splines). In the case of polygons, the skin layer is deformed by attaching each mesh vertex to a specific joint. In this way, the motion of the skin layer follows the
Articulated Body Models
Stick Figure skeleton
Surface skeleton
Volumetric skin
Multi−layered skeleton
skin
muscle
Points, lines, polygons, curved−surface patches
Meta−balls Blobby Man
Free−form deformations Mass−spring systems Anatomically−based
Fig. 7.1. Taxonomy of articulated body models
204
U. G¨ ud¨ ukbay et al.
motion of the articulated structure. However muscle behavior is not modeled in this approach and body parts may disconnect from each other during some motions. In spite of these deficiencies, these models are still very common in Web applications. A solution to the skin deformation problem is to use a continuous deformation function based on joints angles. This method is first used by Komatsu [7] to deform the control points of biquartic B´ezier patches. Thalmanns introduced the concept of Joint-dependent Local Deformation (JLD) [8]. In both approaches, the skin is deformed algorithmically. First, the skin vertices are mapped to the corresponding skeleton segments in order to limit the influence of the joint connecting the segments. Next, a function of the joint angles is used to deform the vertices. These studies showed that specialized algorithms may help to achieve more realistic skin deformations but there are two limitations. First, an algorithmic deformation is basically a mathematical approximation, mostly far away from the physical behavior of a model under various forces; thus algorithmic deformations cannot always accurately describe complex skin deformations of a real human body. Second, a graphic designer cannot easily direct the deformations because they are specified via an algorithm. Stitching is the process of attaching a continuous mesh to a bone structure. In rigid body animation, polygons are attached to the bones. The polygons are transformed by changing the matrix representing the corresponding bone. In stitching, each vertex of a polygon can be attached to a different bone. Therefore, each vertex can be transformed by a different matrix representing the bone to which the vertex is attached. Breaking up skin in this way so that the vertices are in the local space of the bone to which they are attached simplifies the process of stitching. This type of attachment enables us to create a single polygon that “stitches” multiple bones by simultaneously attaching different vertices to different bones. A polygon must fill the gap formed as a result of a manipulation of the bones [9]. Although stitching is a useful technique, it has some problems. Unnatural geometries appear during extreme joint rotations. For example, rotating a forearm by 120 degrees using the stitching technique results in a shear effect at the elbow. A solution is to allow a vertex to be affected by more than one bone. This is called full skinning and is compatible with the behavior of the human body. In a real human, the skin on the elbow is not affected by a single bone but by both the upper and lower arms. In order to implement this, we must know the bones effecting each skin vertex and the weight for each bone specifying the amount of the effect. The position of each vertex is calculated using (7.1) [9]. new vertex position =
N −1 i=0
weighti × matrixi × vertex position,
(7.1)
7 Modeling, Animation, and Rendering of Human Figures
205
where weighti is the weight for bone i and matrixi is the matrix used to transform the vertices attached to bone i. In the case of linear skinning, the sum of all weights is 1.0. Skinning is a simple technique that has a very low computational cost. For this reason, it is used in video games. Current research focuses on improving the skinning procedure by increasing the speed of computations. Sun et al. [10] use the concept of normal-volume, i.e., they reduce the computational cost by mapping a high-resolution mesh onto a lower-resolution control mesh. In this way, the high-resolution object can be deformed by skinning the lowresolution surface control structure. Singh and Kokkevis choose surface-based Free-Form Deformations (FFDs) to deform skin [11]. A very useful property of the surface-oriented control structures is that they bear a strong resemblance to the geometry they deform. In addition, they can be automatically constructed from deformable objects. Assigning weights is a semi-automatic process that requires a huge amount of human intervention; this significantly limits the skinning technique. In addition, a combination of weights for highly mobile portions of the skeleton may be very appropriate for one position of the skeleton but the same weights may not be acceptable for another position. Therefore, there is no single combination of weights that provides an acceptable result for all parts of the body. In spite of these deficiencies, the skinning method remains one of the most popular techniques for skin deformation because of its simplicity. Using predefined keyshapes is another approach for skin deformation [12, 13]. Keyshapes are triangular meshes in some skeletal positions. They are obtained via a digitization procedure [14]. The idea behind this technique is that new shapes are created by interpolation or extrapolation. The deformation-by-keyshapes technique differs from 3D morphing algorithms in that it is limited to smooth interpolation problems. However this approach does not have the deficiencies of the skinning techniques and it performs better than multi-layer deformation models. There is no limitation on the number of keyshapes, making the technique quite flexible. 7.1.3 Volumetric Models Controlling the surface deformation across joints is the major problem of surface models. In volume models, simple volumetric primitives like ellipsoids, spheres and cylinders are used to construct the body shape. A good example of volume models is metaballs. Metaballs are volumes that can join each other based on a function of nearness. They can do a better job than surface models but it is really hard to control a large number of primitives during animation. In the very early stages of computer graphics, volumetric models were built from geometric primitives such as ellipsoids and spheres to approximate the body shape. These models were constrained by the limited computer hardware available at the time. Along with the advances in computer
206
U. G¨ ud¨ ukbay et al.
hardware technology, implicit surfaces are used as an improvement on these early models. Today, volumetric models are often able to handle collisions. An implicit surface is also known as an iso-surface. It is defined by a function that assigns a scalar value to each 3D point in space. Then an iso-surface is extracted from the level set of points that are mapped to the same scalar value. Skeletons, which are constructed from points and lines, are the source of the scalar field. Each skeleton produces a potential field whose distribution is determined by the field function. For the field function it is common to use a high-order polynomial of distance to the skeleton (generally higher than 4th order). This approach is known as metaball formulation. Because they are naturally smooth, implicit surfaces are often used in the representation of organic forms. One of the first examples of this method is a “blobby man” created by Blinn [15]. It is generated from an implicit surface that is constructed using an exponentially decreasing field function. In one of the later studies, Yoshomito shows that a complete, realistic-looking, virtual human body can be created with metaballs at a low storage cost [16]. A more complicated implicit formulation is introduced by Bloomenthal [17]. Implicit surfaces have many properties that promote successful body modeling. Their most useful property is continuity, which is the main requirement for obtaining realistic shapes. There are two more advantages worth mentioning: first, due to the compact formulation of the field functions, little memory is required and second, they are simple to edit since they are defined by point or polygon skeletons. However, undesired blending may be observed during animations. Hybrid techniques that are mixtures of surface deformation models and implicit surfaces are proposed as a solution to this problem [18, 19]. Some volumetric models are adept at handling collisions between different models or different parts of the same model and at generating deformed surfaces in parallel. Elastic properties are included in the formulation of distance-based implicit surfaces by Cani-Gascuel [20]. In this work, a correspondence between radial deformation and the reaction force is established. A non-linear finite element model of a human leg derived from the Visible Human Database [21] is recently proposed by Hirota et al. [22]. It also achieves a high level of realism in the deformation. 7.1.4 Multi-layered Models Lasseter emphasized that computers provide the advantage of building up an animation in layers to create complex movements [23]. The animator specifies different kinds of constraint relationships between different layers. Then, s/he can control the global motion from a higher level. The most common approach is to start with the skeleton and then add the muscles, skin, hair and other components. A skeleton layer, a muscle layer, and a skin layer are the most common. This layered modeling technique is heavily used in human modeling [24]. In this approach, motion control techniques are applied to the skeleton layer and the other layers move accordingly.
7 Modeling, Animation, and Rendering of Human Figures
207
The layered approach has been accepted both in the construction and animation of computer generated characters. There are two types of models. The first relies on a combination of ordinary computer-graphics techniques like skinning and implicit surfaces, and tends to produce a single layer from several anatomical layers [24]. The other group, inspired by the actual biology of the human body, tries to represent and deform every major anatomical layer and model their dynamic interplay [25, 26]. The skeleton layer is an articulated structure that provides the foundation for controlling movement [19]. Sometimes, the articulated structure is covered by material bones approximated by simple geometric primitives [22, 27, 28]. Studies on the skeleton layer focus on the accurate characterization of the range of motion of specific joints. Different models of joint limits have been suggested in the literature: Korein uses spherical polygons as boundaries for the spherical joints like the shoulder [3] whereas Maurel and Thalmann use joint sinus cones for the shoulder and scapula joints [29]. In the early muscle models, the foundations for the muscles and deformations used in a muscle layer are based on free form deformations (FFD) [30]. Muscles construct a relationship between the control points of the FFDs and the joint angles. This method has two limitations: the possibility that the FFD box does not closely approximate the muscle shape and the fact that FFD control points have no physical meaning. Moccozet models the behavior of the hand muscles using Dirichlet free form deformation; this is a generalization of FFD that provides a more local control of the deformation and removes the strong limitation imposed on the shape of the control box [31]. Since implicit surfaces provide the most appropriate method for modeling organic forms, they are widely used to model the muscle layer. In [32], implicit primitives like spheres and super-quadrics are used to approximate muscles. On the other hand, in [19], the gross behavior of bones, muscles, and fat are approximated by grouped ellipsoidal metaballs with a simplified quadratic field function. This technique does not produce realistic results for the highly mobile parts of the human body, in which each ellipsoidal primitive is simultaneously influenced by several joints. The muscles that significantly effect the appearance of the human models are the fusiform muscles in the upper and lower parts of the legs and arms. These muscles have a fleshy belly, tapering at both extremities. Since an ellipsoid gives a good approximation of the appearance of a fusiform muscle, muscle models tend to use the ellipsoid as the basic building block when the deformation is purely geometric. Moreover, the analytic formulation of an ellipsoid provides scalability. When the primitive is scaled along one of its axes, the volume of the primitive can easily be preserved by adjusting the scaling parameters of the two remaining axes. Following a similar approach, the height-to-width ratio can be kept constant. This is the reason why a volumepreserving ellipsoid for representing a fusiform muscle is used in [25, 26]. A polyline called an “action line” is introduced by Nedel and Thalmann [33]. The action line is used for abstracting muscles. It represents the force produced
208
U. G¨ ud¨ ukbay et al.
by the muscle on the bones, and on a surface mesh deformed by an equivalent mass-spring network. A noteworthy feature of this mass-spring system is the introduction of angular springs that smooth out the surface and control the volume variation of the muscle. B-spline solids can also be used for modeling muscles, as described by NgThow-Hing and Fiume [34]. These have the ability to capture multiple muscle shapes (fusiform, triangular, etc.). They use three curves to characterize the attachment of musculotendon onto a skeleton: origin, insertion, and axial. The origin and insertion curves represent the attachment areas of the tendons to the bones. The axial curve represents the line of action of the muscle. Polygonal [22, 24, 25], parametric [19, 26, 35], subdivision [18, 36], and implicit [17, 37] surfaces have been used for modeling the skin. Polygonal surfaces are processed by the graphics unit and are the best choice when speed and/or interactivity are needed. However, some surface discontinuities, which need to be smoothed out, may arise when polygonal surfaces are used. Parametric surfaces yield very smooth shapes. This makes them very attractive for modeling the skin. Shen and Thalmann [19] derive a lower degree polynomial field function for the inner layers. Implicit surfaces are very good at representing organic forms. The main limitation of implicit surfaces is that it is difficult or impossible to apply texture maps for realistic rendering. Therefore, they are very seldom used to directly extract a skin but are used frequently for invisible anatomical layers. Subdivision surfaces best represent the skin layer. They have several advantages: i) smoothness can be guaranteed by recursively subdividing the surface, ii) a polygonal version suitable for rendering is automatically derived without further computations, and iii) interpolating schemes can be used [36]. There are three ways of deforming the skin in multi-layered models [38]: • • •
First, surface deformation models are applied to the skin. Then, the skin is projected back onto the inner anatomical layers. A mechanical model is used to deform the skin while keeping the skin a certain distance away from the material beneath. The skin is defined as the surface of a volume-finite element (mass-spring model of the body).
7.2 Exchangeable Articulated Models Although modeling the basic structure of the articulated figure is a trivial part of human modeling, it becomes a challenging task when there is no standard for it. The H-Anim 1.1 Specification is the usual standard for human body modeling; it defines the geometry and the hierarchical structure of the human body [39]. The Humanoid Animation (H-Anim) standard was developed by the Humanoid Animation Working Group of Web3D Consortium to define interchangeable human models. The H-Anim standard specifies how to define
7 Modeling, Animation, and Rendering of Human Figures
209
humanoid forms and behaviors in standard Extensible 3D Graphics/Virtual Reality Modeling Language (X3D/VRML). This group had flexibility as a goal so no assumptions are made about the types of applications that will use humanoids. They also had a goal of simplicity which led them to focus specifically on humanoids, instead of trying to deal with arbitrary articulated figures. In the H-Anim 1.1 Specification, the human body is represented by a number of segments (such as forearm, hand, foot) that are connected to each other by joints (such as the elbow, wrist and ankle). The H-Anim structure contains a set of nodes to represent the human body. The nodes are Joint, Segment, Site, Displacer, and Humanoid. Joint nodes represent the joints of the body and they are arranged in a strictly defined hierarchy. They may contain other Joint and Segment nodes. Segment nodes represent a portion of the body connected to a joint. They may contain Site and Displacer nodes. Site nodes are placements for cloth and jewelry; they can also be used as endeffectors for inverse kinematics applications (see Subsection 7.3.1). Displacer nodes are simply grouping nodes, allowing the programmer to identify a collection of vertices as belonging to a functional group for ease of manipulation. The Humanoid node stores information about the model. It acts as a root node for the body hierarchy and stores all the references to Joint, Segment, and Site nodes. An H-Anim compliant human body is in the “at rest” position; all the joint angles are zero and the humanoid faces the +z direction, with +y being up and +x to the left of the humanoid according to the right-handed coordinate system. A simple XML data format to represent the human skeleton can be defined by conforming to the hierarchy of joints and segments of the body as named in the H-Anim 1.1 Specification. XML’s structured and self-descriptive format provides an excellent method for describing the skeleton. Front and side views of the skeleton are given in Fig. 7.2(a) and a portion of the XML representation for the skeleton is given in Fig. 7.2(b). Although the data format can be used for representing the full skeleton specified in the H-Anim standard, the complete hierarchy is too complex for most applications. In real-time applications such as games, a simplified hierarchy will be more suitable. H-Anim 1.1 Specification proposes four “Levels of Articulation” that contain the subsets of the joints. The body dimensions and levels of articulation are suggested for information only and are not specified by the H-Anim standard. Levels of articulation are suggested both for simplicity and compatibility. Animators can share their animations and humanoids if they conform to the same level of articulation. “Level of Articulation Zero” has just a root joint. “Level of Articulation One” represents a typical low-end real-time 3D hierarchy that does not contain information about the spine and has a shoulder complex with insufficient detail. “Level of Articulation Two” contains a lot of necessary information about spine, skull and hands. “Level of Articulation Three” represents the full H-Anim hierarchy.
210
U. G¨ ud¨ ukbay et al.
(a)
(b)
Fig. 7.2. (a) Front and side views of the skeleton and (b) a portion of the XML representation for the skeleton
7.3 Motion Control Techniques The different approaches of biomechanics, robotics, animation, ergonomics, and psychology are integrated to produce realistic motion-control techniques. Motion control techniques can be classified into two groups according to level of abstraction that specifies the motion; low-level and high-level. In low-level motion control, the user manually specifies the motion parameters such as position, angles, forces, and torques. In high-level motion control, the motion is specified in abstract terms such as “walk”, “run”, “grab that object”, “walk happily”, etc. [40]. In animation systems using high-level motion control, the low-level motion-planning and control tasks are performed by the machine. The animator simply changes some parameters to obtain different kinds of solutions. To generate realistic animations, both kinds of motion control techniques should be used in an integrated manner [41]. Human motion control techniques can be classified as kinematics, dynamics, and motion capture. 7.3.1 Kinematics Kinematics, which originates from the field of robotics, is one of the approaches used in motion specification and control. Kinematics is the study of motion by considering position, velocity, and acceleration. It does not consider the underlying forces that produce the motion. Kinematics-based techniques animate
7 Modeling, Animation, and Rendering of Human Figures
211
the articulated structures by changing the orientations of joints over time. Motion is controlled by the management of joint transformations over time. In forward kinematics, the global position and orientation of the root of the hierarchy and the joint angles are directly specified to obtain different postures of an articulated body. The motion of the end-effector (e.g., the hand in the case of the arm) is determined by the joint transformations from the root of the skeleton to the end-effector. Mathematically, forward kinematics is expressed as x = f (Θ),
(7.2)
where Θ is the set of joint angles in the chain and x is the end-effector position. After the joint transformations are calculated, the final position of the endeffector is found by multiplying the transformation matrices in the hierarchy. For example, in the case of the leg, the position of the foot is calculated by using the joint angles of hip and knee. In order to work on articulated figures, Denavit and Hartenberg developed a matrix notation, called DH notation, to represent the kinematics of articulated chains [42]. DH notation is a link-based notation where each link is represented by four parameters; Θ, d, a, and α. For a link, Θ is the joint angle, d is the distance from the origin, a is the offset distance, and α is the offset angle. The relations between the links are represented by 4 × 4 matrices. Sims and Zeltzer [43] proposed a more intuitive method, called axis-position (AP) representation. In this approach, the position and orientation of the joint and the pointers to the joint’s child nodes are used to represent the articulated figure. Inverse kinematics is a higher-level approach. It is sometimes called “goaldirected motion.” Given the positions of end-effectors only, inverse kinematics solves the position and orientation of all the joints in the hierarchy. Mathematically, it is expressed as Θ = f −1 (x).
(7.3)
Figure 7.3(a) shows the joint orientations for the end-effector positioning of the right arm and Fig. 7.3(b) shows goal-directed motion of the arm. Inverse kinematics is mostly used in robotics. Contrary to forward kinematics, inverse kinematics provides direct control over the movement of the end-effector. On the other hand, inverse kinematics problems are difficult to solve as compared to forward kinematics where the solution is found easily by multiplying the local transformation matrices of joints in a hierarchical manner. Inverse kinematics problems are non-linear and for a given position x there may be more than one solution for Θ. There are two approaches to solve inverse kinematics problem: numerical and analytical.
212
U. G¨ ud¨ ukbay et al.
(a)
(b)
Fig. 7.3. (a) The joint orientations for the end-effector positioning of the right arm; (b) goal directed motion of the arm
The most common solution method for the non-linear problem stated in (7.3) is to linearize it [44]. When the problem is linearized, the joint velocities and the end-effector velocity will be related by: ˙ x˙ = J(Θ)Θ,
(7.4)
where Jij =
δfi . δxj
(7.5)
The Jacobian J relates the changes in the joint variables to the changes in the position of the end-effector. Jacobian J is an m × n matrix where m is the number of joint variables and n is the dimensions of the end-effector vector. If we invert (7.4), we obtain the equation: Θ˙ = J −1 (Θ)x. ˙
(7.6)
Given the inverse of the Jacobian, computing the changes in the joint variables due to changes in the end-effector position can be achieved by using an iterative algorithm. Each iteration computes the x˙ value by using the actual and goal positions of the end-effector. The iterations continue until the ˙ we must end-effector reaches the goal. To compute the joint velocities (Θ), find the J(Θ) value for each iteration. Each column of the Jacobian matrix
7 Modeling, Animation, and Rendering of Human Figures
213
corresponds to a single joint. The changes in the end-effector position (P(Θ)) and orientation (O(Θ)) determine the Jacobian column entry for the ith joint according to ⎤ δPx /δΘi ⎢ δPy /δΘi ⎥ ⎥ ⎢ ⎢ δPz /δΘi ⎥ ⎥. ⎢ Ji = ⎢ ⎥ ⎢ δOx /δΘi ⎥ ⎣ δOy /δΘi ⎦ δOz /δΘi ⎡
(7.7)
These entries can be calculated as follows: every joint i in the system translates along or rotates around a local axis ui . If we denote the transformation matrix between local frames and the world frame as Mi , the normalized transformation of the local joint axis will be: axisi = ui Mi .
(7.8)
We can calculate the Jacobian entry for a translating joint using (7.8): ⎤ ⎡ [axisi ]T ⎥ ⎢ 0 ⎥, (7.9) Ji = ⎢ ⎦ ⎣ 0 0 and for a rotating joint by: Ji =
[(p − ji ) × axisi ]T (axisi )T
.
(7.10)
The linearization approach makes an assumption that the Jacobian matrix is invertible (both square and non-singular), but this is not generally so. In the case of redundancy and singularities of the manipulator, the problem is more difficult to solve and new approaches are needed. Unlike the numerical methods, analytical methods in most cases find solutions. We can classify the analytical methods into two groups; closed-form and algebraic elimination methods [45]. Closed-form methods specify the joint variables by a set of closed-form equations; they are usually applied to six degrees-of-freedom (DOF) systems with a specific kinematic structure. On the other hand, in the algebraic elimination methods, the joint variables are denoted by a system of multivariable polynomial equations. Generally, the degrees of these polynomials are greater than four. That is why algebraic elimination methods still require numerical solutions. In general, analytical methods are more common than numerical ones because analytical methods find all solutions and are faster and more reliable.
214
U. G¨ ud¨ ukbay et al.
The main advantages of kinematics-based approaches are as follows: first, the motion quality is based on the model and the animator’s capability; second, the cost of computations is low. However, the animator must still spend a lot of time producing the animations. These approaches cannot produce physically-realistic animations since the dynamics of the movements are not considered and the interpolation process leads to loss of realism. When we review the literature we observe that Chadwick et al. use inverse kinematics in creating keyframes [24]. Badler et al. also propose an inverse kinematics algorithm to constrain the positions of the body parts during animation [46]. In addition, Girard and Maciejewski [47] and Sims and Zeltzer [43] generate leg motion by means of inverse kinematics. Their systems are composed of two stages: in the first stage, foot trajectories are specified; in the second stage, the inverse kinematic algorithm computes the leg-joint angles during movement of the feet. Welman investigates the inverse kinematics in detail and describes the constrained inverse kinematic figure manipulation techniques [48]. Baerlocher investigates inverse kinematics techniques for the interactive posture control of articulated figures [49]. Greeff et al. propose constraints to fix and pin the position and orientation of joints for inverse kinematics [50]. 7.3.2 Dynamics Since physical laws heavily affect the realism of a motion, dynamics approaches can be used for animation. However, these approaches require more physical parameters such as center of mass, total mass and inertia. The dynamics techniques can be classified as forward dynamics and inverse dynamics. Forward dynamics considers applying forces on the objects. These forces can be applied automatically or by the animator. The motion of the object is then computed by solving the equations of the motion for the object as a result of these forces. Wilhelm gives a survey of rigid-body animation techniques by using forward dynamics [51]. Although the method works well with a rigid body, the simulation of articulated figures by forward dynamics is more complicated because the equations of motion for articulated bodies must also handle the interaction between the body parts. This extension of the equation makes control difficult. In addition, forward dynamics does not provide an accurate control mechanism. This makes the method useful for the tasks in which initial values are known. Nevertheless, there are some examples of articulated-figure animation using forward dynamics [52, 53]. Inverse dynamics however is a goal-oriented method in which the forces needed for a motion are computed automatically. Although the inverse dynamics applications are rarely used in computer animation, Barzel and Barr [54] are the first users of the method. They generate a model composed of objects that are geometrically related. These relationships are represented by constraints that are used to denote forces on an object. These forces animate the figure in such a way that all the constraints are satisfied. The Manikin
7 Modeling, Animation, and Rendering of Human Figures
215
system proposed by Forsey and Wilhelm [55] also animates articulated models by means of inverse dynamics. The system computes the forces needed when a new goal position for the figure is specified. Another system, called Dynamo, is introduced by Isaac and Cohen [56]; it is based on keyframed kinematics and inverse dynamics. Their approach is an example of combining dynamic simulation and kinematic control. Ko also developed a real-time algorithm for animating human locomotion using inverse dynamics, balance and comfort control [57]. All the studies outlined above describe the motion of a figure by considering geometric constraints. On the other hand, some researchers develop some inverse dynamics solutions based on non-geometric constraints. Brotman and Netravali [58], Girard [59] and Lee et al. [60] are examples. The combination of anatomical knowledge with the inverse dynamics approach generates more realistic motion. This composite system can also handle the interaction of the model with the environment. That is why this method is useful for virtual reality applications. However, dynamic techniques are computationally more costly than kinematic techniques and are rarely used as interactive animation tools. There are difficulties in using a purely forward dynamics system or an inverse dynamics system. To produce the desired movement with high level of control requires hybrid solutions. The need to combine forward and inverse dynamics is discussed in [61]. 7.3.3 Motion Capture Since dynamics simulation could not solve all animation problems, new approaches were introduced. One of these methods animates virtual models by using human motion data generated by motion capture techniques. The 3D positions and orientations of the points located on the human body are captured and this data is then used to animate 3D human models. It is mainly used in the film and computer games industries. Human motion capture systems can be classified as non-vision based, vision-based with markers, and vision-based without markers [62]. In non-vision based systems, the movement of a real actor can be captured by using magnetic or optical markers attached to the human body. In vision-based systems, the motion of the human body is tracked with the help of cameras either with markers attached to the human body [63] or without markers [64, 65]. The main advantage of motion capture techniques is their realistic generation of motion quickly and with a high level of detail. Moreover, with additional computations, the 3D motion data can be adapted to new morphology. Motion blending and motion warping are the two techniques for obtaining different kind of motions. There are also studies to generate smooth human motions interactively by combining a set of clips obtained from motion capture data [66]. Motion blending needs motions with different characteristics. It interpolates between the parameters of motions and has the advantage of low
216
U. G¨ ud¨ ukbay et al.
computation cost. Motion warping techniques change the motion by modifying the motion trajectories of different limbs interactively. But this method suffers from the same problem as kinematic or procedural animation techniques; since they cannot handle dynamic effects, it is impossible to ensure that resulting motions are realistic. In their studies, Bruderlin and Williams [67] work on signal processing techniques to alter existing motions. Unuma et al. [68] generate human figure animations by using Fourier expansions on the available motions. In contrast with the procedural and kinematic techniques, motion modification techniques provide for using real-life motion data to animate the figures. This has the advantage of producing natural and realistic looking motion at an enhanced speed. On the other hand, this method is not a convenient method to modify the movement captured in the data. Realism can be lost while applying large changes to the captured data. With the advance of motion capture techniques, vast amount of of motion capture data is produced. Since the storage and processing of this data becomes very difficult, keyframe extraction from motion capture data [69] and compression techniques [70] become more and more important. 7.3.4 Interpolation Techniques for Motion Control The parameters of body parts (limbs and joints) should be determined at each frame when a character is animated using kinematic methods. However, determining the parameters explicitly at each frame, even for a simple motion, is not trivial. The solution is to specify a series of keyframe poses at different frames. Following this approach, an animator needs only specify the parameters at the keyframes. Parameter values for the intermediate frames, called in-betweens, are obtained by interpolating the parameters between these keyframes. Another problem arises in searching for a suitable interpolation method. Linear interpolation is the simplest method to generate intermediate poses, but it gives unsatisfactory motion. Due to the discontinuities in the first derivatives of interpolated joint angles, this method generates a robotic motion. Obtaining more continuous velocity and acceleration requires higher order interpolation methods like piecewise splines. Intermediate values produced by interpolation generally do not satisfy the animator. Therefore, the interpolation process should be kept under control. For just a single DOF, the intermediate values constitute a trajectory curve that passes through the keyframe values. Interpolating a spline along with the keyframe values at both ends determines the shape of the trajectory. An interactive tool that shows the shape of a trajectory and enables an animator to change the shape could be useful. After a trajectory is defined, traversing it at a varying rate can improve the quality of the movement. Parameterized interpolation methods control the shape of a trajectory and the rate at which the trajectory is traversed.
7 Modeling, Animation, and Rendering of Human Figures
217
Kochanek and Bartels describe an interpolation technique that relies on a generalized form of piecewise cubic Hermite splines [71]. At the keyframes, the magnitude and direction of tangent vectors (tangent to the trajectory) are controlled by adjusting continuity, tension and bias. Changing the direction of the tangent vectors locally controls the shape of the curve when it passes through a keyframe. On the other hand, the rate of change of the interpolated value around the keyframe is controlled by changing the magnitude of the tangent vector. Some animation effects such as action follow-through and exaggeration [23] can be obtained by setting the parameters. This method does not have the ability to adjust the speed along the trajectory without changing the trajectory itself because the three parameters used in the spline formulation influence the shape of the curve. Steketee and Badler [72] offered a double interpolant method in which timing control is separated from the trajectory itself. Similar to the previous method, a trajectory is a piecewise cubic spline that passes through the keyframed values. In addition, the trajectory curve is sampled by a second spline curve. This controls the parametric speed at which the trajectory curve is traversed. Unfortunately, there is no one-to-one relation between actual speed in the geometric sense and the parametric speed. Therefore, the desired velocity characteristic is obtained by a trial-and-error process. A way to obtain more intuitive control over the speed is to reparameterize the trajectory curve by arc length. This approach provides a direct relation between parametric speed and geometric speed. An animator can be provided with an intuitive mechanism to vary speed along the trajectory by allowing him to sketch a curve that represents distance over time [73]. In the traditional animation, keyframes are drawn by experienced animators and intermediate frames are completed by less experienced animators. In this manner, the keyframe-based approach and traditional animation are analogous. The problem with keyframe-based animation is that it is not good at skeleton animation. The number of DOF is one of the main problems. When the number of DOF is high, an animator has to specify too many parameters for even a single key pose. Obviously controlling the motion by changing lots of trajectory curves is a difficult process. The intervention of the animator should ideally happen at low levels, perhaps at the level of joint control. Another problem arises from the hierarchical structure of the skeleton. Since the positions of all other components depend on the position of the root joint, the animator cannot easily determine the positional constraints in each keyframe pose creation. The problem can be solved by specifying a new root joint and reorganizing the hierarchy but this is rarely useful. The interpolation process also suffers from the hierarchical structure of the skeleton. It is impossible to calculate the correct foot positions in the intermediate frames by only interpolating joint rotations.
218
U. G¨ ud¨ ukbay et al.
7.4 Simulating Walking Behavior Most of the work in motion control has been aimed at producing complex motions like walking. Kinematics and dynamic approaches for human locomotion have been described by many researchers [59, 74, 75, 76, 77, 78, 79]. A survey of human walking animation techniques is given in [80]. 7.4.1 Low-level Motion Control Spline-driven techniques are very popular as the low-level control mechanism to specify the characteristics of the movements. In spline-driven motion control, the trajectories of limbs, e.g., the paths of pelvis and ankles, are specified using spline curves. Using a conventional keyframing technique, the joint angles over time are determined by splines. Since splines are smooth curves, they can be used for the interpolation of motion parameters. Cubic splines are easy to implement, computationally efficient, and their memory requirement is low. Local control over the shape of the curve, interpolation of the control points, and continuity control are desirable properties of cubic curves. Another advantage is that they are more flexible than lowerorder polynomials in modeling arbitrary curved shapes [81]. A class of cubic splines, Cardinal splines, can be used to generate the position curves because Cardinal splines require less calculation and memory, yet can exert local control over shape. In addition, there is a velocity curve or distance curve for these body parts enabling us to change the nature of the movement. For a set of control points, a piecewise-continuous curve that interpolates these control points can be generated. If the set of control points is given by: pk = (xk , yk , zk ),
k = 0, 1, 2, . . . , n.
(7.11)
The piecewise-continuous curve generated from these control points is given in Fig. 7.4(a). The parametric cubic polynomials to generate the curves between each pair of control points is represented by the following equations:
p
p
1
k
p
0
p
.. .
P(u)
p
k+1
k
p
p
p
k−1
k+1
n
p
2
...
p
n
(a)
(b)
Fig. 7.4. Cubic splines: (a) a piecewise-continuous cubic-spline interpolation of n+1 control points; (b) parametric point function P (u) for a Cardinal spline section
7 Modeling, Animation, and Rendering of Human Figures
⎡
⎤
⎡
219
⎤
xu ax u3 + bx u2 + cx u + dx ⎣ yu ⎦ = ⎣ ay u3 + by u2 + cy u + dy ⎦ , zu az u3 + bz u2 + cz u + dz
(0 ≤ u ≤ 1).
(7.12)
Cardinal splines interpolate piecewise cubics with specified endpoint tangents at the boundary of each curve section but they do not require the values for the endpoint tangents. Instead, the value of the slope of a control point is calculated from the coordinates of the two adjacent points. A Cardinal spline section is specified by four consecutive control points. The middle two points are the endpoints and the other two are used to calculate the slope of the endpoints. If P (u) represents the parametric cubic curve function for the section between control points Pk and Pk+1 (Fig. 7.4(b)), the boundary conditions for the Cardinal spline section are formulated by the following equations: P (0) = pk , P (1) = pk+1 , P (0) = 12 (1 − t)(pk+1 − pk−1 ), P (1) = 12 (1 − t)(pk+2 − pk ),
(7.13)
where the tension parameter, t, controls how loosely or tightly the Cardinal spline fits the control points. We can generate the boundary conditions as: ⎡ ⎤ pk−1 ⎢ pk ⎥ ⎥ Pu = u3 u2 u 1 · MC · ⎢ (7.14) ⎣ pk+1 ⎦ , pk+2 where the Cardinal basis matrix is: ⎡
−s ⎢ 2s MC = ⎢ ⎣ −s 0
2−s s−3 0 1
s−2 3 − 2s s 0
⎤ s −s ⎥ ⎥, 0 ⎦ 0
s=
(1 − t) . 2
(7.15)
The paths for pelvis, ankle and wrist motions are specified using Cardinal spline curves. These are position curves. Another curve for velocity is specified independently for each body part. Thus, by making changes in the velocity curve, the characteristics of the motion can be changed. Steketee and Badler are the first to recognize this powerful method [72]. They call this velocity curve the “kinetic spline”. The method is called “double interpolant” method. Kinetic spline may also be interpreted as a distance curve. Distance or velocity curves can be easily calculated from each other. This application makes use of the “double interpolant” by enabling the user to specify a position spline and a kinetic spline. The kinetic spline, V (t), is commonly used as the motion curve. However, V (t) can be integrated to determine the distance curve, S(t). These curves are represented in two-dimensional space for easy manipulation. The position curve is a three-dimensional curve
220
U. G¨ ud¨ ukbay et al.
Fig. 7.5. The articulated figure and the Cardinal spline curves
in space, through which the object moves. Control of the motion involves editing the position and kinetic splines. In this application, velocity curves are straight lines and position curves are Cardinal splines. Figure 7.5 shows the position curves for a human figure. However moving an object along a given position spline presents a problem because of the parametric nature of the cubic splines. Suppose we have a velocity curve and a position curve to control the motion. We can find the distance traveled at a given time by taking the integral of the velocity curve with respect to time. Now we must find a point along the position spline, where the computed distance is mapped. Assume that we have a path specified by a spline Q(u), (0 ≤ u ≤ 1) and we are looking for a set of points along the spline such that the distance traveled along the curve between consecutive frames is constant. Basically, these points can be computed by evaluating Q(u) at equal values of u. But this requires the parameter u to be proportional to the arclength, the distance traveled along the curve. Unfortunately, this is not usually the case. In the special cases where the parameter is proportional to arclength, the spline is said to be parameterized by arclength. An object can hardly move with a uniform speed along a spline without arclength parameterization [82]. The animator interactively shapes the curves and views the resulting animation in real time. The double interpolant method enables the animator to change the characteristics of the motion independently. This can be done by independently editing different curves that correspond to the position in 3D space, distance, and velocity. However, to produce the desired movement, the change in the kinetic spline curve should be reflected to the position curve.
7 Modeling, Animation, and Rendering of Human Figures
221
After constructing the position curves for end-effectors, such as wrists and ankles, a goal-directed motion control technique is used to determine, over time, the joint angles of shoulder, hip, elbow and knee. The animator only moves the end-effector and the orientations of other links in the hierarchy are computed by inverse kinematics. This system also enables the user to define the joint angles that are not computed by inverse kinematics. Using a conventional keyframing technique, the joint angles over time can be specified by the user. Cubic splines are then used to interpolate joint angles. 7.4.2 High-level Motion Control Walking motion can be controlled by using high-level kinematic approaches that allow the user to specify the locomotion parameters. Specifying a straight traveling path on a flat ground without any obstacles and the speed of locomotion, walking can be generated automatically by computing the 3D path information and the low-level kinematics parameters. The user can adjusts such parameters as the size of step, the time elapsed during double-support, rotation, tilt, and lateral displacement of the pelvis to produce different walking motions. Walking is a smooth, symmetric motion in which the body, feet, and hands move rhythmically in the required direction at a given speed. Basically, it can be characterized as a succession of phases separated by different states of the feet because the feet drive the main part of the animation. During walking, the foot has contact with the ground (footstrikes) and lifts off the ground (takeoffs). A stride is defined as the walking cycle in which four footstrikes and takeoffs occur. The part of this cycle, which is between the takeoffs of the two feet, is called a step. The phases for each leg can be classified into two: the stance and the swing phase. The stance phase is the period of support. The swing phase is the non-support phase. In the locomotion cycle, each leg passes through both the stance and the swing phases. In the cycle, there is also a period of time when both of the legs are in contact with the ground; this phase is called double-support. The phases of the walking cycle are shown in Fig. 7.6. During the cyclic stepping motion, one foot is in contact with the ground at all times and for a period both of the feet are in contact with the ground. These characteristics of walking really simplify the control mechanism. The kinematics nature of walking must be dissected further to produce a realistic walking motion. For this purpose, Saunders et al. defined a set of gait determinants, which mainly describe pelvis motion [83]. These determinants are compass gait, pelvic rotation, pelvic tilt, stance leg flexion, planar flexion of the stance angle, and lateral pelvic displacement. In pelvic rotation, the pelvis rotates to the left and right, relative to the walking direction. Saunders et al. quote 3 degrees as the amplitude of pelvic rotation in a normal walking gait. In normal walking, the hip of the swing leg falls slightly below the hip of the stance leg. For the side of swing
222
U. G¨ ud¨ ukbay et al. FSL states
double support
phases
0%
TOR
FSR
single (left) support
TOL
left stance
left swing
right swing
right stance
1 step
FSL: footstrike left TOR: takeoff right
FSL
double single (right) support support
50%
100%
FSR: footstrike right TOL: takeoff left
c 1989 Association for Computing Fig. 7.6. The phases of the walking cycle [76]. Machinery, Inc. Reprinted by permission
leg, this happens after the end of the double support phase. The amplitude of pelvic tilt is considered to be 5 degrees. In lateral pelvic displacement, the pelvis moves from side to side. Immediately after double support, the weight is transferred from the center to the stance leg; thus the pelvis moves alternately during normal walking. Moreover, individual gait variations can be achieved by modifying these pelvic parameters. The main parameters of walking behavior are velocity and step length. However, experimental results show that these parameters are related. Saunders et al. [83] relate the walking speed to walking cycle time and Bruderlin and Calvert [76] state the correct time durations for a locomotion cycle. Based on the results of experiments, the typical walking parameters can be stated as follows: velocity = step length × step f requency, step length = 0.004 × step f requency × body height.
(7.16) (7.17)
Experimental data shows that maximum value of step f requency is 182 steps per minute. The time for a cycle (tcycle ) can be calculated from the step frequency: tcycle = 2 × tstep =
2 . step f requency
(7.18)
Time for double support (tds ) and tstep are related according to the following equations: tstep = tstance − tds , tstep = tswing + tds .
(7.19) (7.20)
7 Modeling, Animation, and Rendering of Human Figures
(a)
223
(b)
Fig. 7.7. Velocity curve of (a) left ankle and (b) right ankle
Based on the experimental results, tds is calculated as: tds = (−0.0016 × step f requency + 0.2908) × tcycle .
(7.21)
Although tds can be calculated automatically if step f requency is given, it is sometimes convenient to redefine tds to have walking animations with various characteristics. By utilizing these equations that define the kinematics of walking, a velocity curve is constructed for the left and right ankles. Figure 7.7 illustrates the velocity curves of the left and right ankles. The distance curves shown in Fig. 7.8 are automatically generated using the velocity curves. The ease-in, ease-out effect, which is generated by speed up and slow down of the ankles, can be seen on distance curves.
(a)
(b)
Fig. 7.8. Distance curve of (a) left ankle and (b) right ankle
7.5 Motion Control for a Multi-layered Human Model In a multi-layered human model, the skin is deformed based on the transformation of the inner layers, namely the skeleton and muscle layers.
224
U. G¨ ud¨ ukbay et al.
7.5.1 Skeleton Layer The skeleton layer is composed of joints and bones and controls the motion of the body by manipulating the angles of the joints. To solve the inverse kinematics problem, there are software packages, such as Inverse Kinematics using Analytical Methods (IKAN). IKAN has the functionality to control the movement of body parts [45]. It is a complete set of inverse kinematics algorithms for an anthropomorphic arm or leg. It uses a combination of analytic and numerical methods to solve generalized inverse kinematics problems including position, orientation, and aiming constraints [45]. For the arm, IKAN computes the joint angles at the shoulder and elbow to put the wrist in the desired location. In the case of the leg, the rotation angles for the hip and knee are calculated. IKAN’s methodology is constructed on a 7 DOF fully revolute open kinematic chain with two spherical joints connected by a single revolute joint. Although the primary work is on the arm, the methods are suitable for the leg since the kinematic chain of the leg is similar to the kinematic chain of the arm. In the arm model, the spherical joints with 3 DOFs are the shoulder and wrist; the revolute joint with 1 DOF is the elbow. In the leg model, the spherical joints with 3 DOFs are the hip and ankle; the revolute joint with 1 DOF is the knee. Since leg and arm are similar to human arm-like (HAL) chains, only the details for the arm are explained here. The elbow is considered to be parallel to the length of the body at rest. The z-axis is from elbow to wrist. The y-axis is perpendicular to z-axis and is the axis of rotation for the elbow. The x-axis is pointing away from the body along the frontal plane of the body. A right-handed coordinate system is assumed. Similar coordinate systems are assumed at the shoulder and at the wrist. The projection axis is always along the limb and the positive axis points away from the body perpendicular to the frontal plane of the body. The projection axis differs for left and right arm. Wrist to elbow and elbow to shoulder transformations are calculated since they are needed to initialize the inverse kinematics solver in IKAN. The arm is initialized with a Humanoid object and the shoulder, elbow and wrist joints are named. During initialization, the transformation matrices are computed and the inverse kinematics solver is initialized. The orientations of joints for the positioning of the right arm are seen in Fig. 7.3(a). The joint angles are found automatically according to the end-effector positions. Figure 7.9 illustrates different motions of the skeleton (walking, jumping, squatting, running, and forearm motion). 7.5.2 Muscle Layer While the skeleton creates the general structure of the body, muscles determine the general shape of the surface mesh. Human muscles account for nearly half of the total mass of the body and fill the gap between the skeleton and the skin [84].
7 Modeling, Animation, and Rendering of Human Figures
225
(a)
(b)
(c)
(d)
(e)
Fig. 7.9. Different motions of the skeleton: (a) walking; (b) running; (c) jumping; (d) squatting; (e) forearm motion
226
U. G¨ ud¨ ukbay et al.
Human movements require the muscles to perform different tasks; thus the human body includes three types of muscle: cardiac, smooth and skeletal [85]. Cardiac muscles, found only in the heart, perform the pumping of the blood throughout the body. Smooth muscles are part of the internal organs and are found in the stomach, bladder, and blood vessels. Both of these muscle types are involuntary muscles because they cannot be consciously controlled. On the other hand, skeletal muscles control conscious movements. They are attached to bones by tendons and perform various actions by simply contracting and pulling the bones they are attached to towards each other. If only the external appearance of the human body is important, modeling the skeletal muscles serves the purpose. Skeletal muscles are located on top of the bones and other muscles, and they are structured side by side and in layers. There are approximately 600 skeletal muscles and they make up 40% to 45% of the total body weight. Skeletal muscle is an elastic, contractile material that originates at fixed origin locations on one or more bones and inserts on fixed insertion locations on one or more other bones [84]. The relative positions of these origins and insertions determine the diameter and shape of the muscle. In real life, muscle contraction causes joint motion but in many articulated body models, muscles deform due to joint motion in order to produce realistic skin deformations during animation. There are two types of contraction: isometric (same length) and isotonic (same tonicity). In isotonic contraction, when the muscle belly changes shape, the total length of the muscle shortens. As a result, the bones to which the muscle is attached are pulled towards each other. Figure 7.10 illustrates the isotonic contraction. In isometric contraction, the changes in the shape of the muscle belly due to the tension in the muscle do not change the length of the muscle, so no skeletal motion is produced [38]. Although most body movements require both isometric and isotonic contraction, many applications consider only isotonic contraction; isometric contractions have very little influence on the appearance of the body during animation. In most muscle models, a muscle is represented using two levels: the action line and the muscle shape. Sometimes, the muscle layer is represented only with action lines and the muscle shape is not considered (Fig. 7.11). The key idea behind this is that the deformations of the skin mesh can be calculated based on the underlying action lines and no muscle shape
f
isotonic contraction
Fig. 7.10. Isotonic contraction
7 Modeling, Animation, and Rendering of Human Figures
227
Fig. 7.11. Action line abstraction of a muscle
is needed for most applications. The calculation of the muscle shape and its effect on the skin layer is a complicated and computationally-intensive process. 7.5.2.1 Modeling of Muscles An action line denotes the imaginary line along which the force applied onto the bone is produced. However, the definition of this line is not clear [38]. Many specialists assume the action line to be a straight line, but more commonly, the action line is represented as a series of line segments (or polyline in computer graphics terminology) [86]. These segments and their number are determined through the anatomy of the muscle. A muscle represented by an action line simulates the muscle forces and is basically defined by an origin and an insertion point. The control points on the action line guide the line and incorporate the forces exerted on the skin mesh. These force fields are inversely proportional to the length of the corresponding action line segment. An example of this kind of action line is shown in Fig. 7.12. 7.5.2.2 Animation of Muscles Animation of muscles is a very complicated process due to the difficulty in determining the position and deformation of a muscle during motion. In computer-generated muscle models, the deformations of the muscles are generally inferred from the motion of the skeleton, the opposite of real life. The deformations of the skin layer are driven by the underlying bones and action lines. This allows the three-dimensional nature of the deformation problem
Fig. 7.12. The structure of an action line: control points and forces on these points
228
U. G¨ ud¨ ukbay et al.
to be reduced to one dimension. The control points of the action lines that correspond to the insertion and origin of the muscle are attached to the skeleton joints so that their motion is dictated by the skeleton. The positions of all the remaining control points are obtained through a linear interpolation formulation for each animation frame. We first need to determine the local frame of each action-line control point since the positions of the control points of the action line provide information as to how the surface mesh will expand or shrink over time. After the local frames are constructed, the action line is ready to animate in correspondence with the underlying skeleton. Since the insertion and origin points of the action line are fixed on the skeleton, movement of the skeleton layer is reflected on the action line as a decrease or increase in length. Parallel to the overall change in action-line length, the lengths of each action line segment also change. Since the force fields at each control point are inversely proportional to the segment length, this variation in length also causes a change in the force fields as demonstrated in Fig. 7.13. The next step in animation is the deformation in the skin mesh due to the changes on the forces that are exerted on skin vertices by the action line control points. This deformation is automatically propagated on the skin layer via the anchors between skin vertices and the action line. If the segment length shortens, the
(a)
(b)
(c)
Fig. 7.13. Deformation of an action line and force field changes: (a) rest position and initial force fields of an action line, (b) the action line shortens and forces increase due to muscle contraction, and (c) the action line lengthens and forces decrease due to muscle lengthening
7 Modeling, Animation, and Rendering of Human Figures
229
force fields increase and cause the skin mesh to bump. Similarly, the increase in segment length results in a decrease in force fields and relaxation of the skin mesh. 7.5.3 Skin Layer The skin is a continuous external sheet that covers the body. The skin accounts for about 16% of the body weight and has a surface area from 1.5 to 2.0 m2 in adults [87]. Its thickness varies depending on the location. 7.5.3.1 Modeling the Skin Layer There are basically three ways to model a skin layer. The first method involves designing from scratch or modifying an existing mesh in a 3D modeler, such as Poser [88]. Another way is to laser scan a real person, thus producing a dense mesh that truly represents a human figure. The last method is to extract the skin layer from underlying components that already exist. Figure 7.14 shows the shaded points and the solid view of the P3 Nude Man model, one of the standard characters of the Poser software [88]. The model contains 17,953 vertices and 33,234 faces. The whole body is composed of 53 parts such as hip, abdomen, head, right leg, left hand, etc. This structure facilitates the binding of skin vertices to the inner layers.
(a)
(b)
Fig. 7.14. The shaded point (a) and solid view (b) of the skin model
230
U. G¨ ud¨ ukbay et al.
7.5.3.2 Animation of the Skin Layer In attaching, each vertex in the skin is associated with the closest underlying body components (muscle and bone). Basically, the attachment of a particular skin vertex is to the nearest point on its underlying component. Thus shape changes in the underlying component are propagated through these anchors to the corresponding skin vertices [89]. Skin vertices are first bound to the joints of the skeleton layer in a multistep algorithm. To attach the skin to the joints, skin vertices are transformed into the joint coordinate system. The skin model is decomposed into parts, which are basically groups of vertices of the skin mesh corresponding to some part of the body. In the first step of the attachment process, a particular joint is determined for each part of the skin and attach the vertices of this part to this joint, i.e., the right upper arm is bound to the right shoulder joint and the left thigh is anchored to the left hip joint (Fig. 7.15(a)). However, the first step is not sufficient for realistic deformation of the skin layer. Realism requires that some vertices be bound to more than one joint. In particular, the vertices near the joints need to be attached to two adjacent joints. The second step of the attachment process is focused on binding the necessary vertices to two joints. For this purpose, a distance threshold should be determined for each highly movable joint. If the distance between vertex and joint is smaller than the threshold value, then the vertex is bound to this second joint with a weight (Fig. 7.15(b)). The weights are inversely proportional to the distance between the vertex and the joint. It is difficult to determine the most appropriate distance threshold for the binding operation. A small threshold value misses some of the vertices that should be attached to the joints. For example, some vertices in the back part of the hip need to be attached to the left hip or right hip joints in order to generate a realistic walking motion. However, since the distances between these vertices and the corresponding joints are larger than the threshold value, there may be holes in the skin mesh because of the unattached vertices. Increasing
(a)
(b)
Fig. 7.15. The attachment process: (a) the left thigh is bound to the left hip joint; (b) binding skin vertices to more than one joint based on a distance threshold. The vertices inside the ellipsoids are bound to more than one joint
7 Modeling, Animation, and Rendering of Human Figures
231
the threshold value is a solution to this problem but this attempt may cause unnatural results in other parts of the body during movement. Therefore, unlike the first two steps, which are fully automatic, a manual correction step is generally applied to overcome these deficiencies. This can be done by selecting some of the unattached vertices one by one using a mouse and binding them to the appropriate joints. In 3D Studio Max, it is performed by manually adjusting the 3D influence envelopes [90]. A 3D influence envelope includes the skin vertices to be effected by the corresponding bone. Since the envelopes may overlap, some vertices are effected by more than one bone. Sometimes the automatically-defined envelopes may be wider or smaller than necessary; thus, a manual size adjustment may be required. Anchoring points to the action lines of muscles is achieved by a similar process. An action line can be thought of as a skeleton; control points denote joints and line segments correspond to bones. This allows us to reuse (with some extensions) algorithms originally developed for mapping skin vertices to skeleton joints. In the skin-skeleton mapping algorithm, each skin vertex is bound to a particular joint. In the skin-action line mapping process, each skin vertex is again attached to a particular action line. An action line is composed of a set of control points and each control point has a different force field on the skin mesh. Thus, each skin vertex is bound to a number of control points on the action line (see Fig. 7.16). Before applying muscle-induced deformations to the skin layer, each skin vertex is moved based on the skeletal position. This step has two goals: to generate a smooth skin appearance and to simulate the adherence of the skin layer to the skeleton layer. Skin vertices are attached to the joints of the skeleton layer rather than the limbs. But since the limbs are connected to the joints, the skin vertices move with the limbs in the underlying skeleton [89]. Figure 7.17 shows different motions with the rendered skin. The position curves seen in Figs. 7.9 and 7.17 are the trajectories of left ankle, right ankle and pelvis. The position and velocity curves of each limb are generated automatically. Figure 7.18 demonstrates muscular and skeletal deformations. In the series of still frames, the right arm is raised forward and then flexed to demonstrate muscular deformation on the skin mesh.
Fig. 7.16. Binding skin vertices with action lines. Point j is attached to its nearest three control points, v1 , v2 and v3
232
U. G¨ ud¨ ukbay et al.
(a)
(b)
(c)
(d)
(e)
Fig. 7.17. Different motions of the human figure with the skin: (a) walking; (b) running; (c) jumping; (d) squatting; (e) forearm motion
The most tedious and time consuming operation is binding the skin layer to the underlying layers. However, this does not affect the frame rate if it is done as a preprocessing step before the animations. Depending on the application, different skin meshes ranging from low to high resolution can be used.
7 Modeling, Animation, and Rendering of Human Figures
233
Fig. 7.18. Muscular deformation on the upper arm
7.6 Conclusions Multi layered modeling of human motion based on anatomical approach yields realistic and real time results for generating avatars in motion for three dimensional computer animations. Films produced by these techniques have become very popular and wide spread, many with major box office success. The motion, up till recently, was specified by the animators. The current developments in motion capture now can also provide data for animation thus making human model animation a readily integral part of three dimensional television systems. Building up motion data bases and using various motion definitions from such data bases, as well as modifying parts of this data whenever needed, provides a very powerful tool for the entertainment industry for generating three dimensional realistic humans in motion. Many other applications ranging from medical to flight simulators need near correct human models. Research outlined in this chapter will continue to this end and will increasingly become an important source for providing tools for three dimensional display systems.
Acknowledgements We are grateful to Kirsten Ward for careful editing of the manuscript. This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. N. Badler and S. Smoliar, “Digital representations of human movement,” ACM Computing Surveys, Vol. 11, No. 1, pp. 19–38, 1979. 2. D. Herbison-Evans, “Nudes 2: A numeric utility displaying ellipsoid solids,” ACM Computer Graphics (Proc. of SIGGRAPH’78), pp. 354–356, 1978. 3. J. Korein, A Geometric Investigation of Reach. Cambridge, MA: The MIT Press, 1985.
234
U. G¨ ud¨ ukbay et al.
4. N. Magnenat-Thalmann and D. Thalmann, Computer Animation: Theory and Practice. Berlin: Springer-Verlag, 1985. 5. N. Badler and S. Smoliar, “Graphical behavior and animated agents,” in ACM Siggraph Course Notes # 17, Advanced Techniques in Human Modeling, Animation and Rendering, pp. 19–38, 1992. 6. N. Magnenat-Thalmann and D. Thalmann, Synthetic Actors in 3D ComputerGenerated Films. New York: Springer-Verlag, 1990. 7. K. Komatsu, “Human skin model capable of natural shape variation,” The Visual Computer, Vol. 3, No. 5, pp. 265–271, 1988. 8. N. Magnenat-Thalmann and D. Thalmann, “The direction of synthetic actors in the film rendez-vous ` a montre` al,” IEEE Computer Graphics and Applications, Vol. 7, No. 12, pp. 9–19, 1987. 9. M. DeLoura, Game Programming Gems. Rockland, MA: Charles River Media Inc., 2000. 10. W. Sun, A. Hilton, R. Smith, and J. Illingworth, “Layered animation of captured data,” The Visual Computer, Vol. 17, No. 8, pp. 457–474, 2001. 11. K. Singh and E. Kokkevis, “Skinning characters using surface-oriented free form deformations,” in Proceedings of Graphics Interface, pp. 35–42, 2000. 12. K. Lander, “Skin them bones,” Game Developer, pp. 11–16, 1998. 13. P. P. Sloan, C. F. Rose, and M. Cohen, “Shape by example,” in Proceedings of the Symposium on Interactive 3D Graphics, pp. 135–143, 2001. 14. D. Talbot, “Accurate characterization of skin deformations using range data,” Master’s thesis, University of Toronto, 1998. 15. J. Blinn, “A generalization of algebraic surface drawing,” ACM Transactions on Graphics, Vol. 1, pp. 235–256, 1982. 16. S. Yoshimoto, “Ballerinas generated by a personal computer,” The Journal of Visualization and Computer Animation, Vol. 3, No. 1, pp. 85–90, 1992. 17. J. Bloomenthal and K. Shoemake, “Convolution surfaces,” ACM Computer Graphics (Proc. of SIGGRAPH’91), Vol. 25, No. 4, pp. 251–256, 1991. 18. A. Leclercq, S. Akkouche, and E. Galin, “Mixing triangle meshes and implicit surfaces in character animation,” in Proceedings of Animation and Simulation, pp. 37–47, 2001. 19. D. Thalmann, J. Shen, and E. Chauvineau, “Fast human body deformations for animation and vr applications,” in Proceedings of Computer Graphics International, pp. 166–174, 1996. 20. M. P. Gascuel, “An implicit formulation for precise contact modeling between flexible solids,” ACM Computer Graphics (Proc. of SIGGRAPH’93), pp. 313–320, 1993. 21. The National Library of Medicine, “The visible human project,” available at http://www.nlm.nih.gov/research/visible/visible human.html. 22. G. Hirota, S. Fisher, A. State, C. Lee, and H. Fuchs, “An implicit finite element method for elastic solids in contact,” in Proceedings of Computer Animation, pp. 136–146, 2001. 23. J. Lasseter, “Principles of traditional animation applied to 3D computer animation,” ACM Computer Graphics (Proc. of SIGGRAPH’87), Vol. 21, No. 4, pp. 35–44, 1987. 24. J. E. Chadwick, D. R. Haumann, and R. E. Parent, “Layered construction for deformable animated characters,” ACM Computer Graphics (Proc. of SIGGRAPH’89), Vol. 23, No. 3, pp. 243–252, 1989.
7 Modeling, Animation, and Rendering of Human Figures
235
25. J. Wilhelms and A. Van Gelder, “Anatomically based modeling,” ACM Computer Graphics (Proc. of SIGGRAPH’97), pp. 173–180, 1997. 26. F. Scheepers, R. Parent, W. Carlson, and S. May, “Anatomy based modeling of the human musculature,” ACM Computer Graphics (Proc. of SIGGRAPH’97), pp. 163–172, 1997. 27. L. Porcher-Nedel, “Anatomic modeling of human bodies using physically-based muscle simulation,” Ph.D. dissertation, Swiss Federal Institute of Technology, 1998. 28. F. Scheepers, R. Parent, F. May, and W. Carlson, “Procedural approach to modeling and animating the skeletal support of the upper limb,” Department of Computer and Information Science, The Ohio State University, Tech. Rep. OSU-ACCAD-1/96/TR1, 1996. 29. W. Maurel and D. Thalmann, “Human shoulder modeling including scapulothoracic constraint and joint sinus cones,” Computers & Graphics, Vol. 24, No. 2, pp. 203–218, 2000. 30. T. Sederberg and S. Parry, “Free-from deformation of solid geometric models,” ACM Computer Graphics (Proc. of SIGGRAPH’86), Vol. 20, No. 4, pp. 151–160, 1986. 31. L. Moccozet, “Hand modeling and animation for virtual humans,” Ph.D. dissertation, University of Geneva, 1996. 32. R. Turner and D. Thalmann, “The elastic surface layer model for animated character construction,” in Proceedings of Computer Graphics International, pp. 399–412, 1993. 33. L. Nedel and D. Thalmann, “Real time muscle deformations using mass-spring systems,” in Proceedings of Computer Graphics International, pp. 156–165, 1998. 34. V. Ng-Thow-Hing and E. Fiume, “Application-specific muscle representations,” in Proceedings of Graphics Interface, pp. 107–115, 2002. 35. M. P. Gascuel, A. Verroust, and C. Puech, “A modeling system for complex deformable bodies suited to animation and collision processing,” The Journal of Visualization and Computer Animation, Vol. 2, No. 3, pp. 82–91, 1991. 36. T. DeRose, M. Kass, and T. Truong, “Subdivision surfaces in character animation,” ACM Computer Graphics (Proc. of SIGGRAPH’98), Vol. 32, pp. 85–94, 1998. 37. M.-P. Cani-Gascuel and D. M., “Animation of deformable models using implicit surfaces,” IEEE Transactions on Visualization and Computer Graphics, Vol. 3, No. 1, pp. 39–50, 1997. 38. A. Aubel, “Anatomically-based human body deformations,” Ph.D. dissertation, Swiss Federal Institute of Technology, 2002. 39. Humanoid Animation Working Group of Web3D Consortium, “H-Anim 1.1: specification for a standard humanoid,” available at http://h-anim.org/. 40. O. Arikan, D. Forsyth, and D. O’Brien, “Motion synthesis from annotations,” ACM Transactions on Graphics (Proc. of SIGGRAPH’02), Vol. 22, No. 3, pp. 402–408, 2003. 41. R. Boulic, P. B´echeiraz, L. Emering, and D. Thalmann, “Integration of motion control techniques for virtual human and avatar real-time animation,” in Proceedings of ACM Symposium on Virtual Reality Software and Technology (VRST’97), pp. 111–118, 1997.
236
U. G¨ ud¨ ukbay et al.
42. J. Denavit and R. Hartenberg, “A kinematics notation for lower-pair mechanisms based on matrices,” Journal of Applied Mechanics (ASME), Vol. 22, No. 2, pp. 215–221, 1955. 43. K. Sims and D. Zeltzer, “A figure editor and gait controller for task level animation,” in ACM Siggraph Course Notes # 4, Synthetic Actors: The Impact of Artificial Intelligence and Robotics on Animation, pp. 164–181, 1988. 44. J. Zhao and N. Badler, “Inverse kinematics positioning using nonlinear programming for highly articulated figures,” ACM Transactions on Graphics, Vol. 13, No. 4, pp. 313–336, 1994. 45. T. Deepak, A. Goswami, and N. Badler, “Real-time inverse kinematics techniques for anthropomorphic limbs,” Graphical Models, Vol. 62, No. 5, pp. 353–388, 2000. 46. N. Badler, K. Manoochehri, and G. Walters, “The dynamics of articulated rigid bodies for purpose of animation,” The Visual Computer, Vol. 7, No. 6, pp. 28–38, 1987. 47. M. Girard and A. Maciejewski, “Computational modeling for computer generation of legged figures,” ACM Computer Graphics (Proc. of SIGGRAPH’85), Vol. 19, No. 3, pp. 263–270, 1985. 48. C. Welman, “Inverse kinematics and geometric constraints for articulated figure manipulation,” Master’s thesis, School of Computing Science, Simon Fraser University, 1993. 49. P. Baerlocher, “Inverse kinematics techniques of the interactive posture control of articulated figures,” Ph.D. dissertation, Departement D’Informatique, Ecole Polytechnique F´ed´erale de Lausanne, 2001. 50. M. Greeff, J. Haber, and H.-P. Seidel, “Nailing and pinning: Adding constraints to inverse kinematics,” in Proceedings of the 13th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG’2005), Short Papers, 2005. 51. J. Wilhelms, Dynamic Experiences. San Mateo, CA: Morgan-Kaufmann Publishers, pp. 265–280, 1991. 52. W. Armstrong and M. Green, “The dynamics of articulated rigid bodies for purpose of animation,” The Visual Computer, Vol. 1, pp. 231–240, 1985. 53. J. Wilhelms, “Virya – a motion control editor for kinematic and dynamic animation,” in Proceedings of Graphics Interface, pp. 141–146, 1986. 54. R. Barzel and A. H. Barr, “A modeling system based on dynamics,” ACM Computer Graphics (Proc. of SIGGRAPH’88), Vol. 22, No. 4, pp. 179–188, 1988. 55. D. Forsey and J. Wilhelms, “Techniques for interactive manipulation of articulated bodies using dynamic analysis,” in Proceedings of Graphics Interface, pp. 8–15, 1988. 56. P. M. Isaacs and M. F. Cohen, “Controlling dynamic simulation with kinematic constraints, behavior functions and inverse dynamics,” ACM Computer Graphics (Proc. of SIGGRAPH’87), Vol. 21, No. 4, pp. 215–224, 1987. 57. H. Ko and N. Badler, “Animating human locomotion in real-time using inverse dynamics,” IEEE Computer Graphics and Applications, Vol. 16, No. 2, pp. 50–59, 1996. 58. N. Brotman and A. Netravali, “Motion interpolation by optimal control,” ACM Computer Graphics (Proc. of SIGGRAPH’88), Vol. 22, No. 4, pp. 309–315, 1988.
7 Modeling, Animation, and Rendering of Human Figures
237
59. M. Girard, Constrained Optimization of Articular Animal Movement in Computer Animation. San Mateo, CA: Morgan-Kaufmann Publishers, pp. 209–229, 1991. 60. P. Lee, S. Wei, J. Zhao, and N. Badler, “Strength guided motion,” ACM Computer Graphics (Proc. of SIGGRAPH’90), Vol. 24, No. 4, pp. 253–262, 1990. 61. S. Loizidou and J. Clapworthy, Legged locomotion using HIDDS. Tokyo: Springer-Verlag, 1993. 62. T. Moeslund and E. Granum, “A survey of computer vision-based human motion capture,” Computer Vision and Image Understanding, Vol. 81, No. 3, pp. 231–268, 2001. 63. G. Johansson, “Visual motion perception,” Scientific American, Vol. 232, No. 6, pp. 75-80, 1975. 64. J. Carranza, C. Theobalt, A. Magnor, and H.-P. Seidel, “Free-viewpoint video of human actors,” ACM Transactions on Graphics (Proc. of SIGGRAPH’03), Vol. 22, No. 3, pp. 569-577, 2003. 65. A. Sundaresan and R. Chellappa, “Markerless motion capture using multiple cameras,” in Proceedings of the Computer Vision for Interactive and Intelligent Environment (CVIIE05), 2005. 66. O. Arikan and D. Forsyth, “Interactive motion generation from examples,” ACM Transactions on Graphics (Proc. of SIGGRAPH’02), Vol. 21, No. 3, pp. 483–490, 2002. 67. A. Bruderlin and L. Williams, “Motion signal processing,” ACM Computer Graphics (Proc. of SIGGRAPH’95), pp. 97–104, 1995. 68. M. Unuma, K. Anjyo, and R. Tekeuchi, “Fourier principles for emotion-based human figure animation,” ACM Computer Graphics (Proc. of SIGGRAPH’95), Vol. 29, No. 4, pp. 91–96, 1995. 69. K. Huang, C. Chang, Y. Hsu, and S. Yang, “Key probe: A technique for animation keyframe extraction,” The Visual Computer, Vol. 21, No. 8–10, pp. 532–541, 2005. 70. O. Arikan, “Compression of motion capture databases,” ACM Transactions on Graphics (Proc. of SIGGRAPH’06), Vol. 25, No. 3, pp. 890–897, 2006. 71. D. Kochanek and R. Bartels, “Interpolating splines with local tension, continuity, and bias control,” ACM Computer Graphics (Proc. of SIGGRAPH’84), Vol. 18, No. 3, pp. 33–41, 1984. 72. J. Steketee and N. Badler, “Parametric keyframe interpolation incorporating kinetic adjustment and phrasing control,” ACM Computer Graphics (Proc. of SIGGRAPH’85), Vol. 19, No. 3, pp. 255–262, 1985. 73. R. Bartels and I. Hardke, “Speed adjustment for keyframe interpolation,” in Proceedings of Graphics Interface, pp. 14–19, 1989. 74. N. Badler, C. B. Phillips, and B. L. Webber, Simulating Humans: Computer Graphics, Animation, and Control. Oxford University Press, 1999. 75. L. Bezault, R. Boulic, N. Magnenat-Thalmann, and D. Thalmann, “An interactive tool for the design of human free-walking trajectories,” in Proceedings of Computer Animation, pp. 87–104, 1992. 76. A. Bruderlin and T. W. Calvert, “Goal-directed dynamic animation of human walking,” ACM Computer Graphics (Proc. of SIGGRAPH’89), Vol. 23, No. 3, pp. 233–242, 1989. 77. ——, “Interactive animation of personalized human locomotion,” ACM Computer Graphics (Proc. of SIGGRAPH’93), Vol. 29, No. 4, pp. 17–23, 1993.
238
U. G¨ ud¨ ukbay et al.
78. H. Ko, “Kinematic and dynamic techniques for analyzing, predicting, and animating human locomotion,” Ph.D. dissertation, Department of Computer and Information Science, University of Pennsylvania, 1994. 79. D. Zeltzer, “Motor control techniques for figure animation,” IEEE Computer Graphics and Applications, Vol. 2, No. 9, pp. 53–59, 1982. 80. F. Multon, L. France, P. Cani-Gasguel, and G. Debunne, “Computer animation of human walking: A survey,” Journal of Visualization and Computer Animation, Vol. 10, pp. 39–54, 1999. 81. R. Bartels, J. C. Beatty, and B. A. Barsky, An Introduction to Splines for Use in Computer Graphics and Geometric Modeling. Los Altos, CA: Morgan Kaufmann, 1987. 82. B. Guenter and R. Parent, “Computing the arclength of parametric curves,” IEEE Computer Graphics and Applications, Vol. 10, No. 3, pp. 72–78, 1990. 83. J. Saunders, V. T. Inman, and H. D. Eberhart, “The major determinants in normal and pathological gait,” Journal of Visualization and Computer Animation, Vol. 35-A, No. 3, pp. 543–558, 1953. 84. P. Richer, Artistic Anatomy. Watson-Gutpill Publications, 1981. 85. W. Maurel, Y. Wu, N. Magnenat-Thalmann, and D. Thalmann, Biomechanical Models for Soft Tissue Simulation. Berlin/Heidelberg: Springer-Verlag, 1998. 86. S. Delp and J. Loan, “A computational framework for simulating and analyzing human and animal movement,” IEEE Computing in Science & Engineering, Vol. 2, No. 5, pp. 46–55, 2000. 87. Y. Lanir, Skin Mechanics. New York: McGraw-Hill, pp. 11–16, 1987. 88. MetaCreations Software, Inc., Poser 4, available at http://www.metacreations.com/products/poser (Poser 7 is available at http://www.e-frontier. com/go/poser). 89. L. Kavan and J. Zara, “Real time skin deformation with bones blending,” in Proceedings of WSCG’2003 (Short Papers), pp. 69–74, 2003. 90. Autodesk, “3ds Max”, http://usa.autodesk.com.
8 A Survey on Coding of Static and Dynamic 3D Meshes Aljoscha Smolic1 , Ralf Sondershaus2, Nikolˇce Stefanoski3 , Libor V´ aˇsa4 , 1 3 1 Karsten M¨ uller , J¨ orn Ostermann and Thomas Wiegand 1
2
3
4
Fraunhofer-Institute for Telecommunications, Heinrich-Hertz-Institut, Image Processing Department, Einsteinufer 37,10587 Berlin, Germany Lehrstuhl Graphische kInteraktive Systeme, Wilhelm-Schickard-Institut, Universit¨ at T¨ ubingen Sand 14, 72076 T¨ ubingen, Germany Institut f¨ ur Informationsverarbeitung, Fakult¨ at f¨ ur Elektrotechnik und Informatik, Leibniz Universit¨ at Hannover, Appelstr. 9A, 30167 Hannover, Germany Department of Computer Science and Engineering, Faculty of Applied Science, University of West Bohemia, Univerzitni 8, 306 14 Plzen, Czech Republic
In this chapter we survey recent developments in the area of compression of static and dynamic 3D meshes. In an introductory section we give a definition of meshes and define terms and notations related to meshes. Furthermore, we give an overview to coding techniques in general and describe the principles of mesh compression algorithms at a very informative level. The following two sections give an overview on single rate and progressive coding techniques for static and dynamic meshes, explaining them in more detail pointing out the main ideas of each encoding approach. We conclude each section with a discussion providing an overall picture of developments in the mesh coding area, highlighting advantages and disadvantages of presented approaches, and pointing out directions for future research. The development of compression algorithms for static meshes was mainly forced by the community of 3D graphics hardware accelerators. The goal in mind was to reduce the amount of bytes that need to be transferred from the main memory to the graphics card. Based on the pioneering work of Michael Deering, a variety of algorithms have been proposed that work well for both triangle and polygonal meshes. But not only the hardware community benefits from compression techniques. Modern scanning devices are able to produce huge point clouds which are converted to even bigger triangle soups by surface reconstruction algorithms. A famous example is the Digital Michelangelo project of Stanford that contains scans of some of the most famous sculptures of Michelangelo. The biggest model consists of 386,488,573 polygons. The compression of such huge static meshes not only increases the rendering performance but also decreases
240
A. Smolic et al.
the storage cost at hard discs. Furthermore, the transmission of static meshes over networks becomes important for applications like virtual shopping malls. Efficient storage and broadcasting of dynamic 3D content gets crucial importance for commercial success of 3DTV technology. Dynamic 3D objects in their generic form represented as a sequence of static meshes require even multiple of times more storage than a single static mesh. Good compression techniques are essential for such applications to be successful. Static as well as dynamic meshes show dependencies in spatial and spatio-temporal direction respectively, which can be exploited for their compression.
8.1 Introduction We introduce a basic concept of meshes and establish nomenclature that is often used to describe methods and algorithms for meshes. For a more detailed introduction we refer the interested reader to [1]. Additionally, a short overview to coding techniques is given as well as to basic principles of mesh compression algorithms. 8.1.1 Basics on Meshes A surface can be seen as a two-dimensional subset of R3 . Each point on the surface is surrounded by a two-dimensional neighbourhood of other points on the surface. Due to the two-dimensional nature of a surface, the term of a 2-manifold establishes a more abstract notion. Definition 1 (2-manifold). A 2-manifold is a topological space which is locally Euclidean, i.e. every point has a neighbourhood that is topologically equivalent to an open disc in R2 . Often, surfaces have boundaries where points have a neighbourhood that is topologically equivalent to a half disk. These surfaces are called manifold with boundary. Although many surfaces in computer graphics are manifold or manifold with boundary, keep in mind that surfaces exist that are neither manifold nor manifold with boundary. Figure 8.2 shows examples of manifold and non-manifold surfaces. A surface can be decomposed into a collection of faces which are enclosed by edges and vertices (Fig. 8.1). From a mathematical point of view, the edges and vertices form a graph where the edges surround the faces. This graph can be embedded into a two dimensional plane, i.e. it can be drawn. In order to embed a manifold without boundary, it must be cut whereby the cut edges (and vertices) are doubled. The edges of the embedded graph have unique identifiers attached to them such that duplicated edges (and vertices) can be identified as being on edge (and vertex). The collection of faces, edges, and vertices together with the information how they connect to each other is called connectivity.
8 A Survey on Coding of Static and Dynamic 3D Meshes
241
Fig. 8.1. A sphere can be decomposed into a collection of faces where each face is bounded by edges and vertices. The faces can be embedded into a 2D plane. The collection of faces has the same topology as the sphere if the edges and vertices are identified by labels as it is shown for the top row of triangles
Taking the embedded 2D graph, each vertex can be assigned coordinates in 3D. Similar, each edge can be mapped to an arc in 3D and each face to a surface patch in 3D. We say that each such element is mapped to its geometry. A mesh M is a pair M = (C, G) that represents a surface by the connectivity information C (a collection of faces, edges and vertices) and the geometry G (3D coordinates). The faces, edges and vertices are often called mesh elements. Often, the mapping of the edges to their geometry is just a mapping from the edge to a straight line segment in 3D. A face is mapped implicitly to the polygon that is formed by the line segments of its bounding edges. If a face is bounded by three edges and thus forms a triangle, the polygon of the line segments is a triangle which always lies in a plane. But if a face is bounded by more than three edges, the line segments do not necessarily form a planar polygon. Especially for hardware accelerated rendering such polygons are often decomposed into triangles which are then sent down the graphics pipeline for rendering. The term of polygonal connectivity summarizes the basic properties of a well-designed mesh connectivity. Definition 2 (Polygonal Connectivity). The polygonal connectivity C is a quadruple (V, E, F, R) that consists of the set of vertices V, the set of edges E, the set of faces F and the incidency relation R such that • •
Each edge is incident to its two end vertices, Each face is incident to an ordered closed loop of k edges {e1 , K, ek } ∈ E where each edge is incident to both its end vertices,
Fig. 8.2. The sphere and the donut are manifold surfaces but the workpiece is nonmanifold at the tops of the cones. All three surfaces have one shell. The sphere has no hole (genus 0) and the donut has one hole (genus 1)
242
• •
A. Smolic et al.
Each face is incident to all end vertices of its incident edges, and The incidency relation is reflexive, i.e. a vertex, edge, or face is incident to itself.
The term of adjacency describes the relationships between mesh elements of the same type. Definition 3 (Adjacency). • • •
Two faces are adjacent, iff there exists an edge that is incident to both of them. Two edges are adjacent, iff there exists a vertex that is incident to both of them. Two vertices are adjacent, iff there exists an edge that is incident to both of them.
The valence of a vertex and the degree of a face are terms that are often used to characterize meshes. Definition 4 (Valence and Degree). • •
The valence of a vertex is the number of edges that are incident the vertex. The degree of a face is the number of edges that are incident to the face.
To embed the mesh elements into the three dimensional space, geometry needs to be added. Note that an edge is hereby mapped to a straight line segment which justifies the term polygonal geometry. Definition 5 (Polygonal Geometry). The polygonal geometry G of a polygonal mesh is the mapping from the mesh elements in the connectivity C to elements in R3 with • • •
A vertex is mapped to a point in R3 , An edge is mapped to a line segment that connects both its end points, A face is mapped to the polygon that is enclosed by the line segment of its incident edges.
Often, the meshes carry additional information along with its vertices, edges and faces which describe physical properties like surface colors or surface normals. If the properties are given at the vertices only, they can be extended to edges and faces by linear interpolation. If we look globally at a mesh, we can identify its shells and its genus. A shell is a part of the mesh that is edge-connected, i.e. any two faces of the part are connected by a path of faces such that two consecutive faces on the path are adjacent to each other. The genus is the number of holes of the mesh. A donut has one hole as well as a cup whereas a sphere has no hole. V vertices, e = E Looking at the connectivity information with v = edges and f = F faces, the Euler equation gives information about the relationship between the mesh elements
8 A Survey on Coding of Static and Dynamic 3D Meshes
243
v − e + f = χ. Hereby, χ is called Euler characteristic and depends on the number of shells, on the genus and on the number of boundaries of the mesh. For closed and manifold meshes, the Euler characteristic is given by the number of shells s and the genus g χ = 2(s − g). For manifold meshes with b (closed) boundary loops, the Euler characteristic simply changes to χ = 2(s − g) − b. Meshes that are homeomorphic to a sphere, i.e. that consist of a single shell without holes and boundary, are called simple meshes. Each triangle of a simple mesh has exactly three edges. Furthermore, each edge is adjacent to exactly two triangles. Putting both conditions together, we obtain 2e = 3f. Substituting this equation into the Euler equation for simple meshes and considering an increasing number of faces and vertices, we find v−e+f =2 1 v− f =2 2 v 2 1 = + f f 2 So, we get the following approximations for large meshes f ≈ 2v and e ≈ 3v. A mesh has roughly twice as many faces as vertices and triply as many edges as vertices. Furthermore, if we look at the valence vi of a vertex and compute the average valence of all vertices in simple meshes and consider that each edge is incident to both its two end vertices, we find 1 2e = 6. vi ≈ v v i∈V
So, the valence of a vertex in large meshes is 6 in average. There exist many data structures that store the connectivity information and enable to query this information. Based on the three types of mesh elements, we can identify queries from each mesh element type to another mesh element type, see Fig. 8.3 a. For instance, we could query all faces that are incident to a vertex. The data structures are distinguished between their properties on which relations they store explicitly and, based on this storage, which queries they can answer in which times.
244
A. Smolic et al.
a)
b)
c)
Fig. 8.3. (a) All possible relationships between mesh elements that could be stored in a data structure. (b, c) The half edge data structure introduces the concept of half edges where each edge is split into two half edges that have an orientation such that a half edge is inversely oriented to its opposite. Each vertex references an outgoing half edge (h), each face references one of its bounding half edges (e), each half edge references the vertex it starts from (v ), the face it belongs to (f ), the half that is opposite to it (i), the next half edge in counter-clockwise order (n), and optionally the previous half edge (p). All half edges are stored in a single table such that every half edge has an unique index. For triangular meshes, the f and e references can be computed from the half edge index by mod 3 and div 3, respectively, operations only
A very popular data structure that enables to store manifold meshes is the half edge data structure as shown in Fig. 8.3 b, c. Each edge is split into two half edges that are opposite to each other. Each half edge has an orientation which is chosen such that all half edges of a face form a cycle. The half edge data structure for polygonal meshes maintains three tables, the half edge table, the vertex table, and the face table. The half edge table keeps a record for each half edge which contains references to other half edges, to the vertex where a half edge originates, and optionally to the face that the half edge belongs to. The vertex table stores for each vertex the geometry and the index of one half edge that originates at the vertex. The face table keeps just a reference to one half edge that bounds the face. For the special case of triangle meshes, the face table is skipped. The half edges are sorted by their triangles such that the three half edges of a triangle are stored one after another in the table. Given the index of a half edge hi , the triangle index can be computed by hi div 3. Given the triangle index ti , the half edge index of the first half edge that belongs to the triangle, can be computed by ti mod 3. The half edge data structure can store manifold meshes. Non-manifold meshes can be transformed into manifold meshes by duplicating non-manifold edges or vertices. Alternatively, a non-manifold mesh can be stored in another data structure like the radial-edge data structure [2]. But many existing compression algorithms work on manifold meshes only and must be extended to non-manifold meshes explicitly. 8.1.2 Basics on Encoding Information theory provides us the theoretical background on how to compress any type of data [3]. This may be audio or video signals but also computer graphics models. Any communication can be regarded as the process of transmitting information from a source to a receiver. In the time discrete (sampled)
8 A Survey on Coding of Static and Dynamic 3D Meshes
245
formulation – which is the only interesting in the digital world – a source produces a sequence of symbols over time. This may be samples of an audio signal, images of a video sequence, or individual 3D meshes of a computer graphics animation. The Nyquist theorem gives us the theoretical bounds for optimum sampling and reconstruction [4]. Sampling may be regular at equidistant time intervals or irregular. More general, sampling may also be carried out over other dimensions such as 2D or 3D space. A pixel of a digital image is a 2D sample of a light intensity and colour function. A coloured vertex of a 3D mesh is a 3D sample of an object surface. Source coding is the scientific discipline that provides us the theoretical background on how to compress the source symbols. These may be continuous or discrete in amplitude. Discrete amplitude sources have a finite alphabet, i.e. the number of different possible symbols is limited. This may be sources where the symbols are already quantized. Take for instance the letters of the alphabet that any text is composed of. Only a limited number of letters exist. Such discrete amplitude sources can be compressed lossless, i.e. the receiver can reconstruct the source information exactly. Continues amplitude sources can output symbols of any value, mostly within a certain range. This may be for instance time samples of some sensor data, or any sequence of real numbers. Such a source cannot be compressed lossless, only lossy compression is possible, i.e. the receiver can never reconstruct the information exactly, and a certain distortion is inherently introduced. This distortion, i.e. the inverse of quality, is a function of the bitrate, i.e. the cost that is spent to encode the symbols. Rate distortion theory provides us with the theoretical bounds for this trade-off between cost and quality. It defines a function to calculate the minimum bitrate for a certain quality and vice versa. Note that discrete amplitude sources may also be encoded lossy. The principles for efficient compression are reduction of irrelevancy and redundancy. Irrelevancy is that part of the data that does not have any use for the receiver. It is mostly defined in terms of human perception. Humans are for instance not able to perceive a tone of a frequency higher than 20 kHz. Therefore such information if contained in an audio signal may be removed. There will be no perceptual difference between the original and the filtered audio signal. Then the two signals are called transparent, although there will be a measurable distortion. Irrelevancy reduction therefore always introduces distortion and is not reversible. An initial 3D mesh, for instance from a laser scanner, may be represented using 32 bit float values per coordinate for each vertex. The same 3D mesh when displayed on a computer monitor might be perceived equally well using only 16 bit per coordinate, as a fictive example. Such an irrelevancy reduction would save half of the bitrate. But imagine now zooming very close to the object surface. Now differences between the 32 bit and the 16 bit representation may become noticeable. This shall illustrate that the display device and conditions as well always influence the definition of irrelevancy.
246
A. Smolic et al.
Transforms are an important mechanism for compression and signal processing in general [5]. The original data undergo a mathematical operation and the result is basically another set of numbers. In most practical cases the mathematical operation is invertible, i.e. the original data can be exactly recomputed from the transformed data. Most important are frequency transformations such as the Fourier transform, discrete cosine transform (DCT), or wavelet transform. The original data such as blocks of audio samples or image pixels are transformed into the frequency domain. Instead of samples or pixels frequency coefficients are further processed. The benefit is that a lot of successive operations of compression or processing can be performed much better in the frequency domain. For instance a DCT applied to a block of image pixels in most cases concentrates the largest part of the energy in a few coefficients. Those carry the largest part of the visual information important to the user. Other coefficients may be neglected or coarsely quantized with only a few bits. A decoder receives the quantized coefficients and reconstructs the block of image pixels by inverse DCT. This process is called transform coding and is integral part of algorithms for image and video compression such as JPEG or MPEG, and it can be similarly applied to 3D mesh compression. Another important feature of some transforms is that they inherently produce a progressive or scalable representation of the original data. The resulting bitstream can be partially decoded to obtain a low-quality reconstruction. The quality is improved successively processing more data up to the full quality. This feature is of high importance for a lot of applications like Internet transmission. Instead of waiting for instance for a full image to be downloaded, a low-quality version can be displayed immediately after some portion of the bitstream has arrived. JPEG2000 for instance exploits the properties of the wavelet transform for progressive encoding of images. Similarly this principle is exploited for progressive encoding of 3D meshes. The key for redundancy reduction are statistical dependencies between the source symbols. Pure redundancy reduction is always fully reversible, i.e. the receiver can reconstruct the source information exactly. One important mechanism for redundancy reduction is prediction [6]. Images for instance mostly contain regions of similar colour. If one pixel of such a region is already encoded (and known to the receiver) its colour value will be a good guess for a neighbouring pixel to be encoded next. Prediction here means subtraction of the predicted value of the already encoded pixel from the actual value. The difference is called residual. In most cases the residuals will be small and can be encoded with fewer bits that the original colour values using entropy coding, which is outlined below. In case of 3D meshes already encoded vertices can be used for prediction of actual vertex coordinates to be encoded. Besides this spatial prediction also temporal prediction is widely used for any type of multimedia data. Consecutive images of video sequences for instance are most often very similar. Previously encoded images can be very well used for prediction of pixels, blocks of pixels, or regions within the actual
8 A Survey on Coding of Static and Dynamic 3D Meshes
247
2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 -2000
-1500
-1000
-500
0 X
500 Y
1000
1500
2000
Z
Fig. 8.4. Distribution of x-, y-, and z-components of prediction residuals of a 3D mesh [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
image to be encoded (motion compensation). The same principle can be used for compression of dynamic 3D meshes. Figure 8.4 shows the distribution of prediction residuals of a 3D mesh. It is highly concentrated around zero. This means that the prediction works very well for most cases. These residuals can be much more efficiently entropy encoded (see below) than the original vertex coordinates due to the significantly reduced variance of the distribution. The statistics of a source can be represented by a probability density function (PDF) that specifies the probability that the source emits a certain symbol. Figure 8.4 is a typical example for mesh compression. Obviously some symbols have a very high probability such as the value zero. Others are emitted rarely such as large values. This can be exploited for efficient compression using entropy coding, which is illustrated here without going into theoretical details. For simplicity let’s assume a discrete amplitude source, i.e. the symbols have already been quantized due to some irrelevancy or rate-distortion criterion to a limited number of allowed values. The source has a finite alphabet of M different symbols. Then it is possible to define a codebook that assigns a codeword to each symbol. In most cases this is a bitstring. For instance
248
A. Smolic et al.
the modern English alphabet consists of 26 letters. A codebook can be defined as “a = 00000”, “b = 00001”, “c = 00010”, etc. A decoder can perfectly reconstruct any text from the received bits. For such a representation the minimum length of the bitstring N is given by M ≤ 2N . For the English alphabet this means 5 bit per symbol. However, some of the symbols are more likely to appear than others. For instance the “e” has a very high probability in English text. The principle of variable length coding (VLC) exploits this for efficient compression. Basically, symbols with high probability get short codewords and symbols with low probability get long codewords. In average over all transmitted symbols the resulting bitrate per symbol will be significantly reduced compared to fixed length coding if the PDF is not equally distributed. For instance in the Morse alphabet the “e” gets a very short code “·”. Letters with lower probability get longer codes, e.g. “x = -··-”. One problem with VLC is the uniqueness. Since the bitstrings have a different length the decoder must know after which a specific one is finished. Various algorithms for construction of a unique VLC for a given source with PDF are available such as the Huffman algorithm. Thus entropy coding basically means optimally adapting lossless encoding to a given source with PDF. Various other algorithms for this are available. A very efficient class among them is arithmetic coding that is used for instance in latest MPEG and ITU video coding standards. Especially context-adaptive binary arithmetic coding (CABAC) [7] has proven excellent performance and can similarly be applied to 3D mesh compression. 8.1.3 Basics on Mesh Compression The compression algorithms for meshes can be divided into two categories: single rate encoders and progressive encoders. Single rate encoders compress the mesh into a single bit stream which contains both the connectivity and the geometry information of the mesh. The decoder reads the bit stream and reconstructs all polygons and vertex positions. Progressive encoders first simplify the mesh by a sequence of simplification operations which results in a base mesh that contains less polygons than the original mesh. The encoder compresses the connectivity and the geometry information of this base mesh into a bit stream followed by a sequence of operations that enable to undo the simplification operations. The decoder first reconstructs the base mesh, second decodes the operations and applies the operations to the base mesh until the original mesh is reconstructed. Ideally, the compressed bit stream that is created by a progressive encoder is of the same size as the bit stream that a single rate encoder produces. Note that a progressive encoder does not construct a multi resolution mesh but enables a streaming of the mesh whereby the quality of the mesh improves as more bits arrive. A well-designed progressive encoder optimizes the ratedistortion curve as shown in Fig. 8.5 such that mesh is well approximated even if just a few bits have been arrived.
8 A Survey on Coding of Static and Dynamic 3D Meshes
249
Fig. 8.5. Comparison of rate-distortion curves
Single rate encoders often work by a traversal of the mesh using a half edge data structure as shown in Fig. 8.6. To encode the connectivity, the traversal of the mesh generates a symbol for each triangle which is entropy encoded. Often, the encoding of the geometry is steered by the traversal order of the mesh. Each time a new vertex is visited by the traversal, its coordinates are predicted from already visited vertices and the difference between the predicted location and the true location is encoded. In order to improve the encoding, a quantization step first discretizes the vertex locations onto a given grid. The granularity is often referred to as a k-bit quantization if the grid has 2k cells. Single rate decoders reverse the encoding process. Because the decoder knows how the encoder worked, it can recreate the connectivity as well as the
a)
b)
Fig. 8.6. (a) The connectivity of a mesh is encoded by an inspection of the mesh which produces a sequence of symbols that can be well compressed. Often, this inspection walks over the mesh and encodes the walk. The compression of geometry starts with a quantization of the coordinates to discrete locations resulting in integer values that can be compressed similar to the connectivity symbols. Often, the compression ratio of geometry can be significantly increased by applying a prediction scheme from already visited vertices which is steered by the connectivity walk. (b) The decoder knows how the encoder works and can replay the encoding process in reverse
250
A. Smolic et al.
geometry information by decoding the bit stream and replaying the encoding process. Note that due to the quantization of the geometry information, the mesh cannot be recreated to its complete original quality. But because the error is (in most cases) not visible and neglectible, such an encoding is still referred to as lossless encoding (in contrast to the term lossy compression as it will be defined later on). Progressive encoders first simplify the mesh by a sequence of simplification operations. The resulting base mesh is compressed with a single rate encoder. The sequence of simplification operations is compressed afterwards by encoding a sequence of refinement operations that undo the simplification. Basically, the encoded representation of the refinement operations must specify where and how the decoder can refine the mesh. Progressive encoders differ in how they specify this information, i.e. how they specify the location inside a mesh as well as the type of a refinement (and simplification) operation. A well designed progressive encoder improves the visual quality of the mesh very quickly as more bits of the model arrive and optimizes the ratedistortion curve. Starting with a distorted base mesh, the distortion is tried to be decreased as soon as possible. Additionally, encoders (both single rate and progressive) can be classified into lossy and lossless encoders. Lossless encoders create a bit stream that enables to recreate the mesh completely. But because the mesh itself is intended to approximate a surface in R3 , it is often beneficial to choose a different mesh that also approximates the surface but can be compressed much better. The encoder first transforms the mesh into another mesh – a process called remeshing – and compresses the resulting mesh afterwards. Those encoders are lossy because the decoder reconstructs the modified mesh and not the original mesh. Remeshing techniques have opened the door for wavelet analysis of meshes resulting in compact multi resolution representation for meshes based on wavelets. Up to now, only single meshes have been discussed. But modern animation frameworks are able to create a series of meshes that together form an animation. Such animated meshes are often called dynamic meshes. The next sections overview the encoders for static (single) meshes and for dynamic meshes.
8.2 Compression of Static Meshes The compression of static meshes has first been explored for single rate encoders and was mainly driven by the hardware acceleration community to speed up the transmission of meshes to the graphics accelerator. With the advance of the internet and the need to transmit meshes, progressive encoders have been developed that enable an early preview of the mesh which gets more and more detailed while more data are arriving. We first want to overview the main approaches for single rate encoders and step on to progressive encoders later on.
8 A Survey on Coding of Static and Dynamic 3D Meshes
251
8.2.1 Single Rate Encoders We want to summarize the most successful approaches for single rate encoders. The first part covers approaches to compress the connectivity information while the second part covers geometry compression. Connectivity Coding Beginning with the hardware-supported work of Michael Deering, we move on to methods that consider the mesh as a connected graph and encode spanning trees of the graph. We describe the first method that uses such a scheme, Topological Surgery by Taubin et al. [8] and proceed with a description of methods that encode the tree as a traversal of mesh elements. Depending on the type of the traversed mesh elements, we distinguish facebased methods like Rossignac’s EdgeBreaker [9], edge-based methods like Isenburg’s FaceFixer [10] and vertex-based methods like the coder of Touma and Gotsman [11]. Finally, we describe predictive as well as spectral geometry compression techniques. Today’s graphics hardware mainly supports the rendering of triangles because a triangle lies always in a plane, is always convex and interior values can be easily expressed by linear combinations of the values at its three vertices. Each triangle is specified by its three vertices where each vertex specifies three coordinates and possibly the surface normal, material properties and/ or texture coordinates. If a coordinate is represented with 4 byte floating point values and the material is a RGBA colour that is composed of 1 byte per colour channel, the representation of a vertex needs a total of 3 × 4 + 4 = 16 bytes. A simple approach sends for each triangle all its three vertices resulting in 48 bytes. Recall that the number of triangles per vertex in an average triangle mesh is six, so each vertex is transmitted six times. Modern graphics APIs support the transmission of a triangle mesh as a triangle strip (Fig. 8.7). The triangles of the mesh are ordered in a strip such that two consecutive triangles of the strip join an edge. From the second triangle on, the next triangle is formed by the two vertices of the joined edge and a new vertex. For each triangle of the strip, the new vertex is transmitted. Because a triangle mesh contains twice as many triangles as vertices, the maximal gain is to transmit each vertex only twice. To overcome this, Deering [12] proposed to use a vertex buffer which temporarily stores vertices on the graphics accelerator. The transmitted triangle strip introduces either a new vertex which is pushed onto the vertex buffer or a reference into the vertex buffer that re-uses the information of a vertex that has been transmitted before. Deering first decomposes the mesh into triangle strips and identifies multiply used vertices. Given a triangles strip, the indices of the identified multiply used vertices are computed and the strip is stored in a triangle strip that marks multiply used vertices and uses indices to access them again. Deering calls his mesh representation the generalized triangle mesh which is perfectly
252
A. Smolic et al.
Fig. 8.7. This mesh can be represented as one triangle strip. The grey vertices are submitted twice as can be seen in the upper right triangle strip. Deering’s generalized triangle mesh marks doubly used vertices such that they can be identified and pushed onto a stack. When the vertex is referenced again, only an index into the stack needs to be transmitted. The O and M opcodes specify at which edge the new vertex is to be added
suited for a hardware implementation because the decoder needs a single run over the strip in order to decode the mesh. Deering restricted the size of the vertex buffer to at most 16 vertices so no strip can contain more than 16 different references to vertices. A triangle strip is part of a triangle spanning tree (Fig. 8.8). Trees are an important tool to understand the basic ingredients that are needed to encode the connectivity of the mesh. A triangle mesh can be considered as a graph whose nodes are the vertices of the mesh and whose edges are the edges of the mesh. This connectivity graph has a dual graph whose nodes are the triangles while the edges are defined between triangles that are adjacent to each other. For simple meshes, the Euler formula v−e+f = 2 holds. If we look at spanning trees of both the connectivity graph and its dual graph, we see that both trees have v − 1 edges and f − 1 edges, respectively. For simple meshes, we see that the sum of both fulfils the Euler equation: v − 1 + f − 1 = v + f − 2 = e.
Fig. 8.8. Left: The connectivity graph (black) and the dual graph (grey). Right: Spanning trees of both graphs. Note that the number of edges of the vertex spanning tree is v-1 and the number of edges for the dual spanning tree is v-1 which sums up to v+f -2 and equals e-1 which fulfils the Euler equation with x = 1 for this manifold mesh with one boundary loop
8 A Survey on Coding of Static and Dynamic 3D Meshes
253
Fig. 8.9. Topological Surgery cuts the mesh along edges (grey) of a vertex spanning tree (a) resulting in a flat simple polygon (b). Both the vertex spanning tree (c) and the simple polygon are encoded. The decoder reconstructs the vertex spanning tree, doubles its edges into a boundary loop (d), identifies pairs of edges along the boundary loop and recreates the mesh by filling the decoded simple polygon between pairs of identified edges
Compressing both trees is sufficient to encode the connectivity completely and has first been exploited by Taubin with Topological Surgery [8] (Fig. 8.9). The encoder cuts the mesh along a vertex spanning tree which results in a connected simple polygon, i.e. a polygon that has no internal vertices, as shown in Fig. 8.9. The simple polygon is decomposed into a triangle spanning tree. Both the vertex spanning tree and the triangle spanning tree are encoded. The decoder first decompresses the vertex spanning tree, doubles each of its edges, decodes the triangle spanning tree and reinserts the resulting triangles between pairs of the doubled edges of the vertex spanning tree. Instead of compressing the spanning trees of the connectivity graph and its dual graph explicitly, they can be compressed together in a single run over the mesh. The types of algorithms differ in the type of the mesh elements that are traversed: faces, edges, or vertices. Rossignac proposed EdgeBreaker [9] (Fig. 8.11) which is a simplified version of Gumhold’s Cut-Border Machine [13] (Fig. 8.10). A region is grown starting with a single triangle. The region contains all triangles that have already been encoded. The border of the region consists of edges of the mesh and divides the mesh into the region of processed triangles and a region of triangles that are to be processed. The border is called cut-border. Each of its edges connects to a triangle in the inner region and a triangle of the outer region. A selected edge of the cut-border is called gate and defines the triangle that is encoded next. This next triangle has only five possibilities how it can be located with respect to the gate and the cut-border. Each possibility has an opcode assigned to it and the opcode of the next triangle is added to the compressed stream. After the next triangle has been encoded, a new gate is chosen and the encoder continues. The decoder knows how the encoder worked and can reconstruct the mesh by replaying the role of the encoder (Fig. 8.12). A special situation arises when the decoder comes to a Split operation. Here, the new triangle is formed by the two vertices of the gate and another vertex that is located somewhere on the cut-border. Because EdgeBreaker does not write the index of this vertex into the bit stream, the decoder computes all indices of Split operations in a preprocessing step. Each Split operation has an End operation associated with
254
A. Smolic et al.
Fig. 8.10. (a) The terminology of the traversal methods. An inner part grows by absorbing mesh elements of the outer part. A set of edges separates the inner part from the outer part and is called the cut-border. A selected edge, the so-called gate, of the cut border determines the next triangle that is absorbed. There are a total of six possibilities how the next triangle can be iocated with respect to the gate and the cut-border (b–g). (b) The next triangle has a vertex that has not been processed before. (c, d) The next triangle has a vertex that is the next or the previous vertex of the gate. (e) The next triangle has a vertex that is on the cut-border but neither the next or previous vertex of the gate. (f) The next triangle has a vertex that is both the next and the previous vertex. (g) The next vertex lies on a cut-border that is on the stack
it that closes the cut-border introduced by the Split. All operations on the way through the symbol string from the Split to the End modify the length of the cut-border until it reaches zero after the End operation. Counting this length results in the correct split offset. The decoder of EdgeBreaker needs two runs over the bit stream which makes its hardware implementation very difficult. Therefore, two other decoding techniques have been proposed: Wrap & Zip [14] and Spirale Reversi [15] (Fig. 8.13). The idea of Wrap & Zip is to postpone the identification of a vertex. The R and S operations create a dummy vertex that will later be identified with a different dummy vertex on the cut-border. When the decoder reaches a L or E operation it is in the state to identify dummy vertices and to merge pairs of them which is called zip. Spirale Reversi decodes the symbol stream in reverse starting with the last symbol. The operation that is performed is inverted such that E creates three vertices, R and L create one vertex while C creates no vertex. So, the decoder first finds an E operation and later on the accompanying S operation. The counting of the change of the cut-border length is done implicitly by modifying the cut-border during decoding. Although EdgeBreaker can be extended to handle polygonal meshes, algorithms that are designed to handle such meshes perform better. Isenburg
8 A Survey on Coding of Static and Dynamic 3D Meshes
255
Fig. 8.11. The final twelve stages of EdgeBreaker encoding a mesh and producing the sequence of symbols CRRCSCRRRERSERRE. Gray areas are encoded, white areas are to be encoded. Black vertices are encoded, white vertices have not been encoded yet
introduced FaceFixer [10] which is a region-growing method that encodes edges. Therefore, FaceFixer can encode polygonal meshes and is not restricted to triangular meshes. Similar to EdgeBreaker, a set of opcodes specifies how the edges of a polygon that is incident to the current cut-border are located with respect to the cut-border. Touma and Gotsman [11] produce a stream of symbols that mainly contains the valences of processed vertices (Fig. 8.14). If a mesh has a low variance in the valences of its vertices, the stream of symbols contains a few valences only and can be well compressed. Note that large meshes have a valence of six in average so the valences tend to cluster around the value 6. Furthermore, the compression ratio can be significantly increased if the mesh is remeshed into a (semi-)regular mesh which contains few valences (in best case just the valence 6) only.
256
A. Smolic et al. dif C R R C S C R R R E R S E R R E
+1 -1 -1 +1 +1 +1 -1 -1 -1 -3 -1 +1 -3 -1 -1 -3
len 12 13 12 11 12 13 14 13 12 11 8 7 8 5 4 3 0
The relative vertex index of the marked Split computes to 13-8=5 which can easily be verified in the figures.
Fig. 8.12. EdgeBreaker decodes the connectivity of the mesh by replaying the work of the encoder. A preprocessing step is necessary to compute the split offsets, i.e. the relative location of the split vertex along the boundary. Thereby, the changes of the length of the cut-border are accumulated as shown in the table. The Split and the End opcodes form nested pairs such that the difference between the cut-border lengths of a Split and its according End is the required relative index
The TG coder first connects each boundary loop of the mesh with a dummy vertex as shown in Fig. 8.14. Next, a focus vertex is selected and its valence is outputted together with the valences of the two other vertices of an incident triangle. The edges of this triangle are marked as conquered and belong to the initial cut-border. The cut-border is grown by selecting more vertices around the focus vertex and outputting their valences. The free edges of the focus vertex, i.e. the not-yet-processed edges, are iterated counter-clockwise around the focus vertex and the valences of their free vertices are outputted. The free edges and their free vertices are marked as conquered. When all free edges of the focus vertex have been processed, the focus moves on to the next vertex along the cut-border until all edges have been conquered. If the dummy vertex becomes the focus vertex, the special symbol “dummy” is outputted together with the valence of the dummy vertex. A Split situation can still arise where the cut-border is split into two parts with one cut-border being pushed onto a stack. The Split operation needs to encode the number of edges along the cut-border from the focus vertex to the opposite vertex. The valence-based coder of Touma and Gotsman has later been improved by Alliez and Desbrun [16] with the goal to reduce the number of split
8 A Survey on Coding of Static and Dynamic 3D Meshes
257
Fig. 8.13. EdgeBreaker decoding with Spirale Reversi. The connectivity symbol string is decoded in reverse starting with the last symbol: ERRESRERRRCSCRRC. E creates three vertices, R and L create one vertex while C creates no vertex. S needs no split offset anymore because the gates are located such that they point to the three vertices that form the triangle
situations. Because such split situations tend to arise in convex regions, vertices in concave regions are favoured to be focus vertices over vertices in convex regions. Using a heuristic, the next focus vertex is chosen to be the vertex with the minimal number of free edges. If this choice is not unique, an average number of free edges is considered that takes also neighbouring vertices of the vertex in question along the cut-border into account. Geometry Coding Compressing the geometry of a mesh is mostly steered by the connectivity compression. Many modern algorithms use prediction techniques in order to achieve high compression ratios (Fig. 8.15). Firstly, the coordinates of the vertices are quantized up to a user-given number of bits which results in integer
258
A. Smolic et al.
Fig. 8.14. The TG coder produces the sequence 7 6 5 5 4 5 6 5 5 5 5 4 Dummy 10 3 5 4 for this sample mesh (in contrast to EdgeBreaker: CRRCSCRRRERSERRE)
values for each coordinate. When the mesh traversal arrives at a new vertex, its coordinates are predicted from already processed vertices. The difference between the predicted position and the real position is encoded. The encoder needs to do the following steps • •
Quantize the coordinates to a given number of bits Initialize: Encode the first three vertex locations
8 A Survey on Coding of Static and Dynamic 3D Meshes
259
Fig. 8.15. Geometry compression: the original coordinates (top left) are quantized (top right) and predicted (bottom). The difference between the original and the predicted coordinates tends to be a small value and can be efficiently encoded. The prediction value computes to predict(i) = pi−1 (bottom left) and predict(i) = 2pi−1 − pi−2 (bottom right)
•
Compute the prediction predict(i) of subsequent locations pi (in the order given by the connectivity coder) and encode only difference vectors di di = pi − predict(i)
The decoder needs to undo these steps • •
Initialize: Decode the first three vertex locations Compute the prediction of subsequent locations, decode the difference vectors and compute the final position pi = di + predict(i)
Parallelogram prediction is one of the most often used techniques and was introduced by Touma and Gotsman together with their valence-based connectivity compression [11]. As shown in Figure 8.16, the three vertices p1 , p2 and p3 of the triangle at the gate predict the vertex ppredict = p2 + p3 − p1 which differs only by a small difference d from the correct vertex p of the next triangle.
Fig. 8.16. Parallelogram Prediction
260
A. Smolic et al.
Spectral coding [17] can also be used to compress geometry information. The basic idea is to transfer concepts that are known for one-dimensional or two-dimensional signals to three-dimensional geometry. The signal is transformed into a new basis and the basis functions are sorted by their importance like the Discrete Cosinus Transformation in the JPEG image compression standard. Low frequencies correspond to important information while high frequencies correspond to details and can be skipped. Good approximations of the signal can be achieved by just considering the most important basis functions. For geometry, the eigenvectors of the Laplacian matrix correspond to the basis functions. The Laplacian L of a mesh is defined by the valencies of its vertices and the connectivity graph as L = V − A, with V as the diagonal valence matrix and A as the adjacency matrix of the mesh as shown in Fig. 8.17. Analyzing the eigenvectors yields the decomposition of L L = UDUT , where U contains the eigenvectors of L sorted by their corresponding eigenvalues. Because the decomposition of such a large matrix runs into numerical instabilities (many eigenvalues tend to have a very similar value such that the problem is ill-conditioned) and is expensive in terms of computational time, the mesh is partitioned into several pieces and an eigenanalysis is computed per piece. The geometry is expressed as a single matrix P that contains all coordinates of the mesh vertices pi . This geometry matrix P is projected into the new basis by ˜ = UT P. P ˜ that correspond to small eigenvalues of U are small The coordinates of P and can be skipped while the other coordinates can be encoded efficiently. Such a representation is not only very compact but allows also for a progressive transmission of the geometry information of the mesh. Starting with the
Fig. 8.17. A sample mesh and its adjacency matrix A together with a typical Laplacian star as it is defined by the Laplacian matrix L = V − A
8 A Survey on Coding of Static and Dynamic 3D Meshes
261
most important coordinates, the geometry gets more detailed while the less important coordinates are arriving. Geometry Images [18] are a lossy coding technique where a manifold triangular mesh is remeshed on a regular grid, i.e. an image. The geometry is represented by the RGB channels of the image while the connectivity is given implicitly by the regular structure of the grid. The image can be compressed using any standard image encoders. Basically, the mesh is cut along a set of edges such that it becomes topologically equivalent to a disk and can be parameterized. The domain of the parameterization is the unit square. Given the parameterization, the unit square is point sampled at regular discrete locations. Every such location has a corresponding point on the original mesh (which does not need to be an original vertex but will most likely be a point inside a triangle of the original mesh) whose x, y, z coordinates are written into the RGB channels of the image. Special care needs to be taken to find the edges that form the cut as well as the parameterization. A geometry image is simply rendered by drawing two triangles for each 2 × 2 quad of the image and taking the RGB values as 3D coordinates. It can be greatly enhanced if not only the geometry image is rendered but also a normal map which usually has a higher resolution then the geometry image. The normal map defines all normals for the interior of a triangle and is mapped using standard texture mapping hardware. Normal Maps are a derivative of the well-known bump mapping. Geometry images enable an easy multi resolution representation of the mesh. The image is simply stored as a mipmap. The compression ratios depend on the quality of the image encoder. They have later been optimized by using texture atlases instead of a single texture. Discussion The different approaches are difficult to compare because algorithms and results presented in papers often use different coding backgrounds. One of the major limitations for getting a high compression ratio is the handling of split operations that arise for all face-, edge- or valence-based models. So, Alliez and Desbrun could improve the compression ratios of the TG coder basically just with a more sophisticated handling of split operations. A theoretical analysis of the different algorithms is also difficult and could only be achieved for EdgeBreaker due to its simplicity. The original coder has an upper bound of 4 bits per vertex for simple meshes. The connectivity of any mesh regardless how its connectivity looks like can be encoded with at most 4 bpv. Of course, there are many meshes that can be compressed with a better compression ratio. Using a different coding background, the theoretical upper bound of EdgeBreaker could later be improved to 3.55 bpv. Recall that the theoretical optimum for planar meshes with three boundary edges is 3.24 bpv. So, EdgeBreaker comes close to this limit. Nevertheless, compression ratios of
262
A. Smolic et al.
valence-based techniques tend to be better and can compress a regular mesh to nearly nothing because the compressed sequence of symbols just contains sixes. The connectivity compression ratios are summarized in Table 8.1 expressed in bits per vertex (bpv). Geometry compression depends not only on the chosen technique but also on the chosen quantization level. Furthermore, the geometric distortion needs to be considered especially for lossy geometry compression. Static mesh coding, exploiting spatial dependencies of adjacent polygons, is also part of MPEG-4. It is based on Topological Surgery. The MPEG-4 3D Mesh Coder (3DMC) allows a 30-40 times compression of the IndexedFaceSet node adopted from VRML describing a static polygonal mesh. The rate-distortion curve of Fig. 8.18 shows the lower surface error of lossy spectral geometry compression (KG) for low bit rates compared to the prediction coder of Touma and Gotsman (TG). With increasing bit rates, both techniques achieve a similar compression ratio with a similar distortion. In order to catch mesh errors that are visual for humans, Karni and Gotsman introduced a visual error metric. Given a vertex i with coordinates and a set of incident vertices N (i) that have a geometric distance lij to vi , the base of the visual error is given by a Laplacian −1 lij vj GL(vi ) = vi −
j∈N (i)
−1 lij
.
j∈N (j)
The visual error is defined based on both the geometric distance between vertices and the norm of the Laplacian difference. Given two meshes M and ˆ with n vertices, the visual error is defined as M / / / / / / /M − M ˆ / = 1 (/v − vˆ/ + /GL(v) − HL(ˆ v )/). V 2n Geometry images are usually sampled into a 256 × 256 pixel image which is compressed. But for high visual quality, a normal map must also be stored together with the geometry image. The image of the normal map contains usually much more details than the geometry image and can thus not be compressed at a ratio like the geometry image. Furthermore, the normal image has a higher resolution, e.g. 512 × 512. 8.2.2 Progressive Encoders Progressively encoded meshes allow an intermediate mesh reconstruction at the decoder side, while encoded mesh refinement data is transmitted. We distinguish between lossless and lossy progressive coding techniques, depending on that if the original mesh or a modified mesh is compressed.
Table 8.1. Compression ratios for connectivity compression Name
Type
Connectivity (bpv)
Remarks
Deering 95 [12] Taubin & Rossignac 98 [8] Gumhold & Straßer 98 [13]
Generalized Triangle Mesh Topological Surgery Cut Border Machine
Triangle Strips Dual trees Face Based
∼8-11 ∼4 ∼ 4.36
Implemented in Java3D
Rossignac 99 [9]
EdgeBreaker
Face Based
∼3
Isenburg & Snoeyink 00 [10]
FaceFixer
Edge Based
∼ 2.5
Touma & Gotsman 98 [11]
The TG coder
Valence Based
∼2
Alliez & Desbrun 01 [16]
Adaptive Valence-based
Valence Based
∼ 1.85
One symbol per face, f = 2v symbols Stores split offsets explicitly One symbol per face, f = 2v symbols 4 bpv guaranteed (later improved to 3.55) One symbol per edge, e = 3v symbols Arbitrary polygons One symbol per vertex, v = 0.5f symbols ∼ 0 when regular! One symbol per vertex, v = 0.5f symbols ∼ 0 when regular!
8 A Survey on Coding of Static and Dynamic 3D Meshes
Author
263
264
A. Smolic et al.
Fig. 8.18. The solid lines show the visual error of lossy spectral compression (Karni and Gotsman, KG) for different bit rates. The single dots show the visual error of the vertex-based Touma-Gotsman coder for the same bit rates
Lossless Progressive Coding Techniques In order to progressively compress a mesh in a lossless way, it is gradually simplified to a coarse base mesh whereby each simplification operation is recorded. Applying the inverse operations to the base mesh in a reverse order, the original mesh is restored. Hoppe [19] first developed the concept of progressive mesh (PM) coding (Fig. 8.20), which was later improved by several authors [20, 21, 22]. The PM algorithm simplifies an orientable 2-manifold mesh with successive edge collapse operations (Fig. 8.19). An edge is collapsed by merging the two endpoints of an edge into one point, thus the two incident triangles to the edge are removed and all vertex connections to the previous vertices are now re-connected to the merged vertex. The vertex split operation is the inverse to the edge collapse operation. It inserts a new vertex into the mesh together with new edges and triangles.
Fig. 8.19. Edge collapse and vertex split operation
8 A Survey on Coding of Static and Dynamic 3D Meshes
a)
c)
265
d)
Fig. 8.20. PM coding: a) original mesh c–d) progressive mesh representations
Applying a sequence of edge collapse operations to the original mesh, a simplified mesh is obtained. Since edge collapse operations are invertible, the original mesh can be represented by the simplified mesh together with the sequence of corresponding vertex split operations. The order of edge collapse operations is essential for the construction of the progressive mesh. Hoppe uses an energy function which takes into account the distortion which is initiated by a potential edge collapse operation. The algorithm assigns an energy value to each edge and uses a priority queue to determine the edge carrying least energy which corresponds to least distortion. This edge is then selected for collapsing, which requires the compression of a random access to edges. Hence, despite its innovative nature the PM algorithm is not much efficient for compression. Compression rates of about 35 bpv are reported [22]. Ronfard and Rossignac [23] suggested a different scheme for determining the edge collapse costs based on using three criteria: a tessellation (topological) criterion, a geometrical criterion and a relaxation process to determine these properties. The tessellation criterion is used to evaluate the edges, preventing triangle flips (Fig. 8.21). A triangle flip occurs when a normal of a triangle flips its direction by 180◦ . The criterion computes the angle between original and updated normal At for all triangles t affected by an edge contraction, and selects the maximum value as the cost: LT E(V1 , V2 ) = K
max
t∈triangles(V1 ,V2 )
At .
The geometrical criterion is used to evaluate the distortion caused by the edge contraction. For each vertex, a “star” of incident edges is kept. From this star, one can determine a set of planes, which meet at the vertex. Squared distance to any of these planes from a point x can be determined as:
original tessellation
valid collapse
Fig. 8.21. Traingle flip
invalid collapse
266
A. Smolic et al.
d(x, p) = (x.p)2 . The geometrical criterion value is determined as a maximum of these distances from the new vertex position to all of the planes incident with the original vertices: LGE(V1 , V2 ) = max d(V2 , p). p∈planes(V1 ,V2 )
From the equation follows, that the new position of the vertex is one of the position of the original vertices. Note that the set of planes associated with the new vertex is the union of the sets of planes of the original vertices, i.e. it is not recomputed from the new tessellation of the neighbourhood of the new vertex. The overall cost is then determined as a maximum of LGE and LTE. The authors have also suggested a relaxation process, which will move the replacement vertex to a position which better firs the original local shape. The proposed algorithm is to find a minimum of the sum of distances to the set of incident planes and set the position of the vertex to this minimum: V2 ∗ = min d(x, p). x
p∈planes(V2 )
Note that the set planes(V2 ) now already contains the union of the two original sets of planes. The authors discuss the possibility to use this optimized position in the cost computation of edges, but state that it would be inefficient. Garland [24] introduced quadric error metrics which are a simple scheme to compute (and minimize) the squared distances of a point to a set of planes. Though, it is not necessary to store the set of planes explicitly as it was done by Ronfard and Rossignac [23]. Each such quadric consists of 10 floating point values and must be stored for every vertex of a triangular mesh. If two point collapse into a new one, the quadric of the new point is simply calculated by adding the quadrics of both collapsing points which just means 10 additions of floating point values. Furthermore, Garland introduced general pair contractions in contrast to pure edge contractions which allows for changes in the topology of the mesh easily during simplification. Later on, Garland et al. [25] extended the concept of quadric error metrics from triangular meshes to arbitrary simplicial complexes with a support of attribute values like colors or texture coordinates. The concept of pair contraction is a simple augmentation of the edge contraction described above. The difference is that a pair can be either an edge, or a couple of vertices which is not connected by an edge, but which is very close together. Using the pair contraction instead of edge contraction allows simplification of topology of the mesh, i.e. connecting previously unconnected components of the mesh.
8 A Survey on Coding of Static and Dynamic 3D Meshes
267
The algorithm works in five steps: • • • •
select valid pairs for contraction evaluate the pairs (see below) sort the pairs according to their evaluation iteratively contract the pairs and update the mesh
The key step is the evaluation of pairs. First, we will describe the situation at one vertex. A vertex is an intersection of planes, in which lie the incident triangles. If we want express the squared distance of a point x from a plane p, we use a simple dot product: dp (x) = (xT p)2 , where x is the position represented in homogeneous coordinates and p is the vector of coefficients of the implicit plane equation. If we now want to express the squared distance from a set of planes incident with some vertex v, we simply add the distances together: dv (x) = (xT p)2 . p∈planes(v)
The sum can be rewritten as follows: dv (x) = (xT p)(pT x) p∈planes(v)
=
xT (ppT )x
p∈planes(v)
⎛
= xT ⎝
⎞ ppT ⎠ x
p∈planes(v) T
= x Qx From the last expression follows, that computing sum of squared distances from a set of planes incident with a given vertex can be expressed by a quadratic form. This form can be pre-computed for each vertex of the original mesh. When a pair is considered for contraction, then the quadrics of its endpoints are simply added together. The final position after the contraction is found by minimizing the error measure. The minimum is found by finding the zero point of the first derivative of the quadric form, i.e. by solving the following set of equations: ⎡ ⎡ ⎤ ⎤ q11 q12 q13 q14 0 ⎢ q21 q22 q23 q24 ⎥ ∗ ⎢ 0 ⎥ ⎢ ⎢ ⎥ ⎥ ⎣ q31 q32 q33 q34 ⎦ x = ⎣ 0 ⎦ , 1 0 0 0 1
268
A. Smolic et al.
where qxy are elements of the summed quadric Q12 = Qv1 + Qv2 . Note that the last line of the matrix expresses that we are looking for a solution with homogeneous coordinate equal to 1. The overall error measure for given pair is expressed as p = x∗T Q12 x∗ . This measure is then used to sort the pairs, and the pairs of lowest error measure are contracted first. Lindstrom and Turk [26] have dropped the assumption, that the edge collapse cost should be computed with respect to the original mesh. Their algorithm only uses the current simplified version of the mesh to compute the costs of contraction of each edge. Their criterion is based on local volume preservation, which leads to global volume preservation, which is a problem for some simple vertex placement schemes, including the one presented in [23]. We will first describe the algorithm that sets the position of the new vertex after edge collapse, and then we will show a cost function which is closely related to it. The new vertex position is set by searching for an intersection of three planes, each of which represents some constraint about the position of the vertex. The authors provide equations for several possible constraints, and propose an ordering in which these constraints are evaluated, and their planes constructed. It is possible that some of the constraints produce planes that are almost coplanar, i.e. an underdetermined system which is easily disturbed by rounding errors. Such case is detected by checking the angle between the planes. Cases when the angle is lower than 1◦ are called α incompatible and the next constraint from the priority ordering is selected. The possible constraints are (in order in which they are evaluated): • • • • •
volume preservation boundary preservation volume optimization boundary optimization triangle shape optimization
The volume preservation constraint enforces that the volume is not changed after the edge is collapsed. The space between the original and new tessellation of the neighborhood of the collapsed edge is divided to tetrahedral, each having base in one of the original triangles, and a top in the new vertex (Fig. 8.22). A signed volume of a tetrahedron can be expressed as: vx v0x v1x v2x 1 vy v0y v1y v2y V (t) = . 6 vz v0z v1z v2z 1 1 1 1 Where v is the location of the top of the tetrahedron, and v0 , v1 and v2 are vertices of its base. Note that this volume is negative when the top is located
8 A Survey on Coding of Static and Dynamic 3D Meshes
269
Fig. 8.22. Tetrahedra created in the neighborhood of a contracted edge, from left to right two tetrahedral incident with the removed triangles, tetrahedral incident with vertex v1 and tetrahedral incident with vertex v2
below the base with respect to the normal of the base (i.e. when the volume is reduced) and positive when the top is above the base (i.e. when the volume is added). Therefore, it suffices to sum the volumes up and solve for zero: V (t) = 0. t∈tetrahedra(v1 ,v2 )
Solving this equality gives constrains the solution v to a plane. The boundary preservation works in a similar manner to the volume preservation, and it is evaluated only for contractions that affect at least one boundary edge. It starts with signed area A defined for each original boundary edge e as 1 A(v, e) = (v × e0 + e0 × e1 + e1 × v). 2 Searching for zero sum of signed areas is however only possible for the case of planar triangles, so the algorithm only minimizes the following expression: ⎛ ⎝
e∈boundary
⎞2 1 (v × e0 + e0 × e1 + e1 × v)⎠ . 2
The minimization of this term yields two additional planes. Volume optimization is similar to volume preservation, only this time the unsigned version of tetrahedron volumes are minimized, bringing the new surface as close as possible to the original one (the volume “between” the surfaces is minimized). The minimized expression takes following form: (V (t))2 . t∈tetrahedra(v1 ,v2 )
This constraint yields up to three additional planes. The boundary optimization constraint is evaluated in equivalent manner – the minimization expression which produces up to three additional planes has the following form: (A(v, e))2 . e∈boundary
270
A. Smolic et al.
The triangle shape optimization constraint is only used in the case when the previous constraints have not produced three α-compatible planes. This case usually occurs when the triangles of the original mesh are almost coplanar. In such case, the last constraint simply prefers equilateral triangles to long ones. The minimized expression is: (L(v, vi ))2 . i
Where L is the distance between two vertices, and vi are the vertices adjacent to the new vertex v. The contraction cost that needs to be computed for each edge is finally computed as a weighted sum of volume optimization expression and boundary optimization expression. Equal weight used in the paper for comparative testing against other simplification techniques produced results comparable with the most efficient algorithms. Popovic and Hoppe [20] extended the PM approach allowing more general operations for mesh simplification. They observed that edge collapse operations always preserve the topology of a mesh, e.g. they never change the genus of a mesh. In order to obtain a more general compression algorithm which can be also used to compress non-manifold meshes, they proposed a method called progressive simplical complex (PSC). The basic operation performed on a mesh is the unification operation respectively its inverse, the generalized vertex split operation. The unification operation merges two vertices which, in contrast to PM, do not have to share an edge. Four possible vertex configurations are acceptable. The generalized approach of PSC is paid at the compression side, since more bits are demanded for connectivity coding than PM. However, PSC can be applied to meshes of arbitrary topology. Taubin et al.’s progressive forest split (PFS) technique [21] is based on Hoppe’s PMs. They achieved a more compact representation than PMs at the expense of reduced granularity. In this approach the difference between two successive levels of detail does not consist only of a single vertex split operation, like in PMs, but in a group of vertex splits, realized by a PFS operation (Fig. 8.23). In order to perform a PFS operation rooted spanning trees are aligned with mesh edges, which are cut through the edges in a subsequent step. A following triangulation fills the resulting crevice. Translations correct positions of new vertices, which were obtained through the previous edge cutting procedure. Each forest split operation encodes the forest structure, the triangulation information, and the vertex position translations. A mesh is encoded progressively by decomposing it in layers using PFS operations. Highest compression ratios are achieved by minimizing the number of layers. Typical compression rates are slightly below 30 bpv. PFS is part of the MPEG-4 Version 2 standard. Pajarola and Rossignac proposed the compressed progressive mesh (CPM) method [22] which improves significantly the PM approach [19].
8 A Survey on Coding of Static and Dynamic 3D Meshes
271
Fig. 8.23. The forest split operation illustrated on a part of mesh a) red marked spanning tree b) resulting crevice after cutting through the tree c) triangulated crevice d) the reflined mesh
It refines the mesh topology in batches, which each increase the number of vertices by up to 50 percent. Grouping refinements in batches allows to use fewer bits for connectivity compression. A butterfly scheme [27] is used for the prediction of positions of new vertices using already decoded neighbouring vertices. CPM reportedly achieves bit rates about 22 b/v. The compression techniques described so far are based on vertex split operations. Alliez and Desburn proposed in [28] a completely different approach (VDC), which exploits vertex valences for compression. They observed that the average vertex valence in a mesh lies in the vicinity of 6 and that the entropy of mesh connectivity depends on the distribution of these vertex valences. Their proposed algorithm has two parts, a decimating conquest and a cleaning conquest, which are iteratively applied to the mesh. The decimating conquest first subdivides the mesh in patches. Each patch consists of triangles incident to a common vertex of some valence n. In the decimating conquest the encoder enters a patch, removes the common vertex, outputs the valence n and re-triangulates the remaining hole (Fig. 8.24). This procedure is iteratively applied to neighbouring patches until all patches are processed. After the decimating conquest many vertices with valence 3 remain. These are decimated using the cleaning conquest. This approach preserves the statistical concentration of valences around 6. Geometry is encoded using simple barycentric prediction combined with a local coordinate frame. It is encoded as an offset from the predicted value relative to
Fig. 8.24. A) a degree-6 patch B) removal of middle vertex C) retriangulation of subsequent hole [28]
272
A. Smolic et al.
Fig. 8.25. Progressive transmission of a mesh using Alliez’ and Desburn’s approach [65] (Copy right 2007 IEEE)
the local frame. Subsequent arithmetic coding reduces remaining statistical dependencies. These progressive coding scheme leads to average compression rates of 14–20 bpv, which makes it to one of the best state-of-the-art lossless coding schemes (Fig. 8.25). All presented compression methods so far compress a given mesh in a lossless way. However, triangular meshes are only piecewise linear approximation of real surfaces, i.e. one of many other possible piecewise linear approximations with the same approximation error. Consequently, there is no need to compress a particular mesh, if there exists another mesh representation of the same surface within the same error bounds which is better capable for compression. Lossy Progressive Coding Techniques Lossy progressive coding techniques compress a given mesh without encoding its original connectivity and geometry data. They measure distortion as geometric distance between surfaces, i.e. connectivity and geometry can be treated as additional degrees of freedom. Multiresolution analysis and wavelets are key techniques applied for lossy progressive compression. They allow to decompose a complex surface into a coarse low resolution part together with a hierarchy of fine details, called wavelet coefficients. In practice a great deal of wavelet coefficients is small. The advantage of the decomposition process is that good surface approximations can be obtained even if coefficients are discarded. Such a subband decomposition is important for efficient progressive compression since on the one hand it permits to represent a signal as a coarse signal which can be progressively refined and on the other hand it decorrelates the signal into an energetic low frequency part and low variance high frequency parts. This allows a more compact representation. Before the pioneering work of Lounsbery et al. [29] wavelets were successfully applied only to functions defined over Cartesian grids, e.g. audio, images and video signals. Lounsbery et al. extended the wavelet approach to 2-manifold surfaces of arbitrary type.
8 A Survey on Coding of Static and Dynamic 3D Meshes
273
In general it is impossible to keep the regularity of Cartesian arrangements of samples, like in audio, images, and video, when going to 2-manifold surfaces, since these are usually represented by meshes with irregular connectivity [30]. However, by creating semi-regular meshes (Fig. 8.27) an extension of the Cartesian wavelet transform [31] to 2-manifold meshes gets possible [32]. Semi-regular meshes are constructed by a process of recursive quadricsection of mesh triangles (Fig. 8.26). Starting with a coarse base mesh, whose vertex positions are samples of the 2-manifold surface, each triangle is split into four subtriangles. Vertex positions of new added vertices represent again surface samples. This process is recursively repeated, inducing a multi-resolution hierarchy, until a mesh representation is obtained which approximates the desired irregular 2-manifold mesh with sufficient accuracy. Thus obtained semi-regular meshes possess large triangular patches with regular connectivity. A wavelet transform is applied iteratively in-between two neighbouring meshes in the multi-resolution representation, i.e. between a coarser and a finer mesh. It predicts new vertex positions of the finer mesh, which are added while triangle quadrisection, based on vertex positions of the coarser mesh. Prediction is usually performed using techniques from the area of subdivision surfaces [33]. Deviations of predicted positions from original ones represent wavelet details. These details can be compared to details obtained in a traditional signal processing setting, e.g. by high pass filtering of an image. A lossy progressive coding scheme for meshes consists of three parts: 1) semi-regular remeshing 2) a wavelet transform which represents a mesh as a base mesh and a sequence of wavelet details and 3) single rate encoding of the base mesh and entropy coding of wavelet details. Khodakovsky et al. [34] proposed the progressive geometry compression (PGC) method. It uses the MAPS algorithm [35] to obtain a semi-regular mesh hierarchy (Fig. 8.27). Note that connectivity information of a semiregular mesh can be encoded very efficiently since it depends only on the base mesh and the number of subdivisions. However, the original connectivity can not be reconstructed after remeshing. Khodakovsky et al. applied the Loop scheme [36] as a predictor during wavelet transformation. This leads to a coarse base mesh and a sequence of wavelet coefficients. Since wavelet coeffi-
Fig. 8.26. Quadrisection of a base mesh triangle
274
A. Smolic et al.
Fig. 8.27. A semi-regular representation of the face model generated using the MAPS algorithm
cients tend to decrease from coarse to fine scales a zero-tree approach [32, 37] is used for progressive compression of wavelet details. Each component of a wavelet detail, which in fact is a 3 dimensional vector, is separately compressed using a distinct zero-tree. Produced bits are subsequently interleaved to obtain a progressive bit-stream. This approach provides reportedly 12 dB or 4 times better visual quality than CPM at same bit rates. The mean surface distance [38] is here used as distance measure. Khodakovky and Guskov [39] proposed another wavelet coder (NMC) which used normal meshes [40] for efficient encoding of wavelet details. Normal meshes have the property to represent wavelet details as a scalar offset in normal direction relative to a local coordinate frame. Consequently, wavelet details are described only by a scalar value in contrast to the previously presented approach where wavelet details were 3 dimensional vectors. Khodakovky and Guskov used the unlifted butterfly scheme [27] as predictor since it is also used while normal remeshing [40]. Improvements of about 2–5 dB can be observed in comparison to the previous approach [34]. Mar´ on and Garc´ıa (MG) [41] used a different approach since they do not explicitly use a parameterization to generate a semi-regular mesh. They apply a modified version of Garland’s quadric-based simplification algorithm [24] in order to obtain a base mesh. The butterfly scheme is employed as a predictor, starting with a prediction relative to the base mesh. After each prediction wavelet details are determined, which capture the difference between original mesh (it can be irregular) and the current predicted mesh. A combination of a normal projection approach and a closest point approach are used to calculate the wavelet details. Wavelet details are encoded relative to a local coordinate frame, which is tied on the coarser mesh. Observing that the normal component of wavelet details carries more geometry information than tangential components, wavelet details are quantized differently, increasing the resolution in normal direction at the expense of the tangential components. A bit-wise interleaving of the three detail components
8 A Survey on Coding of Static and Dynamic 3D Meshes
275
generates a single stream of scalar values. Finally entropy coding is preformed using the SPHIT algorithm [37]. This compression algorithm performs slightly worse than PGC but performs always better than CPM. PGC shows probably better performance due to the better remeshing of MAPS. However, PGC is computationally more expensive. Discussion The PM approach allows progressive encoding of 2-manifold meshes with finest (vertex) granularity, which requires having random access to all edges of a mesh. From the point of view of compression this leads to high bit-rates. PSC is an extension of the PM approach to arbitrary meshes based on simplexes. It is a more general approach paid at the compression side with even higher bit-rates. Layered approaches like PFS and CPM which group refinements into batches, improve the compression performance since the access to vertices is restricted only to a subset of all vertices. VDC improves the compression performance even further, due to a deterministic access to vertices which exploits vertex valences. All these approaches employ mesh simplifications techniques in order to obtain a lossless progressive mesh representation which has to be encoded later on. PSC, PM, PFS, CPM use edge collapse operations whereas VDC applies patch based vertex removal operations during simplification. There exists a long series of improvements in the area of mesh simplification techniques [23, 24, 25, 26], which can give impulses for further improvement of lossless progressive coding techniques for static meshes. The PGC approach remeshes the original mesh based on the MAPS parameterization leading to a semi-regular mesh representation. With a subsequent wavelet transformation gains of about 12 dB relative to CPM are achieved. These gains can be attributed to the increased regularity of a semi-regular mesh, which affords an efficient compression of (not original) connectivity and an improved prediction setting. NMC employs a normal parameterization in order to obtain a semi regular mesh. While wavelet compression only scalar offsets in normal direction are calculated relative to the previously predicted value. This leads to a significantly lower number of wavelet coefficients, i.e. one-third in comparison to PGC, improving even more the performance in comparison to PGC. PGC and NMC both require computationally intensive parameterizations while semi-regular remeshing. The MG approach avoids parameterizations. This leads on one hand to a worse approximation quality than NMC but on the other hand compression becomes computationally less intensive. Results of lossless and lossy progressive coding techniques are summarized in Table 8.2 and Table 8.3. 8.2.3 Huge Meshes All algorithms presented so far assume meshes that fit into the core memory. As geometric datasets have continued to grow rapidly over the last years,
276
A. Smolic et al. Table 8.2. Bit-rates of lossless progressive coding techniques
Author
Name Bit-rate
Popovic & Hoppe [20]
PSC
Remarks
over 35 bpv
works also with non-manifold meshes, vertex granularity Hoppe [19] PM about 35 bpv vertex granularity, random vertex access Taubin [21] PFS Slightly below 30 bpv layer granularity, restricted random access Pajarola & Rossignac [22] CPM about 22 bpv layer granularity, restricted random access Alliez & Desbrun [16] VDC 14–20 bpv layer granularity, deterministic access
huge meshes appeared that are far to big to be stored in in-core memory. Approaches for compression and simplification of such huge meshes have become important. File formats for meshes typically consist of a list of vertices specifying their coordinate values and a list of triangles where each triangle stores three indices that point into the vertex list. So, a triangle references three vertices. In reverse, each vertex is referenced by a number of triangles. Consider now that the triangles are loaded from disc in the order defined in the mesh file. A loaded triangle remains stored in in-core memory until it has been processed. If a triangle is loaded from disc, all its vertices must also be loaded. A vertex can be removed from in-core memory if the last triangle that references this vertex has been loaded and processed. Note that the information when a vertex is referenced for the last time is highly important in order to remove the vertex from memory. But usual mesh file formats do not provide this information. Streaming Meshes [42] are an extension of file formats and can be treated like a stream that contains triangles and vertices. As usual, a triangle references three vertices. But if a triangle is the final triangle referencing a vertex, the triangle does not store the index of this vertex but a marked index notifying a streaming algorithm that this vertex is referenced for the last time and can be removed safely. Table 8.3. Gains in dB of lossy progressive coding techniques Author
Name
gain in dB, rel. to
Remarks
Khodakovsky et al. [34] Khodakovsky & Guskov [39] Moran & Garc´ıa [41]
PGC NMC MG
12, CPM 2–5, PGC between PGC and NMC
comp. intensive comp. intensive fast
8 A Survey on Coding of Static and Dynamic 3D Meshes
277
Using such a streaming representation of meshes, both simplification [43] and compression [44, 45] algorithms can be designed by slight changes of classic algorithms only. If the mesh itself does not contain a high spatial coherence, i.e. if each triangle references vertices with very different indices, the mesh elements need to be re-sorted. Otherwise, the in-core buffer might not be sufficient to store the streaming mesh created from a mesh with a bad mesh layout. As a side-effect, streaming meshes improve the performance of simplifications or compressions significantly. Because the (re-sorted) mesh elements own a high spatial coherence, cache misses are minimized and enable an optimal usage of the memory hierarchy of modern computers.
8.3 Compression of Dynamic Meshes In order to visualize dynamic scenery the 3D representation of objects and scenes must be dynamic as well. Dynamic 3D objects and scenes are widely used in computer graphics for games, web-sites, movie and TV production, etc. In most cases such content is purely virtual, created by skilled operators that design the sceneries using specific hardware and software platforms. This may involve hundreds of people working over months to produce a high quality Hollywood movie. On the other hand more and more dynamic 3D content evolves from new applications like free viewpoint video. Here a real world dynamic scene is typically captured by multiple synchronized cameras. The 3D geometry is reconstructed and represented in a suitable format. Then such content can be used in any application just like conventional (i.e. virtual) 3D graphics. Coming from either source dynamic 3D geometry can be conveniently represented using 3D meshes. The simplest approach is just to use completely independent meshes for every time instant, i.e. a succession of static meshes. But obviously this is not the most efficient approach. Instead the geometry of moving 3D objects may be represented by a constant part and a dynamic part. The constant part of the representation reflects the fact that a certain object (virtual or reconstructed) with constant properties is under consideration. The dynamic part reflects the changing properties. For instance the number of vertices and connectivity can be kept constant. Motion and deformation can be constrained by a certain model with a limited number of parameters. Basically such a model describes the displacement of the vertices over time. Very often dynamic 3D models are represented and produced in this way. An initial 3D mesh with vertices and connectivity is defined. It is moved and deformed over time controlled by animation parameters. The model defines the effect of each of the animation parameters on each of the vertices (which may be no effect in some case). Such a representation is extremely
278
A. Smolic et al.
compact. It requires only one 3D mesh and a sequence of animation parameters instead of a sequence of 3D meshes. However, the drawback is that it requires a pre-defined model and is thus restricted to a certain type of object or motion/deformation in each specific case. Models for humans and faces for instance are widely used. Any type of physical knowledge may be integrated. Any type of geometric transformation as expressed by a mathematical function or algorithm can be applied. For a rigid body motion for instance only rotation and translation over time needs to be specified which means only 6 parameters per time instant. But also more complex physical models and 3D warping algorithms are widely used. Generic modules each defining a certain model may be combined to build more complex objects. However, in the most generic case no a-priori model may be forced. Given an initial mesh with vertices and connectivity each of the vertices shall be allowed to move freely over time. This shall be called a time-consistent dynamic 3D mesh. Any dynamic 3D scene can be composed using such timeconsistent dynamic 3D meshes. Physically modelled objects with animation parameters and static objects may be integrated as well into such a scene. Further, the time consistency constraint may be imposed only over a certain period of time, as long as the constant part of the initial representation is still suitable for the object at hand. In principle this representation is suitable and efficient for any dynamic 3D scene. However, the amount of data may still be extremely huge. High quality 3D meshes can consist of millions of vertices. Transmission channels are limited and in many cases expensive. Mobile phones for instance already have impressive 3D rendering capabilities and mobile 3D applications are expected to become more and more popular. For the success of such services and applications, involving for instance transmission over 3G networks, efficient compression is crucial. Further, storage capabilities are still an expensive resource for instance on mobile phones. A first step to efficient compression is key-mesh animation. Instead of specifying the mesh for every time instant, only a number of important meshes – the key-meshes – is defined explicitly. The vertex positions in between are defined by interpolation functions, which may be a linear interpolation in the simplest case. Key-mesh animation is widely used for production. The operator designs the key-meshes and the complete sequence is calculated by interpolation. But also other mesh sequences, e.g. resulting from 3D reconstruction can be approximated piece-wise by such a key-mesh approach. Such a representation is already quite compact. However, still the sequence of key-meshes – now itself being a time-consistent dynamic 3D mesh – can be compressed more efficiently. The previous section has shown how to compress static 3D meshes. Compression of an animated mesh, meaning a static 3D mesh and a sequence of animation parameters, is rather trivial. The following section gives an overview of available algorithms for compression of time-consistent dynamic 3D meshes.
8 A Survey on Coding of Static and Dynamic 3D Meshes
279
Fig. 8.28. Different meshes of the level 2 Humanoid sequence sharing a common connectivity [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
Fig. 8.28 illustrates a few meshes from the Humaniod sequence as example for such data. Figure 8.29 shows an example of a reconstructed mesh sequence. The dynamic 3D geometry of the person was reconstructed from 16 synchronized videos and represented by 3D meshes. The figure shows rendered views with texture mapping at 4 different times from 4 different virtual viewpoints. As for the static case, the techniques for compression of dynamic meshes can be classified into single rate encoders, progressive encoders and others. These will be described in the following sections. 8.3.1 Single Rate Encoders As for the static case single rate encoders do not provide a progressive bitstream. It must be decoded completely and results in the full resolution mesh
Fig. 8.29. Virtual camera fly, rendered views at 4 different times from 4 different virtual viewpoints [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
280
A. Smolic et al.
sequence. However, if progressive decoding is not a requirement for the application at hand they are very efficient. The key for efficient compression is prediction of vertex positions. Vertices within a mesh have a spatial coherence meaning that a vertex is located close to its neighbours. The locations of the neighbours contain information about the location of the vertex itself that can be exploited for spatial prediction. In addition mesh sequences contain temporal coherence. Vertex positions of a previous already encoded mesh contain information about the current vertex positions to be encoded. This is exploited for temporal prediction. These basic principles of spatial and temporal prediction are employed in different ways in different implementations. They can be further classified into direct methods, clustering approaches, and combinations. Direct methods directly encode the vertices or better the prediction residuals. Important examples are Interpolator Compression as defined in MPEG-4 AFX and Dynapack, which are described in the following Sects. 8.3.1 and 8.3.1. Clustering approaches try to segment the meshes into clusters with more or less similar motion. Then the motion of all vertices in the cluster can be encoded using a few representative data (e.g. displacement vectors). An efficient clustering-based mesh encoder called D3DMC is described in Sect. 8.3.1. Section 8.3.1 presents a combination of direct and clustering approaches called RD-D3DMC. Naturally this provides best compression performance at the cost of a significantly increased computational complexity. Suitable 3D error measures are introduced in Sect. 8.3.1. Finally, results comparative experiments are given in Sect. 8.3.1. Interpolator Compression in MPEG-4 AFX MPEG-4 is a rich multimedia framework that provides for instance efficient encoding for any type of multimedia data such as video, audio, graphics, text, etc. MPEG-4 compression of static meshes (3DMC) has already been mentioned in the previous section. Recently a new part of MPEG4 has been released called Animation Framework eXtension (AFX) [46]. It specifies several new computer graphics tools; among them a new algorithm for compression of dynamic 3D meshes called Interpolator Compression (AFX-IC) [47]. AFX-IC provides tools for reduction of the initial sequence into a sequence of key-meshes. The specified syntax provides the means for interpolation of the intermediate meshes. The remaining sequence of key-meshes is encoded using a classical DPCM and entropy coding structure. Each vertex to be encoded is predicted from previously encoded vertices. This may be a spatial neighbour, a temporal neighbour or a combination. In case of temporal prediction the same vertex from the previous encoded mesh (position index n, time index t-1 ) is subtracted from the actual vertex to be encoded. This means subtraction of x-, y-, and z-coordinate: d t = v t n − v t−1 n .
8 A Survey on Coding of Static and Dynamic 3D Meshes
281
In case of spatial prediction an already endcoded neighbour vertex from the same mesh (position index m, time index t) is subtracted: d s = v tn − v tm. In combined mode the same vertex from the previous encoded mesh (position index n, time index t-1 ) is moved in direction of the displacement of an already m encoded neighbour vertex from the same mesh (term v m t − v t−1 ). The result is subtracted from the actual vertex to be encoded: d st = v t n − (v t−1 n + (v t m − v t−1 m )). The encoder tries all prediction modes. The decision about the best mode is taken for each coordinate separately. This means that the x-, y-, and zcoordinate of one vertex can be encoded in different modes. The decision is based on an entropy estimation, i.e. that mode is selected that will most likely cost the least bits. The prediction residual is entropy encoded and sent with the mode decision information to the decoder. AFX-IC is relatively simple thus easy to implement and of low computational complexity, but nevertheless efficient. Since it is an international standard it may serve as reference for other methods described in the following. Dynapack and Related Algorithms The first to study dynamic mesh compression was Deering [12]. Each mesh is compressed independently. Only geometry information from the currently compressed mesh is used and thus only the spatial coherence of the vertices is exploited. Ibarria and Rossignac [48] introduced Dynapack that extends the approach of Deering to also take the temporal coherence of the vertices into account. They proposed a total of four predictors with different properties. The connectivity of the meshes remains constant and can be compressed once. Ibarria and Rossignac proposed to store the connectivity of the animated mesh in a corner table data structure which supports mesh traversal operations. But the algorithm itself is not restricted to use this data structure. It can be implemented by any connectivity data structure that supports mesh traversal. In the following we use the vertex scheme as show in Fig. 8.30 to describe the algorithm. The bold terms describe the geometry of the vertices whereas the italic letters describe the indices of the vertices which are given by the used data structure and are normally attached to exactly one triangle t. Dynapack compresses the first frame mesh using a spatial predictor and the following meshes using one of the spatial and temporal predictors: •
Compress the connectivity of the first frame mesh. This can be done by using at most 4 bits per vertex (plus a small overhead to encode possible holes or handles). Encode the geometry thereby using a spatial predictor, for instance the Lorenzo predictor [49] or the parallelogram predictor.
282
A. Smolic et al. vf lf
lf
rf
vf
rf
pf nf
nf
pf of
of
Fig. 8.30. The bold terms describe the geometry of the vertices whereas the italic letters describe the indices of the vertices which are given by the used data structure and are normally attached to exactly one triangle t
• • •
The following frame meshes are compressed as follows: Encode the vertices of a first triangle using a temporal predictor. Start to traverse the mesh by calling the recursive dynapack method for the three vertices that need to be encoded first: Dynapack(rf ), Dynapack(lf ), and Dynapack(of ).
Dynapack(v,f) If triangle(v) == -1 Return If triangle(v) has not been visited If v has not been visited Encode(v – predict(v,f)) Mark v as visited Mark triangle(v) as visited Dynapack(r,f) Dynapack(l,f)
Decompression works in the same manner and decodes the vertices.
Dynaunpack(v,f) If triangle(v) == -1 Return If triangle(v) has not been visited If v has not been visited v = Decode() + predict(v,f) Mark v as visited Mark triangle(v) as visited Dynapack(r,f)
8 A Survey on Coding of Static and Dynamic 3D Meshes
283
Dynapack(l,f)
Ibarria and Rossignac examined the following four different prediction methods predict( vf ): Parallelogram Spatial Predictor Initially invented by Touma and Gotsman, this predictor returns predict(vf ) = nf + pf − of . Temporal Predictor This predictor simply returns the position of the vertex at the previous frame predict(vf ) = nf −1 . Extended Lorenzo Spatial and Temporal Predictor This is a generalized Lorenzo predictor that was originally invented to compress regular samplings of four-dimensional scalar fields predict(vf ) = nf + pf − of + vf −1 − nf −1 − pf −1 + of −1 . Replica Spatial and Temporal Predictor This predictor perfectly predicts rigid-body motions and uniform scaling transformations predict(vf ) = of + aAf + bBf + cCf Af = pf − of Bf = nf − of Af × Bf Cf = / / . /Af × Bf /3 The constants a, b, and c are computed from the geometry information of the previous frame: Af −1 Df −1 ∗ Bf −1 Bf −1 − Bf −1 Df −1 ∗ Af −1 Bf −1 Af −1 Af −1 ∗ Bf −1 Bf −1 − Af −1 Bf −1 ∗ Af −1 Bf −1 Af −1 Df −1 ∗ Af −1 Bf −1 − Bf −1 Df −1 ∗ Af −1 Af −1 b= Af −1 Bf −1 ∗ Af −1 Bf −1 − Bf −1 Bf −1 ∗ Af −1 Af −1 Af −1 × Bf −1 c = Df −1 / / /Af −1 × Bf −1 /3 a=
Df −1 = vf −1 − of −1
284
A. Smolic et al.
AD denotes the dot product of the vectors A and D whereas * denotes a scalar multiplication. Stefanoski and Ostermann [50] proposed a related connectivity-guided predictive compression approach. Vertex locations pfv are encoded frame-wise, starting with the first frame f = 1 and ending with the last frame f = F . As in Dynapack they traverse vertices v in a deterministic order predicting vertex locations based on already encoded neighbouring vertex locations. Other than in Dynapack vertices are traversed in the order defined by the connectivity compression algorithm Edgebreaker [9]. For prediction 3 different predictors are used, the already introduced parallelogram predictor for prediction in spatial direction only, and the following two spatio-temporal predictors: Angle Preserving Predictor This is a non-linear predictor with angle preserving properties. Calculation of the predicted vertex location predangle (v, f ) for location pfv is performed using orthogonal local coordinate frames (xf −1 , yf −1 , zf −1 ) and (xf , yf , zf ) in frames f − 1 and f , respectively (see Fig. 8.31). These are attached in the corresponding edge centers mf −1 and mf . Prediction is based on the coordinates (ax , ay , az )T , which represent vector pfv −1 − mf −1 relative to basis (xf −1 , yf −1 , zf −1 ).
Fig. 8.31. Angle preserving predictor [50] (Copyright 2006 IEEE)
8 A Survey on Coding of Static and Dynamic 3D Meshes
285
Motion Vector Averaging Predictor The predicted vertex location predmvavg (v, f ) is calculated based on encoded neighbouring vertex locations. Assuming that neighbouring vertices perform similar motion, the motion vector of the current vertex v is calculated as the average of motion vectors mv(v’, f-1) of neighbouring vertices v’: 1 predmvavg (v, f ) = pfv −1 + mv(v , f − 1). N v ∈N (v)
The parallelogram predictor is applied for prediction only when encoding the first frame. For each of the subsequent frames either the angle preserving or the motion vector averaging predictor is applied, depending on that which predictor leads to the lowest average prediction error. One bit per frame of side information is encoded in order to specify the used predictor. Dynapack of Ibarria and Rossignac and the algorithm of Stefanoski and Ostermann are in principle very similar to AFX-IC. However, they employ more complex and thus efficient prediction modes as presented above. This leads to a better performance at the cost of an increased computational complexity. Another advantage of these two algorithms over AFX-IC is the connectivity guided traversal of the mesh [51], which further enhances the efficiency of the spatial prediction modes. The algorithm of Stefanoski and Ostermann shows better performance than Dynapack because of the exploitation of non-linear dependencies and a framewise adaptation of predictors. D3DMC The displacements of vertices of a certain object are not independent of each other. This leads to spatial and temporal coherence that can be exploited for prediction as shown in the previous sections. However, this property of objects can be exploited also by clustering vertices, which is in most cases more efficient. For instance many complex objects can be conceptionally composed out of parts that undergo approximately rigid motions. Imagine a human body, the upper arm for instance can be treated approximately as a rigid body, at least for short time intervals. Then the motion of all vertices defining the surface of the upper arm can be represented by a few rigid body parameters. For perfect accuracy a residual can be added. The principle of clustering approaches thus lies in the segmentation of complex objects into parts whose motion can be represented by a number of parameters according to a certain motion model. A drawback of clustering is that overhead describing the segmentation information needs to be signalled to the decoder. As will be shown below, for relatively simple meshes with a relatively small number of vertices the overhead may become significant making direct methods more efficient. The first to propose a clustering approach was Lengyel [52]. The mesh is segmented into clusters that undergo affine motion. One difficulty is to find an appropriate segmentation. This task may be computationally very complex. Improvements have been proposed by other authors in the following.
286
A. Smolic et al.
Fig. 8.32. Trilinear interpolation of 3D motion vectors [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
Spatial Clustering and Representation Clustering may also be based on regular structures such as cubical volumes. Fig. 8.32 illustrates such an approach. The main concept of spatial clustering used here is to represent a number of similar motion vectors within a cell by only a few substitution vectors. Assuming that a previously encoded mesh is already subtracted (temporal prediction) only difference, displacement or motion vectors d need to be considered for encoding the actual mesh. Within homogeneous regions all these vectors will be very similar. Assume a cell of size Δx, Δy, Δz at position x1 , y1 , z1 . All motion vectors d l of vertices within the cell are represented by the motion vectors o 1 . . . o 8 of the eight corner vertices. For large homogeneous regions this may lead to a tremendous reduction of the number of vectors to be transmitted. The cell information (e.g. size Δx, Δy, Δz and position x1 , y1 , z1 ) is introduced as overhead. Of course this clustering introduces an error. A decoder that receives only the representative motion vectors cannot recover the exact motion vectors. This error can be measured by ed , which is the normalized sum of Euclidian distances between the original motion vectors d l and their respective reconstructions d˜ l (t) from the corner motion vectors within the volume under consideration: L / 1 / /d˜ l (t) − d l (t)/. ed = L l=1
Octree Subdivision The task for efficient encoding is to find an appropriate subdivision of the entire mesh into cells with approximately homogenous motion and the representative corner vectors for each. This is a complex optimization problem that has to trade-off the benefits (in terms of quality or distortion that can be measured e.g. as in the equation above) and costs (in terms of necessary data rate). An efficient approach is to use an octree subdivision algorithm [53], [54]. The subdivision is performed top-down, starting with one single volume – such as the bounding cube – that contains the entire mesh. In the first subdivision step, the initial volume is split into eight octants, as shown in Fig. 8.33.
8 A Survey on Coding of Static and Dynamic 3D Meshes
287
For each octant, a set of representative vectors as in Fig. 8.32 is computed from the contained motion vectors as follows. The trilinear reconstruction of the motion vectors d l (t) from the representative vectors o m (t) with associated weights wm,l (t) is formulated as (see [53], [54]): d l (t) =
8
wm,l (t)o m (t).
m=1
This gives a set of L equations for all motion vectors d l (t) within a spatial cell, with l = 1 . . . L. From this set, a matrix equation d = W · o is created with weighting matrix W , which contains the known positions of all motion vectors by means of their tri-linear weights. Thus, the representative vectors o m (t) are calculated either from the pseudo inverse of W , since W is a non-square matrix: o = [W T W ]−1 W T · d , or by singular value decomposition and inversion of the decomposed matrices: 1 · (U T d ), where W = U · diag(xi ) · V T . o = V · diag xi Then the error measure is computed for the calculated representatives. If the error is below a certain threshold, the representatives for the octant are found and further encoded. If the error exceeds the threshold, the octant is further subdivided into 8 sub-octants, and the process is repeated recursively, as illustrated in Fig. 8.33. The complete process stops when the complete initial volume is processed and all representatives are determined. A significant drawback of this approach is the need to specify a threshold to control the algorithm. It has to be determined experimentally, but does not allow direct control of resulting bit rate and quality. This is overcome in the RD-optimized extension as described in Sect. 8.3.1. An octree subdivision algorithm follows a regular structure. It is relatively easy to implement and computationally very efficient. However, such a rigid structure cannot adapt optimally to the shape of objects. Encoder Based on the ideas in the previous sections, Zhang and Owen presented an efficient encoder for dynamic 3D meshes [53]. This work was extended by M¨ uller et al. [55] integrating efficient arithmetic coding and a static mesh encoder. Fig. 8.34 shows a block diagram of the D3DMC encoder. The main structure is based on a DPCM-loop with 1st order predictor. The block diagram contains MPEG-4 3DMC as fallback mode that is enabled through the Intra/Inter switch that is fixed to either one for each 3D mesh of a sequence. This Intra mode is used for instance when the first mesh (I mesh) is encoded, i.e., when no prediction from previously decoded meshes is used.
288
A. Smolic et al.
Fig. 8.33. Level 3 selective spatial octree subdivision [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
Additionally, temporal prediction can be switched off by the encoder in any other case, e.g., when the prediction error becomes too large. This mode provides backward compatibility to standard static 3DMC and ensures that D3DMC can never be worse than 3DMC. The predictive mode for mesh coding consists of the following steps: 1. The previously decoded mesh is subtracted from the current mesh to be encoded. This step can only be done if time-consistent meshes with a common connectivity are available. In case of a change of connectivity MPEG-4 3DMC m(t) mˆ (t −)1
+
d(t)
-
Octree Clustering
o(t)
Scal./ Quant.
y(t)
Arithmet. Coding
Reconstr./ Inv Scal.
Intra/Inter Switch
oˆ (t ) 0 Memory
Octree Reconstr. dˆ (t ) ˆ m(t ) + mˆ (t − 1)
Fig. 8.34. Block diagram of the D3DMC encoder [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
8 A Survey on Coding of Static and Dynamic 3D Meshes
289
a fallback to 3DMC via the Intra/Inter switch is done and the recursive process is reinitiated. Only the difference signal between original and prediction, i.e. the difference vectors, is further processed. This backward prediction scheme also ensures the suppression of error drift that occurs in forward prediction structures [6]. 2. Spatial clustering is applied to the difference vectors as described in the previous sections. 3. The substitute vectors are passed to an arithmetic coder using contextadaptive binary arithmetic coding (CABAC) [7] to efficiently adapt to the signal statistics. RD-D3DMC As explained before clustering approaches are not always better compared to direct coding. Especially for small resolution meshes regions with homogenous motion may only contain a limited number of vertices, thus the introduced overhead may become relatively large. Zhang and Owen therefore extended their algorithm to a hybrid version that selects between octree clustering and direct coding on a global per mesh basis [53]. M¨ uller et al. extended their algorithm as well and allow switching on a per cell basis [55]. The codec allows 3 different modes for each cell individually: 1. Direct Coding of differential vectors. This prediction mode is beneficial, if the motion vectors within the currently analyzed spatial volume are very different or if only a few vertices remain within the volume. 2. Mean Replacement of all motion vectors by their mean vector. This mode prediction is useful, if all motion vectors within a volume are very homogeneous. Only one vector is transmitted instead of 8 in the case of trilinear interpolation. 3. Trilinear Interpolation of all differential motion vectors from the 8 corners as described before. This prediction mode is likely to be selected, if the motion vectors exhibit a moderate smooth variation across the considered volume. This highly flexible structure allows the encoder to optimally adapt to any type of mesh. However, also the search space for finding the optimum mode is highly increased. Algorithms for an efficient selection of an optimum encoding mode are well-known from video coding as rate-distortion (RD) optimization. In general a specific way of encoding some data will cost a certain bitrate R and introduce a certain distortion D. With a Lagrangian parameter λ a cost function can be defined as D + λR. RD optimization means finding the best suitable RD-pair among all possible ways of encoding the given data. However, in practice this means trying all possible combinations of modes and selecting the best one, which may be a tremendous computational effort. Therefore very often certain search strategies are applied that likely lead to a good result at a reduced complexity.
290
A. Smolic et al. MPEG-4 3DMC m( t)
+
mˆ (t − )1
d( t)
RD-Optimization
-
y(t)
Intra/Inter Switch
0
dˆ (t )
mˆ (t)
Memory
+ mˆ (t − )1
RD-Optimization Spatial Clustering
d( t)
Octree Subdivision
Scal./ Quant.
oT(t) Scal./ Trilinear Replacement Quant.
Arithmet. Coding
y( t)
oM ( t) Scal./ Mean Quant. Replacement
RD-Calculation
arg min( Di + λRi ) ∀i
Inv. Scal.
Inv. Scal.
oˆT ( t ) Spatial
Reconstr.
Mean Rec.
Inv. Scal.
oˆM (t)
Trilin. Rec.
dˆ ( t )
Fig. 8.35. Coding structure for the RD-optimized Dynamic 3D Mesh Coder (top) and detailed structure of RD-optimization (bottom) [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
RD-D3DMC as illustrated in Fig. 8.35 is performed bottom-up (in contrary to D3DMC) and starts with the fully subdivided volume, where each spatial cell only contains 8 vectors or less. Then direct coding becomes better than trilinear interpolation (8 vectors to be transmitted). Only mean replacement and direct coding need to be evaluated at this minimum level. The rate is
8 A Survey on Coding of Static and Dynamic 3D Meshes
291
computed for each cell and each mode by passing the residuals to CABAC. The distortion is computed as normalized sum of Euclidian distances ed , as defined above. All these RD-pairs are stored. Then cells are merged following the octree structure. The RD calculation is repeated for each of the merged cells now including trilinear interpolation mode and all RD-pairs are stored as well. The process is repeated recursively until the highest octree level is reached. Then overall rate and distortion are computed for all possible combinations of subdivisions and modes. From those the encoder can finally chose the optimum combination, i.e. the smallest distortion at a given bitrate or the smalles bitrate at a given distortion. Obviously this process is computationally extremely expensive. Especially for high resolution meshes the number of possible combinations to be evaluated is tremendous. On the other side the best possible result in term of compression efficiency is ensured. Another advantage of RD optimization is that it overcomes the necessity to use predefined error thresholds. 3D Error Measures One of the common error measures in 3D mesh comparison is the Hausdorff distance [57]. For two meshes A and B to be compared, the minimal distance dA,i between all points v A,i (t) of mesh A towards mesh B is calculated first: / / dA,i = min /v A,i (t) − v B,k (t)/. ∀k
From this, the directional Hausdorff distance dA→B is obtained as the maximum of all single distances dA→B = max(dA,i ). ∀i
In the next step, the algorithm is applied vice versa to obtain dB→A . In general dA→B = dB→A , such that the general Hausdorff distance dA,B is taken as the maximum of both directional distances: dA,B = max(dA→B , dB→A ). In the case of time-consistent mesh sequences, which we also consider in this paper, a one-to-one mapping between mesh A and B is possible. Therefore a direct Euclidian distance measure can be applied, which gives a far better displacement representation. For mesh comparison, we used the average distance dm : N / 1 / /v A,i (t) − v B,i (t)/. dm = N i=1 In the case of the Hausdorff distance, only the maximum error of any two vertices is represented, such that all other displacements are neglected. The average Euclidian distance dm or average root mean squared error (AVGRMSE)
292
A. Smolic et al.
is used for mesh-to-mesh comparison, as it represents a common measure for mesh evaluation, as in “Mesh” [58] and “Metro” [38]; two tools that automatically calculate distances between 3D meshes. “Metro” additionally provides a visual comparison. Often, mesh sequences are only represented by a subset of key meshes with varying temporal distances between them. A distortion of two successive key meshes also influences distortion of all intermediate meshes that need to be interpolated after decoding. Thus also the temporal distance between the meshes needs to be included in an error measure. Here, the area distance was introduced in [47], which was also used in the experiments below. The main idea here is to extend the 1D Euclidian distance to a 2D area distance measure DA by adding the temporal distance as 2nd dimension. This area distance is first created separately for x-, y-, and z-component between any successive pair of key meshes within a sequence. The calculation for the x-component is shown in the equation below. Here, the error DA (x) is the sum of all single area distances DA,n (x) between adjacent key meshes at times tn and tn+1 : DA (x) =
N −1
DA,n (x) with DA,n (x)
n=1
$ |d
=
n+1 (x)|+|dn (x)|
(tn+1 − tn ) if 2 |dn+1 (x)|2 +|dn (x)|2 2(|dn+1 (x)|+|dn (x)|) (tn+1 − tn )
sgn(dn (x)) = sgn(dn+1 (x)) . else
The single area distances DA,n (x) are calculated from the trapezoidal areas, defined by the spatial Euclidian distances dn (x) and dn+1 (x) of the adjacent key meshes and the temporal distance tn+1 − tn , as shown in Fig. 8.36. Since the Euclidian distance is calculated from 1D components, it can also become negative and the signs of both distances dn (x) and dn+1 (x) are used to specify, whether the area under investigation is a regular or twisted trapezoid and thus adapt the area calculation accordingly, as shown by the arrow directions for the signed distances dn (x) and dn+1 (x) in Fig. 8.36. For the y-, and z-component, similar calculations are carried out. Finally, a normalized average distance DA is calculated from the 3 separate area differences:
dn(x)
dn ( x ) dn+1(x)
tn
key mesh distance Original animation path
t n+1
dn+1(x) tn
key mesh distance
Reconstructed animation path
t n+1 DA,n( x)
Fig. 8.36. Area calculation for non-crossing and crossing original and reconstructed animation paths for x-component, leading to regular and twisted trapezoidal areas for DA,n (x) respectively [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
8 A Survey on Coding of Static and Dynamic 3D Meshes
DA =
293
DA (x) + DA (y) + DA (z) . 3(tN −1 − t0 ) · dm,max
For normalization purposes, DA is divided by the total temporal distance (tN −1 − t0 ), as well as the maximum spatial distance in x-, y- and z-direction dm,max . The normalized average area distance was also used in MPEG core experiments for 3D graphics compression technology and is described in [47] in more detail. It is also used in the figures in the following section. Comparative Coding Results A comparative study between direct coding (AFX-IC), a clustering approach (D3DMC) and a combined method (RD-D3DMC) has been presented in [54]. For the experiments, the “Humanoid” test set with different resolutions was selected, which is an animated sequence of 399 frames at resolutions of 498, 1940 and 7646 vertices per mesh. The sequence was reduced to 46 key meshes, which are actually encoded. All other meshes can be interpolated. The first graph in Fig. 8.37 shows the results for the coarsest resolution with 498 vertices. Here, the fixed D3DMC performs worse than the standard AFX IC, since a relatively large percentage (∼34%) of the data rate is used for coding the spatial clustering structure. In comparison to that, the improved RD-optimized D3DMC performs similar to AFX IC at bitrates above 0,004 0,003
Distortion DA
0,002
0,001
0,0007
0,0005 0
5
10
15
20
25
30
35
Bitrate [kBits/s] Fixed D3DMC
AFX IC
RD-optimized
Fig. 8.37. Distortion DA over bit rate for fixed and RD-optimized D3DMC and AFX IC, L1Humanoid L3, 46 key meshes, 498 vertices [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
294
A. Smolic et al.
10 kbit/s and even better below. The improvement of the coder in comparison to the fixed version for low-resolution meshes mainly comes from the choice between different modes, where a larger partition of the mesh is directly coded, such that the bitrate for subdivision description is reduced. Since the RDoptimized coder always performs equal or better than any of the other two, the additionally required spatial structure and mode selection information can be compensated. Figure 8.38 evaluates the incremental gain coming from different improvements of RD-D3DMC over D3DMC. The first is just to add RD-optimization instead of the fixed threshold. Only trilinear interpolation mode is used in this case. This already provides a small gain. Addition of the direct coding mode significantly improves the results. Finally, also mean replacement as third mode added improves the compression performance. Figure 8.39 shows the results for medium mesh resolution of 1940 vertices. The clustering method performs significantly better at low and medium rates, while direct coding is slightly better at high rates. The optimum solution is to combine the best of both. At high rates and high accuracy the clustering is getting close to direct coding. The octree structure tends to be very detailed without much gain from clustering vertices. On the other hand the overhead for signalling the clustering increases. In the third case for highest mesh resolution of 7646, the advantage of clustering against direct coding at low and medium rates becomes even larger 0,003
Distortion DA
0,002
0,001
0,0007
0,0005 0
5
10
15
20
25
30
35
Bitrate [kbit/s]
Fixed
RD-opt. with 1 Mode (TI)
RD-opt. with 2 Modes (TI+DC)
Full RD-optimized
Fig. 8.38. Distortion DA over bit rate for Fixed and RD-optimized D3DMC and intermediate steps: RD-optimized with Trilinear Interpolation mode (TI) and RDoptimized with Trilinear Interpolation and Direct Coding modes, L1Humanoid L3, 46 key meshes, 498 vertices [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
8 A Survey on Coding of Static and Dynamic 3D Meshes
295
0,004 0,003
Distortion DA
0,002
0,001
0,0007
0,0005 0
10
20
30
40
50
60
70
80
Bitrate [kBits/s] Fixed D3DMC
AFX IC
RD-optimized D3DMC
Fig. 8.39. Distortion DA over bit rate for Fixed and RD-optimized D3DMC and AFX IC, L2Humanoid L3, 46 key meshes, 1940 vertices [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
(Fig. 8.40), as even more motion vectors can be clustered and coded by very few substitution vectors. At high rates all methods perform similar. Visual examples for two different resolutions at lower data rates are shown in Fig. 8.41 (a) and (b). On the left of each figure, the original mesh is shown, followed by the reconstruction results for AFX IC and D3DMC. Here, the “Mesh-Tool” from [58] was used. Both error images have been adapted to the same error scale to better highlight the differences between the methods. The lighter the color in the difference images, the larger is the reconstruction error. The scales are shown to the left of each error image and additionally include the error histograms. For the AFX IC the mesh surface is rather distorted due to coarse quantization and the error histograms show a large error distribution. In contrast, D3DMC only has small reconstruction errors, and also a small error distribution at very small values. Here, the spatial clustering of D3DMC clearly outperforms a plain direct coding approach as used in AFX IC. The difference between the two approaches becomes even larger for the higher resolution mesh sequence in Fig. 8.41 (b), where the data rate for D3DMC is only 2 /3 of AFX IC (equal bit rates in Fig. 8.41 (a)). If scalability and progressive decoding is not a requirement, single rate encoders can be applied. They are relatively simple compared to progressive encoders and provide excellent compression performance. A combination of
296
A. Smolic et al. 0,004 0,003
Distortion DA
0,002
0,001
0,0007
0,0005 0
50
100
150
200
250
Bitrate [kBits/s] Fixed D3DMC
AFX IC
RD-optimized
Fig. 8.40. Distortion DA over bit rate for Fixed and RD-optimized D3DMC and AFX IC, L3Humanoid L3, 46 key meshes, 7646 vertices [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
direct coding and clustering in a RD-optimized framework will ensure best results in any case. However, as long as suitable optimization strategies that would reduce the computation significantly are not developed, the practicability is highly questionable. For a given application therefore a decision between direct coding and clustering has to be taken. If only highest quality (near lossless) is of interest a direct coding method appears most interesting, due to good performance and low complexity. If medium and low bitrates are envisaged, or if a unique solution over all bitrates is required, a clustering approach such as D3DMC described here should be chosen.
Fig. 8.41. Visual mesh reconstruction for 2 different resolutions, using the “Mesh”Tool [58]: Original Mesh and reconstruction results for AFX-IC and D3DMC. (a): 1940 vertices, AFX IC: 60,1 kBit/s, D3DMC: 62,7 kBit/s, (b): 7646 vertices, AFX IC: 128,1 kBit/s, D3DMC: 86,4 kBit/s [54] (Copyright Elsevier, Signal Processing: Image Communication 2006)
8 A Survey on Coding of Static and Dynamic 3D Meshes
297
8.3.2 Progressive Encoders Multiresolution techniques have been first studied for static meshes. They have been extended to code animated meshes as well. We describe the following methods • Lengyel (Fitting affine predictors) [52] • Shamir and Pascucci (TDAG) [59] • Guskov and Khodakovsky (Wavelet compression) [60] • Mohr and Gleicher (Dynamic mesh simplification) [61] • Kircher and Garland (Dynamic mesh simplification) [62] • Alexa and M¨ uller (PCA) [2] • Karni and Gotsman (PCA with linear predictors) [63] • Sattler et al. (clustered PCA ) [64] • Stefanoski et al. (scalable LPC) [65] Lengyel [52] exploits several fitting predictors to compress animated meshes. A fitting predictor analyzes the animated mesh and selects from a set of predefined transformations the one that best approximates the global behavior of the vertices of the mesh. Lengyel groups the vertices of the mesh into sets and selects the transformation that best approximates the motion of each set of vertices. He supports affine transformations, free-form deformations, key-shapes, weighted trajectories, and skinning. Among these transformations one is chosen for each set of vertices. Lengyel then encodes the differences between the real positions of the vertices in each frame and the positions that are predicted by the transformation. If large sections of an animated mesh are transformed similar by such a transformation, this approach is very effective. But it is difficult to compute the partition of the vertices of the mesh into sets in order to find “good” transformations. In order to avoid the expensive computation of an optimal partition, Lengyel proposes to select a triangle and to compute the transformation that best approximates the behavior of the vertices of the triangle. Triangles that undergo the same transformation are merged into sets. Shamir and Pascucci [59] extend multiresolution techniques for static triangular meshes to animated meshes. The approach is aimed to define a level of detail in both time and space. It is based on the principle of mesh simplifications in each frame. Independent mesh simplifications are performed simultaneously in each level of detail and a dependency graph is created called DAG in order to save inter level dependencies between simplification operations. A DAG represents levels of detail of a static mesh or frame. A cut through the graph corresponds to an adaptive resolution model. Shamir and Pascucci introduce the temporal directed acyclic graph (TDAG) which uses time-tags for all time-dependent information (Fig. 8.42). The TDAG is a very general data structure which also allows for encoding connectivity and topology changes. In a DAG for static meshes, the values of all nodes are static. In contrast, the TDAG for animated meshes stores nodes whose values can change over
298
A. Smolic et al.
Fig. 8.42. A TDAG contains multiple DAGs. The edges of the TDAG are marked by time intervals that represent the lifetime of the edge
time. To do this, the TDAG attaches time-tags to different values in the nodes. A time-tags consists of a sequence of intervals (tbirth , tdeath ) where tbirth stands for the time of birth and tdeath stands for the time of death of the value in a node. Thus, the value is alive in the time interval (tbirth , tdeath ). Note that a value can have multiple time intervals. Shamir and Pascucci propose an online algorithm that constructs a TDAG incrementally. For each mesh Mi+1 a DAG is constructed and merged into the TDAG that has already been constructed for the meshes M0 , . . . , Mi . In order that the constructed DAG for the mesh Mi+1 conforms to the TDAG, the construction process is steered by a priority function that combines both current spatial constraints and temporal constraints. Each DAG is built by a process which successively collapses edges. Quadric error metrics together with topological and geometrical constraints are used as static spatial constraints while the temporal constraints are formed by history constraints and are chosen to preserve the structure of the TDAG as much as possible. A cut through the DAG for a time t forms a valid mesh Mtε that approximates the exact mesh Mt up to an error ε at time t. Thus, a TDAG is parametric in two dimensions: time t and error ε. A TDAG can be queried with both parameters and returns the appropriate mesh. Mohr and Gleicher [61] propose an algorithm which simplifies a dynamic mesh and produces again a constant connectivity dynamic mesh, i.e. the simplification is performed on the single connectivity, which is used for all the frames. Their suggestion is to use the equivalent of Garland simplification of a static mesh, with the only difference that the vertex quadric value is evaluated for each frame, and summed to form the collapse cost, i.e. the cost of collapsing an edge (v1 , v2 ) into a final position v is expressed as
8 A Survey on Coding of Static and Dynamic 3D Meshes k
299
v(Qv+ ,i + Qveˇ,i )v T ,
i=1
where Qv,i stands for the aggregate vertex quadric of vertex v in frame i, and k being the number of frames. The method generally provides better results than applying quadric based simplification on a single frame, but other possibilities of the global criterion, like using the maximum quadric value, are not discussed. Kircher and Garland [62] have proposed an interesting algorithm for creating multiresolution dynamic meshes. They suggest using the quadric error metric (QEM) to produce a hierarchy of simplified versions of the mesh. They first create such hierarchy for the first frame, and subsequently search for so called swaps, which update the structure of the hierarchy to better suit the geometry of subsequent frames. In the first step, a hierarchy of coarser representations of the first frame is created using the QEM based method described in 1.2.2.1. The method is applied iteratively and its results are stored in a tree-like data structure. This structure consists of several levels, each representing a version of the mesh. Each node corresponds to a vertex. Contraction operation used in the QEM based simplification contracts a given number (branching order of the tree) of vertices at given level to form a vertex at a higher level. The original vertices and their coarser representation are connected by the so-called contraction edges. These edges form a tree structure. Additionally, at each level there are edges which represent the actual topology at the given level. These are actually necessary only for the finest level, but they are useful for updating the mesh during the reclustering of subsequent frames. The multilevel mesh obtained by simplification of the first frame can be used for any other frame, however it may not be optimal, because the geometry of subsequent frames is different, and therefore the quadrics may also produce a higher error. The key idea is to use the hierarchy from previous frame to the next frame, and to update it by moving vertices from one cluster to another. It is likely that only small changes are present in the mesh, and therefore only a small number of changes will take place. Therefore, instead of creating the whole structure for each frame from scratch, only a few so-called swaps are performed and encoded with the mesh. A swap is a primitive operation of reclustering. As swap we denote moving a vertex v from cluster a to a cluster b. A swap is fully described by the (v, a, b) triplet, and in order to be performed at a given time, it has to be valid and beneficial. A set of such swaps is performed and encoded with each frame of the mesh, which ensures that the hierarchy well follows the geometry throughout the animation.
300
A. Smolic et al.
A swap is valid when following conditions are met: • • •
there is a vertex in the cluster b, which shares a topology edge with v (i.e. v lies on the border of a towards b. v is not the only vertex in a v is not a pinch vertex of a, i.e. when v is removed from a, a remains topologically compact.
A swap is beneficial, if the QEM is reduced. The benefit of the swap can be roughly guessed as b = Qv (b) − Qv (a), where Qv stands for the aggregate quadric of vertex v, and a and b are the positions of vertices that correspond to the given two clusters at a coarser level of the mesh. This is a conservative guess, as the positions of a and b may change by the swap and therefore the benefit may be even higher. However, it is necessary to ensure that not only the level immediately above the current level is positively influenced. Therefore, a generalized multilevel quadric metric is proposed, which sums the quadrics from all the levels with weights ensuring that the contributions from each level are uniform: 4 5 n E= wi Eu , u∈Mi
i=k+1
where Eu is the quadric error at vertex u of mesh level Mi . The weights wi are obtained as follows: w0 = 1 wi+1 = wi
|Mi+1 | |Mi |
β ,
where |Mi | stands for the number of vertices at level i, and β is an empirically determined constant that compensates for the quadric value growth. The value of the constant has been determined to be 1.9127. When the multilevel structure is created, it can be cut on any level, and only the original topology of that level, and its updates are then transmitted as the simplified version of the animation. This approach avoids the need of sending the complete topology with each frame, while it still allows altering the topology according to the changes in geometry. Guskov and Khodakovsky [60] introduce a wavelet based compression method for triangular animated meshes of constant connectivity. They transform the frame meshes with an anisotropic wavelet transformation that runs on the top of a progressive mesh hierarchy. The wavelet details are encoded. The approach is heavily based on the previous approach for static triangular meshes [66]. A wavelet detail is computed every time a vertex is removed in the progressive mesh hierarchy. The wavelet detail is defined as the difference between
8 A Survey on Coding of Static and Dynamic 3D Meshes
301
the actual position of the removed vertex and the predicted position from the coarser level. The efficiency is thus defined by the ability of the predictor to predict the position of the vertex very well. The mesh is theoretically split into its connectivity, geometry, and parameterization information. The parameterization contains thereby the information how the mesh vertices sample the underlying shape of the object. In animated meshes it is not only often the case that the connectivity remains constant but also that the local parameterization is similar for all frame meshes. Thus, splitting the connectivity and parameterization information from the mesh decorrelates the geometry information and may result in good compression ratios. The connectivity and parameterization can be encoded (compressed) once and used for all frame meshes. The compression algorithm first encodes a parametric mesh which can be any of the frame meshes but is usually the first one. The parametric information of this mesh is used to compress the remaining frame meshes. Alexa and M¨ uller [2] present an approach that computes principal animation components of an animated mesh and thus decouples the animation information from the underlying geometry. The approach supports a progressive compression of the animation with levels of detail in both spatial and temporal domains and achieves high compression ratios due to the decoupling. The algorithm assumes key-frame meshes Mi that can be interpolated to create the animation. All vertices of a key-frame mesh Mi can be represented as a scalar vector Bi that contains the vertex coordinates in a particular order. The order of the vertex coordinates must be equal in all key-frame meshes. In order to obtain a mesh in a key-frame animation, the key-frame meshes Mi must be interpolated. Formally, the state A of mesh can be calculated as ai (t)Bi . A(t) = i
The ai (t) are the weights of the animation and are typically set to a vector that describes an interpolation between the nearest neighbors: ti+1 − t t − ti ai (t) = 0, . . . , 0, , , 0, . . . , 0 . ti+1 − ti ti+1 − ti In order to separate geometry from animation information, an average ˆ 0 can be extracted which contains the common properties of all geometry B ˆ0 frame meshes. Next, the main average differences of all frame meshes to B ˆ can be computed and stored in B1 . This process can be repeated until the ˆi : state A can be expressed in another basis B ˆi . a ˆi (t)B A(t) = i
This change of the basis decouples the geometry and the animation inˆ0 represents mainly the formation and allows for an efficient compression. B
302
A. Smolic et al.
ˆi repreaverage geometry that is common to all key-frame meshes and the B sent geometric changes to this average geometry with decreasing importance (with respect to the reconstruction of the animation). ˆi deviations from the base geometry B ˆ 0 cannot inBecause the linear B clude rigid-body motions, the animation is first decomposed into a rigid-body motion and a soft-body motion. To do this, all meshes are translated such that their center of mass coincides with the origin. After that, an affine map Tt is computed for each time step t that minimizes the squared distance of vertices to their corresponding vertices in the first mesh. Alexa and M¨ uller use the Principal Component Analysis (PCA) to ˆi from the frame meshes. They use singular value compute the new bases B decomposition (SVD) to calculate the basis from the original key-frames which are given by B = (T0 B0 , T1 B1 , . . . , Tn−1 Bn−1 ). The Ti are the rigid-body motions for each frame mesh and B is a n × F matrix where n is the number of vertices in a frame mesh and F is the number of frames. Applying SVD to this matrix results in T ˆ B = BSV ,
ˆ is the matrix where S is the diagonal matrix of the singular values and B that contains the new basis. The closer a singular value of S is to zero, the closer the corresponding base shape is to being linear dependent on other base shapes. In Figure 8.43 the principle of PCA-based compression is illustrated. Please ˆ which globally decorrelates geometry note the change of basis (from B to B) information. A subsequent reduction of dimension, which corresponds to setting the lowest singular value to zero, leads to a more compact representation inducing a minimal error energy. Note that SVD is very expensive to compute (both in time and space). If the memory that is needed to store the matrices exceeds the available memory, the meshes can be simplified first or some key-frame meshes can be skipped. The new interpolation coefficients a ˆi can be obtained by projecting the key-frame into the new basis. Note that SVD creates an orthonormal basis and thus it is sufficient to compute inner products of key-frame meshes and the new basis vectors to obtain the desired projection. An animation can be reconstructed at different levels of detail. If a singular value is set to zero, the corresponding basis vector is not used for the animation. Karni and Gotsman extended the PCA approach of Alexa and M¨ uller later to compress the geometry even better by using a linear predictor [63]. They observed that components of the interpolation coefficients a ˆ(ti ) which are associated to large singular values in the matrix S, posses strong dependencies. Therefore they apply linear predictors
8 A Survey on Coding of Static and Dynamic 3D Meshes
303
Fig. 8.43. Illustration of the principle of PCA-based compression. Each point represents a vector of vertex positions of a frame (here only in 3d). All points represent all frames of a dynamic mesh
a ˆ(ti ) = c0 +
m
cj a ˆ(ti−j ),
j=1
of second (m = 2) or third (m = 3) order to predict interpolation coefficients a ˆ(ti ). The optimal prediction weights c0 ,K ,cm are calculated using a least squares approach. Consequently only prediction errors a ˆ(ti ) − a ˆ(ti ) are encoded. Furthermore by setting the lowest singular values to zero they get a more compact representation of the animation. Thus, exploiting the coherence between interpolation coefficients compression gains are increased significantly. Sattler et al. propose in [64] a clustered PCA approach, which is in fact a combination between the cluster based approach of Lengyel and the PCA based approach of Alexa and M¨ uller. They employ a data driven procedure in order to identify mesh parts which are coherent over time. A clustering of vertex trajectories is performed. Clusteres are determined in a manner that guarantees a compact representation of vertex trajectories which are part of a cluster by a restricted set of eigen-trajectories (reduction of dimension). Subsequent standard principal component analysis is applied for compression of each cluster. In comparison with previous approaches clusters can be compressed more efficiently with a lesser number of basis vectors or eigentrajectories respectively. Stefanoski et al. [65] recently presented a linear predictive coding approach for dynamic 3D meshes supporting and exploiting scalability. The algorithm decomposes each frame f of a mesh sequence into layers employ-
304
A. Smolic et al.
Fig. 8.44. Multi-resolution representation of one frame of a dynamic mesh
ing patch-based mesh simplification techniques. This decomposition is consistent in time, leading to a time consistent multi-resolution representation of a dynamic mesh (see Fig. 8.44). During the decomposition process the set of vertices V is decomposed into L disjoint subsets Vl fulfilling: L Yl=1 Vl = V,
with each Vl corresponding to a layer l. Vertex locations are encoded framewise first encoding all vertex locations of the first frame f = 1 and ending with encoding the vertex locations of the last frame f = F . Within each frame f first vertex locations of the base layer are encoded, i.e. pfv with v ∈ V1 , and at the end vertex location of the highest layer, i.e. pfv with v ∈ VL , are encoded. The decomposition process guarantees that neighbouring vertices N (v) of a vertex v ∈ Vl always lay one layer below, i.e. N (v) ⊂ Vl−1 . For the predictive compression approach, this means that vertex locations of the 1-ring neighbourhood of pfv in frame f are already encoded and can be used for prediction of pfv . A motion vector averaging predictor is employed, since all motion vectors of neighbouring vertices are known to the decoder. Besides this approach which supports only spatial scalability, where all frames are encoded in order f = 1, . . . , F , an alternative approach is presented supporting also temporal scalability, where first all odd frames f = 1, 3, 5, . . . are encoded using the presented approach and subsequently all even frames f = 2, 4, 6, . . . are encoded predicting vertex locations pfv using an extended motion vector averaging predictor. It is realized by motion vector averaging prediction from two directions, i.e. from f-1 to f and from f+1 to f. The average of these two values is used as a predicted value for pfv . Discussion Lengeyl proposed 1999 one of the first compression approaches for dynamic meshes. His approach exploited the dependencies of sets of vertices as they appear e.g. in articulated motion. A combination of vertex clustering and affine prediction was applied in order to obtain a more compact representation of a dynamic mesh. One year later Alexa and M¨ uller proposed a different compression approach based on principal component analysis (PCA). Subsequently
8 A Survey on Coding of Static and Dynamic 3D Meshes
305
several authors combined and improved this techniques. Karni & Gotsman employed PCA and liner predictors (PCA+LP) in order to further exploit dependencies between interpolation coefficients. Sattler et. al applied the PCA approach to sets of vertices like in Lengyls approach (CPCA) and obtained even higher compression gains. Guskov and Khodakovsky employed spatial wavelets (WL) in order to separate parameterization and connectivity information from geometry. They assume that parameterization and connectivity information does not change throughout time, i.e. dynamic meshes perform only isometric deformations from frame to frame (parametrically consistent). Then parameterization and connectivity information have to be encoded only once. Subsequent geometry compression leads to high compression gains. But the assumption of parametrically consistent dynamic meshes does not hold always in practice. Recently, Stefanoski et al. presented a scalable linear predictive compression approach. They introduce one spatial scalable (SSLPC) and one spatiotemporal scalable (STSLPC) linear predictive compression algorithm. Their approach features low computational complexity and shows superior compression performance compared to progressive encoders presented previously. In Fig. 8.45 operational rate-distortion curves are presented illustrating compression performances of coders. For evaluation the mesh sequence Cow was employed consisting of 204 frames and 2094 vertices per frame. Bit-rate is calculated in bits per vertex and frame (bpvf) and the error is measured using a normalized vertex-wise L1 distance denoted here as KG error, which was introduced in [63]. Here an error of 0.15 can be regarded as lossless with
Fig. 8.45. Rate-distortion curves of progressive coders. A KG error below 0.15 can be regarded as visually lossless
306
A. Smolic et al.
respect to visual quality. Thus, the WL approach shows better performance than PCA-based approaches, whereas SSLPC and STSLPC outperform WL. Temporal scalability leads to additional gains illustrated by the superiority of STSLPC against SSLPC. A data structure for representation of dynamic meshes was proposed by Shamir and Pascucci. Their general framework can be exploited to obtain a level of detail decomposition of a dynamic mesh in both directions, space and time. Mohr and Gleicher [61] and Kircher and Garland [62] extended existing approaches for static mesh simplification to dynamic meshes in order to obtain time-consistent multi-resolution mesh sequences. These promising developments point out directions of further research for lossless coding of dynamic meshes. An overview of computational complexities of the presented progressive approaches can be found in Table 8.4. Table 8.4. Overview of computation complexities Author
Type of approach
Computation complexity
Remarks
Lengeyl [52]
fitting of predictors (FP)
**
Shamir & Pascucci [59]
graph representation (GR) Wavelet (WL)
*
*
PCA
***
one of the first compression approaches for dynamic meshes multi-resolution representation for dynamic meshes good compression results for parametrically coherent dynamic meshes good compression results for dynamic meshes with a large number of frames improved compression performance with respect to PCA improved compression performance with respect to PCA+LP coders supports spatial and spatio-temporal scalability respectively, and show superior performance compared to above approaches
Guskov & Khodakovsky [60] Alexa & M¨ uller [2] Karni & Gotsman [63]
PCA with linear predictors (PCA+LP) Sattler et al. [64] clustered PCA (CPCA)
***
Stefanoski et al. [65]
*
Stefanoski et al. [65]
spatial scalable linear predictive coding (SSLPC) spatio-temporal scalable linear predictive coding (STSLPC)
***
*
8 A Survey on Coding of Static and Dynamic 3D Meshes
307
8.3.3 Others Briceno et al. [67] introduced Geometry Videos that extend the concepts of geometry images to dynamic meshes (see Sect. 8.2.1). A geometry video provides a technique to handle an animated mesh like a video sequence. In order to represent a mesh as a Geometry Image, the mesh is cut such that its topology becomes equivalent to a disk. The cut mesh is parameterized over a square domain and sampled onto a square image where the coordinates x, y, z are encoded as RGB values of the image. The image can be compressed using any image compression technique but wavelet based schemes perform best because they allow for storing additional information in sidebands such that the cut boundaries of the mesh match exactly after lossy compression. Basically, geometry videos work in the same way and extend each of these steps to handle dynamic meshes such that all frame meshes are considered. Instead of computing a parameterization for each frame mesh separately, geometry videos compute a single parameterization that performs well for all frame meshes. Hence, temporal coherencies between frames are exploited in order to yield good compression ratios and to decrease jittering artefacts that are visible if each frame mesh has its own parameterization. The cut is computed iteratively, starting with a single edge. Every iteration improves the cut by adding a new edge that points to a vertex which produces the highest distortion with respect to all frame meshes. By adding this edge to the cut, a parameterization relaxes the high distortion at the vertex. Next, a single parameterization is computed that works well for all frame meshes and uses the global cut. Theoretically, the parameterization minimizes a stretch metric considering the geometry of all frame meshes. But Briceno noticed that a parameterization which is computed from one frame mesh works well and introduces an only slightly larger reconstruction error than a global parameterization. So, an arbitrary frame mesh is used to compute a stretch-minimizing parameterization. Given both the global cut and the global parameterization, every frame mesh is converted into a geometry image. The sequence of geometry images forms a geometry video which is compressed using standard 2D video compression techniques like MPEG.
Acknowledgements This work is supported by EC within FP6 under Grant 511568 with the acronym 3DTV. We would like to thank Alexandru Salomie from Vrije Universiteit Brussel for providing the Humanoid mesh sequence (used for Figs. 28 and 37–41) , the Digital Michelangelo Project of Stanford University for providing the David model (used for Figs. 8.5 and 8.6), Matthias Muller from ETH Zurich for providing the Cow mesh sequence (used for Figs. 8.44 and 8.45), and the Computer Graphics Lab of ETH Zurich for providing the Doo
308
A. Smolic et al.
Young data set (used for Fig. 8.29). The models Face (used in Figs. 8.21 and 8.27) and Head (used in Fig. ??) are frequently used in the computer graphics community for research purposes and are provided for download at several websites. Their origin is not known to the authors. They have been downloaded from www.eecs.harvard.edu/∼gotsman/rendering/vrml/face.wrl and www.its.caltech.edu/∼matthewf/Meshes/head.obj .
References 1. M. M¨ antyl¨ a, An Introduction to Solid Modeling, Computer Science Press, College Park, MD, 1988. 2. M. Alexa and W. M¨ uller, Representing animations by principal components, Computer Graphics Forum, Vol. 19(3), 2000. 3. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Univ of Illinois Press, 1949. ISBN 0-252-72548-4. 4. H. Nyquist, Certain topics in telegraph transmission theory, Transaction AIEE, Vol. 47, pp. 617–644, April 1928. Reprint as classic paper in: Proceedings IEEE, Vol. 90, No. 2, Febary 2002. 5. A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing (2nd Ed.), Prentice-Hall, New Jersey, 1999. ISBN 0137549202. 6. N. S. Jayant and P. Noll, digital coding of waveforms, Prentice-Hall Int., London, 1984. 7. D. Marpe, H. Schwarz, and T. Wiegand, Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 620–636, 2003. 8. G. Taubin and J. Rossignac, Geometric compression through topological surgery, ACM Transactions on Graphics, Vol. 17, No. 2, pp. 84–115, 1998. 9. J. Rossignac, Edgebreaker: Connectivity compression for triangle meshes, IEEE Transaction on Visualization and Computer Graphics, Vol. 5, No. 1, pp. 47–61, 1999. 10. M. Isenburg and J. Snoeyink, Face fixer: Compressing polygon meshes with properties. Proceedings of SIGGRAPH 2000, pp. 263–270, July 2000. 11. C. Touma and C. Gotsman, Triangle mesh compression. Proceedings Graphics Interface 98, pp. 26–34, 1998. 12. M. Deering, Geometry compression. In SIGGRAPH ’95 Conference Proceedings, pp. 13–20, 1995. 13. S. Gumhold and W. Straßer, Real time compression of triangle mesh connectivity. Computer Graphics Proceedings, Annual Conference Series, 1998 (ACM SIGGRAPH ’98 Proceedings), pp. 133–140, July 1998. 14. J. Rossignac and A. Szymczak, Wrap and Zip: Linear Decoding of planar triangle graphs, IEEE Transaction Visualization Computer Graphics 5(1), 47–61, 1999. 15. M. Isenburg and J. Snoeyink, Compressing the property mapping of polygon meshes. Proceedings of Pacific Graphics 2001, pp. 4–11, October 2001. 16. P. Alliez and M. Desbrun, Valence-driven connectivity encoding of 3D meshes, Computer Graphics Forum, 20:480–489, 2001.
8 A Survey on Coding of Static and Dynamic 3D Meshes
309
17. Z. Karni and C. Gotsman, Spectral compression of mesh geometry. Computer Graphics (Proceedings of SIGGRAPH), pp. 279–286, 2000. 18. X. Gu, S. J. Gortler and H. Hoppe Geometry images. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 355–361, ACM press, 2000. 19. H. Hoppe, Progressive meshes. Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 99–108, August 1996. 20. J. Popovic and H. Hoppe. Progressive simplicial complexes. In Computer Graphics (SIGGRAPH’97 Proceedings), (1997). 21. G. Taubin, A. Gueziec, W. Horn and F. Lazarus, Progressive forest split compression. In SIGGRAPH’98, August 1998. 22. R. Pajarola and J. Rossignac, Compressed Progressive Meshes, Technical Report: GIT-GVU-99-05, GVU Center, Georgia Institute of Technology, January 1999. 23. R. Ronfard and J. Rossignac, Full range approximation of triangulated polyhedra. In Proceedings of Eurographics ’96, pp. 67–76, 1996 24. M. Garland and P. S. Heckbert, Surface simplification using quadric error metrics. Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 209–216, 1997. 25. M. Garland and Y. Zhou, Quadric-based simplification in any dimension, ACM Transactions on Graphics, Vol. 24, No. 2, 2005. 26. P. Lindstrom and G. Turk, Fast and memory efficient polygonal simplification, Proceedings of the Conference on Visualization ’98, Research Triangle Park, North Carolina, United States, pp. 279–286, 1998. 27. N. Dyn, D. Levin, and John A. Gregory, A butterfly subdivision scheme for surface interpolation with tension control, ACM Transactions on Graphics, 9(2):160–169, April 1990. 28. P. Alliez and M. Desbrun, Progressive compression for lossless transmission of triangle meshes. In SIGGRAPH ’2001 Conference Proceedings, pp. 198–205, 2001. 29. M. Lounsbery, T. D. Derose, and J. Warren, Multiresolution Analysis for surfaces of arbitrary topological type, ACM Transactions on Graphics Vol. 16, No. 1 , pp. 34–73, 1997. Originally available as TR-93-10-05, October, 1993, Department of Computer Science and Engineering, University of Washington. 30. P. Schr¨ oder and W. Sweldens, Digital Geometry Processing, Course Notes, ACM SIGGRAPH, 2001. 31. S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 1999. 32. E. J. Stollnitz, T. DeRose and D. H. Salesi, Wavelets for Computer Graphics: Theory and Applications, Morgan Kaufmann Publishers Inc., 1996. 33. D. Zorin, P. Schr¨ oder, T. DeRose, L. Kobbelt, A. Levin, and W. Sweldens, Subdivision for Modeling and Animation, Course Notes, ACM SIGGRAPH, 2000. 34. A. Khodakovsky, P. Schr¨ oder and W. Sweldens, Progressive geometry compression Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 271–278, July 2000. 35. A. W. F. Lee, W. Sweldens, P. Schr¨ oder, L. Cowsar and D. Dobkin, MAPS: Multiresolution adaptive parameterization of surfaces. In Proceedings of ACM SIGGRAPH 98, 95–104 1998. 36. C. Loop, Smooth Subdivision Surfaces Based on Triangles. Master’s Thesis, University of Utah, August 1987.
310
A. Smolic et al.
37. A. Said and W. A. Pearlman, An image multiresolution representaion for lossless and lossy compression, IEEE Transaction on Image Processing, Vol. 5, pp. 1303–1310, September 1996. 38. P. Cignoni , C. Rocchini , and Scopigno, R., Metro: Measuring error on simplified surfaces. Computer Graphics Forum 17, 2, 167–174, 1998. 39. A. Khodakovsky and I. Guskov, Normal Mesh Compression. submitted for publication, http://www.multires.caltech.edu/pubs/compression.pdf. 40. I. Guskov, K. Vidimˇce, W. Sweldens and P. Schr¨ oder, Normal meshes, Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 95–102, July 2000. 41. F. Mor´ an and N. Garc´ıa, Comparison of wavelet-based three-dimensional model coding techniques, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 14, No. 7, July 2004. 42. M. Isenburg and P. Lindstrom, Streaming meshes. Proceedings of Visualization’05, pp. 231–238, October 2005. 43. H. Vo, S. Callahan, P. Lindstrom, V. Pascucci, and C. Silva, Streaming simplification of tetrahedral meshes, IEEE Transactions on Visualization and Computer Graphics, Vol. 13, No. 1, pp. 145–155, January/February 2007. 44. M. Isenburg, P. Lindstrom, J. Snoeyink, Streaming compression of triangle meshes. Proceedings of 3rd Symposium on Geometry Processing, pp. 111–118, July 2005. 45. M. Isenburg, P. Lindstrom, S. Gumhold and J. Shewchuk, Streaming compression of tetrahedral volume meshes. Proceedings of Graphics Interface 2006, pp. 115–121, June 2006. 46. M. Bourges-Sevenier and E. S. Jang, An introduction to the MPEG-4 animation framework extension, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, pp. 928–936, 2004. 47. E. S. Jang, J. D. K. Kim, S. Y. Jung, M. J. Han, S. O. Woo and S. J. Lee, Interpolator data compression for MPEG-4 animation, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, 2004. 48. L. Ibarria and J. Rossignac, Dynapack: Space-time compression of the 3d animations of triangle meshes with fixed connectivity. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2003. 49. L. Ibarria, P. Lindstrom, J. Rossignac, and A. Szymczak, Out-Of-Core compression and decompression of large n-dimensional scalar fields. In Proceedings of Eurographics 2003, 2003. 50. N. Stefanoski and J. Ostermann, Connectivity-guided predictive compression of dynamic 3D meshes. Proceedings of ICIP ’06 – IEEE International Conference on Image Processing, Atlanta, October 2006. 51. J. Rossignac, A. Safonova and A. Szymczak, 3D compression made simple: Edgebreaker on a corner table. Proceedings of Shape Modeling International Conference, pp. 278–283, 2001. 52. J. E. Lengyel, Compression of time-dependent geometry. In Proceedings of the 1999 Symposium on Interactive 3D Graphics, pp. 89–95, ACM Press, 1999. 53. J. Zhang and C. B. Owen, Octree-based Animated Geomtery Compression, DCC’04, Data Compression Conference, Snowbird, Utah, USA, pp. 508–517, 2004. 54. K. M¨ uller, A. Smolic, M. Kautzner, P. Eisert, and T. Wiegand, Rate-distortionoptimized predictive compression of dynamic 3D mesh sequences, Invited Paper,
8 A Survey on Coding of Static and Dynamic 3D Meshes
55.
56.
57.
58.
59.
60.
61. 62.
63. 64.
65.
66. 67.
311
Signal Processing: Image Communication, Vol. 21, is. 9, pp. 812–828, Special Issue on Interactive representation of still and dynamic scenes, October 2006. K. M¨ uller, A. Smolic, M. Kautzner, P. Eisert and T. Wiegand, Predictive compression of dynamic 3D meshes. In Proceedings of International Conference on Image Processing, pp. 621–624, 2005. J. Zhang and C. B. Owen, Hybrid coding for animated polygonal meshes: Combining delta and octree, International Conference on Information Technology: Coding and Computing (ITCC’05) – Vol. I, pp. 68–73, 2005. D. Huttenlocher, G. Klanderman, and W. Rucklidge, Comparing images using the hausdorff distance, IEEE Journal of Pattern Analysis and Machine Intelligence, Vol. 15, No. 9, pp. 850–863, 1993. N. Aspert, D. Santa-Cruz, and T. Ebrahimi, MESH: Measuring errors between surfaces using the Hausdorff distance. Proceedings of the IEEE International Conference on Multimedia and Expo, Vol. I, pp. 705–708, 2002. A. Shamir and V. Pascucci, Temporal and spatial level of details for dynamic meshes. In Proceedings of the ACM Symposium on Virtual Reality Software and Technology, pp. 77–84, ACM Press, 2001. I. Guskov and A. Khodakovsky, Wavelet compression of parametrically coherent mesh sequences. In SCA ’04: Proceedings of the 2004 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 183–192, ACM Press, 2004. A. Mohr and M. Gleicher, Deformation Sensitive Decimation, Technical Report 4/7/2003, University of Wisconsin, Madison, 2003. S. Kircher and M. Garland, Progressive Multiresolution Meshes for Deforming Surfaces, ACM/Eurographics Symposium on Computer Animation, pp. 191–200, 2005. Z. Karni and C. Gotsman, Compression of Soft-Body Animation Sequences, Elsevier Computer & Graphics 28, pp. 25–34, 2004. M. Sattler, R. Sarlette, and R. Klein, Simple and Efficient Compression of Animation Sequences, Eurographics/ACM SIGGRAPH Symposium on Computer Animation, 2005. N. Stefanoski, X. Liu, P. Klie, and J. Ostermann, Scalable Linear Predictive Coding of Time-Consistent 3D Mesh Sequences, submitted to 3DTV CON – The True Vision, Capture, Transmission, and Display of 3D Video, Kos Island, Greece, May 2007. I. Guskov, W. Sweldens, and P. Schr¨ oder, Multiresolution signal processing for meshes. Proceedings of SIGGRAPH (1999), pp. 325–334, 1999. H. M. Briceno, P. V. Sander, L. McMillan, S. Gortler, and H. Hoppe, Geometry videos: A new representation for 3d animations. In SCA ’03: Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer animation, pp. 136–146. Eurographics Association, 2003.
9 Compression of Multi-view Video and Associated Data Aljoscha Smolic, Philipp Merkle, Karsten M¨ uller, Christoph Fehn, Peter Kauff and Thomas Wiegand Fraunhofer-Institute for Telecommunications, Heinrich-Hertz-Institut, Image Processing Department, Einsteinufer 37,10587 Berlin, Germany
Digital media have significantly influenced and changed modern society over the last two decades. Media are more and more produced, processed, stored, and transmitted in digital formats. Applications, terminals and content are converging. Television can be consumed with mobile phones, Internet access is possible with TV sets, and modern home PCs are powerful multimedia workstations with steadily increasing capabilities. A DVD not only contains the movie with video and audio but also a vast amount of supplementary information. Modern media formats integrate an increasing number of media types, such as video, audio, computer graphics, text, images, etc. into a single file or transmission format. Some of these formats further enable user interactivity with the content. An important driving factor in this area is the availability of international standards for digital media formats. International standards provide interoperability between different systems while still allowing for competition among equipment and service providers. ISO/IEC JTC 1/SC 29/WG 11 (Moving Picture Experts Group – MPEG) and ITU-T SG 16 Q. 6 (Video Coding Experts Group - VCEG) are two of the international bodies that play an important role in digital media standardization. Recently, the convergence of new technologies from computer graphics, computer vision, multimedia and related fields enabled also the development of new types of media, such as 3D video (3DV) and free viewpoint video (FVV) that expand the user’s sensation beyond what is offered by traditional media. 3DV, also referred to as stereo, offers a 3D depth impression of the observed scenery (note that the term 3D may have different meanings in the context of this book), while FVV allows for an interactive selection of viewpoint and direction within a certain operating range, a feature also known from computer graphics. In order to enable 3DV and FVV to be used in applications, the whole processing chain, including image acquisition, 3D representation, compression, transmission, signal processing, interactive rendering and 3D displays need to be considered [1]. The design has to take all
314
A. Smolic et al.
parts into account, since there are strong interrelations between all of them. For instance, an interactive display that requires random access to 3D data will affect the performance of a coding scheme that is based on data prediction. A more detailed overview of 3DV and FVV applications and systems is given below in Sect. 9.1. These are quite diverse and various types of 3D scene representations can be employed, which implies a number of different data types. Some application scenarios may be based on proprietary systems, as for instance already employed for (post-)production of movies and TV content. On the other hand there are also application scenarios that require interoperable systems, such as 3DTV broadcast or free viewpoint video on storage media such as DVDs. This may open up large consumer markets for 3D displays, set-top boxes, media, content, storage media devices, etc., along with the corresponding equipment for production, transmission, etc. Therefore, the MPEG committee has been investigating the need for standardization in the area of 3D and free viewpoint video in a group called 3DAV (for 3D audiovisual) [2] in recent years. The committee has provided an overview of relevant technologies and has shown that a number of these technologies are already supported by standards such as MPEG-4. For the missing elements, new standardization activities have been launched. This includes for instance new tools for efficient and high-quality representation of 3D video objects, which have been adopted as part of the MPEG-4 computer graphics specification known as Animation Framework eXtension (AFX) [3, 4]. Other types of 3D scene representations for 3DV and FVV were not efficiently supported by existing standards. For those, new standardization activities have been launched. One of these activities, the new standard for multi-view video coding (MVC), is in the main focus in Sect. 1.3 of this chapter. Multi-view video refers to a set of N temporally synchronized video streams coming from cameras that capture the same real world scenery from different viewpoints. Such multi-view video is widely used in various 3DV and FVV systems. Efficient compression and the availability of an open international standard are crucial for the success of this technology. Besides the video signals, associated data such as camera calibration parameters have to be considered. The most important special case of multi-view video is stereo video, which means exactly N = 2 videos, each one derived for projection into one eye of the user, in order to generate a depth impression. Such systems are already widely used in niche markets such as for instance IMAX theatres, medical or scientific applications. Standards are already available since some time, but mass consumer markets did not develop so far, due to a number of imperfections of the video chain. These may be overcome due to recent technology developments including a new efficient MPEG standard. Section 9.2 is devoted to these aspects of stereo video.
9 Compression of Multi-view Video and Associated Data
315
9.1 Applications and Systems 3DV and FVV have been introduced as new types of digital media that expand the user’s sensation beyond what is offered by traditional media. More specifically, both provide new functionalities: • •
3DV means a depth impression of the observed scenery, FVV allows for an interactive selection of viewpoint and direction within a certain operating range as known from computer graphics.
Both do not exclude each other. In contrary, they can be very well combined within a single system, since they are based on a suitable 3D scene representation. In other words, given a 3D representation of a scene, if a stereo pair corresponding to the human eyes can be rendered, the functionality of 3DV is provided. If a virtual view (i.e., not an available camera view) corresponding to an arbitrary viewpoint and viewing direction can be rendered, the functionality of FVV is provided. In most cases the navigation range (allowed virtual viewpoints and viewing directions) is thereby restricted to practical limits. In principle, all 3D scene representations provide both functionalities. In the practical implementation of a system it may be decided to provide only one of them, i.e. either pure FVV or pure 3DV. Different technologies can be used for acquisition, processing, representation, and rendering, but all make use of multiple views of the same visual scene [5]. The camera setting and density (i.e., number of cameras) imposes practical limitations on navigation and quality of rendered views at a certain virtual position. Therefore, there is a classical trade-off to consider between costs (for equipment, cameras, processors, etc.) and FVV quality (navigation range, quality of virtual views). 9.1.1 3D Scene Representation The choice of a 3D scene representation format is of central importance for the design of any 3DV or FVV system. On the one hand, the scene representation sets the requirements for acquisition and multi-view signal processing. For instance using an image-based representation (see below) implies using a dense camera setting. A relatively sparse camera setting would only give poor rendering results of virtual views. Using a geometry-based representation (see below) in contrary implies the need for sophisticated and error prone image processing algorithms such as object segmentation and 3D geometry reconstruction. On the other hand, the 3D scene representation determines the rendering algorithms (and with that also navigation range, quality, etc.), interactivity, as well as compression and transmission if necessary. In computer graphics literature, methods for 3D scene representation are often classified as a continuum in between two extremes [6]. These principles can also be applied for 3DV and FVV as illustrated in Fig. 9.1. The one extreme is represented by classical 3D computer graphics. This approach
316
A. Smolic et al.
Image-based Ray-space Light-field Lumigraph
Video plus depth Multi-view video plus depth Layered Depth video
Geometry-based 3D point samples Video fragments
3D mesh model
View-dependent video texture mapping
Fig. 9.1. 3D scene representations for 3DV and FVV
can also be called geometry-based modeling. In most cases scene geometry is described on the basis of 3D meshes. Real world objects are reproduced using geometric 3D surfaces with an associated texture mapped onto them. More sophisticated attributes can be assigned as well. For instance, appearance properties (opacity, reflectance, specular lights, etc.) can significantly enhance the realism of the models. Geometry-based modeling is used in applications such as games, Internet, TV, movies, etc. The achievable performance with these models might be excellent, typically if the scenes are purely computer generated. The available technology for both, production and rendering has been highly optimized over the last few years, especially in the case of common 3D mesh representations. In addition, state-of-the-art PC graphics cards are able to render highly complex scenes with an impressive quality in terms of refresh rate, levels of detail, spatial resolution, reproduction of motion, and accuracy of textures. A drawback of this approach is that typically high costs and human assistance are required for for content creation. Aiming at photo-realism, 3D scene and object modeling is often complex and time consuming, and it becomes even more complex if dynamically changing scenes are considered. Furthermore, an automatic 3D object and scene reconstruction implies an estimation of camera geometry, depth structures, and 3D shapes. With some likelihood, all these estimation processes generate errors in the geometric model. These errors then have an impact on the rendered images. Therefore, high-quality production of geometry model, e.g., for movies, is typically done user assisted. The other extreme in 3D scene representations in Fig. 9.1 is called imagebased modeling and does not use any 3D geometry at all. In this case virtual intermediate views are generated from available natural camera views by interpolation. The main advantage is a potentially high quality of virtual view synthesis avoiding any 3D scene reconstruction. However, this benefit has to be paid by dense sampling of the real world with a sufficiently large number of natural camera view images. In general, the synthesis quality increases with the number of available views. Hence, typically a large amount of cameras
9 Compression of Multi-view Video and Associated Data
317
has to be set up to achieve high-performance rendering, and a tremendous amount of image data needs to be processed therefore. Contrariwise, if the number of used cameras is too low, interpolation and occlusion artefacts will appear in the synthesized images, possibly affecting the quality. Examples of image-based representations are Ray-Space [7, 8] or lightfield rendering [9], and panoramic configurations including concentric and cylindrical mosaics [10, 11, 12, 13]. All these methods do not make any use of geometry, but they either have to cope with an enormous complexity in terms of data acquisition or they execute simplifications restricting the level of interactivity. In between the two extremes there exists a number of methods that make more or less use of both approaches and combine the advantages in some way. For instance, a Lumigraph [14, 15] uses a similar representation as a light-field but adds a rough 3D model. This provides information on the depth structure of the scene and therefore allows for reducing the number of necessary natural camera views. Other representations do not use explicit 3D models but depth or disparity maps. Such maps assign a depth value to each sample of an image. Together with the original 2D image the depth map builds a 3D-like representation, often called 2.5D. This can be extended to Layered Depth Images [16], where multiple color and depth values are stored in consecutively ordered depth layers. Closer to the geometry-based end of the spectrum, methods are reported that use view-dependent geometry and/or view dependent texture [17, 18]. Surface light-fields combine the idea of light-fields with an explicit 3D model [19, 20]. Furthermore, volumetric representations such as voxels (from volume elements) can be used instead of a complete 3D mesh model to describe 3D geometry [21, 22, 23, 24, 25]. A complete system for efficient representation and interactive streaming of high-resolution panoramic views has been presented in [26]. Other coding and transmission aspects of such data have also been studied, for example in [8, 11, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]. 9.1.2 3D Video 3D video provides the functionality of a 3D depth impression of the observed scene. In fact, this functionality, also known as stereo, is not new. Extending visual sensation to the 3rd dimension has been investigated for a long time. Commercial systems (e.g., in IMAX theatres, medicine) are available. However, acceptance for user mass markets (3DTV at home, DVDs, etc.) has not been reached yet. This may be overcome due to recent developments of autostereoscopic 3D displays, i.e., 3D displays that can be viewed without special glasses, and advanced 3D rendering that supports head motion parallax viewing [5]. Humans perceive natural depth because both eyes have a slightly different view onto the real world. Each eye sees a different image from a slightly different viewpoint. The brain merges this information and creates the perception
318
A. Smolic et al.
of a single 3D view. Technical systems can exploit this property for 3D displays. Two cameras (called stereo cameras) capture scenery from viewpoints corresponding to the human eye position. If a display system can ensure that each eye sees only the corresponding view, a 3D depth impression is generated. Such display systems are known for decades. Most of them use special glasses to make sure that each eye only perceives the corresponding view. In an anaglyph representation the images are overlaid and glasses with color filters (e.g. red-green) are used. Other systems use shutter glasses with polarized video (e.g. IMAX theatres). However, the necessity to wear glasses is – among other individual drawbacks of different systems – regarded as an obstacle for wide acceptance of 3DV technology. But in niche markets 3DV is accepted and growing rapidly. This includes for instance computer games. These are produced in most cases using geometric models. Graphics drivers that produce a 3D output (shutter) are commercially available. Using appropriate glasses (which are inexpensive) one can turn a PC into a 3D display system and play computer games in 3D. Besides such pure computer graphics content also natural stereo video is increasingly becoming available on the Internet, on DVDs etc. Such content is either captured in stereo (such as IMAX movies) or is converted from available 2D video. 2D-3D conversion is possible in user-assisted production systems, providing an interesting option for content owners and producers. Further, new 3D displays are available, where no glasses are necessary for a 3D impression. Such autostereoscopic displays use lenticular screens to display two images at the same time, where each eye can see only one of them. Another drawback of classical stereo video systems that capture and display two fixed views to the user is that head motion parallax effects can not be supported. A human watching a 3D scene expects occlusion and disocclusion effects when moving. Parts of objects should appear and disappear. This is not possible with fixed views. Such pure 3DV functionality (the viewpoint is fixed here to the position of the stereo camera) appears unnatural when moving. This may be overcome by adding even a limited free viewpoint video (see Sect. 9.1.3) functionality using depth-based stereo rendering. A video signal and a per sample depth map is transmitted to the user. From the video and depth information, a stereo pair can be rendered [36]. Some 3D displays allow for user tracking with built-in camera sensors [37]. The user’s eye positions are automatically tracked by the system. This is used to automatically adjust the 3D impression and to support head motion parallax viewing. Depending on the motion of the user, the rendered views are adjusted in real-time to the actual eye position. With that occlusion and disocclusion effects are supported within a limited operating range corresponding to the motion of a user sitting on a chair in front of the screen. Since rendering is done at the receiver, the depth impression can be adjusted individually by the user in the same way it is done with color or brightness using a classical TV set [36]. A problem of the stereo approach with depth is content creation, i.e., the generation of depth or disparity information. Cameras that automatically
9 Compression of Multi-view Video and Associated Data
319
capture per sample depth with the video are available and are being further enhanced, but the quality of the captured depth fields is currently still limited. Algorithms for depth and disparity estimation have been studied extensively in computer vision literature and powerful solutions are available. However, it always remains an estimation that can only be solved up to a residual error probability. The true information is in general not accessible. Estimation errors influence the quality of rendered views. One approach is to use a stereo camera for capturing and to estimate depth/disparity from correspondences. Depth structure can also be estimated from motion and other properties of single view video. However, a fully automatic and accurate depth/disparity capturing system is still to be developed. As mentioned before, user-assisted content generation is an option for specific applications. 9.1.3 Free Viewpoint Video Free viewpoint video offers the same functionality that is known from 3D computer graphics. The user can choose a viewpoint and viewing direction within a visual scene, meaning interactive free navigation. This is illustrated in Fig. 9.2. A number of real cameras are capturing a real world scene. Within practical limits the user may freely navigate through the scene. In contrast to pure computer graphics applications, FVV targets real world scenes as captured by natural view cameras. This is potentially interesting for user applications (DVD of an opera/concert where the user can freely chose the viewpoint) as well as for (post-)production. Systems for the latter are already being used (e.g., for sports, movies, EyeVision, ‘Matrix’ effects). In the simplest case of FVV, user navigation is restricted to the available camera positions, i.e. there is no generation of virtual intermediate views. Figure 9.3 illustrates the famous stop-motion effect (e.g., known from the movie ‘The Matrix’). It is possible to freeze the FVV scene in time in analogy
N original camera views
arbitrary view area
Fig. 9.2. Free viewpoint video, interactive selection of virtual viewpoint and viewing direction within practical limits
320
A. Smolic et al.
Fig. 9.3. Stop-motion effect, six different viewpoints at the same point in time
to a freeze image of classical 2D video. Then it is still possible to navigate around it (in spatial dimension) and to show it from different viewpoints at the same time. In this approach any virtual camera path can be produced afterwards. Figure 9.3 shows an example of 6 different viewpoints at the same point in time. Conventionally this effect is produced by placing a suitable number of synchronized cameras exactly along the line of desired navigation. The virtual camera flight is then created by displaying the original camera images consecutively, i.e., by means of a pure switching from camera to camera without any virtual view generation. This approach requires accurate planning and a tremendous effort for acquisition. It is extremely difficult to change anything afterwards. Typically, if the result showing the virtual path is not satisfactory, a reshooting becomes necessary. The figures above illustrate a model-based example of FVV [38]. Here a 3D object is reconstructed from multiple views and represented by its 3D geometry (mesh model) and associated appearance (multi-view video). The 3D video object is dynamic (moving and deforming over time) and provides the same functionality as conventional computer graphics models (free navigation, integration in scenes). We distinguish conventional computer graphics models from FVV in an application specific way as FVV uses natural camera views as the source for the scene (see Fig. 9.2). Such 3D geometry reconstruction includes a variety of advanced and potentially error prone computer vision algorithms, such as camera calibration [39], segmentation [17] and shape-fromsilhouette [38]. It conceptually remains to be an estimation that can theoretically only be solved up to a residual probability. However, by properly setting the environmental conditions, the residual probability of estimation errors can be reduced to make the approach practical. For instance in some application
9 Compression of Multi-view Video and Associated Data
321
scenarios a blue-box studio environment may be used. Further, a priori knowledge about the scene content and incorporation of corresponding models into the reconstruction process can significantly improve the results [40]. An alternative to classical 3D meshes for rendering is the usage of 3D point clouds or video fragments [41, 42]. This representation uses unorganized point clouds in 3D, i.e., points with 3D coordinates but without connectivity. Additional attributes such as color or normal vectors are assigned to the points. Such a point cloud can be rendered for virtual viewpoints of the scene by projecting the points onto the screen (called splatting). The absence of connectivity is considered a big advantage over classical 3D meshes. For dynamic objects it is difficult to keep mesh connectivity constant over time, which is necessary for efficient representation and compression [38]. Some view 3D point clouds as a ‘natural’ extension of 2D video into 3D and consider it as especially interesting for FVV. Compression of such data has been investigated in [35]. Other popular representation and rendering formats for FVV are based on per sample depth information associated with the multiple views [43, 44]. Depth-based stereo rendering is described in more detail in the previous section. The described techniques for depth-based 3D rendering are easily extended to N views. Depending on user position, a simple switching to the nearest original view with depth (or pair of views with disparity/depth) is possible. This extends the navigation range in front of the screen by the number of used camera views. Finally, the functionality of FVV can also be realized with pure imagebased representations. An example is the Free Viewpoint TV (FTV) system developed at the Tanimoto Lab at Nagoya University in Japan [7]. A scene is captured using a dense array of synchronized cameras. The camera signals are represented in a special format called ‘Ray-Space’ [8] that allows rendering the scene from any position (within practical limits) and does not rely on any geometry or depth reconstruction. Virtual intermediate views are generated from the available image data only. Therefore rendering is rather based on signal processing than on computer graphics methods. With that, Ray-Space can be regarded as the natural extension of classical 2D video to a general multiple view 3D scene representation and is a candidate for a generic video representation format in the future. Equivalently to Ray-Space, the term light-field is commonly used for static scenes in computer graphics literature [9].Compression of static light-field data has been studied for example in [27, 28]. In some systems, the multi-view video is directly processed by specific screens [45]. Depending on the user position, a different view is visible, providing the functionality of FVV. This is a purely image-based representation, which does not use intermediate view interpolation. Image-based methods require a dense sampling of the scene with many cameras. They are well suited for scenes for which geometry modeling approaches fail.
322
A. Smolic et al.
9.2 Stereo Video Compression Stereo video is the most important special case of multi-view imagery with N = 2 views. Compression of conventional stereo video has been studied for a long time and the corresponding standards are available. An overview is given in Sect. 9.2.1. Recently, a new standard has been released specifying a video plus depth format to enable efficient stereo with extended functionalities. This approach is outlined in Sect. 9.2.2. 9.2.1 Conventional Stereo Video A conventional stereo pair consists of two images showing the same scene from two slightly different viewpoints corresponding roughly to the distance of human eyes. An example is shown in Fig. 9.4 [46]. Obviously both images are very similar, which makes them well suited for compression, e.g., with one image predicting the other. For instance one of them can be compressed without reference to the other stereo image. Then, the second image can be predicted from the already encoded one, just like temporally related images can be motion-compensated in video compression. The samples of both images correspond to each other through the 3D geometry of the scene and camera properties, including positions and internal
Fig. 9.4. A stereo image pair and associated disparity map (see [46])
9 Compression of Multi-view Video and Associated Data
323
camera properties such as the focal length. The displacement or disparity of each sample in one image with respect to the other as illustrated in Fig. 9.4 is equivalent to a dense motion field in between two consecutive images of a video sequence. Therefore, it is justified to use the same principles of motion estimation and motion compensation for disparity estimation and disparity compensation for image prediction and then to only encode the prediction error or residual further. Nevertheless, some specific differences between motion compensation and disparity compensation need to be considered. The statistics of disparity vector fields is different from the statistics of motion vector fields. In case of a parallel or rectified setup, the disparities are biased and relatively large. Then, small disparity means large depth of the corresponding point in 3D. 3D points close to the camera may have very large disparity values (Note that black areas in Fig. 9.4 correspond to undefined disparity. The other values are true disparities, i.e. not the output of a stereo algorithm). This may require adjustments of entropy coding of the disparity vectors. In general temporally adjacent images of a video sequence tend to be more similar than views of a stereo pair at practical frame rates. Disocclusion effects, i.e., content that is visible in one image is occluded in the other and can therefore not be predicted, are on average more evident in a stereo pair than in between two temporally adjacent video images. Further, specific differences in a stereo pair may come from incorrect white and color balance but also due to scene lighting and surface reflectance effects. The combination of inter-view and temporal prediction is the basic principle for efficient compression of conventional stereo video. A corresponding standard specification has already been defined in ITU-T Rec. H.262 | ISO/IEC 13818-2 MPEG-2 Video, the Multi-view Profile, as illustrated in Fig. 9.5 [47, 48]. In Fig. 9.5, the left eye view is encoded without reference to the right eye view, using standard MPEG-2. This ensures backward compatibility with Main Profile of H.262 | MPEG-2 Video, since it is possible to decode the left eye view bitstream and to display 2D video. For the right eye view inter-view prediction is allowed in addition to temporal prediction. However, the gain in compression efficiency compared to independent encoding of both video streams is rather limited. This is mainly due to the fact that temporal prediction already provides very good performance. Typically, if temporal prediction is efficient for a certain image (e.g., B pictures for additional inter-view prediction does not increase the coding performance
Left
I
B
B
P
B
Right
P
B
B
B
B
Fig. 9.5. Illustration of prediction in H.262 | MPEG-2 Video Multi-view profile
324
A. Smolic et al.
significantly. Temporal neighboring images are on average more similar than spatially neighboring images (see also evaluation is Sect. 9.3.2). For images that are coded as I pictures, i.e., without reference to other temporally adjacent images in the video sequence, a significant gain can be achieved by inter-view prediction. Typically every 0.5–1 seconds such I pictures are inserted into a video stream to allow for random access and error robustness. In Fig. 9.5, the first picture of the left view is encoded as I picture. The corresponding picture of the right view would also be encoded a I picture for random access when independent encoding both video streams. But in H.262 | MPEG-2 Video multi-view coding, inter-view prediction can be applied, resulting in a significant increase of compression efficiency compared to coding this picture as an I picture. Research on compression of conventional stereo video has continued into several directions, including for instance optimum joint bit allocation for both channels, or abandoning backward compatibility to design more efficient interview prediction structures. Algorithms have been based on more up-to-date video codecs such as H.263 [49], MPEG-4 Visual [50] or H.264/AVC [51, 52, 53]. Knowledge about the human visual system and stereo perception has been incorporated into compression strategies. However, none of the developments including the original Multi-view profile have reached commercial relevance so far, since the application of stereo video did not develop into a relevant mass market yet. 9.2.2 Video Plus Depth Format Section 9.1.2 describes the video plus depth format as an alternative to conventional stereo video, where the stereo pair is generated by view interpolation from one video and depth data associated to each sample. This format is – besides the advantages over conventional stereo video described in Sect. 9.1.2 – especially interesting from compression efficiency point of view. Per sample depth data can be regarded as a monochromatic, luminance-only video signal. This is illustrated in Fig. 9.6. It shows an image with an associated per sample depth map. The depth range is restricted to a range in between two extremes Znear and Zf ar indicating the minimum and maximum distance of the corresponding 3D point from the camera respectively. The depth range is linearly quantized with 8 bit, i.e., the closest point is associated with the value 255 and the most distant point is associated with the value 0. With that, the depth map in the middle of Fig. 9.6 is specified, resulting in a grey scale image. These grey scale images can be fed into the luminance channel of a video signal and the chrominance can be set to a constant value. The resulting standard video signal can then be processed by any state-of-the-art video codec. Compression of video plus depth data has been investigated in the European ATTEST project [36]. Several state-of-the-art video codecs have been tested (MPEG-2, MPEG-4, H.264/AVC). A general conclusion from
9 Compression of Multi-view Video and Associated Data Video
325
Depth zfar
0
znear
255
Fig. 9.6. 3D data representation format consisting of regular 2D color video in European digital TV format and accompanying 8-bit depth-images with the same spatio-temporal resolution
these experiments was that depth data can be very efficiently compressed. A rough number is that 10–20% of the bit rate which is necessary to encode the color video is sufficient to encode the depth at good quality. This is due to the specific statistics of depth data, being on average more smooth and less structured than color data. The quantitative results were also confirmed by means of subjective testing. In the trials, 12 non-expert viewers were presented with virtual stereoscopic image material that was synthesized from encoded/decoded depth information where the distortion of the depth images had been varied [54]. Based on a subjective comparison with 3D imagery generated from the original, unimpaired depth data (‘best quality’ reference), the participants rated the stimuli in terms of: 1) perceived image impairments; 2) depth quality. Additionally, the subjects were asked to verbally describe perceived image distortions and 3D artifacts as well as their viewing experiences. The rating results document that virtual stereoscopic images with an acceptable quality can be generated from very low bit rate depth information. In fact, it was found that even rather severe depth-image coding distortions such as visible blocking artifacts do not translate into equally strong perceptible impairments in the synthesized views. However, at extremely low depth-image qualities it was observed that synthesis distortions such as blocking artifacts, jagged object contours and depth layering (cardboarding) appeared. Based on these observations a new backward compatible (to classical DVB) approach for 3DTV was developed within the ATTST project. Figure 9.7 illustrates the concept. It uses a layered bitstream syntax. The base layer is a conventional 2D color video encoded using MPEG-2. This base layer can be processed by any existing MPEG-2 decoder providing backward compatibility. Additionally the bitstream contains an advanced layer carrying the encoded depth information. Advanced systems may access this layer to decode the
326
A. Smolic et al.
Fig. 9.7. Layered bitstream format for video plus depth data, providing backward compatibility (from [55])
depth stream and then generate a stereo pair to be displayed in stereo by view interpolation. This concept is highly interesting due to the backward compatibility, compression efficiency and extended functionality as described in Sect. 9.1.2. Moreover it does not introduce any specific coding algorithms. It is only necessary to specify high-level syntax that allows a decoder to interpret two incoming video streams correctly as color and depth. Additionally, information about depth range (Znear and Zf ar ) needs to be transmitted. Therefore MPEG specified a corresponding container format “ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental Information”, also known as MPEG-C Part 3, for video plus depth data [56] in early 2007. Transport is defined in a separate MPEG Systems specification “ISO/IEC 13818-1:2003 Carriage of Auxiliary Data” [57]. This standard already enables 3DTV based on video plus depth. Moreover, H.264/AVC [51, 52] contains an option to convey the depth images through its auxiliary picture syntax. Here, the video codec for the color video signal and associated depth video signal are both H.264/AVC. This approach is backwards compatible with any existing deployment of H.264/AVC.
9.3 Multi-view Video Coding A common element of 3DV and FVV systems described above is the use of multiple views of the same scene that have to be transmitted to the user. The straight-forward solution for this would be to encode all the video signals independently using a state-of-the-art video codec such as H.264/AVC [51, 52].
9 Compression of Multi-view Video and Associated Data
327
However, as in the stereo case described in Sect. 9.2.1, multi-view video contains a large amount of inter-view statistical dependencies, since all cameras capture the same scene from different viewpoints. These can be exploited for combined temporal/inter-view prediction images are not only predicted from temporally neighboring images but also from corresponding images in adjacent views. Investigations in MPEG have shown that such specific multi-view video coding (MVC) algorithms give significantly better results compared to independent encoding (“Call for Evidence” [58, 59, 60]). Improvements of more than 2 dB were reported for the same bit rate. Since a “Call for Comments” [61] has further shown that there is large interest from industry in systems and applications described above, MPEG decided to issue a “Call for Proposals” [62] for MVC technology along with related requirements [63]. The responses to the “Call for Proposals” have been evaluated in January 2006. All submitted proposals were extensions of H.264/AVC. Therefore it was decided by MPEG to make MVC an amendment (Amendment 4) to H.264/AVC which is scheduled to be finalized in early 2008. MVC is described in the following section. First, requirements, test data, and test conditions are described in Sect. 9.3.1. Section 9.3.2 investigates temporal versus inter-view correlation in more detail. MVC prediction structures and experimental results are presented in Sects. 9.3.3 and 9.3.4. Finally, Sect. 9.3.5 gives conclusions and further research. 9.3.1 Requirements, Test Data, Test Conditions, and Evaluation The overall structure of MVC defining the interfaces is illustrated in Fig. 9.8. The encoder receives N temporally synchronized video streams and generates one bitstream. The decoder receives the bitstream, decodes and outputs the N video signals. MVC Requirements The central requirement for any video coding standard is high compression efficiency. In the specific case of MVC this means a significant gain compared to independent compression of the video signals. Compression efficiency measures the trade-off between costs (in terms of bit rate) and benefits (in terms of video quality), i.e. the quality at a certain bit rate or the bit rate at a
N Video Signals
Encoder
Bitstream
Decoder
N Video Signals
Fig. 9.8. Basic MVC structure defining interfaces
328
A. Smolic et al.
certain quality. However, compression efficiency is not the only thing to be required from a video coding standard. Some requirements of a video coding standard may even be contradictory such as compression efficiency and low delay in some cases. Then a good trade-off has to be found. In some cases the mechanism to define so called Profiles of a standard may help to satisfy specific application requirements. General requirements for video coding such as minimum resource consumption (memory, processing power), low delay, error robustness, or support of different pixel and color resolutions, are often applicable to all video coding standards. Some requirements are specific to MVC as highlighted in the following. Temporal random access is a requirement for any video codec. For MVC also view random access becomes important. Both together ensure that any image can be accessed and for instance displayed. Random access can be provided by insertion of I pictures that don’t use any prediction from other pictures as mentioned above. Scalability is a desirable feature for some video coding standards. This means that a decoder can access a portion of a bitstream in order to generate a low-quality video output. This may be a reduced temporal or spatial resolution, or a reduced video quality (signal to noise ratio, SNR). For MVC, additionally view scalability is required. In this case a portion of the bitstream can be accessed in order to output a limited number out of the N views. Also backward compatibility is required for MVC. This means that one bitstream corresponding to one view that is extracted from the MVC bitstream shall be conforming to H.264/AVC. Quality consistency among views is also addressed. It should be possible to adjust the encoding for instance to provide approximately constant quality over all views. Parallel processing is required to enable efficient encoder implementation and resource management. Camera parameters (extrinsic and intrinsic) should be transmitted with the bitstream in order to support intermediate view interpolation at the decoder. MVC Test Data and Test Conditions The proper selection of test data and test conditions is crucial for the development of a video coding standard. The test data set must be representative for the envisaged area of applications, and therefore cover a wide range of different content properties. For MVC 8 different multi-view test data sets have been used with 5 to 16 camera views, including linear, arc, and array arrangements. Pixel resolutions are 640 × 480 and 1024 × 768, and frame rates are 15 frames/s, 25 frames/s and 30 frames/s. The applications rather target high quality TV-type video than limited channel communication-type video. Therefore smaller resolutions like CIF or QCIF are not considered. The MVC test data set covers a wide range of different content types, indoor and outdoor scenes, fixed and moving camera systems, different complexities of motion and spatial detail. In order to perform comparative evaluations, also the test conditions have to be fixed. For each test data set, three bit rates have been chosen
9 Compression of Multi-view Video and Associated Data
329
corresponding to low but acceptable, medium and high quality, which made exhaustive experimentation necessary to find the proper settings. Depending on the properties and content of each particular test data set, a different bit rate per view was specified. Using same bit rates allows comparing different approaches in subjective tests as described below. The main goal of MVC is to provide significantly increased compression efficiency compared to individually encoding all video signals. Therefore independent encoding all views using H.264/AVC was considered as the reference for coding performance comparisons. All test data sets were encoded this way using the same test conditions. The resulting decoded video signals (anchors) served as reference for objective and subjective comparison. Encoding was done using typical settings and parameters, however, hierarchical B pictures were not applied (see below), meaning that a sub-optimal H.264/AVC configuration has been used namely an IBBP. . . picture coding structure. MVC Evaluation Evaluation of video coding algorithms can be done in general using objective and subjective measures. The most widely used objective measure is the peaksignal-to-noise-ratio (PSNR) of the luma signal which is given as PSNR-Y = 10 log10 (2552 /MSE) with MSE being the mean squared error between the original and decoded video samples. Typically PSNR values are plotted over bit rate and allow then comparison of the compression efficiency of different algorithms (e.g. anchor encoding vs. a proposed MVC scheme). This can be done in the same way for MVC. However, PSNR values do not always capture video quality as perceived by humans. Some types of distortions that result in low PSNR values do not affect the human perception in the same way. One example is a shift of the picture by one sample side wards. Therefore any video coding algorithm can finally only be judged in subjective evaluations. The formal MVC tests were conducted by MPEG using a Single Stimulus Impairment Scale (SSIS) test. This method has proven to deliver reliable results when used for evaluation of the visual quality of video codecs and is specified in ITU-R Rec. BT.500-11 [64]. In this subjective test, subjects are being shown the decoded video signal from a candidate codec. The subjects judge the quality of decoded video on a scale from bad to excellent. The votes of the subjects are statistically analyzed to quantify subjective quality. For statistical confidence, a large number of subjects needs to be involved. Display conditions, room conditions (including lighting and view distance), and execution of test sessions (order of presented video, display time, etc.) require careful design. In consequence such formal subjective tests require a tremendous effort.
330
A. Smolic et al.
9.3.2 Temporal/Inter-view Correlation The key for efficient MVC is inter-view prediction in addition to temporal prediction. Therefore this section investigates the relationship of inter-view and temporal statistical dependencies in order to evaluate if inter-view prediction is useful and in which cases a gain can be expected. For the case of linear camera settings, the inter-view/temporal first order neighbors are shown in Fig. 9.9. With the exception of left- and rightmost cameras each picture of the multi-view sequence has 8 inter-view/temporal neighbors. The purpose of the analysis is to determine by which percentage a rate-distortion optimized encoder such as H.264/AVC would choose either one of these modes, if all of them are available. For the statistical analysis, the multi-view video sequences were combined into a single uncompressed video stream as illustrated in Fig. 9.12 and explained in Sect. 9.3.3. The resulting uncompressed video stream was fed into a standard H.264/AVC encoder with restriction to use P and I pictures only. By proper manipulation of the reference picture management the encoder could choose between all prediction modes as shown in Fig. 9.9. The results are shown in the bar graphs of Fig. 9.10 for the two data sets Ballroom and Race1 by means of the likelihood of prediction mode selection. Here the prediction mode is chosen with the lowest Lagrangian cost value for Lagrangian motion estimation as described in [65]. Lagrangian motion estimation determines the motion vector m i for block S i by min m i = arg m∈M {DDF D (S i , m) + λMOT ION RMOT ION (S i , m)},
where M is the set of possible motion/disparity vectors and with the distortion term being given by p DDF D (S i , m) = |s[x, y, t] − s [x − mx , y − my , t − mt ]| , (x,y)∈Ai
with p = 2 for sum of squared errors and s[] being the current and s’[] being a previously decoded picture that is referenced using the picture reference tn-1
tn
Cn-1
T-L
L
T+L
Cn
T-
P
T+
T-R
R
T+R
Cn+1
tn+1
Fig. 9.9. Prediction modes for first-order neighbor images
9 Compression of Multi-view Video and Associated Data 100
100
90
90
80
80
70
70 60
60 %
331
%
50
50
40
40
30
30
20
20
10
10 0
0 t(n-1)
C(n-1)
t(n)
C(n) C(n+1)
t(n+1)
t(n-1)
C(n-1)
t(n)
C(n) C(n+1)
t(n+1)
Fig. 9.10. Probability of choice of prediction mode when minimizing a Lagrangian cost function in motion/disparity estimation for sequences Ballroom (left) and Race1 (right)
index mt . RMOT ION (S i ,m) is the number of bits to transmit all components of the motion vector (mx ,my, mt ). The size of the blocks S i in the experiment was 16x16 and the Lagrange parameter λMOT ION was chosen to be 29.5. The search range |M | is ±32 integer pixel positions horizontally and vertically. The first conclusion drawn from the analysis over a larger set of multiview sequences is that temporal prediction is always the most efficient prediction mode. However, there are significant differences between the test data sets, regarding the relationship between temporal and inter-view prediction. There is a connection to the spatio-temporal density of the multi-view data. Inter-view prediction is used more often for low frame rates and very close cameras, which is intuitively understandable. Further there is a connection to the scene complexity. Inter-view prediction is used more often for scenes with rapidly moving objects and less for scenes with large areas being covered by static background. For more details about the statistical analysis please refer to [66]. In any case, inter-view prediction is highly efficient for those images that would be encoded as I pictures in independent encoding, e.g. to provide random access. Here, I pictures can be replaced by inter-view predicted pictures, which increases compression efficiency significantly. A similar analysis coming to the same conclusions can be found in [67]. 9.3.3 MVC Prediction Structures Several research groups addressed MVC and developed dedicated inter-view/ temporal prediction structures to efficiently exploit all statistical dependencies within the multi-view video data sets (e.g. [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81]). Figure 9.11 shows a structure developed by Fraunhofer HHI for the case of a 1D camera arrangement (linear or arc), which was proposed to MPEG as response to the Call for Proposals. This scheme uses the prediction structure of hierarchical B pictures for each view in temporal
332
A. Smolic et al. t0
t1
t2
t3
t4
t5
t6
t7
t8
C0
I
B
B
B
B
B
B
B
I
C1
B
B
B
B
B
B
B
B
B
C2
P
B
B
B
B
B
B
B
P
C3
B
B
B
B
B
B
B
B
B
C4
P
B
B
B
B
B
B
B
P
C5
B
B
B
B
B
B
B
B
B
C6
P
B
B
B
B
B
B
B
P
C7
P
B
B
B
B
B
B
B
P
Fig. 9.11. Inter-view/temporal prediction structure based on H.264/MPEG4-AVC hierarchical B pictures
direction. Hierarchical B pictures provide significantly improved compression performance when the quantization parameters for the various pictures are assigned appropriately [82]. Additionally, inter-view prediction is applied to every 2nd view, i.e. C1 , C3 and C5 . For an even number of views, the last view (C7 ) is coded as shown, starting with a P picture, followed by hierarchical B pictures, which are also inter-view predicted from the previous view. Thus, the coding scheme can be applied to any multi-view setting with more than 2 views. To allow random access in applications envisioned for MVC, I pictures are inserted (C0 /t0 , C0 /t8 , etc. in Fig. 9.11). The inter-view/temporal prediction structure in Fig. 9.11 applies hierarchical B pictures in temporal and inter-view direction. This can be realized with a standard-conforming H.264/AVC encoder using its multiple reference picture syntax [83]. For that the multi-view video sequences are combined into one single uncompressed video stream as illustrated in Fig. 9.12 using a specific scan. This uncompressed video stream can be fed into standard encoder software, and the inter-view/temporal prediction structure in Fig. 9.11 can be realized by appropriate setting of the hierarchical B picture prediction scheme. This is a pure encoder optimization, thus the resulting bitstream is standardconforming and can be decoded by any standard H.264/AVC decoder. The only change to the H.264/AVC encoder is the increase of the Decoded Picture Buffer (DPB) size to store all necessary images needed in the proposed scheme and a potentially larger number of output pictures per second than it is currently allowed in H.264/AVC. The currently allowed maximum frame rate is 172 frames per second.
9 Compression of Multi-view Video and Associated Data t0
t1
t2
t3
t4
t5
t6
t7
333
t8
C0 C1 C2 C3 C4 C5 C6 C7
Fig. 9.12. Reordering and interleaving of multi-view input for compression with H.264/MPEG4-AVC
The presented MVC scheme requires only some high level syntax specification that signals to the decoder “this is a multi-view sequence with N views”. Moreover, it requires a new profile and level of H.264/AVC allowing a larger DPB size and more pictures to be output per time interval. Then the decoder can set the Decoded Picture Buffer size appropriately, decode the bit-stream with existing tools, and knows how to invert the reordering in Fig. 9.12. Note that the base view C0 is not encoded with reference to any other view. Thus, this portion of the resulting bitstream conforms to the used non-MVC H.264/AVC profile providing backward compatibility. The example above is for a Group of Pictures (GOP) length of 8, meaning that every 8th picture of the base view C0 is an I picture to allow random access. However, the syntax of hierarchical B pictures is very flexible and multi-view GOPs of any length can be specified. Figure 9.13 illustrates possible settings for GOP lengths of 12 and 15. Other types of camera arrangements can be handled efficiently as well. Figure 9.14 illustrates possible settings for a 5x3 camera array and a star setting with 5 cameras. Example results using this approach are shown in Figs. 9.15 and 9.16, the PSNR-Y over bit rate averaged over all views of a data set. Anchor coding results are represented by the dashed curve (Anchor). The results produced by the presented MVC scheme, utilizing hierarchical B pictures in inter-view and temporal direction are shown by the solid curve (MVC). Independent
334
A. Smolic et al.
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
P
B
B
B
B
B
B
B
B
B
B
B
B
B
B
P
B
B
B
B
B
B
B
B
B
B
B
B
B
P
B
B
B
B
B
B
B
B
B
B
B
P
Fig. 9.13. Flexible length of multi-view GOPs (top: 15 pictures, bottom: 12 pictures)
encoding of each view but with hierarchical B pictures in temporal direction is represented by the dotted curve (Simulcast). The presented MVC scheme outperforms the anchor results significantly by about 2 dB at all bit rates (note that the curves are interpolated from 3 specified rate points). However, a good portion of the gain already comes from the hierarchical B pictures in temporal direction (about half of it averaged over all results). Nevertheless the results prove that specific MVC algorithms, namely B pictures in inter-view direction exploiting inter-view statistical dependencies, significantly improve compression performance. As mentioned before, 3 rate points were determined in the test conditions for each multi-view test data set. For each rate point a gain can be computed being the difference between PSNR-Y of MVC and anchor encoding. For each test data set an average gain can be computed as mean of the gains of the 3 bit rates. Figure 9.17 depicts these average gains. Depending on the specific sequence, coding improvements up to 3.2 dB are obtained. The gain strongly depends on the original setting of the multi-camera arrangement, namely the temporal and inter-view correlation. However, a good portion of the coding gain is already provided by using hierarchical B pictures in time dimension only. For the Uli test data set almost no gain has been achieved. Here the inter-view statistical dependencies are limited, or better, the encoder is not able to exploit them. In this case the content (person) is relatively close to the camera resulting in large disparity. Although the cameras are quite
P
P
P
P
P
P
P
I
P
P
P
P
P
P
P
P P
I
P
P
Fig. 9.14. Possible prediction structures for array and star camera arrangements
9 Compression of Multi-view Video and Associated Data
335
37 36
Avg. PSNR [dB]
35 34 33 32 31
MVC Simulcast Anchor
30 200
250
300
350
400
450
500
550
600
Avg. Bitrate [kbit/s]
Fig. 9.15. PSNR results for Ballroom data set
close (20 cm as in almost all settings, only Rena and Akko & Kayo use closer settings) the content in adjacent views is quite different. This results in a large disparity (distance between the projections of the same 3D point into 2 different camera images). In the presented scheme such disparities are basically estimated/compensated by the encoder on a block 40
Avg. PSNR [dB]
39
38
37
36
35 300
400
500 600 Avg. Bitrate [kbit/s]
MVC Simulcast Anchor 700 800
Fig. 9.16. PSNR results for Race1 data set
336
A. Smolic et al. Akko&Kayo Rena Breakdancers Flamenco2 Race1 Uli Exit Ballroom 0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
Avg. PSNR gain [dB]
Fig. 9.17. Average PSNR gains for all test data sets
basis in the same way as the motion estimation/compensation works. In fact it is completely unknown for the encoder whether it does disparity or motion estimation/compensation. It just operates on pictures in the reference buffer that are appropriately provided by the control unit. Therefore such large disparities might not be always covered by the search range or they result in extreme disparity/motion vectors that cost large numbers of bits. Medium average gains of around 0.5 dB have been achieved for the Exit and Breakdancers data sets. For the other 5 data sets significant gains of about 2 dB and more on average have been achieved. The quality distribution among the views is sequence-dependent. For equal QP setting across all views, sequences with larger camera distance and higher scene complexity show larger deviations, e.g. the Race1 sequence, while sequences like Ballroom with very small camera distance have only small deviations due to more similar content across all views. Normally PSNR already provides a very good indication about performance of a compression method. However, the final judgment can only be done by subjective tests, where humans evaluate visual quality. For this purpose, MPEG conducted exhaustive formal subjective tests to evaluate the performance of the responses to the Call for Proposals. In this case the tests were done at the Technical University of Munich. From each of the 8 test data sets, 2 views were selected randomly for subjective evaluation for all 3 specified rate points. A method called Single Stimulus Multi Media (SSMM) was selected, which is a modified version of the Single Stimulus Impairment Scale (SSIS), and has proven efficiency and reliability in prior tests conducted by MPEG. In a Single Stimulus test the test subjects have to rate the coded video in the absence of an unimpaired reference. To minimize the influence of the videos shown previously each test
9 Compression of Multi-view Video and Associated Data Anchor
10
337
MVC
9
Subjective Quality
8 7 6 5 4 3 2 1 0
Low
Medium Bitrate
High
Fig. 9.18. Subjective results for Ballroom data set
case was shown twice. Showing all test cases twice also helped to verify that all subjects (20 “na¨ıve viewers”) were able to reproduce their votes. A modified mean opinion score (MOS) with values from 0–10 was used to capture all the votes. Basically the subjects judged the quality for each example individually by giving marks. Then the results were evaluated statistically. Figures 9.18 and 9.19 show the MOS results of the presented MVC scheme in comparison to the MPEG anchors for 2 examples. These are the MOS values for the 2 randomly selected views at each bit rate. Note that different bit rates were selected for the different data sets. Obviously, the presented MVC scheme outperforms the anchors significantly in terms of subjective quality. Figure 9.20 compares the average MOS values over all test data sets and selected views. These were averaged for the low, mid and high bit rate separately.
Anchor
MVC
10 9
Subjective Quality
8 7 6 5 4 3 2 1 0
Low
Medium Bitrate
High
Fig. 9.19. Subjective results for Race1 data set
338
A. Smolic et al. Anchor
10
MVC
9
Subjective Quality
8 7 6 5 4 3 2 1 0
Low
Medium Bitrate
High
Fig. 9.20. Average subjective results over all data sets
Note that these bit rates were different for all data sets. Overall the presented MVC proposal outperforms the MPEG anchors significantly. The gain decreases slightly with higher bit rates. 9.3.4 Simplified Prediction Structures The presented approach is highly flexible and allows for a great variety of prediction structures, which can be designed for a given application. The example in Fig. 9.11 is quite complex introducing a lot of dependencies between images and views. This affects computational complexity and memory requirements. For some applications, this may be a drawback. However, the flexible syntax allows for simpler structures by omitting dependencies. Figure 9.21 left illustrates independent encoding of all views but with hierarchical B pictures in temporal dimension. This simulcast approach was already used in the experiments described above. The structure in Fig. 9.21 extends this by applying inter-view prediction for the key pictures only. Key pictures are those at the beginning and the end of a GOP that are normally coded as I pictures. Here inter-view prediction of key pictures is restricted to P pictures only. Therefore
C0
t0
t1
t2
t3
t4
t5
t6
t7
t8
t0
t1
t2
t3
t4
t5
t6
t7
t8
I
B
B
B
B
B
B
B
I
C0
I
B
B
B
B
B
B
B
I
C1
I
B
B
B
B
B
B
B
I
C1
P
B
B
B
B
B
B
B
P
C2
I
B
B
B
B
B
B
B
I
C2
P
B
B
B
B
B
B
B
P
C3
I
B
B
B
B
B
B
B
I
C3
P
B
B
B
B
B
B
B
P
C4
I
B
B
B
B
B
B
B
I
C4
P
B
B
B
B
B
B
B
P
C5
I
B
B
B
B
B
B
B
I
C5
P
B
B
B
B
B
B
B
P
C6
I
B
B
B
B
B
B
B
I
C6
P
B
B
B
B
B
B
B
P
C7
I
B
B
B
B
B
B
B
I
C7
P
B
B
B
B
B
B
B
P
Fig. 9.21. Simulcast using hierarchical B pictures in temporal dimension only (left) and KS− IPP structure (right)
9 Compression of Multi-view Video and Associated Data
339
this structure is referred to as KS− IPP. The evaluation in Sect. 9.3.2 has shown that most of the gain in MVC can be expected from replacing I pictures by predicted pictures. This is the only change of KS− IPP compared to simulcast in Fig. 9.21. Thus this only introduces a minimum additional complexity by inter-view prediction. The base view with I picture does not necessarily need to be the first view C0 . It may be placed arbitrarily as in Fig. 9.22 left. This structure is referred to as KS− PIP. Placing the I picture in the middle may increase compression efficiency. Finally, key pictures may also be encoded in B mode as illustrated in Fig. 9.22 right. This structure is referred to as KS− IBP. It is even possible to extend the concept of hierarchical B pictures over the key pictures, by introducing higher levels. The alternative structures in Figs. 9.21 and 9.22 can easily be implemented using the same concept of interleaving (Fig. 9.12) and appropriate encoder control of hierarchical B pictures as explained above. Example results are shown in Figs. 9.23 and 9.24. For the Ballroom data set the full MVC structure from Fig. 9.11 provides best results. However, the performance of the simplified structures is very close. The computational complexity of the simplified structures is roughly 30% compared to full MVC. This is a big advantage for a lot of applications. The performance of KS− IPP and KS− PIP is practically identical for both test data sets. Therefore the curves can not be distinguished in the figure. There is practically no gain from placing the base view in the middle. KS− IBP always performs slightly worse than full MVC. In the case of Race1 both KS− IPP and KS− PIP perform better that full MVC and KS− IBP. Apparently, using B picture coding for key pictures may have a negative effect for the overall performance. Figure 9.25 compares the average gains relative to anchor encoding for all test data sets and prediction structures. Full MVC performs best for 5 out of 8 test data sets. For the others KS− IPP and KS− PIP perform slightly better. For the Uli test data set even simulcast performs better than full MVC. In this case, using B pictures with reduced QP for key picture encoding results in a decreased overall performance. The lower quality of key pictures encoded in B mode propagates over time. In general the performance strongly depends t0
t1
t2
t3
t4
t5
t6
t7
t0
t8
t1
t2
t3
t4
t5
t6
t7
t8
C0
I
B
B
B
B
B
B
B
I
C0
I
B
B
B
B
B
B
B
I
C1
P
B
B
B
B
B
B
B
P
C1
B
B
B
B
B
B
B
B
B
C2
P
B
B
B
B
B
B
B
P
C2
P
B
B
B
B
B
B
B
P
C3
I
B
B
B
B
B
B
B
I
C3
B
B
B
B
B
B
B
B
B
C4
P
B
B
B
B
B
B
B
P
C4
P
B
B
B
B
B
B
B
P
C5
P
B
B
B
B
B
B
B
P
C5
B
B
B
B
B
B
B
B
B
C6
P
B
B
B
B
B
B
B
P
C6
P
B
B
B
B
B
B
B
P
C7
P
B
B
B
B
B
B
B
P
C7
P
B
B
B
B
B
B
B
P
Fig. 9.22. KS− PIP structure (left) and KS− IBP structure (right)
340
A. Smolic et al. 36
Avg. PSNR [dB]
35
34
33 Anchor Simulcast KS PIP KS IBP KS IPP MVC
32
31 250
300
350
400 450 Avg. Bitrate [kbit/s]
500
550
Fig. 9.23. Comparison of different prediction structures for Ballroom data set
on the test data set. A good portion of the gain already comes from applying hierarchical B pictures in temporal dimension as depicted by the simulcast solution. 9.3.5 Conclusions and Further Research The presented MVC scheme outperformed other approaches as proven by MPEG in objective and subjective evaluation. It performed best among all 40
Avg. PSNR [dB]
39
38
37 Anchor Simulcast KS PIP KS IBP KS IPP MVC
36
35 400
500
600 Avg. Bitrate [kbit/s]
700
800
Fig. 9.24. Comparison of different prediction structures for Racel 1 data set
9 Compression of Multi-view Video and Associated Data
341
MVC KS_IBP KS_PIP KS_IPP Simulcast
AkkoKayo Rena Breakdancers Flamenco2 Race1 Uli Exit Ballroom 0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
PSNR Gain
Fig. 9.25. Average gains for all test data sets and prediction structures
responses submitted to the MPEG Call for Proposals and was therefore chosen as starting point for the further development of the MVC standard [84]. Other proposals performed close but required changes to the H.264/AVC syntax and en-/decoding. However, ideas from these proposals and later inputs are further investigated may be included in the final standard. The problem of large disparities could be solved by depth-based view interpolation prediction as presented in [68, 69, 77]. The idea is to estimate depth either at the encoder (this requires overhead for sending the depth) or the decoder, and to perform view interpolation or 3D warping for prediction. The principle is illustrated in Fig. 9.26. Assume that for camera 1 and camera 3 color and depth data are already encoded. Then it is possible to synthesize and intermediate image corresponding to the in-between camera position 2 from these data by 3D warping as explained before. Such an interpolated view might not be perfect for the whole image in terms of picture quality, but it might provide a useful additional source for prediction of the corresponding image to be encoded (camera 2 original) with significantly reduced disparity (ideally without any). Such algorithms have been tested exhaustively within core experiments conducted by MPEG and the JVT. However, the gains reported so far are marginal. Only for very few test data sets with very close camera settings such view interpolation prediction provides a gain of up to 5% bitrate saving at the same visual quality. Further investigations are needed to optimize the performance. The core experiments in the JVT on view interpolation prediction are ongoing. Further basic problems of MVC are illumination and color inconsistencies that also affect the exploitation of inter-view statistical dependencies. Usually such effects should be minimized by proper setting of the conditions, how-
342
A. Smolic et al.
Fig. 9.26. View interpolation prediction
ever, an MVC algorithm should also be able to cope with this as well, since proper white and colour balancing of the input can not be guaranteed. Also, the illumination (spotlights, shadows, etc.) varies largely over the multi-view images due to the lighting conditions. These problems might be handled by proper illumination and color compensation as proposed in [70, 71]. The basic idea is to modify the motion compensation on macroblock level. Before subtracting the sample values of the block to be encoded and the reference block, the mean of each is compensated from the corresponding sample values. This assumes locally constant illumination and color variations, which is an appropriate model trading-off accuracy and complexity. The algorithms have been tested in exhaustive core experiments conducted by MPEG and the JVT. Gains of up to 0.7 dB have been reported for some test data using illumination compensation in comparison to MVC as described in 1.3.3. However, this strongly depends on the test data and in some cases the gain is negligible or there is none at all. On average over all test data sets a bitrate reduction of 5% was reported for the same visual quality. Therefore illumination compensation has been adopted to the Joint Multi-view Video Model (JMVM) [85] as a technology under consideration. An alternative to illumination compensation on macroblock level integrated into the encoding process, also an appropriate pre-processing can be applied, prior to encoding. Algorithms for illumination correction are well-known in image and video processing. Then the corrected data can be passed to a standard encoder. The big advantage of such an approach is that no changes are necessary to the encoder, decoder and bitstream syntax. A preliminary investigation in this direction is presented in
9 Compression of Multi-view Video and Associated Data
343
[86], however, results are not complete and performance in comparison to integrated illumination compensation is not clear. Another research direction is improving disparity estimation, compensation, and coding [87]. In the first design, disparity is treated equally to motion, however, it is known and explained in Sect. 9.2.1 that the statistical properties of disparity vectors can be quite different compared to those of motion vectors. Disparity estimation has been studied extensively in the computer vision literature. Usually, basic geometric properties and constraints are taken into account. For instance, the search can be done along epipolar lines. This may lead to better estimates and reduce the complexity. Further, specific disparity coding may improve the efficiency of inter-view prediction. Finally, specific coding modes for MVC such as the inter-view direct mode [88] are under investigation. So far the reported results are not mature enough, and therefore the core experiments in the JVT on these issues are ongoing. High-level syntax is necessary to signal the properties of a MVC bitstream to a decoder as explained before. Additional high-level syntax is under development to enable efficient random access, buffer management, and parallel processing. Regarding the specific MVC compression algorithms described in this section, the decisions about adoption to the standard still need to be taken. So far the benefit of each of them is rather limited. Only illumination compensation is under consideration in the JMVM as mentioned above. However, as decided by the JVT adoption to the final standard of any of these algorithms will only be done within a complete package of algorithms that in the sum provide a significant gain compared to MVC as described in Sect. 9.3.3. A rough measure for a significant gain is 30% bitrate savings at the same visual quality, which is far from being reached yet. Otherwise it would not be justified to change the available H.264/AVC core codec design, since MVC as in Sect. 9.3.3 is fully compatible. Thus, the MVC amendment to H.264/AVC as scheduled to be released in early 2008 may very well only contain new high-level syntax, at least in the first version. Outside of standardization further directions in research on MVC are pursued. One important line is to combine scalability with MVC [89, 90, 91, 92, 93, 94]. So far the feature of scalability only comes with decreased compression efficiency. Distributed MVC is investigated for instance in [95]. Also first work on efficient transport and delivery taking user interactivity into account has been presented [96, 97].
Acknowledgements We would like to thank the Interactive Visual Media Group of Microsoft Research for providing the Breakdancers and ballet data sets. The other test data have been provided to MPEG by Mitsubishi Electric Research Labs, KDDI Corp., Nagoya University, and Fraunhofer HHI. We further thank the
344
A. Smolic et al.
Computer Graphics Lab of ETH Zurich for providing the Doo Young multiview data set. Finally, we would like to thank Daniel Scharstein from the Department of Computer Science of Middlebury College and Richard Szeliski from the Interactive Visual Media Group of Microsoft Research for providing the Cones stereo image pair and disparity map. This work is supported by EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. L. Onural, A. Smolic, and T. Sikora, “An Overview of a New European Consortium: Integrated Three-Dimensional Television – Capture, Transmission and Display (3DTV)”, Proceeings of EWIMT04, European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, London, UK, November 25–26, 2004. 2. A. Smolic, and D. McCutchen, “3DAV Exploration of Video-Based Rendering Technology in MPEG”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 3, pp. 348–356, March 2004. 3. ISO/IEC JTC1/SC29/WG11, “ISO/IEC 14496-16/PDAM1”, Doc. N6544, Redmont, WA, USA, July 2004. 4. M. Bourges-Sevenier, and E.S. Jang, “An Introduction to the MPEG-4 Animation Framework Extension”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, pp. 928–936, 2004. 5. A. Smolic, and P. Kauff, “Interactive 3D Video Representation and Coding Technologies”, Proceedings of the IEEE, Special Issue on Advances in Video Coding and Delivery, Vol. 93, No. 1, pp. 98–110 , January 2005. 6. S.B. Kang, R. Szeliski, and P. Anandan, “The Geometry-Image Representation Tradeoff for Rendering”, Proceedings of ICIP2000, IEEE International Conference on Image Processing, Vancouver, Canada, September 2000. 7. M. Tanimoto, “Free Viewpoint Television – FTV”, Proceedings of PCS 2004, Picture Coding Symposium, San Francisco, CA, USA, December 15–17, 2004. 8. T. Fujii, and M. Tanimoto, “Free-Viewpoint TV System Based on Ray-Space Representation”, SPIE ITCom, Vol. 4864–22, pp. 175–189, 2002. 9. M. Levoy, and P. Hanrahan, “Light Field Rendering”, Proceedings of ACM SIGGRAPH, pp. 31–42, August 1996. 10. H.Y. Shum, and L.W. He, “Rendering with Concentric Mosaics”, Proceedings of ACM SIGGRAPH, pp. 299–306, August 1999. 11. H.Y. Shum, K.T. Ng, and S.C. Chan, “Virtual Reality Using the Concentric Mosaic: Construction, Rendering and Data Compression”, Proceedings of ICIP2000, IEEE International Conference on Image Processing, Vancouver, Canada, September 2000. 12. R. Szeliski, and H.Y. Shum, “Creating Full View Panoramic Image Mosaics and Texture-mapped Models”, Proceedings of ACM SIGGRAPH, pp. 251–258, August 1997. 13. S.E. Chen, “QuickTime VR – An Image-Based Approach to Virtual Environment Navigation”, Proceedings of ACM SIGGRAPH, pp. 29–38, August 1995. 14. C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen, “Unstructured Lumigraph Rendering”, Proceedings of SIGGRAPH 2001, pp. 425–432, 2001.
9 Compression of Multi-view Video and Associated Data
345
15. S.J. Gortler, R. Grzesczuk, R. Szeliski, and M.F. Cohen, “The Lumigraph”, ACM SIGGRAPH ’96, pp. 43–54, August 1996. 16. J. Shade, S. Gortler, L.W. He, and R. Szeliski, “Layered Depth Images”, Proceedings of SIGGRAPH ’98, Orlando, FL, USA, July 1998. 17. K. Mueller, A. Smoli´c, M. Droese, P. Voigt, and T. Wiegand, “Reconstruction of a Dynamic Environment with Fully Calibrated Background for Traffic Scenes”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 4, pp. 538–549, April 2005. 18. P. Debevec, C. Taylor, and J. Malik, “Modeling and Rendering Architecture from Photographs: A Hybrid Geometry- and Image Based Approach”, Proceedings of SIGGRAPH 1996, pp. 11–20, 1996. 19. D. Wood, D. Azuma, W. Aldinger, B. Curless, T. Duchamp, D. Salesin, and W. Stuetzle, “Surface Light Fields for 3D Photography”, Proceedings of SIGGRAPH 2000. 20. W.-C. Chen, J.-Y. Bouguet, M.H. Chu, and R. Grzeszczuk, “Light Field Mapping: Efficient Representation and Hardware Rendering of Surface Light Fields”, ACM Transactions on Graphics, 21(3), pp. 447–456, 2002. 21. P. Eisert, E. Steinbach, and B. Girod, “Automatic Reconstruction of Stationary 3-D Objects from Multiple Uncalibrated Camera Views”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 2, pp. 261–277, March 2000. 22. K.N. Kutulakos, and S.M. Seitz, “A Theory of Shape by Space Carving”, University of Rochester, Techenical Report 692, May 1998. 23. T. Matsuyama, and T. Takai, “Generation, Visualization, and Editing of 3D Video”, Proceedings of Symposium on 3D Data Processing Visualization and Transmission, pp. 234–245, Padova, Italy, June 2002. 24. T. Matsuyama, X. Wu, T. Takai, and T. Wada, “Real-Time Dynamic 3-D Object Shape Reconstruction and High-Fidelity Texture Mapping for 3-D Video”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 3, pp. 357–369, March 2004. 25. S.M. Seitz, and C.R. Dyer, “Photorealistic Scene Reconstruction by Voxel Colouring”, International Journal of Computer Vision, 35(2), pp. 151–173, 1999. 26. C. Gr¨ unheit, A. Smolic, and T. Wiegand, “Efficient Representation and Interactive Streaming of High-Resolution Panoramic Views”, Proceedings of ICIP2002, IEEE International Conference on Image Processing, Rochester, NY, USA, September 22–25, 2002. 27. M. Magnor, and B. Girod, “Data Compression for Light-Field Rendering”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 3, pp. 338–343, April 2000. 28. M. Magnor, P. Ramanathan, and B. Girod, “Multi-view Coding for ImageBased Rendering Using 3-D Scene Geometry”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 11, pp. 1092–1106, November 2003. 29. T. Hamaguchi, T. Fujii, Y. Kajiki, and T. Honda, “Real-time View Interpolation System for Multi-view 3D Display”, SPIE Electronic Imaging, Vol. 4297A, pp. 212–221, January 2001. 30. T. Kobayashi, T. Fujii, T. Kimoto, and M. Tanimoto, “Interpolation of RaySpace Data by Adaptive Filtering”, SPIE Electronic Imaging 2000, Vol. 3958, pp. 252–259, January 2000.
346
A. Smolic et al.
31. W.H. Leung, and T. Chen, “Compression with Mosaic Prediction for ImageBased Rendering Applications”, Proceedings of ICME2000, IEEE International Conference on Multimedia and Expo, New York, NY, USA, July 2000. 32. J. Li, H.Y. Shum, and Y.Q. Zhang, “On the Compression of Image Based Rendering Scene”, Proceedings of ICIP2000, IEEE International Conference on Image Processing, Vancouver, Canada, September 2000. 33. T. Pintaric, U. Neumann, and A. Rizzo, “Immersive Panoramic Video”, Proceedings of the 8th ACM International Conference on Multimedia, pp. 493–494, October 2000. 34. C. Zhang, and J. Li, “Compression of Lumigraph with Multiple Reference Frame (MRF) Prediction and Just-in-time Rendering”, Proceedings of DCC2000, IEEE Data Compression Conference, Snowbird, Utah, USA, March 2000. 35. E. Lamboray, S. W¨ umlin, M. Waschb¨ uch, M. Gross, and H. Pfister, “Unconstrained Free-Viewpoint Video Coding”, Proceedings of the IEEE International Conference on Image Processing (ICIP) 2004, Singapore, October 24–27, 2004. 36. C. Fehn, P. Kauff, M. Op de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, L. Vangool, E. Ofek, and I. Sexton, “An Evolutionary and Optimised Approach on 3D-TV”, Proceedings of IBC 2002, International Broadcast Convention, Amsterdam, Netherlands, September 2002. 37. S. Pastoor, “3D Displays”, in O. Schreer, P. Kauff, and T. Sikora (Editors), “3D Video Communication”, Wiley, 2005. 38. K. Mueller, A. Smolic, P. Merkle, M. Kautzner, and T. Wiegand, “Coding of 3D Meshes and Video Textures for 3D Video Objects”, Proceedings of PCS 2004, Picture Coding Symposium, San Francisco, CA, USA, December 15–17, 2004. 39. R.Y. Tsai, “A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-shelf TV Camera and Lenses”, IEEE Journal of Robotics and Automation, Vol. RA-3, No. 4, August 1987. 40. J. Carranza, C. Theobalt, M. Magnor, and H.-P. Seidel, “Free-Viewpoint Video of Human Actors”, ACM Transactions on Graphics (Special Issue SIGGRAPH’03), Vol. 22, No. 3, pp. 569–577, July 2003. 41. S. W¨ urmlin, E. Lamboray, O. Staadt, and M. Gross, “3D Video Recorder: A System for Recording, Processing and Playing Three-Dimensional Video”, Computer Graphics Forum 22 (2), Blackwell Publishing Ltd, Oxford, U.K., pp. 181–193, 2003. 42. S. W¨ urmlin, E. Lamboray, and M. Gross, “3D Video Fragments: Dynamic Point Samples for Real-time Free-viewpoint Video”, Computers and Graphics 28 (1), Special Issue on Coding, Compression and Streaming Techniques for 3D and Multimedia Data, Elsevier Ltd, pp. 3–14, 2004. 43. C.L. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “HighQuality Video View Interpolation Using a Layered Representation”, SIGGRAPH04, Los Angeles, CA, USA, August 2004. 44. P. Kauff, N. Atzpadin, C. Fehn, M. M¨ uller, O. Schreer, A. Smolic, and R. Tanger, “Depth Map Creation and Image Based Rendering for Advanced 3DTV Services Providing Interoperability and Scalability”, Signal Processing: Image Communication, Special Issue on 3DTV, February 2007. 45. W. Matusik, and H. Pfister, “3D TV: A Scalable System for Real-Time Acquistion, Transmission and Autostereoscopic Display of Dynamic Scenes”, ACM Transactions on Graphics (TOG) SIGGRAPH, ISSN: 0730-0301, Vol. 23, Issue 3, pp. 814–824, August 2004.
9 Compression of Multi-view Video and Associated Data
347
46. D. Scharstein, and R. Szeliski, “High-Accuracy Stereo Depth Maps Using Structured Light”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), Vol. 1, pp. 195–202, Madison, WI, June 2003. 47. B. Haskell, A. Puri, and A. Netrevali, “Digital Video: An Introduction To MPEG-2”, 1997. 48. ITU-T and ISO/IEC JTC 1, “Generic Coding of Moving Pictures and Associated Audio Information – Part 2: Video”, ITU-T Rec. H.222.0 | ISO/IEC 13818-1 (MPEG 2 Systems), November 1994. 49. ITU-T, “Video Coding for Low Bit Rate Communication”, ITU-T Rec. H.263; version 1, November 1995; version 2, January 1998; version 3, November 2000. 50. ISO/IEC JTC 1, “Coding of Audio-Visual Objects – Part 2: Visual”, ISO/IEC 14496-2 (MPEG-4 visual version 1), April 1999; Amd. 1 (ver. 2), February, 2000; Amd. 2, 2001, Amd. 3, 2001, Amd. 4 (streaming video profile), 2001, Amd 1 to 2nd ed. (studio profile), 2001, Amd. 2 to 2nd ed. 2003. 51. ITU T and ISO/IEC JTC 1, “Advanced Video Coding for Generic Audiovisual Services”, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, 2003, most recent Version: 2005. 52. T. Wiegand, G.J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 560–576, July 2003. 53. S. Sun, and S. Lei, “Stereo-view Video Coding Using H.264 Tools”, Proceedings of SPIE – Volume 5685, Image and Video Communications and Processing 2005, pp. 177–184, March 2005. 54. B. Quante, C. Fehn, L.M.J. Meesters, P.J.H. Seuntiens, and W.A. IJsselsteijn, “Report on Perception of 3D Coding Artefacts”, ATTEST Technical Report D8a, September 2003. 55. C. Fehn, “Depth-Image-Based Rendering (DIBR), Compression, and Transmission for a Flexible Approach on 3DTV”, Dissertation, Technical University, Berlin. Mensch & Buch Verlag, Berlin, Germany, 2006. ISBN 3-86664-118-4. 56. ISO/IEC JTC1/SC29/WG11, “Text of ISO/IEC FDIS 23002-3 Representation of Auxiliary Video and Supplemental Information”, Doc. N8768, Marrakech, Morocco, January 2007. 57. ISO/IEC JTC1/SC29/WG11, “Text of ISO/IEC 13818-1:2003/FDAM2 Carriage of Auxiliary Data”, Doc. N8799, Marrakech, Morocco, January 2007. 58. ISO/IEC JTC1/SC29/WG11, “Survey of Algorithms Used for Multi-view Video Coding (MVC)”, Doc. N6909, Hong Kong, China, January 2005. 59. ISO/IEC JTC1/SC29/WG11, “Call for Evidence on Multi-view Video Coding”, Doc. N6720, Palma de Mallorca, Spain, October 2004. 60. ISO/IEC JTC1/SC29/WG11, “Report of the Subjective Quality Evaluation for MVC Call for Evidence”, Doc. N6999, Hong Kong, China, January 2005. 61. ISO/IEC JTC1/SC29/WG11, “Call for Comments on 3DAV”, Doc. N6051, Gold Coast, Australia, October 2003. 62. ISO/IEC JTC1/SC29/WG11, “Call for Proposals on Multi-view Video Coding”, Doc. N7327, Poznan, Poland, July 2005. 63. ISO/IEC JTC1/SC29/WG11, “Requirements on Multi-view Video Coding v.2”, Doc. N7282, Poznan, Poland, July 2005. 64. ITU-R BT.500-11, “Methodology for the Subjective Assessment of the Quality of Television Pictures”, 2002.
348
A. Smolic et al.
65. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G.J. Sullivan, “RateConstrained Coder Control and Comparison of Video Coding Standards,” IEEE Transactions CSVT, Vol. 13, pp. 688–703, July 2003. 66. P. Merkle, K. M¨ uller, A. Smoli´c, and T. Wiegand, “Statistical Evaluation of Spatiotemporal Prediction for Multi-view Video Coding”, Proceedings of ICOB 2005, Berlin, Germany, October 27–28, 2005. 67. A. Kaup, and U. Fecker, “Analysis of Multi-Reference Block Matching for Multi-View Video Coding”, Proceedings of 7th Workshop Digital Broadcasting, pp. 33–39, Erlangen, Germany, September 2006. 68. E. Martinian, A. Behrens, J. Xin, and A. Vetro, “View Synthesis for Multi-view Video Compression”, Proceedings of PCS 2006, Picture Coding Symposium, Beijing, China, April 2006. 69. M. Kitahara, H. Kimata, S. Shimizu, K. Kamikura, Y. Yashimata, K. Yamamoto, T. Yendo, T. Fujii, and M. Tanimoto, “Multi-view Video Coding Using View Interpolation and Reference Picture Selection”, ICME 2006, IEEE International Conference on Multimedia and Exposition, Toronto, Ontario, Canada, July 2006. 70. J.-H. Kim, P.-L. Lai, A. Ortega, Y. Su, P. Yin, and C. Gomila, “Results of CE2 on Multi-view Video Coding”, Joint Video Team, Doc. JVT-T117, Klagenfurt, Austria, July 2006. 71. Y.-L. Lee, J.-H. Hur, Y.-K. Lee, S.-H. Cho, H.J. Kwon, N.H. Hur, and J.W. Kim, “Results of CE2 on Multi-view Video Coding”, Joint Video Team, Doc. JVT-T110, Klagenfurt, Austria, July 2006. 72. K.-J. Oh, and Y.-S. Ho, “Multi-view Video Coding Based on the Lattice-like Pyramid GOP Structure”, Proceedings of PCS 2006, Picture Coding Symposium, Beijing, China, April 2006. 73. X. Cheng, L. Sun, and S. Yang, “A Multi-view Video Coding Scheme Using Shared Key Frames for High Interactive Application ”, Proceedings of PCS 2006, Picture Coding Symposium, Beijing, China, April 2006. 74. Y. Yang, G. Jiang, M. Yu, F. Li, and Y. Kim, “Hyper-Space Based Multi-view Video Coding Scheme for Free Viewpoint Television”, Proceedings of PCS 2006, Picture Coding Symposium, Beijing, China, April 2006. 75. M. Flierl, A. Mavlankar, and B. Girod, “Motion and Disparity Compensated Coding for Video Camera Arrays”, Proceedings of Picture Coding Symposium, Beijing, China, April 2006. 76. A. Vetro, W. Matusik, H.P. Pfister, J. Xin, “Coding Approaches for End-toEnd 3D TV Systems”, Proceedings of Picture Coding Symposium (PCS), San Francisco, CA, USA, December 2004. 77. E. Martinian, A. Behrens, J. Xin, A. Vetro, and H. Sun, “Extensions of H.264/AVC for Multiview Video Compression”, IEEE International Conference on Image Processing (ICIP), Atlanta, USA, October 2006. 78. H. Kalva, L. Christodoulou, L.M. Mayron, O. Marques, and B. Furht, “Design and Evaluation of a 3D Video System Based on H.264 View Coding”, International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2006), Newport, RI, USA, May 22–23, 2006. 79. D. Socek, D. Culibrk, H. Kalva, O. Marques, and B. Furht, “Permutationbased Low-complexity Alternate Coding in Multi-view H.264/AVC”, IEEE International Conference on Multimedia & Expo (ICME) 2006, July 9–12, 2006, Toronto, Canada.
9 Compression of Multi-view Video and Associated Data
349
80. H. Kalva, L. Christodoulou, L. Mayron, O. Marques, and B. Furht, “Challenges and Opportunities in Video Coding for 3D TV”, Special Session on 3-D TV: Primed for Success? IEEE International Conference on Multimedia & Expo (ICME) 2006, July 9–12, 2006, Toronto, Canada. 81. C. Bilen, A. Aksay, G. Bozdagi Akar, “A Multi-View Video Codec based on H.264”, IEEE ICIP 2006, Atlanta, GA, USA, October 2006. 82. H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of Hierarchical B Pictures and MCTF”, ICME 2006, IEEE International Conference on Multimedia and Expo, Toronto, Ontario, Canada, July 2006. 83. T. Wiegand, X. Zhang, and B. Girod, “Long-term Memory Motion-compensated Prediction,” IEEE Transactions on Circuits and Systems for Video Technical, Vol. 9, No. 1, pp. 70–84, February 1999. 84. A. Vetro, Y. Su, H. Kimata, and A. Smolic, “Joint Draft 1.0 on Multi-view Video Coding”, Joint Video Team, Doc. JVT-U209, Hangzhou, China, October 2006. 85. A. Vetro, Y. Su, H. Kimata, and A. Smolic, “Joint Multi-view Video Model”, Joint Video Team, Doc. JVT-U207, Hangzhou, China, October 2006. 86. U. Fecker, M. Barkowsky, and A. Kaup, “Improving the Prediction Efficiency for Multi-View Video Coding Using Histogram Matching”, Proceeding Picture Coding Symposium (PCS 2006), Beijing, China, April 2006. 87. J. Lu, H. Cai, J.-G. Lou, and J. Li, “An Effective Epipolar Geometry Assisted Motion-estimation Technique for Multi-view Image Coding”, IEEE International Conference on Image Processing, Atlanta, GA, USA, 8–11 October, 2006. 88. X. Guo, Y. Lu, F. Wu, and W. Gao, “Inter-view Direct Mode for Multi-view Video Coding”, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 16, No. 12, pp. 1527–1532, 2006. 89. J. Garbas, U. Fecker, T. Tr¨ oger, and A. Kaup, “4D Scalable Multi-View Video Coding Using Disparity Compensated View Filtering and Motion Compensated Temporal Filtering”, International Workshop on Multimedia Signal Processing (MMSP), Victoria, Canada, October 2006. 90. M. Droese, C. Clemens, and T. Sikora, “Extending Single-view Scalable Video Coding to Multi-view Based on H.264/AVC”, Proceedings ICIP 2006, Atlanta, USA, October 2006. 91. W. Yang, Y. Lu, F. Wu, J. Cai, K.N. Ngan, and S. Li, “4D Wavelet-based Multi-view Video Coding”, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 16, No. 11, pp. 1385–1396, 2006. 92. W. Yang, F. Wu, Y. Lu, J. Cai, K.N. Ngan, and S. Li, “Scalable Multiview Video Coding Using Wavelet”, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 6078–6081, 2005. 93. W. Yang, F. Wu, Y. Lu, J. Cai, K.N. Ngan, and S. Li, “Scalable Multiview Video Coding Using Wavelet”, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 6078–6081, 2005. 94. N. Ozbek, and A.M. Tekalp, “Scalable Multi-view Video Coding for Interactive 3DTV”, ICME 2006, IEEE International Conference on Multimedia and Exposition, Toronto, Ontario, Canada, July 2006. 95. X. Guo, Y. Lu, F. Wu, W. Gao, and S. Li, “Distributed Multi-view Video Coding”, Visual Communications and Image Processing, Vol. 6077, 2006.
350
A. Smolic et al.
96. E. Kurutepe, M.R. Civanlar, and A.M. Tekalp, “Interactive Multi-View Video Delivery with View-Point Tracking and Fast Stream Switching”, MRCS 2006, Istanbul, Turkey, September 2006. 97. E. Kurutepe, M.R. Civanlar, and A.M. Tekalp, “Interactive Transport of Multiview Videos for 3DTV Applications”, Packet Video Workshop 2006, Hangzhou, China.
10 Efficient Transport of 3DTV A. Murat Tekalp1 and M. Reha Civanlar2 1 2
College of Engineering, Ko¸c University, Istanbul, Turkey Do-Co-Mo Labs, Palo Alto, California, USA
10.1 Introduction The main difficulty in the deployment of multi-view video services, including 3D and free-viewpoint TV, appears to be the large bandwidth requirement associated with transport of multiple video streams. Inventing efficient solutions to transport multi-view video signals over both broadcast channels and wired or wireless IP networks, which are a compatible with the existing infrastructure and international standards is a challenging task. 3DTV transport has an inherent backwards compatibility requirement, because of the wide deployment of existing 2D digital TV infrastructure. There are two potential backwards-compatible transport architectures for 3DTV: the DVB architecture for broadcast and the Internet Protocol (IP) architecture for wired or wireless streaming. An end-to-end 3DTV system consists of 3D video representation and compression, transport protocols and systems, and 3D display client/peer, as shown in Fig. 10.1. There are a number of 3D display technologies, which may require different 3D video representations. Different representations for 3D video and their compression have been addressed in [1]. For example, there are fixed-view stereoscopic displays, which require only two views at a time. There are also autostereoscopic displays which require eight or more views at a time to provide limited free-view functionality. Both of these displays can be driven by either a raw multi-view video representation, or the so-called “video-plus-depth” representation [2]. The “video-plus-depth” data representation [2] has been recently proposed to build transport of 3DTV, which provide fixed-view and/or limited free-view functionality, evolutionarily on the existing DVB infrastructure. This representation uses a regular video stream enriched with a depth map, providing a Z-value for each pixel. The 3D video is rendered at the receiver side with two or more views by using depth-image-based rendering (DIBR). MPEG has established a special international standard that focuses on 3DTV using the video-plus-depth representation.
352
A. Murat Tekalp and M. Reha Civanlar 3D Video Representation and Compression
3D Display
Transport
Fig. 10.1. Block diagram of an end-to-end 3DTV system
The Internet Protocol (IP) architecture is very flexible in accommodating a wide range of communication applications ranging from the ongoing replacement of classical telephone services by voice over IP applications to new video over IP services. Transmission of video over IP is currently an active research and development area where significant results have already been achieved. There are already video-on-demand services, both for news and entertainment applications, offered over the Internet. Also, 2.5G and 3G mobile network operators started to use IP successfully to offer wireless video services. Flexible transport of a variety of 3DTV representations over IP networks seems to be a natural extension of monoscopic video over IP applications. Video streaming architectures can be classified as i) server unicasting to one or more clients, ii) server multicasting to several clients, iii) peer-to-peer (P2P) distribution, where each peer forwards packets to another peer, and iv) P2P multicasting, where each peer forwards packets to several other peers. Multi-view video streaming protocols include RTP/UDP/IP, which is the current state of the art, and RTP/DCCP/IP, which is the next generation protocol. Multicasting protocols can be supported at the network-layer or application layer. This chapter starts with a brief overview of 3D/multi-view video representations and encoding in Sect. 10.2. Section 10.3 discusses transmission of encoded “video-plus-depth” data over the MPEG-2 transport stream for
Video+ depth
Server-driven scaling (uni)
MVC JMVM
Client-driven scaling (uni)
Scalable MVC
ApplicationLayer Multicast
JVT Standardization
IETF RTP Standardization
Wireline IP Client Wireless Client
SERVER
ENCODERS
CLIENTS Packet Loss Protection
Error Concealment
Fig. 10.2. Block diagram of a framework for server-to-client 3DTV unicast over IP
10 Efficient Transport of 3DTV
353
broadcast applications. Unicast and multicast streaming of multi-view encoded representations over IP are discussed in Sect. 10.4. A block-diagram of a complete framework for end-to-end unicast transport over IP is shown in Fig. 10.2. This framework includes several options for video representation, encoder/decoder, streaming protocols, rate scaling and allocation strategies, error control and concealment schemes, and client support for various 3D displays. We present a stereo video unicast test-bed, including a server and multiple clients, for demonstration of the concept in Sect. 10.5. Conclusions and future research directions are discussed in Sect. 10.6.
10.2 Overview: 3D Video Representation and Encoding A complete treatment of 3D and free viewpoint video representations and their compression is given in [1, 2, 3] with special focus on related standardization activities in MPEG. Multi-view 3D video can be encoded implicitly, in the so-called video-plus-depth representation, or explicitly in raw form. There are several strategies for encoding of raw multi-view video, including simulcast coding, scalable simulcast coding, multi-view coding, and scalable multi-view coding. These are briefly reviewed in the following for the sake of completeness. 10.2.1 Video-plus-Depth Representation and Encoding The video-plus depth representation has been proposed as a new data format for 3DTV by the European project ATTEST [4], to replace the usual end-toend stereoscopic video chain, i.e., capturing, transmission and display of two separate video streams, one for the left and one for the right eye. This new representation uses a regular video stream, where each frame is enriched with a depth map providing a Z-value for each pixel. The final left and right views are reconstructed at the receiver side by using depth-image-based rendering (DIBR). This concept provides backwards compatibility with existing DVB services, efficient compression, and easy adaptation to different 3D display systems, viewing conditions and user preferences. The encoding of the depth map presents a small overhead, typically about 10–20%, on the video bitrate. Hence, it is widely accepted that the video-plus-depth representation is a promising approach for near-term 3DTV broadcast systems, especially because it can evolutionarily be built on the existing DVB infrastructure. Recently, Fehn et al. [2] extended this to “N video-plus-depth data” representation for better rendering at the receiver side. They consider aspects of interoperability, scalability, and adaptability when different multi-baseline geometries are used for multi-view capturing and 3D reproduction. They also present a method for creation of depth maps and an algorithm for depthimage-based rendering related to the system approach.
354
A. Murat Tekalp and M. Reha Civanlar
10.2.2 Multi-View Video Encoding There are various approaches to encoding multi-view video, which provide trade-off between random access, ease of rate adaptation, and compression efficiency. These are: Simulcast Coding: The simplest approach to multi-view video coding would be simulcasting, where each view can be independently coded using the public-domain H264/AVC or scalable video coding (SVC) reference codecs. Joint Video Team (JVT) of ISO/IEC MPEG (Moving Picture Expert Group) and ITU-T VCEG (Video Coding Experts Group) has recently developed a widely accepted standard for single-view (monocular) video coding, called H.264/AVC standard [5]. Both the standard and a reference encoder/decoder implementation are publicly available [5, 6]. The JVT is currently extending this standard for scalable video coding (SVC) [7]. This extension includes temporal, spatial and quality (SNR) scalability enhancement layers over a H.264/AVC base layer. Scalable encoding provides flexibility in video rate adaptation by extracting video at the desired bitrate (over a range of rates) from a single compressed bitstream. A reference encoder/extractor/decoder implementation is also available [8]. Multi-View Encoding: It is well-known that compression efficiency in multiview video coding can be improved, without sacrificing PSNR, by predicting one view from others. Work on standardization of multiple-view video coding (MVC), which investigates various inter-view prediction structures and other options, has recently started under the JVT. A reference software implementation is publicly available [9]. Scalable Multi-View Coding: Approaches for scalable multi-view video coding (SMVC) have recently been proposed [10, 11, 12]. Basic SMVC implementation is an extension of the single-view scalable video coding (SVC) to multi-view video [10, 11]. A hierarchical decomposition between views in a way similar to the temporal decomposition is applied in order to exploit redundancy between views. The codec supports both open-loop and closed-loop encoding. The advantage of this approach lies in its compatibility with the state of the art single-view H.264/AVC and SVC codecs. The hierarchical decomposition structure allows efficient access to all views and frames inside a view. This is especially important for video-based rendering and multi-view displays, which have different requirements. The chosen decomposition structure also supports parallel processing. The results show that SMVC provides superior rate-distortion performance compared to simulcast scalable coding of multiple views using JSVM 5.1. An alternative implementation of SMVC is the MVC base layer plus simulcast enhancement layers proposed in [12]. SMVC can be used for transport of multi-view video over IP for interactive 3DTV by dynamic adaptive combination of temporal, spatial, and SNR scalability according to network conditions [12, 15].
10 Efficient Transport of 3DTV
355
The demo server presented in Sect. 10.5 can support many of these different codecs. It is also possible to serve depth maps together with each view for intermediate view interpolation at the client-side, although our demo server and clients do not currently support this option.
10.3 Video + Depth over MPEG-2 Transport Stream The basic video-plus-depth format described in Sect. 10.2.1 has been standardized within MPEG (Motion Pictures Experts Group). This work was initiated by a joint effort of Philips and Fraunhofer HHI (partner of 3DTV Network of Excellence). The idea was to only standardize the format itself – by means of metadata that conveys the meaning of the gray level values in the depth imagery – and some additional metadata required to signal existence of an encoded depth stream to the receiver. The actual compression of the per-pixel depth information, on the other side, has not been defined explicitly such that every conventional MPEG video codec (e.g., MPEG-2 or H.264/AVC) can be used for this purpose. The new standard has been published in two parts. The specification of the depth format itself is called ISO/IEC 23002-3 (MPEG-C), and a method for transmitting video-plus-depth within a conventional MPEG-2 Transport Stream has become an amendment (Amd. 2) to ISO/IEC 13818-1 (MPEG2 Systems). Both standards have been finalized at the MPEG meeting in Marrakech, Morocco (January 2007). The technical details can be found in the standardization documents [13, 14]. A more complete treatment of 3DTV broadcasting can be found in [15].
10.4 Streaming Multi-view Video over IP A multi-view video streaming server should perform, among other things, TCP-friendly video rate adaptation, RTP packetization, packet scheduling, and transmit buffer management. The major research issues on multi-view video streaming are multi-view rate allocation/adaptation and packet loss protection on the server side for various streaming strategies. In streaming video over the Internet, video rate must be adapted to the available throughput in order to avoid congestion. Furthermore, it is desirable that the video rate must be friendly with other TCP traffic. We provide a brief overview of streaming protocols in Sect. 10.4.1. Rate adaptation of stereo and multiview video differs from that of monocular video, since rate allocation between views offers new flexibilities. Rate adaptation in a unicast server can be openloop (Sect. 10.4.2) or client-driven in case of single-user head tracking display clients (Sect. 10.4.3). Application-layer multicasting and peer-to-peer streaming are discussed in Sect. 10.4.4. Packet loss protection at the server and error concealment at the client are discussed in Sect. 10.4.5.
356
A. Murat Tekalp and M. Reha Civanlar
10.4.1 Streaming Protocols Today, the most widely used transport protocol for media/multimedia is the Real-time Transport Protocol (RTP) over UDP [16]. However, RTP/UDP does not contain any congestion control mechanism and, therefore, can lead to congestion collapse when large volumes of multi-view video are delivered. The Datagram Congestion Control Protocol (DCCP) [17] is designed as a replacement for UDP for media delivery, running directly over the Internet Protocol (IP) to provide congestion control without reliability. DCCP can be thought as TCP minus reliability and in-order packet delivery, or as UDP plus congestion control, connection setup, and acknowledgements. The Datagram Congestion Control Protocol (DCCP) is a transport protocol that implements bi-directional unicast connections of congestion-controlled, unreliable datagrams. Despite of the unreliable datagram flow, DCCP provides reliable handshakes for connection setup/teardown and reliable negotiation of options. Besides handshakes and feature negotiation, DCCP also accommodates a choice of modular congestion control mechanisms. There exist two congestion control schemes defined in DCCP currently, one of which is to be selected at connection startup time. These are TCP-like Congestion Control [18] and TCP-Friendly Rate Control (TFRC) [19]. TCP-like Congestion Control, identified by Congestion Control Identifier 2 (CCID2) in DCCP, behaves similar to TCP’s Additive Increase Multiplicative Decrease (AIMD) congestion control, halving the congestion window in response to a packet drop. Applications using this congestion control mechanism will respond quickly to changes in available bandwidth, but must tolerate the abrupt changes in congestion window typical of TCP. On the other hand, TFRC, which is identified by CCID3, is a form of equation-based flow control that minimizes abrupt changes in the sending rate while maintaining longer-term fairness with TCP. It is hence appropriate for applications that would prefer a rather smooth sending-rate, including streaming media applications with a small or moderate receiver buffer. The TFRC rate: In its operation, CCID3 calculates an allowed sending rate, called the TFRC rate, by using the TCP throughput equation, which is provided to the sender application upon request. The sender may use this rate information to adjust its transmission rate in order to get better results. Hence, the server must use effective video rate adaptation methods, which are discussed in the following subsections 10.4.2 Unicasting with Server-Driven Rate Adaptation Server-driven rate adaptation refers to rate adaptation at the server side without any feedback from the client, i.e., open-loop rate adaptation. Several openloop rate adaptation strategies for stereo and multi-view video at the server side for UDP and DCCP protocols are reviewed in the following.
10 Efficient Transport of 3DTV
357
In addition to spatial and temporal redundancy in monocular video, there are two other types of redundancy that can be exploited for efficient encoding and streaming of multi-view video: i) Inter-view redundancy, which is most effective when the distance between the cameras is close, refers to high correlation between the views, and ii) Psycho-visual redundancy: it is wellknown that the human visual system can perceive high frequencies in 3D from the higher quality 2D view, if the two 2D views do not have the same spatial/temporal resolution [20]. MVC encoders discussed in Sect. 10.2.2 are all based on exploitation of the inter-view redundancy. We now discuss exploitation of psycho-visual redundancy for effective server-driven rate adaptation. In monoscopic video compression, it is a common practice to sub-sample the chrominance channels, since the HVS is less sensitive to variations in chrominance values. Similarly, in the theory of stereo perception, it is reported that the HVS can perceive high frequency information in 3D from one of the views even if the other view is low pass filtered [20]. Hence, spatial and/or temporal sub-sampling of one of the views can be performed to reduce overall transmission bitrate. Of course, the sub-sampled view will be interpolated to full resolution at the client before display. On-line rate adaptation can be performed either by adaptation of encoding parameters of a real-time MVCcompatible encoder or by layer extraction from an off-line encoded SMVC bitstream. It has been shown that stereoscopic videos can be encoded using this approach at about 1.2 times the rate of monoscopic videos with little visual quality degradation [20, 21]. Rate Adaptation using a Real-Time MVC Encoder [21]: Rate adaptation by downscaling one of the views can be achieved by i) spatial sub-sampling, ii) temporal sub-sampling, iii) scaling the quantization step-size, or iv) contentadaptive scaling using a combination of the above for each group of pictures (GoP) in an MVC-compatible real-time encoder. In content-adaptive scaling, we classify each GoP into four categories, as shown in Fig. 10.3, according to their low-level attributes such as the amount of motion and spatial detail within the GoP. Segments with high temporal activity (high motion) need to be encoded at full temporal resolution for a smooth viewing experience. On the other hand, for a low-motion GoP, temporal sampling rate can be reduced without significant loss of perceptual quality. Likewise, GoPs with high spatial detail should not be reduced to lower spatial resolutions, while it may be harmless to do spatial downsampling in case of low spatial detail. Rate Adaptation by Layer Extraction from an SMVC Bitstream [22]: In this case, the video is encoded off-line with a predetermined number of spatial, temporal and SNR scalability layers. Content-aware bit allocation among the views is performed during bitstream extraction by adaptive selection of the number of spatial, temporal and SNR scalability layers for each group of pictures (GOP) according to motion and spatial activity of that GOP. It has been shown that SMVC encoded bitstreams with content-aware bit allocation among views can be used for stereo video transport over the Internet for interactive 3DTV [22].
358
A. Murat Tekalp and M. Reha Civanlar
Fig. 10.3. Classification of GoP
10.4.3 Unicasting with Client-Driven Rate-Adaptation There are also streaming strategies where rate adaptation at the server side must be done using feedback from the client-side. For example, selective streaming of multi-view video to a single-user head-tracking display client is reviewed in this section. A novel client-driven multi-view video streaming system that allows a user watch 3-D video interactively with significantly reduced bandwidth requirements by transmitting a small number of views selected according to his/her head position was proposed in [23]. This system can be used to efficiently stream a dense set of multi-view sequences (light-fields) or wider baseline multi-view sequences together with depth information. The user’s head position is tracked and predicted into the future to select the views that best match the user’s current viewing angle dynamically. Prediction of future head positions is needed so that views matching the predicted head positions can be requested from the server ahead of time in order to account for delays due to network transport and stream switching. The system allocates more bandwidth to the selected views in order to render the current viewing angle. Highly compressed, lower quality versions of some other views are also requested in order to provide protection against having to display the wrong view when the current user viewpoint differs from the predicted viewpoint. The proposed system makes use of multi-view coding (MVC) and scalable video coding (SVC) concepts together to obtain improved compression efficiency while providing flexibility in bandwidth allocation to the selected views. Rate-distortion performance of the system has been demonstrated under different conditions. A block diagram of the system is shown in Fig. 10.4. Suppose that we have a multi-view video with N views on a server. The client-side first determines
10 Efficient Transport of 3DTV
359
Position(t) SET 1 Head Prediction
t
Determination of Number of Low−Quality Views
Req
Multi−View Video Set Selection
uest
Network Module
t+RTT
Decoding and View Selection
Display Left View
Right View
yer e La Bas VC M t men ance Enh ayers L
SET 2 Multi−View Server
SET n
Position(t + RTT)
Fig. 10.4. Overview of head-tracking selective multi-view video streaming system
the user’s current head position and a Kalman-filter based predictor predicts the user’s head position d frames into the future. Then, an error measure is computed at the client to determine the number of views, M ≤ N , to be requested from the server. The server selectively streams the multi-view video sequence encoded at two quality levels: As a base layer, all M views are encoded using the MVC codec at a lower bit-rate. On top of this base layer, an enhancement layer is encoded for each view independently of other enhancement layers to allow random access in order to improve the quality of the selected views. This encoding scheme is illustrated in Fig. 10.5. Since the total bandwidth available to the user is assumed fixed, an increased proportion of the bandwidth needs to be allocated to the base layer as M increases. This necessitates an intelligent rate allocation scheme between the base layer MVC and enhancement layer streams. We assume that that the server can host several sets of the same multi-view video, each set encoded using a different value of M and has different rate allocations between the base and enhancement layers. The client switches to the appropriate set of streams according to its bandwidth, user’s predicted head position, and the current value of M . If there are no prediction errors, the received high-quality (base and two enhancement) streams are passed on to the display, which shows a high quality view to each eye. This is illustrated in Fig. 10.6. The low bit rate base layer MVC enables the user to keep watching 3D video, albeit possibly at a lower quality, when the current user head position differs from the predicted position until correct high quality streams arrive from the server. If there is a prediction error and wrong set of high quality streams arrive, the system displays low quality version of the desired views which may be available in the base layer MVC only. According to [20], humans perceive high quality 3D video as long as one of the eyes sees a high quality view. Therefore, in the presence of prediction errors, as long as at least one of the required views is delivered in high quality, the viewer might not even notice any loss of quality. If the prediction error is so severe that a required view is not delivered at all (is not among the M views in the base layer), an error concealment method is employed (e.g., nearest available views are displayed).
360
A. Murat Tekalp and M. Reha Civanlar
Fig. 10.5. Multi-view video encoding with MVC base layer and simulcast enhancement layers
10.4.4 Application-Layer Multicasting Multicasting is a method to efficiently transport packets from one or more senders to a group of interested clients. The multicasting paradigm aims to avoid sending duplicate packets to clients in the network in order to utilize network resources more efficiently. In network-layer multicast, the sender sends
Fig. 10.6. Quality of the received video streams in the presence of head prediction errors
10 Efficient Transport of 3DTV
361
every packet only once. These packets get duplicated at multicast-enabled routers as needed and forwarded to other members of the multicast group. Unfortunately, network-layer multicasting is not widely deployed in practice, because it requires multicast capable routers. Although most of the new routers are now multicast capable, security and other operational concerns also discourage network operators from enabling the multicast functionality. The alternative is application-layer multicast, which shifts that functionality to hosts, where packet duplication, forwarding, and management of distribution trees, illustrated in Fig. 10.7, are all accomplished through software at end points. Application-layer multicast is not as efficient as network-layer multicast in two aspects: some packets may end up traveling through more hops than network-layer multicast, and some physical links may carry some duplicate packets. Two other important factors which need to be considered when assessing the merits of an application layer multicast protocol are joinlatency and amount of control traffic overhead to construct and maintain the distribution tree. Various application layer multicast protocols have been reported to be applicable to media transport. End System Multicast (ESM) uses NARADA protocol [24] to deliver video streams to viewers around the globe. A loss resilient variant of NICE protocol with probabilistic loss recovery was studied for video delivery over RTP in [25]. In [26], Setton et al. present results for a video streaming system using application layer multicast, which introduces
Fig. 10.7. A sample distribution tree for a small multi-cast network
362
A. Murat Tekalp and M. Reha Civanlar
a rate-distortion optimized approach to up-stream bandwidth allocation at peers. Moreover, Hosseini and Georganas [27] have demonstrated a 3-D video conferencing system utilizing an application layer multicast protocol, and introduced the awareness-driven video concept to overcome the up-stream bandwidth limitations for multi-user video conferencing applications by only delivering videos for some of the users, which is essentially parallel to the selective delivery idea. In [28], McCanne et al. show that multicasting principle can be further adapted to transmission of multimedia data by multicasting scalable multimedia data in multiple layers and giving the control over which layers to receive to the end-points. This concept can readily be adapted to the transport of multi-view video streams, where each view in the multi-view video can be assigned to a different layer and the viewer can choose which views they need to receive and join the corresponding multicast group. A network-layer multicast solution has been proposed for efficient transport of dynamic light fields [29, 30], where each view is streamed to a different IP-multicast address, and a viewer’s client joins the appropriate multicast groups to only receive the 3-D information relevant to its current viewpoint. The set of selected videos changes in real time as the user’s viewpoint changes because of head or eye movements. Techniques for reducing viewing angle jumps during fast viewpoint changes have been investigated. The performance of the approach has been studied through network experiments. 10.4.5 Packet Loss Protection and Error Concealment Congestion is the main cause of packet losses over the wired Internet. In contrast to the wired backbone, the capacity of the wireless channel is fundamentally limited by the available bandwidth of the radio spectrum and various types of noise and interference, which leads to bit errors. Most network protocols discard packets with bit errors; thus, translating bit errors to packet losses. Therefore, the wireless channel is the “weakest link” of future multimedia networks and, hence, requires special attention, especially when mobility gives rise to fading and error bursts. In particular, joint source and channel techniques have been developed for the efficient transmission of video streams over packet erasure channels, both in wired and wireless networks. Advanced channel coding techniques (such as Reed-Solomon, Turbo and LDPC codes) in conjunction with schemes which add unequal amount of redundancy to the data according to their importance are used to protect effectively visual information from packet losses that originate from congestion and/or bit errors over the wireless channel. In the following, we review robust channel transmission techniques, robust encoding techniques for robust transmission (e.g., multiple description coding), and error concealment techniques at the client (receiver side). Robust Multi-View Video Transmission Techniques: Recently, a monoscopic video coding scheme based on macroblock classification and unequal error protection of an H.264/AVC stream has been developed in [31]. Prior to
10 Efficient Transport of 3DTV
363
transmission, macroblocks are classified into three slice groups by examining their contribution to video quality. Since the transmission scenarios are over packet networks, facing moderate to high packet loss rates, RS codes are used for channel protection. RS protection is selected for each slice group using a channel rate allocation algorithm based on dynamic programming techniques. This method is the first utilizing explicit mode of the H.264/AVC flexible macroblock ordering (FMO) in conjunction with channel coding techniques. The performance gain is attributed to the more efficient data organization, which allows better error concealment without sacrificing coding performance, and to the finer protection of slice groups arising from our unequal error protection strategy. The use of LT codes, which exhibit low encoding and decoding complexity and can adapt to the erasure rate of the packet networks due to their rateless property, seems promising. The transmission of multi-view video encoded streams over packet erasure networks is examined in [32]. The proposed scheme employs a fully compatible H.264/AVC multi-view video codec. Macroblock classi- fication into unequally important slice groups is achieved using the Flexible Macroblock Ordering (FMO) tool of H.264/AVC. Systematic LT codes are used for error protection due to their low complexity and advanced performance. The optimal slice grouping and channel rate allocation is determined by an iterative optimization algorithm based on dynamic programming. Experimental results clearly demonstrates that the method is promising. Stereoscopic video streaming using Forward Error Correction (FEC) techniques are examined in [33]. Since packets are transmitted through IP networks, packetization is designed to recover from packet losses rathen than bit error losses. Although the encoder is not layered, packets are classified into a number of layers depending on their impact on the overall 3D perception. Systematic LT and RS codes are examined for their performance. Multiple-Description Coding of Stereo Video [34]: Two multiple description schemes for coding of stereoscopic video have been presented in [34]. The SS-MDC scheme exploits spatial scaling of one view. In case of one channel failure, SS-MDC can reconstruct the stereoscopic video with one view lowpass filtered. SS-MDC can achieve low redundancy (less than 10%) for video sequences with lower inter-view correlation. MS-MDC method is based on multi-state coding and is beneficial for video sequences with higher inter-view correlation. The encoder can switch between these two methods depending on the characteristics of video. Error Concealment at the Client: When dealing with low bitrate videos, packet losses may lead to the loss of an entire frame of the video. Several studies exist in the literature on frame loss concealment algorithms for monoscopic video, but these methods may not be directly applicable to the stereoscopic video. The authors propose a full frame loss concealment algorithm for stereoscopic sequences in [35]. The proposed method uses redundancy between the two views and previously decoded frames to estimate the lost frame. The results show that, the proposed algorithm outperforms the
364
A. Murat Tekalp and M. Reha Civanlar
monoscopic methods when they are applied to the same view as they are simulcast coded.
10.5 A Test-Bed for Unicast Stereo Video Streaming An end-to-end prototype system for unicast streaming of stereo video over UDP has been recently developed [36, 37]. A block-diagram of the prototype system is shown in Fig. 10.8. Multiple clients have been developed by modifying the VideoLAN client for different 3D displays. The prototype system currently operates over a LAN with no packet losses. Packet loss resilience and concealment techniques will supported in the next release of the system. The server employs the protocol stack RTP/UDP/IP, and can serve multiple clients simultaneously. Session description protocol (SDP) is used to ensure interoperability with the clients. We assume that the encoded stereo bitstream consists of NAL units of both left and right views. These NAL units are packetized for independent streaming over two separate channels using Real-time Transport Protocol (RTP). The packet format is based on the RTP payload format for H.264 video. Three packetization modes are defined in this payload format. Single NAL Unit mode and Non-Interleaved mode are intended for low-delay applications. FU-As (Fragmentation Unit without Decoding Order Number) packetization structure is used to transfer NALUs with sizes exceeding the network MTU. They are fragmented in the application layer, instead of relying on the IP layer fragmentation. Other packets with smaller sizes are sent in Single NAL unit packets. The session description protocol (SDP) is used to achieve interoperability between the stereo video server and clients. An additional session attribute is
Input Left Left Stream Left Video
Input Right
MVC decoder
Output Left Output Right
Video server Right Video
Right Stream
Input Left
Acquisition
Encoder
Transmission
H.264 decoder
Decoder
Output Left
Display
Fig. 10.8. A block-diagram of the end-to-end stereoscopic video streaming test-bed
10 Efficient Transport of 3DTV
365
defined in order to specify stereo data and which channel is the left and which one is the right. a=view : mono a=view : stereo
<port-Left> <port-Right> where “mono” and “stereo” specify the view type and “ <port>” pair provides access information for the corresponding views. Three clients for different types of display systems have been implemented: i) Client-1 supports an in-house polarized 3D projection display system; ii) Client-2 supports the auto-stereoscopic Sharp 3D laptop, iii) Client-3 supports a monocular display to demonstrate backwards compatibility. For all client implementations, the open source software VideoLAN Client (VLC) has been modified. VLC is a highly portable multimedia player for various audio and video formats and streaming protocols. The modified VLC handles packets of left and right views using two separate threads. We send NALU units received in RTP packet payload directly to the decoder after de-packetization. Any MVC-compatible decoder can be used at the client. The decoder decodes and sends the decoded picture to the video output modules. The video output units visualize the left and right frames in a synchronized manner by using the time information in the RTP timestamps. Finally, the in-house 3D projection display system uses a pair of Sharp MB-70X projectors as shown in Fig. 10.9. Light from one of the projectors is polarized in clockwise direction and light from the other projector in counterclockwise direction using circular polarization filters. Both projectors are precisely aligned to project onto a special silver screen covered with a neutral grey reflective dielectric material to preserve the polarization of light during reflection. The users wear glasses, which have matching filters with the projectors to ensure that light from one projector is only seen with one eye. A single high-end PC with two display outputs using the extended desktop feature drives the projectors. This setup results in a virtual desktop of 2048x768 pixels, each projector displaying only one half of the extended desktop at 1024x768 native resolution. Thus, left and right videos can be easily shown on the left and right halves of the extended desktop, such that they exactly overlap with each other on the silver screen.
10.6 Conclusions and Future Directions There appears to be two main directions for transport of 3DTV: i) over existing DVB infrastructure, and ii) over the Internet Protocol (IP). For the former, an MPEG standard using the so-called video-plus-depth encoded representation inside the MPEG-2 transport stream has been finalized. A multitude of multi-view encoding and streaming strategies using RTP/UDP/ IP or RTP/DCCP/IP can be considered for the latter, which constitute the
366
A. Murat Tekalp and M. Reha Civanlar
Fig. 10.9. The client with polarized stereo projection display
main subject of this chapter. Video streaming architectures can be classified as i) server unicasting to one or more clients, ii) server multicasting to several clients, iii) peer-to-peer (P2P) unicast distribution, where each peer forwards packets to another peer, and iv) P2P multicasting, where each peer forwards packets to several other peers. Multicasting protocols can be supported at the network-layer or application layer. Main current and future research issues are: i) determination of the best video encoding configuration for each streaming strategy – multi-view video encoding methods provide some compression efficiency gain at the expense of creating dependencies between views that hinder random access to views; ii) determination of the best rate adaptation method – adaptation refers to adaptation of the rate of each view as well as inter-view rate allocation depending on available network rate and video content, and adaptation of the number and quality of views transmitted depending on available network rate and user display technology and desired viewpoint; iii) packet-loss resilient video encoding and streaming strategies as well as better error concealment methods at the receiver; iv) best peer-to-peer multicasting design methods, including topology discovery, topology maintenance, forwarding techniques, exploitation of path diversity, methods for enticing peers to send data and to stay connected, and use of dedicated nodes as relays.
10 Efficient Transport of 3DTV
367
References 1. A. Smolic, P. Merkle, K. M¨ uller, C. Fehn, P. Kauff, and T. Wiegard, “Compression of multi-view video and associated data,” in Three Dimensional Television: Capture, Transmission, and Display, eds. H. Ozaktas and L. Onural, Springer, New York, 2007 (this book). 2. A. Smolic, K. Mueller, P. Merkle, C. Fehn, P. Kauff, P. Eisert, and T. Wiegand, “3D video and free viewpoint video – Technologies, applications and MPEG standards,” Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Toronto, Ontario, Canada, July 2006. 3. C. Fehn, N. Atzpadin, M. M¨ uller, O. Schreer, A. Smolic, R. Tanger, and P. Kauff, “An advanced 3DTV concept providing interoperability and scalability for a wide range of multi-baseline geometries,” Proceedings of IEEE International Conference on Image Processing (ICIP), Atlanta, GA, USA, pp. 2961–2964, Oct. 2006. 4. C. Fehn, P. Kauff, M. Op de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, L. Vangool, E. Ofek, and I. Sexton, “An evolutionary and optimised approach on 3D-TV,” Proceedings of IBC 2002, International Broadcast Convention, Amsterdam, Netherlands, Sept. 2002. 5. ITU-T Rec. H.264, Advanced video coding for generic audiovisual services, Mar. 2005. http://www.itu.int/rec/T-REC-H.264 6. JM Reference Software, version JM 11.0, Aug. 2006. http://iphome. hhi.de/suehring/tml/ 7. Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, “Scalable video coding – Working draft 1,” Joint Video Team, Doc. JVT-N020, Hong-Kong, China, Jan. 2005. 8. Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, “Joint scalable video model JSVM-4,” Joint Video Team, Doc. JVT-Q202, Hong-Kong, China, Oct. 2005. 9. Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, “Joint Multiview Video Model (JMVM) 1.0,” Joint Video Team, Doc. JVT-T208, Hong-Kong, China, Jul. 2006. 10. N. Ozbek and A. M. Tekalp, “Scalable multi-view video coding for interactive 3DTV”, Proceedings of IEEE International Conference Multimedia and Expo. (ICME), Toronto, Canada, July 2006. 11. M. Drose, C. Clemens, and T. Sikora, “Extending single-view scalable video coding to multi-view based on H.264/AVC,” IEEE International Conference on Image Processing (ICIP), Atlanta, GA, USA, Oct. 2006. 12. E. Kurutepe, MS Thesis, Koc University, Sept. 2006. 13. ISO/IEC JTC 1/SC 29/WG 11, FPDAM of ISO/IEC 13818-1:200X/AMD 2 (Carriage of Auxiliary Video Data), eds. J. van der Meer and A. Bourge, WG 11 Document N8094, Klagenfurt, Austria, July 2006. 14. ISO/IEC JTC 1/SC 29/WG 11, Study of ISO/IEC FCD 23002-3: Representation of Auxiliary Video and Supplemental Information, eds. A. Bourge and C. Fehn, WG 11 Document N8482, Hangzhou, China, Oct. 2006. 15. C. Fehn, “3DTV broadcasting,” in 3D Video Communication, eds. O. Schreer, P. Kauff, and T. Sikora, Wiley, New York, 2005. 16. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A transport protocol for real-time applications,” IETF, RFC 3550, Jul. 2003.
368
A. Murat Tekalp and M. Reha Civanlar
17. E. Kohler, M. Handley, and S. Floyd, “Datagram Congestion Control Protocol (DCCP),” IETF, RFC 4340, Mar. 2006. 18. S. Floyd and E. Kohler, “Profile for Datagram Congestion Control Protocol (DCCP) Congestion Control ID 2: TCP-like Congestion Control,” IETF, RFC 4341, Mar. 2006. 19. S. Floyd, E. Kohler, and J. Padhye, “Profile for DCCP congestion control ID 3: TCP-Friendly Rate Control (TFRC),” IETF, RFC 4342, Mar. 2006. 20. L. B. Stelmach, W. J. Tam, D. Meegan, and A. Vincent, “Stereo image quality: effects of mixed spatio-temporal resolution,” IEEE Trans. Circ. Syst. Video Technol., vol. 10, no. 2, pp. 188–193, 2000. 21. A. Aksay, C. Bilen, E. Kurutepe, T. Ozcelebi, G. Bozdagi-Akar, M. R. Civanlar, and A. M. Tekalp, “Temporal and spatial scaling for stereoscopic video compression,” Proceedings of EUSIPCO, Florence, Italy, Sept. 2006. 22. N. Ozbek and A. M. Tekalp, “Content-aware bit allocation in scalable multiview video coding”, International Workshop on MRCS (LNCS 4105), Istanbul, Sept. 2006. 23. E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, “Client-driven selective streaming of multi-view video for interactive 3DTV”, To appear in IEEE Transactions on CSVT, Dec. 2006. 24. Y.-H. Chu, S. G. Rao, and H. Zhang, “A case for end system multicast,” in SIGMETRICS, ACM Press, New York, pp. 1–12, 2000. 25. S. Banerjee, S. Lee, R. Braud, B. Battacharjee, and A. Srinivasan, “Scalable resilient media streaming,” Proceedings of International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 4–9, 2004. 26. E. Setton, J. Noh, and B. Girod, “Rate-distortion optimized video peer-to-peer multicast streaming,” Proceedings of ACM Workshop on Advances in Peer-toPeer Multimedia Streaming, ACM Press, New York, pp. 39–48, 2005. 27. M. Hosseini and N. D. Georganas, “Design of a multi-sender 3D videoconferencing application over an end system multicast protocol,” Proceedings of ACM International Conference on Multimedia, ACM Press, New York, pp. 480–489, 2003. 28. S. McCanne, V. Jacobson, and M. Vetterli, “Receiver-driven layered multicast,” in SIGCOMM, ACM Press, New York, pp. 117–130, 1996. 29. E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, “A receiver-driven multicasting framework for 3DTV transmission”, Proceedings of EUSIPCO, Antalya, Turkey, Sept. 2005. 30. E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, “Interactive transport of multiview videos for 3DTV applications”, Proceedings of Packet Video, China, 2006. 31. Thomos, S. Argyropoulos, N. V. Boulgouris, and M. G. Strintzis, “Robust Transmission of H.264/AVC Streams Using Adaptive Group Slicing and Unequal Error Protection”, EURASIP Journal on Applied Signal Processing, Special issue on Advanced Video Technologies and Applications for H.264/AVC and Beyond, Feb. 2006. 32. S. Argyropoulos, A. S. Tan, N. Thomos, E. Arikan, and M. G. Strintzis, “Robust transmission of multi-view Video Streams Using Flexible Macroblock Ordering and Systematic LT codes,” Proceedings of 3DTV-CON, Kos Island, Greece, May 2007. 33. A. S. Tan, A. Aksay, C. Bilen, G.B.-Akar, and E. Arikan, “Error resilient layered stereoscopic video streaming,” Proceedings of 3DTV-CON, Kos Island, Greece, May 2007.
10 Efficient Transport of 3DTV
369
34. A. Norkin, A. Aksay, C. Bilen, G. B. Akar, A. Gotchev, and J. Astola, “Schemes for multiple description coding of stereoscopic video,” Proceedings of MRCS 2006, Istanbul, Turkey; Lecture Notes in Computer Science, vol. 4105, pp. 730–737, Springer-Verlag, Sept. 2006. 35. C. Bilen, A. Aksay, and G. Bozdagi-Akar, ”Motion and disparity aided stereoscopic full frame loss concealment method,” IEEE SIU 2007, Eskisehir, Turkey, June 2007. 36. S. Pehlivan, MS Thesis, Koc University, Aug. 2006. 37. A. Aksay, S. Pehlivan, E. Kurutepe, C. Bilen, T. Ozcelebi, G. B.Akar, M. R. Civanlar, and A. M. Tekalp, “End-to-end stereoscopic video streaming with content-adaptive rate and format control,” Signal Processing: Image Communication, February 2007.
11 Multiple Description Coding and its Relevance to 3DTV Andrey Norkin1 , M. Oguz Bici2 , Anil Aksay2 , Cagdas Bilen2 , Atanas Gotchev1 , Gozde B. Akar2 , Karen Egiazarian1 and Jaakko Astola1 1
2
Institute of Signal Processing, Tampere University of Technology, Tampere, Finland Middle East Technical University, Ankara, Turkey
11.1 Motivation for Multiple Description Coding 3D visual scenes may be captured by stereoscopic or multi-view camera settings. The captured multi-view video can be compressed directly or converted to more abstract 3D data representation such as 3D dynamic meshes (mesh sequences), subject to efficient compression as well [107, 108]. In any case, the efficiently compressed 3D visual data has to be transmitted over communications channels, such as wireless channels or best-effort networks. This raises the problem of error protection, since most of these channels are errorprone. Contemporary multi-view video compression methods utilize comprehensive temporal and inter-view prediction structures and therefore channel errors occurring in one view can propagate not only through the subsequent frames of the same view but also to frames of other views. A common approach for error protection considers it a pure channel problem, separate from the source compression problem. This approach is based on Shannon’s work, which states that source and channel coding tasks can be carried out independently with no loss of efficiency. Following this approach, raw source sequences are processed in a way which reduces the data rate as much as possible. Reliable transmission of the bitstream to the receiver is provided then by a channel coder. The transport mechanism has to be perfect, since a single error in the compressed bitstream might severely damage the reconstructed signal. According to the noisy channel coding theorem, a near-perfect transmission can be achieved if the data rate does not exceed the channel capacity. This, however, is hardly achievable in practical cases. A widely used error-protection method operates at the transport layer of the OSI model [40], e.g. at the TCP protocol . There, error-free transmission is achieved by retransmitting packets that have been lost or corrupted. A problem with such a mechanism is that it causes delays and thus requires larger memory buffers. The delay is at least a packet round-trip time. A second
372
A. Norkin et al.
problem arises when packet losses are caused by network congestions. Trying to retransmit lost packets generates extra data traffic and makes the network even more congested. Furthermore, retransmissions are virtually impossible in digital broadcasting. During the broadcasting, loss of even a single packet may cause the transmitter to receive multiple retransmission requests, an effect called a feedback implosion. Another approach for reliable transmission over lossy channels is so-called forward error correction (FEC). The compressed bitstream data is distributed between packets to be protected by block channel codes. The data from lost packets can be reconstructed from the received packets. The block code length plays an important role in this approach. In terms of efficiency, long blocks are preferred since short-length blocks generate bitstreams with a relatively large number of additional symbols. The law of large numbers also dictates the choice of longer blocks since the number of errors is predicted more easily by long sequences. Just as with the previous approach based on retransmissions, also a problem of delays and large memory buffers exists, this time caused by long blocks. The two above-mentioned approaches tolerate no errors. They assume that all data transmitted is correctly received. Consequently, they spend a considerable amount of resources to guarantee this. These resources grow with the amount of data to be transmitted. One may consider multi-view video as a case of such a growing amount of data compared to single-view video. Pure channel coding approaches might not be quite feasible in such cases. As an alternative, one can tolerate channel losses. Assuming that not all data sent has reached the decoder, one can concentrate on ensuring efficient decoding of the correctly received data only. In such a case, one needs to change the source coding accordingly and, more broadly, to consider the error protection problem as a joint source-channel problem. Multiple description coding (MDC) is a coding approach for communicating a source over unreliable channel. The source is encoded into several descriptions, which are sent to the decoder independently over different channels. A certain amount of controlled redundancy is added to the compressed descriptions to make them useful even when received alone. The decoder can reconstruct the source from one description with low, yet acceptable quality. The more descriptions received, the higher is the reconstruction quality. Usually, the descriptions are balanced ; that is, the descriptions are of equal rate and importance. In that case, the reconstruction quality depends on the number of received descriptions only and not on which particular descriptions are received. The simplest MDC framework (Fig. 11.1) considers communication over two independent channels. The encoder codes the source X into two redundant and mutually refining descriptions, which are sent separately over Channel 1 and Channel 2. There are three decoders at the receiver side. The central ˆ 0 of the source X based on two decoder (Decoder 0 ) outputs an estimate X received channels, while the side decoders (Decoder 1 and Decoder 2 ) receive
11 Multiple Description Coding and its Relevance to 3DTV Channel 1 X
Decoder 1 Decoder 0
Encoder
Decoder 2
373
∧ X1 ∧ X0 ∧ X2
Channel 2
Fig. 11.1. Scenario for MD coding with two channels and three receivers
ˆ 1 and X ˆ 2 . The channels have two one channel each and output estimates X states: “on” (working) or “off” (failing). Then, the decoder receives information either from one or two channels, based on their states. When transmitted over a packet framework, each description is placed in a separate packet protected with a block code. The packet in the receiver is considered either correctly received or lost. Increasing the number of descriptions increases the probability that at least one packet (i.e. one description) reaches the decoder. However, it also increases the coding redundancy and the decoder complexity. In summary, MDC is an attractive coding approach as it provides reliable source reconstruction using only part of the data sent to the decoder and employs no priority network transmission mechanisms. MDC is especially advantageous in short delay media streaming scenarios such as video conferencing and when broadcast over unreliable channels, where it provides acceptable reconstruction quality and prevents feedback implosions in the case of numerous packet losses. The above features can be even more important when transmitting 3DTV content, characterized by a larger amount of data. In addition, content such as stereoscopic video or multi-view video possesses higher degrees of correlation, which can be successfully utilized in MDC schemes. The rest of this chapter is organized as follows. In Sect. 11.2, we start with some theoretical aspects concerning the rate-distortion (RD) theory and achievable RD bounds for MDC schemes. Then, in Sect. 11.3, we briefly survey MDC approaches for image coding. In Sect. 11.4, we survey MDC approaches for video coding. We need them as building blocks for MDC schemes for 3D visual data. In Sect. 11.5, results on MDC of stereoscopic video are presented and Sect. 11.6 introduces the results on MDC of 3D-meshes.
11.2 Multiple Description Rate Distortion Region Multiple description coding has its roots in information theory. A number of studies have addressed the information-theoretic aspects of MDC and studied bounds on achievable rates and distortions thus providing tools for optimizing practical MDC schemes. In this section, we review the informationtheoretic works on MDC to form a basis for better understanding the
374
A. Norkin et al.
problems discussed in the subsequent sections. The reader is also referred to the excellent survey of MDC by Goyal [32]. In single description coding, a rate-distortion pair (R, D) is called achievable if there exist a source code with rate R and distortion D. The rate distortion (RD) region is then defined as a closure of the set of all achievable rate-distortion pairs (R, D) [32]. For the analytical definition of R(D) function see [10]. Theoretical works on MDC consider generalizations of the RD region to the case of multiple descriptions. Most of these studies consider a classical multiple description (MD) case with two channels and three decoders (Fig. 11.1). Some papers describe achievable regions in the case of many descriptions. Unfortunately, even for the two-channel case, there is no general result for the multiple description RD region yet. Nevertheless, the MD RD region has been found for several special cases of interest. A classical MDC scenario is shown in Fig. 11.1. Consider a sequence of i.i.d. random variables X1 , X2 , . . . XN . A variable X = (X1 , X2 , . . . XN ) is coded into two descriptions: Description 1 and Description 2 with rates R1 and R2 respectively. The descriptions are independently sent over Channel 1 and Channel 2. The receiving side has three decoders. Decoder 1 gets the information from Channel 1 only, while Decoder 2 gets the information from Channel 2. Decoder 0 gets the information from both channels. Decoder 0 is also called a central decoder, while Decoder 1 and Decoder 2 are called side ˆ i , where i = 0, 1, 2. The decoders. The Decoder i estimates the variable X as X distortion measures d1 , d2 , and d0 are given and the distortions corresponding to each description are ˆN di (xN , x i )=
N 1 E[di (xk , x ˆik )], N
i = 0, 1, 2
(11.1)
k=1
ˆN where xN = (x1 , x2 , . . . , xN ), and x xi1 , x ˆi2 , . . . , x ˆiN ). i = (ˆ Unlike the single description case, the rate distortion relations for multiple descriptions have not been defined in general in terms of information-theoretic quantities such as entropy, mutual information etc. The MD region can be defined as a set of all achievable quintuples (R1 , R2 , D0 , D1 , D2 ). 11.2.1 Achievable Rates for Multiple Descriptions Consider the classical MDC scenario with two channels and three decoders (Fig. 11.1). A pair (R1 , R2 ) is called an achievable rate for a given distortion D = (D1 , D2 , D0 ) if there exists a sequence of N pairs of descriptions f1 (xN ), ˆN ˆN f2 (xN ) with rates R1 and R2 and reconstruction functions x 0 (f1 , f2 ), x 1 (f1 ), N ˆ 2 (f2 ) such that, for sufficiently large N , the following inequality is met [26] x E[
N 1 di (xk , x ˆik )] ≤ Di , N k=1
i = 0, 1, 2.
(11.2)
11 Multiple Description Coding and its Relevance to 3DTV
375
The rate distortion region R(D) for distortion D = (D1 , D2 , D0 ) is defined as a closure of the set of achievable rate pairs (R1 , R2 ) satisfying (11.2). An achievable rate region is any subset of the rate distortion region. An achievable rate region has been defined by El Gamal and Cover [26] as follows. For a sequence of i.i.d. finite alphabet random variables X1 , X2 , . . . with the probability function p(x), and the distortion measure di (·, ·) an achievable rate region for distortion D = (D1 , D2 , D3 ) is given by the convex hull of all (R1 , R2 ) satisfying ˆ 1 ), R1 > I(X; X ˆ 2 ), R2 > I(X; X ˆ 1, X ˆ2, X ˆ 0 ) + I(X ˆ1; X ˆ 2 ), R1 + R2 > I(X; X
(11.3)
where I(·; ·) is the average mutual information and for some probability mass ˆ2 , x ˆ0 |x) such that function p(x)p(ˆ x1 , x ˆ i )] ≤ Di , E[di (X, X
i = 0, 1, 2.
(11.4)
The achievable rate region as defined by (11.3) gives sufficient conditions for the quintuple (R1 , R2 , D0 , D1 , D2 ) to be in the MD rate distortion region. To know the rate distortion region completely, one would need to find necessary and sufficient conditions. This has been done for the interesting case of “no excess rate sum”, i.e. R1 + R2 = R(D0 ). Ahlswede has shown [2] that for the no excess rate case the El Gamal–Cover’s conditions (11.3) are necessary as well as sufficient. Thus, for this particular case, these conditions define the MD rate distortion region. It has been conjectured that El Gamal–Cover’s region (11.3) is tight in general and provides a complete RD region. By a counterexample, Zhang and Berger have shown that sometimes these conditions are not tight when R1 + R2 > R(D0 ) [106]. Thus, generally, (11.3) does not provide the complete RD region for multiple descriptions and should be considered just as an inner bound. A work by Venkataramani et al. [90] has addressed the L-channel MDC problem, providing a generalization of El Camal–Cover’s Theorem for L channels. 11.2.2 Rate Distortion Region for Gaussian Source and Squared Error Distortion The MD region has been found for the special case of a memoryless Gaussian source and mean squared error distortion. For this source and distortion measure, Ozarow has shown that the El Gamal–Cover’s achievable rate region is also a complete RD region [64]. As known from the rate-distortion theory [10], the rate distortion function of any memoryless source can be bounded by the rate distortion function of a Gaussian source with the same variance. Therefore, the knowledge of the RD region of a Gaussian source is rather important and can be utilized in practical MDC designs.
376
A. Norkin et al.
Consider the source in Fig. 11.1 as a sequence of i.i.d. random variables {Xk } having Gaussian distribution with variance σ 2 . The distortion measure for all tree decoders is squared error di (x, x ˆi ) = (x − x ˆi )2 , i = 0, 1, 2. The distortion-rate function D(R) of a Gaussian source is [10] D(R) = σ 2 2−2R .
(11.5)
The obvious outer bound is [64] Di ≥ D(Ri ) = σ 2 2−2Ri , 2 −2(R1 +R2 )
D0 ≥ σ 2
i = 1, 2
.
(11.6)
The Shannon’s source coding theorem implies that the RD function can be approached arbitrarily close when the codeblock length N approaches infinity [10]. If this held for the MD case, then (11.6) would define the achievable region. However, Ozarow has shown [64] that the actual achievable set of quintuples (R1 , R2 , D1 , D2 , D0 ) for Fig. 11.1 is formed by points satisfying D1 ≥ σ 2 2−2R1 , D2 ≥ σ 2 2−2R2 , D0 ≥ σ 2 2−2(R1 +R2 )
1 √ √ , 1 − ( Π − Δ)2
(11.7)
where Π = (1 − D1 /σ 2 )(1 − D2 /σ 2 ) and Δ = D1 D2 /σ 4 − 2−2(R1 +R2 ) . Thus, the region, defined by (11.7) is actually the RD region. The last term in the expression for D0 shows that there is a penalty on D0 for small values of distortions D1 and D2 . The lowest central distortion is achieved when Π ∼ = Δ and therefore D1 + D2 ∼ = σ 2 (1 + 2−2(R1 +R2 ) ). Consequently, if either D1 or D2 is small, the other side distortion must be near σ 2 . That means that the second description is almost useless by itself [64]. Conversely, for R1 = R2 and D1 = D2 , the central distortion is not better than half of the side distortion. For a small value of D this is far worse than the value D0 ≥ σ12 D2 given by the bound (11.6). In other words, two descriptions together are twice as good as one if they do not need to be individually very good. On the contrary, if the side distortion constraints are severe, then two descriptions will not work better than one, because they are in fact the same description [64]. Venkataramani et al. have generalized Ozarow’s outer bound for the case of L channels [90]. 11.2.3 Inner and Outer Bounds for Squared Error Distortion In case of squared error distortion the inner and outer bounds for MD region have been established by Zamir [105]. A sketch of the inner and outer bounds is given in Fig. 11.2. The inner bound is shown to be the Ozarow’s MD region (11.7) for the Gaussian source.
11 Multiple Description Coding and its Relevance to 3DTV
377
R2
Achievable rates Non-achievable rates
Inner bound Outer bound R1
0
Fig. 11.2. Inner and outer bounds for rate region
Let RX (D0 , D1 , D2 ) be the set of achievable rate pairs (R1 , R2 ) at distortions (D0 , D1 , D2 ). Similarly, let DX (R1 , R2 ) be the set of achievable distortion triplets (D0 , D1 , D2 ) for a given rate pair (R1 , R2 ). Let X be a random variable, with a probability density function f (x) whose support is a set X. The outer bound can be found for any real memoryless source X with finite differential entropy DX (R1 , R2 ) ⊆ D∗ (Px , R1 , R2 ), where Px = 22h(x) /(2πe) is the entropy power of the source, h(X) = − X f (x) log2 f (x)dx is the differential entropy, and D∗ (Px , R1 , R2 ) is the set of all triplets (D1 , D2 , D0 ) satisfying D1 ≥ Px 2−2R1 , D2 ≥ Px 2−2R2 , D0 ≥ Px 2−2(R1 +R2 )
1
, ˜ − Δ) ˜2 1−( Π
(11.8)
˜ = (1 − D1 /Px )(1 − D2 /Px ) and Δ˜ = D1 D2 /Px2 − 2−2(R1 +R2 ) . where Π In [105], it was shown that D∗ (σx2 , R1 , R2 ) ⊆ DX (R1 , R2 ) ⊆ D∗ (Px , R1 , R2 )
(11.9)
or conversely R∗ (σx2 , D0 , D1 , D2 ) ⊆ RX (D0 , D1 , D2 ) ⊆ R∗ (Px , D0 , D1 , D2 ),
(11.10)
where σx2 and Px are the variance and the entropy-power of X, and D∗ (σx2 , R1 , R2 ) is the Ozarow’s MD distortion region for a Gaussian source given by (11.7). One can notice that (11.9) and (11.10) are actually the extension of Shannon’s upper/lower bound of the RD function to the case of multiple descriptions. It has been also shown that for high resolution conditions, i.e. D1 , D2 → 0, the outer bound (the rightmost part of (11.9)) is asymptotically tight. We refer to [105] and [54] for more details.
378
A. Norkin et al.
11.2.4 Minimal Breakdown Degradation A significant part of the MD literature is focused on the memoryless binary symmetric source (BSS) with Hamming distortion measure. Consider the BSS as a sequence of i.i.d random variables, Xk taking the values 0 and 1 with probability 1/2. The average Hamming distortion between the source sequence xN and its reconstruction x ˆN i is N ˆN dN H (x , x i )=
N 1 dH (xk , x ˆik ), N
i = 0, 1, 2.
(11.11)
k=1
ˆi ) = 0 if x = xˆi and dH (x, xˆi ) = 1 if x = x ˆi . The rates are where dH (x, x R1 = R2 = 1/2 bits per symbol and the average expected central distortion is D0 = 0. The total rate R1 + R2 is the exact minimum rate needed to achieve zero central distortion. If one of the channels breaks down, the decoder has to estimate Xk based on the information from the other channel. The problem is to find all the achievable quintuples (R1 , R2 , D0 , D1 , D2 ) in the usual Shannon sense. This problem is called minimum breakdown degradation and is treated by numerous authors [11, 100, 101, 102]. If one just sends half of the bits on each channel, it is easy to see that the minimum side distortion that decoder can get is D1 = D2 = 1/4. However, it has been shown that the minimum side distortion is lower, specifically, √ D1 = D√ = ( 2 − 1)/2 ≈ 0.207. Berger and Zhang have shown [11] that the 2 bound ( 2−1)/2 is tight for the case when the quintuples (R1 , R2 , D0 , D1 , D2 ) are achievable in the usual Shannon sense, i.e. D0 → 0 when the blocklength N → ∞. In this case, the result also coincides with the El Gamal–Cover’s achievable rate region. Nevertheless, when exact reconstruction in the central decoder is needed, the bound is not tight. The works [1] and [106] study and generalize the results of Witzenhausen’s hyperbola lower bound [100]. Ahlswede has proved in [1] that this bound is tight for the case of no excess rate and an almost exact central reconstruction. It is also tight for the case of arbitrary small excess rate and exact reconstruction. However, the bound is not tight for the case of exact central reconstruction and no excess rate. 11.2.5 Successive Refinement Problem Successive refinement of information is a special case of the multiple description problem. In the successive refinement problem, Decoder 2 is removed from the MDC scheme as given in Fig. 11.3. Decoder 1 gets only the information from Channel 1 at rate R1 . Decoder 0 is said to refine the information from Channel 1. It gets the information from both Channel 1 and Channel 2 with rates R1 and R2 , respectively. The corresponding distortions are D1 and D0 . It is said that we are successively refining a sequence of random variables {Xk } from distortion D1 to distortion D0 if the description of the source
11 Multiple Description Coding and its Relevance to 3DTV
379
Fig. 11.3. Successive refinement problem
is optimal in every stage, i.e. R1 = R(D1 ) and R = R1 + R2 = R(D0 ). The problem is said to be successively refinable in general if the successive refinement from distortion D1 to distortion D0 is achievable for every D1 ≥ D0 [27]. Successive refinement is not always possible. Equitz and Cover have shown [27] that the rate distortion problem is successively refinable if and only if the individual solutions for the rate distortion problem could be written as a ˆ1 → X ˆ 0 → X as a Markov chain. Codes Markov chain, i.e. one can write X for the successive refinement have a “tree structure”, in which the coarse descriptions occur near the “root” of the tree and the finer descriptions are near the leaves [27]. Equitz and Cover have also shown that successive refinement is possible for Gaussian signal with square-error distortion, for all finite alphabet signals with Hamming distortion, and for Laplacian signals with absolute-error distortion. As successive refinement is a special case of MDC problem, one can use the achievable region of El Gamal and Cover (11.3) to find the achievable rates in successive refinement problem. However, Ahlswede [2] has shown that for the “no-excess rate” case (R1 + R2 = R(D0 )) the conditions of El Gamal and Cover are necessary and sufficient. As the successive refinement is a “no-excess rate” case, (11.3) gives the entire rate region [27]. The entire rate region for successive refinement problem is given by the following relations. For a discrete memoryless source {Xk }∞ k=1 with distribution p(x) and distortion measures d1 and d0 , the quadruple (R1 , R2 , D1 , D0 ) is achievable if an only if there exists a conditional distribution p(ˆ x1 , x ˆ2 |x) such that the following four inequalities are satisfied [70]. ˆ 1 ), R1 ≥ I(X; X ˆ 1, X ˆ 0 ), R1 + R2 ≥ I(X; X ˆ i )] ≤ Di , E[di (X, X i = 1, 0.
(11.12)
11.2.6 Application of Rate Distortion Bounds Finding the rate distortion region for the MD case is inherently more complicated than finding it for the single description case. Unlike the rate distortion function for the single description case, the rate distortion function for multiple descriptions has not been defined in general in terms of information-theory quantities.
380
A. Norkin et al.
The MD region is completely known only for the special case of a memoryless Gaussian source with squared error distortion. This special case is nevertheless quite important, as it confines all other memoryless sources with the same variance. For this particular case, the rate distortion region coincides with the achievable region of El Gamal and Cover [26]. The results on multiple distortion RD bounds are used to evaluate the performance of MD codes as demonstrated in Sect. 11.3. When optimizing the performance of an MDC technique, on can minimize the product D0 D1 . In many cases, a linear combination αD0 + βD1 of central and side distortions is a good performance measure. The weights α and β in the linear combination usually correspond to the probabilities of receiving both descriptions or only one description.
11.3 Multiple Description Coding of Images A typical transform-based image coder consists of the following blocks: a transform, which maps the spatial domain image representation to a transform domain for better decorrelation; a quantizer, which scans and quantizes the transform coefficients to achieve lossy compression; and an entropy coder, which removes the remaining statistical redundancy between quantized coefficients in a lossless manner. When an MDC module is to be added to such a scheme to achieve channel error protection, a first and crucial problem is where to include it. MDC methods utilize various possibilities for adding controllable redundancy. Some employ subsampling in spatial or transform domain while others employ transforms to create interleaved patterns to be distributed among descriptions or special scanning and quantization of transform coefficients. In this section, we review the most common MDC approaches and practical existing schemes. Rather than being very comprehensive, our review introduces principles of MDC of images and focuses on these approaches, which are most relevant within the 3DTV context. 11.3.1 Multiple Description Scalar Quantization A simple way to add error-protecting redundancy to the compressed image bitstream is to do this at the quantization stage, i.e. the stage where loss of insignificant information occurs. This idea has been extensively developed in the works of Vaishampayan, who has suggested a theory of multiple description scalar quantizers [84] addressing cases of fixed-rate quantization [84] and entropy-constrained quantization [87]. Multiple description scalar quantization (MDSQ) works as follows. Two side (coarse) quantizers with overlapping cells operate in parallel at the quantization stage. The quantized source can be reconstructed from the output of either quantizer with lower quality. When the outputs of the two quantizers are combined, they produce higher quality reconstruction due to the resulting
11 Multiple Description Coding and its Relevance to 3DTV
381
smaller quantization cells. In a practical scheme, the encoder [84] first applies a regular scalar quantizer, mapping the input variable x to a quantization index I. Then, in a second step, an index assignment is applied, mapping each index I to a codeword index pair (i, j) in a codebook. Figure 11.4(a) [84] presents the index assignment matrix for the case of “staggered” index assignment. The cells of the quantizer corresponding to the index I are numbered in the matrix from 0 to 14. The row and column indices of the index assignment matrix form the index pair (i, j). Index i is included in Description 1 whereas index j is included in Description 2. The central decoder reconstructs the exact value of index I and the corresponding ˆ 0 . The side decoders estimate X as an expected value when one of the value X indices is fixed. Thus, the quality of the side reconstruction is determined by a number of diagonals in the index assignment matrix. In Fig. 11.4(a), only 15 out of 64 cells in the index assignment matrix are occupied. Unoccupied cells constitute coding redundancy. Figure 11.4(b) shows an index assignment with three diagonals filled and lower redundancy. The highest redundancy is achieved when only the main diagonal of the index assignment matrix is filled. This corresponds to duplication of all data in both descriptions. The side distortions are equal to the central distortion E[d0 ] = E[d1 ] = E[d2 ] and consequently the bitrate is doubled. If the index assignment matrix is full, there is no redundancy, resulting in high side distortions. The high-rate analysis of MDSQ has been presented in [86] and the performance of MDSQ has been compared to Ozarow’s rate-distortion bound [64] for squared error distortion and memoryless Gaussian source (11.7). Comparing the optimal entropy-constrained quantizer with the theoretical bound, a 3.07 dB gap between the product of the average central and side distortions d0 d1 has been identified [86]. It has been conjectured [76] that this gap is caused by the non-spherical form of quantization cells. Therefore, the gap could be closed by constructing quantizers with more “spherical”-like cells; that is, cells with a smaller normalized second moment than a hypercube. Several solutions have been proposed, including trellis-coded quantization [44] and
Fig. 11.4. Index assignment: (a) Staggered quantization cells; (b) Higher spread quantization cells
382
A. Norkin et al.
multiple description lattice vector quantizer (MDLVQ). It has been shown [88] that MD vector quantizers are capable of closing the 3.07 dB gap when vector dimensions tend to infinity (N → ∞). An improvement of 0.3 dB is achieved in the two-dimensional case using an MD hexagonal lattice quantizer [76]. Originally, MDLVQ was limited to the balanced case (equal rates R1 = R2 and equal distortions D1 = D2 ). Diggavi et al. [24] generalized this method to asymmetric multiple description vector quantizers that cover the entire spectrum of the distortion profile, ranging from balanced to successively refinable descriptions. The improvements for MDLVQ provide more operating points and more flexible rate-distortion trade-off by the slight increase in complexity [33, 50]. A generalized MDVQ for more than two descriptions has also been designed [28]. One of the first multiple description image coders has been proposed by Vaishampayan [85]. The idea of the coder is quite simple. Multiple description scalar quantization is applied to DCT coefficients of JPEG coder. The indices obtained are entropy-coded separately in both descriptions. The descriptions are sent to the destination in different packets. When only one description is received, lower quality reconstruction is obtained. Higher quality reconstruction is obtained from both descriptions. An MD image coder based on MDSQ has been proposed in [77, 78] by Servetto et al. In this coder, MDSQ is applied to each coefficient of the wavelet transform. Thus, two descriptions of the wavelet coefficients are created. Each description is then coded independently with a single description coder (e.g. SPIHT). Different subbands are coded with different MD scalar quantizers to achieve better redundancy allocation. 11.3.2 Multiple Description Transform Coding Another MD approach considers adding redundancy immediately after the stage of transform coding by means of so-called pairwise correlating transform (PCT) [93]. The general framework is the following. First, the input signal is decorrelated using proper transform (e.g. DCT). The resulting coefficients are ordered according to their variances and coupled into pairs. These pairs undergo correlating transform, i.e. two uncorrelated coefficients at the PCT input give rise to two correlated coefficients at the PCT output. One transform coefficient is sent to Description 1 and the other is sent to Description 2. By the explicitly added redundancy within the pairs of coefficients, a lost coefficient from the pair can be estimated from the received one. When both descriptions are received, the exact values of variables can be determined by taking the inverse transform. The method has been extended to more general orthogonal [63] and non-orthogonal transforms [91, 94]. MDC using orthogonal correlating transforms was introduced in [63]. Consider at the input two independent Gaussian random variables A and B with variances σa2 and σb2 , respectively, σa2 > σb2 [63]. The output random variables C and D with variances σc2 and σd2 , respectively, are related to A and B by a unitary matrix T
11 Multiple Description Coding and its Relevance to 3DTV
383
[C D]t = T [A B]t . The transform T controls the redundancy by varying the correlation between C and D. For example, T can be parametrized by an angular parameter, e.g. [63] cos θ sin θ T = , − sin θ cos θ where the parameter θ relates to the amount of correlation introduced by T . After being correlated, the outputs C and D are quantized and entropy-coded. Besides orthogonal correlating transforms, nonorthogonal transforms have also been developed [63]. A non-orthogonal transform is generally more efficient than the orthogonal one as it can work on a larger redundancy interval. Further improvement of non-orthogonal transforms has been introduced in [94] and [95]. The optimal non-orthogonal transform, producing balanced rates and side distortions, has been derived. The basis vectors of the optimal transform have the same length and are rotated with the same angle in opposite directions from the axis corresponding to the variable with higher variance. The optimal transform, giving equal side distortions and balanced description rates, has the following form [94]: ⎡ ⎤ cot θ
tan θ
2 ⎦ T = ⎣ 2 cot θ tan θ − 2 2
(11.13)
The suggested non-orthogonal transform outperforms the orthogonal transform. The orthogonal and non-orthogonal transforms yield the same results only in two points (with zero redundancy and maximum redundancy for orthogonal pairing) [63]. The point of maximum redundancy for orthogonal transform is the rotation on the angle θ = π/4. For this rotation angle, the orthogonal and non-orthogonal transforms are in fact the same transform. A similar correlating transform was obtained independently by Goyal et al. [34, 35, 36]. Generalizing the work of Orchard et al. [63], a transform-based approach was developed for producing M descriptions of an N -tuple source. Several transform optimization results have been presented for memoryless Gaussian sources, including a complete solution for the N = 2, M = 2 case with arbitrary weighting of the descriptions. An important tool for evaluating performance of MD codes is the redundancy rate-distortion (RRD) function [63]. The redundancy is determined as ρ = R − R∗ . Here, R is the resulting rate of the MD coder with a central channel distortion D0 , and R∗ is the rate of the best single description coder for a given distortion D0 . Thus, ρ is an additional bitrate needed for one-channel reconstruction. The RRD function is then defined as ρ(D1 ; D0 ), where D1 is the averaged one-channel distortion. Since ρ depends weakly on D0 , the RRD function could be written as ρ(D1 ) [63].
384
A. Norkin et al.
The performance of JPEG-like MD coder using correlating transforms has been compared with the performance of the coder using MDSQ [95]. It has been shown that PCT exhibits good performance in the small redundancy region but fails to achieve good reconstruction quality for higher redundancies. Poor performance for high redundancies is due to the fact that the 2 RRD curve of correlating transform converges to a nonzero value σB /2 [95] (in a region with redundancy close to zero it shows super exponential decay). The performance of multiple description transform codes (MDTC) for higher redundancies is improved in Generalized MDTC (GMDTC) proposed by Wang et al. [94, 97]. This scheme uses the explicit redundancy to correct the error resulting from reconstruction from a single description. For higher redundancies, GMDTC includes in the description carrying the coefficient C some information C ⊥ , an orthogonal complement of C in the Hilbert space spanned by (C, D) [94]. This orthogonal complement information is pure redundancy and is used only in the case of reconstruction from a single description. If total redundancy is below a critical point ρ∗ , all the redundancy is allocated to the transform. Thus, this hybrid scheme combines super-exponential decay in a low redundancy region and near-exponential decay in a higher redundancy region. 11.3.3 MD-FEC A general approach is to equip multiple descriptions with forward error correction (MD-FEC) [5, 56]. Its basic idea is to assign unequal numbers of FEC symbols to different parts of the compressed bitstream, depending on the information importance of these parts and their contribution to the overall reconstruction quality. This idea is best applied to so-called “embedded” bitstreams, where the bytes of the compressed source are ordered according to their importance. The wavelet-based SPIHT encoder is an example of compression algorithm generating an embedded bitstream [73]. Its first bytes are the more important ones and any subsequent byte refines the decoded image. Thus, the bitstream can be truncated based on the given bit budget leaving the reconstruction still possible. In connection with FEC, the first bytes should be better protected than the later bytes. We illustrate how MD-FEC works by means of an example. Seventeen data symbols are coded using eight FEC symbols; thus, a total of 25 symbols is to be transmitted, with these broken into five codes, as shown in Table 11.1. FEC is implemented by means of Reed-Solomon (RS) codes. Stronger RS codes are applied to the data located at the beginning of the bitstream, i.e. to the more important data. Namely, (5,2)-codes are applied to symbols 1 and 2, (5,3)codes are applied to symbols 3 to 8, (5,4)-codes are applied to symbols 9 to 12, and symbols 13 to 17 are left unprotected. Then, the symbols, including FEC ones, are grouped vertically into multiple descriptions (packets). Each packet is protected with a parity code enabling error detection and sent to the receiver.
11 Multiple Description Coding and its Relevance to 3DTV
385
Table 11.1. Example of MD-FEC
Code Code Code Code Code
1 2 2 3 4
D1
D2
D3
D4
D5
1 3 6 9 13
2 4 7 10 14
FEC 5 8 11 15
FEC FEC FEC 12 16
FEC FEC FEC FEC 17
At the receiver side, the decoder detects erroneous descriptions and uses RS codes to reconstruct the lost data. As (5,2) RS code can sustain a loss of three symbols, receiving any two descriptions makes it possible to decode symbols 1 and 2. Similarly, receiving any three descriptions makes it possible to decode symbols 1 to 8. When no descriptions are lost, all 17 symbols are reconstructed. For reconstruction with RS codes, it does not matter which descriptions are lost, only the number of lost descriptions matters. Thus, MD-FEC generates inherently balanced descriptions (having the same size and resulting in the same distortion when lost). MD-FEC is quite an attractive technique as it can be applied to any coder which generates embedded bitstream. The method can also be applied to nonembedded bitstream, after parts of the bitstream are rearranged in order of importance. 11.3.4 A Survey of MD Image Coding Approaches An MDC approach developed by Goyal et al. exploits quantized overcomplete frame expansions for generalized multiple description coding [37, 38]. The authors propose the use of linear transforms from RN to RM with M > N followed by scalar quantization, in the so-called quantized frames (QF) system. Each quantized coefficient may be considered as a description. This approach is similar to block channel codes with swapped order of the transform and quantization operations [38]. Due to this interchanged order, the MDC system behaves differently from a conventional block channel code in the presence of erasures. A channel-code system with (M, N )-code would not show any degradation of reconstruction quality until the number of erasures exceeded N − M . If the number of erasures exceeds M − N , the so-called “cliff-effect” characterized by fast loss of reconstruction quality appears. In contrast, the QF system shows mild degradation of quality when the number of erasures is higher than M − N . Each transform coefficient brings some independent information, even those in excess of N coefficients [32]. Moreover, frame expansions provide quantization noise reduction [39], yielding better quality than conventional block channel coding when all the descriptions are received. Several MDC schemes are based on spatial image subsampling, exploiting the fact that natural images possess a high degree of spatial correlation
386
A. Norkin et al.
Fig. 11.5. Example of polyphase transform when original image of size 4 × 4 is partitioned into 4 blocks [72]
[30, 72]. Subsampling in 2D space may lead to different polyphase components, depending on the geometry of subsampling lattice applied. Figure 11.5 shows a subsampling scheme generating four polyphase components. Polyphase subsampling is attractive since it can be adjusted for an arbitrary number of descriptions, making the coder easily tunable for changing channel conditions. At the receiver, pixels corresponding to the lost descriptions can be interpolated from the received neighbouring pixels. The redundancy is due to spatial subsampling which decreases interdependencies between the neighboring pixels and makes the compression less efficient. Additional redundancy can be added in a form of redundant copies of polyphase components [72]. In that case, each description contains a fullycoded copy of one polyphase component and several redundant copies of other polyphase components coded at lower rate. If the main copy of a polyphase component is lost, the available redundant copy of the same component included in other description is used. An optimized bit allocation algorithm chooses the number of redundant copies and their bitrates for each polyphase component [72]. The redundancy can also be added at the preprocessing stage before the polyphase transform [30]. Such a pre-processing procedure is shown in Fig. 11.6. The input image is transformed to DCT domain to an array of size D × D, which is then padded with zeros to the size of (D × M )(D × M ). The obtained (D × M )(D × M ) representation is transformed back to spatial domain, where it is split into multiple descriptions by polyphase subsampling. Clearly, the redundancy can be adjusted by the padding parameter M . Polyphase subsampling can be done in a transform domain as well. In wavelet domain, such schemes are based on inter and intra-scale dependencies between wavelet coefficients [9, 46]. For lapped orthogonal transforms
11 Multiple Description Coding and its Relevance to 3DTV
387
Fig. 11.6. Pre-processing block [30]
(LOT), such schemes have been generated by interleaving blocks of transform coefficients [18, 19, 20]. Blocks of LOT transform coefficients are split between two descriptions in such a way that neighbouring blocks are included in different descriptions. If a description is lost, missing coefficients are replaced with zeros, resulting in a spatial interpolation due to the overlapping reconstruction functions. To control the amount of redundancy, different LOT bases have been developed [20]. These bases trade off a coding gain against a reconstruction gain. Miguel et al. have proposed a SPIHT-based MD image coder [55]. The coder can produce a flexible number of descriptions N . Spatially disperse wavelet coefficient trees are grouped into N sets. Each tree set is independently coded with SPIHT algorithm. Each description consists of one tree set coded at a higher rate and M − 1, M ≤ N redundant tree set copies coded at lower rates. Thus, the redundancy is formed by duplicating the coefficient trees. If the description is lost, the missing trees are obtained from their best quality redundant copy in the other descriptions [55]. Bit allocation is optimized for the target bitrate and the probability of description loss. An MD coder producing two balanced descriptions compatible with JPEG2000 is proposed by Tillo et al. [82]. The algorithm generates two JPEG2000 streams coded at rates R1 and R2 such that R1 > R2 . Rate allocation in JPEG200 is based on code-block (CB) truncation, so that Stream 2 has CB’s truncated at a lower rate than CB’s from Stream 1. To generate balanced descriptions, CB’s from Stream 1 are alternated with CB’s from Stream 2, yielding two descriptions with the approximate rate (R1 + R2 )/2. 11.3.5 2-stage Multiple Description Image Coding This section describes in more detail a two-stage MD image coder [61]. Although the coder scheme is quite simple, it enables the attainment of good performance at low redundancy. It represents the original image in the form
388
A. Norkin et al.
of a coarse image approximation (shaper) and a residual image. The coarse approximation is subsequently duplicated and combined with the residual image further split into two descriptions using a checkerboard rearrangement of transform coefficient blocks. In the encoder scheme (Fig. 11.7(a)), the shaper (blocks bordered by the dashed line) is generated by decimation by a factor M followed by a JPEG coder (see Fig. 11.7(a)). Special attention is paid to the way the image is decimated and interpolated. A B-spline-based least square image resizing ensuring minimum loss of information is utilized [57]. Thus, most image information is concentrated in the decimated image to be included in both descriptions. For the decimated image, a DCT-based coder is a reasonable choice. Alternatively, the shaper can be generated by a wavelet-based coder, e.g. SPIHT (Fig. 11.7(b)). In this case, the image resizing operation is inherently included in the scheme. The residual image is coded by a JPEG-like coder using a block transform (denoted by T ). It can be either DCT or lapped orthogonal transform (LOT). The transform coefficients are finely quantized with a uniform quantization step (Qr ). Then, transform blocks are directly split into two parts in a checkerboard manner (see Fig. 11.8) and entropy-coded. One part together with the shaper forms Description 1, while the second part combined again with the shaper forms Description 2. Thus, each description consists of the coarse image approximation and half of the transform blocks of the residual
Fig. 11.7. Varieties of proposed scheme: (a) shaper is obtained by spline resizing and JPEG coding; (b) shaper is obtained by SPIHT coding. Reprinted from [61], copyright 2006, with permission from Elsevier
11 Multiple Description Coding and its Relevance to 3DTV
(a) Residual image
(b) Description 1
389
(c) Description 2
Fig. 11.8. Checkerboard splitting of the residual image (in case of DCT). Reprinted from [61], copyright 2006, with permission from Elsevier
image. Therefore, no extra redundancy is added in the residual image coding while generating two descriptions instead of one. The obtained coder provides balanced descriptions both in terms of PSNR and bitrate. The amount of redundancy is also easily adjustable. The algorithm of optimal bit allocation subject to probability of a channel error is also provided [61]. When the decoder receives both descriptions, reconstruction is straightforward. In case of one-channel reconstruction, lost coefficients of the residual image are just filled with zeros. Then, the inverse quantization and inverse transform are applied. The shaper is obtained from the received description and added to the reconstructed residual image. Figure 11.9 shows a test image reconstructed from one and two descriptions by the 2-stage coders using DCT and LOT for the residual image coding. The coders exploit B-splines and JPEG for coding the shaper. One notices that DCT and LOT produce different visual artifacts when the image is reconstructed from a single description. In particular, DCT produces blocking artifacts caused by different reconstruction quality of neighbouring blocks. LOT does not produce blocking artifacts. Instead, it produces artifacts that look similar to ringing. The 2-stage coder is also compared with other MD image coders based on JPEG. The 2-stage coder exploits B-splines and JPEG for coding the shaper. The residual image is coded with DCT. Two modifications of this coder are used for comparison: with and without postprocessing (called 2-stage+postfilt and 2-stage, respectively) [61]. The coder is compared with two MD coders presented in [95]. One of them is a JPEG-based MDTC image coder. Another coder (MDSQ) is based on applying multiple description scalar quantization [84] to DCT coefficients of the JPEG coder. Test image Lena (512 × 512, 8 bpp) is used for comparison. The central distortion for the coders is fixed at D0 = 35.78 − 36.00 dB. Figure 11.10 represents the RD curves for the coders described above. Different operating points for the 2-stage coder are obtained by varying the
390
A. Norkin et al.
Fig. 11.9. Reconstructed image Lena, 2-stage coder, spline interpolation and DCT coding of shaper. DCT coding of the residual (R = 0.623 bpp, ρ = 28%): (a) reconstruction from both descriptions, D0 = 35.24 dB; (b) reconstruction from Description 1, D1 = 31.84 dB. LOT coding of the residual (R = 0.632 bpp, ρ = 27.5%): (c) reconstruction from both descriptions, D0 = 34.80 dB; (d) reconstruction from Description 1, D1 = 30.78 dB. Reprinted from [61], copyright 2006, with permission from Elsevier
downsampling factor for the shaper. Figure 11.10 demonstrates that the 2stage coder substantially outperforms the MDTC coder for the whole range of redundancies. The difference is greater for low redundancies. Superior performance of the 2-stage coder in the low redundancy region is due to downsampling before JPEG coding of shaper. This operation makes it possible to obtain higher PSNR for low bitrates compared with the conventional JPEG compression. The 2-stage coder also performs better or comparably with the MDSQ coder for higher redundancies. In the middle range of redundancies, the 2-stage coder works better than MDSQ. Moreover, the coder achieves smaller redundancies than the coders based on MDSQ and MDCT. One notices that the 2-stage coder is able to produce meaningful side reconstruction even with a redundancy less than 5%.
11 Multiple Description Coding and its Relevance to 3DTV
391
34 33 32
PSNR(dB)
31 30 29 28 2−stage 2−stage+post−filt MDTC MDSQ
27 26 25 24 0.6
0.65
0.7
0.75 bitrate (bpp)
0.8
0.85
0.9
Fig. 11.10. RD performance of different coders; image Lena (512 × 512). Reconstruction from a single description. For MDTC and MDSQ, D0 = 35.78 dB. For 2-stage coder and 2-stage coder with post-filtering, D0 = 35.80 ÷ 36.00). Reprinted from [61], copyright 2006, with permission from Elsevier
11.4 Multiple Description Coding of Video Recent video coding standards employ motion-compensated prediction to efficiently remove the temporal correlation in video sequences and to split the video information into the form of prediction error and motion vectors. Consequently, a typical MDC for video has to address both coding of prediction errors and motion vectors. In addition, special attention has to be given to the synchronization between the coder and decoder. The creation of multiple descriptions for the error signal can be handled exactly as in the case of images; that is, by subsampling in spatial or transform domain, and/or by proper scanning and quantization of transform coefficients. The correlation in temporal domain can be utilized by suitable temporal domain subsampling. In this section, we address the most important problems of MDC of video by surveying relevant papers. For a more extensive treatment of the topic, we refer to the excellent overview paper by Wang et al. [96]. 11.4.1 Prediction Loops Mismatch Control Motion-compensated prediction combined with multiple descriptions gives rise to a problem of mismatch between the states of the encoder and decoder since the loss of one description may cause loss of synchronization between the encoder and decoder. This problem can be solved either by generating extra prediction loops for any potential configuration of lost descriptions, or by embedding the prediction in each description separately.
392
A. Norkin et al.
An MDC approach using three prediction loops in the encoder to mimic different description loss scenarios has been proposed by Reibman et al. [67, 68]. In this approach, motion vectors and headers are simply duplicated in both descriptions. I-frames and prediction errors in P-frames are coded using one of the previously developed MD image coding approaches, e.g. by pairwise correlating transform [63, 94]. The coder uses the central (two-channel-based) reconstruction as a reference picture in motion-compensated prediction. To deal with the prediction loop mismatch, the encoder has three different prediction loops, emulating three possible decoding scenarios: both descriptions received or either of the descriptions received [67, 68]. A general diagram of the MD video coder based on three prediction loops is shown in Fig. 11.11. The encoder stores three previous frames P0 , P1 and P2 reconstructed from both descriptions, Description 1 and Description 2, respectively. For each block X, the decoder forms the corresponding block from the previously reconstructed frames through motion-compensated prediction. The residual R (the motion-compensated difference) in the central prediction loop (assuming both descriptions are available) is coded using a general MD image coder (labelled in the scheme as MDC encoder). The MD encoder 61 and R 62 which are sent to Channel 1 and Channel produces two descriptions R 61 and R 62 and 2. The encoder also has an MDC decoder block, which receives R 7 7 produces R1 and R2 : two estimates of R from Description 1 and Description 71 or R 72 only, the reconstruction in the 2, respectively. If the decoder gets R 7i . However, the side loop predicabsence of any additional information is Pi + R 7 72 are also coded by the single tion errors M1 = X−P1 −R1 and M2 = X−P2 −R description encoders and included in the corresponding descriptions. Coding the M1 and M2 controls the mismatch between the prediction loops in the encoder and decoder in the case of description loss. Three different algorithms for mismatch signal coding have been proposed [67]. Algorithm 1 completely codes the mismatch error to the precision of quantization error. Algorithm 2 uses
Fig. 11.11. Framework for 3-prediction loops MD coding in the P mode. Adapted from [68]. Copyright 1999 IEEE
11 Multiple Description Coding and its Relevance to 3DTV
393
partial control of the mismatch error. Algorithm 3 does not use side prediction loops at all. Instead, it allocates more redundancy to the single-channel reconstruction of signal R. The decoder never uses mismatch error signal when both descriptions are received [67]. The matching-pursuits multiple description coding (MP-MDC) of Tang et al. [80] is also based on the 3-loop scheme described earlier. This approach modifies DCT structure to a matching-pursuits framework, where the redundancy in multiple descriptions is controlled by the number of shared basis functions. Unlike the approach of Reibman, where the side loops are considered as pure redundancy, the approach of Tang uses maximum likelihood estimation in order to improve the reconstruction quality of the central decoder by using the side loop mismatch error signal. The three-prediction-loop architecture, as illustrated by Fig. 11.11 [67], is constrained to have only two descriptions for the reason that the number of drift-compensating contributions grows exponentially with the number of descriptions. Several MDC approaches achieve drift-free (mismatch-free) behaviour in the case of description loss by using separate predictors for each description. While two independent prediction loops completely solve the problem of prediction loop mismatch, they also degrade the prediction performance because of the lower quality reference frames used. The independent flow MD video coder (IF-MDVC) introduced in [29] exploits polyphase transform to create two or more descriptions. The input video undergoes a polyphase transform, providing N independent flows. Motioncompensated prediction is performed in each flow separately and is followed by entropy coding. When one description is lost, the lost samples are interpolated from the received samples [29]. This coder does not require any drift compensation as the prediction is made in each description separately. The redundancy in this coder is due to the reduced coded efficiency caused by breaking of the cross-dependencies in the original video sequence. An approach called mutually refining DPCM (MR-DPCM) solves the problem of prediction loop mismatch by using two independent prediction loops in the encoder [84]. Each description uses its own prediction loop, where the reference frame is the single description reconstruction. The DCT coefficients of motion-compensated differences are quantized with shifted quantizers (MDSQ) described in Sub-sect. 11.3.1 [84]. When the decoder receives both descriptions, they are combined to produce a higher quality reconstruction. Regunathan and Rose have enhanced the MR-DPCM scheme by using an estimation-theoretic approach [65]. A drift-free multiple description coder has been introduced by Boulgouris et al. [14]. This coder has one prediction loop. Only the information common for both descriptions is used for motion-compensated prediction. This is achieved by coding the prediction error with a wavelet coder. Wavelet coefficients from the lower subbands are duplicated in both descriptions and used in a reference frame. The coefficients from higher subbands are split between two descriptions and are not used in motion-compensated prediction.
394
A. Norkin et al.
A simple MD video coder has been introduced by Reibman et al. [66]. This coder does not code the mismatch signal. The drift in the case of description loss is corrected only when receiving an I-frame. Motion vectors and header information are duplicated in both descriptions. DCT coefficients with the magnitude greater than a predefined threshold are duplicated in both descriptions, while lower magnitude coefficients are alternated between two descriptions. Bit allocation is optimized by choosing the threshold value for each DCT block. Thus, the decoder can decode any of the received descriptions with no additional operations. When two descriptions are received, the decoder simply merges them into one H.263 compliant stream. 11.4.2 Coding of Motion Vectors Motion vectors are essential for video sequence reconstruction. Thus, a full MDC system has to take into account coding of the motion vectors. A straightforward approach in treating the motion information is to duplicate the motion vectors in all the descriptions in order to protect the motion information in the case of channel losses. Motion vectors can take a significant part of the target bitrate, especially in low-bitrate budgets. This justifies the necessity for efficient motion vector coding in MDC of video [52, 53]. The approach of Kim and Lee [53] exploits overlapped block motion compensation [62], which generates a smoother motion field than the conventional block-based motion compensation. Blocks from a motion vector field are partitioned into two coarse fields using quincunx subsampling and are split between two descriptions. The decoder reconstructs a fine motion field if both descriptions are received, and a coarse motion field if only one description is received. The redundancy in this algorithm is minimal, which may cause high side distortions. The coder in [53] does not code the prediction mismatch between the encoder and decoder prediction loops. However, a combination with the 3-loop motion compensation scheme as in [67] is possible. Another MDC approach generates two descriptions from a mesh-based motion compensation field [98]. Motion vectors are associated with the nodes of the mesh and then split between two descriptions using quincunx subsampling. If both descriptions are received, a fine motion mesh is generated. When one description is lost, a coarse motion mesh is generated. A central residual image is obtained when both descriptions are used in motion compensation and a side residual image is obtained when the motion compensation mesh is formed from a single description. The DCT coefficients of the I-frames and the central residual signal images are coded into two descriptions using MDTC [94]. In order to control the mismatch, side information is added to both descriptions and used in the case when one description is lost. 11.4.3 Temporal Subsampling Motion information is treated by applying temporal subsampling in an approach known as video redundancy coding (VRC) [99]. Temporally subsampled
11 Multiple Description Coding and its Relevance to 3DTV
395
Fig. 11.12. VRC with two threads and three frames per thread. Reprinted from [99]. Copyright 1998 IEEE
frame sequences (threads) are formed and processed in parallel, thus becoming independently decodable. In a fixed time, the threads converge to a Sync frame. Even if some of the threads are damaged, the other threads can be used to correctly reconstruct the Sync frame. This minimizes the number of I-frames as there is no need for complete resynchronization. Figure 11.12 illustrates VRC with two threads and three frames per thread. As each thread is a subsampled version of the original video along the temporal axis and the prediction is done independently of the other threads, the motion vectors estimated from two consecutive P-frames can be longer and the prediction error can be greater. This adds a substantial penalty to the coding efficiency. Another source of redundancy in VRC are the Sync frames coded multiple times [99]. In the H.264 standard [42], one can use SP-frames instead of VRC Sync frames [48]. This would eliminate the drift caused by inaccurate frame reconstruction from different Sync frames. The VRC approach has been adopted and further developed by Apostolopoulos [7] into an approach referred to as Multiple States. A video sequence is split into a number of independently decodable bitstreams (e.g. even and odd frames) with its own prediction process and state information. If one stream is corrupted, the video sequence can still be recovered from the other stream at half the original frame rate with no distortion. The novelty of the Multiple States approach is that it uses the data from multiple streams to recover the lost state. The frame reconstruction process is illustrated in Fig. 11.13. The lost frame can be reconstructed with sufficient quality to be used as a reference frame in the decoding of the stream it belongs to. Thus, the source video can be efficiently reconstructed to its full rate without using VRC Sync frames. A Multiple Description Motion Compensation (MDMC) approach has also been motivated by VRC. MDMC reduces redundancy in VRC by utilizing a second-order predictor, i.e. by predicting the current frame from two
Fig. 11.13. Lost frame reconstruction in balanced multi-state video coding. Adapted from [8]. Copyright 2001 IEEE
396
A. Norkin et al.
previously decoded frames [91, 92]. A motion-compensated block of the frame i has two motion vectors. The first motion vector M V1 points to the frame i − 1. The second vector M V2 points to the frame i − 2. When searching for M V2 , a search window is centred in the location pointed to by M V1 . This provides a longer dynamic range for M V2 . In coding motion vectors, the coded information includes M V1 and ΔM V2 = M V2 − M V1 . Both the central and side prediction errors are coded by DCT followed by the entropy coding [92]. The encoder generates two descriptions. One description includes all even frames and the other description includes all odd frames. When both descriptions are received, motion compensation is performed as the weighted sum of two compensated pictures. When only one description is received (e.g. containing odd frames), the decoder predicts the current frame from the previous frame of the same description. The predicted frame is then different from the one at the encoder. This mismatch between the predicted frames at the encoder and frames at the decoder is explicitly coded with a lower accuracy, following the three-prediction-loop approach [67] (see Sub-sect. 11.4.1). The predictor and the quantizer of the mismatch signal can be adjusted to control the level of redundancy and reconstruction quality. Thus, unlike VRC, which has only few operational points in its redundancy rate-distortion region, MDMC operates over a wide range of redundancy. A similar approach, called double-vector motion compensation (DMC), has been proposed in [16]. It uses a weighted superposition of two blocks in two previous frames without coding the mismatch signal. 11.4.4 Unbalanced Multiple Description Coding Most of the MDC approaches produce balanced descriptions. Thus, lower quality reconstruction can be obtained from each of the descriptions separately and high quality reconstruction is obtained from both descriptions together (D0 < D1 , D0 < D2 , D1 = D2 ). However, some applications may require unbalanced descriptions. An unbalanced MDC (UMDC) is especially beneficial when combined with multiple-path transport. In multi-path transport, each description is transmitted to the receiver over a separate link. If those links have different probabilities of packet losses, UMDC can improve the average reconstruction quality at the receiver [8]. In particular, UMDC assumes that each stream can have different RD performance; that is, one description has higher rate and is more important than the other description. To obtain the unbalanced descriptions, one can vary the quantization, the frame rate or the spatial resolution of one description. However, it is important to have approximately the same visual quality in both video streams in order to avoid annoying variations in quality, the so-called flicker effect, in the case of no losses. An example of unbalanced video coding is given in Fig. 11.14. The descriptions in this approach have different frame rate. Another approach to UMDC gives high priority to one of the descriptions, while the other description is considered as pure redundancy [22, 23, 79].
11 Multiple Description Coding and its Relevance to 3DTV
397
Fig. 11.14. Lost frame reconstruction in unbalanced multi-state video coding. Adapted from [8]. Copyright 2001 IEEE
More specifically, the encoder creates a high resolution (HR) description and a low resolution (LR) description, where the central distortion is assumed equal to the higher quality side distortion (D0 = D1 < D2 ). The low-quality description is obtained by spatial downsampling of the original sequence and used only in the case where the high-resolution description is lost. When HR description is lost and LR description is used for the reconstruction, a mismatch in the prediction loop appears. Then, an erasure recovery algorithm [31, 79] is used to estimate the lost samples in the HR sequence. The video sequence is considered as consisting of a number of series of “single pixel” video sequences [31]. Each “single pixel” sequence is obtained by taking one pixel in the first frame and considering the sequence of pixels in the subsequent frames obtained from the original pixel via motion-compensated prediction. The estimate x ˆHR of the lost value x minimizes the MSE between the reconstructed HR sequence and the received LR sequence for a series of N subsequent frames. It has also been shown that the distortion depends on the number of consistency checks N , the variance and the temporal correlation of the quantization error [31]. 11.4.5 MD-FEC for Video The application of the MD-FEC scheme to the output of a SPIHT coder has been described in Sub-sect. 11.3.3. MD-FEC has also been applied to the output of an MC-based video coder [43]. To generate a progressive bitstream, the output of MC-based video coder is rearranged in such a way that more important information is transmitted in the beginning of the stream. The bitstream is split into groups of pictures (GOP) of predefined size; each GOP is rearranged independently of the other GOPs. In the rearranged GOP, I-frame goes first, while other important data (such as header information, motion vectors) is transmitted next. This allows reconstruction of the coarse motion sequence from the beginning of the bitstream. Finally, VLC codewords from each block corresponding to k-th DCT coefficients are grouped together (the groups are ordered from low to high frequency content) [43]. MD-FEC algorithm is applied to the rearranged bitstream as it is described in Sect. 11.3.3. Nevertheless, the reader has to keep in mind that reordering does not make the bitstream truly embedded. If the rearranged GOP is truncated to the size a, its quality is usually worse than the quality of the GOP naturally coded for the same size a. MD-FEC algorithm can also be used with data partitioning
398
A. Norkin et al.
of the H.264 standard [42]. MD-FEC can be applied quite naturally also to 3D-wavelet coders such as 3D-SPIHT. 11.4.6 3D-transform Based Coding This section describes MD video coding based on 3D-transforms. One of the approaches to video coding is three-dimensional subband coding (3DSBC) using wavelet transforms. Motion compensation can be incorporated into 3D-SBC using motion-compensated lifting [74]. This implementation, called motion-compensated temporal filtering (MCTF), can be used with any wavelet kernel and any motion model; it also enables temporal scalability in video coding. In an MDC scheme based on MCTF, low-frequency frames are duplicated in both descriptions and high frequency frames are divided between two descriptions. When reconstructing from a single description, missing frames are estimated using motion vectors [89]. Another MD video coding method with a flexible number of descriptions is based on interframe wavelet-based scalable video coding [4]. Motioncompensated temporal filtering (MCTF) and 2D spatial wavelet filtering are used to perform the 3D wavelet transform. The MD algorithm generates N descriptions from the spatio-temporal code-blocks coded at M different rates. A code-block copy with a higher rate goes to one description while the redundant copies with lower rates are included in other descriptions to facilitate reconstruction in the case of description loss. Motion vectors and the lowest frequency code-blocks in both time and space are duplicated in all descriptions at a higher rate. This method can be applied to any type of wavelet-based video coding which utilizes independent code-block coding as the subband entropy coding technique. Motion estimation is the most complex part of hybrid video coders. If it can be replaced with a proper transform along the temporal axis, savings in overall encoding complexity can be achieved. Three-dimensional DCT (3D-DCT) can be used to efficiently remove correlation in both spatial and temporal domains. A 3D-DCT video coder is also advantageous in error resilience as it enjoys no error propagation in the subsequent frames. An MD video coder based on 3D block transforms has been described in [60]. This MD video coder is an adaptation of the 2-stage image MD coding approach [61] to video coding, described in Sect. 11.3.5. This scheme (called 3D-2sMDC) has a balanced computational load between the encoder and decoder and is able to work at the very low redundancy introduced by MD coding. The encoder scheme is shown in Fig. 11.15. At the first stage (dashed rectangle), a coarse sequence approximation is obtained and included in both descriptions. The second stage produces enhancement information, which has a higher bitrate and is split between two descriptions. In the encoder, a sequence of frames is split into groups of 16 frames. Each group is split into 3D cubes of size 16 × 16 × 16. 3D-DCT is applied to each cube. The lower DCT coefficients forming the 8×8×8 cube are quantized with
11 Multiple Description Coding and its Relevance to 3DTV
399
Fig. 11.15. 3D-2sMDC encoder scheme. Reprinted from [60]
a uniform quantization step Qs and entropy-coded (see Fig. 11.16(a)), thus forming the shaper. Other coefficients are set to zero. The quantized shaper is transformed back to the spatial domain and subtracted from the original sequence. The obtained residual sequence is coded by a 3D block transform. It can be either 3D-DCT or hybrid 3D transform. The hybrid transform consists of a lapped orthogonal transform (LOT) in vertical and horizontal directions and DCT in the temporal direction. Transform coefficients are finely quantized with a uniform quantization step Qr . Then, transform blocks are split into two parts in the manner shown in Fig. 11.16(b). The first part of blocks and the shaper form Description 1, while the second part of the blocks and the shaper form Description 2. In the case of central reconstruction (reconstruction from two descriptions), each part of the residual sequence (X1 and X2 ) is received with the corresponding description and entropy-decoded. Inverse quantization and inverse transform are applied to coefficients and the residual sequence is added to the shaper to obtain the reconstruction of the original sequence. The side decoder scheme can be obtained from Fig. 11.17 if the content of the dashed rectangle is removed. In this case, the shaper is reconstructed from its available copy in Description 1. The residual sequence has only half of the coefficient
(a)
(b)
Fig. 11.16. Coding patterns: (a) 3D-DCT cube for shaper coding: only coefficients in the gray volumes are coded, other coefficients are set to zero; (b) Split pattern for volumes of residual sequence: grey - Description 1 ; white - Description 2. Reprinted from [60]
400
A. Norkin et al.
Fig. 11.17. Decoder scheme. Central reconstruction. Side reconstruction when Description 2 is missing. Reprinted from [60]
volumes X1 , and missing coefficients X2 are filled with zeros. The subsequent stages of the decoding process are identical to central reconstruction. The redundancy of 3D-2sMDC is determined only by the shaper quality controlled by the shaper quantization step Qs . The quality of the central reconstruction is controlled by the quantization step Qr of the residual sequence. The complexity of 3D-2sMDC can be reduced even further. There is no need to compute all the coefficients of forward 3D-DCT of size 16 × 16 × 16 (Fig. 11.16(a)). To perform a 3D-DCT of an N × N × N cube, one needs 3N 2 one-dimensional DCTs of size N . However, if only N/2 × N/2 × N/2 low-frequency coefficients are needed, row-column-frame (RCF) transform requires N 2 + 1/2N 2 + 1/4N 2 = 1.75N 2 DCTs for one cube. To get the 8 lowest coefficients of 1D-DCT, pruned DCT is used. This yields a substantial reduction in computational complexity compared with full separable DCT. The estimated overall complexity of the 3D-2sMDC encoder is 1.5 to 2 times lower than that of H.263 [41]. The difference between the complexity of 3D-2sMDC encoder and H.263+ [41] with scalability is even greater (although they have similar compression efficiency). The following paragraphs show the simulation results for the 3D-2sMDC encoder. The comparison is done for the sequence Silent Voice (QCIF, 15 fps). The 3D-2sMDC coder is compared with the MDTC coder that uses three prediction loops in the encoder described in Sect. 11.4.1 [67, 68]. The 3D2sMDC coder exploits the hybrid transform for coding the residual sequence. The rate-distortion performance of these two coders is shown in Fig. 11.18. PSNR of central reconstruction of the 3D-2sMDC coder is D0 ≈ 31.52 dB and central distortion of the MDTC coder is D0 = 31.49 dB. Figure 11.18 shows that the proposed 3D-2sMDC coder outperforms the MDTC coder, especially in the low-redundancy region. The side reconstruction performance of the 3D-2sMDC coder could be explained by the following. The MC-based multiple description video coder has to control the mismatch between the encoder and decoder. This could be done, for example, by explicitly coding the mismatch signal, as is done in the case of the MDTC coder [67, 68] (Sub-sect. 11.4.1). The 3D-2sMDC coder does not need to code the mismatch signal. Thus, it can achieve better performance at low redundancies.
11 Multiple Description Coding and its Relevance to 3DTV
401
30 29.5 29
PSNR(dB)
28.5 28 27.5 27 26.5 26
3D−2sMDC (Scheme 1) MDTC
25.5 25
65
70
75
80
85
90
95
100
bitrate (kbps)
Fig. 11.18. Sequence Silent Voice, mean side reconstruction. D0 ≈ 31.53 dB. Reprinted from [60]
A drawback of the 3D-2sMDC coder is its relatively high delay. High delays are common for coders exploiting 3D-transforms (e.g., coders based on 3D-DCT or 3D-wavelets). For example, the delay of 3D-2sMDC is slightly over half a second for 30 fps and about one second for 15 fps. Figure 11.19 shows frame 13 of the sequence “Tempete” reconstructed from both descriptions (Fig. 11.19(a)) and from Description 1 alone (Fig. 11.19(b)). The sequence is coded by the 3D-2sMDC encoder (with hybrid transform at the second stage) to the bitrate R = 880 kbps. One can see that although the frame reconstructed from one description has some distortions caused by loss of transform coefficient volumes of the residual sequence, the overall picture is smooth and pleasant to the eye.
(a) Central reconstruction, D0 = 28.52
(b) Side reconstruction, D1 = 24.73
Fig. 11.19. Sequence Tempete; reconstruction of frame 13. Reprinted from [60]
402
A. Norkin et al.
11.5 Multiple Description Coding of Stereoscopic Video In Sect. 11.4, the basic principles of multiple description coding of monoscopic video together with the state of the art techniques have been surveyed. Similar ideas can also be applied to the MD coding of stereoscopic video. The basic difference between stereoscopic and monoscopic video is that the depth information is essential in stereoscopic video and should be maintained in each description. In this section, some recent methods developed for multiple description coding of stereoscopic video are reviewed and discussed. These approaches are based on the joint coding structure, as shown in Fig. 11.20. In the joint coding scheme, prediction in the left sequence employs only motion estimation and prediction in the right sequence employs both motion and disparity estimation. The discussed MDC methods employ spatial and temporal downsampling of the views in stereoscopic video. 11.5.1 Spatial Scaling Stereo-MDC There are two theories about the effects of unequal bit allocation between left and right video sequences. Those theories are fusion theory and suppression theory [25, 47, 103]. In fusion theory, it is believed that the total bit budget should be equally distributed between two views. According to suppression theory, the overall perception in a stereo-pair is determined by the highest quality image. Therefore, one can compress the target image as much as possible to save bits for the reference image, so that overall distortion is the lowest. The Spatial Scaling Stereo MDC (SS-MDC) approach [58] is based on these two theories. In [3], the perception performance of spatial and temporal downscaling for stereoscopic video compression has been studied. The obtained results indicate that spatial and spatiotemporal scaling provide acceptable perception performance at a reduced bitrate. This concept of stereoscopic video can be used in a multiple description coder in such a way that scaled stereoscopic video is used as side information. Figure 11.21 presents the scheme exploiting spatial scaling of one view (SS-MDC). In Description 1, left frames are predicted only from left frames, and right frames are predicted from both left and right frames. Left frames are coded with the original resolution; right frames are downsampled prior to encoding. In Description 2, right frames are coded with the original resolution
Fig. 11.20. Simple stereoscopic joint coder reference structure. Left sequence is coded independently; frames of right sequence are predicted from either right or left frames
11 Multiple Description Coding and its Relevance to 3DTV
403
Fig. 11.21. MDC scheme based on spatial scaling (SS-MDC). Reprinted from [58], copyright Springer-Verlag Berlin Heidelberg 2006, with kind permission of Springer Science and Business Media
and left frames are downsampled. When both descriptions are received, left and right sequences are reconstructed in full resolution. If one description is lost due to channel failures, the decoder reconstructs a stereoscopic video pair, where one view is low-pass filtered. A stereo-pair where one view has the original resolution and another view is low-pass filtered provides acceptable stereoscopic perception. After the channel starts working again, the decoding process can switch back to the central reconstruction (where both views have high resolution) after the IDR picture is received. 11.5.2 Multi-state Stereo MDC The MS-MDC scheme is shown in Fig. 11.22 [58]. Stereoscopic video sequence is split into two descriptions. Odd frames of both left and right sequences belong to Description 1, and even frames of both sequences belong to Description 2 . Motion-compensated prediction is performed separately in each description. In Description 1, left frames are predicted from preceding left frames of Description 1, and right frames are predicted from preceding right frames of Description 1 or from the left frames corresponding to the same time moment. The idea in this scheme is similar to that in video redundancy coding (VRC) [99] and multi-state coding [7]. If the decoder receives both descriptions, the original sequence is reconstructed with the same frame rate. If one description is lost, stereoscopic video
Fig. 11.22. Multi-state stereo MDC (MS-MDC). Reprinted from [58], copyright Springer-Verlag Berlin Heidelberg 2006, with kind permission of Springer Science and Business Media
404
A. Norkin et al.
is reconstructed with the half of the original frame rate. Another possibility is to employ a frame concealment technique for the lost frames. As one can see from Fig. 11.22, a missed (e.g. odd) frame can be concealed by employing motion vectors of the next (even) frame, which uses only previous even frame as a reference for motion-compensated prediction. This MDC scheme does not allow to adjust the coding redundancy. However, for some video sequences it makes it possible to reach bitrates lower than the bitrate of simulcast coding. This method can easily be generalized for more than two descriptions. MSMDC also does not introduce any mismatch between the states of the encoder and decoder in the case of description loss. 11.5.3 Simulation Results To compare the performance of two stereoscopic MDC approaches, redundancy rate distortion curves for different MDC coded stereoscopic streams have been investigated. The bitrate generated by the SS-MDC coder is R = R∗ + ρsim + ρd , where ∗ R is the bitrate obtained with the single description coding scheme providing the best compression, ρsim is the redundancy caused by using simulcast coding instead of joint coding, and ρd is the bitrate spent on coding of the downscaled sequences. Thus, the redundancy ρ = ρsim + ρd of the proposed method is bounded by the redundancy of the simulcast coding ρsim . The redundancy of the simulcast coding ρsim depends on characteristics of the video sequence and varies from one sequence to another. The redundancy ρd of coding two downsampled sequences can be adjusted to control the total redundancy ρ. Redundancy ρd is adjusted by changing the scaling factor (factors of two in our implementation) and quantization parameter QP of the downscaled sequence. During the generation of the results, two low-pass filters are used for downsampling and upsampling the frames in one of the views. The used filters are: 13-tap downsampling filter: {0, 2, 0, −4, −3, 5, 19, 26, 19, 5, −3, −4, 0, 2, 0}/64 11-tap upsampling filter: {1, 0, −5, 0, 20, 32, 20, 0, −5, 0, 1}/64 Filters are applied to all Y,U and V channels in both horizontal and vertical directions, and picture boundaries are padded by repeating the edge samples. These filters are used in Scalable Video Coding extension of H.264 [69] and explained in [75]. The downscaling is done by factors of 2 in both dimensions. In motion estimation of the downscaled sequence, frames with the original resolution are also scaled by the same factor for proper estimation. Three different stereoscopic sequences are used to test the performance of the stereoscopic MDC schemes. These are Train and Tunnel (720 × 576, 25 fps, moderate motion, separate cameras), Fun-fair (360 × 288, 25 fps, high motion, separate cameras) and Xmas (640 × 480, 15 fps, low motion, close
11 Multiple Description Coding and its Relevance to 3DTV
405
cameras). Both algorithms are applied to these videos. In all the experiments, I-frames are inserted every 25 frames. The reconstruction quality is measured by PSNR. The PSNR value of a stereo-pair is calculated according to the formula 2552 P SN Rpair = 10 log10 , (Dl + Dr )/2 where Dl and Dr represent the distortions in the left and right frames [15]. In the experiments, average P SN Rpair is calculated over the processed sequences. Redundancy is calculated as the ratio (in percents) of additional bitrate to the minimal bitrate R∗ of the joint coding scheme. To show characteristics of the video sequences, we code them by joint coder and simulcast coder for the same PSNR. The results are shown in Table 11.2. The experiments for MD coding use the same values of the D0 and R∗ , which are given in Table 11.2. One can see that Train and Tunnel and Fun-fair sequences show low interview correlation, and sequence Xmas shows high interview correlation. Thus, Xmas has high redundancy of simulcast coding ρsim , which is the lower bound for redundancy of the SS-MDC coding scheme. The SSMDC scheme is tested for downsampling factors of 2 and 4 in both vertical and horizontal directions. For each downscaling factor, we change quantization parameter (QP) of the downscaled sequence to achieve different levels of redundancy. The results for the second scheme (MS-MDC) are given only for one level of redundancy. The reason is that this method does not allow adjusting of redundancy since the coding structure is fixed as in Fig. 11.22. The redundancy of the MS-MDC method takes only one value and is determined by characteristics of the video sequence. Figure 11.23 shows the redundancy-rate distortion (RRD) curves [63] for SS-MDC and the values for MS-MDC for the test sequences. The results are presented as PSNR of a side reconstruction (D1 ) vs redundancy ρ. The results for SS-MDC are given for scaling factors 2 and 4. For the sequence Xmas, simulation results for scaling factor 4 are not shown, as PSNR is much lower than for the scaling factor 2. The simulation results show that reconstruction from one description can provide acceptable video quality. The SS-MDC method can perform in a wide Table 11.2. Joint and simulcast coding. Reprinted from [58], copyright SpringerVerlag Berlin Heidelberg 2006, with kind permission of Springer Science and Business Media Sequence
D0 , dB R∗ = Rjoint , Kbps Rsim , Kbps ρsim , %
Train and Tunnel 35.9 Fun-fair 34.6 Xmas 38.7
3624 3597 1534
3904 3674 2202
7.7 2.2 43.5
A. Norkin et al. 32
28
31
27
30
26
PSNR(dB)
PSNR(dB)
406
29 28 27
24 23
SS−MDC (Scal 2) SS−MDC (Scal 4)
26 25 5
25
10
15
20
25
SS−MDC (Scal 2) SS−MDC (Scal 4)
22
30
21 0
35
5
Redundancy (%)
10
15
20
25
30
Redundancy (%)
(a) T rain an d T unnel. MS-MDC : D 1 = 30.7 dB, ρ = 41.4%.
(b) Fun- fair . MS-MD C: D 1 = 26.8 dB, ρ = 24.3%.
PSNR(dB)
33.5
33
32.5 SS−MDC (Scal 2) 32
46
48
50
52
54
56
Redundancy (%)
(c) X mas. MS-MDC : D 1 = 29.6 dB, ρ = 30.1%.
Fig. 11.23. Redundancy rate-distortion curves for test sequences. Reprinted from [58], copyright Springer-Verlag Berlin Heidelberg 2006, with kind permission of Springer Science and Business Media
range of redundancies. Downscaling with factor 2 provides good visual quality with acceptable redundancy. However, the performance of SS-MDC depends to a great extent on the nature of the stereoscopic sequence. This method can achieve very low redundancy (less than 10%) for sequences with lower inter-view correlation (Train and Tunnel, Fun-fair). However, it has higher redundancy in stereoscopic video sequences with higher inter-view correlation (Xmas). The perception performance of SS-MDC is quite good as the stereopair perception is mostly determined by the quality of the high-resolution picture. The MS-MDC coder performs usually with 30–50% redundancy and can provide acceptable side reconstruction even without error concealment algorithm (just by copying the previous frame instead of the lost frame). MSMDC should be used for sequences with higher inter-view correlation, where SS-MDC shows high redundancy.
11 Multiple Description Coding and its Relevance to 3DTV
407
Table 11.3. Fraction of MVs in the right sequence, M V s/(M V s+DV s). Reprinted from [58], copyright Springer-Verlag Berlin Heidelberg 2006, with kind permission of Springer Science and Business Media Sequence
Joint
SS-MDC
MS-MDC
Train and Tunnel Fun-fair Xmas
0.94 0.92 0.66
0.78 0.80 0.56
0.90 0.85 0.61
The encoder can decide which scheme to use in an adaptive manner by collecting the encoding statistics. Table 11.3 shows the statistics of motion vector (MV) prediction for joint coding mode, SS-MDC and MS-MDC. The statistics are collected for P-frames of the right sequence. Values in Table 11.3 show the fraction of motion vectors which point to the frames of the same sequence, i.e. the ratio of motion vectors to the sum of the motion and disparity vectors (DV s) in the right sequence frames M V s/(M V s + DV s). One can see that the value M V s/(M V s + DV s) correlates with the redundancy of simulcast coding ρsim given in Table 11.2. The value M V s/(M V s + DV s) could tell the decoder when to switch from SS-MDC to MS-MDC and vice versa. Both techniques produce balanced descriptions and provide stereoscopic reconstruction with acceptable quality in the case of one channel failure for the price of moderate redundancy (in the range of 10–50%). They provide drift-free reconstruction in the case of description loss. The approach called SS-MDC performs better for sequences with lower inter-view correlation, while the MS-MDC approach performs better for sequences with higher inter-view correlation.
11.6 Multiple Description Coding of 3D Geometry 3D mesh data consists of geometry and connectivity information. While the geometry specifies 3D coordinates of vertices and can be compressed in a lossy mode, connectivity data describes the adjacency information between vertices and should be coded in a lossless mode. Recently, the issue of 3D mesh compression has been addressed in a number of publications. Another issue in mesh coding is error–resilient coding of geometry data which is needed during transmission over an error-prone channel. Special care should be taken in case some parts of the compressed bit-stream are lost or erroneous. Error-resilient coding of 3D meshes has not been studied as thoroughly as that of images and video, though several publications have addressed the topic. In [104], error resiliency is achieved by segmenting the mesh and transmitting each segment independently. These segments are then stitched at the
408
A. Norkin et al.
decoder using joint-boundary information which is classified as the most important. The disadvantage of this method is that it does not provide a coarseto-fine representation of the received data. Another method [6], achieves error resiliency by assigning optimal error protection codes to layers of progressively coded 3D mesh. This method is scalable with respect to both channel bandwidth and channel packet-loss rate and is applicable to any progressive compression method that produces a hierarchical bit-stream. However, in this method, when the data is lost and cannot be recovered at a coarse layer, nothing of the finer layer information can be used and the decoding process terminates. For scenarios, in which the mesh data is transmitted over an error-prone channel, error resilience can be achieved by MDC of 3D geometry. In this case, the coded geometry information is split into multiple descriptions, while the compressed connectivity bitstream should be included in each description. The extra redundancy added by the latter operation is acceptable, since the compressed connectivity size is much smaller than the compressed geometry size. For the geometry data, the aim is to generate independently decodable descriptions so that the precision of 3D coordinate values is increased by receiving any of the descriptions. Depending on the principal idea employed, the MDC methods for meshes can be summarized in the following categories: Partitioning Vertex Geometry [45], Multiple Description Scalar Quantization (MDSQ) [12], and Partitioning Wavelet Coefficient Trees [59]. 11.6.1 Partitioning Vertex Geometry This approach is based on the requirement that the geometry information can be represented/coded in a lossy manner while it is of a vital importance that the connectivity information is represented/coded in a lossless manner. Following this requirement, 3D mesh vertices are partitioned and coded into different descriptions, while the connectivity information is included in all descriptions. Losing some descriptions causes loss of geometry coordinates of some vertices. In that case, the missing vertices can be estimated from the available ones. The true connectivity information is always available and undistorted, thus helping to estimate the missing vertices. Having the connectivity information coded in all descriptions is not costly since its size is much smaller than the size of geometry data [45]. The compression of connectivity data is accomplished by a method referred to as topological surgery [81]. The compressed bitstream of connectivity data is tightly coupled with the encoded geometry information in each description. One advantage of the topological surgery approach is that it employs a vertex spanning tree and this tree can be efficiently used to partition vertices [81]. The compression of partitioned geometry data in each description is accomplished through arithmetic coding of prediction residuals of geometry
11 Multiple Description Coding and its Relevance to 3DTV
409
coordinates. Predictions in each description are performed using only the vertices in the same description to avoid any mismatch. To decrease the loss of compression performance caused by this prediction strategy, a surface-based prediction scheme is proposed [45]. 11.6.2 Multiple Description Scalar Quantization Based Approach Multiple Description Scalar Quantization (MDSQ) [84], has been described in Sect. 11.3.1 with applications to images and video. In this section, we describe how it can be applied to geometry data coding [12]. After the mesh geometry is transformed into a wavelet domain, two independently quantized sets of wavelet coefficients are created. Each description is then obtained by combining a set of coded quantized wavelet coefficients with the compressed bitstream of connectivity data. Progressive Geometry Compression (PGC) has been chosen as the starting compression scheme as it provides a multiresolution mesh decomposition [51]. It is especially suitable for an arbitrary topology, highly detailed, and densely sampled meshes arising from geometry scanning. The original model in PGC is remeshed to have a semi-regular structure which allows wavelet transform based on subdivisions. The remeshing is done in such a way that the error caused by remeshing is smaller than the estimated discretization error. The obtained semi-regular mesh undergoes Loop or butterfly-based wavelet decomposition to produce a coarsest-level mesh and wavelet detail coefficients. The coarsest level connectivity is coded by the Touma and Gotsman (TG) coder [83], taking into account its irregular structure. The wavelet coefficient (edgebased) trees are coded by the SPIHT algorithm [73]. While PGC is specifically used in this approach, any other multiresolution mesh coding algorithm can be easily adapted to the proposed method. Detailed block diagrams of the encoder and decoder of the proposed algorithm are given in Figs. 11.24 and 11.25. After applying PGC and MDSQ, two sets of wavelet coefficients are obtained as if they were two distinct coarsely quantized wavelet coefficients. Then, each set of wavelet coefficients is coded by SPIHT. Descriptions are obtained by adding TG-coded coarsest-level irregular connectivity data and coarsest-level geometry data which is uniformly
Fig. 11.24. Encoder block diagram. Reprinted from [12]. Copyright 2006 IEEE
410
A. Norkin et al.
Fig. 11.25. Decoder block diagram. Reprinted from [12]. Copyright 2006 IEEE
quantized with a chosen number of bits giving acceptable distortion (14 bits is a good suggestion) to each set. The coarsest-level geometry vertices are not quantized to two descriptions by MDSQ because even small errors in this level cause significant visual distortion. Experiments are performed with the Bunny model, composed of 34835 vertices and 69472 triangles, as described in Sect. 11.6.1. The effects of two parameters on the index assignment matrix, R (bits per source symbol for side decoders) and k (half of number of diagonals closest to main diagonal) are investigated. Tables 11.4 and 11.5 and Fig. 11.26 show the MDC performance for different R when k = 1 and Tables 11.6 and 11.7 and Fig. 11.27 show the performance for different k when R = 6. Results are given for compressed file sizes in bytes and relative L2 distance as objective distortion metric. L2 distance between two surfaces X and Y is defined as 1/2 1 2 d(x, Y ) dx , (11.14) d(X, Y ) = area(X) x∈X where d(x, Y ) is the Euclidean distance from a point x on X to the closest point on Y . Since the distance is not symmetric, it is symmetrized by taking the maximum of d(X, Y ) and d(Y, X). The Metro tool [21] approximates this distance by sampling vertices, edges and triangles and taking root mean square value of shortest distances from points in X to surface Y . Relative L2 distance is obtained by dividing distance by bounding box diagonal. All L2 Table 11.4. File sizes for different R when k = 1. Reprinted from [12]. Copyright 2006 IEEE Model
ML 4,1
ML 5,1
ML 6,1
ML 7,1
ML 8,1
ML 9,1
Side 1 Side 2 Central
1757 1870 2227
2402 2420 3180
3548 3546 4715
5152 5217 6981
7775 7862 10614
11941 11931 16360
11 Multiple Description Coding and its Relevance to 3DTV
411
Table 11.5. Relative L2 errors for the case in Table 11.4. Reprinted from [12]. Copyright 2006 IEEE Model
ML 4,1
ML 5,1
ML 6,1
ML 7,1
ML 8,1
ML 9,1
Side 1 Side 2 Central Single
34,59 32,95 15,58 14,32
24,61 24,15 9,35 10,74
16,48 17,43 6,15 6,84
11,35 11,49 3,87 4,57
7,71 7,59 2,49 3,00
5,04 4,96 1,74 2,10
errors in this section are given in units of 10−4 . Visual illustrations are also shown in Fig. 11.28. Reconstructed models with one description are labeled as Side1 and Side2 and the one including both descriptions is labeled as Central. In addition, error values for single description coded model having average file sizes of side descriptions are given and labeled as Single. Increasing the number of diagonals in the index assignment matrix decreases the redundancy by adding more index values to the central quantizer. This increases the side distortions and decreases the central distortion. However, the increase in side distortions is greater than the decrease in central distortion. 11.6.3 Partitioning Wavelet Coefficient Trees This approach is also based on the Progressive Geometry Compression (PGC) scheme. It can be adapted to any mesh coding scheme employing wavelet transform and zero-tree coding. In order to generate multiple descriptions, wavelet coefficient trees are grouped into several sets which are to be independently coded. These sets are packetized into multiple descriptions in such a way that each description contains one tree set which is coded with higher rate and several redundant tree sets which are coded with lower rates. 35
50 Side 1 Side 2 Central Single
25
40
Relative L Error
20
2
2
Relative L Error
30
15 10
Side 1 Side 2 Central Single
30 20 10
5 0 4
5
6
7
8
0 1
9
R
2
3
4
5
6
k
(a )
(b )
Fig. 11.26. (a) File Sizes (b) Relative L errors for different R when k = 1. Reprinted from [12]. Copyright 2006 IEEE 2
412
A. Norkin et al.
Table 11.6. File Sizes for different k when R = 6. Reprinted from [12]. Copyright 2006 IEEE Model
ML 6,1
ML 6,2
ML 6,3
ML 6.4
ML 6,5
ML 6,6
Side 1 Side 2 Central
3548 3546 4715
4842 4840 6305
5728 5564 7452
7071 6980 8739
7902 7782 9805
8939 8958 10762
The general scheme is shown in (Fig. 11.29) [59]. Wavelet coefficient trees are split into several sets Wi , i = 1 . . . N and coded by SPIHT algorithm at the high bitrate. Each description contains M copies of different tree sets (M ≤ N ). Namely, the i-th description contains one set Wi coded at rate Ri,0 and M − 1 sets of redundant trees Wj , j = i. These M − 1 tree sets represent coding redundancy and are coded at lower rates than Ri,0 . The redundancy included in each description is obtained as a result of the optimization algorithm described in the following paragraphs. As the most important information in the embedded stream is located at the beginning of the bitstream, the redundant copies are used when the descriptions with corresponding high-rate coded tree subsets are lost. If some descriptions are lost, the most important parts of the original trees in those descriptions will be recovered because their copies at lower rates will be present in the received descriptions. The compressed coarsest mesh representation C with rate RC is included in every description to facilitate the inverse wavelet transform even if only one description is received. Duplicating coarsest mesh C also increases coding redundancy. The manner of grouping the coefficient trees into sets is particularly important, since different sets are reconstructed with different quality in the case of description loss. Therefore, 3D mesh locations corresponding to different tree sets will have different quality. To perform grouping of the trees into sets, ordering of the coarsest mesh vertices is performed, as proposed in [13, 49]. It provides ordering of the vertices that has good locality and continuity properties. Then, the desired type of wavelet tree grouping is obtained by sampling the one-dimensional array [13, 49]. Two types of tree set splitting have been compared: first - grouping the closely located trees together and second - grouping spatially disperse trees. The spatially close grouping is Table 11.7. Relative L2 errors for the case in Table 11.6. Reprinted from [12]. Copyright 2006 IEEE Model
ML 6,1
ML 6,2
ML 6,3
ML 6.4
ML 6,5
ML 6,6
Side 1 Side 2 Central Single
16,48 17,43 6,15 6,84
26,41 25,57 4,39 4,90
29,18 28,69 3,54 4,24
34,95 34,11 3,06 3,34
37,13 35,43 2,76 2,98
40,30 39,59 2,47 2,64
11 Multiple Description Coding and its Relevance to 3DTV
413
4
x 10
File Size
1.5
12000 Side 1 Side 2 Central
10000
1
6000
0.5
0 4
Side 1 Side 2 Central
8000
File Size
2
4000
5
6
7
8
2000 1
9
2
3
R
4
5
6
k
(a )
(b)
Fig. 11.27. (a) File Sizes (b) Relative L errors for different k when R = 6. Reprinted from [12]. Copyright 2006 IEEE 2
obtained by assigning successive vertices from the array to the same group. The disperse grouping is obtained by sampling the array in a round-robin fashion. It has been observed that the latter case yields annoying artifacts when only one description is received and that the former case gives better visual quality in general. This is illustrated in Fig. 11.30, where model Bunny is encoded into four descriptions and optimized for the packet loss rate PLR = 15%. One can see that although grouping of disperse trees achieves lower objective distortion than grouping close trees, it produces annoying visual artifacts. Therefore, the remaining results of this work were obtained by the spatially close grouping method. Redundancy of the proposed algorithm is determined by the number of redundant tree copies, their rates and the coarsest mesh size. The problem of bit allocation is to minimize expected distortion at the decoder subject to probability of packet loss P and the target bit budget. A simple channel model is used where probability of packet loss P is assumed to be the same for each packet and independent of previous channel events. Another assumption
(a)
(b)
(c )
Fig. 11.28. Reconstructed models for ML(6,1): (a)–(b) Reconstruction from one description (Side1 and Side2 ) (c) Reconstruction from both of the descriptions (Central ). Reprinted from [12]. Copyright 2006 IEEE
414
A. Norkin et al.
Fig. 11.29. TM-MDC encoder scheme. Reprinted from [59]
is that one packet corresponds to one description. If the description has to be fragmented into different packets, the probability of the description loss P can be found from PLR. Suppose that N descriptions are generated. Then, the coefficient trees are split into N tree sets and M copies of tree sets are included in one description (M ≤ N ). Given P , it is easy to determine for each copy of the tree set the probabilities Pj , j = 0, . . . M that this copy is used for reconstruction where P0 is the probability of using the full-rate copy of the tree set and PM is the probability of not receiving any copy of the tree set. Probabilities Pj can easily be found from P and the packetization strategy. Thus, we have to minimize the expected distortion E[D] =
M N
Pj Dij (Rij ),
(11.15)
i=1 j=0
where Dij is the distortion incurred by using the j-th copy of a tree set i and Rij represents bits spent for the j-th copy of the i-th tree set. Optimization is performed under the following bitrate constraints M N
Rij + N RC ≤ R,
(11.16)
i=1 j=0
(a ) P SNR = 50.73 dB
(b ) P SNR = 48.51 dB
Fig. 11.30. Reconstruction of model Bunny from one description (out of four descriptions) for different types of tree grouping. (a) Spatially disperse grouping; PSNR = 50.73 dB. (b) Spatially close grouping, group size is 10; PSNR = 48.51 dB. Reprinted from [59]
11 Multiple Description Coding and its Relevance to 3DTV
415
where R is the target bitrate and RC is the rate of the coarsest mesh. The rate of the coarsest mesh is chosen constant with the geometry information quantized to 14 bitplanes. Optimization of bit allocation requires computation of D(R) function for every allocation step. Calculation of D(R) is a computationally expensive operation. However, each tree set contributes to the total distortion D. Since each tree set corresponds to some separate location on the mesh surface (defined by the root edge) in grouping spatially close trees, the distortions corresponding to separate tree sets can be considered additive. Therefore, a distortion-rate (D-R) curve Di (Ri ) for each coefficient tree set is obtained in advance. Calculations of Di (Ri ) are performed only once before the optimization algorithm is used for the first time. Then D-R curves are saved and can be used every time in bit allocation algorithm for new values of R and P . Optimization is performed with generalized Breiman, Friedman, Olshen and Stone (BFOS) algorithm [71]. BFOS algorithm first allocates a high rate for each copy of the tree set. Then, the algorithm consequently deallocates bits from the sets where D(R) curves exhibit the lowest decay at the allocated bitrate. This process stops when bit budget constraints are satisfied. In the case where optimization yields zero rates for some redundant tree copies, these copies are not included in the descriptions. Experiments are performed for models Bunny and Venus Head. In the experiments, model Bunny is coded into four descriptions at total 22972 Bytes (5743 Bytes per description) and eight descriptions at total 25944 Bytes (3243 Bytes per description). Model Venus Head is coded into four descriptions at 24404 Bytes (6101 Bytes per each description). The reconstruction distortion metric as in (11.14) is used for comparisons. It is also converted to PSNR, as follows: PSNR = 20 log10 (peak/d), where peak is the bounding box diagonal, and d is the L2 error. In the experiments, three coders are compared. The first coder is denoted as Tree-based Mesh MDC (TM-MDC). The second coder is a simple MDC coder in which each description contains the coarsest mesh and one set of wavelet coefficient trees. The sets of coefficient trees in both coders are formed from spatially close groups of trees of size 10. This coder is the same as TMMDC optimized for P = 0 (for P = 0, no redundant trees are included in the descriptions). The third coder outputs an unprotected SPIHT bitstream. Figures 11.31 and 11.32 show the average distortions for reconstruction from a different number of received descriptions for model Bunny coded into four and eight descriptions, respectively. The curves are generated for TMMDC with bit allocations optimized for different P . One can observe from the figures that the coders optimized for higher values of P achieve higher PSNR values than the coders optimized for lower values of P when some descriptions are lost. Since the coder optimized for a higher value of P expects more description losses, it trades off the higher reconstruction distortion when all the descriptions are received to a lower distortion in the cases when some descriptions are lost. On the other hand, the coder optimized
416
A. Norkin et al. 80
50 PLR 1% (ρ=25%) PLR 3% (ρ=37%) PLR 5% (ρ=47%) PLR 15% (ρ=73%)
30
75 70
PSNR (dB)
2
Relative L error
40
20
65 60 PLR 1% (ρ=25%) PLR 3% (ρ=37%) PLR 5% (ρ=47%) PLR 15% (ρ=73%)
55 10
50 0 1
2
3
45 1
4
Number of descriptions received
2
3
4
Number of descriptions received
(b)
Fig. 11.31. Reconstruction of Bunny model from different number of received descriptions. The results are given for bit allocations for different packet loss rates (PLR). The redundancy ρ is given in brackets. (a) Relative L2 error; (b) PSNR. Reprinted from [59]
for a low value of P is expected to receive more descriptions and tries to achieve better distortion in the case when all the descriptions are received. This can be verified from the figures by observing that the coder optimized for P = 1% shows the best performance when all the descriptions are received. Figure 11.33 compares the performance of the proposed TM-MDC, the simple MD coder and unprotected SPIHT for model Bunny. The results are calculated for P = 0, 1, 3, 5, 10, 15, 20%. In the TM-MDC coder, bit allocation is optimized for each P. For the simple MDC coder, redundancy is always fixed ρ = 10%, which occurs due to the inclusion of coarsest-level data in each description. For each P , the average distortion is calculated by averaging results of 100000 experiments (simulations of packet losses). For P = 0, the coders show the same performance since the TM-MDC coder, optimized for P = 0 becomes the same coder as the simple MDC coder. For higher packet loss rates, the performance of the simple MDC coder dramatically decreases while the reconstruction quality of the TM-MDC shows only mild degradation. 80
PSNR (dB)
60
40
PLR 5% (ρ=68%) PLR 10% (ρ=87%) Unprotected SPIHT
20
0 1
2
3
4
5
6
7
8
Number of descriptions received
Fig. 11.32. Model Bunny encoded into 8 descriptions at total 25944 Bytes. Reconstructed from different number of descriptions. Compared to unprotected SPIHT. Reprinted from [59]
11 Multiple Description Coding and its Relevance to 3DTV 35
25
80 2
TM−MDC (L dist.) TM−MDC (Weibull) Simple MDC
70
PSNR (dB)
Relative L2 error
30
20 15 10
60 50
2
TM−MDC (L dist.) TM−MDC (Weibull) Simple MDC Unprotected SPIHT
40 30
5 0 0
417
5
10
15
Packet loss rate (%)
(a)
20
20 0
5
10
15
20
Packet loss rate (%)
(b)
Fig. 11.33. The comparison of network performance of the proposed TM-MDC with a simple MDC scheme and unprotected SPIHT. (a) Relative L2 error; (b) PSNR. Reprinted from [59]
For P = 20%, the optimized TM-MDC coder shows an PSNR, 15 dB higher than the PSNR of the simple MD coder. In Fig. 11.33, the TM-MDC method results are shown for two different labels, namely TM-MDC (L2 dist) and TM-MDC (Weibull). The former corresponds to the results obtained using original D-R curves in optimization,
(a)
(b)
(c)
(d)
Fig. 11.34. Reconstruction of Bunny model from (a) one description (48.36 dB), (b) two descriptions (63.60 dB), (c) three descriptions (71.44 dB), (d) four descriptions (74.33 dB). Reprinted from [59]
418
A. Norkin et al.
(a)
(b)
(c)
(d)
Fig. 11.35. Reconstruction of Venus Head model from (a) one description (53.97 dB), (b) two descriptions (65.18 dB), (c) three descriptions (72.51 dB), (d) four descriptions (77.08 dB). Reprinted from [59]
while the latter corresponds to the results obtained by modeling D-R curves by the Weibull model [17] to decrease complexity. Modeling is achieved by fitting original data to the function D(R) = a − be−cR , d
(11.17)
where real number parameters a, b, c and d are determined by nonlinear regression. It is observed that, the performance of TM-MDC (Weibull) is very close to performance TM-MDC (L2 distance). It proves that the RD curve modeling has been successful. Figure 11.34 shows visual reconstruction results for the model Bunny encoded with redundancy ρ = 63%. Figure 11.35 shows visual reconstruction results for the model Venus Head encoded with redundancy ρ = 53%. The reconstructed visual models correspond to reconstructions from one, two, three and four descriptions. One could see that even the reconstruction from one description provides acceptable visual quality.
11 Multiple Description Coding and its Relevance to 3DTV
419
11.7 Conclusions This chapter has discussed multiple description coding and its relevance to 3DTV. MDC is a coding method for communicating the source over unreliable channel. The source is encoded into several descriptions, which are sent to the decoder independently. A certain amount of redundancy is added to the compressed descriptions to make them useful even when received alone. The decoder can reconstruct the source from one description with low, yet acceptable quality. The more descriptions received, the higher is the reconstruction quality. The motivation for using multiple description coding in 3DTV arises from vulnerability of the compressed bitstream in case of channel errors and erasures and the increasing amount of data when transmitting compressed stereoscopic and multi-view video over the error-prone channel. These issues have been discussed in detail in Sect. 11.1. Multiple description coding has a long history and passed its way from the information-theoretic problem to practical coding approaches. Section 11.2 has reviewed the studies of MD rate distortion region. The rate distortion function for multiple descriptions has not been found in general.The RD region for multiple descriptions is completely known only for the special case of a memoryless Gaussian source and the squared error distortion. This special case is quite important as it confines all other memoryless sources with the same variance. For this particular case, the rate distortion region coincides with the achievable region of El Gamal and Cover [26]. For other sources with squared error distortion, one can use (11.3) to obtain a bound for the MD rate-distortion function. The results on multiple description RD region have been used to evaluate the performance of multiple description codes. This has been discussed in Sect. 11.3. Sections 11.3 and 11.4 have surveyed approaches to MDC of images and video, which can be successfully applied to 3DTV as well. In particular, Sect. 11.3 has addressed the problem of adding controllable redundancy at one of the typical blocks of the transform coder, namely, at the transform stage, quantization stage, scanning process, or even at the preprocessing stage and the channel coding stage. As argued in Sect. 11.4, MDC of video is inherently more complicated than MDC of images. In video coders based on motion compensation, multiple descriptions can be generated from motion information and/or motion-compensated difference. In MC-based coders, a mismatch between the prediction loops in the encoder and decoder can appear when reconstructing from a single description. Many MD video approaches try to control this mismatch, either by fully eliminating it, or coding it partially, or ignoring it. Similar to MD image coding, multiple descriptions of video can be generated during the preprocessing stage (subsampling in temporal or spatial domains) or at the channel coding stage (MD-FEC). MD video coders based on 3D-transforms are often constructed by adopting image coding approaches to coding tree dimensions.
420
A. Norkin et al.
As multi-view, stereo and some of 3D mesh coders consist of similar blocks, the approaches to MDC of images and video can also be applied to coding multi-view or 3D mesh data. However, MDC of multi-view video and 3D meshes can also exploit other types of redundancy which are inherently present in these types of data. Section 11.5 has introduced MDC of stereoscopic video and made relations with monoscopic video approaches. Due to the more complicated prediction structure of stereoscopic video coding, it is difficult to apply MDC approaches based on the three-loop structure. Conversely, approaches based on temporal subsampling can be readily applicable to MDC of stereoscopic and multi-view video. Two techniques of MDC of stereoscopic video have been presented. Both techniques produce balanced descriptions and provide stereoscopic reconstruction with acceptable quality in the case of one channel failure for the price of moderate, in the range of 10–50%, redundancy. Both techniques provide driftfree reconstruction in the case of description loss. The performance of these approaches depends on characteristics of the stereoscopic video sequence. The approach called SS-MDC performs better for sequences with lower inter-view correlation, while the MS-MDC approach performs better for sequences with higher inter-view correlation. A criterion for switching between the approaches is used by the encoder to choose the approach that provides better performance for the particular sequence or part of the sequence. One can consider the development of MDC methods for coding the stereoscopic video to multiview video coding. The MS-MDC approach can be readily applied to coding multi-view video. However, the inter-view prediction in multi-view video is less efficient compared to the stereoscopic video due to the greater distance between the cameras. This should be taken into account when designing MDC schemes for multi-view video. In Sect. 11.6, three MDC techniques for 3D geometry coding have been presented. The TM-MDC and MDSQ-based methods use the progressive geometry compression approach, which makes it possible to achieve high compression ratios in coding highly detailed 3D meshes. The method partitioning vertex geometry is expected to perform competitively for 3D meshes with small number of vertices. The advantage of the TM-MDC method is its ability to create more than two descriptions. This method can also be flexibly adapted to varying channel bandwidth and channel error rate. The D-R curve modelling used in TM-MDC significantly decreases the number of computations, making it possible to efficiently exploit this method in practical application.
Acknowledgement This work is supported by the EC within FP6 under Grant & 511568 with the acronym 3DTV.
11 Multiple Description Coding and its Relevance to 3DTV
421
References 1. R. Ahlswede, On multiple descriptions and team guessing, IEEE Trans. Inform. Theory IT-32, 807–814, 1983. , The rate distortion theory for multiple descriptions without excess rate, 2. IEEE Trans. Inform. Theory 31, no. 6, 502–521, 1985. 3. A. Aksay, C. Bilen, E. Kurutepe, T. Ozcelebi, G. Bozdagi Akar, R. Civanlar, and M. Tekalp, Temporal and spatial scaling for stereoscopic video compression, in Proc. EUSIPCO’06 (Florence, Italy) 8, Sept. 2006. 4. E. Akyol, A. M. Tekalp, and M. R. Civanlar, Scalable multiple description video coding with flexible number of descriptions, in Proc. IEEE Int. Conf. Image Processing 3, 712–715, Sept. 2005. 5. A. Albanese, J. Blomer, J. Edmonds, M. Luby, and M. Sudan, Priority encoding transmission, IEEE Trans. Information Theory 42, 1737–1744, 1996. 6. G. AlRegib, Y. Altunbasak, and J. Rossignac, An unequal error protection method for progressively transmitted 3-D models, IEEE Trans. Multimedia 7, 766–776, 2005. 7. J. Apostolopoulos, Error-resilient video compression through the use of multiple states, in Proc. Int. Conf. Image Processing 3, 352–355, Sept. 2000. 8. J. Apostolopoulos and S. Wee, Unbalanced multiple description video communication using path diversity, in Proc. Int. Conf. Image Processing 1, 966–969, Oct. 2001. 9. I. Bajic and J. Woods, Domain-based multiple description coding of images and video, IEEE Trans. Image Process. 12, 1211–1225, 2003. 10. T. Berger, Rate distortion theory, Prentice Hall, Englewood Cliffs: NJ, 1971. 11. T. Berger and Z. Zhang, Minimum breakdown degradation in binary source encoding, IEEE Trans. Inform. Theory IT-29, 807–814, 1983. 12. M. O. Bici and G. Bozdagi Akar, Multiple description scalar quantization based 3D mesh coding, in Proc. IEEE Int. Conf. Image Processing (Atlanta, US) Oct. 2006. 13. A. Bogomjakov and C. Gotsman, Universal rendering sequences for transparent vertex caching of progressive meshes, Computer Graphics Forum 21, 137–148, 2002. 14. N. Boulgouris, K. Zachariadis, A. Leontaris, and M. Strintzis, Drift-free multiple description coding of video, in Proc. IEEE Int. Workshop Multimedia Signal Processing 1, 105–110, 2001. 15. N. V. Boulgouris and M. G. Strintzis, A family of wavelet-based stereo image coders, IEEE Trans. Cirquits Syst. Video Technol. 12, no. 10, 898–903, 2002. 16. C.-S.Kim, R.-C.Kim, and S.-U. Lee, Matching pursuits multiple description coding for wireless video, IEEE Trans. Circuits Syst. Video Technol. 11, 1011–1021, 2001. 17. Y. Charfi, R. Hamzaoui, and D. Saupe, Model-based real-time progressive transmission of images over noisy channel, in Proc. WCNC’03 (New Orleans, LA) 347–354, Mar. 2003. 18. D.-M. Chung and Y. Wang, Multiple description image coding based on lapped orthogonal transforms, in Proc. IEEE Int. Conf. Image Processing (ICIP’98) 1, 664–668, Oct 1998. 19. , Multiple description image coding using signal decomposition and reconstruction based on lapped orthogonal transforms, IEEE Trans. Circuits Syst. Video Technol. 9, 895–908, 1999.
422 20. 21. 22.
23.
24.
25.
26. 27. 28. 29.
30.
31.
32. 33. 34. 35. 36.
37.
38.
A. Norkin et al. , Lapped orthogonal transforms designed for error-resilient image coding, IEEE Trans. Circuits Syst. Video Technol. 12, 752–764, 2002. P. Cignoni, C. Rocchini, and R. Scopigno, Metro: Measuring error on simplified surfaces, Computer Graphics Forum 17, 167–174, 1998. D. Comas, R. Singh, and A. Ortega, Rate-distortion optimization in a robust video transmission based on unbalanced multiple description coding, in Proc. IEEE Int. Workshop Multimedia Signal Processing (Cannes, France) 581–586, Oct. 2001. D. Comas, R. Singh, A. Ortega, and F. Marques, Unbalanced multipledescription video coding with rate-distortion optimization, EURASIP J. Appl. Signal Processing no. 1, 81–90, 2003. S. Diggavi, N. Sloane, and V. Vaishampayan, Asymmetric multiple description lattice vector quantizers, IEEE Trans. Inform. Theory 48, no. 1, 174–191, 2002. I. Dinstein, M. G. Kim, A. Henik, and J. Tzelgov, Compression of stereo images using subsampling transform coding, Optical Engineering 30, no. 9, 1359–1364, 1991. A. A. El-Gamal and T. M. Cover, Achievable rates for multiple descriptions, IEEE Trans. Inform. Theory 28, 851–857, 1982. W. Equitz and T. Cover, Successive refinement of information, IEEE Trans. Inform. Theory 37, no. 2, 269–275, 1991. M. Fleming and M. Effros, Generalized multiple description vector quantization, in Proc. Data Compression Conference 3–12, Mar. 1999. N. Franchi, M. Fumagalli, R. Lancini, and S. Tubaro, A space domain approach for multiple description video coding, in Proc. IEEE Int. Conf. Image Processing 3, 253–256, Sept. 2003. N. Franci, M. Fumagalli, and R. Lancini, Flexible redundancy insertion in a polyphase down sampling multiple description image coding, in Proc. IEEE Int. Conf. Multimedia Expo 2, 605–608, Aug. 2002. M. Fumagalli, D. Sagetong, and A. Ortega, Estimation of erased data in a H.263 coded stream by using unbalanced multiple description coding, in Proc. Int. Conf. Multimedia Expo 2, 13–16, July 2003. V. Goyal, Multiple description coding: Compression meets the network, IEEE Signal Processing Mag. 18, 74–93, 2001. V. Goyal, J. Kelner, and J. Kovacevic, Multiple description vector quantization with a coarse lattice, IEEE Trans. Inform. Theory 48, no. 3, 781–788, 2002. V. Goyal and J. Kovacevic, Optimal multiple description transform coding of gaussian vectors, in Proc. IEEE Data Compression Conf. 388–397, Mar. 1998. , Generalized multiple description coding with correlating transforms, IEEE Trans. Inform. Theory 47, no. 6, 2199–2224, 2001. V. Goyal, J. Kovacevic, R. Arean, and M. Vetterli, Multiple description transform coding of images, in Proc. Int. Conf. Image Processing 1, 674–678, Oct. 1998. V. Goyal, J. Kovacevic, and M. Vetterli, Multiple description transform coding: Robustness to erasures using tight frame expansions, in Proc. IEEE Int. Symp. Inform. Theory (Cambridge, MA) 408, Aug. 1998. , Quantized frame expansions as source-channel codes for erasure channels, in Proc. IEEE Int. Conf. Data Compression, 326–335, Mar. 1998.
11 Multiple Description Coding and its Relevance to 3DTV
423
39. V. Goyal, M. Vetterli, and N. Thao, Quantized overcomplete expansions in Rn : Analysis, synthesis, and algorithms, IEEE Trans. Inform. Theory 44, no. 1, 16–31, 1998. 40. ISO/IEC, Information technology – open systems interconnection – basic reference model: The basic model, ISO/IEC 7498-1: 1994(E), Nov. 1994. 41. ITU-T, Video coding for low bit rate communication, ITU-T Rec. H.263; version 1, Nov. 1995; version 2, Jan. 1998; version 3, Nov. 2000. 42. ITU-T and ISO/IEC JTC 1, Advanced video coding for generic audiovisual services, ITU-T Rec. H.264; ISO/IEC 14496-10 AVC, 2003. 43. A. Mohr, E. Riskin, A. Lippman, J. Goshi, and R. Ladner, Unequal loss protection for H.263 compressed video, in Proc. IEEE Data Compression Conf. 73–82, Mar. 2003. 44. H. Jafarkhani and V. Tarokh, Multiple description trellis coded quantization, in Proc. IEEE Int. Conf. Image Processing (ICIP98) 1, 669–673, Oct. 1998. 45. P. Jaromersky, X. Wu, Y. Chiang, and N. Memon, Multiple-description geometry compression for networked interactive 3D graphics, in Proc. ICIG’2004, 468–471, Dec. 2004. 46. W. Jiang and A. Ortega, Multiple description coding via scaling-rotation transform, in Proc. Int. Conf. Acoustics Speech Signal Processing 5, 2419–2422, Mar. 1999. 47. B. Julesz, Foundations of cyclopeon perception, The University of Chicago Press, 1971. 48. M. Karczewicz and R. Kurceren, The SP- and SI-frames design for H.264/AVC, IEEE Trans. Circuits Syst. Video Technol. 13, 637–644, 2003. 49. Z. Karni, A. Bogomjakov, and C. Gotsman, Efficient compression and rendering of multi-resolution meshes, in Proc. IEEE Int. Conf. Visualization (Boston, US) Oct. 2002. 50. J. Kelner, V. Goyal, and J. Kovacevic, Multiple description lattice vector quantization: Variations and extension, in Proc. IEEE Data Compression Conf. (Snowbird, UT) 480–489, Mar. 2000. 51. A. Khodakovsky, P. Schr¨ oder, and W. Sweldens, Progressive geometry compression, in Proc. Comput. Graph. SIGGRAPH 2000, 271–278, 2000. 52. C.-S. Kim and S.-U. Lee, Multiple description motion coding algorithm for robust video transmission, in Proc. ISCAS 2000 4, 717–720, May 2000. 53. , Multiple description coding of motion fields for robust video transmission, IEEE Trans. Circuits Syst. Video Technol. 11, no. 9, 999–1010, 2001. 54. T. Linder, R. Zamir, and K. Zeger, The multiple description rate region for high resolution source coding, in Proc. IEEE Data Compression Conf. 49–158, Mar. 1998. 55. A. Miguel, A. Mohr, and E. Riskin, SPIHT for generalized multiple description coding, in Proc. IEEE Int. Conf. Image Processing 3, 842–846, Oct. 1999. 56. A. Mohr, E. Riskin, and R. Ladner, Generalized multiple description coding through unequal loss protection, in Proc. IEEE Int. Conf. Image Processing 1, 411–415, 1999. 57. A. Munos, T. Blu, and M. Unser, Least squares image resizing using finite differences, IEEE Trans. Image Processing 10, 1365–1378, 2001. 58. A. Norkin, A. Aksay, C. Bilen, G. Bozdagi Akar, A. Gotchev, and J. Astola, Schemes for multiple description coding of stereoscopic video, LNCS, in Proc. MRCS 2006 (Springer-Verlag Heidelberg) 4105, 730–737, Sept. 2006.
424
A. Norkin et al.
59. A. Norkin, M. O. Bici, G. Bozdagi Akar, A. Gotchev, and J. Astola, Waveletbased multiple description coding of 3-D geometry, in Proc. VCIP’07, Proc. SPIE (San-Jose, US) 6508, 65082I–1–65082I–10, Jan. 2007. 60. A. Norkin, A. Gotchev, K. Egiazarian, and J. Astola, Low-complexity multiple description coding of video based on 3D block transforms, EURASIP J. on Embedded Systems 2007, Article ID 38631, 11 pages, 2007. doi:10.1155/2007/38631. 61. , Two-stage multiple description image coders: Analysis and comparative study, Signal Processing: Image Communication 21/8, 609–625, 2006. 62. M. Orchard and G. Sullivan, Overlapped block motion compensation: an estimation-theoretic approach, IEEE Trans. Image Processing 3, no. 5, 693–699, 1994. 63. M. Orchard, Y. Wang, V. Vaishampayan, and A. Reibman, Redundancy rate distortion analysis of multiple description image coding using pairwise correlating transforms, in Proc. Int. Conf. Image Processing (Santa Barbara, CA) 608–611, Oct. 1997. 64. L. Ozarow, On a source-channel coding problem with two channels and three receivers, Bell Syst. Tech. J. 59, no. 10, 1909–1921, 1980. 65. S. Regunathan and K. Rose, Efficient prediction in multiple description video coding, in Proc. IEEE Int. Conf. Image Processing 1, 1020–1023, Sept. 2000. 66. A. Reibman, H. Jafarkhani, Y. Wang, and M. Orchard, Multiple description video using rate-distortion splitting, in Proc. IEEE Int. Conf. Image Processing (ICIP2001) 1, 978–981, Oct. 2001. 67. A. Reibman, H. Jafarkhani, Y. Wang, M. Orchard, and R. Puri, Multiple description coding for video using motion-compensated prediction, in Proc. IEEE Int. Conf. Image Processing (ICIP99) 3, 837–841, Oct. 1999. 68. , Multiple description coding for video using motion-compensated temporal prediction, IEEE Trans. Circuits Syst. Video Technol. 12, 193–204, 2002. 69. J. Reichel, H. Schwarz, and M. Wien, Scalable video coding – working draft 3, JVT-P201 (Poznan, PL) 24–29, July 2005. 70. B. Rimoldi, Successive refinement of information, IEEE Trans. Inform. Theory 40, no. 1, 253–259, 1994. 71. E. Riskin, Optimum bit allocation via generalized BFOS algorithm, IEEE Trans. Inform. Theory 37, 400–4002, 1991. 72. P. Sagetong and A. Ortega, Optimal bit allocation for channel-adaptive multiple description coding, in Proc. Video Commun. Image Processing (San Jose, CA) 53–63, Jan. 2000. 73. A. Said and W. Pearlman, A new, fast, and efficient image codec based on set partitioning in hierarchical trees, IEEE Trans. Circuits Syst. Video Technol. 6, no. 3, 243–250, 1996. 74. A. Secker and D. Taubman, Motion-compensated highly scalable video compression using an adaptive 3D wavelet transform based on lifting, in Proc. IEEE Int. Conf. Image Processing 2, 1029–1032, 2001. 75. S. A. Segall, Study upsampling/downsampling for spatial scalability, JVT-Q083 (Nice, FR, PL) 14–21 Oct. 2005. 76. S. Servetto, V. Vaishampayan, and N. Sloane, Multiple description lattice vector quantization, in Proc. IEEE Data Compression Conf. (Snowbird, UT) 13–22, Mar. 1999.
11 Multiple Description Coding and its Relevance to 3DTV
425
77. S. D. Servetto, K. Ramchadran, V. Vaishampayan, and K. Nahrstedt, Multipledescription wavelet based image coding, in Proc. IEEE Int. Conf. Image Processing (Chicago, IL) 1998. 78. , Multiple-description wavelet based image coding, IEEE Trans. Image Processing 9, no. 5, 813–826, 2000. 79. R. Singh and A. Ortega, Erasure recovery in predictive coding environments using multiple description coding, in Proc. IEEE 3D Workshop Multimedia Signal Processing 333–338, Sept. 1999. 80. X. Tang and A. Zakhor, Matching pursuits multiple description coding for wireless video, IEEE Trans. Circuits Syst. Video Technol. 12, 566–575, 2002. 81. G. Taubin and J. Rossignac, Geometric compression through topological surgery, ACM Trans. Graphics 17, no. 2, 84–115, 1998. 82. T. Tillo and G. Olmo, A novel multiple description coding scheme compatible with the JPEG2000 decoder, IEEE Signal Processing Lett. 11, 908–911, 2004. 83. C. Touma and C. Gotsman, Triangle mesh compression, in Proc. Graphics Interface (Vancouver, BC, Canada) Jun. 1998. 84. V. Vaishampayan, Design of multiple description scalar quantizers, IEEE Trans. Inform. Theory 39, no. 3, 821–834, 1993. 85. , Application of multiple description codes to image and video transmission over lossy networks, in Proc. 7th Int. Workshop Packet Video (Brisbane, Australia) 55–60, Mar. 1996. 86. V. Vaishampayan and J.-C. Batllo, Asymptotic analysis of multiple description quantizers, IEEE Trans. Inform. Theory 44, no. 1, 278–284, 1998. 87. V. Vaishampayan and J. Domaszewicz, Design of entropy-constrained multiple description scalar quantizers, IEEE Trans. Inform. Theory 40, no. 1, 245–250, 1994. 88. V. Vaishampayan, N. Sloane, and S. Servetto, Multiple description vector quantization with lattice codebooks: design and analysis, IEEE Trans. Inform. Theory 47, no. 5, 1718–1734, 2001. 89. M. van der Schaar and D. Turaga, Multiple description scalable coding using wavelet-based motion compensated temporal filtering, in Proc. IEEE Int. Conf. Image Processing 2, 489–492, Sept. 2003. 90. R. Venkataramani, G. Kramer, and V. Goyal, Multiple description coding with many channels, IEEE Trans. Inform. Theory 49, no. 9, 2106–2114, 2003. 91. Y. Wang and S. Lin, Error-resilient coding using multiple description motion compensation, in Proc. IEEE Int. Workshop Multimedia Signal Processing (MMSP01) 441–446, Oct. 2001. , Error-resilient coding using multiple description motion compensation, 92. IEEE Trans. Circuits Syst. Video Technol. 12, no. 6, 438–452, 2002. 93. Y. Wang, M. Orchard, and A. Reibman, Multiple description image coding for noisy channels by pairing transform coefficients, in Proc. IEEE First Workshop Multimedia Signal Processing (San Diego, CA) 419–424, June 1997. 94. , Optimal pairwise correlating transform for multiple description coding, in Proc. Int. Conf. Image Processing 1, 679–683, Oct. 1998. 95. Y. Wang, M. Orchard, V. Vaishampayan, and A. Reibman, Multiple description coding using pairwise correlating transforms, IEEE Trans. Image Processing 10, no. 3, 351–366, 2001. 96. Y. Wang, A. Reibman, and S. Lin, Multiple description coding for video delivery, in Proc. of IEEE 93, 57–70, 2005.
426
A. Norkin et al.
97. Y. Wang, A. Reibman, M. Orchard, and H. Jafarkhani, An improvement to multiple description transform coding, IEEE Trans. Image Processing 50, no. 11, 2843–2854, 2002. 98. Y. Wang and C. Wu, A mesh-based multiple description coding method for network video, in Proc. 18th Int. Conf. Advanced Information Networking and Application (AINA) 1, 549–554, Sept. 2004. 99. S. Wenger, G. Knorr, J. Ott, and F. Kossentini, Error resilience support in H.263+, IEEE Trans. Circuits Syst. Video Technol. 8, no. 7, 867–877, 1998. 100. H. S. Witsenhausen, On source networks with minimal breakdown degradation, Bell Syst. Tech. J. 59, no. 6, 1083–1087, 1980. 101. H. S. Witsenhausen and A. D. Wyner, Source coding for multiple descriptions. II: A binary source, Bell Syst. Tech. J. 60, no. 10, 2281–2292, 1980. 102. J. Wolf, A. Wyner, and J. Ziv, Source coding for multiple descriptions, Bell Syst. Tech. J. 59, no. 8, 1417–1426, 1980. 103. W. Woo and A. Ortega, Optimal blockwise dependent quantization for stereo image coding, IEEE Trans. on Cirquits Syst. Video Technol. 9, 861–867, 1999. 104. Z. Yan, S. Kumar, and C.-C. J. Kuo, Error resilient coding of 3-D graphic models via adaptive mesh segmentation, IEEE Trans. Circuits Syst. Video Technol. 11, 860–873, 2001. 105. R. Zamir, Gaussian codes and shannon bounds for multiple descriptions, IEEE Trans. Inform. Theory 45, no. 7, 2629–2636, 1999. 106. Z. Zhang and T. Berger, New results in binary multiple descriptions, IEEE Trans. Inform. Theory 33, 502–521, 1987. 107. A. Smolic, R. Sondershaus, N. Stefaroski, L. V´ aˇsa, K. M¨ uller, J. Ostermann, and T. Wiegand, A surrey to coding of static and dynamic 3D meshes, in Three Dimensional Television: Capture, Transmission, and Display, eds. H. M. Ozaktas and L. Onural, Springer, 2007(this book). 108. A. Smolic, P. Merkle, K. M¨ uller, C. Fehn, P. Kauff, and T. Wiegand, Compression of Multi-View Video and Associated Data, in Three Dimensional Television: Capture, Transmission, and Display, eds. H. M. Ozaktas and L. Onural, Springer, 2007(this book).
12 3D Watermarking: Techniques and Directions Alper Koz1 , George A. Triantafyllidis2 and A. Aydin Alatan1 1 2
Department of Electrical and Electronics Engineering, M.E.T.U., Ankara, Turkey Informatics and Telematics Institute, Thessaloniki, Greece
12.1 Introduction In the last decade, utilization of 3D information for modeling and representation of a real world scene has become more widespread in many applications due to the rapid advances in computer graphics, capturing technology, imagebased rendering methods and VLSI systems. The applications that require 3D information have first aroused in the field of computer graphics in order to create a realistic visual content from the available 3D models. The requirement for the utilization of 3D information has further diffused into the animation films, video games and stereoscopic displays. In the foreseeable future, it is expected to witness more sophisticated technologies, such as free-view televisions (FTV) [39], where the user will have an opportunity to select his/her arbitrary view, while watching a real scene (Fig. 12.1) and, finally, 3D holographic televisions, which are expected to generate the exact replica of a real scene via the specially designed displays at our homes [46]. In this inevitable scenario, there is a strong necessity for developing techniques, which protect the ownership rights of the original 3D data, as well as prevent unauthorized duplication or tampering. At this point, it should be noted that 3D scene representation is totally digital, hence all digital right management (DRM) solutions, suitable for other digital modalities, could also easily be applied to 3D scene descriptions. For instance, in a typical encrypted TV broadcast, a content provider might provide the media containing the 3D scene information to the customer, whereas the customer can use his/her valid license to play the media (Fig. 12.2). An exception of the easy migration of general purpose DRM solutions for 3D data is encountered in the field of data hiding. Although many approaches have been proposed for protecting the rights of 2D multimedia content owners by hiding content owner information invisibly into the data, these methods can not be easily extended for the case of 3D content. This chapter aims to focus on the technologies, which adopt alternative solutions other than the
428
A. Koz et al.
Fig. 12.1. A general scheme for free view television
current DRM systems. In this aspect, 3D watermarking methodologies are examined in detail, after a brief introduction of the field of watermarking.
12.2 Fundamentals of Watermarking The easiness in copying a digital content rapidly, perfectly and without limitations in numbers has resulted in a copyright protection problem for the content owners. Watermarking is proposed as a solution to this problem [14, 17, 41, 54]. A watermark, a secret imperceptible signal, is embedded into the original data in such a way that it is detectable as long as the perceptible
Fig. 12.2. A typical DRM system
12 3D Watermarking: Techniques and Directions
429
quality of the content stays at an acceptable level. The owner of the original data proves his/her ownership by extracting the watermark from the watermarked content, in a case of multiple ownership claims. A general scheme for digital watermarking is given in Fig. 12.3. The secret signature (watermark) is embedded into the cover image by using a secret key at the coder (C). Only the owner of the data has the key and it is not possible to remove the message from the data without the knowledge of this key. Next, the watermarked image passes through the transmission channel, which includes degradations, transformations and possible attacks on the image, such as lossy compression, geometric distortions, digital-to-analog and analog-todigital conversions and other possible signal processing operations. After the watermarked image passes through this channel, the message is tried to be extracted at the decoder (D). 12.2.1 Watermarking Applications Although the main motivation behind digital watermarking is copyright protection, its application area is much wider, including broadcast monitoring, fingerprinting, authentication and covert communications [20, 23, 29, 40]. By embedding watermarks into commercial advertisements, one can monitor whether the advertisements are broadcasted at the correct instants by means of an automated system [23, 29]. The system receives the broadcast data and searches for watermarks identifying where and when the advertisement is broadcasted. The same process could also be used for video and audio clips. Musicians and actors may request to ensure that they receive accurate royalties for broadcasts of their performances. Fingerprinting is a novel approach to trace the source of illegal copies [23, 29]. The owner of the digital data may embed different watermarks in the copies of digital content customized for each recipient. In this manner, the
WATERMARK
WATERMARK
Fig. 12.3. A general block diagram for watermarking
430
A. Koz et al.
owner can identify the customer by extracting the watermark, if the data is supplied to third parties. Watermarking could also be utilized for authentication of any change that has occurred on the digital content. If any tampering has been made in the content, the same change will also take place on a fragile watermark embedded to the digital content, providing information about which part of the content has been altered. Covert communication is another possible application of digital watermarking [23, 29]. A secret message can be embedded imperceptibly as a watermark to the digital image or video to provide information from the sender to the intended receiver while maintaining low probability of intercept by other unintended receivers. There are also non-security related applications of digital watermarking, such as indexing of videos, movies and news items, where markers and comments can be inserted by search engines [23]. Another application is the detection and concealment of image/video transmission errors [1]. For block-based coded images, a summarizing data of every block is extracted and hidden to another block by data hiding. At the decoder side, this data is used to detect the block errors. 12.2.2 Watermarking Requirements A digital watermarking process is usually evaluated based on the following criteria: perceptual transparency of the watermark, robustness against encountered attacks, computational complexity of the algorithm, bit-rate of data embedding process, false positive rate of watermark detection at the detector, recovery of data with or without access to the original signal, the speed of embedding and retrieval process, computational complexity of the embedding and detection algorithm, the ability of the embedding and retrieval module to integrate into standard encoding and decoding process [14, 23, 29, 31]. Depending on the application, the constraints that are used during the evaluation process might differ. For example, in a video indexing application, the robustness of the watermarking method against attacks is not important, since there is no possibility that the video might pass through any signal processing operations. In the covert communication application, it is better to use a blind watermarking scheme which does not need the original data during the watermark detection process, if real TV broadcasting is used as a communication channel. Moreover, if the application is copyright protection, the insertion/detection of the watermark is a time-consuming procedure. On the other hand, in a broadcast monitoring application, the speed of the watermark detection algorithm should be fast enough to follow real-time broadcasting. Considering the main motivation behind digital watermarking and most of the other applications, the major and common requirements to provide useful and effective watermarks can be given as imperceptibility, robustness against intended or non-intended signal operations and capacity.
12 3D Watermarking: Techniques and Directions
431
Imperceptibility refers to the perceptual similarity between the original and watermarked data. The owner of the original data mostly does not tolerate any kind of degradations in the original data. Therefore, the original and watermarked data should be perceptually equivalent. The imperceptibility of the watermark is mostly tested by means of some subjective experiments [42]. Robustness can be defined as the ability to detect the watermark, after the watermarked data has passed through a particular signal processing operation. The attacks, against which a watermarking method should be robust, depend on the application. For instance, while only the robustness against the transmission of the data in a channel is sufficient for the broadcast monitoring application, this is not the case for copyright protection application of digital watermarking. For such a case, it is totally unknown which signal processing operations will be applied to the watermarked data. Hence, the watermarking scheme should be robust to any possible signal processing operations, as long as the quality of the watermarked data preserved. Capacity refers to the ability to verify and distinguish between different watermarks with an arbitrarily low probability of error, as the number of differently watermarked versions of an image increases [49]. In other words, it can be defined as the amount of information that can be hidden to the data without producing any perceptible distortions on the content. There is a strong trade-off between these three requirements. If the capacity is increased, this might yield visible deformations in the content. If robustness is needed by means of increasing watermark strength, this might also lead to degradation of imperceptibility. On the other hand, the capacity requirement is inversely related with robustness. It is obvious that as the number of bits in the hidden information is increased, it will be more difficult to extract the hidden watermark without any bit error, after the attacks on the content. The optimum compromise between these requirements is also dependent on the application. While the number of hidden bits, in some applications, such as broadcast monitoring, should be sufficient to differentiate all broadcasts from each other, some others, such as copyright protection, might require only one-bit of hidden information, which indicates the owner of the content. A watermarking scheme should take all of this trade off into account to achieve the optimum solution.
12.3 3D Watermarking During the last decade, watermarking has been one of the most active research topics, attracting the interest of researchers with different backgrounds, such as signal processing, communication and information theory, cryptography, and computational vision. However, a great deal of this research effort has focused on digital watermarking of still images, [3, 16, 30], video [10] or audio streams [35]. Accordingly, the watermarking technology for these media is on its way for reaching to a maturity. On the contrary, 3D watermarking can still
432
A. Koz et al.
be accepted to be immature, although there are many applications in which 3D content is utilized and consumed. In general, the term, 3D watermarking, in this chapter, implies the watermarking of any representation of a 3D scene. The watermarking technology aims to protect a 3D representation by embedding hidden data into some components of the representation. Based on these components, a categorization for the scene representations can be achieved, as illustrated in Fig. 12.4. While the representations in the left side of this plot are mainly dependent on the geometrical structure of the scene, the representation in the rightmost part is purely based on the images that are captured by the cameras appropriately located in the scene. In the geometry-based representation, there are three fundamental components of a scene: geometry, texture and a map that defines the relation between the texture and the geometry. While forming an arbitrary view of a scene during the rendering applications, these three components are used accordingly in relation with the given lighting conditions of a scene. On the other hand, in the image based representation, a scene is represented by only the 2D projection of the scenes, which are the images simply captured by cameras. The arbitrary view generation of a scene is achieved by using some special interpolation techniques on the original camera images. Considering these basic components of each representation, the watermarking methods in the literature can be classified into three groups: • • •
Geometry watermarking Texture watermarking Image-based watermarking (i.e. watermarking for image-based representation of 3D scenes.)
The definitions and requirements of the corresponding watermarking problems are different in each of the aforementioned groups. The first group mostly focuses on protection of the intellectual property rights of the 3D geometrical structure, which is the most significant part of any scene; most of the More geometry-based
Static Texture, Single Geometry
View-dep. Texture, Single Geometry
More image-based
Static Texture, View-dep. Geometry
Incarnations of 3D model Conventional graphics pipeline
View-dep. Texture, View-dep. Geometry
LayeredDepth images
Sprites Warping
Light-field rendering
Lumigraph Interpolation
Fig. 12.4. Categorization of scene representations [38]. Reprinted with kind permission of Springer Science and Business Media
12 3D Watermarking: Techniques and Directions
433
techniques in 3D watermarking literature belong to this group [5, 25, 44, 52, 53, 61, 65]. The main aim of this group of techniques is the extraction of the hidden watermark after any attacks on the scene geometry. These attacks might include rotation, translation, uniform scaling, polygon simplification, randomization of points, mesh compression, remeshing, mesh smoothing operations, cut operations, local transformations, global transformations and other operations on the geometry that changes the structure while preserving the visual quality in a desired level [43]. Actually, this variety of possible attacks makes 3D geometry watermarking more challenging, compared to image and video watermarking. The second group deals with the watermarking of the texture information of the scene. Although, there is small amount of research on texture watermarking within 3D watermarking research, some recent papers on this topic introduce the problem and state its requirements [22]. The main goal of texture watermarking is to extract the watermark, that is originally hidden into the texture of the 3D object, from the rendered images or videos (obtained after projection of 3D object into 2D image planes), thus protecting any visual representation of the object. The attacks on the texture, not only include the operations on the texture of the object, such as subsampling, JPEG compression or malicious attacks, but also involve modifications in texture mapping or distortions in the geometrical descriptions of the object. [22]. The final group in 3D watermarking aims to protect the image based representation of a 3D scene. While the first two groups try to protect the intellectual property rights for the two important components of a 3D scene representation (i.e. geometry and texture), the third group approaches this problem, as the watermarking of image sequences that are recording the same 3D scene and extracting the watermark from any rendered image, generated for an arbitrary view angle. Considering the simplicity of image-based representation techniques against the traditional descriptions of the 3D scenes, due to the rapid advances in the recent years, the watermarking methods for the protection of image-based representation of a 3D scene should also be expected to become more attractive in the upcoming years. Table 12.1 summarizes the Table 12.1. 3D Watermarking categories, their protected components and typical attacks Protected Component
Attacks
Geometry Watermarking
Geometry Representation
Texture Watermarking Image-based Watermarking
Texture Image
Geometrical Transformations, Mesh operations (Compression, simplification, smoothing. . .) Any processing on the texture, Geometrical distortions Multi view Compression, Image-based Rendering Operations
Images representing a scene
434
A. Koz et al.
aforementioned categories in 3D watermarking. In the following sections, the requirements, problems and the state of the art methods for each group are presented in detail.
12.4 3D Geometry Watermarking Since geometry is the fundamental information source for a 3D scene in many related applications, the watermarking research mostly focused on the geometrical information of a scene. Therefore, most of the methods in the literature belong to this group. A general scheme for the geometry watermarking is given in Fig. 12.5. The watermark is imperceptibly embedded to the object and the watermarked object is delivered to the channel. The channel might include any distortions on the geometry that could happen due to any malicious or non-malicious utilization of the object. For instance, the model after the channel in Fig. 12.5 is the translated, rotated, scaled and smoothed version of the originally watermarked model. Then, the watermark is attempted to be detected from the tested object at the extraction part. Depending on the extraction algorithm, the original model might not be required in some applications. The special requirements and problems of geometry watermarking are summarized in the next subsections. 12.4.1 Requirements of 3D Geometry Watermarking Similar to the general requirements for the image and video watermarking, a watermarking system designed for the geometrical structure of a 3D scene or a 3D object should satisfy the requirements on capacity, robustness and imperceptibility. Regarding the capacity requirement, the system should allow embedding of nontrivial amounts of data. In addition, a watermarking scheme should be able to distinguish between different watermarks with a low probability of
Original Model Digital Watermark 1010....
Watermark Embedding Original Model
Watermark Extraction
1010.... Extracted Watermark
Fig. 12.5. A general scheme for geometry watermarking
12 3D Watermarking: Techniques and Directions
435
error as the number of differently watermarked version of a digital content increases [49]. This requirement is critical, especially for public watermarking schemes, where a different watermark is embedded into each content for every licenser or buyer. Actually, the capacity requirement has no significant differences from those of image and video watermarking. As a second requirement, a watermarking system for the scene geometry should be robust to all geometric or topological operations, as long as the visual quality of the geometric model is not severely degraded. The geometrical and topological operations on the model may include the following [43]: • • • • • • •
Rotation, translation, and uniform scaling Polygon simplification (often needed to achieve adequate rendering speed) Randomization of points Re-meshing (re-triangulation); generating equal shaped patches with equal angles and surface Mesh smoothing operations Cut (sectioning) operations—removing parts of the model as in backface culling Local deformations
In addition to these requirements, the distortions in the geometry due to the watermarking should be imperceptible. However, the imperceptibility requirement for a geometrical model is not a trivial problem; in contrary, this requirement is more complicated, compared to that of image and video watermarking. The following section will present the problems of 3D geometry model watermarking, while fulfilling the mentioned requirements. 12.4.2 Fundamental Problems in 3D Geometry Watermarking The 3D geometry model watermarking has some specific problems due to its different type of representation, compared to image and video signals. These problems are tabulated in Table 12.2. According to this table, the first problem emerges in the synchronization of the watermark sequence to the 3D geometry data, since there is no implicit ordering of the data in 3D models. While the data in images and video frames are scan-line ordered, 3D model data, such as vertices, edges and faces, could only be ordered, after a fixed orientation and a position of data in space [43]. Therefore, most of the 3D watermarking methods in the literature first, convert the 3D model into a translation, rotation and scale invariant domain, then, apply the watermarking scheme to the model. For example, in order to achieve translation invariance, center of the mass of the object is mostly chosen as the origin of the x, y and z axes [25, 58]. The rotation invariance is satisfied by placing the 3D object such that the principal component of the object coincides with the zaxis of the xyz -space. The scale invariance is usually achieved by normalizing r (radius) components of the vertices of the model such that the distance of the furthest vertices to the origin is equal to one [58]. It is obvious that an
436
A. Koz et al.
Table 12.2. Problems in 3D geometry watermarking compared to the image and video watermarking Image and Video
3D meshes
Representation
Two and three dimensional functions on a manifold grid
Synchronization
Scan line ordering
Handling and Editing
Well-defined standards for compression (JPEG, JPEG 2000, MPEG, H.26X), Wellanalyzed transformations used in transmission, compression, filtering. (DCT, FFT, wavelet etc.) Common signal processing during compression and transmission, synchronization attacks, cropping, etc.
Lack of unique representation, different mesh can represent the same surface Requires a fixed orientation and a position of data in space No well-accepted standard for compression,
Robustness
Imperceptibility
Experimentally well studied perceptual analysis inherited from vision research to watermarking research.
Multi-resolution representation techniques are comparatively new. A high number of the diverse attacks, including geometry transformations (translation, scaling, rotation, affine etc.), local modifications, topological changes. No experimental study on the perceptual limits of the watermark for the geometry
attack could also distort the main parameters of the object that are being used to achieve translation, rotation and scale invariance, which convert the watermark extraction into a more difficult problem. Another problem in 3D model watermarking is the lack of uniqueness in the representation of the model data. Since it is possible to represent the same surface with different vertices, edges and faces, a scheme that embeds the watermark into the values of the geometric primitives of a model should also be able to extract the watermark from any other representations, as long as the visual quality of the model is not degraded. This approach makes the watermark embedding and extraction in 3D models a more complicated problem. As mentioned before, another challenge in 3D geometry watermarking is to achieve robustness against a high number of diverse attacks. In addition to translation, rotation and scaling of the object, the general transformations, such as affine transformations, are also common [25, 58]. Furthermore, it is possible to make some local deformations to reshape a part of the model. Topological modifications, such as resection of a desired part of a model, mesh simplification or re-meshing, might also be applied to the model.
12 3D Watermarking: Techniques and Directions
437
A watermarking scheme for 3D geometry model should survive all such attacks, as long as the visual quality of the model is preserved at acceptable levels. Handling and editing of the 3D geometry model is also a more serious problem compared to image and video counterparts [43]. This case might involve complex geometrical and topological operations, as previously mentioned. It is also difficult to extend the common operations that are used in 2D signal processing for 3D content. The final problem in 3D watermarking is related with the imperceptibility requirement. In image or video watermarking [6, 21, 49], the invisibility is satisfied by using some explicit perceptual measures [31], termed as just noticeable differences (JND s), to determine the watermark embedding location and strength. JND s are determined in vision research by setting some perceptual experiments on the cosine patterns at different spatial and temporal frequencies. If the method does not exploit this kind of perceptual metrics, at least, PSNR is used in order to make an assessment on the watermark energy and imperceptibility. However, in 3D case, setting of the perceptual experiments and the determination of JND s is not a trivial problem. Furthermore, there is not a metric for general evaluation of invisibility and watermark energy in 3D processing, such as PSNR for the image case. 12.4.3 3D Geometry-Based Watermarking Methods As depicted before, the main trend in 3D watermarking focuses on the watermarking of 3D geometry-based data, since the most valuable component of a scene in the majority of the 3D applications is its geometry. In fact, geometry watermarking in this chapter implies the watermarking of the geometry component of an object(s), residing in a scene. An object could be represented by points, meshes or voxels. Among these representations, mesh-based representation is more commonly used in 3D applications. Hence, in watermarking research, 3D mesh watermarking is examined in more detail, compared to the other representations of 3D geometry. Therefore, the classification in this chapter is presented, considering mainly the watermarking methods that are applicable to 3D meshes. However, a few existing methods on 3D point and voxel representations are also presented in the given classification, accordingly. In 3D mesh representation, an object is formed of geometric primitives, such as points, lines, polygons, polyhedrons and connected polyhedrons [44] (See Fig. 12.6). A watermark scheme could exploit the geometrical values of these primitives, such as coordinates of a point, length of a line, area of polygon, volume of a polyhedron, ratios of the areas of two polygons, two quantities that define a set of similar triangles, ratios of volumes of two polyhedrons, etc. Some examples of geometric primitives for a mesh representation are shown in Fig. 12.7. In addition to these geometrical primitives, it is also possible to embed the watermark by changing the topology of the 3D model.
438
A. Koz et al.
Fig. 12.6. An example for a mesh representation: Bunny mesh [60]
For instance, a watermarking scheme can use one of the two alternative ways of triangulating a quadrilateral or two different mesh sizes in order to embed a watermark bit of 1 or 0, as shown in Fig. 12.8 [44]. While the methods exploiting the geometrical attributes form the majority of the geometry based (X1,Y1,Z1) A2
l1
l3
V
A1
A
(X2,Y2,Z2)
l2 (a)
(X3,Y3,Z3) (c)
(b)
V1 V2 (d)
(e)
Fig. 12.7. Some of the geometric primitives of a mesh model that could be exploited for watermarking. (a) Coordinates of the vertices (X, Y, Z), length of the lines (l), area of a triangle (A), (b) Ratio of the areas of two triangles (A1 /A2 ), (c) Volume of a tetrahedral (V ), (d) Ratio of the volumes of two tetrahedral (V1 /V2 ), (e) Normal vectors of the triangles forming the mesh
12 3D Watermarking: Techniques and Directions or
439
or
Fig. 12.8. Some examples of alternative topological structures for watermarking [45]. Reprinted with permission from IEEE. Copyright 1998 IEEE
methods, the approaches based on the topological properties of 3D geometrical model receives less attention from the literature. The methods on 3D geometry watermarking can be divided into two main groups, based on the embedding domain of the watermark: spatial domain and transform domain methods. The general classification of 3D geometry watermarking methods is given in Fig. 12.9. Spatial domain methods embed the watermark directly into the geometrical values of the geometrical primitives. In general, most of the methods in this group first extract perceptually significant geometric primitives from the model and then embed the watermark into those primitives by using a specially proposed method (Fig. 12.10). The other class is the transform domain methods, where a 3D object is decomposed into subsignals by applying a 3D geometry based transformation, such as wavelet transformation [53] or mesh spectral analysis [52]. In this group of methods, after applying the transformation to the mesh model, watermark is embedded to the resulting transform coefficients (Fig. 12.11). The members of each group are explained in detail in the following sections. Next, the advantages and disadvantages of each group are presented. 12.4.4 Spatial Domain Methods Spatial domain methods could be categorized into 3 groups (Fig. 12.9). While the first group basically embeds the watermark into the vertex coordinates of a 3D mesh model, the second group exploits the other geometric primitives of a 3D mesh before adding the watermark. The third category follows a different strategy by means of utilizing the 2D image watermarking methods Geometry Watermarking
Frequency Domain Methods
Spatial Domain Methods
Methods based on 3D vertices
Methods based on Other Geometric Primitives
Methods Based on 2D Watermarking
Wavelet based Methods
Spectral Analysis based Methods
Fig. 12.9. Categorization of 3D geometry-based watermarking
440
A. Koz et al. W
Watermarking Procedure
Extraction of Primitives
Watermarked Model
Original Model
Fig. 12.10. A general scheme for the spatial domain methods
for watermarking of the meshes, after transforming the 3D geometry into a two dimensional form. Although there is quite limited number of representatives for this group in the literature, it is still categorized in a different category due to its originality. 12.4.4.1 Modifying Vertex Coordinates The methods of this group apply modifications on the vertex coordinates of a 3D mesh model to embed the watermark. While some of the approaches directly add the watermark into the vertex coordinates (Fig. 12.12.a), some others change the position of a vertex, with respect to the watermark bit, into one of the two predefined regions around the location of the vertex (Fig. 12.12.b). These different approaches are examined in the following sections. Embedding Watermark directly onto Vertex Coordinates The first method in this group [9] modifies the vertex coordinates of a 3D model by modulating the watermark signal with a global scaling factor and a masking weight and then, adding to the coordinate values. The masking weight for each vertex is determined by taking the average of the differences between the positions of the connected vertices to that vertex. Since the locations, where sharp variations occur, contains the perceptually significant part of a 3D model, such a masking weight can give perceptually acceptable Transformed Coefficients
T Original Model
W
Watermarking Procedure
Watermarked Coefficients
T
-1
Watermarked Model
Fig. 12.11. A general scheme for the transform domain methods
12 3D Watermarking: Techniques and Directions (Xw,Yw,Zw)
441
Region I (X,Y,Z) Region II
(a)
(b)
Fig. 12.12. Two different approaches to modify the vertex coordinates for watermarking: (a) adding a watermark pattern (Xw ,Yw ,Zw ) into the coordinates of the vertex directly, (b) moving the vertex into one of the predefined regions around the vertex with respect to watermark bit
results. In fact, more deliberate masking weights should be exploited by using well defined edge detectors which are tailored for 3D geometrical models. The method is shown to be robust to the additive random noise, compression by MPEG-4 SNHC and mesh simplification [9]. The advantage of the method is in its simple implementation, compared to other spatial domain methods. In a similar method [64, 66], the watermark is again embedded in an additive manner to the coordinates of the vertices. The scheme also varies the strength of the watermark signal adaptively with respect to the local geometry and embeds the watermark information into the model by modifying the distance of the vertices to the centre of the model. The method distributes the information corresponds to a bit of the watermark over the entire model via vertices scrambling. Such a scrambling of the watermark among the entire model yields a more resistive scheme against cropping attack, as well as mesh simplification and additive noise, where most of the transform domain methods fail. In [58], the watermark is embedded into the radius component of the vertices in spherical coordinate system. In this technique, the 3D object is first translated into the new coordinate system, so that the center of mass coincides with the origin of axes. Then, the vertices of the object are converted to the spherical coordinates, as (r, θ, ϕ). All components of rj (j = 1, 2, . . . , N ) are sampled into K bins with the uniform distance by using the following equation: Q(rj ) = IN T ((rj − rmin /(rmax − rmin )) × K + 0.5 K is selected according to the watermark bit number. rmax , rmin , correspond to the maximum and minimum value of the r-component. Sample means, Ek (r) of r = Q−1 (k), in each of the sampled bins, are calculated and arranged in the descending order, according to the density of vertex per bin. The binary watermark, wk , is embedded into the sample means in the descending order as: Ek (r) = (1 + αRk ) × Ek (r)
442
A. Koz et al.
If wk = 0, then Rk is chosen as −1. Otherwise, Rk is set 1. The parameter, α, is the embedding strength. In order to extract the watermark without the original mesh, the indices of bins into which the watermark is embedded, and sample mean Ek (r) before watermark embedding, are sufficient. The critical point in the embedding process is the parameter, α. In order to satisfy the trade off between imperceptibility and robustness, the mesh model is projected onto two constraint convex sets, (one for robustness and the other for imperceptibility), and the range for parameter, α, is determined according to these convex sets. In the detection scheme, the watermarked object passed from the same operation up to the determination of sample means, Ek w . Then, Ek w ’s are compared to Ek at each bin to extract the watermark bit. The method shows robustness to scaling and rotation without the use of the original object in the detection stage. However, in the case of affine transformation, the method requires the original object to determine the necessary transformations before the extraction process. Modifications on Position of the Vertices w.r.t Watermark Bit The methods in this approach make changes in the position of the vertices to satisfy some requirements specially defined for each watermark bit. For instance, one of the methods in this group [32, 56] enforces some limitations on the probability distribution of the vertices, surrounding a vertex to embed the watermark to that vertex. The first step in the method is to translate the 3D object such that the center of the mass corresponds to the origin of the axes. Then, in order to achieve rotation invariance, the object is rotated so that its principal component, u, coincides with the z-axis. Next, the model is converted into spherical coordinates and the watermark is embedded to the r-component (i.e. the distance of a vertex to the origin of the axes) for the purpose of scale invariance. In order to achieve robustness against mesh simplification, every watermark sample is embedded to a set of vertices instead of one vertex. In the method, r-Θ plane (spherical coordinates) is decomposed into subranges, Θj , and a random variable, dr (usi ), is formed by using the r component of the vertex, usi , where usi denotes the spherical coordinates of the vertice, ui : dr (ui s ) = r(ui s ) − H(ui s ) H is a local neighborhood operation of the vertices around usi . H satisfies an approximation function of r(usi ) that depends on the neighborhood of usi . The operator H is chosen so that dr (usi ) follows a Gaussian distribution with variance σ2 , and zero mean. In the method, the distribution of dr (usi ) is modified in each subrange, Θj , according to the watermark bit. If the watermark bit is equal to 1, r-component of the some of the vertices, vis , that have dr (vis ) > bσ, is altered in order to fall inside (0, bσ). If the watermark bit is -1, r-component of the some of the vertices, vis , such that, dr (vis ) < −bσ, is
12 3D Watermarking: Techniques and Directions
443
altered in order to fall inside (−bσ, 0). This operation changes the probabilities, prob(dr (vis ) > bσ) and prob(dr (vis ) < −bσ), in each subrange Θj of the watermarked 3D object. In the original object, these probabilities are equal to G(−b), where G shows the error function, erfc, for a Gaussian distribution of variance σ2 , and zero mean. In the watermarked object, these probabilities will be smaller than G(−b), which is the main logic during watermark detection. The probability, prob(dr (vis ) > bσ), is computed by means of calculating the ratio of the number of vertices, that satisfy dr (vis ) > bσ, to the number of the vertices in the model. The probability, prob(dr (vis ) < −bσ), is calculated in a similar way. The method is claimed to be the first blind 3D object watermarking algorithm (i.e. does not require the original mesh during the detection process) robust to translation, rotation, scaling and mesh simplification. However, one of the typical attacks, cropping, is not considered in the robustness tests. Another method changing the position of the vertices for watermark embedding is proposed in [61]. In this method, the vertices are located in one of the two predefined regions around the selected vertex according to the watermark bit. These regions are determined with respect to the local moments of the vertices in the neighborhood of the selected vertices for watermarking. The algorithm consists of two steps [61]. In the first step, a chain of vertices and their neighborhoods are selected and ordered. The most appropriate vertices for watermarking are those from mesh areas consisting of small polygons in the 3D object. Such regions are equivalent to the image regions in image with texture, details or noise that has been considered appropriate for image watermarking. In order to ensure the imperceptibility of the watermarks, modifications can be produced only in the regions consisting of small polygons. A threshold, T (Vi ), depending on the distance D(Vi ), which is the distance of a vertex, Vi , to its neighborhood vertices, is used in order to select those small areas for watermarking. In the second step of the method [61], locations of selected vertices are changed according to their local neighborhood moments and the embedded information bit. The watermark code is embedded into a set of Bvertices and their neighborhoods. Two separate geometric areas are considered in the space defined by the set {Vi , N (Vi )}, one for embedding a bit of 0, and the other for embedding a bit of 1. The watermarked vertex is moved into one of the two regions, according to corresponding watermark bit. The method defines two parallel planes by using the geometry of its neighborhood, N (Vi ), in order to determine the separate geometric areas. In a recent work [4, 5], the method is improved by means of defining bounding ellipsoids, instead of parallel planes, for each watermark bit. In the detection stage, the same procedure for vertex selection and ordering is applied to the test object and the watermark is retrieved by determining the region where the vertex is placed. The method in [5] makes a comparison between this new approach and the method in [61], and indicates the superiority of the approach defining bounding ellipsoids to the parallel plane approach in terms of imperceptibility and robustness.
444
A. Koz et al.
The method shows robustness to the common attacks, such as rotation, scaling and other affine transformations changing the vertex order. Especially, robustness against cropping attack is one of the exceptional advantages of the method. In [47, 48] a new 3D watermarking technique, based on a Generalized Radon Transform (GRT), is described. It should be noted that the (GRT) is not a transformation from a signal processing perspective, as commonly understood. It is a variation of the Radial Integration Transform (RIT) [19], namely the Cylindrical Integration Transform (CIT), which simply integrates the information of a 3D model on cylinders, beginning from its centre of mass. Figure 12.13 illustrates the computation of the CIT for model’s vertices: the dots indicate the model’s vertices, the line segments Li indicate the lines, which end on the surface of the bounding unit sphere and the cylinders CYLi indicate the cylindrical integration area. A watermarking technique embeds a specific model identifier, which represents the 3D model’s descriptors, into the vertices of the 3D model via modifications on their locations. The proposed watermarking method is robust to geometric distortions such as translation, rotation and uniform scaling. Additionally, the method is robust against any vertex reordering. However, the method is vulnerable to attacks, such as mesh smoothing operations, cropping and local deformations, since such operations change the shape of 3D model, which then cannot be used as a query model. Other than these methods proposed for copyright protection, there are also some fragile watermarking methods which are focusing on the authentication of integrity of 3D meshes [27, 62]. The first method in this category embeds a fragile watermark by iteratively perturbing vertex coordinates until a predefined hash function applied to each vertex matches the other predefined hash function applied to that vertex. Since their algorithm relies heavily on an ordered traversal of vertices, it is capable of detection object cropping [27].
Fig. 12.13. Cylindrical integration transform (CIT) [47]. Reprinted with permission from IEEE. Copyright 2004 IEEE
12 3D Watermarking: Techniques and Directions
445
However, authentication after local modifications, vertex reordering and some particular attacks, such as floating point truncation or quantization are not taken into account in the method. In [27], the aforementioned problems are handled. The proposed method could not only achieve localization of malicious modifications, but also robust to certain incidental data processing, such as quantization of vertex coordinates and vertex reordering. 12.4.4.2 Methods Based on other Geometric Primitives The methods in this class perform the changes in the geometric quantities of a 3D model other than the coordinates of the vertices. The method in [43], which is one of the prior and well-known methods on 3D model watermarking, embeds the watermark information into the surface geometry. The scheme first maps surface normals onto a unit sphere, and then subtly alter groups of similar normals in order to embed the watermark bits. It should be noted that possible operations on a 3D model, such as remeshing, mesh simplification, cropping, randomization of points and other similar attacks, might cause substantial changes in model vertex and face set configuration, adjacencies, and topology without changing the perceived quality of the model. In other words, there exists nearly an infinite amount of meshes representing or approximating one particular surface. Therefore, the watermark should be embedded into some other geometric primitives of the model that should not be affected from the aforementioned attacks as long as the visual quality of the model is not degraded. The idea in [43] is to use collections of surfaces as an embedding primitive. If the representation of a model changes due to these operations, the new vertex face set configuration should maintain global surface characteristics, regarding size, orientation, and curvature, at least of perceivable features. Otherwise, a significant loss of visual quality should occur. Embedding and retrieval process consist of several stages, as illustrated in Fig. 12.14. In this method [42], the embedding process takes an original model M , a key and a bit string S of length N as input. Next, the system calculates consistent face normals from the actual face normals of M . The original model M with consistent normals or parts of it, denoted as RVector in the illustration, is needed to be stored. From the key, N non overlapping bins, bin centers and radii, are derived. In the following step, the consistent face normals are sampled by using these bins. The core embedding process now modifies model vertices in order to change bin contents (normals) with respect to certain measures, called feature types. The output of embedding process are the model M ’, which is denoted as watermarked copy of M , in which the bit-string S had been embedded, the N original feature values, called FVector, which are needed as reference values in the retrieval process and the mentioned RVector. The retrieval process in [42] takes the watermarked model M ’, the key, the feature vector FVector and certain additional original features, denoted as RVector, as input. First consistent normals are calculated from the actual
446
A. Koz et al.
Fig. 12.14. The dataflow between stages in embedding and retrieval process in [43]. Reprinted with kind permission of SPIE
face normals of M ’. Next, M ’ is reoriented with respect to the original model, M . For this process the RVector information is required. Then, the bins are constructed from the key (or alternatively the center and radii information is contained in the FVector ). In the core retrieval process, the actual feature values are calculated and compared against those in FVector which yields the bit string S’ as a result. If S’ differs from S, and a constant part of S indicate that actual reorientation accuracy is insufficient, further orientations are tested until S matches the constant fraction of the bit string S, sufficiently. With this system, Benedens [43] primarily aims robustness against: • • •
Randomization of points Mesh altering (re-meshing) operations or attacks Polygon simplification
However, the method demonstrates robustness against only simplification attacks [43]. Moreover, one drawback of the algorithm is the large amount of a priori data needed, before watermark retrieval. A more relevant drawback is the amount of preprocessing needed before the watermarking core algorithm can be applied. Another scheme exploiting the normal vector distributions of surface patches of a 3D model is proposed in [34]. The method embeds the watermark into the consistent normal vectors. In order to satisfy robustness against
12 3D Watermarking: Techniques and Directions
447
the partial geometric deformation, equal number of watermark bits is embedded to each patch of subdivided mesh. The method uses EGI distribution of Benedens’ algorithm [43] to achieve robustness against simplification and remeshing attacks. The main improvements against [43] are the robustness of the method against cropping attack by means of spreading the watermarks into different patches and exploiting the normal vector distribution of each patch instead of the normal distribution of the entire model, which is the case in [43] . Another well-known method in this group is proposed in [44, 45]. The paper presents mainly three watermarking methods, namely, triangle similarity quadruple algorithm (TSQ), tetrahedral volume ratio embedding (TVR) and mesh density pattern embedding algorithm (MDP). In the first algorithm, a pair of dimensionless quantity, such as a pair of angles in a triangle or the ratio of the lengths of the adjacent sides of a triangle and the ratio of the height of a triangle to its base, is used as the geometrical embedding primitive to watermark triangle meshes. Since these dimensionless ratios are invariant to rotation, translation and uniform scaling, the watermark is robust against those operations. In addition, the watermark also survives resection and local deformation, since subscript arrangement to determine the watermark embedding locations and repeated embedding are utilized during watermark insertion. In the second method, the ratio of the volumes of a pair of tetrahedrons is selected as the embedding primitive. Since the ratio of the volumes of two polyhedrons is invariant to affine transformation, the watermark also survives from affine transformation, as well as the aforementioned attacks in the first method. The Tetrahedral Volume Ratio (TVR) algorithm, show near optimal properties with respect to capacity, execution speed, and monitoring capabilities. The significant drawbacks include vulnerability to re-meshing operations, polygon simplification, and point randomization. Nevertheless, their algorithm is well-suited for embedding public watermarks. The third method, Mesh Density Pattern Embedding (MDP), first tessellates the given curved surfaces, and then, embeds a visible pattern by modulating the sizes of triangles in the output mesh. This simple method survives from any type of geometrical transformations. However, it is vulnerable to a re-meshing attack that generates patterns with mostly identical shapes (angles and size). 12.4.4.3 Methods based on 2D (Image) Watermarking In the last category, a different method based on 2D image watermarking is presented. Although, there is only one representative of this group in the literature, it is categorized as a different group, due to its novelty. The proposed method [25], first extracts a 2D image from a 3D model and then, exploits a DCT-based image watermarking algorithm instead of the usual 3D watermarking methods, which mainly embeds the watermark by performing slight modifications on the vertex coordinates of the 3D models. Extracting a 2D image from a 3D model also allows using any other image watermarking algorithm during 3D watermarking.
448
A. Koz et al.
Nu-0.5 Nu-1 Nu-2
S1 S3 0.5 0 -0.5
0 1
2
Nv-1
S2
vr = 0
Fig. 12.15. 2D grid on the side face of the scanning cylinder [25]. Reprinted with permission from IEEE. Copyright 2002 IEEE
In the extraction process of the 2D image, a cylinder, denoted as scanning cylinder, is placed around the object. The radius and height of the virtual cylinder are set such that the 3D model slightly resides inside the cylinder. As a result, the cylinder undergoes the same amount of uniform scaling as the 3D model does, which achieves robustness to scaling attack. Then, Nv × Nu grid on the side face of the cylinder, as shown in Fig. 12.15, is constructed. The grid points, where the ranges are calculated, are denoted by (ur × vr ) ∈ {(0, 0), . . . , (Nv − 1, Nu − 1)}. The vertical line (vr = 0) lies in the direction of s3 and vr increases along the counterclockwise direction. Figure 12.16 shows the general situation of virtual ranging in [25]. From (ur , vr ) on the grid, which corresponds to q’ in (x,y,z) coordinate, a line is drawn towards the point q on the opposite side of the cylinder. The triangle abc represents one of the triangular faces which intersects the line segment q’q. Note that the triangle abc is facing the point q’, i.e., the order of vertices is a → b → c. The l is the range value that is to be found for every (ur , vr ) on the grid. The corresponding 2D image of l values is linearly mapped to [0, 255] and a DCT based algorithm is used for watermarking. The method is robust against mesh simplification and noise.
l
q(ur,vr) ev
a c b
(ur,vr) q’(ur,vr)
S1 S2
mv O
S3
Fig. 12.16. Ranging situation [25]. Reprinted with permission from IEEE. Copyright 2002 IEEE
12 3D Watermarking: Techniques and Directions
449
12.4.5 Transform Domain Methods The transform domain methods embed the watermark into the resulted coefficients after a transformation is applied to the 3D geometry model. In general, watermarking techniques in this category are of spread-spectrum type methods, where the watermark is embedded additively into the decomposed signal in different resolutions after the transformation. Considering the applied transformations towards the model, the transform domain methods can be categorized into two groups, as the methods, based on the spectral analysis [17, 24, 50, 51, 52] or wavelet transformation [18, 19, 33, 53], respectively. These two categories can be considered, as the analogue of the DCT- or DFTbased methods and wavelet-based methods in image watermarking, respectively. While the first transformations are applied to the entire image and gives the high and low frequency content only in one level, wavelet representation gives the coarse and detailed images in each subband level, which is obtained by means of applying recursive down sampling operations on the image. 12.4.5.1 3D Geometry Watermarking based on Spectral Analysis Spectral methods embed the watermark into the resulting signals after a transformation is applied to the model in order to decompose the mesh model into low frequency and high frequency contents. The transformations are constructed by using a set of orthogonal basis defined over the entire 3D mesh model. Most of the methods in this group embed the watermark in an additive manner after modulating the watermark with a local scaling factor in the direction of the orthogonal basis. The differences between the methods are in the selection of the orthogonal basis for the construction of the transformation. In the first method of this group [17], a technique derived from progressive meshes [26], is utilized to construct the orthogonal basis over the meshes. Progressive meshes method is one of the multi-resolution surface representations that share similar properties with the frequency-based transformations on the images. The representation automatically determines the perceptually significant parts of the surface. In the proposed method, the original mesh is first converted into a coarse base mesh and a sequence of refinement operations by using the techniques in progressive meshes. Then, the method defines the orthogonal basis for each of these refinements over their corresponding neighborhood in the original mesh. The watermark is added to the 3D coordinates of the mesh vertices, after it is multiplied by the orthogonal basis and a global scaling factor, adjusting the watermark strength. In this way, the spread-spectrum principles that are used in image watermarking, are adapted to embed information into the basis functions corresponding to perceptually significant features of the model. The method in [17] also proposes a solution to another challenge in 3D watermarking, namely, extracting the watermark after the attacks that modifying not only the vertex positions, but also the structure of vertex sampling. In order to address this challenge,
450
A. Koz et al.
an optimization technique to resample the attacked mesh using the original mesh connectivity is developed. With the proposed technique, the robustness against mesh simplification and other operations that preserve shape is obtained. In addition, the method also survives from classical attacks on the 3D meshes, such as smoothing, additive random noise, similarity transformation and other attacks, due to the spread spectrum principles used during watermark embedding. In another method based on the spectral analysis [52], the orthogonal basis for the transformation is selected as the eigenvectors of a Laplacian matrix derived from connectivity of polygonal meshes. The method first forms the Laplacian matrix, whose elements are determined by using only the number of connections of each vertex to the other vertices in the model. Then, eigenvalue decomposition is applied to the Laplacian matrix to compute the mesh spectra. The decomposition produces a sequence of eigenvalues and a corresponding sequence of eigenvectors of the matrix. While spectral coefficients of the smaller eigenvalues represent global shape features, spectral coefficients of the larger eigenvalues represent local or detail shape features. Projecting the coordinate of a vertex onto a normalized eigenvector produces a mesh spectral coefficient of the vertex. The watermark is embedded into the mesh spectral coefficients by exploiting a spread spectrum approach similar to the previous approach [17]. However, rather than using the randomly generated numbers from a Gaussian distribution as a watermark, as in [17], an information sequence of 0 and 1’s are embedded to the model. Due to the spread spectrum techniques utilized during watermark embedding, this method is also robust against similarity transformations (translation, rotation and scaling), random noise added to vertex coordinates, mesh smoothing and partial resection of meshes. However, one lack of the method compared to [17] is the weakness against the connectivity change that can occur, during remeshing or mesh simplification, since mesh spectral analysis depends on the connectivity of the mesh. Another disadvantage of the method is the high computational cost of the numerical method used for the spectral analysis. Such a high computational cost precluded the analysis and watermarking of mesh models having more than a few thousand vertices. In a recent version of this approach by the same researchers [51], the method in [52] is improved to handle the mentioned problems. In order to provide robustness against mesh connectivity, mesh alignment and remeshing steps are included to the system before the watermark extraction is applied to the tested mesh. Mesh alignment is achieved by minimizing the distance between the surfaces of the watermarked mesh and the tested mesh. In the remeshing step, geometry of the tested mesh is resampled by using the connectivity of originally watermarked mesh. The second improvement of the method is in the robustness of the method against attacks that combine cropping with the geometric transformations, mesh simplification, smoothing, and other interferences. This is achieved by means of per patch alignment, instead of per model alignment. In the watermarking process, the model is first decomposed
12 3D Watermarking: Techniques and Directions
451
into suitable patches into which the watermark is embedded repeatedly. Such an alignment with respect to each patch enabled extraction of watermarks after cropping followed by geometric transformation. The last improvement of the method is achieved in the computational cost. By using an efficient method, called Arnoldi method, in the eigenvalue decomposition of the Laplacian matrix, the performance of the method is increased more than ten times. Such an achievement made it possible to analyze a much larger mesh region for shape features to be modified for watermarking. A similar technique to [52] also uses the eigenvectors of the Laplacian matrix to define the transformation applied to the 3D model [12]. The difference of the method is in the watermarking embedding scheme. Instead of using a spread spectrum based additive watermarking technique, the method uses a substitutive scheme which spreads the watermark over the three spectral axes after the transformation. Given that the (Pi , Si , Ri ) is one set of spectral coefficients obtained after the transformation is applied to the model, the method changes the middle coefficient of the set such that it becomes closer to the minimum or maximum coefficient of the set according to the watermark bit. The scheme is robust against spectral compression, random noise added on the vertices coordinates and other common geometrical transforms, such as translation, rotation and scaling. However, the watermark fails after the attacks changing the connectivity of the model. The last method in this group is based on the spherical harmonic transformation [36]. First, the method maps the coordinate information of 3D meshes to a unit sphere and then applies spherical harmonic transformation, which can be interpreted, as a combination of Fourier transform and latitude transform on the vertical and horizontal angles of the spherical coordinates, respectively. The watermark is embedded to the resulting transformation coefficients in an additive manner. The method is proved to be robust against noise addition, filtering, enhancement, rotation, translation and resampling with no need of mesh alignment and remeshing. One advantage of the method compared to the previous methods can be denoted as not requiring a remeshing operation to provide robustness to resampling. The method is also robust to cropping, vertex reordering and simplification with preprocessing. However, the method requires additional processing in order to decrease the serious distortion resulting from the spherical harmonic transformation, which increases the computational complexity of the method. Other than these methods watermarking the mesh representation of a geometric model, there are also some methods for the watermarking of point representations. In [15], a watermarking system is proposed for 3D models represented by unstructured clouds of point samples. The algorithm is similar to the previous work proposed by Ohbuchi [51, 52] by means of adapting the specific parts to point clouds. In a preprocessing step, the model is decomposed into a set of disjoint patches as in [51]. Each patch is transformed into the frequency domain, where the watermark is encoded into the spectral coefficients. The final watermarked model is then obtained by applying an
452
A. Koz et al.
inverse transform to each patch. The proposed method is of non-blind type, which requires the original model during the watermark extraction performed on a given point cloud. During the extraction, the original object and the potentially marked test object are aligned by using a registration process. Then, the surface of the test object is resampled with the resolution of the original object. After a frequency transformation, the watermark is extracted using both the resampled test geometry and the original object. Then the correlation is computed between the extracted and original watermark. Finally, an algorithm for watermarking of 3D volume data, based on spread spectrum technique, is proposed in [63]. It is important to note that a 3D volume data is a 4D signal with three coordinates corresponding to 3D position and a value of the signal at that position. In fact, the watermark component in this method is the intensity information of the volume. The method is robust and invisible in the sense that 2D rendered image of this watermarked volume is perceptually indistinguishable from that of the original volume. The method is different from the previous methods due to the transform domain where the watermark is embedded. In the method, 3D DCT of the volume data is computed and the watermark is added similar to the spread spectrum watermarking method [14]. The method is robust against many attacks, such as geometrical distortions, addition to constant offset to voxel values, addition of Gaussian and non-Gaussian noise, low pass, high pass and median filtering, local exchange of voxel slices etc. 12.4.5.2 3D Geometry Watermarking based on Wavelet Transformation The second group in transform domain methods is based on wavelet transformation [57]. Similar to the spectral analysis based methods, this group of methods applies a wavelet transformation which decomposes the 3D mesh into the subsignals in different resolutions and then, embeds the watermark into each resolution coefficients by using spread spectrum techniques. The difference of this group from the previous group is the use of wavelet transformation, which represents a mesh model in a more compact and natural form consisting of multiple resolutions. The first method [53] in this group is specially proposed for robustness against affine transformation. The method first decomposes a 3D polygonal mesh by using lazy wavelets induced on 3D polygonal meshes. Then, the watermark is added to some geometric measures on wavelet coefficients which are invariant to affine transformation. During the embedding process, a mechanism to limit the maximum geometric distortion on the vertices is also utilized to control the imperceptibility of the watermark. The watermark embedding process is illustrated in Fig. 12.17. During detection, both the watermarked and original meshes are passed through the wavelet transformation and the difference between these wavelet coefficients are calculated for detection (Fig. 12.18). The scheme is shown to be resistant against affine
12 3D Watermarking: Techniques and Directions
453
Fig. 12.17. Watermark embedding process [53]. Reprinted with kind permission of Springer Science and Business Media
transformation, partial resection, and random noise added to vertex coordinates, and other attacks. As a limitation, the method requires the mesh to have 1-to-4 subdivision connectivity. In addition, robustness against remeshing attack is not handled in the method. The advantages of [53] can be summarized as follows: First of all, the area where the watermark is embedded can be easily determined by selecting the largest wavelet coefficient vectors, so that the embedded data can be perceptually invisible. Secondly, the embedded watermark can be spread into various resolution levels. Hence, the localization of the watermark at high resolutions provides the ability to identify the distinct region of the watermarked polygon data, on which the local modification is achieved, while the global spreading
Fig. 12.18. Watermark Extraction Process [53]. Reprinted with kind permission of Springer Science and Business Media
454
A. Koz et al.
of the watermark at low resolutions makes the embedded watermark invariant for the local modification on the geometry. This strategy makes the watermark more robust than the one in the spatial domain method. Thirdly, from the property of the wavelet decomposition, the watermark can be embedded not only into the original polygon data, but also into the approximated polygon data at several higher resolution levels only by executing the one embedding process. Finally, the method based on the wavelet transform by using the lazy wavelet enables to define a clear geometrical relation between the overall geometric tolerance and the upper bound of modification on each wavelet coefficient vectors. Therefore, the geometric error can be easily controlled between the watermarked polygon data and the original one. Another method based on multi-resolution processing is proposed in [33]. The method is based on a multi-resolution decomposition of polygonal mesh shapes, developed by Guskov, et.al. [28], which separates a mesh into detail and coarse feature sequences by repeatedly applying local smoothing combined with shape difference. After the decomposition of the model, the watermark is added into the resulting coefficients after a multiplication with a local and a global factor as in the spread spectrum based methods. The method has shown good robustness against vertex reordering, noise addition, simplification, filtering and enhancement, cropping etc. In addition, the watermarking algorithm integrates nicely with the other signal processing tools developed by Guskov, et.al. [28], which also constitutes a generic applicable multi-resolution framework in which other proposed techniques fit in as well. However, as most of the other transform domain methods, the scheme requires a registration and re-sampling stages before watermark extraction to bring the attacked mesh model back into its original location, orientation, scale, topology and resolution level. In addition, the method requires the original mesh to detect the watermark. In [19], a blind watermarking technique which exploits a class of 3D wavelets based on subdivision surface is presented. The method assumes that the mesh to be decomposed by wavelet analysis is a semi-regular mesh, which is obtained by regularly subdividing an irregular mesh. The method decomposes the 3D mesh into sublevels as illustrated in Fig. 12.19.
Fig. 12.19. Representations of a 3D model in different wavelet sublevels
12 3D Watermarking: Techniques and Directions
455
Fig. 12.20. The general scheme of the algorithm in [19]
The watermark is added to each wavelet coefficients at a suitable resolution level after a multiplication with a local factor. The overall scheme of the watermarking process is given in Fig. 12.20. Watermark detection is accomplished by computing the correlation between the watermark signal and the tested mesh. Robustness against geometric transformations, such as rotation, translation and uniform scaling, is achieved by embedding the watermark in a normalized version of the host mesh, obtained by means of Principal Component Analysis. The method is shown to be robust against additive noise, low pass filtering, translation, rotation, scaling and cropping. However, the robustness against remeshing operation is not handled in the method. The method is extended in [18] by means of using a roughness measure to improve the imperceptibility. 12.4.6 Comparisons and Discussions on Geometry Watermarking A brief comparison between the two categories of geometry watermarking methods is presented in Table 12.3. Based on this table, compared to the spatial domain methods, transform domain methods generally satisfy the trade off between the imperceptibility and robustness in a more reliable manner, since the applied transformation mostly separates the perceptually significant and insignificant part of the 3D model. Therefore, it is easier to find the perceptually significant part of the model to embed the watermark in this category. The second advantage is due to the ability for spreading the watermark into each resolution after the transformation. Such a spreading makes the watermark more robust, especially against compression, noise addition and local deformations. Since the watermark is deeply embedded into each resolution of the model, watermark components reside inside the 3D model as long as the visual quality is preserved. However, the transform domain methods are computationally more complicated, compared to that of the spatial domain methods, which preclude their use in practical applications. On the other hand, spatial domain methods are easier in implementation and more robust especially against cropping attack, where most of the transform domain methods fail.
456
A. Koz et al.
Table 12.3. Comparisons between different categories in 3D Geometry Watermarking Pros
Cons
Spatial Domain Methods
Lower complexity, Robustness against Cropping
Transform Domain Methods
Robustness against compression and noise addition, Well integration with visual perception
Difficulty in finding perceptually significant regions, Weakness against local deformations Higher Computational Cost, Weakness against Cropping
In Table 12.4, the proposed solutions against the main attacks in geometry watermarking is presented. Since one of the important problems in geometry watermarking is the diversity of the attacks, a separate table for the proposed solutions could be useful to achieve a complete evaluation. In summary, the first four attacks in Table 12.4, translation, rotation, scaling and affine transformation are handled in the literature by means of embedding the watermark into the invariant primitives of the mesh model under mentioned attacks. Cropping is solved by spreading the watermark into the entire model. For the robustness against additive noise and compression, transform techniques which embed the watermark into different resolutions are utilized. Finally, mesh simplification and remeshing is solved by resampling the attacked mesh using the original mesh connectivity. Finally, the methods are tabulated in Table 12.5 with respect to their robustness against the main attacks. The methods are also categorized as Table 12.4. Attacks and Proposed Solutions in 3D watermarking Attacks
Proposed Solutions
1.Translation
• •
2.Rotation
• •
Positioning the object, as the center of mass corresponding to the origin of the coordinate system [56] Using invariant metrics under translation, as an embedding primitive (e.g. a pair of angles in a triangle of a mesh [44, 45]) Rotating the object such that its principal component coincides with the z-axis of the coordinates system [56]. Using metrics invariant under rotation as an embedding primitive (e.g. a pair of angles in a triangle of a mesh [44, 45])
12 3D Watermarking: Techniques and Directions 3.Scaling
• •
4.Affine Transformation
• •
457
Normalizing the radial components of the vertices for the model such that the distance of the furthest vertices to the origin is equal to one [58] Using metrics invariant under scaling as an embedding primitive (e.g. a pair of angles in a triangle of a mesh [44, 45] ) Using the original mesh to estimate and recover the distortions due to affine transformation [47]. Embedding the watermark into invariant metrics under affine transformation (e.g. ratio of the volumes of a pair of tetrahedrons in a mesh model [44, 45])
5.Cropping • •
Distributing the information corresponds to a bit of the watermark over the entire model via vertex scrambling [47]. Spreading the watermarks into different surface patches of the mesh model and using the normal vector distribution of each patch for watermark embedding [34].
6.Additive Noise
•
Embedding the watermark into the coefficients of each resolution after a frequency based transformation is applied to the mesh model. (Transform Domain Methods, [33, 53])
7.Compression
•
Embedding the watermark into the coefficients of each resolution after a frequency based transformation is applied to the mesh model. (Transform Domain Methods [33, 53])
8.Mesh Simplification
•
Embedding each watermark sample into a set of vertices, instead of one vertex [56]. Resampling the attacked mesh using the original mesh connectivity [17]. Using invariant metrics under mesh simplification as an embedding primitive, such as normals of collections of surfaces [43].
• •
9. Remeshing
• •
Using invariant metrics under mesh simplification, such as normals of collections of surfaces, as an embedding primitive [42]. Resampling the attacked mesh using the original mesh connectivity [17].
SPATIAL DOMAIN METHODS
TRANSFORM DOMAIN METHODS
Robustness Against Attacks (# in Table 12.4)
Methods
Ref. No.
Type (NB or B)
1
2
3
4
5
NB NB B B B B
NH √ √ √ √ √
NH √ √ √ √ √
NH √ √ √ √ √
NH √
NH √
Based on Vertice Coordinates
[9] [64, 66] [58] [32, 56] [4, 5, 61] [47, 48]
X NH √
√ √ √ √ √
√ √ √ √ √
√ √ √ √ √
√
√
√
√ √ √ √ √
√ √ √ √ √
√ √ √ √ √
√ √ √
√ √ √
√ √ √
Based on Other Primitives
[43] [34] [44]-TSQ [44]-TVR [44]-MDP
B B B
Based on 2D Methods
[25]
B
Spectral Analysis Based Methods
[17] [52] [51] [12] [36]
NB NB NB B NB
Wavelet Based Methods
[53] [33] [18]
NB NB NB
6
7
8
NH NH √
√ √ √ √ √
√ √ √
√ √ √ √ √
NH
X
X
NH NH X √ √
X √ √ √ √
NH NH √ √ √
NH
X
NH NH NH NH NH √
X X √
NH NH
√
NH NH X NH NH NH NH NH X
X √ √ X X X √
9 NH √ NH NH NH X √ √ X X X X
√ √ √ √ √
√ √ √ √ √
X √
NH √
√ √ √ √ √
√
NH √ √
√ √ √
√ √ √
√ √ √
NH √
X √
NH
A. Koz et al.
Category
458
Table 12.5. Methods and their robustness against major attacks in geometry watermarking. The corresponding attack to the given √ number is given in Table 12.4. (B: blind, NB: non-blind, NH: Not Handled in robustness tests, : robust, X: not robust)
12 3D Watermarking: Techniques and Directions
459
non-blind or blind in this table. While the non blind methods require the original mesh in the detection process, the blind methods do not have such a constraint. It is difficult to speak about the superiority of one method to another method due to the lack of a unique 3D scene representation, coding scheme and a benchmarking on 3D geometry watermarking. However, if only the robustness against possible common attacks on 3D geometry is considered in the evaluation, the transform domain methods [33, 36, 51] seems more promising.
12.5 Texture Watermarking As mentioned before in this manuscript, the majority of the 3D watermarking methods are based on the geometry of the 3D model. Recently, a novel texture-based watermarking is proposed in [22], presenting the problem and requirements of texture watermarking in detail. The main goal of the method is to extract the watermark that is originally hidden into the textural component of the object, from the resulting projected images or videos, thus protecting the visual representation of the object. The attacks on the texture, not only include the operations on the texture image, such as subsampling, JPEG compression or malicious attacks, but also involve modifications in texture mapping or distortions in the geometrical descriptions of the object [22]. Therefore, the problem is much complicated, compared to the one in geometry watermarking. The method in [22] first describes the basic steps in texture watermarking: In the first step, watermark is embedded into the texture of an object by using an image watermarking algorithm and the texture is mapped onto the object. The watermarked object can then be projected on to any image plane for further use in the virtual scenes. Next, after utilization of the represented object in any imaging technology, the watermarked texture is reconstructed. This step requires the geometry and texture mapping function for the recovery. The rendering parameters, including the location of the object with respect to the virtual camera and intrinsic parameters of the virtual camera, are also necessary to reconstruct the texture from the 2D view of the object. In addition, the knowledge of even the lighting model and parameters for the scene may also be needed. Finally, the embedded watermark is extracted from the recovered image in the last step of the algorithm. The method mainly focuses on the reconstruction of the texture from the 2D views. Actually, this part is the main contribution of the method, since the other steps, such as watermark embedding and extraction, are similar to a typical image watermarking algorithm. Common techniques in computer graphics are used for the projection and rendering of the object. Specifically, the scheme estimates the rendering parameters of the virtual camera, which is simply the projection matrix, depending on the position of the virtual camera with respect to the model and its intrinsic parameters, such as the focal length and size of the pixels, etc.
460
A. Koz et al.
Based on the simulation results that are presented for various conditions, ranging from ideal cases (a priori known rendering parameters) to more realistic scenarios (unknown rendering parameters which are estimated from a 2D view), or testing the results against attacks both in the geometry (mesh simplification) or texture (JPEG compression), the performance of the method in [22] in terms of the robustness against attacks is quite promising. As previously mentioned, texture watermarking is more challenging, compared to the geometry based 3D watermarking methods, due to diversity of attacks in this category. However, it is important to note that the features of 3D watermarking approaches, texture and geometry based, could easily be combined in a unified watermarking scheme, since either approach does not interfere with the other.
12.6 Watermarking for Image-based Representation of 3D Scenes This section is devoted to a novel area in 3D watermarking, which is dealing with the protection of a 3D scene that is represented and rendered by using the 2D images, which are taken from different view points in the scene. Imagebased representation and rendering (IBR) techniques have been developed in the last ten years, as an alternative to the traditional geometry-based rendering methods. The main aim of IBR techniques is to produce a projection of a 3D scene from an arbitrary view point by using a number of original camera views of the same scene. This approach contains the original effects already in the original camera views and consequently, yields more natural views, compared to the traditional geometry-based methods, which are mostly modeled with a single texture and an additional supporting textures, which sometimes lacks natural appearance. Moreover, IBR is still more preferable, since the resulting images are easier to obtain and simpler to handle compared to describing a geometric model, a texture and a texture map in the traditional approach [13]. Due to these advantages, IBR has attracted much attention from researchers in vision and signal processing and shown a significant progress in the last decade [13]. Yet, real-time demonstrations of free-view TV is possible,where TV-viewers select freely the viewing position and angle by the application of IBR on the transmitted multi-view video [11, 39]. With these recent advances, the copyright and copy protection problems for the image-based represented scenes become more important. For instance, in the free-view TV application of IBR, a TV viewer might record a personal video for his/her arbitrarily selected view and misuse this content. Therefore, a content provider should prove his/her ownership on the recorded media in such a case. A new study [7] presents the special requirements of watermarking for IBR and proposes a watermarking method for the scenes rendered with Light field rendering [37], which is one of the basic IBR techniques in the literature [13].
12 3D Watermarking: Techniques and Directions
461
Fig. 12.21. The watermarking problem for Image-based Rendering (The illustrated multi-view video sequence is taken from Mitsubishi Electric Laboratory Archive [59])
12.6.1 Requirements of Watermarking for Image-based Rendering First of all, concerning with the robustness requirement, the watermark should not only be resistant to common video processing and multi-view video processing operations, it should also be extracted from a rendered video frame for an arbitrary view (see Fig. 12.21). In order to extract the watermark from such a rendered view, the watermark detection scheme should involve an estimation procedure for the imagery camera position and orientation, where the rendered view is generated. In addition, the watermark should also survive from image-based rendering operations, such as frame interpolation between neighbor cameras and pixel interpolation inside each camera frame. IBR also extends the imperceptibility requirement for free-view watermarking. A watermark for video should be spatially invisible in each frame and temporally invisible in the temporal dimension of the video. Specifically, the watermark is embedded to the consecutive frames of a video such that it should not yield any flickering type of distortion in the temporal dimension. Parallel to this principle in video watermarking, the watermark for free-view video should be embedded to the adjacent camera frames, such that it should not yield any flickering type of distortion in the resulting video during rendering process. 12.6.2 Proposed Watermarking Method for Free View TV with Light-field Rendering In the literature, one of the well known and useful IBR representations is the light field [37], due to its simplicity, in which only the original images are used to construct the imagery views. It is the authors’ belief that this representation is also expected to constitute one of the fundamental technologies in free-view TV technology [39], especially in consumer electronics market due to its cost advantages. Therefore, the proposed watermarking method [7] is specially
462
A. Koz et al.
tailored for the extraction of watermarks from the imagery views, which are generated by using light field rendering. In light field rendering (LFR), a light ray is indexed, as (uo , vo , so , to ), where (uo , vo ) and (so , to ) are the intersections of the light ray with the two parallel planes namely, camera (uv ) and focal (st ) planes. The planes are discretized, so that a finite number of light rays are recorded. If all the discretized points from the focal plane are connected to one point on the camera plane, an image (2D array of light fields) is resulted. Actually, this resulting image becomes sheared perspective projection of the camera frame at that point [37]. 4D representation of light field can also be interpreted as a 2D image array, as in Fig. 12.22. The watermark is embedded to each image of this 2D image array. 12.6.2.1 Watermark Embedding Considering the mentioned requirements, the method in [7] proposes to embed a watermark pattern (generated from a Gaussian density with zero mean and unit variance) into the each camera image of the scene by exploiting the spatial sensitivity of HVS [2]. For that purpose, the watermark is modulated with the resulting output image, after filtering each camera image by a 3x3 high pass filter, and spatially added to that image. This operation decreases the watermark strength in the flat regions of the image, in which HVS is more sensitive, and increases the embedded watermark energy in the detailed regions, where HVS is less sensitive. There are two critical points in the embedding stage. First of all, the watermark is embedded to the camera images, which are the sheared perspective projection of the original camera frames. Secondly, the watermark component added to each image pixel is determined according to the intersection of the light ray corresponding to that pixel (the ray from camera center towards the position of that pixel) with the focal plane. It should be emphasized that the same watermark component is embedded into the pixels of different camera frames, whose corresponding light rays are intersected at the same point in the focal plane, as illustrated in Fig. 12.23.
Fig. 12.22. An illustration for 2D light field array
12 3D Watermarking: Techniques and Directions
463
Fig. 12.23. Watermark Embedding in Free View TV [7]. Reprinted with permission from IEEE. Copyright 2006 IEEE
In this example, while the added watermark components to the first camera image are [w0 , w1 , w2 , w3 . . .], the embedded components to the second camera image are [w1 , w2 , w3 . . .], i.e. beginning from w1 . The rationale behind such a procedure is to avoid the superposition of the different watermark samples, coming from different camera frames in the interpolation operations during the rendering. Otherwise, such a superposition severely degrades the performance of the correlation operation in the watermark detection stage. Moreover, such a procedure should also avoid the flickering type of distortion during rendering, since the watermark component in any rendered view will be identical to the watermark components in the adjacent original views. The proposed method applies the same watermarking operation for each light field image as follows [7]: Iuv ∗ (s, t) = Iuv (s, t) + α.Huv (s, t).W (s, t)
(12.1)
where Iuv is the light field image corresponding to the camera at the (u, v) position, Huv is the output image after high pass filtering, α is the global scaling factor to adjust the watermark strength, W is the watermark sequence generated from a Gaussian distribution with zero mean and unit variance and finally, Iuv ∗ is the watermarked light field image. 12.6.2.2 Watermark Detection A correlation-based scheme is proposed for watermark detection [7]. Rather than dealing with the typical attacks for image and video watermarking, the
464
A. Koz et al.
critical problem of multi-view video, extraction of the watermark from an arbitrary view generated by LFR, is considered. Assuming that the position and rotation for the imagery view are known a priori, the first step in the watermark detection is applying the same rendering operations during the generation of an arbitrary view to the watermark pattern, W , in order to generate the rendered watermark, Wren . The arbitrary view is filtered by a highpass filter and the normalized correlation between the resulting image and rendered watermark is calculated. Then, normalized correlation is compared to a threshold to detect the watermark. Overall structure of the watermark detection is shown in Fig. 12.24. 12.6.3 Discussion on Free View Watermarking The proposed method extracts the watermark successfully from an arbitrarily generated image assuming that the position and rotation of the virtual camera is known. In order to extend the method for the case of an unknown virtual camera position and rotation, the transformations on the watermark pattern due to light field rendering operations are also analyzed [7]. Based on this analysis, the camera position and homography estimation methods are proposed considering the operations in light-field rendering. The results show that the watermark detection is achieved successfully for any unknown virtual camera position and orientation. The proposed method [7] mainly focuses on the watermark insertion and extraction for the static scenes consisting of only one depth layer which forms the fundamental case in the development of IBR technology. The approach is expected to give competitive results for the rendering algorithms that consider multiple layers, for the scenes with limited structure variation. In general, watermarking for image-based representations gives a completely new direction for 3D watermarking. However, some points still should be addressed in watermarking of image-based represented scenes. First, the method should be extended for the static scenes consisting of multiple depths.
Wren
Iren
High pass filtering
Iˆren
Normalized Correlation
Comparison to a threshold
1/0
< Iˆren ,Wren > Iˆ W ren
ren
Fig. 12.24. Overall structure of Watermark Detection Process [7]. Reprinted with permission from IEEE. Copyright 2006 IEEE
12 3D Watermarking: Techniques and Directions
465
Then, the problem should be handled for the dynamic scenes. In addition, the imperceptibility analysis is missing in the paper. A sophisticated imperceptibility analysis may also include the possible distortions due to watermarking in the rendered video sequences, as well as the individual images. In addition, the problem should also be handled for alternative IBR techniques such as depth based rendering.
12.7 Summary and Conclusions In this chapter, a rigorous literature survey on 3D watermarking techniques is presented. The chapter categorizes the 3D watermarking techniques according to the main components of the major representation techniques, as geometry-, texture- and image-based watermarking. The requirements and problems in each branch of 3D watermarking are briefly reviewed in the manuscript. It is realized that the existing 3D scene watermarking methods mainly focus on the watermarking of 3D geometry data, which are mostly represented with mesh structures. These methods are classified, as spatial domain methods, where the watermark is embedded into the geometric values of the geometric primitives, such as the coordinates of the points, length of a line, area of a polygon, volume of a polyhedron, and transform domain methods, where the watermark is embedded into the resulted coefficients, after a 3D geometry-based transformation is applied to the 3D geometry data. Pros and cons of each category are summarized in Table 12.3. Briefly, transform domain methods should be more appropriate for determining the significant portions of the 3D object, hence more robust to the compression and noise attacks. On the other hand, spatial domain approaches are easier to implement and robust against the geometrical attacks, such as cropping. The proposed solutions in the literature to the main attacks in geometry watermarking are examined separately, in order to give a more concise evaluation on 3D geometry watermarking. The robustness of the methods against main attacks is also compared. Although it is difficult to state the superiority of a method against another, due to the lack of a unique 3D scene representation, coding scheme and a benchmarking on 3D geometry watermarking, some methods are found out to be more promising, based on their robustness against possible common attacks on 3D geometry. One important consequence of this chapter is to realize that the existing 3D watermarking methods in the literature mostly work on the 3D geometrybased representation of 3D scenes. As pointed before, in such a representation, the scene geometry is modeled by using meshes, point clouds or voxels and then, texture and reflectance maps are used as the additional descriptions on top of the model. Currently, such methods are mostly proposed to protect the geometry description of a 3D object that is used in computer graphics applications. However, considering that the geometry-based representations might also be a suitable framework for the future technologies,
466
A. Koz et al.
such as 3DTV, some suitable extensions of these methods can be good candidates for the solution of the copyright problem and other related problems, such as authentication, content labeling, and content protection, in the coming applications. On the other hand, some alternative representation techniques for 3D scenes, such as image based modeling and rendering (IBR) have been rapidly developed in the recent years. By capturing a set of images from different viewpoints of a scene, these techniques are designed to reproduce the scene correctly at an arbitrary view point. Compared to the geometry-based models, this approach is more advantageous, since images are easier to obtain, simpler to handle and more realistic to render. Noting that the users could record a personal video for their arbitrarily selected views and misuse this content, this technology makes the copyright problem for the image based represented scenes more apparent. Fortunately, there are also pioneering works in this new area of 3D watermarking.
Acknowledgement This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. A. Yilmaz, A.A. Alatan, “Error Concealment of Video Sequences by Data Hiding”, IEEE International Conference on Image Processing Barcelona, Spain, 2003. 2. A.B. Watson, “DCT quantization matrices visually optimized for individual images”, Proceedings of SPIE on Human Vision, Visual Processing, and Digital Display IV, 1993. 3. A.G. Bors, I. Pitas, “Image Watermarking using DCT Domain Constraints”, IEEE International Conference on Image Processing, pp. 231–234, 1996. 4. A.G. Bors, “Blind watermarking of 3D shapes using localized constraints”, 2nd International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), 6–9 September 2004. 5. A.G. Bors, “Watermarking 3D Shapes Using Local Moments”, IEEE International Conference on Image Processing, 2004. 6. A. Koz, A. Aydın Alatan, “Foveated Image Watermarking”, IEEE International Conference on Image Processing Rochester, NY, 2002. 7. A. Koz, C. Cigla, A. Aydin Alatan, “Free view Watermarking for Free view Television”, IEEE International Conference on Image Processing, Atlanta, GA, USA, 8–11 October 2006. 8. A.B. Watson, J. Hu, J.F. McGowan III, “DVQ: A Digital Video Quality Metric based on Human Vison”, Journal of Electronic Imaging, Vol 10, No. 1, pp. 20–29 January 2001.
12 3D Watermarking: Techniques and Directions
467
9. M. Ashourian, R. Enteshary “A new masking method for spatial domain watermarking of three-dimensional triangle meshes”, TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region, Vol. 1, pp. 428–431, Oct. 2003. 10. B. Hartung, F. Hartung, B. Girod, “Digital Watermarking of Raw and Compressed Video”, Proceedings of European EOS/SPIE Symposium. Advanced Imaging and Network Technologies, Society of Photo-Optical Instrumentation Engineers, Bellingham, Wash., 1996. 11. C. Fehn, P. Kauff, O. Schreer, R. Schafer, “Interactive virtual view video for immersive TV applications”, Proceedings of IBC’01, Amsterdam, Netherlands, Vol. 2, pp. 53–62, September 2001. 12. F. Cayre, P. Rondao-Alface, F. Schmitt, et al. Application of spectral decomposition to compression and watermarking of 3D triangle mesh geometry Signal Processing-Image communications, Vol 18, No 4, pp 309–319, April 2003. 13. C. Zhang, T. Chen, “A survey on image based rendering-representation, sampling and compression”, Signal Processing: Image Communications, Vol 19, 2004. 14. I.J. Cox, J. Killian, T. Leighton, T. Shamoon “Secure spread spectrum watermarking for multimedia”, IEEE Transactions on Image Processing Vol 12, No 6, pp 1673–1687, 1997. 15. D. Cotting, T. Weyrich, M. Pauly, M. Gross, “Robust Watermarking of Point Sampled Geometry”, in Proceedings of International Conference on Shape Modeling and Applications, 2004. 16. E. Koch and J. Zhao, “Towards Robust and Hidden Image Copyright Labeling”, Proceedings 1995 IEEE Workshop on Nonlinear Signal and Image Processing, IEEE CS Press, Los Alamitos, California., pp. 452–455, June 1995. 17. E. Praun, H. Hoppe, A. Finkelstein “Robust Mesh Watermarking”, In Proceedings of SIGGRAPH 99, pp. 69–76, 1999. 18. F. Uccheddu, M. Corsini, M. Barni, “Using a Roughness Measure to Improve the Invisibility of 3D Watermarks”, Cost 276 meeting in Ankara Turkey. 19. F. Uccheddu, M. Corsini, M. Barni, “Wavelet-Based Blind Watermarking of 3D Models”, International Multimedia Conference, Proceedings of the 2004 Workshop on Multimedia and Security, pp 143–154, 2004. 20. F.A.P. Petitcolas, “Watermarking Schemes Evaulation”, IEEE Signal Processing Magazine, Vol 17, No 5, pp 58–64, September 2000. 21. G. Do¨err, J.-L. Dugelay, “A Guide Tour of Video Watermarking,” In Signal Processing: Image Communication, Vol 18, No 4, pp 263–282, 2003. 22. E. Garcia, J.L. Dugelay, “ Texture-based Watermarking of 3D Video Objects”, IEEE Transactions on Circuits and Systems for Video Technology, Vol 13, No 8, pp: 853–866, Aug. 2003. 23. G.C. Langelaar, I. Setyawan, R.L. Lagendijk, “Watermarking Digital Image and Video Data”, IEEE Signal Processing Magazine, Vol 17, No 5, pp 20–46, September 2000. 24. G. Louizis, A. Tefas, I. Pitas, “Copyright Protection of 3D Images using Watermarks of Specific Spatial Structure”, IEEE International Conference on Multimedia and Expo, 2002. ICME ’02, Vol 2, pp 557–560, August 2002. 25. Han Sae Song, Nam Ik Cho, JongWeon Kim, “Robust Watermarking of 3D Mesh Models”, IEEE International Conference on Image Processing, pp 332–335 2002.
468
A. Koz et al.
26. Hoppe, H. Progressive Meshes IN ACM SIGGRAPH 96 Conference Proceedings, pp. 99–108, August 1996. 27. Hsueh-Yi Sean Lin, Hong-Yuan Mark Liao, Chun-Shien Lu and Ja-Chen Lin, “Fragile Watermarking for Authenticating 3D Polygonal Meshes”, IEEE Transactions on Multimedia, Vol. 7, No. 6, December 2005. 28. I. Guskov, W. Sweldens, P. Shr¨ oder, “Multiresolution signal processing for meshes”, Proceedings SIGGRAPH’99, pp. 49–56, 1999. 29. I.J.Cox, M.L. Miller, J.A. Bloom, “Watermarking Applications and their Properties”, International Conference on Information Technology, Las Vegas, 2000. 30. J. Zhao, E. Koch, “A Digital Watermarking System for Multimedia Copyright Protection”, Proceedings ACM Multimedia 96, ACM Press, New York, pp. 443–444, Nov. 1996. 31. J.F. Delaigle, “Protection of Intellectual Property of Images by perceptual Watermarking”, Ph.D Thesis submitted for the degree of Doctor of Applied Sciences, Universite Catholique de Louvain, Belgique. 32. A. Kalivas, A. Tefas, I. Pitas, “Watermarking of 3D Models Using Principal Component Analysis”, Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03). 2003 IEEE International Conference on, Vol 5, pp. 676–679, 2003. 33. K. Yin, Z. Pan, J. Shi, D. Zhang, “Robust Mesh Watermarking Based on Multiresolution Processing”, Computers & Graphics, Vol. 25, pp. 409–420, 2001. 34. K.-R. Kwon, S.-G. Kwon, S.-H. Lee, T.-S. Kim, K.-I. Lee, “Watermarking for 3D Polygonal Meshes Using Normal Vector Distributions of Each Patch”, IEEE International Conference on Image Processing, Vol 2, pp 499–502, September 2003. 35. L. Boney, A.I. Tewfik, K.I. Hamdy, “Digital Watermarks for Audio Signals,” Proceedings EUSIPCO 96, Vol. III, VIII European Signal Processing Conference, pp. 1697–1700, 1996. 36. L. Li, D. Zhang, Z. Pan, J. Shi, K. Zhou, K. Ye, “Watermaking 3D Mesh by Spherical Parameterization”, Computer & Graphics Vol 28, pp 981–989, 2004. 37. M. Levoy, P. Hanrahan, “Light Field Rendering”, Computer Graphics (SIGGRAPH’96), New Orleans, LA, pp. 31–42, August 1996. 38. M. Pollefeys et al., “3D Structure from Images - SMILE 2000”, Springer Verlag, LNCS 2080, p. 164, 2000. 39. Masayuki Tanimoto, “FTV (Free Viewpoint Television) Creating Ray-Based Image Engineering”, International Conference on Image Processing, Genova, September 11–14, 2005. 40. M. Mass, T. Kalker, J.-P.M.G Linnartz, J. Talstra, G.F.G. Depovere, J. Haitsma, “Digital Watermarking for DVD Video Copy Protection”, IEEE Signal Processing Magazine, September 2000. 41. Memon, N., P.W. Wong, “Protecting Digital Media Content”, Communications of the ACM Vol 41, pp 35–43, July 1998. 42. M.D. Swanson, M. Kobayashi, A.H. Tewfik, “Multimedia Data- Embedding and Watermarking Technologies”, Proceedings of the IEEE., Vol. 86, No. 6, June 1998. 43. O. Benedens, “Watermaking of 3D Polygon Based Models with Robustness against Mesh Simplicication”, in Proceedings SPIE Security and Watermarking of Multimedia Contents, Vol 3657, pp 329–340, 1999.
12 3D Watermarking: Techniques and Directions
469
44. R. Ohbuchi, H. Masuda, M. Aono, “Watermarking 3D Polygonal Models”, Proceedings ACM Multimedia’97, 1997. 45. R. Ohbuchi, H. Masuda, M. Aono, “Watermarking Three Dimensional Polygonal Models Through Geometric and Topological Modifications”, IEEE Journal on Selected Areas in Communication, Vol 16, No 4, pp 551–560, 1998. 46. L. Onural et. al., “An Overview of the Holographic Display Related Tasks within the European 3D TV project”, SPIE 2006. 47. P. Daras, D. Zarpalas, D. Tzovaras, M. G. Strintzis, “Watermarking of 3D Models for Data Hiding”, IEEE International Conference on Image Processing, ICIP 2004, Singapore, October 2004. 48. P. Daras, D. Zarpalas, D. Tzovaras, D. Simitopoulos, M.G. Strintzis, “Combined Indexing and Watermarking of 3D Models using the Generalized 3D Radon Transform”, in Multimedia Security Handbook, B. Fuhrt and D.Kirouski (Eds), CRC Press, pp 733–758, ISBN 0-8493-2773-3, December 2004. 49. R.B. Wolfgang, C.I. Podilchuk, E.J. Delp, “Perceptual Watermarks for Image and Video”, Proceedings of the IEEE, Vol 87, No 7, July 1998. 50. R.A. Patrice, M. Benoit, “Blind Watermarking of 3D Meshes Using Robust Feature Points Detection”, International Conference on Image Processing (ICIP05), Genova, Italy, 11–14 September, 2005. 51. R. Ohbuchi, A. Mukaiyama, S. Takahashi, “A Frequency Domain Approach to Watermarking 3D Shapes”, In Proceedings EuroGraphiocs 2002, Vol 21, no 3, 2001. 52. R. Ohbuchi, S. Takahashi, T. Miyazawa, A. Mukaiyama, “Watermarking 3D Polygonal Meshes in the Mesh Spectral Domain”, In Proceedings Graphics Interface 2001, pp. 9–17, 2001. 53. S. Kanai, H. Date, T. Kishinami, “Digital Watermarking for 3D Polygons using Multiresolution Wavelet Decomposition”, Proceedings Sixth IFIP WG 5.2 GEO-6, pp. 296–307, Tokyo, Japan, December 1998. 54. R.V. Schyndel, A. Tirkel, C. Osborne, “A Digital Watermark”, In Proceedings of ICIP, IEEE Press, pp. 86–90, Nov 1994. 55. S. Zafeiriou, A. Tefas, I. Pitas, “Blind Watermarking Schemes for Copyright Protection of 3D Mesh Objects”, IEEE Transacitons on Visaulization and Computer Graphics, Vol. 11, No. 5, 2005. 56. S. Zafeiriou, A. Tefas, I. Pitas, “A Blind Robust Watermarking Scheme for Copyright Protection of 3D Mesh Models”, IEEE International Conference on Image Processing, 2004. 57. E.J. Stollnitz, T.D. DeRose, D.H. Salesin, “Wavelets for Computer Graphics: Theory and Applications”, Morgan Kauffman, 1996. 58. S.-H. Lee, T.-S. Kim, S.-J. Kim, Y. Huh, K.-R. Kwon, K.-I. Lee, “3D Mesh Watermarking Using Projection onto Convex Sets”, IEEE International Conference on Image Processing, 2004. 59. The Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, http://www.merl.com 60. The Stanford 3D Scanning Respository: http://graphics.stanford.edu/data/ 3Dscanrep. 61. T. Harte, A. G. Bors, “Watermarking 3D Models”, IEEE International Conference on Image Processing, 2002. 62. M.M. Yeung, B.-L. Yeo, “Fragile Watermarking of 3D Objects”, In International Conference on Image Processing, 1998.
470
A. Koz et al.
63. Y. Wu, X. Guan, M.S. Kankanhalli, Z. Huang, “Robust Invisible Watermarking of Volume Data Using the 3D DCT”, Computer Graphics International 2001. Proceedings, pp 359–362, July 2001. 64. Z. Yu, H.H.S. Ip, L.F. Kowk, “Robust Watermarking of 3D Polygonal Models Based on Vertex Scrambling”, Computer Graphics International, 2003. Proceedings, pp 254–257, July 2003. 65. Z. Karni, C. Gotsman, “Spectral Compression of Mesh Geometry”, Proceedings SIGGRAPH 2000, pp. 279–286, 2000. 66. Z. Yu, H.H.S. Ip and L.F. Kwok “A Robust Watermarking Scheme for 3D Triangular Mesh Models”, Pattern Recognition, Vol 36, No 11, pp 2603–2614, November 2003.
13 Solving the 3D Problem—The History and Development of Viable Domestic 3DTV Displays Phil Surman1 , Klaus Hopf 2 , Ian Sexton1 , Wing Kai Lee1 and Richard Bates1 1 2
De Montfort University, UK Fraunhofer-Institute for Telecommunications – Heinrich-Hertz-Institut, Germany
13.1 Introduction Domestic television and video display is central to one of the largest consumer electronics markets in the world and the prize for developing a technically capable, and commercially viable domestic-suitable 3D video display system is likely to be great. Producing such a domestic 3D video system places great demands on innovation, research and development, but with recent advances in the enabling technologies such displays are now within our grasp. This paper starts by giving a brief history of the many attempts to produce a viable domestic 3D video display, illustrating the pioneers who first initiated research on 3D domestic displays. This paper then outlines and discusses the essential requirements that would be necessary to fulfil viewer expectations of a viable and usable domestic 3D video display. These demands are then placed in the context of the historical attempts to produce viable 3D displays, showing how these attempts have informed current thinking by outlining the problems of each technology approach. The paper then goes on to describe possible contemporary approaches to producing domestic 3D video displays, discussing the current viability of each, and showing that although there are many current solutions, these are often not suitable for domestic use. The paper then shows the development, based on historical work and contemporary thinking and technology, of viable 3D domestic video displays for both single viewer use and multiple viewer use that are hoped will fulfil the demands of domestic use. The paper summarises with the prediction that within the next 10 years we will see domestic 3D video displays readily available and accepted by the market place.
13.2 A Brief History of Domestic 3D Video Display Although 3D video and television have never been regularly demonstrated or broadcast, there have been several systems proposed and demonstrated over
472
P. Surman et al.
the years, and there are those that are proposed for the future. These range from the first demonstration of 3D television by John Logie Baird in the 1940’s through to a proposed Japanese system for 2020. 13.2.1 Baird’s System John Logie Baird, arguably the inventor of television, pioneered a 30-line mechanical television in the late 1920s, though this mechanical system was not adopted for broadcast as a superior electronic 405-line system was favoured by the BBC in the UK. After the liquidation of the Baird Company, Baird experimented independently on 3D colour television with screen sizes ranging from 60cm to 76cm using a 600-line image interlaced six times. His first demonstration of 3D television (or stereoscopic television as it was then called) was made to a selected audience on 9th August 1928. Writing for ‘Television’ in September of that year Dr C Tierney wrote an account of that demonstration: “A man sitting before a transmitter was very clearly seen on the screen of a receiver situated in another part of the building, in perfect relief, showing the facial delineation and expression both with and without optical assistance. These experiments promise considerable development and importance in their practical application”. Baird continued development and in 1940 developed a system where scanning was carried out by an electronic ‘flying spot’ method, similar to that used a few years earlier in cinema television demonstrations. Some mechanical operation was still included, in the form of a set of 3 rotating primary colour filters, on the same principle as his first colour television in 1928. This was demonstrated to reporters on a 75cm × 60cm screen in December 1940. Finally Baird’s stereoscopic television was developed for high definition (500 lines) colour transmission and successfully demonstrated in late 1941. This technique did not require the wearing of special glasses by the viewer, but it was necessary for the viewer’s head to stay in one position to see the stereoscopic effect. 13.2.2 Rollman’s Anaglyph In 1853 Wilhelm Rollman first illustrated the anaglyph principle of using blue and red line drawings on a black background to produce a 3D effect when observed wearing red and blue filter glasses. In 1858 Joseph D’Almeida used this principle to project 3D magic lantern slide shows using red and green filters, with the audience observing the images through red and green glasses. Louis Ducas du Hauron produced the first printed anaglyphs in 1891 by printing two negatives on the same paper, one in blue or green, and one in red to create the anaglyph effect. James Butterfield, Stanton Alger and Dan Symmes authored a patent for a 3D stereoscopic television system dated March 29, 1988; this included a theoretical discussion of full colour anaglyph method. These historical developments produced many later technological developments for colour 3D anaglyph movies, 3D videos and 3D photography that were essentially reinventions of previous ideas.
13 Development of Viable Domestic 3DTV Displays
473
By the late 1980’s there were several colour polychromatic anaglyph TV broadcasts in the Los Angeles area., including the 1950’s 3D films Inferno and Hondo, starring John Wayne, as well as original footage for a local Los Angeles news programme. This was followed by the BBC and the Independent Television companies in the UK using the Ana-VisionTM system of 3-D Images Ltd for transmission of 3D content. This system used coloured glasses to provide a full-colour technique for film and television. 13.2.3 Pulfrich’s Stereo In 1922, German physicist Carl Pulfrich observed that when a pendulum is moving in a straight line across the eyes in front of a viewer who has one eye looking through a grey lens and the other eye clear, the pendulum looks as if it is moving in a ellipse in 3 dimensions. The pendulum is observed going away and then towards the viewer as it goes from left to right, or right to left, depending on which eye is looking through the darkened lens. Pulfrich happened to try this because of an eye injury that caused a cataract in his eye. The effect is caused by the nerve cells in the visual cortex receiving the brighter image processing the image more rapidly than the neurons processing the darker image. This separation in time causes a spatial illusion as the human brain processes and resolves the two images as if they are slightly separated in space. The Pulfrich effect can also be observed when the camera capturing a scene moves laterally across the scene. For example 3D is observed if the camera is moving from right to left, and a dark filter is placed in front of the right eye. The effect is also observed if the camera movement is in the other direction with the filter placed in front of the left eye. The BBC used this effect and produced several 3D TV shows, ‘Dr Who’ and ‘Eastenders’ especially to exploit the Pulfrich 3D effect, and in America an episode of ‘Third Rock from the Sun’ and an episode of ‘Married with Children’ were also made in Pulfrich 3D. 13.2.4 Polarization There have been many early attempts at presenting a stereo pair, with alternating polarization orientations, on alternate frames of an interlaced scan to produce a 3D effect. However none were particularly successful due to limited switching rates producing high levels of image flicker. A more recent and potentially successful polarized glasses method is that used by Dynamic Digital Depth Group plc (DDD), who in conjunction with the Arisawa Manufacturing Co. Ltd. will be launching the first consumer solution for watching any television show, DVD, or video in 3D. It is anticipated that a new television system will be introduced that will allow the viewer to switch the television into 3D mode by activating the TriDef Vision+, a DDD designed set-top box that converts any 2D video signal into 3D. The Vision+, combined with the television’s pre-installed 2D/3D screen and a pair of polarized glasses for each
474
P. Surman et al.
viewer would allow 3D viewing from anywhere in the room. The screen consists of a series of orthogonally arranged polarizers that separate the left and right images on alternate pixel rows. 13.2.5 The Holograph Stephen Benton invented white light transmission holography whilst working on holographic television at Polaroid Research Laboratories in 1968, but there seem to be no outcomes of the work available for 3D television, and given the limited technology of the time it is unlikely that successful transmission of an acceptable holographic television was achieved. Following this Victor Komar in Russia produced holographic film movies in the 1970s and 1980s, but this work was eventually dropped due to lack of funds and there is no evidence that the work was extended to encompass video. Holographic video research has been carried out at MIT since 1989 but this has never been developed to the stage where it is used for television transmission. It is likely that no actual practical systems have been realised to date, primarily due to the huge amounts of data required for holographic video and the subsequent high transmission bandwidth needed by this data, and although algorithms are used reduce the bandwidth requirement considerably, it is still too high for a practical television system.
13.3 Contemporary Domestic 3D Technologies Having briefly reviewed early work on producing 3D video displays, we now examine contemporary approaches that would be suitable for domestic use. These technologies range from display of two differing images, one presented to each eye (binocular), through to many-image (holography). 13.3.1 Binocular Binocular approaches deliver a simple image pair (a left eye image and a differing right eye image) to the eyes of a viewer. Often methods are used where special glasses are worn in order to separate left and right images on a single screen to the appropriate eyes. Examples of this type of display, as reviewed previously, are anaglyph where glasses with coloured filters separate the images, polarized glasses where the two images are separated by orthogonally polarized light [1] or shuttered glasses where time multiplexing is used to present the left and right images sequentially to each eye [2]. It is also possible to display separate images to a viewer with the use of a head mounted display (HMD) or near-to-eye devices. However, for many applications the use of glasses is considered unacceptable and an autostereoscopic (displays that do not require the wearing of special eye wear are referred to as autostereoscopic)
13 Development of Viable Domestic 3DTV Displays
475
approach must be adopted. Typically autostereoscopic displays produce separate images to the left and right eyes by producing viewing regions in space in front of the screen where either a left or a right image is seen, this is known as a binocular approach. This approach is illustrated in Fig. 13.1 where the viewer will perceive 3D if the left and right eyes are located in the regions L and R respectively. In Fig. 13.1 the viewing regions are fixed in space and the viewer must locate their eyes at the correct position in front of the screen. There is some tolerance to head movement, typically less than the eye separation distance of around sixty-five millimetres. Examples of the contemporary approaches used for these types of display are spatial multiplexing where an image pair is interlaced on alternate display (typically an LCD) pixel columns and directing light through the display to the eyes by either: a vertically aligned lenticular screen [3, 4], a parallax barrier [5] a prismatic screen [6] or with controlled light sources behind the LCD [7]. A second similar approach uses images displayed on alternate pixel rows and light directed to the eyes using a holographic optical element (HOE) [8]. In addition, some displays are capable of being switched between the 2D and the 3D mode [5, 9]. However, due to the geometry of the display optics in all of these displays, other viewing zones are formed, for example L1 and R1 on Fig. 13.1, where 3D can also be observed. Hence if a viewer’s left and right eyes are located in, say, R and L1 respectively, a pseudoscopic image is observed; this where a ‘reversed’ and incorrect 3D image is seen. Approaches to overcoming this problem, and the problem of allowing movement of the viewer is to track the viewer’s head or eye position, and then move the viewing regions accordingly onto the correct eyes. An early example of such a head tracked display is that of Alfred Schwartz [10]. This display produces the image by projection and uses a Fresnel lens to concentrate the light in the regions of the viewer’s eyes. There have been several other head tracked systems reported over the past few years, which have incorporated different optical systems, some of these being: lenticular screens [11, 12], movable illumination sources [13], prismatic Screen
PLAN VIEW
Viewing zones R1 L1
R L
Fig. 13.1. Binocular viewing zones
476
P. Surman et al.
screens [14], HOEs [15], miscellaneous optical configurations [16, 17, 18, 19, 20, 21, 22, 23] projection and Fresnel lenses, twin screens with selectable light sources [24], single screens with micro polarizer multiplexing arrays with a separate LCD to steer light [25], twin screens with twin monochrome display light sources [26, 27, 28], single screens with micro polarizer array and switched LED light sources [29] and finally twin LCDs whose images are projected on to a Fresnel field lens[30]. Clearly there are many approaches to steering a binocular set of images to the eyes of a viewer. Displays that present only a single stereo image pair have an advantage in that only the minimum amount of information (two images) need be displayed. However, there are some disadvantages associated with this, the principal ones being: lack of motion parallax, rivalry between the accommodation and convergence of the viewer’s eyes, and 3D geometry distortions. Motion parallax gives the ability to ‘look around’ an object by providing a continuously changing image with viewpoint. This is what happens when a scene is viewed naturally. Motion parallax is not inherently impossible in a head tracked display as the images can be altered in accordance with the viewer’s head position. The conflict between accommodation and convergence (the natural movement of the eyes inward to view objects) is arguably the principal disadvantage of twoimage methods [34]. Objects in the image will invariably appear to be away (in front or behind, ‘A’ in Fig. 13.2) the plane of the screen. When this is the case, the eyes focus on the screen (as this is where the images are displayed, ‘B’ in Fig. 13.2), but converge at the apparent distance of the object as in Fig. 13.2. This obviously does not happen in natural viewing conditions, and any difference in the accommodation and convergence can potentially cause eyestrain and nausea. 3D geometry distortions give rise to distortion in the depth of a 3D image and also false rotation of the image. These are shown in Fig. 13.3 and were first described in 1953 [35]. Apparent depth distortion makes the appearance of depth (how far away different parts of the image may be) increase with increasing viewing distance. Consider a scene that is observed by a viewer in
Screen
PLAN VIEW
Apparent position of object
Viewer
A B Fig. 13.2. Accommodation/vergence rivalry
13 Development of Viable Domestic 3DTV Displays Scene boundary observed by A
477
Scene boundary observed by C C c
a
Screen a
c
c
OA
A
OB
a
B a
PLAN VIEW c
Fig. 13.3. Image geometry distortions
position ‘A’ (Fig. 13.3) and whose image content is bounded by the rectangle ‘a,a,a,a’ (Fig. 13.3). When the viewer moves to position ‘C’ (Fig. 13.3), the boundary moves to the position indicated by line ‘c,c,c,c’ (Fig. 13.3). This boundary change indicates the distortions of objects depicted within it. Further, possibly a more noticeable artefact is the effect of false rotation. This can be understood by considering the virtual object ‘OA’ (Fig. 13.3) that appears to be one third of the distance between the screen and the viewer ‘A’ and on the line between the centre of the screen and the centre of the eyes. The position of this object will always fulfil these conditions so that as the head position moves from ‘A’ to ‘B’, its position on the centre line moves to ‘OB’. 13.3.2 Multi-view In order to enable freedom of viewer movement, but without the use and limitation of head position tracking found in binocular systems, a binocular system can be extended with a series of images presented across the viewing field to form a multi-view display. The use of many images each viewable at only one eye location enables the display of motion parallax, as each view can be a different perspective of the image. However, the problem of accommodation/convergence rivalry still exists as the eyes are focused on the screen but the image may appear in front or behind the screen. Changing the image with viewer position can prevent false rotation, and image geometry distortions can also be eliminated. The principal disadvantages of multiview displays is the limited depth of the viewing region and limited depth of field.
478
P. Surman et al. Screen
PLAN VIEW
Viewing zones
1 2 3 4 5 6
Fig. 13.4. Multi-view viewing zones
In Fig. 13.4 viewers will perceive 3D when their eyes are located in adjacent viewing zones. It can be seen that these only form a small proportion of the viewing field. When the eyes of the viewers are located in the shaded regions (images 1–6 on Fig. 13.4) each eye will correctly see a single image across the complete width of the screen. However, when the eye is away from these regions, it will perceive one or more different and incorrect images across the width of the screen. For example, the eye shown in Fig. 13.4 will see parts of three images across the width of the screen; image 3 will be seen on the right hand side of the screen, image 2 in the centre and image 1 on the left. This can be partially but not fully overcome by blending the differing images into each other as the eye traverses the viewing field, and by reducing the image disparity (3D depth of the perceived images) so that image points appear to be close to the plane of the screen, reducing any noticeable change in 3D effect as the eye moves. Although the practical implementation of a multi-view display may be fairly complex [36, 37] they can also be relatively simple; for example as in the Philips 9-view display [38], which employs a lenticular sheet to form the viewing zones, or the Sanyo 4-view display [39] that incorporates a parallax barrier. In the short term, due to their relative simplicity, it is likely that multi-view displays will provide the first-generation of 3D displays for widespread use. 13.3.3 Holoform A further development of multi-view 3D displays is the use of holoform imaging. We have seen that multi-view displays suffer from differences between accommodation and convergence causing possible viewer discomfort. A research group at the Telecommunications Advancement Organization of Japan (TAO) have identified the need for a large number of views in order to overcome these problems caused by accommodation and convergence [40]. Their approach is to provide what they term ‘super multi-view’ (SMV). Under these conditions, the pupil receives two or more parallax images. The authors claim
13 Development of Viable Domestic 3DTV Displays
479
this will cause the eye to focus at the same distance as the convergence, which is a significant finding. The number of views required for a SMV display is large, and a method of producing sufficient views is to use holographic techniques to produce holographic stereograms. Holographic stereograms, where many multiple views across the viewing field are produced holographically, has a second advantage analysed in a paper by Pierre St Hilaire [41]. In this paper, the effect of the image appearing to ‘jump’ between adjacent views is considered. This phenomenon is similar to aliasing when a waveform is under-sampled, i.e. when the sampling rate is less than half the maximum frequency in the original signal. This criterion is in the same order as the figure obtained from research at the Heinrich Hertz Institute where it has been determined that typically, 20 views per inter-ocular distance are required for the appearance of smooth motion parallax [42]. Thus if a multi-view display presents a sufficiently large number of images whose pitch across the viewing field is small, the appearance of continuous motion parallax can be generated as the individual differences between adjacent images will be very small and unobserved Another benefit of holoform displays is that the vertical motion parallax is discarded in practical implementations as this is not necessary in order for 3D to be perceived (provided the axis of the head is close to vertical). From these discussions, a lenticular display that claims to present continuous motion parallax has been developed [43]. Moving parallax barriers can also be used to provide continuous motion parallax. These may either use a barrier that is moved mechanically [44] or a solid state ferro-electric barrier [45, 46] to move and steer the viewing zones to the eyes of the viewer. Two recent holoform displays are the quasi-holographic display of Holografika [47] that incorporates a holographic optical element (HOE) screen and Cambridge University’s dynamic parallax barrier display [48]. In principle, a holoform display could consist of a vertically aligned lenticular screen with a high-resolution display mounted behind it. However, at the time of writing, there is no display available with sufficiently high resolution. 13.3.4 Volumetric A volumetric display is one where an image is produced within a volume of space. The elements of the image are referred to as voxels (‘volume-pixels’ as opposed to the pixels of a 2D image that all lie in one plane) with the voxels produced by effectively ‘slicing’ up the image and displaying the individual planes to produce a 3D image. Volumetric displays fall within two basic categories [49], these being virtual image and real image. In a virtual image display virtual (perceived depth rather than real depth) depth planes are produced either by a deformable mirror whose focal length varies with time [50, 51] or with a rotating lens [52]. ‘Slices’ of the image are produced sequentially and lens or mirror is used to scan the virtual image of this through space. Real image (the image
480
P. Surman et al.
is displayed on a screen with actual depth) volumetric displays can be implemented in two ways, namely: moving screens and static screens. As with virtual image volumetric displays, moving screens produce a series of ‘slices’ sequentially. The moving screen can consist of a screen on which an image is projected [53, 54, 55], for example a rotating spiral, or a moving (in and out from the viewer) screen from which the image illumination is emitted [56]. In static screen displays, as their name implies, there are no moving parts. The light may be ‘piped’ to the individual voxel positions [57]. Alternatively, images may be projected on to a series of stacked parallel screens where one of the stack is sequentially rendered opaque, thus effectively producing a moving screen on to which the individual image ‘slices’ may be projected [58]. The advantage of volumetric displays is that there is no accommodation/convergence rivalry as each voxel occupies a given region of space as in real-life viewing conditions. Also, the image has motion parallax in both the horizontal and vertical directions. The disadvantage of volumetric displays for the presentation of video input (as opposed to computer-generated images) is that displayed surfaces are transparent, so images tend to appear non-solid. 13.3.5 Holographic The ideal stereoscopic display would produce moving images in real time that exhibit all of the characteristics of the original scene. This would require the reconstructed wavefront of the scene to be identical, and this could only be achieved using holographic techniques. The difficulties of this approach are the huge amounts of computation necessary to calculate the fringe pattern of the hologram, and the high resolution of the display, which has to be of the order of a wavelength of light (around 0.5 micron). Approximately ten million discretized samples per square millimetre are required to match the resolution of an optically produced hologram. This means that large amounts of redundant information have to be displayed. The earliest attempts at producing moving holographic images were made by the Spatial Imaging Group at MIT in 1989 [59]. This utilised acousto-optic modulators (AOMs) to modulate coherent light (a spatial light modulator, or SLM), and rotating mirrors to scan the beams. Initially, computation of the fringe pattern was slow but this was improved by the use of a method known as diffraction-specific computation [60]. A later version used an 18-channel AOM and six tiled scanning mirrors [61]. Improvements on this display have been made by replacing the AOM with a focused light array consisting of laser diodes [62] and by eliminating moving parts [63]. The use of multiple LCDs to produce an effective 15-million pixel LCD [64, 65] and other methods [66] have been described by groups from Japan. Another approach to providing a suitable spatial light modulator (SLM) that has a sufficiently high resolution is to use an optically addressed SLM (OASLM) whose information is supplied via a lower resolution electrically addressed SLM (EASLM) [67]. One option for the EASLM is the digital
13 Development of Viable Domestic 3DTV Displays
481
micro-mirror device (DMD) [68, 69]. Although not strictly holographic, an interesting spin-off from the OASLM/EASLM technology is its application in a multi-view display [70] by a joint group from Cambridge University and the Korea Institute for Science and Technology. However, although research is ongoing, to date no viable domestic 3D display has been produced using these techniques. Although holography may not provide a viable solution for a 3D display in the short term, it is likely it will be practicable eventually. A major problem will be with image capture where it will be difficult to reproduce naturally lit scenes as holograms cannot be captured with ambient light. It may be that image synthesis is employed, or that some semi-holographic method will be adopted.
13.4 The Demands for Domestic 3D Video Display Having briefly reviewed past attempts and contemporary work on producing viable domestic 3D displays, we have seen that none have been wholly successful, due to technological limitations and also perhaps as none so far fulfil viewer expectations of a viable and usable domestic 3D video display. It is necessary to define what we believe are the most important factors that would fulfil viewer expectations and so produce a viable 3D system. Examining the properties of domestic 2D televisions, these systems allow viewers to watch unencumbered by any special eye wear (such as shuttered, polarised or coloured glasses), they allow viewers to move around the domestic room freely and watch images without significant degradation, and they accommodate as many simultaneous viewers as desired. For a 3D system to be successful all three of these properties must be met. Hence extending these to 3D systems we can define that: 1) 3D display must be autostereoscopic (not require any eye wear), 2) that is must allow viewers to move freely about the domestic room and still see 3D (mobile viewer), and finally 3) that it must accommodate multiple viewers simultaneously (multi viewer). These three key elements are summarised as ‘autostereoscopic, mobile, multi’ viewer requirements. There are also a number of 3D-specific requirements that are desirable but not essential as discussed in Sect. 3 previously. These are that the display must provide motion parallax, have no accommodation/convergence rivalry, and finally provide an image that is apparently solid and not transparent or translucent. In order to determine the most suitable display for 3D requirements it is useful to summarise the results of the possible approaches already discussed. The generic types of 3D display are holographic, volumetric, holoform, multiview and binocular. Table 13.1 summarises the performance of the various types, but does not take into account any other considerations, for example cost, complexity or whether the technology exists yet to support it.
482
P. Surman et al. Table 13.1. Autostereoscopic Display Requirements and Performance
Display type/Requirements
Number Tolerate of viewers viewer movement
Motion Acc/conv parallax rivalry
Image transparency
Holographic Volumetric Holoform Multi-view
Multiple Multiple Multiple Multiple Single
Large Large Large Limited Very limited
Yes Yes Yes Yes No
No No No Yes Yes
No Yes No No No
Single
Adequate
Possible Yes
No
Multiple
Large
Possible Yes
No
Binocular
Fixed – non head tracked Single user head tracked Multi-user head tracked
According to the requirements of Table 13.1, holographic and holoform displays have all the necessary attributes for the ideal 3D display, that is they have the potential to support multiple viewers who can move freely over a large area, and the image can replicate the original. Volumetric displays can also fulfil all these conditions with the exception of image transparency, which makes them unsuitable for video. Multi-view, and all types of binocular display, suffer from the disadvantage that the viewer’s eyes focus on the screen, but converge at the apparent distance of the point in the image where the eyes are fixated. Multi-view displays have a usable region that has limited depth as explained in the previous section and binocular displays with fixed viewing regions place severe restrictions on the viewer’s position. These restrictions on viewer position can be overcome with the use of head tracking. Holographic and holoform displays are potentially the most suitable and their use as the next generation of 3D display must be considered. However, the large amount of information that has to be displayed is the principle barrier to the introduction of practical moving holographic displays. At present, the image volumes of these displays are small and the viewing regions limited. Applying the criteria defined previously 20 views per inter-ocular distance are required for the appearance of smooth motion parallax [42], this equates to approximately 300 to 400 separate images per metre width of viewing field for the appearance of smooth motion parallax and for the elimination of accommodation/converge rivalry. A holoform display that meets these conditions would have to receive and present a formidable amount of information that is of two orders of magnitude greater than a 2D display. Given the current technological barriers to producing the most optimal 3D display, the authors consider that the head tracked binocular option provides a viable solution to the next generation of autostereoscopic display. Single viewer and multi-viewer solutions for these displays are now presented.
13 Development of Viable Domestic 3DTV Displays
483
13.5 A Single-viewer Solution to Domestic 3D Display A solution found by the authors for a single viewer binocular autostereoscopic head tracked display suitable for domestic use is now described. The display enables the single viewer to move their viewing position both across the plane of the screen laterally (X), and in toward and away from the plane of the screen (Z) change of viewing distance, thus providing a comfortable degree of viewer movement. This movement is tracked and allows the head positional information to be used to alter the display images to also provide motion parallax. The display is called the ‘Free2C’ 3D display and is being developed by the Fraunhofer-Institute for Telecommunications, Heinrich-Hertz-Institut, Germany, (‘HHI’) [31]. 13.5.1 The Free2C 3D Display The basic concept of the Free2C single-user 3D display is illustrated in Fig. 13.5. As can be seen from the illustration, a pair of stereoscopic views are reproduced simultaneously in a column-interleaved format on a conventional LCD (Liquid Crystal Display) panel, forming a spatially multiplex left and right image pair on the LCD. The display is equipped with a lenticular lens raster which deflects the individual perspective images into the left and right eye respectively of a single viewer (Fig. 13.5). The display accommodates the head movement of the viewer by continually re-adjusting the position of the lenticular lens in relation to the LCD to steer the stereoscopic views onto the eyes of the viewer. The lenticular lens raster may be rapidly and accurately moved both in the lateral (X) plane and in the fore and aft viewing distance (Z) plane. Thus lateral (X) head movement is accommodated by moving the lenticular also in the X direction, and fore and aft (Z) movement of the head is accommodated by moving the lenticular also in the Z direction. This lenticular may move in both X and Z planes simultaneously. This approach provides the user with satisfying 3D reproduction
Fig. 13.5. Free2C display principle of operation
484
P. Surman et al.
within a sufficiently large viewing area to allow comfortable head movement and overcomes a major problem for many state-of-the-art autostereoscopic 3D displays that require a fixed head position. Measurement of the viewer’s actual head position is measured by a highly accurate, video-based tracking system (see Sect. 13.5.2). The technical specifications of the Free2C single-user 3D display are follows: It is constructed from an LCD with a screen size of 21.3 inches in diagonal and a spatial resolution of 1200 × 1600 pixels (3:4 portrait format). The achieved image quality (contrast 300:1, brightness 200 cd/m2) is equal to common monoscopic flat panel displays. Variation of the viewing distance fore and aft (Z) is feasible in a range of 400 mm to 1100 mm and side to side head movements are possible in a range of approximately ± 25◦ from the screen centre. The particular design of the lens raster plate ensures that the stereoscopic views are almost perfectly separated (ghosting < 2%). Hence the Free2C 3D display meets the essential requirements for comfortable viewing of extended stereoscopic depth volumes. The basic Free2C display and driving computer are illustrated in Fig. 13.6, with one practical implementation of the display for an information kiosk illustrated in Fig. 13.7. 13.5.2 A Note on Viewer Tracking Robust and reliable head tracking systems are required to allow accurate steering of left and right eye images to the eyes of viewers. These systems must
Fig. 13.6. The Free2C single-user autostereoscopic 3D display (left) and driving computer (right)
13 Development of Viable Domestic 3DTV Displays
485
Fig. 13.7. The integration of the Free2C technology in a kiosk system
provide: (a) High accuracy in terms of located head position; (b) Robustness with respect to different users, fast head movements as well as changes in scene background and illumination; (c) Automatic initialization and detection of viewer procedures. Typically state-of-the-art head trackers deploy a range of approaches from passive (optical) markers and active (optical, acoustic, magnetic) emitters and receivers, as well as inertial system components such as gyros, gravimeters and accelerometers. Some advanced systems even combine different components, e.g. optical and inertial subsystems, in order to make the tracker more robust against changes in the environment of occasional visual image occlusions. Generally, these systems are intrusive since they require the user(s) to be tethered to the measurement equipment, or at least to wear some parts of the equipment. However the ever decreasing price/performance ratio of computing coupled with decreases in video image acquisition cost have triggered numerous research activities where machine-vision based approaches are used. These are non-intrusive and passive and rely solely on the detection of head and facial features (such as the eyes) in images. These systems are more appropriate to a ‘walk-up-and-use’ 3D display system. Systems have been previously developed such as the ‘Blue Eyes’ tracker developed by IBM Almaden for human-computer interaction that uses a dual-light source head tracker alternating between dark background versus bright pupil effect to detect viewer eyes in an image. Here one set of infrared LEDs mounted close to the camera axis creates a bright pupil image (red eye effect) and a second off-axis infrared LED source produces a dark pupil image. The two light
486
P. Surman et al.
sources are switched on and off alternatively for subsequent video frames thus allowing the disambiguation of bright pupils in the difference image. Another solution is the ‘faceLAB’ system developed by Seeing Machines that is based upon a robust and flexible stereovision solution head model adapted to the facial features of a user. This system requires calibration to the user and then extracts and interprets the location of facial features such as the eyebrows, pupils, iris, eye corners, mouth and nostrils to calculate a head position. However, neither of these systems is either sufficiently accurate, or sufficiently free from initial user setup to be suitable for a domestic 3D display. 13.5.3 The HHI Video Head Tracker Due to a lack of a sufficiently suitable head tracking system HHI developed a new system suitable for their display. This is a non-contact non-intrusive video based system that provides a near to real-time high-precision single-person 3D video head tracker. The fully automated tracker employs an appearance-based method for initial head detection (requiring no calibration) and a modified adaptive block-matching technique for head and eye location measurements after head location. The adaptive block-matching approach compares the current image with eye patterns of various sizes that are stored during initialization. Tracking results (shown as locating squares on the eyes) for three different users with three different scene backgrounds and illumination conditions are shown in Fig. 13.8. As can be seen from the graphic, the tracking algorithm also works for viewers that wear glasses. Depending on the camera frame rate and resolution used the head tracker locates the user’s eye positions at a rate of up to 120 Hz. Measurements of head and eye position in three-dimensional space (X, Y, Z) are calculated with a resolution of 3x3x10 mm3 . If a single tracking camera is used for tracking then the Z-coordinate is calculated from the user’s interocular distance. This value can be specified manually otherwise a default value of 65 mm is used, assuming that the viewer’s eyes are oriented parallel to the display screen. If two (or more) cameras are used then this is supplemented with triangulation of the eye via the camera’s base distances so that the head tracker can determine the Z-position even without prior knowledge of the user’s eye separation. The
Fig. 13.8. The HHI video head tracker in operation
13 Development of Viable Domestic 3DTV Displays
487
Fig. 13.9. Tracked live (left) and reference (right) eye patterns
application of two cameras also increases the accuracy in the Z-direction and extends the overall tracking range. For automatic initialization the tracker finds the user’s eye positions by either looking for simultaneous blinking of the two eyes, or by pattern fitting face candidates in an edge representation of the current video frame by applying a predefined set of rules. These face candidates are finally verified by one of two possible neural nets. After initial detection the eye patterns that refer to the open eyes of the viewer are stored as a preliminary reference. Irrespective of the initialization method applied the initial reference eye patterns are scaled (using an affine transformation) to correspond to six different camera distances (Fig. 13.9 right images). The resulting twelve eye patterns are used by the head tracker to find the viewer’s eyes in the current live video images (Fig. 13.9 left images). 13.5.4 A Single Viewer Autostereoscopic Display When combined, the HHI display and non-intrusive head tracking system provide a viable and usable domestic 3D display solution. However, the technology is only suited to a single viewer as it is not possible to steer the lenticular lens raster that controls the image display to the left and right eyes to more than one viewer simultaneously. The only possible solution to this is time multiplexing (by moving the lenticular to one viewer and showing the images, and then moving to the next viewer and showing the images, and so on), however the lenticular has too great a mass for this to be accomplished sufficiently rapidly. The display provides two of the three requirements for a domestic television 3D display, in that the display must be autostereoscopic, and that is must allow viewers to move freely about the domestic room and still see 3D (mobile viewer), but the display cannot accommodate multiple viewers simultaneously (multi viewer). Thus a different solution must be adopted for multiple viewers.
13.6 A Multi-viewer Solution to Domestic 3D Display A solution found by the authors for a multiple viewer, mobile viewer, binocular autostereoscopic head tracked display suitable for domestic use is now
488
P. Surman et al.
described. In a similar fashion to the single viewer display described previously, the display enables the a viewer to move their viewing position both across the plane of the screen laterally (X), and in toward and away from the plane of the screen (Z) change of viewing distance, with this movement being tracked and the head positional information used to steer images to the eyes via novel optics. In addition, the display uses novel optics to generate multiple left and right eye images which may be steered independently. Thus the display is autostereoscopic, accommodates mobile viewers and accommodates multiple viewers, satisfying all three of the requirements outlined previously in Sect. 13.4. It is currently under development by The Imaging and Displays Research Group at De Montfort University (DMU), UK. 13.6.1 The DMU Multi-user 3D Display The display operates in a similar manner to the Free2C display by using an LCD with left and right eye images interlaced on alternative pixel rows coupled with a lenticular lens image raster to steer light. However, here the similarity ends with the Free2C display. In the Free2C display the lenticular lens that steers images to the left and right eyes of the single viewer is located in front of the LCD screen displaying the left and right eye images, with movement of the exit pupils accomplished by movement of the lenticular. In the DMU display the lenticular is fixed and placed behind the LCD screen. Here the lenticular is used to simply focus light from light sources and light steering optics placed behind the screen through the left and right image pixel rows of the LCD, with steering of the exit pupils accomplished not by lenticular movement but by movement of the light source through steering optics located behind the screen and lenticular. This arrangement is illustrated in Fig. 13.10. There are two sets of steering optics and light sources located behind the display, with the light from each source following a different path through the display. Here light for the left eye exit pupil is focussed on the lenticular so that the light only falls on the left eye LCD image rows (‘L’ in Fig. 13.10), and the light for the right eye exit pupil is focussed on the lenticular so that the light only falls on the right eye LCD image rows (R in Fig. 13.10). Note that a parallax barrier could be used in place of the lenticular sheet to form the simplest multiplexing screen, however the light throughput of a parallax barrier is limited to a maximum of approximately 50%. It is more efficient to perform this function using a lenticular screen that consists of horizontally aligned cylindrical lenses that have the same pitch as the parallax barrier. In this case the lenses enable potentially 100% of the light from the steering optics to be focused on to the LCD pixels. The lenticular screen is located close behind the LCD and its positioning is critical in order to maximise light throughput and minimise crosstalk, caused by light falling on the incorrect left or right eye LCD pixel rows. The core concept of the display is to produce image regions in space in front of the screen at the viewer’s eye positions. These viewing regions are known
13 Development of Viable Domestic 3DTV Displays
Light from steering optics
489
Lenticular screen
L R L R L
LCD
R
Fig. 13.10. Display screen layout
as exit pupils and their formation can be explained by considering Fig. 13.11 where an exit pupil is formed with the use of a large lens and a vertical light source. Here the light source forms a real image at the centre of the exit pupil such that an observer within the shaded region shown (Fig. 13.11) will see the illuminated screen image across the complete area of the screen. In order for 3D to be observed, two adjacent exit pupils must be formed, this is simply achieved by placing a second image source to one side of the existing source to produce an additional exit pupil. In principle, multiple viewers could be served by using several pairs of left and right eye light sources behind the screen of Fig. 13.11. However in practice lens aberrations limit the region over which the exit pupils can be formed such that this limits the use of this display to a single viewer who has limited freedom of movement. This single lens problem is overcome by Illumination source Lens Exit pupil Screen
Fig. 13.11. Exit pupil formation with lens
490
P. Surman et al. Moveable light source columns Lens array
Exit pupil
Fig. 13.12. Steered light principle
replacing the illumination source and lens (as illustrated in Fig. 13.11) by multiple light sources and lenses to form light steering optics placed behind the lenticular lens and LCD screen. This principle is shown in Fig. 13.12. In practice this concept is extended (Fig. 13.12) by replacing the simple multiple lenses by an array of co-axial lenses as illustrated in Fig. 13.13a. The required exit pupils are formed by the series of cylindrical lenses with a light source placed behind each vertical lens (Fig. 13.13a).
Fig. 13.13. Steering array and aperture images
13 Development of Viable Domestic 3DTV Displays Light contained by total internal reflection
491
Illumination source
Aperture
To exit pupil
Fig. 13.14. Co-axial optical element
Here the light steering optics have illumination surfaces behind each optical array element with each surface supplied by a linear array of individual light sources that can be switched independently in accordance with the viewers’ head positions. A single lens element of the steering array is shown (Fig. 13.14). Note that the optical arrangement shown in Fig. 13.14 does not exhibit any off axis aberrations as both the illumination and the refracting surfaces are cylindrical and have a common axis placed at the centre of the aperture. For this reason, this configuration is termed co-axial. There are many optical elements (Fig. 13.14) in the steering optics array (Fig. 13.13a) as the apertures of each array element are narrower than the width of the screen. Hence the elements are arranged in the stacked configuration (shown in Fig. 13.13a) in order to provide a contiguous light source across the width of the array. The appearance of this illumination is shown in Fig. 13.13b. As this illumination provides the backlight for the display LCD it must light the complete height of the screen and this is achieved by locating a vertically diffusing sheet in front of the optical elements. Figure 13.13 also shows that two arrays are used; the upper one for illuminating the left pixel rows via the lenticular multiplexing screen, and the lower array for the right pixels. The nature of the steering optics is such that there is no limit on the number of exit pupils that can be formed; the limit is set by the number of viewers who can physically fit within the viewing field, with each additional viewer simply requiring an additional light source to be illuminated behind the optical elements (Fig. 13.14). The overall construction of the display is shown (Fig. 13.15) to illustrate the placement of the screen assembly (lenticular and LCD), the steering optics (light sources, optical elements) and also side folding mirrors. The side folding mirrors are surface-silvered and used to optically extend the width of the steering optics array by forming virtual images of the array when the display is viewed off-axis from the screen.
492
P. Surman et al.
PLAN VIEW
Side mirrors
Screen
Optical array
SIDE VIEW
Fig. 13.15. Prototype multi-viewer display
13.6.2 Prototype Display A prototype of the display has been constructed in order to demonstrate the operation of the display hardware. The optics that steer the exit pupils (from Figs. 13.11 and 13.12) consist of two ten-element optical arrays that are located 800 millimetres behind the LCD and lenticular screen assembly (from Fig. 13.10). The prototype steering optics array is shown in Fig. 13.16, this figure shows half of the left array with the light sources illuminating the back of the optical elements of the steering array. Figure. 13.17 shows the light from the steering optics illuminating and tracking several sets of viewer eyes printed on test target sheets of card. Note on this figure how the left and right eyes are illuminated separately (from the left and right steering array respectively) with a clear dark band between the eyes. Also note that more
13 Development of Viable Domestic 3DTV Displays
493
Fig. 13.16. Light steering array prototype
than one viewer is accommodated. The eyes are tracked during movement by simply changing which light sources are lit behind the optical elements. This picture was taken with the screen assembly removed to provide clear photography of the exit pupils. Illumination for the optical arrays is supplied by arrays of white surface mount LEDs arranged around the periphery of the back surface of the optical elements. The LEDs illumination sources are constructed in modules of sixteen
Fig. 13.17. Light exit pupils tracking viewer eyes
494
P. Surman et al.
Driver chip Heat sink
SMT LEDs Condenser lens array
Fig. 13.18. 16-element LED Array
LEDs mounted with a pitch of 1.1 millimetres with an integral driver chip, heat sink and condenser lens array as depicted in Fig. 13.18. A 1.1 millimetre pitch was chosen as the smallest easily obtainable and equates, via the steering optics, to a positional resolution for the exit pupil of approximately 10mm at the eye of a viewer. This is quite acceptable as it is considerably less than the interocular distance of the eyes so separation of the left and right exit pupils is assured at the eye. Each of the optical elements of the display (Fig. 13.14) is illuminated by 16 LED modules giving a total illumination array of 256 LEDs, with 20 optical elements comprising the complete light steering array the display requires 5120 LEDs. The screen assembly consists of a lenticular multiplexing screen and a highresolution LCD (as shown in Fig. 13.10). The position of the multiplexing lenticular screen in relation to the LCD is crucial and must be adjusted to an accuracy of ± 5 microns in the y-direction, and ± 50 microns in the z-direction. In the prototype the lenticular screen and LCD are mounted in a rigid frame that incorporates two fine adjustment screws that act on either side of the lenticular screen and move it in the y-direction. The position in the x-direction is not crucial, and accurate spacers set and fix the separation in the z-direction. The LCD used is a NEC 21” UXGA (1200×1600) panel that enables two standard 576-line images to be displayed. In standard form, the front polarizer that is adhered to the front glass substrate of the LCD would scatter light up to 5◦ from the display. This would cause unacceptable crosstalk as it would scatter the directed light of the exit pupil, hence it is removed and replaced with a polarizer having smooth surfaces that do not scatter light. Vertical scattering of the exit pupils however is desirable as this allows for variation in viewer height, and this is achieved with the use of a Physical Optics Corporation holographically produced light shaping diffuser (LSD) sheet that produces a 20◦ × 0.5◦ elliptical pattern. With the major axis of this in the vertical orientation, the sheet behaves as a vertical diffuser. The full prototype is shown in Fig. 13.19.
13 Development of Viable Domestic 3DTV Displays
495
Fig. 13.19. Prototype multi-user display
13.6.3 Prototype Performance The prototype is extremely useful in determining the performance of this type of display. Although crosstalk is relatively high, the performance of this display is significant as this is the first time a 3D display operating on this principle has been constructed. The primary problems encountered in the prototype are crosstalk and image banding. Investigation found that crosstalk was caused by diffraction at the LCD. A certain amount of diffracted light always emerges from an LCD due to the periodic nature of its pixel structure. It was found that diffraction is particularly severe with the NEC LCD used in the prototype as this has a vertical microstructure in the sub-pixels that has a small pitch of fifteen microns. In Fig. 13.20 it can be seen that the first order component of the diffraction pattern approaches a considerable 20% proportion of the zero-order component.
Relative Intensity
1
0.5
0
-100
0 Deviation (milliradians)
Fig. 13.20. LCD diffraction plot
100
496
P. Surman et al.
R
L
1
R e la tive In te n s ity
0.5
0 -150
0 Distance (mm)
150
Fig. 13.21. Exit pupil diffraction
The effect of screen diffraction can be seen in Fig. 13.21 where the relative intensity profiles across the viewing field are shown. The continuous line of the graph shows the profile for the output of a single optical array element at a distance of 2.8 metres and with no LCD in the light path. Here a beam approximately 100 millimetres wide is formed when 10 LEDs are illuminated. The lines L and R show typical eye spacing, and the exit pupil intensity without LCD is shown to fall from maximum to around 1% of this value well within this interocular distance, resulting in little crosstalk. However, with the LCD in place the profile is changed dramatically, this is shown by the dashed line. With eyes positioned again at the lines L and R, the crosstalk is now in the region of 15%, a value close to being unacceptable. Although some level of diffraction is inevitable at the LCD it is possible to reduce this to tolerable limits. For example the prototype LCD could be rotated through 90◦ , in which case the effect of diffraction would be reduced due to the lack of a horizontal high spatial frequency sub-pixel component when in this orientation, although this would adversely change the aspect ratio of the display. Another option would be to use a monitor type LCD that has a simple pixel structure but has a relatively restricted viewing angle. The most satisfactory solution would be the design of an LCD that is particularly suited to this application. As the original contiguous LCD backlight is effectively replaced by an array of discrete LED illumination sources, the appearance of banding is a potential problem. Variation in intensity and colour between the devices gives rise to the appearance of vertical banding. Here the variation in LED colour was more noticeable than the variation in LED brightness. Even though all of the LEDs used were chosen with very tight specification in the same CIE chromaticity region the colour variation could be clearly seen with the screen showing a blank white image. However, when there is an image on the screen, especially if it is moving, the effect becomes much less noticeable and quite acceptable.
13 Development of Viable Domestic 3DTV Displays
497
13.6.4 Multi-modal Operation of the DMU Display The flexibility of the steering optics and screen arrangement of the display enables it to operate in more than a simple binocular mode without any alteration to its hardware. As shown, in its simplest form the display presents a single stereo pair where the left and right image regions are steered to the positions of the viewers’ left and right eyes. However in principle more than two images can be presented on the single screen in one of two different ways. The images can be displayed simultaneously on alternating lines or series of lines, using spatial multiplexing where each additional binocular image pair would require a doubling of the number of pixel rows on the LCD to maintain the same original screen resolution for each image. Alternatively, images can be presented sequentially using temporal multiplexing where each additional binocular image pair would require a doubling of the LCD refresh rate to maintain the same original screen refresh rate for each image. Thus a display suitable for presenting motion parallax to four viewers (generating 4 different binocular image pairs simultaneously) would require eight times the vertical resolution of a standard display, or would require it to run at eight times the critical flicker frequency for flicker to be unnoticeable. Both of these solutions are possible with advancing technology, and such a display could offer different images to differing viewers simultaneously from the same single screen, creating a multimodal 3D image display system. 13.6.5 A Multiple Viewer Autostereoscopic Display The prototype has provided a path to a viable and usable domestic 3D display solution as the display provides all three of the requirements for a domestic television 3D display, in that the display is autostereoscopic, that it allows viewers to move freely about the domestic room and still see 3D, and the display can accommodate multiple viewers simultaneously. Although the display is still in prototype form, the principle of operation has been proven, and it is hoped that with ongoing work the display will produce a commercially viable solution within 5 years. In addition the display offers the possibility, without design modification, of multimodal multiple different image presentation to differing viewers once technology has advanced sufficiently to produce the required LCD performances.
13.7 Summary There have been many approaches to producing viable and usable 3D domestic television and display solutions, ranging from historical attempts through to, and informing, contemporary work. These technologies range from the basic display of the stereoscopic minimum of two differing images with binocular systems, through to the many-image display of holography. Considering all of
498
P. Surman et al.
the approaches, three fundamental properties exist that are a prerequisite for viable and usable domestic 3D television display, these are that the system must be autostereoscopic, that is must allow viewers to move freely about the domestic room and still see 3D, and finally that it must accommodate multiple viewers simultaneously, in order to be as acceptable as existing 2D systems. These three key elements are summarised as ‘autostereoscopic, mobile, multi’ viewer (AMMV) requirements. There are also 3D-specific requirements that are desirable in that a viable display must provide motion parallax, have no accommodation/convergence rivalry, and finally provide an image that is apparently solid and not transparent or translucent. Of all of the approaches examined, to date, only binocular multiple mobile tracked viewer display solutions and holographic display solutions are likely to fulfil all requirements of a viable and usable display. Of these, although holography may not provide a viable solution for a 3D display in the short term, it is likely it will be practicable eventually with advance in technology and will provide a future solution. This leaves binocular tracked viewer systems as the prime candidate for a contemporary solution. The HHI Free2C binocular tracked viewer display described does provide a proven, available and viable 3D display solution, however the technology is only suited to a single viewer, limiting its domestic use. The DMU binocular tracked viewer display prototype may offer a path to a viable and usable domestic 3D display solution, as the display can provide all three of the requirements for a domestic television 3D display. Although the performance of the first prototype display is limited, the principle of operation has been proven and a second multi-user prototype that addresses all the problems identified is currently under construction. It is anticipated that a head tracked multi-user display will have the capability to serve up to five viewers who may be located at a maximum distance of three metres from the screen. The HHI single target tracker is being developed for this purpose within the EU-funded MUTED project. A typical television audience rarely exceeds five viewers. The optics of the MUTED display will enable 3D to be observed up to three metres from the screen. and if a viewer is beyond this distance a 2D image will be seen. Given the current technological barriers to producing the most optimal 3D display, the authors consider that head tracked binocular systems will provide the viable solution for the next generation of domestic autostereoscopic display, with holographic systems superseding binocular systems as technology advances. It is predicted that within the next 10 years we will see domestic 3D video displays readily available and accepted by the market place.
Acknowledgements This work has been carried out within three European Union-Funded projects. The initial display and human factors work was funded by the ATTEST – IST-2001-34396 (Advanced Three-dimensional Television System
13 Development of Viable Domestic 3DTV Displays
499
Technologies) Framework 5 project. Work has continued since the end of ATTEST under the Framework 6 3DTV Network of Excellence - IST-511568. The latest research is within the MUTED – IST-5-034099 (Multiple-User three-dimensional Television Display) Framework 6 STREP project.
References 1. J. A. Norling (1953). “The Stereoscopic Art – a Reprint”, Journal SMPTE, Vol. 60, pp. 268–308. 2. P. Bos, T. Haven, and L. Virgin (1988). “High Performance 3D Viewing Systems Using Passive Glasses”, SID 88 Digest, pp. 450–453. 3. T. Bardsley (1995). “The Design and Evaluation of an Autostereoscopic Computer Graphics Display”, Thesis submitted to De Montfort University. 4. J. Harrold, D. J. Wilkes, and G. J.Woodgate (2004). “Switchable 2D/3D Display – Solid Phase Liquid Crystal Microlens Array”, Proceedings of The 11th International Display Workshops, Niigata, Japan, pp. 1495–1496. 5. Sharp (2004). http://www.sharp3d.com/technology/howsharp3dworks/ 6. A. Schwerdtner and H. Heidrich (1998). “The Dresden 3D Display (D4D)”, SPIE Proceedings, “Stereoscopic Displays and Applications IX”, Vol. 3295, pp. 203–210. 7. J. Eichenlaub (1997). “A Lightweight, Compact 2D/3D Autostereoscopic LCD Backlight for Games, Monitor and Notebook Applications”, Stereoscopic Displays and Virtual Reality Systems IV, Vol. 3012 (SPIE Proceedings), pp. 274–281. 8. D. Trayner and E. Orr (1996). Autostereoscopic Display using Holographic Optical Elements Stereoscopic Displays and Applications VI, Vol. 2653 (SPIE Proceedings), pp. 65–74. 9. J. Harrold, D. J. Wilkes, and G. J. Woodgate (2004). “Switchable 2D/3D Display – Solid Phase Liquid Crystal Microlens Array”, Proceedings of the 11th International Display Workshops (SID Proceedings IDW ’04), pp. 1495–1496. 10. A. Schwartz (1985). “Head Tracking Stereoscopic Display”, Proceedings of IEEE International Display Research Conference, pp. 141–144. 11. S. Ichinose, N. Tetsutani, and M. Ishibashi (1989). “Full-color Stereoscopic Video Pickup and Display Technique Without Special Glasses”, Proceedings of the SID, Vol. 3014, 1989, pp. 319–323. 12. G. J. Woodgate, D. Ezra, J. Harrold, N. S. Holliman, G. R. Jones, and R. R. Moseley (1997). “Observer Tracking Autostereoscopic 3D Display Systems”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems IV”, Vol. 3012, pp. 187–198. 13. J. Eichenlaub and J. Hutchins (1995). “Autostereoscopic Projection Displays2”, Stereoscopic Displays and Virtual Reality Systems II, Vol. 2409 (SPIE Proceedings), pp. 48–54. 14. H. Heidrich, A. Schwerdtner, A. Glatte, and H. Mix (2000). “Eye Position Detection System”, Stereoscopic Displays and Virtual Reality Systems VII, Vol. 3957 (SPIE Proceedings), pp. 192–197. 15. D. Trayner and E. Orr (1997). “Developments in Autostereoscopic Displays Using Holographic Optical Elements”, Stereoscopic Displays and Virtual Reality Systems IV, Vol. 3012 (SPIE Proceedings), pp. 167–174.
500
P. Surman et al.
16. T. Okoshi (1976). “Three Dimensional Imaging Techniques”, New York: Academic Press, pp. 129–142. 17. N. Tetsutani, S. Ichinose, and M. Ishibashi (1989). “3D-TV Projection Display System with Head Tracking”, Japan Display ’89, pp. 56–59. 18. P. Harman (1996). “Autostereoscopic Display System”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems IV”, Vol. 2653, pp. 56–64. 19. P. Harman (2000). “Autostereoscopic Teleconferencing System”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VII”, Vol. 3957, pp. 293–301. 20. N. Tetsutani, K. Omura, and F. Kishino (1994). “A Study on a Stereoscopic Display System Employing Eye-position Tracking for Multi-viewers”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems”, Vol. 2177, pp. 135–142. 21. K. Omura, N. Tetsutani, and F. Kishino (1994). “Lenticular Stereoscopic Display System with Eye-Position Tracking and Without Special-Equipment Needs”, SID 94 Digest, pp. 187–190. 22. H. Imai, M. Imai, O. Yukio, and K. Kubota (1996). “Eye-position Tracking Stereoscopic Display Using Image Shifting Optics”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems IV”, Vol. 2653, pp. 49–55. 23. A. Arimoto, T. Ooshima, T. Tani, and Y. Kaneko (1998). “Wide Viewing Area Glassless Stereoscopic Display Using Multiple Projectors”, SPIE Proceedings, “Stereoscopic Displays and Applications IX”, Vol. 3295, pp. 186–191. 24. D. Ezra, G. J. Woodgate, B. A. Omar, N. S. Holliman, J. Harrold, and L. S. Shapiro (1995). “New Autostereoscopic Display System”, SPIE Proceedings, “Stereo-scopic Displays and Virtual Reality Systems II”, Vol. 2409, pp. 31–40. 25. S. A. Benton, T. E. Slowe, A. B. Kropp, and S. L. Smith (1999). “Micropolarizerbased Multiple-viewer Autostereoscopic Display”, SPIE Proceedings, “Stereoscopic Displays and Applications X”, Vol. 3639, pp. 76–83. 26. T. Hattori (2000a). “Sea Phone 3D Display”, http://home.att.net/∼SeaPhone/ 3display.htm, pp. 1–6 27. Y. Nishida, T. Hattori, S. Sakuma, K. Katayama, S. Omori, and T. Fukuyo (1994). “Stereoscopic Liquid Crystal Display II (Practical Application)”, SPIE Proceedings “Stereoscopic Displays and Virtual Reality Systems”, Vol. 2177, pp. 150–155. 28. T. Hattori, S. Sakuma, K. Katayama, S. Omori, M. Hayashi, and Y. Midori (1994). “Stereoscopic Liquid Crystal Display 1 (General Description)”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems”, Vol. 2177, pp. 143–145. 29. T. Hattori (2000b). “Sea Phone 3D Display”, http://home.att.net/∼SeaPhone/ 3display.htm, pp. 7–9. 30. J. Y. Son, S. A. Shestak, S. -S. Kim, and Y. J. Choi (2001). “A Desktop Autostereoscopic Display with Head Tracking Capability”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VIII”, Vol. 4297, pp. 160–164. 31. Hhi (2002). 3-D Display http://atwww.hhi.de/∼blick/3-D Display/3-d display. html 32. P. Surman (2002). “Head Tracking Two-Image 3D Television Displays”, Ph.D. Thesis De Montfort University, Leicester, UK http://eigg.res.cse.dmu. ac.uk/publications/surman phd.html
13 Development of Viable Domestic 3DTV Displays
501
33. P. Surman, I. Sexton, R. Bates, K. C. Yow, and W. K. Lee (2005). “The Construction and Performance of a Multiviewer 3D Television Display”, Journal of the SID, Vol. 13, Issue 4, pp. 329–334. 34. P. A. Howarth (1996). “Empirical Studies of Accommodation, Convergence and HMD Use”, Hoso-Bunka Foundation Symposium, Tokyo. 35. N. Spotiswoode and R. Spotiswoode (1953). “Stereoscopic Transmission”, University of California Press, pp. 13 and 19–22. 36. A. R. L. Travis and S. R. Lang (1991). “The Design and Evaluation of a CRT-based Autostereoscopic Display”, Proceedings of the SID, Vol. 32/4, pp. 279–283. 37. N. A. Dodgson, J. R. Moore, S. R. Lang, G. Martin, and P. Canepa (2000). “A 50” Time-multiplexed Autostereoscopic Display”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VII”, Vol. 3957, pp. 177–183. 38. S. T. de Zwart, W. L. IJzerman, T. Dekker, and W. A. M. Wolter (2004). “A 20-in. Switchable Auto-Stereoscopic 2D/3D Display”, Proceedings of the 11th International Display Workshops, Niigata, Japan, pp. 1459–1460. 39. Sanyo (2004). “Step Barrier System Multi-view 3-D Display Without Special Glasses”, Advertising material published by Sanyo. 40. Y. Kajiki (1997). “Hologram-Like Video Images by 45-View Stereoscopic Display”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems IV”, Vol. 3012, pp. 154–166. 41. P. St Hilaire (1995). “Modulation Transfer Function of Holographic Stereograms” Paper from Internet – originally published in “Applications of Optical Holography”, 1995 SPIE Proceedings. 42. S. Pastoor (1992). “Human Factors of 3DTV: An Overview of Current Research at Heinrich-Hertz-Institut Berlin”, IEE Colloquium “Stereoscopic Television” Digest No. 1992/173, p. 11/3. 43. M. McCormick, N. Davies, and E. G. Chowanietz (1992). Restricted Parallax Images for 3D T.V Colloquium “Stereoscopic Television” Digest No: 1992/173 (IEE), pp. 3/1–3/4. 44. H. B. Tilton (1985). “Large-CRT Holoform Display”, Proceedings of IEEE International Display Research Conference, pp. 145–146. 45. I. Sexton (1992). “Parallax Barrier Display Systems” IEE Colloquium “Stereoscopic Television” Digest No: 1992/173, pp. 5/1–5/5. 46. K. Perlin, C. Poultney, J. S. Kollin, D. T. Kristjansson, and S. Paxia (2001). “Recent Advances in the NYU Autostereoscopic Display”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VIII”, Vol. 4297, pp. 196–203. 47. T. Baloch (2001). “Method and Apparatus for Displaying Three-dimensional Images”, United States Patent No. 6,201,565 B1 48. C. Moller and A. R. L. Travis (2004). “Flat Panel Time Multiplexed Autostereoscopic Display Using an Optical Wedge Waveguide”, Proceedings of the 11th International Display Workshops, Niigata, Japan, pp. 1443–1446. 49. B. Blundell and A. Schwartz (2000). “Volumetric Three Dimensional Display Systems” (Wiley-Interscience), pp. 316–324. 50. A. C. Traub (1967). “Stereoscopic Display Using Rapid Varifocal Mirror Oscillations”, Applied Optics, June 1967, Vol. 6, No. 6, pp. 1085–1087. 51. S. Mckay, G. M. Mair, S. Mason, and K. Revie (2000). “Membrane-mirror-based Autostereoscopic Display for Teleoperation and Telepresence Applications”,
502
52. 53.
54.
55.
56. 57.
58. 59.
60.
61.
62.
63.
64. 65.
66.
67.
P. Surman et al. SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VII”, Vol. 3957, pp. 198–207. J. Fajans (1992). “Xyzscope – A New Option in 3-D Display Technology”, SPIE Proceedings, “Visual Date Interpretation”, Vol. 1668, pp. 25–26. A. Andreev, Y. Bobylev, S. Gonchukov, I. Kompanets, Y. Lazarev, E. Pozhidaev, and V. Shoshin (2004). “Experimental Model of Volumetric 3-D Display Based on 2-D Optical Deflector and Fast FLC Light Modulators”, SID Proceedings, “Advanced Display Technologies 2004”, pp. 279–283. G. E. Favalora, R. K. Dorval, D. M. Hall, M. Giovinco, and J. Napoli (2001). “Volumetric Three-dimensional Display System with Rasterization Hardware” Stereoscopic Displays and Virtual Reality SystemsVIII”, Vol. 4297 (SPIE Proceedings), pp. 227–235. K. Miyamoto, Y. Sakamoto, T. Yamaguchi, and I. Fukuda (2004). “A WideField-of-View Multi-color 3D Display”, Proceedings of the 11th International Display Workshops (SID Proceedings IDW ’04), p. 1499. T. Yendo, T. Kajiki, T. Honda, and M. Sato (2000). “Cylindrical 3-D Video Display Observable from All Directions”, Proceeding Pacific Graphics 2000. D. L. Macfarlane and G. R. Schultz (1994). “A Voxel Based Spatial Display”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems”, Vol. 2177, pp. 196–202. I. Kompanets and S. A. Gonchukov (2004). “Volumetric 3D Liquid Crystal Displays”, Information Display, Vol. 20, No. 5 (SID Publications), pp. 24–26. M. Lucente and T. A. Galyean (1995). “Rendering Interactive Holographic Images”, Computer Graphics Proceedings, Annual Conference Series, 1995, pp. 387–393. M. Lucente, S. A. Benton, and P. St.-Hilaire (1994). “Electronic Holography: The Newest”, International Symposium on 3-D Imaging and Holography, Osaka, Japan November 1994. M. Lucente (1997). “Interactive Three-dimensional Displays; Seeing the Future in Depth”, Siggraph “Computer Graphics – Current, New and Emerging Display Systems”. Y. Kajiki, H. Yoshikawa, and T. Honda (1996). “Three-Dimensional Display with Focused Light Array”, SPIE Proceedings, “Practical Holography X”, Vol. 2652, pp. 106–116. J. -Y. Son, S. A. Shestak, S. -K. Lee, and H. -W. Jeon (1996). “Pulsed Laser Holographic Video”, SPIE Proceedings, “Practical Holography X”, Vol . 2652, pp. 24–28. T. Honda (1995). “Dynamic Holographic 3D Using LCD”, Asia Display ’95, pp. 777–780. K. Maeno, N. Fukaya, O. Nishikawa, K. Sato, and T. Honda (1996). “Electroholographic Display Using 15-megapixel LCD”, SPIE Proceedings, “Practical Holography X”, Vol. 2652, pp. 15–23. N. Hashimoto, S. Morokawa, and K. Kitamura (1991). “Real Time Holography Using the High-resolution LCTV-SLM”, SPIE Proceedings, “Practical Holography V”, Vol. 1461, pp. 291–302. M. Stanley, P. B. Conway, S. Coomber, J. C. Jones, D. C. Scattergood, C. W. Slinger, B.W. Bannister, C. V. Brown, W. A. Crossland, and A. R. L. Travis (2000). “A Novel Electro-optic Modulator System for the Production of Dynamic Images from Gigapixel Computer Generated Holograms”,
13 Development of Viable Domestic 3DTV Displays
503
SPIE Proceedings, “Practical Holography XIV and Holographic Materials VI”, Vol. 3956, pp. 13–22. 68. J. M. Younse (1993). “Mirrors on a Chip”, 1EEE Spectrum, November 1993, pp. 27–31. 69. R. J. Gove (1994). “DMD Display Systems: The Impact of an All-digital Display”, Texas Instruments Publication – reprint from: SID 1994 International Symposium, Seminar, Exhibition, San Jose, California, June 12–17, 1994. 70. H. -W. Jeon, A. R. L. Travis, T. D. Collings, T. D. Wilkinson, and Y. Frauel (2000). “Image Tiling System Using Optically Addressed Spatial Light Modulator for High-resolution and Multiview 3-D Display”, SPIE Proceedings, “Stereoscopic Displays and Virtual Reality Systems VII”, Vol. 3957, pp. 165–176.
14 An Immaterial Pseudo-3D Display with 3D Interaction Stephen DiVerdi1 , Alex Olwal2 , Ismo Rakkolainen3&4 and Tobias H¨ ollerer1 1 2 3 4
University of California at Santa Barbara, Santa Barbara, CA 93106, USA KTH, 100 44 Stockholm, Sweden FogScreen Inc., 00180 Helsinki, Finland Tampere University of Technology, 33101 Tampere, Finland
14.1 Introduction Many techniques have been developed to create the impression of a 3D image floating in mid-air. These technologies all attempt to artificially recreate the depth cues we naturally perceive when viewing a real 3D object. For example, stereoscopic imaging simulates binocular disparity cues by presenting slightly different images of the same scene to the left and right eyes, which is interpreted by the brain as a single 3D image. Virtual reality applications tend to track the user’s head and render different views of the 3D object depending on where the user is in relation to the object, to simulate motion parallax. 3D applications in general simulate realistic imagery with perspective and complex shading algorithms to create the impression that the virtual object is seamlessly integrated with the 3D scene. We have applied these simulated depth cues to a novel immaterial display technology [1], creating an engaging new way to view 3D imagery. In addition, multiple unencumbered users can naturally manipulate the objects floating in mid-air. This paper summarizes our work on the interactive, immaterial pseudo-3D display and its implications on perception of 3D content. Our novel walk-through pseudo-3D display and interaction system is based on the patented FogScreen, an “immaterial” indoor 2D projection screen [2, 3, 4], which enables high-quality projected images in free space. We have extended the basic 2D FogScreen setup in three major ways. First, we use head-tracking to provide motion parallax and correct perspective rendering for a single user. Second, we support multiple types of stereo vision technology for binocular disparity cues. Third, we take advantage of the two-sided nature of the FogScreen to render the front and back views of the 3D content on the two sides of it, so that the user can cross the screen to see the content from the back. Commonly, users want to interact with the displayed objects, not just view them. The ability for an observer to walk completely through the FogScreen
506
S. DiVerdi et al.
makes it appropriate for enabling new viewing and interaction possibilities. While two traditional displays mounted back-to-back could present a similar pseudo-3D display to multiple viewers, the opaque displays would prohibit viewers from effectively collaborating across the two displays by obscuring users’ views of each other, blocking or distorting speech, and making it difficult to pass physical objects across the display. This permeability of the FogScreen is important not only for imaging and visualization, but also to provide an additional perceptive cue of the virtual 3D content’s integration into the physical environment. Our interactive, dual-sided, wall-sized system allows a single user to view and manipulate objects floating in mid-air from any angle, and to reach and walk through them. Two individual, but coordinated, images are projected onto opposite sides of a thin film of dry fog, and an integrated 3D tracking system allows users on both sides to interact with the content, while the non-intrusive and immaterial display makes it possible to freely pass physical objects between users or move through the shared workspace. Our system opens up possibilities for a wide range of collaborative applications where face-to-face interaction and maximum use of screen estate is desirable, as well as the maintenance of individual views for different users. We first discuss related work in Sect. 14.1. The basic FogScreen is described in Sect. 14.3, and its pseudo-3D extension is explained in Sect. 14.4. In Sect. 14.5, we discuss the interaction and input technologies, and Sect. 14.6 describes our interactive demonstration applications. Sect. 14.7 presents the results and evaluation of our pseudo-3D display and interaction, and finally we present future work and conclusions in Sects. 14.8 and 14.9, respectively.
14.2 Advanced Displays A fundamental goal for all 3D displays is to create an illusion of depth, such that the user perceives a full 3D scene that seems to float in mid-air. It can be generated in a variety of ways by artificially recreating the effects of depth cues from natural viewing. We briefly discuss the variety of such displays here. Stereoscopic displays [5] provide slightly different images for the left and right eye, creating the appearance of 3D objects that float in front of or behind the screen. The viewing area for correct perspective is restricted, and user-worn glasses are required. Autostereoscopic displays [6] require no special glasses for stereoscopic viewing, but the correct viewing area and resolution are typically somewhat limited. The viewer’s 3D position can be tracked, allowing the rendered images to be modified according to the user’s perspective. This expands the viewing area and enables the user to experience parallax through head-motion. Traditional augmented [7] and virtual reality often use head-worn, tracked displays [8] which draw virtual images directly in front of the user’s eyes. These setups typically only provide a private image which cannot be seen without cumbersome user-worn equipment. Artifacts such as misregistration and lag
14 An Immaterial Pseudo-3D Display with 3D Interaction
507
detract from the sense of presence and may cause eye-strain, headache, and other discomforts. Volumetric displays create a 3D image within a volume. The objects can be viewed from arbitrary viewpoints with proper eye accommodation. Unfortunately, existing displays create their 3D imagery in a fairly small enclosed volume that the viewer cannot enter. They also have problems with image transparency when parts of an image that are normally occluded are seen through a foreground object. Many research projects investigate large displays with user tracking as interactive surfaces. It has proven advantageous to use a screen material that supports rear-projection and tracking from behind the display such that occlusion can be minimized. The HoloWall [9], MetaDESK [10] and Perceptive Workbench [11] use a diffusion screen for rear-projection, while IR illumination enables IR cameras to track objects near the surface of the screen. Projection screens like the dnp Holo Screen [12] and HoloClear [13] consist of a transparent acrylic plate that is coated with a holographic film, such that it only diffuses light projected from a 30–35◦ angle. These transparent displays show only projected objects, are single-sided and not penetrable. Touchlight [14] uses such a screen and allows the gestures of the users to be tracked through the screen. Hirakawa and Koike [15] combine user tracking with a transparent 2D screen for a projection-based optical see-through AR system, whereas ASTOR [16] achieves autostereoscopic AR with 3D imagery using a holographic optical element. A serious limitation of these setups, however, is their inherent single-sidedness, which limit collaboration. There have been several displays using water, smoke or fog, with an early example presented by the Ornamental Fountain dating back to the end of the 19th century [17]. More recently, water screen shows such as Water Dome [18], Aquatique Show [19] and Disney’s Fantasmic [20], spray sheets of freely flowing or high-velocity water to create impressive displays for large audiences. The magnitude and wetness of these screens, as well as their large water consumption, make them impractical for indoor applications, as well as preclude the viewers from comfortably passing through them and seeing clear images from short distances. Many types of fog projection systems [21, 22] have been used for art and entertainment purposes, but the rapid dispersion of the fog seriously limits the fidelity of projected images. The dispersion is caused by turbulence and friction in the fog flow, which disrupts the desired smooth planar surface, causing projected points of light to streak into lines. This streaking causes severe distortion of the image from off-axis viewing angles. The Heliodisplay [23] is a medium-sized (22”–42” diagonal) immaterial rear-projection display. It harvests humidity in the air by condensing it into water, which is then broken into fog. However, considering its single-sidedness, smaller format (compared to the 100 inches of the FogScreen) and tabletop setup, it is not a suitable basis for the kind of walk-through human-scale interactive display we are pursuing.
508
S. DiVerdi et al.
14.3 The “Immaterial” FogScreen To achieve an immaterial display, we base our system on the FogScreen [2, 3, 4], which uses fog as a projection surface to create an image that floats in mid-air (see Fig. 14.1). If people walk through the FogScreen, the image will instantly re-form behind them. It allows projection of interactive content, such as images or videos, to appear floating in free space. It also enables creation of special effects like walking through a brick wall or writing fiery characters in thin air. FogScreens are currently used for special effects at various highprofile venues, events and trade shows. Entertainment is one big application area, including performing arts [24], but the screens are increasingly used for other applications as well. The FogScreen employs a patented method for forming a physically penetrable 2D particle display. The basic principle (see Fig. 14.2a) is the use of a large non-turbulent airflow to protect a flow of dry fog particles inside it from turbulence. The outer airflow may become slightly turbulent, but the inner fog layer remains flat and smooth, enabling high-quality projections. Ordinary tap water is broken into fine fog droplets and trapped inside this non-turbulent airflow. The resulting stable sheet of fog enables projections on a screen that is dry and feels like slightly cool air. The light from a standard projector is scattered through this sheet of fog, creating a rear-projection image. The FogScreen works much like an ordinary screen in terms of projection properties. Light from a projector is scattered by the fog, creating an image that floats in mid-air. However, not all the light is scattered, so a bright projector is needed. A 5000 ANSI lumens projector is usually sufficient for lit environments such as trade shows, if the background is dark. In dark rooms dimmer projectors will suffice. The image can be viewed in most off-axis viewing directions, although an on-axis viewing direction towards the projector yields an optimal image. The
Fig. 14.1. The FogScreen can create fully opaque or very translucent high-quality images in mid-air. It can provide high visual detail
14 An Immaterial Pseudo-3D Display with 3D Interaction
509
Fig. 14.2. left to right: (a) The principle of the FogScreen.(b) As the FogScreen image plane (grey cross-section) is not infinitesimally thin, the pixels may mix with the neighboring ones when viewed or projected at steep angles
field-of-view is currently up to 120◦, depending on flow quality and environmental issues, such as air flow and characteristics of the projected imagery. When viewed from a very steep angle, the thickness of the fog causes adjacent pixels to blur into one another, reducing image quality from the sides (see Fig. 14.2b). The FogScreen characteristics require some considerations in order to design the best possible content and take advantage of the screen as a novel media space [24]. Projectors with SVGA resolution (800×600 pixels) are adequate due to the currently limited fidelity of the FogScreen. Tiny details like small text may be hard to view and use, so it is recommended to design large buttons and objects. The effective image resolution is optimal towards the top of the screen and deteriorates somewhat with increasing distance from the fog dispenser. Most natural imagery looks good on the screen, and color and contrast are vividly preserved. We conducted a test on the effect of projection angle with a NEC WT-610 [25] ultra-short throw distance projector, which can create a 100” diagonal image from as close as 25 inches. The projection angle over the screen area then varies between 40–70◦. Because of the thickness of the fog, the images become quite blurry compared to conventional projector setups. Only very large, uniform objects and text remained identifiable and legible. For the remainder of this work we used ordinary projectors. The opacity of FogScreen depends on many parameters such as fog density, projector and image brightness, and the background of the viewing area. Depending on what imagery is projected onto the FogScreen, a variety of different effects can be achieved. When showing a normal photograph it acts much like a traditional projector screen. However, if the image is mostly black except for a few objects, the black areas appear transparent and create the effect of virtual objects floating in space in front of the user. If a full-screen image of a virtual environment is displayed without text or abstract imagery, it creates more of a portal effect, giving the impression of looking through a window into another world.
510
S. DiVerdi et al.
14.4 3D and Pseudo-3D Display Technologies While fundamentally a 2D display technology, the basic FogScreen can be extended to become a pseudo-3D display, via dual-sided rendering, head-tracked rendering, and stereoscopic imaging. The screen affords a 100” diagonal image in the center of a large open viewing area that is limited only by available space and coverage of a 3D position tracker (see Fig. 14.3). By tracking a single viewer’s head, using correlated projectors on each side and adjusting the projected 3D graphics rendering accordingly, we create a pseudo-3D display. This makes the 3D effect more convincing by showing the 3D object from the appropriate angle. The viewer can see objects floating in mid-air from both sides and freely walk around and through them to examine the scene from almost any angle. The eye cannot correctly focus at a real 3D point within the image, but an impression of depth is still achieved due to other monocular cues, most notably motion parallax. Also, stereoscopic imaging techniques can be used with the FogScreen. These techniques were all informally evaluated by six different viewers to get an idea of 3D perception performance. 14.4.1 Stereoscopic Projection We experimented with a variety of passive and active stereoscopic rendering techniques on our display.
Fig. 14.3. Our dual-sided prototype system setup consists of a FogScreen with a 100” diagonal screen, created by two SVGA projectors mounted 3 m above the floor. Interactivity is added via a whiteboard tracker or a laser range finder for 2D position tracking, and/or via four infrared cameras around the work-space for 3D position tracking
14 An Immaterial Pseudo-3D Display with 3D Interaction
511
Passive stereoscopy with linear polarized glasses and filters [1] is possible, as a thin fog layer accurately preserves polarization of rear-projected light. We used standard polarization filters and glasses for our experiments. Crosstalk between the left and right images is comparable with that resulting from the use of a standard silvered screen. Polarization requires two projectors, which raises the setup cost. The computer must also be able to drive two separate projectors for a single-sided display, requiring four different views being rendered for dual-sided polarized stereo. Passive stereoscopy with red-cyan colored glasses [1] also worked well since the FogScreen maintains proper image colors. Red-cyan stereoscopy only requires a single projector, making the system less expensive and complex than polarized stereoscopy, but the effect is limited to monochromatic imagery. Since the FogScreen preserves image colors, the Infitec [26] passive stereo system could also be used, but so far we did not have one available for testing. The last passive stereoscopy technique we tried was ChromaDepth [27], which color-codes an image by depth value, with red pixels being nearest to the camera, followed by orange, yellow, green and finally blue in the background (see Fig. 14.7). A pair of diffraction grating glasses shift colors so red areas appear near the user, while blue appears far away. The main advantage of this technique is that if the users are not wearing the special glasses, they still see a single coherent image, instead of two superimposed views as with red-cyan or polarized stereo. However, the tradeoff is that ChromaDepth is more of an ad hoc technique for creating binocular depth cues and does not actually simulate eye separation and focal length of the user’s visual system, resulting in less effective 3D perception. For active stereoscopy, we used a DepthQ 3D projector [28] with shutter glasses. While this projector model is quite affordable and may serve as an example for the ongoing reduction in costs for active stereo systems, this was still overall the most expensive option we explored, and the resulting image brightness was lower than that of cheaper passive stereo solutions. Initial results indicate that the quickly changing turbulence pattern of the fog surface over time causes a subtle difference between the left and right images of an active stereo projection, disrupting accurate separation and making it slightly more difficult than on a standard silvered screen to see a clear stereoscopic image. This problem will be partially solved as the screen quality will improve in the future. 14.4.2 3D Head Tracking Motion parallax, achieved by tracking the user’s head position, is a strong monoscopic depth cue. Most any tracking technology suitable for virtual or augmented reality work could be used with our system. For this work, we employed a WorldViz Precision Position Tracker (PPT) wide-area 3D optical tracker [29] for head tracking. Our 3D tracking is explained in more detail in Sect. 5.2.
512
S. DiVerdi et al.
Fig. 14.4. left to right: (a) Prototype headset with an IR LED for viewer’s 3D tracking, and a hand-held IR pointer for 3D interaction. (b) Custom-made 2×3 cm miniature version of the PPT marker
We use an active LED marker on a headset for the head’s 3D position tracking (see Fig. 14.4a). The marker could also be custom-made into a miniature version for stereoscopic glasses. Figure 14.4b shows the current results of our miniaturization efforts. Our system works correctly for a single viewer, similar to virtual rooms and immersive workbenches [30]. 14.4.3 Dual-sided Projection To accentuate the sensation that these virtual objects actually exist in the physical world, the dual-sided capabilities of the FogScreen are used to show both the front and back of the objects, so that viewing the scene from opposite sides will present a consistent perception. Very little of the projected light actually reflects from the fog layer back towards the projecting source. Therefore, the image is predominantly visible for a viewer on the opposite side of a projector (viewing a rear-projected image). A front-projected image is extremely faint. This feature enables us to simultaneously project different images on the two sides of the FogScreen with the back-projected image completely dominating the view. The faint frontprojection image means that there will be slight ghosting in high-contrast regions, but, in our experience, the cross-talk is acceptable, and, in fact, negligible apart from the case where very bright imagery on one side coincides with very dark regions on the other. In the cases where this cannot be avoided, dynamic photometric correction between the front and back projectors based on screen content could alleviate the effect. More interestingly, two coordinated views of a 3D object can be shown on each side of the screen. For example, an application that displays a 3D object, such as a modeling and animation package, could show both sides of the object on the two sides of the FogScreen, creating a more convincing sense of presence of the virtual object in the physical environment. Figure 14.5 illustrates the dual-sided screen with the example of a cartoon shark seen from the front and back.
14 An Immaterial Pseudo-3D Display with 3D Interaction
513
Fig. 14.5. The FogScreen allows two independent images to be projected on each side of the screen, such that opposite sides of a 3D object can be rendered on the screen for a pseudo-3D effect. These photographs of a static two-sided scene illustrate how the back-projected image completely overshadows the simultaneous front-projection (which finds its way through the screen to the ground and back wall behind it)
Our system is also well suited for multiple users. Different users that wish to view the 3D scene do not even need to be situated on the same side of the screen, as is the case with traditional display technologies. With conventional displays, users must crowd inside a small viewing. A typical tabletop system allows users to spread out around the display, but each user will see the data from a different orientation, some upside-down, hindering collaboration. Dualsided rendering on the FogScreen allows the same layout to be presented on both sides, but with text and images properly oriented for viewers who can spread out on either side of the large display.
14.5 Pseudo-3D Experiments While the image quality is not perfect from off-axis viewing angles, the system works reliably and produces an appealing and intriguing human-scale reachand walk-through pseudo-3D display. The spectator can view floating objects from the front and back and freely walk around and through them and, with head-tracking, see the scene from any angle (see Figs. 14.6 and 14.7). As the projection plane is 2D, the eye does not accommodate to the correct distance, but even without using stereoscopic effects the 3D nature of objects is emphasized by giving the impression of floating in free space. The 3D objects look fairly natural when viewed on-axis. As the viewing direction is moved to the side, the image will start to degrade, and finally becomes unusable when viewed more than about 60◦ off-axis. Users who tried the head-tracking system took an initial period to become accustomed to the interaction, but found that the effect was convincing in
514
S. DiVerdi et al.
Fig. 14.6. A gray, non-textured teapot on the pseudo-3D FogScreen as seen from the viewer
making it seem as though a 3D object floated in the space of the screen. We had similarly encouraging results when we demonstrated the dual-sided rendering – users would often naturally walk around or through the screen to see the full scene and they could easily get a better idea of the entire 3D volume without the need for interaction devices. To test the stereoscopic techniques, we showed our users a number of 3D images, including stereo photographs, random dot stereograms and rendered 3D images of simple geometric objects [31]. Overall, we found that the FogScreen is suitable for stereoscopic rendering, with an impressive 3D effect. The stereoscopic effect was dominant, and it became difficult to estimate where the screen plane lies. In particular, polarized stereo provided the clearest 3D perception effect, based on qualitative assessments of many users for a variety of 3D scenes. The downside to the required two projectors is that the stereo effect demands careful calibration of the two images. The different projection angles for corresponding pixels also created some minor loss of edge sharpness at depth discontinuities, due to pixel smearing. Red-cyan stereo also performed well,
Fig. 14.7. The pseudo-3D FogScreen, displaying the Stanford Bunny [31] in mid-air, here with ChromaDepth [27] stereoscopic imaging
14 An Immaterial Pseudo-3D Display with 3D Interaction
515
but inherently it can not reproduce as vivid colors as polarization. ChromaDepth had the weakest 3D effect, but this is not surprising as it gives a poorer impression when viewed on a regular display as well. Active frame-sequential stereo did not perform quite as well as we had hoped, as bright light from the projector easily disrupted the infrared stereo sync. This can be alleviated by using a more powerful IR emitter and actually be avoided in most cases if the projector is placed high enough so that the FogScreen device occludes it, except in the near proximity of the screen. If infrared signal transmission is not an option (e.g. because of interference with a chosen tracking or interaction technology), wired sync solutions can be employed. When the sync signal was present, the stereo effect was quite perceptible and comparable with passive stereo methods, but the passive stereo solutions were also more cost-effective. In general, users were able to see stereo imagery on the FogScreen reliably, but some types of imagery required more effort on the user’s part than others. In particular, with random dot stereograms it generally took users some time for the 3D scene to become visible, while regular 3D geometry was easy and instantaneous to perceive. While the FogScreen can be viewed from a wide field-of-view for 2D imagery, effective stereo correspondence was experimentally determined to be limited to a 15–20◦ viewing angle. We also compared the stereo performance of the FogScreen with a traditional silvered screen and found that the FogScreen creates a more pronounced sense of depth than a traditional screen, whereas traditional screens naturally reproduce higher resolution images due to their more precise nature. We measured the pronounced depth effect by having users estimate the extent along the viewing direction of the same geometry on the FogScreen and the regular screen, and found that users consistently estimated the same objects as roughly 50% longer when displayed on the FogScreen. Our theory is that this effect is rooted in the lack of a reference plane, as objects on the FogScreen appear to float in mid-air, whereas objects on the regular screen are perceived as being anchored in front of the screen plane. Also, the projecting “cones” emanating from the projectors, visualized by participating media beyond the transparent screen (clearly visible in Figs. 14.6 and 14.7) may have contributed to this exaggerated sense of depth.
14.6 Interaction Technologies The immaterial nature of the display is important for enabling new interaction possibilities. While two traditional displays mounted back-to-back could present a similar display to multiple users, the opaque displays would prohibit users from effectively collaborating across the two displays by obscuring users’ views of each other, distorting speech, and making it difficult to pass physical objects across the display. This permeability of the FogScreen is important not only for collaboration but also for the potential for direct interaction with the
516
S. DiVerdi et al.
3D scene. To enable interaction with our immaterial display, we investigated a number of tracking systems and input devices. The ability for a user to walk completely through the FogScreen makes it appropriate as a portal in virtual or mixed reality environments. A CAVETM with a FogScreen as a wall, for instance, would allow the user to easily enter the virtual environment from the outside. The sense of real immersion would still be achieved since the user would be completely surrounded by displays. Furthermore, before entering the environment, the FogScreen portal could present an interface to the outside world that would allow a user to set parameters about the environment he or she was about to enter. Collaboration among multiple people within a single application can be greatly enhanced by a dual-sided display. Multiple users that wish to cooperatively use an application with traditional display technologies must all stand in front of the same display, limiting the number of people who can effectively participate. Immersive workbenches and tabletop systems [30, 32] allow more users to share a workspace, but they do not all view the data from the same side – while a user on one side of the table may see text correctly oriented, a user on the other side will be unable to read it as it will be upside-down. With our dual-sided display, the same interface and layout can be presented on both sides of the screen, but text and images can be properly oriented so all users can actively participate. Collaboration is also encouraged through its inherent support for face-to-face interaction and eye contact. Four people interacting on a shared scene, for example, can more effectively interact in groups of two on either side of the screen, seeing and reaching through to each other, than in one group of four in front of a one-sided display or tabletop system. Interaction can also be an important component of perception – as the goal of the pseudo-3D FogScreen is to create the sensation that virtual objects exist in the physical space around the user, the ability to seamlessly interact with those objects would reinforce their integration. 3D perception is aided by interaction through directly mapping the user’s 3D gestures to virtual objects, creating a proprioceptive depth cue. 14.6.1 2D Tracking The FogScreen appears to intrigue people as a passive, immaterial walkthrough screen, and the natural inclination upon first seeing it is often to attempt to play with the virtual objects on the screen by touching them. By turning it into an interactive 2D computer touch screen, the application possibilities for the screen significantly broaden. 14.6.1.1 Ultrasound Tracking One of our tracking solutions consists of a low-cost, off-the-shelf whiteboard tracker (eBeam Interactive). The device tracks an ultrasound emitter in a 2D plane using ultrasound sensors that are attached in one corner of the screen
14 An Immaterial Pseudo-3D Display with 3D Interaction
517
(see Fig. 14.3). This allows one ultrasound emitting pointer to be tracked as long as line-of-sight is maintained. We had to make some minor modifications to the emitting pointer to remove the need to push the wand against a solid screen. The device has trouble capturing ultrasound across larger surfaces, such as our 100” diagonal display, since the sensors are located on one side only. The previous eBeam System 1 model worked reliably, as the sensors are on both sides of the screen and reasonable tracking accuracy is provided (typically ±2cm on a 2 m wide screen). The fog flow and ultrasound emissions of the device itself have no noticeable effect on the tracking. The accuracy is adequate for most entertainment and business applications, save very detailed, high-precision work. In addition to the spatial inaccuracy, the ultrasound tracking introduces a temporal delay of about 100 ms. In typical ‘push-button’ interaction this is almost unnoticeable, but it might present a problem in a fast paced application such as a game. While eBeam is easy to use and install, its sensitivity to ambient noise can be a problem. 14.6.1.2 Laser Range Finder We added the support for a laser range finder to enable more intuitive, unencumbered interaction. We used an eye-safe SickTM LMS-200 laser scanner mounted on the FogScreen device (see Fig. 14.8). It scans the environment by firing a series of short infrared laser pulses in a fan shape, and measuring the time-of-flight from the firing to the return of an optical echo. It provides an accuracy of 10 mm and a statistical error of just 5 mm, which is adequate for our purpose. The Sick scanner also requires line-of-sight, but an emitter is no longer needed and users are thus able to interact directly with their bare hands. The
Fig. 14.8. Unobtrusive tracking. left to right: (a) The Sick laser range finder 2D tracking system, highlighted in the upper left corner, allows users to intuitively interact with the FogScreen using their bare hands. (b) In the depicted demo application, the system leaves slowly fading sparklers or fiery traces wherever the user is touching the screen
518
S. DiVerdi et al.
tracker is triggered by objects intersecting the plane and the largest detected object is chosen to represent the touch screen input. As long as an object is present, the left mouse button is emulated to be pressed. The laser range finder may be triggered by the fog and thus needs to be mounted 10–20 cm from the screen. The non-intrusive tracking allows natural interaction, while currently limiting the system to one user and one preferred side per scanner device. The scanner transmits its information via a serial RS-422 link at up to 500 kbps. This is both too high a data rate for ordinary PC serial ports to receive and too high a sampling rate to be useful for our purposes. We constructed a simple AVR microcontroller based interface card to receive the data from the scanner, compress it, and transmit the compressed data at a rate of 115,200 bps to the host. At the host, the data can either be directly used by custom software or by our emulator of a generic desktop mouse to control arbitrary legacy applications. The laser scanner occasionally produces too high measurement values, mainly due to a laser beam only partially hitting a target, so filtering is applied to the data stream. Spurious single sample errors are discarded by a 3-tap 2D median filter. For the mouse emulator, the onset of the mouse button click is delayed to allow the hand to fully enter the beam and a similar delay is introduced for the mouse button release when the hand is removed. As the scanning rate is 75 Hz, the resulting delay is not noticeable to the user. 14.6.2 Vision-based 3D Tracking The two abovementioned tracking technologies provide compact 2D tracking solutions that are straightforward to install and calibrate. The single-user constraint and limitation to 2D interaction made us look into other tracking technologies, such as the WorldViz PPT, to enable multi-user 3D applications. The PPT is a wireless 3DOF vision-based system that uses from two to eight cameras to track small near-infrared (IR) LEDs in the environment. Our setup uses four cameras, two on each side of the display (see Fig. 14.3). The FogScreen is practically translucent in the IR spectrum so it does not interfere with the visibility of our IR LEDs. Each LED marker needs to be visible by at least two cameras at all times, and up to eight can be tracked simultaneously. We use one LED to track the user’s head position, and one LED for the hand position. Multiple LEDs can also be combined to provide more degrees of freedom, which obviously increases the size of the marker due to the required distance between LEDs. The use of two LEDs would allow a user to specify a vector, which can be useful for orienting objects on the screen – three LEDs allow 6DOF tracking. There is no robust means of uniquely identifying a particular LED. Thus, LEDs that are close to each other or LEDs that move too quickly can confuse the tracker, causing IDs to be swapped or the appearance of more or fewer
14 An Immaterial Pseudo-3D Display with 3D Interaction
519
LEDs than are actually present. These artifacts are particularly problematic for applications that need to maintain knowledge between frames of tracked object identity. To address this issue, we developed a proxy VRPN tracker server [33] that filters the PPT output into more reliable data. It analyzes the position and velocity of tracked objects to predict future positions, reducing swapping, and removes the duplicate report artifact by eliminating tracking results that have very similar position and velocity. The use of IR markers also imposes the requirement of controlled lighting, since many regular light sources have IR radiation that will generate noise in the near-IR camera image. This issue is evident especially in environments with daylight or bright incandescent spotlights, whereas standard fluorescent lighting does not have this problem. Specific spotlights (i.e. with minimal IR radiation) or IR filters could make it possible to use incandescent light sources in the environment. 14.6.3 Input Devices Gesture-based interaction without the use of input devices is natural and direct, but is often limited to simple pointing, such as in the case of our laser range finder. Systems that recognize multiple and more complex pointing gestures still tend to be limited to a single user, and while manipulation gestures can be intuitive, system commands, such as mode changes, might not be as easy to represent. We developed various wireless controllers to facilitate interaction for applications where discrete input in a comfortable form factor is desired. Our controllers take the form of 3-button joystick handles with an integrated or detachable custom-made miniature version of the PPT marker for 3D position tracking (see Fig. 14.9). The use of Bluetooth in combination with PPT markers enables multiple wireless 3DOF controllers to be simultaneously active in the system.
Fig. 14.9. Wireless joysticks with integrated 3DOF PPT marker and Bluetooth. left to right: (a) Symmetrical joystick. (b) Right-hand joystick
520
S. DiVerdi et al.
In addition to a trigger button, our controllers have two horizontally or vertically placed buttons that are conveniently accessible with the thumb. The controller with horizontal buttons is symmetrical and works for both left- and right-hand users (See Fig. 14.9a). It is ideal for transformations, such as horizontal translation or rotation around the up-axis, as well as in applications that want to mimic left- and right-button mouse clicks. The other controller type is for right-hand users and has two buttons on the left side (See Fig. 14.9b). The buttons are closer to each other and thus require less thumb movement, but the smaller separation makes it more likely for novice users to confuse them. Additionally, our 2×3 cm miniature version of the PPT marker (see Fig. 14.4b) is sufficiently small and light to be held between two fingers or to be attached to body parts as a lightweight 3D marker. Its small size allows us to simulate unencumbered 3D hand tracking for experiments with multi-user hand-tracked-style 3D interaction.
14.7 Interaction Experiments We developed and tested several types of interfaces using our interactive FogScreen system. Most of these interfaces were part of demo applications that were presented at ACM SIGGRAPH 2005. All except the fiery characters demo are based on the vision-based 3D tracking described in Sect. 14.6. The goal with each interface was to explore a different interaction mechanism using the unique capabilities of the FogScreen and our input devices. Each test was examined to see how it affected users’ perception of, and reaction to, the 3D content. 14.7.1 2D Manipulation The first action most users take upon seeing the FogScreen is to insert their hands into the display to “play” with objects on the screen. We took advantage of this natural tendency by developing a set of interfaces that involve the user directly touching the screen to interact with content. 14.7.1.1 Fiery Characters Figure 14.8 shows the fiery characters demo, which allows users to play with slowly fading sparklers and lines of fire on the translucent screen. The density of fog is kept low, so only the bright fiery spots are visible and everything else is invisible. The demo uses 2D touch screen and mouse emulation by means of the eBeam tracker or Sick scanner – when a user touches the screen, a mouse button press is emulated, allowing users to draw in the air with their hands on the virtual screen.
14 An Immaterial Pseudo-3D Display with 3D Interaction
521
14.7.1.2 Rigid Body Simulator To create the sensation of manipulating real physical objects, we developed an application that lets users intuitively interact with realistically behaving virtual objects (see Fig. 14.10). Using a straightforward implementation of standard mechanical dynamics [34], our application simulates a number of virtual rigid bodies that bounce around as if in a low-gravity environment. The interface is simple and straightforward, in an attempt to closely mimic manipulation of real objects – each user moves a single LED tracked by PPT to control a paddle. This paddle behaves as another rigid body in the simulation and allows the user to collide with the objects in the scene, directing their motion. The simulation and interaction are limited to 2D (all z-axis values are set to 0), as we found that without additional depth cues, perception of z-axis placement of objects was difficult for some users. However, the paddles can be controlled from any point in our 3D interaction space. Distance to the screen does not matter in this application. The 2D interface is so simple that even small children were able to immediately start playing with the simulation without any instruction. Because of the direct manipulation style of the interface, there is a sensation that the user is playing with real objects, rather than virtual ones. Unfortunately, this effect is somewhat reduced by the ability of the user to stand and interact with the screen at any depth because of the clamping of z-axis values. 14.7.1.3 Consigalo The dual-side and reach-through capabilities of the FogScreen are used in Consigalo, an engaging multiplayer game (see Fig. 14.11). Users control colored spherical cursors with the handheld PPT markers in the 3D space around the screen. The user touches the screen to grab objects at a location, or moves away from the screen to release objects. This allows players to pick up falling
Fig. 14.10. The rigid body simulator has a physically intuitive interface which is easy even for small children to understand and play with. A 2DOF controller lets users move a green paddle which collides realistically with other virtual objects in the scene, including a teapot and a torus
522
S. DiVerdi et al.
Fig. 14.11. The touchscreen-style interaction of Consigalo makes it very easy for users to intuitively understand the grabbing action necessary to catch animals and score points. The dual-sided display is instrumental in allowing many users to participate and engage each other from across the screen
animals and sort them into the colored side goal areas, by dragging them across the display surface. The 3D tracking emulates a touch screen here: when moving the PPT marker (see Fig. 14.9a) close to or through the screen, an event is triggered. This furthers the perception of virtual objects occupying the physical space of the screen by allowing users to ‘grab’ them when touching the screen. Consigalo also enables collaborative face-to-face 3D interaction by taking advantage of the screen transparency and dual-sidedness to allow play on either side of the screen – the players can even switch sides by moving through the screen while interacting with the animals. 14.7.2 Navigation To explore the concept of our display as a portal to a virtual environment, we created the virtual forest tour (see Fig. 14.12). The interface is a first-person point of view situated inside the environment, with one of our 3DOF wireless controllers (see Fig. 14.9) for interaction. The user can move the controller while holding the first button to change the velocity as the camera moves through the environment. The second button changes the viewing direction, and the third button controls the position of the light source in the scene. Altogether, the effect is a very natural game-style interface that makes it easy to navigate the environment. While the perception of moving through a virtual environment is clear, it still requires some willful suspension of disbelief, as navigation does not use any natural locomotive methods. The presence of the virtual environment could be improved in a few ways. Using head-tracking to provide parallax in the environment would be a major step forward, but we haven’t evaluated such hybrid setups yet. Some sort of natural locomotion interface such as the moving floor tiles [35], would also improve the sensation of navigating a real environment.
14 An Immaterial Pseudo-3D Display with 3D Interaction
523
Fig. 14.12. The virtual forest tour acts as a portal from the real world into a virtual environment full of thousands of realistically rendered trees. People can explore the environment with a first-person game-style interface using a wireless 3DOF controller. We are currently exploring navigational interfaces taking into account the walk-through capability
14.7.3 3D Manipulation As we present 3D content on the pseudo-3D FogScreen, we are also interested in interfaces for 3D manipulation of scenes. We developed two applications that use different techniques to provide depth cues for a 3D cursor used to select regions of a curved 3D surface displayed by the screen. 14.7.3.1 Elastic Surface Deformer Our first test was a single-user modeling application we developed to explore the combination of real 3D interaction and pseudo-3D visualization in our system. The elastic surface deformer uses a 3-button controller (see Fig. 14.9b) to stretch and sculpt, as well as to move and rotate, an elastic 3D model of a human head (see Fig. 14.13). The front and back views of the 3D model are projected on opposite sides of the screen, such that the user can walk through the screen and see what the object looks like from the other side in a pseudo-3D fashion. We chose to use full 3D interaction in this application – users had to position the 3D cursor on the head surface to grasp and drag it. However, it proved slightly difficult to manipulate the 3D model, since the only available depth cue was cursor occlusion (perspective cues were not available as the head was orthographically projected to facilitate concurrent dual-sided view). While pure 2D interaction is too limited, a traditional solution to this problem would be to do selection in 2D and use relative 3D motion for dragging. 14.7.3.2 Sound Putty The Sound Putty project extended the 3D manipulation of the elastic head deformer to a more abstract interactive art exhibit (see Fig. 14.14). Multiple
524
S. DiVerdi et al.
Fig. 14.13. The elastic face deformer’s 3D interface gives users complete control over how they distort a virtual head, although selection was difficult due to insufficient depth cues
users are able to simultaneously influence the behavior of a virtual puttylike fluid by moving attractors and repellers around it in 3D. To provide additional depth cues, in the single-user case head-tracked rendering was used to provide motion parallax. Small motions of the user’s head provide slight parallax which shows very clearly the depth of the surface, making correct 3D positioning possible. During interaction, the fluid will often move completely in front of the screen, no longer actually in the plane of the screen, but users are still able to effectively find and manipulate it. However, there was more of a learning curve associated with the headtracked rendering, as users are not accustomed to that type of display. Proper calibration is critical to make a believable experience – when the calibration was slightly off, it distorted users’ perception, making input more difficult than a regular 2D display, as they were grappling with figuring out what the image meant instead of focusing on the interaction. With proper calibration
Fig. 14.14. Sound Putty presents the user with an abstract putty-like fluid that can be controlled to create interesting shapes and motions. The 3D interface was greatly enhanced by the addition of head-tracked rendering, which provides motion parallax depth cues to allow for 3D perception of the shape of the surface
14 An Immaterial Pseudo-3D Display with 3D Interaction
525
and after a short learning curve however, users had little difficulty interacting with this interface in 3D.
14.8 Future work Our work with the FogScreen continues in many different areas. We are working towards improving the image quality of the display by further reducing turbulence in the fog flow. This will improve image fidelity and increase the effective field of view, as well as allow the realization of even larger screens. The quality of stereoscopic imagery, especially active stereo, should also improve with less turbulence. We are also interested in improving the sense of presence of virtual objects in the physical world by integrating haptic feedback into the interface. A haptic device such as the SPIDAR [36] would work well with minimal discomfort – a normal display would interfere with the necessary wires, but they can go through the FogScreen, and a SPIDAR is capable of the large range of motion necessary. It would also provide 3DOF tracking input for at least a single user, removing the need for an additional tracking solution. Finally, in the interest of developing a fully volumetric 3D display, we are currently investigating the use of multiple FogScreens in various configurations to allow images to occupy a tangible physical volume surrounding the user. This would overcome a fundamental limitation of existing volumetric displays, allowing the user to become fully immersed in the 3D visualization.
14.9 Conclusions We have described a novel mechanism to create a pseudo-3D walk-through screen with interactive capabilities. The implemented system enables one to view and manipulate 3D objects in mid-air and observe them from different angles in a natural manner. Using it as an immaterial, head-tracked dualsided display has led to an enhanced visualization experience. It creates a strong visual effect of 3D objects floating in air, even when the image is not stereoscopic. This is a first step in the direction of a truly volumetric walkthrough display. The addition of 2D and 3D interaction significantly expands the possibilities for applications of the FogScreen. It provides advantages over other displays by allowing unhindered multi-user collaboration, providing new interface potential, and subtly reinforcing the presence of virtual objects in the physical environment. Unlike many volumetric displays, the pseudo-3D FogScreen can be very large and does not restrict the user from “touching” the objects, leading to a more immersive experience. Engaging interaction with immaterial 3D objects
526
S. DiVerdi et al.
can be supported in a variety of ways, as we demonstrated at SIGGRAPH 2005 Emerging Technologies [37]. The FogScreen has proven itself as a captivating display technology that immediately generates interest and excitement in the audience. The feedback from our SIGGRAPH 2005 demonstration was unanimously enthusiastic about the dual-sided, interactive experience. Since then, our demos of head tracking and stereoscopy have been met with similar enthusiasm about the further improved perception of 3D imagery.
Acknowledgments We wish to thank Karri Palovuori at Tampere University of Technology and Marc Breisinger, Nicola Candussi, Cha Lee, Rogerio Feris, Jason Wither and Alberto Candussi at the Four Eyes Lab, University of California at Santa Barbara. Special thanks to Andy Beall and Matthias Pusch from Worldviz LLC. This research was in part funded by grants from NSF IGERT in Interactive Digital Multimedia, the Sweden-America Foundation, the Academy of Finland, the EC under contract FP6-511568 3DTV, TTY tukis¨aa¨ti¨ o, Emil Aaltonen Foundation, Tekes, Alfred Kordelin Foundation, Jenny and Antti Wihuri Foundation, and the Finnish Cultural Foundation and its Pirkanmaa Branch. Additional funding came from a research contract with the Korea Institute of Science and Technology (KIST) through the Tangible Space Initiative Project.
References 1. DiVerdi S, Rakkolainen I, H¨ ollerer T, Olwal A (2006). A Novel Walk-through 3D Display. Proc. SPIE Electronic Imaging, Stereoscopic Displays and Virtual Reality Systems XIII, San Jose, CA, USA, January 15–18, 2006, SPIE Vol. 6055, pp. 1–10. 2. Palovuori K, Rakkolainen I (2004). FogScreen. U.S. patent 6,819,487 B2. November 16, 2004. 3. Rakkolainen I, Palovuori K (2002). A Walk-thru Screen. IS&T/SPIE Electronic Imaging 2002, Proc. of Conference on Projection Displays VIII, San Jose, CA, USA, January 23–24, 2002, pp. 17–22. 4. FogScreen Inc. (2007). http://www.fogscreen.com. March 2007. 5. Pastoor S, Wopking M (1997). 3-D Displays: A Review of Current Technologies. Displays, Vol. 17, No. 2, April 1, 1997, pp. 100–110. 6. Halle M (1997). Autostereoscopic Displays and Computer Graphics. Computer Graphics, ACM SIGGRAPH, Vol. 31, No. 2, May 1997, pp. 58–62. 7. Azuma R, Baillot Y, Behringer R, Feiner S, Julier S, MacIntyre B (2001). Recent Advances in Augmented Reality. IEEE Computer Graphics and Applications, Vol. 25, No. 6, November–December 2001, pp. 24–35.
14 An Immaterial Pseudo-3D Display with 3D Interaction
527
8. Sutherland I (1965). The Ultimate Display. Proc. of IFIP Congress 1965, Vol. 2, pp. 506–508. 9. Rekimoto J, Matsushita N (1997). Perceptual Surfaces: Towards a Human and Object Sensitive Interactive Display. In Workshop on Perceptual User Interfaces (PUI’97), October 1997, pp. 30–32. 10. Ullmer B, Ishii H (1997). The MetaDESK: Models and Prototypes for Tangible User Interfaces. Proc. of the ACM UIST’97 Symposium, pp. 223–232. 11. Leibe B, Starner T, Ribarsky W, Wartell Z, Krum D, Singletary B, Hodges L (2000). The Perceptive Workbench: Towards Spontaneous and Natural Interaction in Semi-Immersive Virtual Environments. Proc. of IEEE Virtual Reality 2000, March 2000, New Brunswick, NJ, USA, pp. 13–20. 12. dnp Holo Screen (2007). DNP, http://www.dnp.dk/. March 2007. 13. HoloClear (2007). HoloDisplays, http://www.holodisplays.com/. March 2007. 14. Wilson A (2004). TouchLight: An Imaging Touch Screen and Display for Gesture-Based Interaction, Proc. of ICMI’04, pp. 69–76. 15. Hirakawa M, Koike S (2004). A Collaborative Augmented Reality System Using Transparent Display. Proc. of ISMSE’04, pp. 410–416. 16. Olwal A, Lindfors C, Gustafsson J, Kjellberg T, Mattson L (2005). ASTOR: An Autostereoscopic Optical See-through Augmented Reality System. Proc. of IEEE and ACM ISMAR 2005, pp. 24–27. 17. Just PC (1899). Ornamental Fountain. U.S. patent 620,592. March 7, 1899. 18. Sugihara Y, Tachi S (2000). Water Dome – An Augmented Environment. Proc. of the Information Visualization Conference, London, July 2000, pp. 548–553. 19. Aquatique (2007). Aquatique Show International, http://www.aquatic-show. com/. March 2007. 20. Fantasmic Show (2007). Disney, http://disneyworld.disney.go.com/wdw/ entertainment/entertainmentDetail?id = FantasmicEntertainmentPage. March 2007. 21. Desert Rain Project (2007). http://www.crg.cs.nott.ac.uk/events/rain/. March 2007. 22. Mee Fog Inc (2007). http://www.meefog.com/. March 2007. 23. IO2 Technology LLC (2007). Heliodisplay, http://www.io2technology.com/. March 2007. ¨ 24. Rakkolainen I, Erdem T, Erdem C ¸ , Ozkan M, Laitinen M (2006). Interactive “Immaterial” Screen for Performing Arts. ACM Multimedia 2006, Interactive Arts Program, Santa Barbara, CA, USA, October 23–27, 2006. 25. NEC (2007). WT-610 Short Throw Projector, http://www.nec.co.uk/NEW− MultiSync− WT610.aspx. March 2007. 26. Infitec GmbH (2007). Infitec Interference Filters, http://www.infitec.net/. March 2007. 27. Bailey M, Clark D (1998). Using ChromaDepth to Obtain Inexpensive Singleimage Stereovision for Scientific Visualization. Journal of Graphics Tools, Vol. 3, No. 3, pp. 1–9. 28. DepthQ 3D Projector (2007). Infocus, http://depthq.com/. March 2007. 29. WorldViz PPT 3D Optical Tracker (2007). http://www.worldviz.com/ppt/. March 2007. 30. Krueger W, Froehlich B (1994). The Responsive Workbench. IEEE Computer Graphics and Applications, Vol. 14, No. 3, pp. 12–15. 31. The Stanford 3D Scanning Repository (2007). http://www-graphics.stanford. edu/data/3Dscanrep/.March 2007.
528
S. DiVerdi et al.
32. Rekimoto J, Saitoh M (1999). Augmented Surfaces: A Spatially Continuous Work Space for Hybrid Computing Environments. Proc. of ACM CHI. 33. Taylor R (1998). VRPN: Virtual Reality Peripheral Network, http://www.cs. unc.edu/Research/vrpn/. 34. Baraff D (1989). Analytical Methods for Dynamic Simulation of Nonpenetrating Rigid Bodies. Proc. ACM SIGGRAPH, pp. 223–232. 35. Iwata H, Yano H, Fukushima H, Noma H (2005). CirculaFloor: A Locomotion Interface Using Circulation of Movable Tiles. Proc. IEEE Virtual Reality 2005, Bonn, Germany, March 12–16, 2005, pp. 223–230. 36. Kim S, Ishii M, Koike Y, Sato M (2000). Development of Tension Based Haptic Interface and Possibility of its Application to Virtual Reality. Proc. of ACM VRST, pp. 199–205. 37. Rakkolainen I, Laitinen M, Piirto M, Landkammer J, Palovuori K (2005). The Interactive FogScreen. A Demonstration and Associated Abstract at ACM SIGGRAPH 2005 Program: Emerging Technologies, Los Angeles, CA, USA, July 31–August 4, 2005. See also http://ilab.cs.ucsb.edu/projects/ismo/ fogscreen.html.
15 Holographic 3DTV Displays Using Spatial Light Modulators Metodi Kovachev1, Rossitza Ilieva1 , Philip Benzie2 , G. Bora Esmer1 , Levent Onural1 , John Watson2 , and Tarik Reyhan1 1
2
Dept. of Electrical and Electronics Eng., Bilkent University, TR-06800 Ankara, Turkey University of Aberdeen, King’s College, AB24 3FX, Scotland, UK
15.1 Introduction All functional blocks of a 3DTV system, such as its capture, compression, transmission and display units, are all important for a successful end-to-end operation. However, there is no doubt that the display unit has a special impact since the viewer interacts directly with it. Construction of a 3D display unit, which generates a replica of a 3D scene with an acceptable quality, has been a primary goal for researchers for a long time [1, 2]. Current 3D display implementations are usually based on stereoscopic or autostereoscopic technologies. However, a true 3D display unit, such as a holographic 3DTV display device, is much more desirable due to superior 3D visual quality they promise. Naturally, a dynamic holographic device is needed for video operation. Spatial light modulator (SLM) technology is one convenient alternative for achieving the dynamic holographic display. An SLM is an array of pixels where each pixel modulates the phase and amplitude of light transmitted through or reflected from it [3]. Recently, multi-mega-pixel SLMs that can be electronically driven by a digital video interface (DVI) or video graphics adapter (VGA) are developed [4, 42]. Although the developments in SLM technologies during the last decade brought us new opportunities, currently achievable SLM parameters are still not sufficient for a satisfactory 3D display quality. Holography is based on representing and storing the 3D scene information as an interference pattern. Therefore, holographic recordings require a high spatial resolution. Various methods, such as compression of fringe patterns, generation of horizontal-parallax-only (HPO) holograms and computation of binary holograms have been proposed to reduce the bandwidth requirements [5]. If an SLM is going to be used as a holographic display unit, a large array size with a small pixel pitch is essential. Holograms may be captured directly from charge-coupled devices (CCDs) with a high spatial resolution and dynamic range, or may be generated by computers or other means. A successful wavefront reconstruction can be achieved if the hologram features
530
M. Kovachev et al.
match the SLM parameters such as pixel pitch, array size, pixel geometry, dynamic range, etc. A survey on liquid crystal (LC) SLMs is given in the next section. After a discussion of some different methods of hologram generation by computation in Sect. 15.3, both digital and optical reconstructions from such generated holograms are compared in Sect. 15.4. Conclusions are presented at the end.
15.2 Survey on Electro-optical Properties of LC Spatial Light Modulators SLMs, as promising devices for the holographic displays, are the subject of discussion in this section. Liquid crystal SLMs are electro-optical devices that can modulate transmitted or reflected light; they contain a two-dimensional array of discrete cells or pixels [3, 6]. Each pixel contains a liquid crystal layer sandwiched between two electrodes on glass substrates and has birefringence depending on the applied voltage. The applied voltage modulates the phase difference between the ordinary ray and the extraordinary ray in the pixel cell. This is equivalent to a rotation of light polarization angle when the incident light is polarized at a 45◦ angle with respect to the fast polarizing axis orientation of the LC. A polarization analyzer converts phase difference modulation to gray scale levels. One of the electrodes is usually common for all pixels in the SLM. A potential difference V (ζ, η), with respect to the common electrode can be applied at each pixel. Both electrodes may be transparent; or one transparent and one reflective (mirror) electrodes may be used depending on the design. The first case corresponds to the transmission mode and the second case to the reflection mode. Nowadays SLMs consist of more than a million pixels and any pixel may be addressed or driven independently. Pixel sizes vary from 7 μm up to 19 μm, and the number of pixels can go up to 3840×2048. Pixel size depends on photolithographic and microelectronic technologies. These pixels are usually square and arranged as a matrix with an aspect ratio of 4:3 or 16:9 for the standard and panoramic displays, respectively. It is not physically possible for the entire surface to be active. The SLM matrix structure is geometrically determined by four parameters: these are pixel pitch (Δζ and Δη), and gap (gx and gy ) along the X and Y directions (Fig. 15.1). Pixel pitch is the distance between the centers of neighboring pixels and the gap is the non-active area between two neighboring pixels. The ratio of the gap area and the active zone determines the fill factor of the SLM where common commercial fill factors being > 90 %. The gap has usually low transparency or low reflectivity and introduces an attenuation of the light. The ratio of the input and output light intensities determines the efficiency and it is about 50% to 75% for modern SLMs. Transmissive SLMs are manufactured by using etching technology on a transparent substrate. The relief created this way acts as a phase grating and
15 Holographic 3DTV Displays Using Spatial Light Modulators
531
Fig. 15.1. Nomenclature for SLM pixel structure
creates multiple diffraction orders, as shown in Fig. 15.2, even when no voltage is applied. It is possible to use multiple diffraction orders as an advantage to enlarge the viewing zone or increase the effective SLM resolution [7, 8]. Reflective SLMs are manufactured by planar technology on a silicon substrate. Passive and active SLM elements are manufactured by diffusion and the relief on the silicon surface in this case has a depth less than the wavelength and introduces a small phase modulation in the reflected light. Because of this, the energy in high diffractive orders is much less than the energy in the zeroth or first orders of diffraction. LCoS (Liquid Crystal on Silicon) technology is usually a high resolution and a high fill factor process and pixel edges are
Fig. 15.2. Diffractive orders produced by an SLM illuminated by coherent laser source
532
M. Kovachev et al.
smooth. The electronic circuits that control the formation of the image are fabricated on the silicon chip which is coated with a highly reflective layer. The circuitry is behind the pixel, and therefore, does not create an obstruction along the light path. The liquid crystal used in the cells of SLMs is a bi-refractive medium whose bi-refractive index depends on the applied voltage at the electrodes [9]. The voltage changes the refractive index of the liquid crystal in the extraordinary (fast) direction and the cell works as a phase-retarder under a suitably polarized light, obtained by a polarizer in front of the SLM. The polarizer is orientated 45◦ toward the liquid crystal fast axis (Fig. 15.3). An analyzer after the SLM, orthogonally orientated with respect to the polarizer, transmits a portion of the light depending on the introduced retardation. In this case the SLM works in the amplitude-mostly mode. If the polarizer and analyzer are parallel to the fast axis of the liquid crystal then the SLM works in the phase-mostly mode. Most SLMs can modulate the phase from zero to 2π or (−π to +π) and some of them up to 3π depending on the wavelength (for example: Holoeye Photonics: LC 2002, LC-R 2500, LC-R 768; HDTV Phase Only Panel HEO 1080 P). Characteristics of a Holoeye SLM are presented in Figs. 15.4.a – 15.4.g. The phase modulation is measured relative to the linearly polarized reference beam. Intensity modulation is presented in relative units with respect to a chosen value from the corresponding gray level curve of I(x, y). Holoeye characteristics given in Figs. 15.4.a and 15.4.b show that the SLM behaviour depends strongly on the used polarizer. A polarizer with a better than 1:1100 attenuation ratio at 543nm yields a high contrast ratio. Input-output SLM characteristics are important for correct design and display of the computer generated holograms. The SLM characteristics for both modes depend on the polarizer and the analyzer orientations as shown in Figs. 15.4.c and 15.4.d. The phase and intensity modulation characteristics for an amplitude-mostly mode of transmissive SLM LC 2002 are shown in
Fig. 15.3. Fast polarizing axis orientation of LC SLM
15 Holographic 3DTV Displays Using Spatial Light Modulators
533
Fig. 15.4. Modulation characteristics of commercially available Holoeye SLMs. (a) Intensity modulation for LC 2002 at 543 nm; (b) at 633 nm; (c) phase and intensity modulation for different polarizer settings at 543 nm; (d) at 633 nm. (The phase modulation units are ×π radians)
Fig. 15.4.c, for a polarizer orientation of 330◦ and analyzer at 0◦ with respect to the vertical axis of the SLM, at 543 nm. Phase and intensity modulation characteristics for Holoeye LC-R 768 reflective SLM for a polarizer orientation of 70◦ , and without an analyzer, at 543 nm and at 633 nm are shown in Figs. 15.4.e and 15.4.f, respectively. Therefore, the input polarization without an analyzer leads to an optimal light efficiency. Phase modulation for a polarizer orientation of 90◦ and intensity modulation for polarizer and analyzer orientations of 90◦ for phase-only SLM HEO 1080 P (1920 × 1080 pixels) at 633 nm are shown in Fig. 15.4g. It is
534
M. Kovachev et al.
Fig. 15.4. (continued) Modulation characteristics of commercially available Holoeye SLMs. (e) phase and intensity modulation for LC-R 768 at 543 nm; (f ) at 633 nm; (g) phase and intensity modulation for HEO 1080 P at 633 nm. (The phase modulation units are ×π radians). (Courtesy of Stephan Osten from Holoeye Photonics AG)
15 Holographic 3DTV Displays Using Spatial Light Modulators
535
seen that phase-mostly mode is always accompanied by a small amplitude modulation. Such characteristics are readily available on producers’ (Holoeye Photonics, CRLO Displays Ltd., Displaytech Ltd.) documents. Another important characteristic is the crosstalk between neighboring pixels of an SLM. Experiments with a Sony SLM LCX012BL, and a Holoeye LC 1004 [41] showed that the crosstalk is practically zero. Most commercial SLMs suitable for computer generated holography (with a pixel count of greater than 1000 × 1000) have a minimum writing time which yields a frame rate of about 300 fps. Since the frame rate is rather high for video perception, single SLM color displays are possible by allocating the available frames to R, G, B color components in a time-multiplexed fashion. Alternatively, colour mixing can be achieved by using a simultaneous combination of transmissive and reflective SLMs.
15.3 Generation of Holograms by Computation Methods and algorithms for computer generated holograms (CGHs) have been known for a long time [11, 28]. SLMs have been used as diffractive devices to reconstruct 3D images from CGHs [4, 11, 12, 13]. Two common methods to compute the diffraction field due to a planar object in 3D space are the Rayleigh-Sommerfeld (R-S) diffraction integral and the Fresnel-Kirchoff diffraction formula. A comparison of R-S and Fresnel-Kirchhoff diffraction integrals is given by Lucke [14]. The Fraunhofer formula, which is also called k(x2 + y 2 )max the far field approximation, is valid for larger distances, z >> , 2 where (x, y) represent the maximum extent of the object and >> should be interpreted as “at least 15 times” [16]. The Fresnel diffraction formula should be used for smaller distances [15, 16]. For a successful reconstruction, the size of the SLM, the reconstruction wavelength, the distance between the SLM and the image location, and many other parameters must be carefully considered. Three computational methods, which are based on R-S, Fresnel-Kirchhoff and the bipolar intensity formulas, are investigated in this Chapter. The diffraction computation is usually a demanding process, therefore a number of algorithms have been employed to exploit redundancy and thus reduce the computation time [5, 17, 18, 21]. For instance, Lucente et al. utilised a bipolar intensity method [5]. Ito et al. applied this method for reconstruction using LCoS SLMs [19, 20]. Classical methods for hologram computation use the wavefront propagation theory [22, 23, 24, 25]. R-S diffraction integral, Fresnel-Kirchhoff diffraction formula, and Fresnel approximation are some of well known scalar diffraction field calculation methods [9, 12, 15, 16, 26, 27].
536
M. Kovachev et al.
15.3.1 Rayleigh-Sommerfeld Diffraction The scalar diffraction of monochromatic coherent light between two parallel planes in a homogeneous and linear medium can be expressed by the planewave decomposition (PWD). PWD and R-S diffraction integral are equivalent [29, 30]. The difference between PWD and R-S diffraction integral is how they express the diffraction field relationship: former uses the frequency domain, whereas the latter defines the relationship in the spatial domain. PWD is used because of its simplicity in implementations. The diffraction field relationship between the input and output fields by utilizing the plane wave decomposition is,
2π/λ
2π/λ
[U (x, y, 0)] exp[j(kx x + ky y)] exp(kz z)dkx dky
U (x , y , z) = −2π/λ −2π/λ
(15.1) where is the 2D Fourier transform (FT) from (x, y) domain to (kx , ky ) domain. The terms kx , ky and kz are the spatial frequencies of the propagating monochromatic waves along the x, y and z axes, respectively. The spatial frequency kz can be expressed in terms of kx and ky as kz = k 2 − kx2 − ky2 , where k = 2π/λ. The expression in (15.1) can be rewritten as: ! U (x , y , z) = 4π 2 −1 [U (x, y, 0)] exp j k 2 − kx2 − ky2 z , where −1 is the inverse FT from (kx , ky ) domain to (x, y) domain. Since we are dealing with propagating waves only, the diffraction field is band-limited. Moreover, to have a finite number of plane waves in the calculations, we work with periodic diffraction patterns. To obtain the discrete representation, (15.1) is sampled uniformly along the spatial axes with x = nXs , y = mXs and z = pXs , where Xs is the spatial sampling period, n and m are integers, and p is a real variable. Uniform sampling is applied on the frequency domain with kx = 2πn /N Xs and ky = 2πm /N Xs . The resultant discrete algorithm is (15.2) UD (n, m, p) = N DF T −1 {DF T [UD (n, m, 0)]Hp (n , m )},
where the term Hp (n , m ) = exp(j2π β 2 − n2 − m2 p/N ) and β = N Xs /λ. The discrete diffraction field UD (n, m, p) is: UD (n, m, p) = U (nXs , mXs , pXs ). Effects of sampling of the R-S diffraction field on the reconstructed image are discussed in [43]. 15.3.2 Fresnel-Kirchhoff Diffraction When the distance z is sufficiently large, then the observer is said to be in the Fresnel diffraction region. The condition for distance z is [15, 16]:
15 Holographic 3DTV Displays Using Spatial Light Modulators
8 z >>
3
537
π [(x − x)2 + (y − y)2 ]max 4λ
where >> should be again interpreted as “at least 15 times”. For instance, for an SLM with a 17.8mm diagonal size, a viewing zone with the same size, and a pixel size of 12.1 μm, and for an illumination with a wavelength λ = 0.543 μm, the above condition becomes z > 0.530 m. The field over the hologram plane U (x , y , z) according to Fresnel diffraction is [16]: " # k exp(jkz) U (x , y , z) = U (x, y, 0) exp j [(x − x)2 + (y − y)2 ] dxdy jλz 2z (15.3) Equation 15.3 above is the convolution of the object function U (x, y, 0) with the kernel K(x, y, z) which is given by: x2 + y 2 j exp(jkz) exp jk . (15.4) K(x, y, z) = − λz 2z For the moment, we will drop the constant terms in (15.4) (−j/λz; exp(jkz)) for simplicity, and denote U (x, y, 0) with U (x, y). Now if we have a 2D discrete array with an overall dimensions X and Y, (e.g. an SLM), then for the inner integral, with respect to x, we can write [12] x2 x3 x4 (x − x)2 dx = Bdx + Bdx + Bdx + . . . (15.5) U (x, y) exp jk 2z
+X/2
−X/2
x1
x2
x3
where, (x − x)2 ; B = U (x, y) exp jk 2z
U (x, y) = U (xi , y) = ci ; f or xi ≤ x ≤ xi+1 and y = σ Here ci ’s and σ are constants. Thus we split the original integral into many integrals each defined over a single pixel and set the integral boundaries to coincide with the pixel boundaries. Therefore, over the area of a single pixel, the input field is constant; so it can be moved out of the integral: x i+1
(x − x)2 dx exp jk 2z
x i+1
Bdx = U (xi , y) xi
xi
For the exponent argument √ √ √and integral boundaries √ the following substitutions can be made: (x − x ) 2/ λz = τ ; dx = λz/ 2dτ ; τ |x=xi = τi . Each of the integrals in (15.5) can be written in terms of the Fresnel integral as,
538
M. Kovachev et al.
τi+1
exp
τi+1 τi jπ 2 jπ 2 jπ 2 τ dτ = τ dτ − τ dτ exp 2 2 2
τi
0
0
= C(τi+1 ) + jS(τi+1 ) − C(τi ) − jS(τi ), (1 ≤ i ≤ N), where C(τi ) and S(τi ) are cosine and sine Fresnel integrals [9, 16]. Integrals along the y direction can be calculated in the same way. If we take into account the relation between τ and x, √ 2 τi−j = (xi − xj ) √ λz and that assume that each integral above is multiplied by a piece of the input field corresponding to the interval [xi , xi+1 ], we can write: √ λz √ U (xi , y)[C(τ(i+1)−j ) + jS(τ(i+1)−j ) − C(τi−j ) − jS(τi−j )], 2 (1 ≤ i, j ≤ N ). Therefore for the sum of integrals in (5) along the x-direction we obtain: √ N λz √ U (xi , y)[C(τ(i+1)−j ) + jS(τ(i+1)−j ) − C(τi−j ) − jS(τi−j )]. 2 i=1 This is the convolution in discrete form along the x-direction. In a similar way, for the y-direction we can write: √ M λz √ U (x, yk )[C(σ(k+1)−l ) + jS(σ(i+1)−l ) − C(σk−l ) − jS(σk−l )], 2 k=1 √ √ √ √ where (y − y ) 2/ λz = σ; dy = λz/ 2dσ; σ|y=yi = σi . Combining the expressions for both x and y directions one can derive the 2D convolution in its discrete form as, 1 U (xj , yl ; z) = − exp(jkz) U (xi , yk ; 0){[jC(τ(i+1)−j ) + S(τ(i+1)−j ) 2 i=1 M
N
k=1
− jC(τi−j ) − S(τi−j )][jC(σ(k+1)−l ) + S(σ(k+1)−l ) − jC(σk−l ) (15.6) − S(σk−l )]}. The kernel, expressed by the terms in curly brackets in (15.6), can be easily calculated by using standard algorithms [9]. This kernel takes into account the wavefront contribution of each pixel area in the diffraction pattern. The convolution can be calculated directly or by discrete Fourier transform. The second step in calculation of a hologram is to add a reference beam co-linear to the propagation direction for an in-line (on-axis) hologram, or angular for an off-axis hologram.
15 Holographic 3DTV Displays Using Spatial Light Modulators
539
15.3.3 Bipolar Intensity Method Now we will consider the bipolar intensity method, which derives its name from producing an interference pattern that has both positive and negative values. In classical optical holography, we may consider that a hologram, H, consists of the combination of the complex valued wavefront of an object beam, U , and reference beam, R, onto the recording medium, as, H = |R + U | = |R| + |U | + U R∗ + RU ∗ 2
2
2
where R = |aR (x, y)|exp{jϕR (x, y)} and U = |aU (x, y)|exp{jϕU (x, y)} are the reference and object beams, respectively. The first term in the above expression is called the reference bias and it is a spatially invariant (DC) term. Second term denotes the object self-interference which is a spatially varying pattern. This term can cause distortion in the reconstruction process; fortunately this distortion is small for small objects and can be suppressed by having a tilted reference beam. Removal of these undesired terms helps in improving the computational efficiency and reduces the noise generated by the object beam. The final sinusoidal term describes the modulation of the object and reference beams and actually these are the terms which are computed as the hologram. Elimination of the first two terms results in an interference pattern with positive and negative intensities and hence the method is called the bipolar intensity method. Therefore, the intensity can be represented as: Ibipolar (x, y) = 2 |aU (x, y)| |aR (x, y)| cos[ϕR (x, y) − ϕU (x, y)], where, ϕU (x, y), represents the phase of the object beam. Here, we assume that the wavefront can be represented by a spherical wavefront and setting the scaling factor 2|U ||R| to unity, and omitting ϕR (x, y) term, which is taken as zero at the recording plane, we arrive at (15.7). This is the bipolar intensity method used for calculating holograms. Intensity at each pixel location (xα , yα ) on the SLM can be defined as, Ibipolar (xα , yα ) =
N o.pts. j−1
Aj cos
2π 2 2 2 (xα − xj ) + (yα − yj ) + zj . λ
(15.7) where, xj , yj , zj , are the real co-ordinate locations of the points on an object and Aj is the amplitude of the corresponding point on the object. During the calculation of (15.7) it is necessary to normalise the bipolar intensity so that all values are positive and can be written onto the SLM. This is easily done by adding a proper DC offset to all pixels. The dynamic range of the intensity levels are then quantized according to the dynamic range of the SLM. The resultant pixel values can be directly written onto the SLM. Equation (15.7) enables the reconstruction of 3D scenes described as object
540
M. Kovachev et al.
points and can be easily implemented into graphics commodity hardware for fast real-time computation of holograms [31]. LC or LCoS SLM can be used to display holograms [4, 11, 12, 13]. Usually in-line holograms are reconstructed with LCoS SLMs. The advantage of the in-line reconstruction geometry is that the reference and object beams are collinear, thus the resolution requirements of the SLM are less demanding [32]. Due to the limited spatial bandwidth product offered by the SLM, it is only possible to reconstruct holograms with a limited viewing angle and spatial resolution. A hologram can be also recorded directly by a CCD camera and then reconstructed numerically [33].
15.4 Comparison of Numerical and Optical Reconstructions A program for computation of forward or backward propagated wavefront using Fresnel diffraction integral (15.6) is written [12]. A hologram is obtained by adding a reference beam to the forward propagated wavefront. In our case numerical or optical reconstructions from the generated holograms are obtained by using the complex conjugate of the reference beam and by backward propagating the field. If the reference beam itself was used, instead, then the virtual image would have been reconstructed. Holograms computed by the RayleighSommerfeld diffraction formula (15.2), and by bipolar intensity method (15.7), are also calculated, and reconstructions from these holograms are obtained. SLMs used during the optical reconstructions were LC for the transmission mode and LCoS for the reflection mode. The LC SLM, which is also commonly used in multimedia projectors, has a resolution of 1280 × 720 square pixels with a pitch of 12.1 μm and a gap of about 1 μm. A 635 nm diode laser is used during the reconstruction. For the LCoS SLM the resolution is 1900×1200 square pixels and the pitch is 8.1 μm; a He-Ne laser (632.8 nm) is used for illumination. The maximum diffraction angle for the LC SLM, determined by its pixel size, is 1.5◦ and the minimum distance for a Gabor (in-line) hologram, to avoid overlapping of diffractive orders in the reconstructed image, is 350 mm. The maximum diffraction angle for LCoS SLM is 2.2◦ . Experimental setups for optical reconstructions, for the reflection and transmission modes, are shown in Fig. 15.5. A Star Target is used as the object for computer simulations. This particular pattern is chosen to test the resolution of the overall process. Holograms of this object are calculated for a reconstruction distance of 800 mm. R-S in-line hologram, of the Star Target object given in Fig. 15.6, is shown in Fig. 15.7. In-line and off-axis Fresnel holograms of the same object are calculated using (15.6) and are shown in Figs. 15.8 and 15.9, respectively.
15 Holographic 3DTV Displays Using Spatial Light Modulators
541
Fig. 15.5. Assembled experimental setups. L denotes the laser; L1 and L2 are the collimating lenses; A shows the pinhole; P1 is the polarizer, and P2 is the analyzer
Reconstructed images from Fresnel holograms (Figs. 15.8 and 15.9) by computer simulation are shown in Figs. 15.10 and 15.11. Diffracted field of the object at 800 mm distance is about two times the SLM size in both x and y directions. Reconstructed image, (Fig. 15.10), from Fresnel in-line hologram is corrupted because of the twin image, edge effects, and the periodicity. The implied periodicity of the original pattern is a result of using DFT algorithm in computation of convolution between the field U (x, y) and the kernel K(x, y, z). As seen from Fig. 15.11, the off-axis case overcomes the twin-image corruption, as expected. Optically reconstructed images by a red (635 nm) laser diode are captured by a CCD camera (JenOptik AG type 11 MP CCD); the results are shown in Figs. 15.12–15.15. It is observed that the simulated (Figs. 15.10 and 15.11) and optically reconstructed (Figs. 15.12–15.15) Star Target images are well matched. A magnified portion of the reconstructed Star Target image from the off-axis hologram is shown in Fig. 15.12. The rays of the star are well reconstructed. The zeroth-order (directly transmitted) beam and reconstructed image in plus first order from the off-axis hologram are shown in Fig. 15.14. The image is
Fig. 15.6. Star Target object (First published by Springer[12])
542
M. Kovachev et al.
Fig. 15.7. R-S in-line hologram of the Star Target object shown in Fig. 15.6
situated diagonally to the zeroth order. The minus first order is the virtual image on the other side of the diagonal. The qualities of reconstructed images by the LC SLM from the in-line R-S (Fig. 15.15) and Fresnel (Fig. 15.13) holograms are visually similar. In the series of reconstructed images there is a clearly perceivable difference between the in-line and off-axis image qualities, the off-axis being superior. However the angle between the object and reference beams has to be carefully chosen, so that the reconstructed object pattern does not overlap with the other diffracted orders of the SLM. It must be taken into account that this
Fig. 15.8. Fresnel in-line hologram of the Star Target object shown in Fig. 15.6
15 Holographic 3DTV Displays Using Spatial Light Modulators
543
Fig. 15.9. Fresnel off-axis hologram of the Star Target object shown in Fig. 15.6 (First published by Springer [12])
Fig. 15.10. Computer reconstructed image from the Fresnel in-line hologram shown in Fig. 15.8
Fig. 15.11. Computer reconstructed image from the off-axis hologram shown in Fig. 15.9 (First published by Springer [12])
544
M. Kovachev et al.
Fig. 15.12. Optically reconstructed and magnified image from the off-axis hologram shown in Fig. 15.9 by using a red laser (First published by Springer [12])
angle is restricted due to the pixel size. The optimal angle for the used LC SLM is 0.76◦ when the reconstruction distance from the SLM is 800 mm. Figure 15.16 illustrates an example when the chosen angle of the reference beam is 0.55◦ which is less than the optimal, and thus the reconstructed object overlaps with the zeroth order. It is also possible that real and virtual images of the neighboring SLM orders may also overlap. This occurs when the angle is larger than the optimal. It is worthwhile mentioning that good results are obtained by in-line holograms in case when the object is much smaller than the SLM size and has white letters over a black background. Such an example is shown in Figs. 15.17–15.19, where the object is in Fig. 15.17, the computed hologram
Fig. 15.13. Optically reconstructed image from the Fresnel in-line hologram shown in Fig. 15.8
15 Holographic 3DTV Displays Using Spatial Light Modulators
545
Fig. 15.14. Optically reconstructed image from the Fresnel off-axis hologram is shown in Fig. 15.9
is in Fig. 15.18 and the optically reconstructed image is in Fig. 15.19. To reduce the distortions caused by the first diffraction order and the DC term, the reconstructed image at the second diffraction order is taken. As already mentioned, a good and correct reconstruction can be achieved if the parameters of the SLM and the computed hologram match. However, it is highly desirable to compute the holograms in a generic fashion without considering the eventual physical SLM parameters; reconstructions using different SLMs from the same hologram will provide flexibility. In an attempt to experimentally check the possible degradation in quality when different SLMs are used during the reconstruction, we computed a hologram for a
Fig. 15.15. Optically reconstructed image from R-S in-line hologram shown in Fig. 15.7
546
M. Kovachev et al.
Fig. 15.16. Optically reconstructed image when the reference beam angle is less than optimal
Fig. 15.17. “3DTV” Object
Fig. 15.18. In-line R-S hologram of Fig. 15.17
15 Holographic 3DTV Displays Using Spatial Light Modulators
547
Fig. 15.19. Reconstructed image from Fig. 15.18
given SLM, but used a different SLM in addition to the original one, for the reconstruction. An in-line hologram (Fig. 15.21) of the artificially created “sine-wave” object, computed using the bipolar method (15.7) for the LCoS SLM, is shown in Fig. 15.20. An off-axis hologram of the Star Target object, which is calculated by the Fresnel off-axis method (15.6) for the LC SLM is shown in Fig. 15.9. Then, optical reconstructions using both LCoS and LC SLMs are conducted. Obtained results are shown in Figs. 15.22–15.24.
Fig. 15.20. The artificially generated “sine-wave” object
548
M. Kovachev et al.
Fig. 15.21. Computed hologram of the object shown in Fig. 15.20 created by (15.7) for a LCoS SLM
It is observed that the reconstruction quality is satisfactory even if another SLM, than the one intended during the computation, is used during the reconstruction. Up to now, the presented experiments are based on monochromatic wave propagation. As the next step, we evaluate the performance of the Fresnel hologram computation algorithm for colour holograms. Images of off-axis colour Fresnel hologram (Fig. 15.26) of the 3DTV Project Logo (Fig. 15.25)
Fig. 15.22. A portion of the optically reconstructed image by the LCoS SLM of the “sine-wave” object shown in Fig. 15.20 from the hologram shown in Fig. 15.21
15 Holographic 3DTV Displays Using Spatial Light Modulators
549
Fig. 15.23. Optically reconstructed image by the LC SLM of the “sine-wave” object shown in Fig. 15.20 from the hologram shown in Fig. 15.21. The bright rectangle obstructing the image is the zeroth order (First published by Springer [12])
Fig. 15.24. Optically reconstructed image from the off-axis hologram shown in Fig. 15.9 by the LCoS SLM (First published by Springer [12])
Fig. 15.25. 3DTV Project Logo as a colour object
550
M. Kovachev et al.
Fig. 15.26. Colour hologram of the object in Fig. 15.25 (First published by Springer [12])
are reconstructed using the LC SLM whose resolution is 1280×720 pixels. Optical and computer simulation based reconstructions are illustrated in Figs. 15.27 and 15.28, respectively. The colour images cannot be presented here because of black-and-white printing process. The object (Fig. 15.25) is split into its R, G, B colour components. The R-component is shown in Fig. 15.29 in gray scale. Then separate holograms are computed for each component using (15.6). The hologram for the R-component is shown in Fig. 15.30. The colour hologram in Fig. 15.26 is generated by numerically superposing the calculated holograms for the R, G and B components, using their respective colours during the superposition. The reconstruction from the hologram corresponding to the R component by the LC SLM using a red laser is shown
Fig. 15.27. Reconstructed 3DTV Project Logo by superposition of optically reconstructed from RGB components of Fig. 15.26 (First published by Springer [12])
15 Holographic 3DTV Displays Using Spatial Light Modulators
551
Fig. 15.28. Numerically reconstructed 3DTV Logo from Fig. 15.26 (First published by Springer [12])
in Fig. 15.31. The image is captured by a digital camera. Similar SLM reconstructions are carried out also for the G and B holograms and each reconstruction is captured by the digital camera. The captured images for each colour component are then combined, numerically, to yield the colour image in Fig. 15.27. Numerical reconstructions, instead of optical SLM reconstructions, are also carried out for comparison, and the result is shown in Fig. 15.28. Various papers on colour CGH can be found in the literature [35, 36, 37, 36, 39]. Consequently, we can say that LC SLMs can be used to recontruct the colour holograms.
Fig. 15.29. Red-component of the colour object shown in Fig. 15.25
552
M. Kovachev et al.
Fig. 15.30. Computed hologram of the Red component
Fig. 15.31. Optically reconstructed Red-component of the hologram given in Fig. 15.30
15.5 Conclusion SLMs are promising devices for dynamic holographic displays. The quality of reconstructed images using the currently available SLMs is promising, but not satisfactory. SLM pixel size of about 0.4–0.6 μm is needed to write good quality holograms. Nowadays commercially available SLMs, which could be used as a media for holograms, have a pixel size of 7–8 μm and the number of pixels can go up to 3840 × 2048. Pixel size depends on photolithographic and microelectronic technologies. So, about 10 times real or virtual improvement of pixel size is needed.
15 Holographic 3DTV Displays Using Spatial Light Modulators
553
Algorithms used for CGH are also important. They must be efficient and fast enough to be able to work in real time. Moreover, the resolution of the reconstructed 3D scene has to be equivalent or better than that of the commonly used 2D displays. At the moment algorithms used to compute holograms are efficient for real time processing only of quite low resolution objects [13, 21, 40]. The algorithms based on the Fresnel-Kirchhoff diffraction formula and Rayleigh-Sommerfeld diffraction integral provide similar reconstructed patterns when the distance along the optical axis is around 0.8m. The holograms are calculated with a resolution of 1280×720 pixels for an object with the same resolution using a 3.6 GHz personal computer. The achieved computing speed of 3.25×10−5 s/point is better than the published results (about 10−4 s/point) for several algorithms [21, 31, 40]. Naturally, the computational complexity is three times higher, than the monochrome holographic display, for colour holography. A match between the parameters of the SLM used during the reconstruction, and the computed hologram is desirable for better quality. However, conducted experiments show that reconstructions using different SLMs could be satisfactory, too. SLMs has the potential to be used for color holographic displays, as well.
Acknowledgements The authors thank Stefan Osten, Holoeye Photonics AG, Berlin, Germany for providing the amplitude and phase modulation characteristics of Holoeye SLMs. This work is supported by EC within FP6 under Grant 511568 with acronym 3DTV.
References 1. Onural L., Bozdagi G., Atalar A., New high-resolution display device for holographic three-dimensional video: principles and simulations, Optical Engineering, Vol. 33, No. 3, 835–844, 1994. 2. Fukaya N., Maeno K., Sato K., Honda T., Improved electroholographic display using liquid crystal devices to shorten the viewing distance with both-eye observation, Optical Engineering, Vol. 35, No. 6, 1545–1549, 1996. 3. Huignard J. P., Spatial light modulators and their applications, J. Optics, Vol. 18, No. 4, 181–186, 1987. 4. Bleha W. P., Sterling R. D., D-ILATM Technology for high resolution projection displays, JVC ILA Technology Group, 20984 Bake Parkway, Suite 102, Lake Forest CA 92630 USA. 5. Lucente M., Computational holographic bandwidth compression, IBM Systems Journal, Vol. 35, No. 3/4, 349–365, 1996.
554
M. Kovachev et al.
6. Yang H. and Lu M., Nematic LC modes and LC phase gratings for reflective spatial light modulators, IBM Journal of Research and Development, Vol. 42, No. 3/4, 401–410, 1998. 7. Mishina T., Okui M., Okano F., Viewing-zone enlargement method for sampled hologram that uses high-order diffraction, Applied Optics, Vol. 41, No. 8, 1489–1499, 2002. 8. Mishina T., Okano F., Yuyama I., Time-alternating method based on singlesideband holography with half-zone-plate processing for the enlargement of viewing zones, Applied Optics, Vol. 38, No. 17, 3703–3713, 1999. 9. Born M., Wolf E., “Principles of Optics”, Pergamon Press, New York, 4th ed., 1970. 10. Cameron C. D., Pain D. A., Stanley M., Slinger C. W., Computational challenges of emerging novel true 3D holographic displays, in Critical Technologies for the Future of Computing, S. Bains, L. J. Irakliotis, eds., Proceedings of SPIE, Vol. 4109, 129–140, 2000. 11. Abookasis D., Rosen J., Three types of computer-generated holograms synthesized from multiple angular viewpoints of a three-dimensional scene, Applied Optics, Vol. 45, No. 25, 6533–6538, 2006. 12. Kovachev M., Ilieva R., Onural L., Esmer G. B., Reyhan T., Benzie P., Watson J., Mitev E., “Reconstruction of Computer Generated Holograms by Spatial Light Modulators”, Proceedings IW MRCS 2006, Istanbul, Turkey, LNCS 4105, 706–713, 2006. 13. Fukushima S., Kurokawa T., Ohno M., Real-time hologram construction and reconstruction using a high-resolution spatial light modulators, Applied Physics Letters, Vol. 58, 787–789, 1991. 14. Lucke R. L., Rayleigh-Sommerfeld diffraction and Poisson’s spot, European Journal of Physics, Vol. 27, 193–204, 2006. 15. Amuasi H., The Mathematics of Holography, Essays Towards the AIMS Postgradute Diploma (2003/2004), African Institude for Mathematical Sciences, 2004. http://www.aims.ac.za/resources/archive/2003/henryessay2.0.pdf 16. Goodman J. W. “Introduction to Fourier Optics”, Roberts & Company Publisher, U.S., 3rd edition, 2004. 17. Asundi A., Singh V. R., Sectioning of amplitude images in digital holography, Measurement Science and Technology, Vol. 17, 75–78, 2006. 18. Janda M., Hanak I., Skala V., Digital HPO Hologram Rendering Pipeline, EUROGRAPHIC short papers conference, Proceedings, 81–84, 2006. 19. Ito T., Okano K., Color electroholography by three colored reference lights simultaneously incident upon one hologram panel, Optics Express, Vol. 12, No. 18, 4320–4325, 2004. 20. Ito T., Holographic reconstruction with a 10-μm pixel-pitch reflective liquidcrystal display by use of a light-emitting diode reference light, Optics Letters, Vol. 27, No. 16, 1406–1408, 2002. 21. Lucente M., Optimization of hologram computation for real-time display, SPIE Proceeding, “Practical Holography VI”, Vol. 1667, 32–43, 1992. 22. Leibling M., Unser M., Autofocus for digital Fresnel holograms by use of a Fresnelet-sparsity criterion, Journal Optical Society of America A, Vol. 21, No. 12, 2424–2430, 2004. 23. Choi K., Kim H., Lee B., Synthetic phase holograms for auto-stereoscopic image displays using a modified IFTA, Optics Express, Vol. 12, No. 11, 2454–2461, 2004.
15 Holographic 3DTV Displays Using Spatial Light Modulators
555
24. Leibling M., Thierry Blu, Unser M., Complex-wave retrieval from a single offaxis hologram, Journal Optical Society of America A, Vol. 21, No. 3, 367–377, 2004. 25. Plesniak W., Incremental update of computer-generated holograms, Optical Engineering, Vol. 42, No. 6, 1560–1571, 2003. 26. Mezouari S., Harvey A. R., Validity of Fresnel and Fraunhofer approximations in scalar diffraction, Journal of Optics A: Pure Applied Optics, Vol. 5, S86–S91, 2003. 27. Grilli S., Ferraro P., De Nicola S., Finizio A., Pierattini G., Meucci R., Whole optical wavefield reconstruction by digital holography, Optics Express, Vol. 9, No. 6, 294–302, 2001. 28. Kajiki Y., Ueda H., Tanaka K., Okamoto H., Shimizu E., Cylindrical large computer-generated holograms and hidden-point removal process, Proceedings of SPIE, Vol. 2652, 29–35, 1996. 29. Sherman G. C., Application of the convolution theorem to Rayleigh’s integral formulas, Journal Optical Society of America, Vol. 57, 546–547, 1967. 30. Lalor E., Conditions for the validity of the angular spectrum of plane waves, Journal Optical Society of America, Vol. 58, 1235–1237, 1968. 31. Ahrenberg L., Benzie P., Magnor M., Watson J., Computer generated holography using parallel commodity graphics hardware, Optics Express, Vol. 14, 7636–7641, 2006. 32. Kries T., Hologram reconstruction using a digital micromirror device, Optical Engineering, Vol. 40, No. 6, 926–933, 2001. 33. Schnars U., Juptner W., Direct recording of holograms by a CCD target and numerical reconstructions, Applied Optics, Vol. 33, No. 2, 179–181, 1994. 34. Vdovin G., LightPipes: beam propagation toolbox, OKO Technologies, The Netherlands, 1999. 35. Shimobaba T., Ito T., A color holographic reconstruction system by time division multiplexing with reference lights of laser, Optical Review, Vol. 10, No. 5, 339–341, 2003. 36. Choi K., Kim H., Lee B., Full-color autostereoscopic 3D display system using color-dispersion-compensated synthetic phase holograms, Optics Express, Vol. 12, No. 21, 5229–5236, 2004. 37. Suh H. H., Color-image generation by use of binary-phase holograms, Optics Letters, Vol. 24, No. 10, 661–663, 1999. 38. Sando Y., Itoh M., Yatagai T., Full-color computer-generated holograms using 3-D Fourier spectra, Optics Express, Vol. 12, No. 25, 6246–6251, 2004. 39. Sando Y., Itoh M., Yatagai T., Color computer-generated holograms from projection images, OSA, Optics Express, Vol. 12, No. 11, 2487–2493, 2004. 40. Munjuluri B., Huebschman M., Garner H., Rapid hologram updates for realtime volumetric information display, Applied Optics, Vol. 44, No. 24, 5076–5085, 2005. 41. Wernicke G., Krueger S., Kamps J., Gruber H., Demoli N., Duer M., Teiwes S., Application of a liquid crystal display spatial light modulator system as dynamic diffractive element and in optical image processing, Journal of Optical Communications, Vol. 25, No. 4, 141–148, 2004. 42. HoloEye Spatial Light Modulators. http://www.holoeye.com/spatial light modulators-technology.htm 43. Onural L., Exact analysis of the effects of sampling of the scalar diffraction field, Journal Optical Society of America A, Vol. 24, No. 2, 359–367, 2007.
16 Materials for Holographic 3DTV Display Applications Kostadin Stoyanov Beev, Kristina Nikolaeva Beeva and Simeon Hristov Sainov Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences, Acad. G. Bonchev str., bl. 109, 1113 Sofia, Bulgaria
16.1 Introduction Realization of dynamic holographic 3D display, in which the 3D scene is encoded in terms of optical diffraction, transformed into fringe pattern and further converted into a signal for a spatial light modulator (SLM) and displayed in real time, is an extremely challenging enterprise [1]. Although 3D imaging systems based on stereoscopic and autostereoscopic displays are currently available [2, 3], the holographic 3D display creation is an object of great interest since it is the only way to obtain “true” 3D images transmission. The reason is the inherent property of holography to reconstruct an identical wavefront to the one emanated from the object. As a consequence, it is possible to reproduce the original 3D scene with all off the depth cues and high resolution that would allow a truly realistic 3D image. The efforts to create holographic display have been directed to the application of three different SLM technologies – acousto-optic modulators (AOMs), liquid crystal displays (LCDs) and digital micromirror devices (DMDs). The AOM-based SLM was the first one, described in scientific literature [4]. Among the main drawbacks of the system are the necessity to convert the computer generated hologram into analog signal before applying it to the AOM, and the need of moving optical parts. The next step, governed by the fast development of liquid crystals, was to use LCDs [5]. The first systems were monochromatic possessing small viewing zone (only 3◦ ) and slow refresh rate (∼7 Hz). Recently, several technical solutions allowing to obtain color display based on LCDs are reported [6]. Anyway, the image quality and refresh rate do not satisfy the requirements for a real color dynamic display. The Texas Instruments Digital Mirror Device (DMD) [7] posses several key advantages – high light modulator efficiency (∼65%), high contrast ratio (∼1000 : 1) and ability to operate in wide spectral range. The main drawback of display systems using DMDs is connected to the multiple Fraunhofer diffraction orders that occur at coherent light illumination due to the DMD grating profile.
558
K. Beev et al.
Taking into account the recent state of art in the area, it can be deduced that currently available commercial SLMs scarcely satisfy the demands of holographic display systems. The conclusion is confirmed by a joint venture company – Holographic Imaging LLC [8], formed by Qinetiq and Ford Motor Co in order to apply digital holography to industry. Qinetiq [9] has proposed a way to overcome the problem, taking advantage of the electrically addressed spatial light modulators (EASLMs) high frame rate and the high resolution of optically addressable spatial light modulators (OASLMs). In other words, the system benefits from the wide temporal bandwidth of EASLMs and the wide spatial bandwidth of OASLMs. The working principle is to divide the computer generated hologram into segments that are displayed sequentially on the EASLM, than projected into the OASLM using proper optics. A proper technical decision is to employ an array of switchable lenses. At this point, it can be benefited from the development of holographic optical elements, which has some certain advantages over conventional optics. Since they are lightweight, stackable and conformable to different shapes, they can create complex optical systems in compact configurations unattainable with conventional optics. Holographic optical elements can also be multifunctional, combining several functions like focusing, deflecting and filtering within a single element [10]. Moreover, the diffraction structure can be multiplexed by itself, wherein several independent elements share the same volume. In waveguide applications they can serve as input-output couplers, beam splitters, deflectors, expanders, filters, and attenuators. As a consequence of these extremely attractive features, holographic optical elements have found wide application in contemporary optics in areas ranging from high density optical storage and the super-resolution problem, to display applications. A key element for the non-holographic autostereoscopic 3DTV realization is holographic screen forming multiple viewing zones. It requires a high diffraction efficiency and high signal-to-noise ratio media for permanent holographic recording. The best candidate for this application still seems to be the silver halide emulsions (see below). The possibility to control the properties of the hologram in real time enlarges the described applications. A crucial point in the holographic display system realization is the proper materials application, both in the switchable transmittance optics implementation (in Qinetiq concept) and the final reversible display element. As a conclusion, we should consider a wider range of dynamic materials, considering not only the reversible but also the switchable media. Fortunately, the area of holographic recording materials shows a significant development recent years, especially in holographic optical elements creation. In fact, the elaboration and improvement of photosensitive materials is an object of interest prior to the holography invention. Mostly such investigations were connected with the photographic recording process. In general, the available photosensitive media were divided in two groups – silver containing (with or without binder compounds) materials and media without silver. The first group acquires large-scale application in practice.
16 Materials for Holographic 3DTV Display Applications
559
As a definition, photosensitive media are such materials that experience physical and chemical changes under light illumination. The rate of the photomodifications is in accordance with the distribution and the intensity of the light field. After post-exposure processing (or in some cases without such one), these modifications lead to changes in the optical properties of the material, which form the recorded image. Employing the method for recording, introduced by Denis Gabor in 1948 and called by him wave front reconstruction, increases the requirements for the photosensitive materials. Through recording of the interference pattern between the scattered from the object light and a reference wave, the information for both the phase and the amplitude of the object wave is stored. Nevertheless the recording media detects only the intensity distribution. The second step in this method, recognized as holography soon after the Gabor invention, is also well know – reconstruction of the recorded image performed with a “copy” of the reference wave in consequence of the diffraction from the recorded interference pattern. Thus the object wave is reconstructed. It is essential that the holographic technique gives rise to the development of different areas in contemporary optics. May be one of the most attractive applications has been the art holography utilizing enlargement of photographic possibilities – namely to record three-dimensional images. Nowadays, the holographic method finds application in areas like interferometry [11, 12], high-density optical storage [13], diffractive optics [10], display technology [14], including three-dimensional displays [1], etc. As mentioned above, employing holographic technique increases the requirements to the recording media. It is consequence of the interference pattern recording, where the required spatial resolution is quite higher than the necessary one in photography. Also there are some other specifics of the holographic media, presented in the next part of the chapter. Although silver halide emulsions satisfy excellently the necessary requirements for permanent holographic storage, the presence of chemical process of development “impede” series of applications. Along with the development of diffractive and holographic optics technology, a necessity of media for switchable and reversible storage development and improvement presents as well. In the present chapter, the requirements for the holographic materials, the development of non-silver recording media, based on polymers, liquid crystals, inorganic crystals and composites as well as the current trends in dynamic holographic recording materials development are discussed consecutively. The part “Non-silver holographic recording media” to a certain extent follows the chronological sequence of the non-silver development. In the next part, the examined polymer, crystal and liquid crystal materials are evaluated from applicability in 3D display systems point of view. A table with some of the holographic characteristics of the different recording materials is also presented.
560
K. Beev et al.
16.2 Characteristics of Holographic Recording Media and Basic Requirements To perform recording of a hologram, suitable modification of the material properties under light illumination is required. To examine this process, the complex amplitude A of the wave passing through medium with thickness T is considered. It is expressed by the following equation: 2πn A = A exp(−iβT ) exp(−αT ) = A exp −i T exp(−αT ) λa where β = 2πn/λ is propagation constant, α – amplitude absorption coefficient, n – refractive index and λa – the recording wavelength in air. In a proper medium for holographic recording after exposure and eventual development, one of the parameters – α, n or T should experience changes. Depending on which characteristic is modulated the materials are distinguished as: • •
Media with amplitude modulation (absorbing materials) – α is modulated; Media with phase modulation – light induced changes of n or T .
Most of the recording materials exhibit a combined amplitude-phase modulation. It is essential to note that elaboration and application of a particular holographic material is dependent on its concrete usage. Despite of this, there are some common, basic characteristics of the holographic properties. 1. Spatial resolution The spatial resolution represents the maximal number of fringes, which the material is capable to register separately. This characteristic for the mass-produced and widely used silver halide emulsions is connected to the grain size and the scattering. The scattering is consequence of the differences in the refractive indices between the gelatin and the silver halide (most often bromic) – respectively about 1.5 and 2.25. The holographic silver halide emulsions are fine-grained – as a comparison, while the grain size in a conventional (photographic) emulsion is above 1 μm, it is less than 0.08 μm for holographic silver halides. The spatial resolution is measured in lines/mm. Typical values of the most employed holographic recording media produced by Kodak and Agfa companies are in the order of 2500–5000 lines/mm. The grain size in these emulsions is under 60 nm. In the Central Laboratory of Optical Storage and Processing of Information, Bulgarian Academy of Sciences is developed ultra fine-grained silver halide emulsion for high quality holographic recording with grain size ∼10 nm and spatial resolution up to 10000 lines/mm.
16 Materials for Holographic 3DTV Display Applications
561
2. Diffraction efficiency Diffraction efficiency (η) is the ratio of the diffracted (I1 ) and the reconstructing (I2 ) beam intensities: η=
I1 I2
Depending on the type of holographic recording (thin or thick, reflection or transmission, amplitude of phase), the maximal theoretical values of the diffraction efficiency vary from 3.7% to 100% [15]. In practice, it is essential to account the diffraction efficiency dependence on the contrast (V ) of the interference pattern, the exposure H and carrier frequency f : η = F (H, V, f ); V = (Imax − Imin ) / (Imax + Imin ), where Imax and Imin are the maximal and minimal values of the intensity in the interference pattern respectively; the exposure H = I.t, where I is the light intensity and t – the exposure time. For the correct wave front reconstruction, it is necessary to work in the linear part of the recording material response, i.e. in the linear part of the t − H function (transmittance-exposure function), where t is the modulated complex transmittance of the material for a given value of V . This linear part of the curve determines the dynamical range of the material: M = 10 log (Vmax ) / (Vmin ), where Vmax is determined by the nonlinear distortion coefficient and Vmin by the noise in the holographic image. 3. Modulation transfer function The modulation transfer function (MTF) is the diffraction efficiency dependence on the spatial frequency of the holographic recording. For high-quality holographic recording, the requirement for the MTF of the recording material is to be almost independent on the spatial frequency. 4. Sensitivity The sensitivity (S) is determined by the exposure values required to achieve √ η a given diffraction efficiency value at given values of V and f [15]. S = HT 6. Noise Noise is the undesired light flux, diffracted or scattered in the reconstructed wave direction. The origin of the noise can be different and connected both to the holographic recording and to the reconstruction process. The quantitative assessment of the noise level is determined by the signal-to-noise ratio (S/N). This characteristic is critical for the information capacity of the holographic recording, as well as the quality of the reconstructed image.
562
K. Beev et al.
16.3 Non-silver Holographic Recording Media After extensive investigations, nowadays, the silver halides exhibit excellent holographic properties. They are sensitized in the spectral range below 1000 nm [16], allowing also realization of multi-color holographic recording. Sensitivity values >1100 cm/J are obtained. This characteristic is strongly dependent on the grain size (increasing the grain size lead to higher sensitivity) and inversely proportional to the spatial resolution and the noise level. The highest values of spatial resolution reach 10000 lines/mm [17]. The refractive index modulation is ∼ 0.02. Another excellent feature is the perfect stability of the recorded holograms [18]. The holograms exhibit both amplitude and phase modulation. Additional possibility in order to obtain diffraction efficiencies up to 100% and pure phase recording is to bleach the holograms, but the “price” is a significant increase of the noise level. Recording materials revealing phase modulation and high diffraction efficiency, without the bleaching process and the high noise levels are the dichromated gelatines. Anyways they are medium for permanent holographic recording and also require “wet” post-process. Nevertheless, as mentioned in the introduction, the recent state of art in diffractive and holographic optics requires materials without the “wet” processing which presents in silver halide emulsions (as well as in dichromated gelatine) recording process. Series of the contemporary applications including the 3D display require dynamic media. As a consequence, there is an increasing necessity to substitute silver halide emulsions. This has resulted in tremendous efforts in the development of holographic recording materials ranging from crystals as doped LiNbO3 , LiTaO3 , KnbO3 and Sn2 P2 S6 to different types of thermoplastics, photopolymers, azobenzene polyester films, liquid crystals and composite materials [15, 19]. 16.3.1 Polymeric Recording Materials The first approach in order to substitute silver halide emulsions is to use polymeric recording media. Photopolymers have been used in practice for almost two centuries. From the middle of XIX century, the so-called photoresists find application in polygraphy. Due to this technology, printed plates for electronic devices have been produced since 1940. Later (in 1960), the first integrated circuit is created. This initiates the rapid growth in microelectronics as well one of its bases – the photolithography. Nowadays holographic technology is one of the most promising areas of photopolymers application. The polymeric layers are extremely attractive to form various in shape and functions diffractive optical elements. The photoinduced changes in polymeric recording media can be a consequence of various photochemical reactions of electron-excited molecules and the subsequent physical and chemical processes. These conversions are limited to the illuminated areas of the material and their degree is dependent on the
16 Materials for Holographic 3DTV Display Applications
563
activating light intensity. The rate of these changes determines modulation of the optical parameters. Usually one or two molecules take part in the primary reactions. In the simplest case, the reaction is recognized as monomolecular. The parameter quantum efficiency (Φ) is introduced to describe the process. It represents the ratio of the reacted molecules and the absorbed photons. The change in the concentration c of the photosensitive molecules in optical thin layer is described by the differential equation [20]: dc = −I0 σΦcdt, where I0 is the light intensity. The concentration decreases exponentially during the exposure: c = c0 exp(−I0 σΦt) = c0 exp(−σΦH), H = I0 t is the exposure, measured in number of photons per unit area of the layer for the time of the exposure (also called quantum exposure). If the 2 exposure is measured in radiation power per unit area (W/cm ) the equation acquires the form: I0 1 c = c0 exp − σΦt = c0 exp − σΦH , hν hν here ν is the activating radiation frequency. The above-described kinetics (the last two equations) is valid for monomolecular reactions. However, it is also applicable for bimolecular reactions when the excited molecule interacts with the surrounding particles situated in immediate vicinity. Such reaction is called pseudo-monomolecular and takes place in significantly higher concentration of the second reagent in comparison with the one of the photosensitive molecule. The primary photochemical reaction changes the qualitative composition of the material, influencing at the same time the optical properties by itself. Their alteration is consequence of both the polymer matrix modifications and the subsequent dark processes like diffusion of unreacted spices, different kinds of relaxation processes etc. Different types of photo-modifications are: • • • •
Crosslinking; Photodestruction; Photoisomerisation; Photopolymerization.
Crosslinking is intermolecular bonds formation leading to insolubility of the illuminated areas of the polymer. During the photodestruction, the length and weight of the polymer chain is decreased, which results in increase of the solubility. The photoisomerization is usually accomplished by cis-trans conformation of azo-derivative compounds. The primary photochemical reaction can
564
K. Beev et al.
initiate polymer formation from low-weight compounds – monomers. In this case the primary photomodification is accomplished by the molecules of the photoinitiator, which form chemically reactive particles with unpaired electrons – free radicals. The interaction of the radical with a monomer molecule initiates chain polymerization reactions, creating molecules with hundreds and thousands units. Different types of polymer recording media depending on the chemical mechanism are described. Among them are the so-called photoresists, photochromic azopolymers, anthracene-containing and photopolymers. The photoresists change their solubility in some solvents after illumination. Two types – positive and negative photoresists are recognized, depending on which area is dissolved after exposure – respectively the illuminated or dark one. The first type is comprised of polymers solvents containing compounds increasing the solubility of the polymer molecules under light illumination. Such media are the phenol-formaldehyd resins. The negative photoresists under short wavelength light exposure exhibit chemical double bonds breaking, which later crosslinks, forming bigger molecules insoluble in some organic solvents like xylene, benzine and others. Both types are used for holographic recording. The anthracene-containing polymers exhibit the mechanism photodimerization leading to formation of derivative molecules in excited state, which form pairs with a molecule in basic state. In consequence, both photochromism and photorefraction occur, due to changes respectively in the absorption wavelength and the molecular polarizability (see below). A wide class of organic holographic recording media are the azo materials. The specific element in their structure is the azo-group, which can be only one or more. The azo-groups are two phenyl rings connected with double nitrogen bond (−N = N −). Molecules with high value of the photoinduced anisotropy are obtained on the base of azo-dyes. The azo-group exists in two configurations (Fig. 16.1) – trans and cis. The first one is more stable. The trans-cis isomerization is accomplished under light illumination, while the reverse transformation – the cis-trans can be also performed by a thermal relaxation (or optically – at other wavelength). The change in the absorption wavelength of the two isomers mechanism is referred as photochromism.
h
N
N N
h ’, kTB
N
Fig. 16.1. Trans-cis isomerization of azobenzene molecules
16 Materials for Holographic 3DTV Display Applications
565
A large class of organic materials exhibits such changes. For the aim of the holographic recording, azobenzene liquid crystalline and amorphous polymers exhibiting photo-isomerization and surface relief creation mechanisms are elaborated. In azobenzene polymers and liquid crystalline materials the azo-group can be connected to the molecular chain and the izomerization leads to change in the refractive index [21]. There are also materials exhibiting amorphous-liquid crystalline phase transition in consequence of the cis-trans isomerisation [22]. In particular, photopolymers are materials exhibiting polymerization process mechanism of recording. The photopolymers possess two irrefutable advantages. On the one hand, due to the chain character of the reactions, they have high quantum efficiency, resulting in high sensitivity values. On the other hand, the activating radiation does not interact directly with the monomers but with added photoinitiator molecules. This allows spectrum shift far from the initial monomer absorption peaks in order to sensitize the material to a proper laser wavelength. The photopolymerization is a chemical process with separate molecules (monomers) connection under light illumination, resulting in alteration of the mechanical and optical parameters of the media. The volumetric refractive index and/or the thickness of the layer (surface relief creation) are changed. Thus, phase modulation occurs. The whole process is “dry”. Generally the recording process is accomplished by the following mechanism. Polymerization takes place during exposure in the light regions of the interference pattern. Its rate is proportional to the light intensity. A dark process goes after, leading to uniform redistribution of the unreacted monomer all over the layer. The last stage is fixing – the whole area is illuminated with uniform UV light. Higher densities and refractive index are obtained in the areas where the initial polymerization has taken place along with the subsequent mass transfer. This process in a lot of cases does not require dark process and postfixing. Different systems with two monomers are developed in order to realize one-step processing. Such systems are the two-component materials described below. Also polymer binder is used in order to obtain single step recording [23]. The photopolymerization process can be accomplished by three different mechanisms [24]. The chemical reaction can consist of double bond opening by cationic, anionic or free radical mechanism. This type is referred to addition or chain polymerization due to the fast process kinetics. In this case one of the double bond ends become chemically active and can link covalently to other molecule. As a consequence, the double bond of the other molecule becomes activated after the covalent reaction and reacts with another monomer. Thus the process repeats itself until the reaction is terminated. Such monomers are derived from the acrylate, methacrylate or vinyl families. If multifunctional acrylates or vinyl monomers (containing more than one reactive group) are used, the resulting network is highly crosslinked.
566
K. Beev et al.
Other type of monomers can react with only one other molecule, i.e. it is necessary to have at least two functional groups in order to form polymer network. To obtain crosslinking, again multifunctional reactants are required. In lots of cases, the process involves a loss of small molecule as a reaction product. The step-polymerization kinetics is very different from the addition one. The molecular weight during the polymerization process increases gradually in contrast to the very rapid chain growth in the case of addition reactions. For example, such reactions take place in the polyurethanes formations from diols with diisocyanates reactions. Also each alcohol can react with one isocyanate in order to link two molecules, until the additional alcohol (or isocyanate) is exhausted. The third type of polymerization chemistry is connected to ring-opening reactions. In this case, one of the reactants has to contain cyclic structure. Such reactions can lead to highly crosslinked network formation. Common reactants in such reactions contain epoxide groups – three-membered ring with an oxygen atom (as a member of the ring). The ring can be opened in presence of nucleophilic groups, like thiol. Simultaneously with the ring opening the oxygen atom is picking up hydrogen to form an alcohol. The molecular conversions lead to changes of the density and the refractive index of the material. Analysis of the refractive index modulation dependence on the monomer-polymer conversion is presented below. In fact, after all phototransformations exhibited by polymer materials (as well as after their eventual post-processing), they undergo to some extend refractive index changes [13]. If this change Δn reaches values ∼ 10−4 or higher, the material is considered to possess photorefractive properties. In order to analyze these properties, the Lorentz-Lorenz formula describing the refractive index n of a mixture of particles is applied. It is convenient to use the following form: n2 − 1 = Ri ci n2 + 2 i According to this correlation, n is determined by the concentration of the material components ci and their refraction Ri . The refraction represents a characteristic of the particle contribution to the refractive index of the material. It is proportional to the molecular polarizability αi : Ri =
4π αi 3
The bigger the change in the polarizability of the photoproduct molecules is, the higher difference in the refractive index is obtained. On account of this, the photorefraction can be consequence of processes leading to such alteration of the qualitative composition, which is accompanied with change in the polarizability of the components. Another reason to arise photorefraction modulation is connected to the concentration alteration of the components ci , leading to density (ρ) changes,
16 Materials for Holographic 3DTV Display Applications
567
which is the case exhibited by the photopolymers. Such case can be examined by the Lorentz-Lorenz formula. If the i-type particles number in the volume of the photosensitive material with mass m is Ni , then: Ri Ni ρ. Ri ci = m i i It can be considered that under the activating illumination, a conversion form the k- to the l-component takes place. The indexes k and l refer the initial and the photoproduct molecules respectively, which concentrations change due to the photochemical conversions. It is accompanied by mechanism of density (ρ) changes in the material. Then: 4 5 Ri Δ (Ni ρ), Ri ci = Ri Δci = m i i i Δ (Ni ρ) = Ni (H)ρ(H) − Ni ρ = (Ni + ΔNi )(ρ + Δρ) − Ni ρ = ΔNi ρ + (Ni + ΔNi )Δρ, Ri Ri Rk Δ (Ni ρ) = Ni Δρ + ΔNk (ρ + Δρ) m m m i i +
Rl ΔNl (ρ + Δρ). m
Since the quantity of the formed product is Np = ΔNl = −ΔNk (ΔNk < 0), the above equation can be written in the next manner: 4 5 Ni Np Δ (ρ + Δρ)(Rl − Rk ) Ri ci = Ri Δρ + m m i i Δρ(H) Ri ci = + cp (H)ΔR, ρ i where Δρ(H) is the density alteration caused by the exposition H, cp (H) – the concentration of the photoproduct and ΔR = Rl − Rk – the change of the activating particles refraction due to the photoreaction. From the LorentzLorenz we can write: 2 n −1 6nΔn(H) Δ ≈ 2 . n2 + 2 (n2 + 2) As a result, the refractive index dependence on the molecular conversion and the density changes can be expressed as [20]: , 2 + 2 n +2 Δρ(H) + cp (H)ΔR Ri ci Δn(H) = 6n ρ i 2 2 2 n +2 n + 2 n2 − 1 Δρ(H) + cp (H)ΔR = 6n ρ 6n
568
K. Beev et al.
Unfortunately, the photopolymerization is not a local process. It propagates beyond the illuminated volume boundaries. In the less illuminated or dark regions the polymerization process is accelerated or initiated by diffused light, thermal reactions as well as radicals diffusion. As a consequence, the refractive index modulation (Δn) diminishes with the spatial frequency increase. Another problem is connected to the layer shrinkage, originating from the close packing of the molecules accompanied by the polymerization process. It results in undesired blue shift of the Bragg wavelength during recording. These problems are nowadays limited by the supplementary compounds addition. They do not take part in the polymerization process i.e. they are neutral compounds in the recording process. The elaboration of such composites along with other relatively new dynamic materials is described in the part 4 of the Chapter, presenting the recent recording media development. 16.3.2 Photorefractive Crystals The described above materials represent a major part of the organic recording media. The efforts to develop non-silver holographic material are also directed to elaboration of holographic inorganic crystals. They have been extensively studied in order to achieve similar characteristics to silver halides along with reversibility of the recording process. Thus, the photorefractive crystals become a very wide studied class of the recording materials. The photorefractive effect is referred to refractive index spatial modulation under nonuniform illumination through space-charge-field formation [25] and electro-optical nonlinearity. The effect is consequence of drift or diffusion separation of photogenerated by spatially modulated light distribution charge carriers, which become trapped and produce nonuniform space charge distribution. The more mobile charges migrate out of the illuminated region owing to the photovoltaic effect and ultimately are trapped in the dark regions of the crystal [26]. The resulting internal space-charge electric field modulates the refractive index. This effect was first discovered in 1966 as an optical damage mechanism in electro-optical crystals [27]. Soon after, the photorefraction was recognized as potentially useful for image processing and storage [28]. Most applications and quantitative analysis of photorefractive materials nowadays are connected to holographic technique. The charge migration in the case of two coherent beams overlapping results in sinusoidal space-charge field that modulates the refractive index. Thus a refractive index modulation (grating) is obtained forming a read-write-erase capable hologram. This feature is due to the possibility to erase the pattern by uniform illumination. Typically, the holograms are π/2 phase shifted with respect to the illuminating interference pattern. A consequence is the energy transfer between the two light beams interfering in the medium, recognized as asymmetric two-beam coupling. In the case of sufficiently strong coupling, the gain may exceed the absorption and the reflection losses in the sample, so optical amplification can occur.
16 Materials for Holographic 3DTV Display Applications
569
It is important to avoid confusing the photorefractive mechanism with the large number of other local effects, such as photochromism, thermochromism, thermorefraction, generation of excited states, etc. leading to photoinduced refractive index modulation. On a basic level to achieve photorefractive effect both photoconductivity and refractive index dependence on the electric field is required. The recording process consists of several steps. First, charge carriers are excited by the inhomogeneous illumination leading to spatially modulated currents appearance. Thus, charge density pattern is obtained and space charge fields arise. Due to electro-optical effect, the necessary refractive index modulation is obtained [29]. According to the steps, described in the previous paragraph, the first physical process is the generation of mobile charges in response to the light field. It can be represented as electrons and holes formation. Drift, bulk photovoltaic effect and diffusion are involved in the charge density pattern formation. The drift current is consequence of the Coulomb interaction of an external electric field with the charge carriers. The bulk photovoltaic currents are typical for noncentrosymmetric crystals. Sometimes they are referred to “photogalvanic currents” in order to distinguish them from the usual photovoltaic effect. Except on the host materials, this effect also depends on the doping and annealing of the crystals. Since the electron orbitals of the defects are oriented with respect to the crystal lattice in order to minimize the free energy, light polarization sensitivity is observed. Lorentz forces can cause additional currents and influence the photovoltaic effect. Anyways the magneto-photorefractive effect is negligible even in very strong fields presence. The other transport process is connected to the diffusion currents, which are consequence of the free charges spatial variation due to the inhomogeneous illumination. Except the charge originate problem, another key question is where the charges are trapped. These microscopic processes determine macroscopic properties like absorption, absorption changes, conductivity and holographic sensitivity. Different models of these processes are developed. Among them are the one-center model without and with electron-hole competition, the two-center model and the three-valence model. On the base of these models, more complex systems like the three-valence with electron-hole competition case can also be described. To choose appropriate charge transport model not only the host material, but also the light intensity (cw or pulse laser regime), doping and thermal annealing should be considered [19]. Presence of trapping sites to hold the mobile charges is required, especially when longer lifetime storage is desired. In general terms, a trapping site is a local region of the material where the charges are prevented from transport participation for a certain time. The last requirement for the photorefractive media is connected to the presence of refractive index modulation in consequence of the local electric fields initiated by the illumination, charge generation and redistribution. If
570
K. Beev et al.
the material exhibits large electro-optic effect, the refractive index modulation magnitude Δn is related to the space charge field Esc as follows: 1 Δ n = − n3 re Esc , 2 where re is an effective electro-optical coefficient. According to the Δn dependence on the field Esc , the sinusoidal variations of Esc lead to sinusoidal refractive index modulation. Another reason to occur field-dependant refractive index modulation can be consequence of quadratic or Kerr orientational effect. It is connected to light induced birefringence in photorefractive materials [30]. First, all known materials exhibiting photorefractive mechanism was inorganic crystals as LiNbO3 , KNbO3 , BaTiO3 , Bi12 SiO20 , Srx Ba1−x NbO3 (0 ≤ x ≤ 1), InP:Fe, GaAs, multiple-quantum-well semiconductors and several others [31]. Very early the crucial influence thermal treatment and the dopants were discovered [32, 33]. In order to improve the holographic characteristics of these crystals, the influence of different doping agents is examined. In the following, the main features of some of the most widely studied types of crystal and the used doping elements are considered. A further description is available in [19]. 16.3.2.1 LiNbO3 and LiTaO3 LiNbO3 and LiTaO3 crystals are among the most studied photorefractive materials. The highest values of the refractive index modulation (according to the recent literature) for photorefractive crystals (∼ 2 × 10−3 ) are obtained in LiNbO3 :Fe. The charge transport is consequence of bulk photovoltaic effect; the dominant carriers are electrons. At low intensities, the process is well described by the one center charge model and no light-induced absorption changes are observed. The photoconductivity increases linearly with the light intensity. In contrast if light-induced absorption changes appear, the one center charge model is not sufficient to describe the processes. In this case, the two center model is used. The most widely used dopants for these crystals are Fe and Cu. The presented features are also observed in Fe and Cu-doped crystals. In double-doped crystals, like LiNbO3 :Fe, Mn, photochromic effect is observed. Other employed dopants are Mg and Zn. The Mg does not influence the photovoltaic effect, but enlarges the conductivity of the crystal. The addition of Zn (∼2–5%) also leads to conductivity increase, but at the same time results in higher holographic sensitivity. Sensitization for infrared recording is realized in LiNbO3 :Fe and LiNbO3 : Cu. It is performed by green pulses excitation for subsequent infrared exposure [34]. 16.3.2.2 BaTiO3 In BaTiO3 crystals exhibit sublinear increase of the photoconductivity in regards to the light intensity. Light-induced absorption changes are observed.
16 Materials for Holographic 3DTV Display Applications
571
Therefore the two-center and three valence models are used to describe the charge transport processes [35]. The mechanisms are diffusion and externally applied electric field, but in doped materials appreciable photovoltaic effect contributes as well. Usually the charge carriers are holes. Fe, Rh, Co, Ce, Mb, Cr are used as dopants. The iron doping leads to additional absorption, but no significant improvements of the photorefractive performance are observed. The photorefractive effect is strongly influenced by thermal annealing. Rhodium doping improves the photorefractive effect in the red and infrared region along with light induced absorption changes. If double doping is performed – BaTiO3 :Fe, Rh, the charge transport become more complicated. The response time is improved in the case of Co-doping. The cerium-doped crystals exhibit higher light-induced absorption changes. Other dopants used in BaTaO3 crystals are Mb, Cr and Nb, although the best performance is observed in materials doped with Rh, Co and Ce. Another way to improve the response time is to heat the crystals. 16.3.2.3 Barium-strontium Titanate and Barium-calcium Titanate Since tetragonal BaTiO3 are very difficult to produce (small growth rates up to 0.2mm/h are achieved), a possible solution is to use appropriate mixed crystals. Such crystals are Ba1−x Srx TiO3 (BST) and Ba1−x Cax TiO3 (BCT), 0 ≤ × ≤ 1. The one-center model successfully describes the charge transport in BST, while BCT show sublinear conductivity and light-induced absorption and the two-level model should be used. In BCT are observed bulk photovoltaic fields, but they are not significant. In both crystals hole conductivity dominates. The dominant charge driving forces are diffusion and external electric fields. A possible way to improve these crystals is to use the knowledge for the charge transport in BaTiO3 and use dopants like Rh, Co and Ce. 16.3.2.4 KNbO3 Potassium niobate crystals also exhibit sublinear conductivity and light induced absorption changes. The two-level model is employed to describe the charge transport. The charge carriers in doped crystals are usually holes. In undoped materials electron-hole competition takes place. The driving forces are diffusion and drift in external electric fields. Due to the large conductivity, the bulk photovoltaic fields are negligible, although bulk photovoltaic currents present. Blue light irradiation enhances absorption in infrared. This effect can be used for infrared holographic recording. KNbO3 crystals doped with Ir show higher effective trap density, but no significant improvements of the response time are observed. Other possible dopants are Ni and Rh. The KnbO3 :Rh crystals are electron conductive. They show photorefractive response time about 30 times smaller and photoconductivity more than two orders of magnitude larger than undoped crystals. An
572
K. Beev et al.
inconvenience is the complicated crystal growth. Other dopants for KNbO3 are Cu, Ru, Mn, Rb, Na, Ta, Ce, Co. Among them, copper, rhodium and manganese favorably influence the photorefractive properties. 16.3.2.5 Other Crystals Mixed potassium tantalite-niobate crystals – KTa1−x NbxO3 , 0 ≤ x ≤ 1 (KTN) are also objects of investigations. Again Ir doping is used to increase the effective trap density. Other relatively more examined crystals are strontiumbarium niobate. Ce and Rh are used as dopants in these materials. With Sn2 P2 S6 crystal are obtained sensitivity values ∼ 5000 cm/J [36], which is relatively high for photorefractive crystals. Other group of photorefractive materials used for holographic recording is the sillenite-type crystals. Bismuth oxide crystals Bi12 MO20 , where M = Si, Ge, Ti, attract special attention owing to their high photosensitivity and high carrier mobility permitting fast response times [37, 38]. Moreover these crystals can be easily doped. An attractive feature is doping with Ru, Re and Ir, which shifts the transmission spectra to the red and to the near infrared spectral range. Thus holographic recording using He-Ne and low-cost diode lasers is possible. For example, the Ru-doped Bi12 TiO20 exhibits this effect most significantly [39]. Similar photorefractive sensitivity improvement to longer wavelengths is also observed in Ru-doped KNbO3 crystals [40]. However, growth of KNbO3 :Ru is rather complicated, since the incorporation of ruthenium is possible only if the crystals are grown with high speed, which diminishes the crystal quality and results in more defects in the crystal structure [19]. Usually the holographic sensitivity of photorefractive crystals is between tens and hundreds cm/J. Spatial frequencies exceeding 2000 lines/mm are obtained. On the other hand, the refractive index modulation is relatively low – up to 10−3 compared to other type of materials (silver halides, polymers). 16.3.3 Liquid Crystals Other type of materials, finding application in dynamic holographic recording is liquid crystals. Nowadays they can be found almost everywhere due to the huge realization they have achieved in the area of conventional displays. Nevertheless liquid crystals continue to be extensively studied for various contemporary applications being also very attractive for holographic display realization. In the basics of the specific liquid crystal behavior is the match of solid and isotropic liquid properties. In other words, they possess at the same time some properties typical for liquids along with such peculiar to crystals. More precise denomination is mesomorphic materials, since they exhibit aggregation states appearing between solid and liquid phase. They can also be called anisotropic liquids. From macroscopic point of view, they resemble liquids, but the strong anisotropy in the physical properties refer them like more similar
16 Materials for Holographic 3DTV Display Applications
573
to crystals [41]. At the same time typical effects for crystals like Pockels effect are not observed in liquid crystals. Depending on the material, one or more mesomorphic states can appear if some thermodynamic parameter is changed. If the volume and the pressure are kept constant, liquid crystalline state appears during concentration or temperature variation. Thus liquid crystals are divided in two main groups –lyotropic and thermotropic. The first ones show mesomorphic state if the concentration of the chemical compounds is changed, while the thermotropics reach these states through temperature variation. Mesomorphic state is observed in compounds exhibiting molecular orientational order. Usually this is the case of elongated in some direction (rod-like) molecules. The typical liquid crystal molecule length is about 20–40 ˚ A, while they are usually only ∼ 4–5 ˚ A broad. They are diamagnetic but with constant dipole momentum. Different categories of chemical structures exhibiting liquid crystalline state exist, at the same time new types are continuing to be synthesized nowadays. Though, the organic materials exhibiting mesomorphic state can be divided in several groups by symmetry considerations. Thus, nematic, smectic and cholesteric phases are distinguished. Nematic liquid crystals are characterized by long orientation order and free space translation of the molecules weight centers. They are optically uniaxial media with unpolar crystallographic structure – i.e. the directions of the molecular ends are homogeneously distributed. Layer structure is typical for smectic liquid crystals. According to the Zachman-Demus classification, the following smectic mesophases can be distinguished: •
•
•
Smectic phase A – represents freely moving towards each other layers, which surfaces are formed by the molecular ends. The molecules are directed in orthogonal direction towards the layer surface and parallel to each other. Inside the layers the molecules do not have translation order, so they can move in two directions and spin around their long axis. This modification of smectics appears at highest temperatures and under additional heating is transformed in nemamatic or holesteric (see below) mesophase or isotropic liquid. Smectic phase B – the molecules in the layers, being orthogonal towards the surface and parallel to each other, form hexagonal packing. Ordinary and slanted B smectics are distinguished (in the case of slanted B nematics the molecules tilt towards the layer surface is different of π/2). Smectic phase C – the long molecular axis (parallel to each other in the layer) are forming temperature dependent angle towards the layer surface. The liquid crystalline compounds possessing optical activity can generate chiral mesophase. In this case each following layer is turned towards the previous one, so twisted structure is obtained. If orientation of molecular dipoles in a certain manner presents, such structure possess ferroelectric properties. The chiral smectics C, having dipole moment perpendicular to
574
• • •
K. Beev et al.
its long axis and recognized as ferroelectric liquid crystals, generally exhibit a submillisecond or even microsecond switching time and thus attracts significant attention. Unfortunately, the use of these extremely attractive dynamic features for practical applications is limited due to technological problems, connected to the specific orientation of the ferroelectric liquid crystals. Smectic phase D – optically isotropic. X-ray structural analysis studies have not detected layered structure, but quasi-cubic lattice. Smectic phase E – characterized by very high degree of three-dimensional order Rotation around the long molecular axis is absent. Smectic F and G phases exist as well, being less studied and an object of current investigations.
The cholesteric liquid crystals comprise of optically active molecules, which long molecular axis direction in each following layer (consisting of parallel oriented and freely moving in two directions molecules) forms certain angle with the direction of the molecules in the previous one. Thus spiral structure with pitch dependent on the type of the molecules and the external influence is formed. This pitch corresponds to rotation of the orientation molecular axis (the director – see below) at 2π, although the period of the physical properties alteration is equal to π. The so-called chiral nematics – molecules with typical nematic structure, but possessing in addition optical activity – are also referred to the cholesteric liquid crystals. The predominant molecular orientation in liquid crystal is characterized by a unit vector n, fulfilling the requirement n = −n. It is recognized as director. In nematic liquid crystals the director coincides with the optical axis direction. In the cholesteric mesophase it changes its direction along with the cholesteric spiral and its components can be expressed as: nx = cos ϕ, ny = sin ϕ, nz = 0. In A and B smectic phases the director coincides with the normal towards the smectic planes, i.e. coincides with the optical axis similarly to the case of nematic liquid crystals. In C and H smectics the director is deflected towards the layer normal and coincides with one of the two optical axes. The equivalence of the director orientation (n = −n condition) is consequence of the fact that macroscopic polarization effects in liquid crystals are not observed. A series of monographs examine the LC properties in details [42, 43, 44]. The most attractive features (large optical anisotropy and field-applied refractive index modulation) are consequence of their anisotropic nature (anisotropy in the dielectric and diamagnetic permittivity) and electro-optical behavior. The electro-optical LCs properties are governed by the free energy minimization in external electric or magnetic field leading to reorientation (and reorganization) of the LC molecules. In the case of positive anisotropy, the LC director aims to follow the applied field direction. If the anisotropy is
16 Materials for Holographic 3DTV Display Applications
575
negative – the LC molecules rotate in perpendicular direction towards the external field. If the LC director does not satisfy the minimum free energy condition in the initial state, in the case of strong enough applied external field, director reorientation takes place until new stationary distribution is obtained. This effect is known as Freedericksz transition and requires sufficient fields to overcome the elastic forces. The relation between the applied field E and the director angle θ is given by the expression [45]: Ed 2
8
Δε = 4πK
θm 0
dθ = F (k), sin2 θm − sin2 θ
where d is the LC thickness, K – elastic coefficient and θm – the angle of director diversion in the middle of the layer. The elliptical integral is tabulated for arbitrary values of k = sin θm < 1. For relatively small diversion angles it can be expanded in series. If the examination is limited to the first two terms, the expression can be written as: 8 4πK π 1 E= 1 + sin2 θm + . . . . Δε d 4 As a consequence, deformation θm = 0 is possible if the applied field exceeds a given value E0 , which represents the threshold voltage for the Freedericksz transition: 8 π 4πK E0 = . d Δε This effect changes the optical properties of the LCs due to the refractive index alteration: no ne nef f = n2o cos2 θ + n2e sin2 θ Thus, the possibility to reorient the LC under external field (in most of the practical applications – electric) enables the control of its optical properties. This is an extremely attractive feature, finding application in a wide area of nowadays science and technique. In fact, the most attractive applications are connected to the display area. For the aim of the holographic recording, LC systems containing azodyes [46, 47, 48] are studied. Usually nematic LC are used. The reason is that they are rather sensitive to weak perturbation forces induced by electric, magnetic or optical fields. Nematics are also well known because of their nonlinear optical axis reorientational effects [49]. The nonlinear optical properties can be increased dramatically by doping the LC with traces of dye molecules, usually ranging from 0.5% to 2% [50]. This is a possible way to increase the
576
K. Beev et al.
diffraction efficiency of holographic gratings. The increase is consequence of the laser radiation absorption leading to photoexcitation of the dye molecules initiating the mechanism for large refractive index changes due to the LC director reorientation. The LC director axis reorientation has been attributed to intermolecular interactions between azo dye and LC molecule [51, 52] and to optically induced d.c. space-charge field (examined below) [53, 54]. Some of the obtained holographic characteristics of dye-doped liquid crystals are: sensitivity in the range 440–514nm, spatial frequency above 1000 lines/mm and refractive index modulation up to 0.1 [47, 48].
16.4 Recent State of Art and New Materials Development and Applicability of the Different Media Types to Holographic Display Systems The described materials are the major part of the non-silver recording media. But for the aim of holographic 3D display realization, the recording material has to be in addition dynamic. Among the presented holographic media photorefractive crystals, liquid crystalline and polymeric materials containing azo-groups or exhibiting phase transitions [55] are representative of the dynamic recording materials. The photorefractive crystals are one of the earliest, largely considered for storage media suitable for read-write holograms recording. Usually they are doped with transition metals such as iron or rare-earth ions like praseodymium, grown in large cylinders in the same way as semiconductor materials. Thus, large samples can be polished for thick hologram recording. In the last several years, photorefractive materials continue to be a subject of intensive studies. Most of the available literature concern LiNbO3 crystals. Some of the works are related to switching optimization [56] or studying of the photorefractive effect depending on the composition and the light intensity [57]. Another direction of the recent investigations is connected to new doping elements utilization. In cerium-doped crystal the photovoltaic constant is measured to be only one third of that of the iron-doped one [58]. Other dopants recently used in combination with supplementary elements are In and Cr [59, 60]. The higher the doping level of Cr, the larger absorbance around 660 nm in double doped LiNbO3 :Cr:Cu crystals is observed. Along with LiNbO3 crystals, the LiTaO3 materials are also subject of intensive investigations [61]. Similarly to LiNbO3 , two- step infrared sensitization and recording in LiTaO3 crystals is reported [62]. It has been realized by pyroelectric effect. Nevertheless, the other types of crystals are also continuing to be studied [63]. New materials like Gd3 Ga5 O12 [64, 65] are also developed. The main advantage of photorefractive crystals is that no development of the holograms is required and all the processes are completely reversible.
16 Materials for Holographic 3DTV Display Applications
577
Unfortunately, mainly the difficult crystal growth and sample preparation limit the applications of these materials. Also the thickness of photorefractive crystals is typically about several mm. For optical storage applications thicker media is preferable, but to realize devises comprising of layers and lots of elements this is rather not desired. Thus, it would be inconvenience in the 3D-holographic display realization. 16.4.1 Photorefractive Organic Materials In fact, all know photorefractive materials until the 1990 were inorganic crystals. Photorefractivity in an organic crystal was first reported by the ETH Zurich group (in 1990) [66]. The material was a carefully grown nonlinear organic crystal 2-cyclooctylamino-5-nitropyridine doped with 7,7,8,8tetracyanoquinodimethane. Although the growth of high-quality doped organic crystals is a difficult process, since most of the dopants are expelled during the crystal preparation [67], there are some following investigations of such media [68, 69]. On the other hand, polymeric and/or glassy materials can be doped relatively easy with various molecules with different sizes. Also, polymers may be formed into different shaped thin films as well as applied to total internal reflection and waveguide configurations according to the applications requirements [67]. The first polymeric photorefractive material was composed of an optically nonlinear epoxy polymer bisphenol-A-diglycidylether 4 -nitro-1,2-phenylenediamine, which was made photoconductive by doping with 30 wt% of the hole transport agent diethylaminobenzaldehydediphenylhydrazone. The first publication of its application in holography is presented in [70]. Another approach is not to dope electro-optical polymers with charge transport molecules, but to synthesize a fully functionalized side-chain polymer with multifunctional groups [71, 72]. Nevertheless, a faster and easier to implement approach is the quest-host chemical design. It enables a better way to test different combinations of polymers and molecules with photosensitivity, transport and optical activity [13]. Along with the other advantages, a further motivation to elaborate photorefractive polymers is consequence of a particular figure-of-merit consideration comparing the refractive index changes possible in different materials in the case of equal density of the trapped charges. It can be defined as Q=
n3 re , εr
where n is the refractive index, re – the effective electro-optical coefficient and εr is the relative low-frequency dielectric constant. Q approximately measures the ratio of the optical nonlinearity to the screening of the internal space-charge distribution by medium polarization. It is established that for inorganic materials it does not vary very much, due to the fact that the optical nonlinearity is driven mostly by the large ionic polarizability. In contrast, the
578
K. Beev et al.
nonlinearity in organics is a molecular property arising from the asymmetry of the electronic charge distribution in the ground and excited states [25]. As a consequence, the large electro-optic coefficients are not accompanied by large DC dielectric constants. Thus, an improvement in Q by more than 10 times is possible with organic photorefractive materials. On a basic level, the recording mechanism in photorefractive polymers does not differ from the one in photorefractive crystals, but different constituents bring forth to the required properties. Examples for good charge generators (the first process in photorefractive materials) are donor-acceptor charge transfer complexes like carbazoletrinitrofluorenone, fullerenes such as C60 , or well known in photographic industry dye aggregates. In order to obtain dynamic recording, reduction/oxidation process is required – the charge generation site has to oxidize back to its original state. In photorefractive polymers the holes are more mobile. The charge (hole) transport function is generally provided by a network of oxidizable molecules situated close enough to provide hopping motion. Examples of such transporting molecules are carbazoles, hydrazones and arylamines, which are electronrich and consequently have low oxidation potential. Energetics requires the highest occupied energy level of the photogenerator to be lower than the one of the transporting molecules. The physical processes initiating the charge transport are diffusion in consequence of charge density gradients or drift in externally applied electric field. Both generally proceed by charge transfer from transport site to transport site. At all, in most polymeric materials the ability of the generated charges to move by diffusion alone in zero electric field is quite limited, so the drift in the applied field is the dominant mechanism for charge transport. The other element for the photorefractive effect, especially when longer grating lifetimes are desired, is the presence of trapping sites that hold the mobile charges. In polymer photorefractive materials transporting molecules with lower oxidation potential are used as deep hole traps [73]. The efforts to describe the photorefractive effect in polymers were connected to application of the standard one-carrier model used for inorganic crystals. According to this model the space charge field Esc is expressed by [74]: Esc = where Eq =
eNA [1−(NA /ND )] εo εr KG
mEq (Eo + iEd ) Eq + Ed − iEo
is the trap-density-limited space charge field √
I1 I2 – the for wavevector KG , εo is the permittivity of free space, m = 2I1 +I 2 modulation depth of the optical intensity pattern, ND – the density of donors, NA – the density of acceptors providing partial charge compensation, Ed = kB T KG /e – the diffusion field, where kB is the Boltzmann’s constant.
16 Materials for Holographic 3DTV Display Applications
579
The corresponding equation for organic photorefractive materials [75] is quite similar. The density of the acceptors replaced by the density of traps. Additional field-dependent terms arising from the field dependence of mobility and quantum efficiency are present. Moreover, several additional physical effects should be taken in account. They are connected to the presence of shallow along with deep traps, evidenced by the sublinear intensity dependence; the more complicated field dependence of the photogeneration efficiency especially if sensitizers like C60 are used as well as the different trapping mechanism. In fact, the main reason for the complexity of the photorephractive effect in polymers was observed a bit later. In 1994 Moerner and co-workers altered the picture of electro-optical non-linearity discovering an effect, which does not exist in the inorganic materials [30]. It is connected to orientational processes in the polymeric material in consequence of the charge-field formation. The latest photorefractive polymers, strongly exhibiting these effects, showed significantly higher performance (diffraction efficiency up to 100% compared to values of the order of several percents). It was achieved by fabrication conditions improvement and addition of plasticizer. The plasticizer reduces the glass transition temperature, enabling orientation of the electro-optic chromophores. The reorientation of the birefringent chromophores enhances the refractive index modulation, which is fully reversible reaching amplitude values ∼ 0.007 with response time of 100–500 ms. The other reasons for the higher performance of these materials is due to utilization of sample preparation conditions allowing higher electric fields application. Thus, the year 1994 was a turning point in the photorefractive polymers development. The chromophores design was changed – the Pockels effect was replaced by the orientational birefringence as the main driving mechanism. On the basics is the orientational photorefractivity, in which the refractive index of the material is modulated by the orientation of the optically anisotropic dopant molecules with permanent dipole moment control. It is a consequence of the internal space charge field generation, driven by absorption, charge generation, separation and trapping, similarly to the traditional photorefractive materials. 16.4.2 Liquid Crystals At the same year (1994), the first photorefractive liquid crystal materials were reported [76, 77]. The low-molarmass liquid crystalline material 40pentyl-4-biphenylcarbonitrile, doped with small amounts of a sensitizing laser dye rhodamine 6G was used. In fact, the ultimate extension of the orientational photorefractivity is to consider materials consisting entirely of long, rod-shaped birefringent molecules, which can be easily oriented in external electric field – i.e. the liquid crystals [26]. It is well known that nematic liquid crystals possess large optical nonlinearities associated with director axis orientation in consequence of optical or electric field application. In many aspects
580
K. Beev et al.
they are ideal to observe photorefractive effect, due to the specific design in order to obtain orientational response. Also, no nonlinear dopant is necessary, since the liquid crystal is itself the birefringent component. It is essential that in liquid crystals 100% of the medium contribute to the birefringence in contrast to the percentage of the nonlinear optical dopant in other systems. Furthermore, the molecules response to the space charge field is lower with an order of magnitude compared to polymers. As a comparison, the required field for photorefractive liquid crystal reorientaion is ∼0.1 V/μm, while in polymer systems it is ∼50 V/μm. Along with these advantages, the figure of merit, similarly with polymers is rapidly improved within several years in consequence of new liquid crystal mixtures elaboration and the better understanding of the charge transfer processes. Usually, the liquid crystal is sandwiched between two indium tin oxide (ITO)-coated glass slides, treated with a surfactant to induce perpendicular alignment of the director (towards to the face of the glass slides), i.e. homeotropic alignment [78]. The cell thickness is typically between 10 and 100 μm, fixed by Mylar spacers. The theoretical treatment is based on a steady-state solution assumption to the current density resulting in the following expression for the diffusion field in the liquid crystals [76, 79]: Esc =
−mkB Tq D+ − D− σph sin qx, 2eo D+ + D− σph + σd
where m is the modulation index, σph – the photoconductivity, σd – the dark conductivity, eo – the charge of the proton, and D+ and D− – the diffusion constants for the cations and anions, respectively. This equation determines the critical factors influencing the magnitude of the space-charge field. The photoconductivity relative to the dark conductivity and the difference in the diffusion coefficients of the cations and anions allow the one set of charges to preferentially occupy the illuminated or the dark regions of the interference pattern. Even today, the complete theoretical understanding of the photorefractivity in polymers and liquid crystals is challenging. According to [49] a full theory should take into account effects like mobility of various charge carriers, standard space-charge generation, space-charge fields due to optical fields and to conductivity and dielectric anisotropies [80], torques on the director axis, as well as flows and instabilities of the nematic liquid crystal. A large increase of the orientational photorefractive effect was first reported by Wiederrecht et al. [81]. It was obtained basically by two improvements. An eutectic liquid crystal mixture was used along with an organic electron donor and acceptor combination with a well-defined and efficient photo-induced charge transfer mechanism. The eutectic mixture lowers the liquid crystalline to solid phase transition and provides better photorefractive performance due to the greater reorientation angle of the molecules in consequence of the lower orientational viscosity. The employed liquid crystals were low-molar-mass compounds with higher birefringence.
16 Materials for Holographic 3DTV Display Applications
581
Another direction in these investigations is to use high molecular liquid crystals. New liquid crystal composites were developed containing both a lowmolar-mass liquid crystal and a liquid crystal polymer, or high-molar-mass liquid crystal [77, 82, 83]. These composites in many respects have the best photorefractive figures of merit for strictly nematic liquid crystals [26]. Net beam-coupling coefficients greater than 600 cm−1 and applied voltages ∼ 0.11V/μm were obtained with lower than 10 mW per beam intensities of a He-Ne laser. The gratings operated at Bragg regime. Later studies have examined the influence of the polymer liquid crystal molecular weight on the temporal and gain coefficients of photorefractive gratings [84]. An improvement in the response time was found for the lower molecular weight polymers. This fact was referred to an overall decrease in the composite viscosity. The response time was shorter than 15 ms at Bragg regime. Furthermore, the required applied voltage was lowered to ∼0.1 V/μm. In fact the improvements in the holographic characteristics of these materials are connected to composite materials development. At the one hand, the possibility to combine the very large reorientational effects exhibited by low-molar-mass LCs with the longer grating lifetimes and higher resolution of nonmesogenic polymers is extremely attractive. At the other hand, the combination of liquid crystals and photopolymers enables switchable gratings realization, finding large applications in practice. These materials are recognized as polymer dispersed liquid crystals and will be discussed in details in the following. In order to summarize the current performance of the organic photorefractive materials, some data of the obtained characteristics are adduced and compared to the one of the photorefractive crystals. To gain efficient charge generation and thus high holographic sensitivity, it is required to sensitize the recording material to some proper laser wavelength. In contrast to inorganic crystals, materials the spectral sensitivity of photorefractive organics can be changed using proper dopants. During recent decades, numerous sensitizers have been developed [85]. As a result, the spectral sensitivity of these materials is nowadays tuned through the entire visible spectrum and the near infrared, up to 830 nm. The spatial resolution is similar to this of inorganic crystals. Such parameters as the dynamic range and the material stability are also comparable with photorefractive crystals. The last two characteristics are not directly connected to the requirements of holographic display material, but mostly to storage applications. For the holographic display realization purpose, a critical parameter is the time response of the medium. It is rather complex process in photorefractive polymers and liquid crystals. It depends on several factors, including photogeneration efficiency, drift mobility and field-induced orientational dynamics of chromophores. It is also strongly dependant on the applied electric field. Thus, the fundamental research concerning the photorefractive materials (inorganic crystals, polymers and liquid crystals) has pointed out the limits of
582
K. Beev et al.
their applicability in real-time holography [86]. Inorganic photorefractive crystals and photorefractive polymers have a relatively slow response and besides the latter need biasing of 10 kV per sample of typical thickness. Nevertheless, the latest photorefractive liquid crystals and polymer dispersed liquid crystals exhibit a better performance. As a conclusion, it could be pointed out that the most promising mechanism for holographic 3D display applications is the reorientation of birefringent chromophores. It can be obtained by internal electric field creation (in the photorefractive media) and optical cis-trans conformation processes (dyedoped liquid crystals). As a consequence, currently liquid crystalline materials are considered as a leading candidate for those applications that do not require long storage times. The molecular reorientational effect (cis-trans conformation) is much faster process than the charge generation, transport and trapping (in the case of the electro-optical effect in photorefractive materials). In this sense, it is preferable for 3D display applications. In fact, the dye-doped liquid crystal materials are quite new. Their performance and the possibility to realize holographic recording is consequence of the so-called Janossy effect. In 1990 Janossy discovered that the optical reorientation of liquid crystals can be enhanced up to two orders of magnitude in the case of doping with certain dichroic dyes, if the dye molecule is excited anisotropically [87]. Such dyes are known to undergo photoinduced conformational changes, which is the case of azo-dyes [88]. The excellent dynamic performance of the dye-doped liquid crystals achieved nowadays, has resulted in a significant scientific interest. Currently, the possibility to employ azobenzene dye-doped twisted-nematic liquid crystals for polarization holographic gratings recording is studied [89]. High diffraction efficiency (exceeding 45%) is obtained. In addition to the polarization rotation when the laser beam is diffracted in the medium, this rotation angle can be controlled by the twisted angle of the sample cells. Also, the layer undulations in the cholesteric cells are used as switchable weakly polarization-dependent 2D diffraction gratings of both Raman–Nath and Bragg types [90]. These experiments put a new possibility to enlarge the liquid crystal application through chiral structures use. Applying transverse-periodically aligned nematic liquid crystals, polarization-induced switching between diffraction orders of a transverseperiodic nematic LC cell is realized [91]. Relatively new approaches to liquid crystalline materials are also connected to carbon nanotubes-doping and combined – fullerene C60 and dye-doping [92, 93]. The applicability of dye-doped liquid crystal materials for relatively highspatial frequency recording has been shown in [47]. 16.4.3 Polymers In fact, enhanced optical reorientation the effect of dye-doping is not limited only to liquid crystals, but also presents isotropic liquids and amorphous
16 Materials for Holographic 3DTV Display Applications
583
polymers. Although polymers are most often referred as promising medium for high density optical storage, continuing to be studied in this direction [94], they also find applications in real-time holography. The azo-containing materials are among the representatives of the most studied dynamic polymeric holographic media. It is established that the optical properties modification of azo materials is due to the efficient photoisomerization of the -N = N- in azobenzen group, initiated by the absorbed light; the photoreorientation with polarized light is also well-known [95, 96]. The molecular reorientation, a consequence of the angular hole burning due to multiple trans-cis-trans photoisomerization cycles, leads to photoinduced birefringence and dichroism [97]. The reversible photoisomerizations can also initiate mass transport, resulting in surface relief formation (surface diffraction gratings) [98]. The polymer mass redistribution induced by an interference pattern of two laser beams takes place well below the polymer’s glass transition temperature. Different mechanisms have been proposed in order to explain the surface relief gratings origin in azobenzene functionalized polymers. They include thermal gradient mechanisms, asymmetric diffusion on the creation of concentration gradient [99], isomerization pressure [100], mean field theory (based on electromagnetic forces) [101], permittivity gradient [102] and gradient electric force [103]. Besides of the surface relief creation, the photochromic conversion have attracted strong interest [104]. At all, both azobenzene LC and amorphous polymers exhibiting photoisomerisation and surface relief creation show excellent holographic characteristics. The photo-isomerization mechanism allows wider spectral sensitivity – up to 633 nm, while the surface relief materials usually work in the range of 244–532 nm. The gained spatial frequency is 6000 and 3000 lines/mm, respectively. The refractive index modulation exceeds 0.1. Nevertheless, the pursuit for dynamic holographic materials development is enlarging the available media mostly by composites development in order to combine the advantages of different materials like liquid crystals, various polymers etc. 16.4.4 Polymer Dispersed Liquid Crystals Polymer dispersed liquid crystals (PDLC) are relatively new materials, elaborated during the last two decades. Although they have been first considered for other applications, later they find large appliance in holographic recording. The first applications of PDLCs were the so-called “smart windows” formed by homogeneously distributed liquid crystal droplets in polymer matrix. Their optical behavior is electrically controlled. Later, the switchable holographic gratings recording enable huge applications as holographic optical elements. Another direction in PDLC development is to use the photorefractive effect in order to obtain reversible recording. The first description of PDLCs has been made by Fergason in 1985 [105], Doane [106] and Drzaic [107] in 1986. Their main advantage is the possibility to combine of the unique liquid crystal properties with the possibility to
584
K. Beev et al.
realize photoinduced processes in the medium, including optical recording. The structures consist of micron, or sub-micron, birefringent liquid crystal droplets embedded in optically transparent polymer matrix. The structure is fixed during phase separation process. The phase separation of PDLCs can be accomplished by several mechanisms. The thermal methods include common solution of thermoplastic material and liquid crystal cooling (TIPS – thermal induced phase separation). Another way is to use common solvent and its evaporation – solvent-induced phase separation (SIPS). Nevertheless, the most established nowadays technique is to employ polymerisation of monomeric precursors homogenized with the liquid crystal – polymerisation-induced phase separation (PIPS). The last case can be achieved optically (by UV irradiation). The next stage consists of free radical reactions initiating monomer-polymer conversion leading to increase of the polymer molecular weight in the presence of large volume fractions of liquid crystal. The final morphology consists of randomly dispersed liquid crystal domains with form, volume proportion and size determined by the illuminating light intensity, the volume ratio of the compounds in the pre-polymer mixture and the temperature [108]. It is essential to note that the obtained morphology determines the further electro-optical properties of the film. Depending on the liquid crystal concentration, two main types of morphologies are observed after the phase separation process. In the case of relatively low amounts of liquid crystal, the morphology is “Swiss cheese” type – spherical or ellipsoidal droplets are completely embraced by the polymer matrix. The other type of morphology consists of two continuous phases (polymeric and liquid crystalline) – described like “sponge” morphology. It is usually observed at liquid crystal concentrations exceeding 50%. A typical feature of this morphology is the coalescence of the liquid droplets [109]. At a given liquid crystal concentration, the droplet size and distribution is determined by the polymerization kinetics. If the liquid crystal is extracted from the structure, the morphology can be observed by electron microscopy techniques. After the initial droplets formation, during the polymerization process, their size increases in consequence of liquid crystal diffusion from the areas where the polymer concentration (due to the monomer-polymer conversions) increases rapidly. The droplet size and their distribution is determined not only by the diffusion process, but also by the polymer network propagation leading to “gelation” over a given molecular weight and density of the matrix. At this moment, the diffusion significantly diminishes and the droplet size (and shape) is fixed. Diameters from 0.02 to several micrometers are obtained. The control of the diameter is required for optimization of the further electrooptical properties of the material. The droplets distribution is random, except in the cases when special surface treatment is performed in order to create preliminary orientation of a layer of the material. Some kind of arrangement presents within the
16 Materials for Holographic 3DTV Display Applications
585
droplets, usually nematic, but the overall direction of the molecular axes (the director) is different in each droplet. At all, the director configuration within the droplet is dependent on the surface interactions, the elastic constants of the liquid crystal and the presence of external applied field as well as its amplitude. In most of the cases, the optical axis (determined by the dipole momentum) coincides with the molecular. In consequence of the chaotic director distribution, the material is “opaque” and strongly diffuses light. This is a consequence of the refractive index mismatch between the droplets and the polymer matrix. Usually, the ordinary refractive index is chosen to be similar to the one of the polymer. This equilibrium is used to “switch off” the highly scattering mode through application of strong enough electric field. Such effect can be observed through the influence of magnetic field. The field (in the practical applications – electric) has to exceed the resistance caused by the elastic forces of the liquid crystal in order to induce molecular reorientation. This reorientation corresponds to Freedericksz transition [110], which explains its threshold behavior. The electric field is applied in such manner that the incident light “sees” only the ordinary refractive index of the liquid crystal. In consequence the material becomes absolutely transparent [111]. When the electric field is removed, the liquid crystal returns to its initial distribution governed by the elastic forces. Thus, two states of the PDLC films are obtained – highly opaque “switched off” state and transparent “switched on”. The refractive ne + 2no index in the initial state is n0 = . When strong enough electric field 3 is applied, the effective refractive index is no . As a consequence, the obtained optical anisotropy is: Δn 3 In the case of low volume fractions of liquid crystal Δnef f is smaller than Δn/3. In order to increase the optical anisotropy, the percent content of the liquid crystal in the pre-polymer mixture should be increased. Another advantage of PDLCs with higher liquid crystal concentration is the lower electric field required for the director reorientation. Namely the scattering from the birefringent liquid crystal droplets control through electric field application is fundamentally the reason for one of the most attractive applications of PDLCs – the display technology. To some extent the conventional liquid crystal displays made from twisted nematics remain relatively expensive and difficult to produce. They also require additional optical elements (polarizers). On the base of controlled light scattering, PDLCs find application in optoelectronics for different transmission windows, temperature sensors, color filters with variable optical density, etc [112]. If such reactive monomers and liquid crystal solution is illuminated with spatially modulated light distribution (an interference pattern), the exposure Δnef f = n0 − no =
586
K. Beev et al.
initiates a counter-diffusion process. It consists of liquid crystal transport to the dark regions of the interference pattern governed by the monomer diffusion to the light areas and the polymer network growth (in the light regions). The monomer diffusion is initiated by the concentration gradient due to the monomer-polymer conversions in the light regions. The phase separation process takes place in the dark parts of the material, where the liquid crystal is confined in droplets with sizes usually smaller than 0.5 μm. The liquid crystal droplets have randomly oriented director. This method of holographic structures formation was first utilized by Sutherland and co-workers in 1993. They report transmission diffraction gratings recording [113]. During the next year, reflection diffraction gratings recording was also realized [14]. The holographically formed PDLC structure consists of polymer reach and liquid crystal reach layers, corresponding to the light distribution. Again, due to the refractive indices mismatch, scattering occurs. Due to the periodicity of the structure this scattering is coherent and lead to reconstruction of the holographic information. Thus, the medium exhibit phase modulation and the structures are recognized as holographic polymer dispersed liquid crystals (HPDLC). Similar to the conventional PDLCs, the ordinary refractive index of the liquid crystal is chosen to match the one of the polymer matrix. As a result, the application of electric field leads to switching off of the diffraction structure, since the index modulation disappears. Again, when the electric field is removed, the liquid crystal restores its initial configuration governed by the elastic forces. The consequence of this mechanism is the reversible switching of the diffraction grating [114]. The sensitizing of HPDLC in the visible spectral range, actually to some proper laser wavelengths is fulfilled by addition of appropriate combination of dye and photoinitiator. The role of the dye consists in translation of the material absorption peak in the desired spectral range, while the photoinitiator is important for the free-radical polymerization processes initiation. It is considered that process exhibit the following mechanism [115, 116]. The photon absorption is accompanied with excitation of the dye molecule. The next process is electron transfer from the excited dye molecule to the initiator, usually belonging to the group of the amines. In consequence, a pair ion radicals is formed. This process is immediately followed by proton transfer to the anion radical of the initiator from the co-initiator. As a result a neutral amine radical is obtained which initiates the photopolymerization. The co-initiator concentration significantly influences the free radical formation efficiency. This efficiency results in higher polymerization velocity, which affect the size and the anisotropy of the droplets. In order to obtain high diffraction efficiency along with high spatial resolution, in the case of reflection diffraction gratings, morphology with high concentration of small liquid crystal droplets is required [117]. Extremely fast photopolymerization is exhibited by the multifunctional monomers – the necessary time is in the order of seconds. Another feature
16 Materials for Holographic 3DTV Display Applications
587
is that they form highly crosslinked network. As a result the liquid crystal droplet growth is limited and its size does not exceed 0.5 μm. Usually two kinds of monomers are used in HPDLC receipts. The mechanism at the one type is a free-radical bond opening (addition polymerization), while the other exhibits a combination of free-radical and step reactions. The acrylate monomers with higher than 4 functionality satisfy the requirements to achieve significant molecular weight in several seconds. Urethane derivatives with functionality between 2 and 6 are also used. Often N -vinyl pyrrolidinone (NVP) is used as a reactive diluent in order to homogenize the initial mixture. Another class of monomers is the commercially available Norland resins. The most widely used is NOA 65 (Norland Optical Adhesive). The other basic component of the PDLC pre-polymer mixture is the liquid crystal. The most widely used liquid crystals are nematic, possessing positive anisotropy. High values of Δn and Δε are required. An important criterion for the material choice is the ordinary refractive index match with the value of the polymer’s. Some of the most often employed liquid crystals are E7 and the series BL. They have values of Δn and Δε respectively ∼0, 21–0,28 and ∼13–18 [10]. An advantage of these liquid crystals is their good compatibility with the acrylate and NOA monomers. Another class is the TL series. They have limited solubility, but they are distinguished by the good stability, resistance and low driving voltages. Another approach in order to decrease the ruling voltages is to add surfactant-like compounds (as octanoic acid). Their role is to reduce the surface interactions. Since the morphology determines the properties of the film [118] at a major degree, the specific organization inside the droplet is an important object of investigation. One of the most frequently applied methods is the transmission imaging form polarization microscope analysis. The birefringent liquid crystal droplets change the polarization state of the light. The linear polarization is converted to elliptical and according to the rate of the polarization rotation, the organization inside the droplet toward the optical field can be estimated. Three different nematic organizations are estimated. Radial and axial configurations are consequence of normal anchoring at the droplet surface – homeotropic alignment. In the case of tangential (homogeneous) alignment of the liquid crystal molecules at the droplet surface, the configuration is bipolar [119]. Morphological investigations are also performed by scanning and transmission electron microscopy [120]. They provide information for the droplets distribution in the polymer matrix, but not for the configuration inside them. The organization inside the droplet is determined by parameters like droplet size and shape, as well as the surface interactions with the surrounding polymer matrix. These parameters are dependant on the concrete compounds as well as the recording kinetics. It has to be pointed out that HPDLC possess very attractive properties as medium for switchable holographic recording, which becomes apparent in the active investigations carried out by many research groups recent years.
588
K. Beev et al.
Mainly the high degree of refractive index modulation, the volume character of the recorded gratings, the unique anisotropic properties and electro-optical behavior attract the scientists’ attention. In addition, the whole recording process consists of a single step and allows the application of different optical schemes and geometries. As a consequence of the HPDLC new materials elaboration and optimization, the following holographic characteristics are obtained. • • • •
Spectral sensitivity in the almost whole visible range as well as in the infrared – 770–870 nm, through the utilization of different dyes; Holographic sensitivity (S) exceeding 3 × 103 cm/J; Spatial frequency > 6000 mm−1 ; Refractive index modulation Δn ∼ 0.05.
Owing to these recent characteristics, HPDLC are nowadays intensively investigated for lots and various practical applications. Simultaneously, the fundamental problems of the material performance optimization [121, 122] and the underlying physical processes in such systems [123, 124] are also an object of extensive investigations. The studies of the mesophase confined in small volumes, where the surface interactions play major role is an interesting and actual problem. There is no exact theoretical treatment of the simultaneous photopolymerization, phase separation and mass transfer processes, responsible for the diffraction structures in HPDLC. At the other hand, HPDLC find large application as holographic optical elements in areas like photonic crystals [125], high density information recording [126], electrically controlled diffractive elements [127], tunable focus lenses [128], electro-optical filters [129], interconnectors and other elements for fiber optics [127]. They have been recently used as elements in information security systems [130] and feedback elements of compact laser in order to flip the generated wavelength [131]. Polarization holographic gratings in PDLC have also been reported [132]. First considered for display applications, HPDLC remain one of the most attractive candidates for the different approaches for color and mostly 3D displays. HPDLC enable color separation applicable also to image capturing [133]. Another approach is to use waveguide holograms [134]. Investigations of HPDLC at total internal reflection geometries are already performed. Slanted transmission diffraction gratings, where the applied electric field controls the total internal reflection conditions for the horizontally polarized light vector (electric vector parallel to the incident plane) are realized [135]. Stetson and Nassenstein holographic gratings, representing total internal reflection and evanescent wave recording in extremely thin layer, have also been successfully recorded [136, 137]. The applications of HPDLC in these different and numerous areas is consequence of the possibility to create different morphologies through compounds choice, concentration changes, recording geometry and kinetics.
16 Materials for Holographic 3DTV Display Applications
589
As mentioned above, another direction of PDLC development is connected to the synthesis of photorefractive polymer dispersed liquid crystals (PR PDLC) as medium for reversible holographic recording. Their elaboration is consequence of the photorefractive organic material development. In PR PDLC the polymer typically provides the photoconductive properties required for the space-charge field formation. The liquid crystal provides the refractive index modulation through orientational nonlinearity. The major advantage of PR PDLC is the significantly lower electric field necessity, compared to the one required in polymer composites. The first PR PDLC systems were reported in 1997 by two groups [118, 138, 139]. The polymer/liquid crystal mixtures were similar, based on PMMA (poly-methyl methacrylate) polymers and E49 and E44 liquid crystals. The employed sanitizers were different. Also, the recorded gratings differed depending on the obtained regime – Bragg or Raman-Nath. Since these first experiments, the performance of PR PDLC materials has been considerably improved. Internal diffraction efficiency values reaching values ∼ 100% and applied voltages of about only 8 V/μm are obtained [140, 141]. Although they have such promising characteristics, the remaining weak points of PR PDLC are the relatively high scattering loss and slow photorefractive dynamics due to the low mobility. The scattering losses are connected to the relatively high liquid crystal concentrations, resulting in droplets with bigger size compared to the HPDLC. Although these disadvantages, mostly the response time, were successfully overcome by the substitution of PMMA with PVK (poly-N-vinylcarbazole), it was obtained at the cost of low diffraction efficiency reaching only several percents [142]. As a result, a number of physical studies were conducted in various PR PDLC systems in order to get better understanding of the photorefractive mechanism and to optimize their performance [143, 144, 145, 146].
16.5 Conclusion The 3D holographic display realization is still a challenging task. It requires a 3D scene encoding, in terms of optical diffraction, transformation into fringe patterns of the hologram, signal conversion for a spatial light modulator and display in real time [1]. The ultimate element of this devise should be a fast dynamic holographic material possessing high spatial resolution capability. Another problem is connected to the fact, that the available spatial light modulators scarcely satisfy the demands of holographic display systems. Since the critical point is their poor spatial resolution, the most probable solution is to synthesize the whole diffraction structure in parts, i.e. to transfer the diffraction structure from the spatial light modulator to the reversible recording media by multiplication (the Qinetiq concept). Thus, the final device should comprise certain number of elements, including switchable diffractive optics and reversible recording media. The best
590
K. Beev et al.
candidate for the switchable optical elements seems to be the nano-sized composite polymer-dispersed liquid crystals. They possess the main advantages of the organic media – simple (dry), one step processing, high sensitivity, proper mechanical characteristics (plasticity) allowing easy integration in different compact devices, as well as high signal to noise ratio and spatial resolution. Recently, most of the efforts are directed to improve the electro-optical performance, to employ total internal holographic recording set-ups as well as to develop new PDLC mixtures. Otherwise, the ideal reversible material seems to be lacking. As mentioned above, the difficult crystal growth and sample preparation limit the applications of photorefractive crystals. Another disadvantage is the relatively high price. Despite the progress in photorefractive organic materials, a number of challenges remain. Among them is the necessity of each material optimization, due to the inability to maximize both steady state and dynamic performance at the same time. Sub-millisecond response times are not achieved yet. At all, the ideal material should have low operating voltages and fast response simultaneously. Also, no complete theoretical treatment exists. Other promising candidates are the dye-doped liquid crystals and photochromic materials. An advantage of these recording media is the absence of electric field in the write or read process. The required properties of this class of materials can be summarized as follows: thermal stability of both isomers; resistance to fatigue during cyclic write and erase processes; fast response; high sensitivity [104]. Another approach is to use biological materials, making advantage of their improved properties due to natural evolution. Such biological material is the photochromic retinal protein bacteriorhodopsin contained within the purple membrane of haloarchaea species members, usually encountered in hypersaline environments [147, 148]. A possible problem, or perhaps an advantage, can be connected with the natural evolution of the biological spices, depending on the ambient conditions, and the resulting change in their properties. Otherwise spectral sensitivity in the range of 520–640 nm along with very high S values – exceeding 106 cm/J, are obtained. The spatial resolution is higher than 1000 lines/mm. Also more than 106 write-erase cycles have been achieved. In order to illustrate and compare a part of the material types presented in the text, some of their holographic characteristics are adduced in Table 16.1. Both materials for permanent and dynamic (reversible and switchable) holographic recording are considered. Again, we should emphasize here the important application of holographic optical elements (both permanent and switchable) in the area of 3DTV. According to the statement of the 3DTV NoE project coordinator Levent Onural in a BBC interview, the expectation for pure holographic television realization is within the next 10–15 years. From the other hand, autostereoscopic displays for 3DTV are commercially available. Thus, the main challenge is to realize multiple viewing-zones screen, where a holographic technique has certain advantages over the lenticular systems.
Table 16.1. Comparison between some holographic characteristics of different recording materials Vol/ Surf
Sensitivity
Storage density
Spectral range, nm
S cm/J
Lines/mm
Response time *
Driving voltage**
Δn
Thickness
Stability
μm
Rewritable/ Temp range n˚ of read ˚C cycles
Lifetime
PERMANENT STORAGE Silver Halide Dichromated gelatin Photopolymers
V/S V
< 1100 < 700
> 1100 ∼ 100
up to 10000 > 5000
0.02 0.022
7–20 15–35
no no
< 100 < 200
years years
V
514, 532, 650–670
0.56.7 103
> 5000
0.012
5–500
no
< 100∗∗∗
> 10 years
DYNAMIC RECORDING LiNbO3
V
0.02–0.1 up to 40
> 2000
∼ kV/cm
2 10−3
> 10000
yes
< 500
years
30–3000
> 2000 > 2000
0.1–20 s 1 ms−1 s
∼ kV/cm ∼ kV/cm
−3
10 10−4
> 10000 > 10000
yes yes
years years
1000–5000
> 2000
0.5–500 ms
∼ kV/cm
3 10−4
> 10000
yes
< 450 > 50 < 200 < 66
LiTaO3 KNbO3
V V
350–650 800–1000 300–550 400–900
Sn2 P2 S6
V
550–1100
0.5–20 s
years
(continued)
16 Materials for Holographic 3DTV Display Applications
Material/ Effect
591
592
Material/ Effect
Vol/ Surf
Sensitivity
Storage density
Response Driving time * voltage**
Δn
Thickness
Stability
μm
Temp range ˚C < 80 − −120∗∗∗
Spectral range, nm
S cm/J
Lines/ mm
S/V
488,514, 532, 633
102
> 6000
102 s
0.1
2–10
Rewritable/ n˚ of read cycles yes
S/V
244–532
102
> 3000
102 s
0.1
3–5
yes
< 80 − 120∗∗∗
years
S/V
vis
3 102
> 1600
ms
10−3
> 100
> 106
years
PDLC PIPS / TIPS / Photorefraction
S/V
360–532 770–870
> 3 103
> 6000
ms
∼ 10 V/μm
0.05
20–100
Dye-doped nematic
S/V
440–514
3 103
0> 1000
ms
0.1
10–20
< 48 − 95
years
Bacteriorhodopsin in gelatine matrix
V/S
520–640
4.7·106
0.1 V/μm (PR)
no PIPS/ TIPS yes PR yes
< 45 − 100∗∗∗ > t40–t10 < 50 − 100
> 1000
ms
2·10−3
30–40
> 106
–20/40
> 10 years
Azobenzene LC and amorphous polymers / photo-isomerisation Azobenzene LC and amorphous polymers / surface reliefs Photochromics
Lifetime
years
years
*Here for dynamic media only (for materials exhibiting permanent storage it is usually the time to obtain any diffracted signal from the hologram) **Where electric field is employed to switch the structure ***Strongly dependent on the molecular weight and the polymer type.
K. Beev et al.
Table 16.2. (Continued)
16 Materials for Holographic 3DTV Display Applications
593
Most probably, in particular the investigations in the field of materials for display and switchable diffractive devices in near feature should be concentrated on nano-particle-liquid crystal composites. The current development of nano-particles dispersions has shown excellent holographic characteristics [149, 150]. The main advantage is the possibility to obtain extremely high refractive index modulation, since materials like TiO2 have refractive index value of almost 3. Also, low shrinkage and good sensitivity are obtained. In general, the process is similar to HPDLC grating formation – the photopolymerization process initiates mass transfer of the components. The challenge is to combine low-energy consuming liquid crystal devices with the possibility to enhance the modulation by nano-particles redistributions and to obtain reversible diffractive structures formation.
Acknowledgement This work is supported by EC within FP6 under Grant 511568 with the acronym 3DTV.
References 1. V. Sainov, E. Stoykova, L. Onural, H. Ozaktas, Proc. SPIE, 6252, 62521C (2006). 2. J.R. Thayn, J. Ghrayeb, D.G. Hopper, Proc. SPIE, 3690, 180 (1999). 3. I. Sexton, P. Surman, IEEE Signal Process., 16, 85 (1999). 4. J. Kollin, S. Benton, M.L. Jepsen, Proc. SPIE, 1136, 178 (1989). 5. J. Thayn, J. Ghrayeb, D. Hopper, Proc. SPIE, 3690, 180 (1999). 6. T. Shimobaba, T. Ito, Opt. Rev., 10, 339 (2003). 7. D. Dudley, W. Duncan, J. Slaughter, Proc. SPIE, 4985, 14 (2003). 8. www.holographicimaging.com. 9. www.qinetiq.com. 10. T.J. Bunning, L.V. Natarajan, V. Tondiglia, R.L. Sutherland, Annu. Rev. Mater. Sci., 30, 83 (2000). 11. V. Sainov, N. Mechkarov, A. Shulev, W. De Waele, J. Degrieck, P. Boone, Proc. SPIE, 5226, 204 (2003). 12. S. Guntaka, V.Sainov, V. Toal, S. Martin, T. Petrova, J. Harizanova, J. Opt. A: Pure Appl. Opt., 8, 182 (2006). 13. H. Coufal, D. Psaltis, G. Sincerbox, Holographic Data Storage, Springer: Berlin, (2000). 14. K. Tanaka, K. Kato, S. Tsuru, S. Sakai, J. Soc. Inf. Disp., 2, 37 (1994). 15. R. Collier, C. Burckhardt, L. Lin, Optical Holography, Academic Press: New York, London (1971). 16. T. Petrova, P. Popov, E. Jordanova, S. Sainov, Opt. Mater., 5, (1996). 17. H. Bjelkhagen, Silver Halide Recording Materials for Holography and Their Processing, Springer-Verlag, Heidelberg, New York (1995); Vol. 66.
594
K. Beev et al.
18. Ts. Petrova, N. Tomova, V. Dragostinova, S. Ossikovska, V. Sainov, Proc. SPIE, 6252, 155 (2006). 19. K. Buse, Appl. Phys. B, 64, 391 (1997). 20. V. Mogilnai, Polymeric Photosensitive Materials And Their Application (in Russian), BGU: (2003). 21. G. Ponce, Tsv. Petrova, N. Tomova, V. Dragostinova, T. Todorov, L. Nikolova, J. Opt. A: Pure Appl. Opt., 6, 324 (2004). 22. T. Yamamoto, M. Hasegawa, A. Kanazawa, T. Shiono, T. Ikeda, J. Mater. Chem., 10, 337 (2000). 23. S. Blaya, L. Carretero, R. Madrigal, A. Fimia, Opt. Mater., 23, 529 (2003). 24. P.S. Drzaic, Liquid Crystal Dispersions, World Scientific: Singapore, (1995). 25. O. Ostroverkhova, W. E. Moerner, Chem. Rev., 104, 3267 (2004). 26. G.P. Wiederrecht, Annu. Rev. Mater. Res., 31, 139 (2001). 27. A. Ashkin, G.D. Boyd, J.M. Dziedzic, R.G. Smith, A.A. Ballman, et al., Appl. Phys. Lett., 9, 72 (1966). 28. C.R. Giuliano, Phys. Today, April, 27 (1981). 29. K. Buse, Appl. Phys. B, 64, 273 (1997). 30. W. Moerner, S. Silence, F. Hache, G. Bjorklund, J. Opt. Soc. Am. B, 11, 320 (1996). 31. P. Gunter, J. Huignard, Photorefractive Effects and Materials, Springer-Verlag: New York, (1988); Vol. 61–62. 32. J. Amodei, W. Phillips, D. Staebler, Appl. Opt., 11, 390 (1972). 33. G. Peterson, A. Glass, T. Negran, Appl. Phys. Lett., 19, 130 (1971). 34. E. Kr¨ atzig, K. Buse, Two-Step Recording in Photorefractive Crystals, In Photorefractive Materials and their Applications, P. G¨ unter, J. P. Huignard. (Ed.) Springer-Verlag: Berlin, Heidelberg (2006). 35. H. Kr¨ ose, R. Scharfschwerdt, O.F. Schirmer, H. Hesse, Appl. Phys. B, 61, 1 (1995). 36. A.A. Grabar, I.V. Kedyk, M.I. Gurzan, I.M. Stoika, A.A. Molnar,Yu.M. Vysochanskii, Opt. Commun., 188, 187 (2001). 37. V. Marinova, M. Hsieh, S. Lin, K. Hsu, Opt. Commun., 203, 377 (2003). 38. V. Marinova, Opt. Mater., 15, 149 (2000). 39. V. Marinova, M. Veleva, D. Petrova, I. Kourmoulis, D. Papazoglou, A. Apostolidis, E. Vanidhis, N. Deliolanis, J. Appl. Phys, 89, 2686 (2001). 40. K. Buse, H. Hesse, U. van Stevendaal, S. Loheide, D. Sabbert, E. Kratzig, Appl. Phys. A, 59, 563 (1994). 41. F. Simoni, Nonlinear Optical Properties of Liquid Crystal and Polymer Dispersed Liquid Crystals, World Scientific: Singapore, (1997). 42. W.H. de Jeu, Physical Properties of Liquid Crystalline Materials, Gordon and Breach: New York, (1980). 43. I.C. Khoo, F. Simoni, Physics of Liquid Crystalline Materials, Gordon and Breach: Philadelphia, (1991). 44. P.G. de Gennes, The Physics of Liquid Crystals, Oxford University Press, London, (1974). 45. L. Blinov, Electro and Magnitooptics of Liquid Crystals, Nauka: Moscow, (1978). 46. S. Slussarenko, O. Francescangeli, F. Simoni, Y. Reznikov, Appl. Phys. Lett., 71, 3613 (1997). 47. F. Simoni, O. Francescangeli, Y. Reznikov, S. Slussarenko, Opt. Lett., 22, 549 (1997).
16 Materials for Holographic 3DTV Display Applications
595
48. H. Ono, T. Sasaki, A. Emoto, N. Kawatsuki, E. Uchida, Opt. Lett., 30, 1950 (2005). 49. I. Khoo, IEEE J. Quantum Electron., 32, 525 (1996). 50. Y. Wang, G. Carlisle, J. Mater. Sci: Mater. Electron., 13, 173 (2002). 51. T. Kosa, I. Janossy, Opt. Lett., 20, 1230 (1995). 52. T. Galstyan, B. Saad, M. Denariez-Roberge, J. Chem. Phys., 107, 9319 (1997). 53. I. Khoo, S. Slussarenko, B. Guenther, M. Shin, P. Chen, W. Wood, Opt. Lett., 23, 253 (1998). 54. S. Martin, C. Feely, V. Toal, Appl. Opt., 36, 5757 (1997). 55. T. Yamamoto, M. Hasegawa, A. Kanazawa, T. Shiono, T. Ikeda, J. Mater. Chem., 10, 337 (2000). 56. L. Ren, L. Liu, D. Liu, J. Zu, Z. Luan, Opt. Lett., 29, 186 (2003). 57. W. Yan, Y. Kong, L. Shi, L. Sun, H. Liu, X. Li, Di. Zhao, J. Xu, S. Chen, L. Zhang, Z. Huang, S. Liu, G. Zhang, Appl. Opt., 45, 2453 (2006). 58. X. Yue, A. Adibi, T. Hudson, K. Buse, D. Psaltis, J. Appl. Phys, 87, 4051 (2000). 59. Q. Li, X. Zhen, Y. Xu, Appl. Opt., 44, 4569 (2005). 60. Y. Guo, L. Liu, D. Liu, S. Deng, Y. Zhi, Appl. Opt., 44, 7106 (2005). 61. M. Muller, E. Soergel, K. Buse, Appl. Opt., 43, 6344 (2004). 62. H. Eggert, J. Imbrock, C. B¨ aumer, H. Hesse, E. Kr¨ atzig, Opt. Lett., 28, 1975 (2003). 63. V. Marinova, S. Lin, K. Hsu, M. Hsien, M. Gospodinov, V. Sainov, J. Mater. Sci: Mater. Electron., 14, 857 (2003). 64. J. Carns, G. Cook, M. Saleh, S. Guha, S. Holmstrom, D. Evans, Appl. Opt., 44, 7452 (2005). 65. M. Ellabban, M. Fally, R. Rupp, L. Kovacs, Opt. Express, 14, 593 (2006). 66. K. Sutter, P. Gunter, J. Opt. Soc. Am. B, 7, 2274 (1990). 67. W. Moerner, A. Grunnet-Jepsen, C. Thompson, Annu. Rev. Mater. Res., 27, 585 (1997). 68. J. Hulliger, K. Sutter, R. Schlesser, P. Gunter, Opt. Lett., 18, 778 (1993). 69. G. Knopfle, C. Bosshard, R. Schlesser, P. Gunter, IEEE J. Quantom Electron., 30, 1303 (1994). 70. S. Ducharme, J. Scott, R. Twieg, W. Moerner, Phys. Rew. Lett., 66, 1846 (1991). 71. L. Yu, W. Chan, Z. Bao, S. Cao, Macromolecules, 26, 2216 (1992). 72. B. Kippelen, K. Tamura, N. Peyghambarian, A. Padias, H. Hall, Phys. Rew. B, 48, 10710 (1993). 73. G. Malliaras, V. Krasnikov, H. Bolink, G. Hadziioannou, Appl. Phys. Lett., 66, 1038 (1995). 74. G. Valley, F. Lam, Photorefractive Materials and Their Applications I, P. Gunter, J. Huignard. (Ed.) Springer Verlag: Berlin, (1988). 75. J. Schildkraut, A. Buettner, J. Appl. Phys, 72, 1888 (1992). 76. E. Rudenko, A. Shukhov, J. Exp. Theor. Phys. Lett., 59, 142 (1994). 77. I. Khoo, H. Li, Y. Liang, Opt. Lett., 19, 1723 (1994). 78. I. Khoo, Liquid Crystals: Physical Properties and Nonlinear Optical Phenomena, Wiley: New York, (1995). 79. N. Tabiryan, A. Sukhov, B. Zeldovich, Mol. Cryst. Liq. Cryst., 136, 1 (1986). 80. G. Wiederrecht, B. Yoon, M. Wasielewski, Science, 270, 1794 (1995). 81. G. Wiederrecht, B. Yoon, M. Wasielewski, Science, 270, 1794 (1995).
596
K. Beev et al.
82. H. Ono, N. Kawatsuki, Opt. Lett., 24, 130 (1999). 83. H. Ono, T. Kawamura, N. Frias, K. Kitamura, N. Kawatsuki, H. Norisada, Adv. Mater., 12, (2000). 84. H. Ono, A. Hanazawa, T. Kawamura, H. Norisada, N. Kawatsuki, J. Appl. Phys, 86, 1785 (1999). 85. K. Law, Chem. Rev., 93, 449 (1993). 86. S. Bartkiewicz, K. Matczyszyn, K. Janus, Real time holography - materials and applications, In EXPO 2000, Hannover, (2000). 87. I. Janossy, A.D. Lloyd, B. Wherrett, Mol. Cryst. Liq. Cryst., 179, 1 (1990). 88. K. Ichimura, Chem. Rev., 100, 1847 (2000). 89. H. Ono, T. Sasaki, A. Emoto, N. Kawatsuki, E. Uchida, Opt. Lett., 30, 1950 (2005). 90. B. Senyuk, I. Smalyukh, O. Lavrentovich, Opt. Lett., 30, 349 (2005). 91. H. Sarkissian, S. Serak, N. Tabiryan, L. Glebov, V. Rotar, B. Zeldovich, Opt. Lett., 31, 2248 (2006). 92. I. Khoo, Opt. Lett., 20, 2137 (1995). 93. W. Lee, C. Chiu, Opt. Lett., 26, 521 (2001). 94. L. Dhar, MRS Bulletin, 31, 324 (2006). 95. A. Osman, M. Fischer, P. Blanche, M. Dumont, Synth. Metals, 115, 139 (2000). 96. J. Delaire, K. Nakatani, Chem. Rev., 100, 1817 (2000). 97. A. Sobolewska, A. Miniewicz, E. Grabiec, D. Sek, Cent. Eur. J. Chem., 4, 266 (2006). 98. A. Natansohn, P. Rochon, Photoinduced motions in azobenzene-based polymers, In Photoreactive Organic Thin Films, Z. Sekkat, W. Knoll. (Ed.) Academic Press, San Diego, (2002); pp. 399. 99. P. Lefin, C. Fiorini, J. Nunzi, Opt. Mater., 9, 323 (1998). 100. C. Barrett, A. Natansohn, P. Rochon, J. Chem. Phys., 109, 1505 (1998). 101. I. Naydenova, L. Nikolova, T. Todorov, N. Holme, P. Ramanujam, S. Hvilsted, J. Opt. Soc. Am. B, 15, 1257 (1998). 102. O. Baldus, S. Zilker, Appl. Phys. B, 72, 425 (2001). 103. J. Kumar, L. Li, X. Jiang, D. Kim, T. Lee, S. Tripathy, Appl. Phys. Lett., 72, 2096 (1998). 104. E. Kim, J. Park, S. Cho, N. Kim, J. Kim, ETRI J., 25, 253 (2003). 105. J.L. Fergason, SID Int. Symp. Dig. Tech. Pap., 16, 68 (1985). 106. J.W. Doane, N.A. Vaz, B.-G. Wu, S. Zumer, Appl. Phys. Lett., 48, 269 (1986). 107. P.S. Drzaic, J. Appl. Phys., 60, 2142 (1986). 108. T.J. Bunning, L.V. Natarajan, V.P. Tondiglia, G. Dougherty, R.L. Sutherland, J. Polym. Sci., Part B: Polym. Phys., 35, 2825 (1997). 109. T.J. Bunning, L.V. Natarajan, V. Tondiglia, R.L. Sutherland, Polymer, 36, 2699 (1995). 110. P.G. De Gennes, J. Prost, The Physics of Liquid Crystals, (2nd ed.), Oxford University Press: New York, (1993). ˇ 111. J.W. Doane, N.A. Vaz, B.-G. Wu, S. Zumer, Appl. Phys. Lett., 48, 269 (1986). 112. G. Montgomery, J. Nuno, A. Vaz, Appl. Opt., 26, 738 (1987). 113. R.L. Sutherland, L.V. Natarajan, V.P. Tondiglia, T.J. Bunning, Chem. Mater., 5, 1533 (1993). 114. R. Pogue, R. Sutherland, M. Schmitt, L. Natarajan, S. Siwecki, V. Tondiglia, T. Bunning, Appl. Spectroscopy, 54, 12A (2000).
16 Materials for Holographic 3DTV Display Applications 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146.
597
D. Neckers, J. Chem. Ed., 64, 649 (1987). D. Neckers, J. Photochem. Photobiol., A: Chem., 47, 1 (1989). K. Tanaka, K. Kato, M. Date, Jpn. J. Appl. Phys., 38, L277 (1999). T.J. Bunning, L.V. Natarajan, V.P. Tondiglia, G. Dougherty, R.L. Sutherland, J. Polym. Sci., Part B: Polym. Phys., 35, 2825 (1997). R. Ondris-Crawford, E.P. Boyko, B.G. Wagner, J.H. Erdmann, S. Zumer, J.W. Doane, J. Appl. Phys., 69, 6380 (1991). T. Bunning, L. Natarajan, V. Tondiglia, R. Sutherland, D. Vezie, W. Adams, Polymer, 36, 2699 (1995). G. De Filpo, J. Lanzo, F.P. Nicoletta, G. Chidichimo, J. Appl. Phys., 84, 3581 (1998). L. Petti, G. Abbate, W.J. Blau, D. Mancarella, P. Mormile, Mol. Cryst. Liq. Cryst., 375, 785 (2002). D.R. Cairns, C.C. Bowley, S. Danworaphong, A.K. Fontecchio, G.P. Crawford, Le Li, S. Faris, Appl. Phys. Lett., 77, 2677 (2000). A. Mertelj, L. Spindler, M. Copic, Phys. Rev. E., 56, 549 (1997). R. Sutherland, V. Tondiglia, L. Natarajan, S Chandra, T. Bunning, Opt. Express, 10, 1074 (2002). L. Criante, K. Beev, D.E. Lucchetta, F. Simoni, S. Frohmann, S. Orlic, Proc. SPIE, 5939, 61 (2005). G. Crawford, Optics and Photonics News, April, 54 (2003). H. Ren, Y. Fan, S. Wu, Appl. Phys. Lett., 83, 1515 (2003). R. Sutherland, L. Natarajan, V. Tondiglia, T. Bunning, Proc. SPIE, 3421, 8 (1998). L. Luccheti, S. Bella, F. Simoni, Liq. Cryst., 29, 515 (2002). D. Lucchetta, L. Criante, O. Francescangeli, F. Simoni, Appl. Phys. Lett., 84, 4893 (2004). D.E. Lucchetta, R. Karapinar, A. Manni, F. Simoni, J. Appl. Phys., 91, 6060 (2002). T. Fiske, L. Silverstein, J. Colegrove, H. Yuan, SID Int. Symp. Dig. Tech. Pap., 31, 1134 (2000). T. Suhara, H. Nishihara, J. Koyama, Opt. Commun., 19, 353 (1976). H. Xianyu, J. Qi, R. Cohn, G. Crawford, Opt. Lett., 28, 792 (2003). K. Beev, L. Criante, D. Lucchetta, F. Simoni, S. Sainov, J. Opt. A: Pure Appl. Opt., 8, 205 (2006). K. Beev, L. Criante, D. Lucchetta, F. Simoni, S. Sainov, Opt. Commun., 260, 192 (2006). H. Ono, N. Kawatsuki, Opt. Lett., 22, 1144 (1997). A. Golemme, B. Volodin, E. Kippelen, N. Peyghambarian, Opt. Lett., 22, 1226 (1997). N. Yoshimoto, S. Morino, M. Nakagawa, K. Ichimura, Opt. Lett., 27, 182 (2002). J. Winiarz, P. Prasad, Opt. Lett., 27, 1330 (2002). R. Termine, A. Golemme, Opt. Lett., 26, 1001 (2001). H. Ono, H. Shimokawa, A. Emoto, N. Kawatsuki, Polymer, 44, 7971 (2003). G. Cipparrone, A. Mazzulla, P. Pagliusi, Opt. Commun., 185, 171 (2000). H. Ono, H. Shimokawa, A. Emoto, N. Kawatsuki, J. Appl. Phys, 94, 23 (2003). A. Golemme, B. Kippelen, N. Peyghambarian,, Chem. Phys. Lett., 319, 655 (2000).
598
K. Beev et al.
147. B. Yao, Z. Ren, N. Menke, Y. Wang, Y. Zheng, M. Lei, G. Chen, N. Hampp, Appl. Opt., 44, 7344 (2005). 148. A. Fimia, P. Acebal, A. Murciano, S. Blaya, L. Carretero, M. Ulibarrena, R. Aleman, M. Gomariz, I. Meseguer, Opt. Express, 11, 3438 (2003). 149. N. Suzuki, Y. Tomita, Jpn. J. Appl. Phys., 42, L927 (2003). 150. Y. Tomita, N. Suzuki, K. Chikama, Opt. Lett., 30, 839 (2005).
17 Three-dimensional Television: Consumer, Social, and Gender Issues Haldun M. Ozaktas Bilkent University, TR-06800 Bilkent, Ankara, Turkey
This chapter is based on a series of discussions which were planned and carried out within the scope of the Integrated Three-Dimensional Television— Capture, Transmission, and Display project, which is a Network of Excellence (NoE) funded by the European Commission 6th Framework Information Society Technologies Programme. The project involves more than 180 researchers in 19 partner institutions from 7 countries throughout Europe and extends over the period from September 2004 to August 2008. The scope of the discussions encompassed consumer expectations and behavior, including current perceptions of three-dimensional television (3DTV), its potential for novelty and mass consumption, and other consumer and nonconsumer applications and markets for the technology. Other areas discussed included the social dimensions of 3DTV in both consumer and non-consumer spheres, and how it compares with other high-impact technologies. Gender related issues were also discussed to some degree. Neither the manner in which the discussions were conducted, nor the way in which they were processed, was based on a scientific methodology. All discussions were informal in nature, with the moderator periodically putting up discussion points to raise new issues or focus the discussion. Most participants in these discussions were technical professionals or academicians with backgrounds in engineering and science, who were members of the Network of Excellence. A number of professionals from other areas also enriched the discussions and a small sample of laypersons and potential consumers were interviewed briefly. Our reporting here by no means represents a sequential record of the live and e-mail discussions which spanned a period of over two years. Opinions provided at different times and places were montaged thematically to achieve a unified presentation and were heavily edited. The discussions given here may seem naive (or worse, misguided) to social scientists with more sophisticated skills and more experience in thinking about such issues. Our hope is that if the content of this chapter does not actually illuminate the issues within its scope, it may at least shed light on the level of thinking and the concerns of those who are actively developing the technology.
600
H. M. Ozaktas
In this case, we hope that this chapter will serve as an insider record of the ruminations of the developers of a technology about its implications, during an intermediate stage of its development. It will certainly be interesting to consider these in retrospect ten or twenty years from now.
Part I: Introduction 17.1 Introduction It would certainly be a mistake to look upon three-dimensional television (3DTV) as solely the latest in the line of media technologies from radio to black-and-white television to color television, although therein may lie its greatest economic potential. Complicated chains and networks of causality underly the interaction between many technologies and society. It is important to distinguish between the impacts of social and technological entities, although they are intimately related. Television as a social institution has been thoroughly discussed, generally with a negative tone. Totally different however is the legacy of television in the sense of broadcasting technology, or in the sense of the cathode ray tube (CRT), the central technology in conventional television. This technology has been perfected for consumer television units, but today finds many applications, most notably in computer display terminals. Indeed, it can be argued that the CRT has had a greater impact in computing than in television. (Ironically, the liquid crystal display (LCD) found a place in computing first, and then later in television sets.) Lastly, it is important to make a distinction between 3D displays and 3D television (3DTV). Here we use the term 3D display to refer to imaging devices which create 3D visual output. 3DTV refers to the whole chain of 3D image acquisition, encoding, transport/broadcasting, reception, as well as display. We must also be cautious when referring to the impact of a technology on society, as it implies that there is only one-way causation; technology may have an impact on society but society also has an effect on technology. Such considerations complicate any prediction regarding the impact of 3DTV. However it seems very likely that it will have an important impact. Home video, cable, broadcast, and games are potentially highly rewarding areas for earlyentrance companies, since it may take a while before the technology can be emulated by others. Widespread public acceptance of this technology is very difficult to predict and will depend largely on the quality attained. If only mediocre quality is feasible, market penetration may be shallow and short lived, relying more on novelty aspects which are known to wear off quickly. People may prefer a high-quality two-dimensional image to a medium quality three-dimensional one, especially if there are limitations on viewing angle, contrast, equipment size, and cost. Even so, three-dimensional television has been so heavily portrayed in film and fiction that a significant number of consumers may show interest despite possible shortcomings. On the other hand,
17 Three-dimensional Television: Consumer, Social, and Gender Issues
601
if reasonably high quality can be attained, even at an initially high price, it is possible, and indeed likely that the technology may supplant ordinary television in at least some contexts. The potential consumer market should not blind one to the opportunities in other more specialized applications. Most of these will not demand as high a quality as consumer applications, and may involve customers willing to pay higher prices. While it is not clear that 3DTV would be widely used for computer display terminals, there are a wide variety of specialized applications. These may include sophisticated computer games, professional simulators and virtual reality systems, teleconferencing, special-purpose applications including scientific and industrial visualization, inspection, and control, medical visualization and remote diagnosis and treatment including telesurgery, environmental monitoring, remote operation in hazardous environments, air traffic control, architectural and urban applications, and virtual preservation of perishable objects of cultural heritage. If we accept that two-dimensional imaging and display technologies have had a positive impact in modern society, it seems almost certain that the above applications will produce a positive impact, even if 3DTV does not become a standard item in every home. For instance, the fact that people still travel to meet face-to-face is evidence that even modern teleconferencing cannot fully replace physical proximity. If 3DTV can come close enough, this would have a large impact on how meetings are conducted. This would not only include official and corporate meetings (reducing the cost of products and services to society), but also the meetings of civil society organizations, potentially increasing public participation at all levels. Three-dimensional television should not be seen in isolation from other trends in media technology, most importantly interactive or immersive technologies. Cliches holding television responsible for the drop in theater attendance or reading will gain new strength if such technologies become widespread. The main question will again focus on what the new technologies will replace/displace. In summary, the potential applications of the technology fall in two main categories: a three-dimensional replacement of present day television and a variety of specialized applications. The impact of the latter could be moderate to high benefits to society in economic and welfare terms. The impact of the former is less predictable but there is the potential for very high economic returns to those who own the technologies.
17.2 Historical Perspective I. Rakkolainen provided an extended account of pertinent historical observations, summarized at length in this section. He noted that many have dreamed of Holodeck- or StarWars-like 3D displays and that 3D images have attracted
602
H. M. Ozaktas
interest for over a century. The general public was excited about 3D stereophotographs in the 19th century, 3D movies in the 1950s, holography in the 1960s, and is now excited by 3D computer graphics and virtual reality (VR). The technology of 3D displays has deeply intrigued the media and the public. Hundreds of different principles, ideas, and products have been presented with potential applications to scientific visualization, medical imaging, telepresence, games, and 3D movies. The broad field of VR has driven the computer and optics industries to produce better head-mounted displays and other types of 3D displays; however most such VR efforts involve wearing obtrusive artifacts, an experience in stark contrast with the ease of watching TV. Immersion is an experience that encloses the user into a synthetically generated world. Contemporary 3D displays try to achieve this through elaborate schemes, but this is not only a matter of technology; the most important factor for immersion is not technical fidelity but the user’s attitude and possibly the skill of the content author. A theater scene or a novel can be quite “immersive” although it does not involve very advanced technology. Just before the first photographs were made in 1839, stereo viewing was invented. The first stereoscope tried to reproduce or imitate reality with the aid of an astonishing illusion of depth. A decade later, when less cumbersome viewing devices were developed, stereoscopic photography became popular. The stereo image pairs immersed the viewer in real scenes (they are still popular in the form of toys). Then, starting at the end of the 19th century moving pictures reproduced a world of illusion for the masses. The idea of synthetically-reproduced reality is not new and does not necessarily rely on digital technology. In 1860 the astronomer and scientist Herschel wrote about his vision of representing scenes in action and handing them down to posterity. Cinema and TV have somewhat fulfilled his vision. I. Rakkolainen went on to list a large number of popular mass-produced cameras of the late 19th century, each of which used slightly different technologies and designs with no standards: Academy, Brownie, Buckeye, Comfort, Compact, Cosmopolitan, Delta, Eclipse, Filmax, Frena, Harvard, Kamaret, Kodak, Kombi, Lilliput, Luzo, Nodark, Omnigraphe, Photake, Poco, Simplex, Takiv, Velographe, Verascope, Vive, Weno, Wizard, and Wonder. Only Kodak survived and became a huge business. The Kodak camera was by no means a superior technology. It used a roll film long enough for 100 negatives, but the key element to its success was perhaps that Kodak provided a photofinishing service for customers; apparently having to do the lab work was an obstacle for many. Rakkolainen believes that this resembles the current situation with 3DTV. The same enthusiasm that greeted photography, stereographs, and Lumiere brothers’ Cinematographe at the end of the 19th century, is now seen with 3DTV, virtual reality, and other related technologies.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
603
Part II: Consumer Expectations and Behavior 17.3 Current Public Perceptions of “Three-dimensional Television” What do lay people think of when confronted with the phrase “three-dimensional television” ? This question was posed to people from different social and educational backgrounds. Among the brief answers collected we note the following: •
• •
People think of the image/scene jumping out, or somehow extending from the front of the screen. Although not everyone had seen Princess Leya projected by R2D2 in Star Wars, the idea of a crystal ball is widespread in folklore. However, most people seem to imagine a vertical display like conventional TV, rather than a horizontal tabletop scenario. So-called “three-dimensional” computer games which are not truly threedimensional, but where the action takes place in a three-dimensional domain as opposed to early computer games which take place in “flatland.” “Nothing.”
M. Kunter noted that some thought of 3DTV as a full 3D projection of objects into the room, but nobody referred to the “Holodeck” scenario (being and acting in a virtual reality environment). This may be connected to A. Boev’s distinction between what he refers to as convergent and divergent 3D displays: He defines convergent 3D as the case where the user stays outside the presentation; the presentation can be seen from different points of view, like observing a statue or attending the theater. He defines divergent 3D as the case where the user is inside the presentation, and is able to look around and change points of view. Boev noted that this is often compared to an immersive multimedia-type of game, and is in some ways like listening to radio theater which also puts the user in a similar state of mind of “being inside” the presentation. The observation that a radio play makes one feel inside, compared to TV where one feels outside, seems very important; the perception of insideness, which is considered an aspect of realism, does not necessarily increase with the amount of information conveyed. According to I. Rakkolainen, 3DTV may take many different forms; it may be similar to today’s TVs but with 3D enhancements, IMAX-like partially immersive home projection screens, tracking head-mounted displays, holographic displays, or perhaps “immaterial” images floating in the air. He emphasized that we should be open minded about the possibilities. A. Boev noted that the very use of the term 3DTV was limiting in that it forced people to think of a box, and excluded other modalities. M. Karam¨ uft¨ uo˘glu asked whether 3DTV would be immersive and/or interactive, or merely offer depth information. He also noted that the TV and computer box might disappear, with all such technologies converging to an ubiquitous, pervasive presence.
604
H. M. Ozaktas
¨ M. Ozkan told an interesting anecdote exemplifying the power of media and marketing: He had asked TV sales staff in electronics stores whether they had heard about 3DTV and amazingly they said that it was “coming soon.” And what they were referring to was not any stereoscopic display, but a device displaying miniature football players on a table-like surface; they had seen it on a TV program featuring the 3DTV NoE! Furthermore, they linked this “near-future product” to recent price cuts in plasma and LCD TV screens. This anecdote is powerful evidence of how certain images can capture the public imagination. H. M. Ozaktas recalled that one US telephone commercial from about ten years ago, showed a family reunion for a child’s birthday party taking place through teleconferencing. The image took up a whole wall, making it seem that the remote participants were in the other half of the room. Clearly, the makers of the commercial were trying to similarly capture the imagination of their audience. F. Porikli observed that there is an imagination gap between the generation who grew up watching Star Wars episodes and earlier generations. People who have watched IMAX movies tended to imagine 3DTV as a miniature version of the movie theater in their homes. Younger generations are more open to the idea of a holographic tabletop display. In any event, people imagine that they will be able to move freely in the environment and still perceive the content in full 3D (which can lead to dissapointment if the viewer position needs to be restricted). Since conventional TV viewing is a passive activity, people do not usually have the expectation that they should be able to interact with the scene or have any effect on the program they are watching.
17.4 Lay Persons’ Expectations What do lay people expect from such a product? What features, function, and quality do they expect ? Today people take for granted high-quality 2D images and it would be unrealistic to expect them to put up with even moderately lower quality images in 3DTV. If the images are not clear, crisp, or if they are hard to look at in any way, it is unlikely that people will watch it. For a significant amount of TV content, 2D screens are already realistic enough, as M. Kautzner noted. Although in a technical sense one may think that 3D is more “realistic” than 2D, that may be a fallacy. “Realisticness” is very psychological: a clear, crisp color image is very realistic to a lot of people watching TV or a film, whereas a 3D image which deviates even a little from this crispness and contrast may look terrible. Humans possibly do not really miss true 3D information, since they can deduce enough of it from the content. Human imagination is such that even if we see a reduced representation of reality, such as a black-and-white photo, a 2D image, or even a sketchy caricature, we can fill it in our minds and visualize its realistic counterpart. Black and white photos are quite realistic despite the loss of color information. Other than an arrow flying towards
17 Three-dimensional Television: Consumer, Social, and Gender Issues
605
you or a monster jumping at you (contrived actions familiar from the old colored-glass 3D films), it is not clear exactly what information 3DTV is going to convey that will be important to viewers. Thus if the only thing 3DTV has to offer is the novelty factor, it will not be a mass market. The opposite argument could be that, by the same token, people did not really need color information either; black-and-white TV was just fine, but color TV still caught on. Nevertheless, the introduction of color did not entail much sacrifice of quality; G. Ziegler remarked that 3DTV will have a difficult time if it is of lower quality and this will be all the more true if it is difficult to watch or strains the eyes. Some consumers expect to see the same kind of an aquarium as the contemporary TV is, but somehow conveying some sense of depth. Other consumers expect to be able to move around the display freely and be able to see the view from different angles. Another group of consumers totally lacks any vision of “true” 3D, and merely expect 3D graphics on a flat panel, as in current 3D computer games. And a significant group of consumers seems to have hardly any idea of what the term might imply. These observations imply that it may be important to educate potential consumers that the 3D we are talking about is something more than the 3D of a perspective drawing. F. Porikli noted that while non-entertainment users of 3DTV may be willing to forego several comfort or convenience features that are not pertinent to the application, the expectations of household entertainment consumers may be higher. People do not like the idea of wearing goggles or markers or beacons, and they certainly do not like having limited viewing positions or low resolution. Consistency of 3D image quality with respect to viewer motion and position is another important factor. As for price, Porikli believed that any display product costing over 5000 USD is not likely to be widely accepted. I. Rakkolainen argued that rather than trying to achieve a perfect 3D display, tricks and approximations must be used to obtain a reasonably priced and good-enough display for general use. Indeed, while R&D group A may be focused on “true” or “real” holographic reconstruction, R&D group B may get to market with a sloppy, pseudo-, quasi-, and a really not-deserving-the-name product which nevertheless satisfies these conditions. The question is, what aspects of 3DTV will be important and attractive to consumers and which will be irrelevant? Maybe true 3D parallax, and the ability to walk around which are hallmarks of true 3D may not matter; maybe people will be comfortable simply with more depth cues on a flat screen. J. Kim noted that in fact, in many cases 2D cues are sufficient for depth perception. N. Stefanoski commented on the issue of whether more information is always desirable. In some cases, conveying the maximum amount of information may be desirable (perhaps for sports events, teleconferencing, virtual shops), but in other cases there will be not much consumer desire to choose the viewing perspective. In fact in some cases fixing the perspective may be desired
606
H. M. Ozaktas
for artistic reasons or genre convention (hiding the face of the murderer in a mystery film). A. Boev noted the importance of studying consumer expectations. Although consumers are often “taught” what they need in the case of some products, for a novel and potentially expensive product it may be important to know what the buyers expect. The Nintendo Virtualboy was promoted as a “3D game system” which made people expect images floating in the air. When people realized it only works with glasses, almost everybody was heavily disappointed, and it was a failure. A. Boev also emphasized the importance of two-way compatibility. 3DTV sets should be able to display 2D programs and 2D sets should be able to display a 2D version of 3D programs. This would be a general expectation based on the history of the transition to color. F. Porikli also emphasized that at the very least, any 3DTV should be back-compatible to 2D content. D. Kaya-Mutlu noted that TV is here being conceived largely as a visual medium, as a conveyor of visual information, and the viewer’s relation to TV is being conceived mainly as a matter of decoding/processing visual information, as a matter of visual perception. This is understandable if we assume that the major contribution of 3DTV is taken as the enhancement of images. She pointed out that this misses other important components of TV content, such as talk, and more importantly other social functions beyond being an information conveyor—such as providing background sound, serving as an accompaniment, or as a means to structure unstructured home time. These are all functions of the household TV set and whether they will transfer to 3DTV may be important determinants.
17.5 Sources of Public Perceptions and Conceptions What past or present technologies or fictional sources have influenced people’s conceptions of such a technology ? Some of the answers collected were: • • • • • •
Colored (or polarized) stereo glasses. Three-dimensional IMAX movies and other theme park movies. Depiction of such technologies in science fiction movies and novels, such as Star Trek and Star Wars. Still holograms. 3DTV computer games or similar rendered objects on conventional TV. Virtual reality or augmented reality.
S. Fleck noted in particular that 3D theaters in Disneyland, Europa-Park, etc. and IMAX theaters might have had the greatest influence; 3D versions of Terminator 2 and The Muppet Show are popular examples.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
607
17.6 Potential for Novelty Consumption Is there a novelty-oriented segment of the population willing to pay for expensive, relatively low-quality early consumer models ? R. Civanlar did not think that an early model of low quality regarding resolution, color, etc. would be acceptable; consumers are too accustomed to high resolution crisp images. However, low quality or restrictions on the 3D features may be acceptable since consumers have not yet developed high expectations in that regard yet. Audiences might at first rush to watch the new 3DTV tabletop football games, but the novelty would quickly fade after a couple of times and people would probably return to the comfort of their 2D sets. M. Kunter made a similar comment about IMAX theaters, which remain a tourist attraction but have never become established as cultural institutions like common movie theaters. G. Ziegler thought that there may be a subculture of science fiction enthusiasts who would gladly pay for initially expensive hardware—not so much for the content they would watch, but for the excitement of the experience they are familiar with from science fiction. He noted that 3D already has the status of a hobby with specialist suppliers such as www.stereo3d.com, which evaluates all kinds of exotic hardware from small and large companies. Purchasers of this equipment are not ordinary consumers but hobbyists who sometimes modify the hardware. Ultimately, however, this group is small and without large buying power. Ziegler also noted that certain rich urban singles often have an interest in such gadgets; for them the design is of paramount importance, even more so ¨ Sandık¸cı both thought that certain highthan the features. G. Ger and O. income customers might buy such a product for the sake of novelty if it were a status symbol; but they felt that such an outcome is socially divisive and not desirable. I. Rakkolainen emphasized that it might make more sense to target the early models at businesses rather than the consumer of novelties; businesses, the military, and other special applications customers can pay significantly greater amounts and take greater risks. He pointed out that some rich consumers might buy expensive technology if it gave them something new, fun and useful. But he wondered if there are enough such consumers. The same seems to be the case for non-mainstream customers who are so attracted by the novelty that they are willing to put up with low quality. (P. Surman was of the opinion that such populations are more likely to be motivated by being the first to own a product, rather than being thrilled by the novelty factor.) Therefore, focusing on non-consumer markets seems to be strategically more advantageous. Ziegler also noted that major companies like to use novel yet expensive technologies at fairs for promotional purposes. F. Porikli supported Rakkolainen, noting that some people pay huge sums for expensive artwork and hobby cars, so there is obviously a market for everything, but how big is
608
H. M. Ozaktas
that market? Without convincing content support, Porikli thought it unlikely that expensive 3DTV products will ever reach any but the richest people. He also emphasized the importance of the non-household market: research labs, assisted surgery and diagnosis in medical settings, military applications, and video conferencing. V. Skala believed that the hand game industry might be an engine for ¨ Y¨ future development. A. O. ontem also believed that there will be a significant demand for game consoles with 3D displays, generating considerable revenue. R. Ilieva pointed out that one option in introducing 3DTV to the masses would be an approach involving small changes to ordinary TV sets. H. M. Ozaktas noted that for instance, K. Iizuka of the University of Toronto has produced simple add-ons for cellular phones allowing them to transmit stereo images. Similar approaches may be technically possible for 3DTV, but it is not clear whether these would interest consumers. I. Rakkolainen reported a quotation from Alan Jones in the newsletter 3rd Dimension (www.veritasetvisus.com), Jones suggested that a new level of technology must drop to 5–10 times the price of its predecessor to get users interested; when the price drops to only double it starts getting widespread acceptance from early adopters. The price must fall to about 1.5 times the predecessor’s before it can become a truly mass product. In summary, Jones felt that the future for 3D displays is bright but they will not displace 2D because there will continue to be uses for which 3D is not necessary.
17.7 The Future Acceptance of Three-dimensional Television Will commercial three-dimensional television replace two-dimensional television, or will it remain as a novelty limited to only a certain fraction of consumers ? Many people seem to think that 3DTV may replace common television in the future. However, this thinking may reflect nothing more than a simpleminded extrapolation of a linear conception of progress moving from radio to television to color to 3D. It is important to understand the acceptance of new technologies in the context of competition between rival market forces. D. Kaya-Mutlu recounted that in the 1950s, when TV had become a serious alternative to cinema, the American movie industry introduced technological novelties to attract viewers back to the movie theatres. The first novelty was 3D movies but these did not have a longlasting impact on viewers. The special eyeglasses that were required were the main reason for audience resistance. This is why many present researchers consider it important to develop approaches which do not necessitate the wearing of such equipment. The next novelty was Cinerama, which created an illusion of three dimensionality without special eyeglasses. It garnered more public interest, but it too became a tourist attraction in a few places,
17 Three-dimensional Television: Consumer, Social, and Gender Issues
609
probably because it required a different and complex exhibition environment. Finally, it was the wide-angle process, Cinemascope, introduced by Twentieth Century Fox in 1953 that had the most longlasting effect among these novelties. Cinemascope movies offerred a wider screen image with color and stereo sound, and therefore contrasted sharply with the small, black and white TV image. Cinemascope movies also augmented the impression of telepresence. HDTV, which combines a large, high quality image with good quality sound, is an extension of this concept into the private home. Kaya-Mutlu thought that although it seems to be nothing more than a high-resolution version of ordinary 2DTV, HDTV could be a rival to 3DTV in the home market. G. Ger noted that the different phases of acceptance of a new technology must be carefully studied to avoid strategic mistakes. She gave several examples of failed technologies, such as 3D movies and the picture phone. The picture phone had been the subject of science fiction for a long time; the public was already familiar with the concept and was even anticipating it. It seemed like the logical next step; just as television followed radio, picture phones would follow ordinary phones. The engineers working on it probably thought that price was the only obstacle and that would surely drop in time. But it turned out that although people sometimes wanted to see the person they were talking to, more often they did not. Perhaps you are unshaven or without makeup during the weekend, or perhaps you do not want your body language to tell your boss that you are lying when you say you are too ill to come to work. Ger underlined that a technology is accepted if it fits well with the existing culture and needs of society. On the other hand, H. M. Ozaktas noted that the present acceptance and popularity of free internet-based video-telephony (most notably Skype) should make us rethink these explanations; this could provide a lot of new data regarding people’s acceptance and the factors underlying it. There are some useful questions: how do people use Skype, when do they prefer no video, when do they prefer to remain silent or inaccessible, how do they combine voice and the accompanying chat features. ¨ Sandık¸cı noted that about eighty percent of new products fail, mostly O. because of a lack of understanding of consumers and their needs. She also gave the example of 3D movies, noting that it had the image of being weird and juvenile kid stuff, which probably guaranteed its failure. She warned that the association of 3DTV with 3D movies could hurt the success of 3DTV. She talked of the need to think about how the technology will fit into the people’s lives. For instance, referring to the tabletop 3DTV scenario, she noted that in a typical living room layout, the TV is not in the middle of the room. Therefore either the technology may effect the way people furnish their living rooms, or if it asks for a major change from people, it may face resistance. D. Kaya-Mutlu had already noted the fallacy of conceiving of TV merely as a visual medium, and of the viewer’s relation to TV as merely a matter of decoding/processing visual information. Another important component of TV content is talk and many TV programs are talk-oriented. More importantly,
610
H. M. Ozaktas
ethnographic research on TV audiences (within a cultural studies framework) exploring the significance of TV in the everyday lives of families and housewives, has shown that TV has several social functions beyond being an information conveyor. This body of research has shown that the pleasures derived from TV content are not merely textual (which includes both the visual and the aural information). For example, James Lull, in his article “The Social Uses of Television” (1980) develops a typology of the uses of home TV, which are not directly related to the content of TV programs. Distinguishing between structural and relational uses of TV, Lull points to the use of TV as a background sound, as an accompaniment, and as a means by which family members, especially housewives, structure unstructured home time. Lull also discusses how TV regulates the relations between family members. It has also been shown that TV is watched in an unfocused manner, at the same time as conversation and other domestic activity. Another aspect of this unfocused watching is the growing practice of zapping among channels. Kaya-Mutlu ¨ Sandık¸cı was very right in pointing to the need to talk about said that O. how the technology will fit into the lives of people. Since 3DTV will cater to an audience whose expectations and viewing habits/styles have been shaped by 2DTV content, its realistic images may not be enough to attract a wide audience. She thought that 3DTV assumes a focused/attentive viewer while some evidence shows that many viewers watch TV in a distracted manner (for example there are housewives who “watch” TV without even looking at the screen; they are maybe more appropriately referred to as “TV listeners” rather than TV viewers). At the least, one can argue that 3DTV’s popularity may depend on a radical change in the viewing habits and styles of the majority of viewers. M. Karam¨ uft¨ uo˘ glu suggested that in order to avoid failure, it is important to talk with sociologists, philosophers, cultural theorists, and media artists. The essential ingredient of a successful commercial design is to iterate the design through interaction with potential consumers. He also noted the possibility of consumer resistance to obtrusive gadgets. One of the most important issues brought up in this context was that of content. A 3DTV set does not mean anything without 3D content. The content is the real product, the set is just the device needed to view it. What is to be sold is the content, not the set. For instance, a typical CD or DVD player costs much less than the cost of any reasonable CD or DVD collection. And for content to be produced there must be demand, which can come only from customers already owning 3DTV sets, creating a chicken-and-egg situation. Therefore, Y. Yardımcı speculated that even if there were a small group of customers attracted to novelty, and even if they could support the production of the sets, would they reach the threshold necessary to justify the production of content? On the other hand, Yardımcı also cited research showing that purchases of high definition (HD) television sets was rising at a faster rate than those receiving HD programming. This was paradoxically setting the stage for a boom in content production and thus the solution of the
17 Three-dimensional Television: Consumer, Social, and Gender Issues
611
chicken-and-egg problem. F. Porikli, on the other hand, noted that a lesson learned from HDTV acceptance was that without sufficient content, it is not realistic to expect people to make such an investment. R. Civanlar thought that acceptance will probably depend on the type of content. People may be willing to pay extra for 3D sports viewing. As for entertainment, special movies that use 3D effects would have to be produced. He mentioned a Sony theater in New York City that frequently shows high quality 3D feature films and is usually full even though the tickets are not cheap. On the other hand, although this particular theater has been around for ten years or so, no new ones have opened. He believes that Sony produces special movies for this theater, probably not to make money but for reasons of prestige and promotion. Another aspect of the content issue was brought forward by D. Kaya-Mutlu, who noted that each medium has its own esthetics. For example, Cinemascope promotes long shots instead of close-ups, whereas the small low-resolution TV screen promotes close-ups. That is partly why many major cinematic productions look crammed and are less pleasant to watch on TV. Kaya-Mutlu suggested that the growing popularity of HDTV is likely to prompt some changes in TV esthetics; these may also be valid for 3DTV. I. Rakkolainen suggested that 3DTV technology could be used with many different kinds of content. Apart from broadcast TV, some of the content categories could include 3D games, virtual reality, and web content. He also noted that an interim appliance might be an ordinary TV most of the time but be switchable to 3D, perhaps with less resolution, when there is a special broadcast. According to G. Ziegler, in order for big media companies to produce the content that would drive consumer demand, what is desperately needed is a common standard for 3DTV productions (this would at first be for stereo 3DTV systems). Naturally, standardization requires a certain degree of maturity of a technology; he noted that some of the new stereo movies available may set a de facto standard until better standards are agreed upon. F. Porikli believed that consumers will not stubbornly stick to 2DTV if the quality of 3DTV matches expectations. Even though transitions, such as to HDTV, are painful, people find it difficult to go back once they are accustomed to the higher quality content. Nevertheless, 2D displays will continue to be used in many applications due to their cost, their smaller size, and their robustness. Another issue is whether 3DTV would ever become standard or whether only selected special programs would be 3D. This may also depend on the restrictions and requirements 3D shooting brings to stage and set, an issue which does not seem to be widely discussed. Transition from radio to TV brought tremendous changes, whereas transition from monochrome to color brought only minor ones. If the requirements coming from 3D shooting are excessive, it might not be worth the trouble and cost for programs where it
612
H. M. Ozaktas
does not have a special appeal, and it may be limited to specific program categories including sports and certain film genres. J. Kim noted that since it was not yet clear what shape 3DTV will take, it was not easy to comment on public acceptance. 3DTV may evolve from stereoscopic 3DTV requiring special glasses, to multi-view autostereoscopic 3DTV, and finally to holographic 3DTV. Each type of 3DTV could engender a different response. For the first two types (which lay 3D functionality on top of existing 2D without replacing it) the primary determinant of acceptance will be how the added 3D video services fit users needs for specific content types. These first two types of 3DTV should be backward-compatible—users should be able to switch to 2D viewing without losing anything more than depth perception. They should also be able to handle 2D/3D hybrid content. 3DTV systems, especially early ones, might exhibit various distortions, which would induce psychological fatigue with extended viewing. J. Kim therefore predicted that only selected programs would be shown in 3D mode.
17.8 Other Consumer Applications of Three-dimensional Television Other than being a three-dimensional extension of common TV and video, what other consumer applications of 3DTV can you think of ? (games, hobbies, home movies, automotive, smart apartments, etc.) I. Rakkolainen believes that it is useful to distinguish between passive applications such as TV and video, navigation aids in cars, and interactive applications such as games. He also noted that the success of different applications will depend on the size and format of the displays that can be produced. He said that games and entertainment applications hold a lot of promise, because they can be adapted to many different display types. Indeed, many ¨ Y¨ consider games an important potential application of 3DTV. A. O. ontem believed that a 3D game console designed to be connected to a 3DTV set would be very attractive to consumers. F. Porikli also agreed that since current TV displays support games, hobbies, entertainment content etc., it was likely that 3D displays would also do so. While many understand these to be more realistic and immersive versions of existing computer games, G. Ziegler suggested several less immediate examples. He first noted the success of the IToy, a camera-based device with simple 2D motion tracking; this is used for games but it has other applications—it can be a personal training assistant that supervises your daily exercises. If optical 3D motion capture works reliably, such systems could easily be extended in exciting ways. He also noted the DDR (Dance Dance Revolution), the Japanese dancing game, as an example of consumer interest in such devices and activity games. Video conferencing is another potential application area. While many kinds of systems are already available for remote multi-party conferencing, they have still not replaced face-to-face meetings. Precisely what important features of
17 Three-dimensional Television: Consumer, Social, and Gender Issues
613
the face-to-face interaction are lost, and whether they can be provided by 3DTV, remain interesting questions. F. Porikli commented that just as Skype users find it difficult to go back making traditional phone calls, people who experience 3D teleconferencing may not be willing to go back to conventional teleconferencing. P. Surman noted that the display systems being developed have the capacity to present different images to different viewers and this could be exploited for certain purposes, such as for targeted advertising where a viewer is identified and an image intended specifically for that viewer is not seen by anyone else. This could work for more than one viewer. Such technology could also be used to block undesirable scenes from young viewers. A TV set which can display two channels simultaneously to viewers sitting in different spots has already been introduced, and marketed as a solution to family conflicts about which station to watch. G. Ziegler noted potential applications mixing the concepts of 3DTV and augmented reality, where multi-camera recordings are projected into augmented reality environments. He also noted that there were several possibilities, such as the use of a webcam or a head-mounted display for “mixed reality” 3DTV viewing. Other applications include a virtual tourist guide and a virtual apartment walk-through. J. Kim mentioned TV home shopping; people would like to see the goods as if they were in a store and would appreciate the added 3D depth perception and the ability to look around objects. Only the goods for sale need to be shown in 3D, against a 2D background that includes the host and any other information. This mode of 2D/3D hybrid presentation could also be used for other programs such as news, documentary etc. Applications of 3D displays to mobile phones was suggested by ¨ Y¨ A. O. ontem, who argued that consumers would like to see a miniature of the person they are talking to. In this context, he also proposed the intriguing idea that 3D displays may form the basis of 3D “touch screens,” although there are many questions about how to detect the operator’s finger positions and purposeful motions.
17.9 Non-consumer Markets for Three-dimensional Television What major markets other than the mass consumer market may arise? In other words, what could be the greatest non-consumer applications of 3DTV, in areas such as medicine, industry, scientific and professional applications, etc. Do these constitute a sizable market ? We have already noted I. Rakkolainen’s position that such technologies should initially target business customers in high-cost professional areas like medical, military, and industrial visualization, followed by medium-cost applications like marketing and advertising. P. Surman also noted that such
614
H. M. Ozaktas
applications constitute a sizeable market and that these niche markets can justify a more expensive product. This could be useful for the development of a commercial TV product as it could take ten years to develop an affordable TV display, but less time to produce a more expensive one. The niche mar¨ kets would drive the development. Likewise M. Ozkan believed that due to the cost of initial products, professional areas such as military training, industrial design, and medicine were likely early application areas. R. Ilieva, along with others, believes that there is considerable potential in the medicine, education, and science markets. While most agreed that industrial markets could tolerate higher prices, it was not clear that they would tolerate lower quality. T. Erdem noted that industrial applications may require even higher quality than consumer 3DTV applications. The consensus was that it depended on the application and could go either way. G. Ziegler noted that 3DTV research may have many spill-over effects, in areas such as image analysis and data compression. This could lead to advances in areas such as real-time camera calibration, industry-level multicamera synchronization, real-time stereo reconstruction, and motion tracking. For instance, classic marker-based motion tracking (also used in the movie industry) might become obsolete with the advent of more advanced markerless trackers that stem from the problem of 3D data generation for free viewpoint video (FVV) rendering. Other applications might include remote damage repair, space missions, spying and inspection operations, remote surgery, minimally-invasive surgery, and regrettably military operations such as remote-controlled armed robots. F. Porikli added remote piloting and virtual war-fields to the potential list of military applications. An interesting point was made regarding professional applications. In some professional areas, the existing values, norms, vested interests, or skill investments of practitioners may result in resistance to the technology. While most physicians are used to adapting to sophisticated new equipment, years of clinical training and experience with 2D images may make them resistant or uncomfortable working with 3D images. Also, their expectations of quality may be quite different than general consumers. As with many technologies, the issues of de-skilling and retraining arise. Many professionals learn over years of experience to “feel” the objects they work with, and when the technology changes, they cannot “feel” them any longer and feel almost disabled. As a specific example, S. Fleck noted that they have been doing research in the field of virtual endoscopy for years, and that they have asked surgeons about their opinion of 3D visualization. While about two-thirds said that they would appreciate such capabilities, it was important for them to be able to use any such technology in a hassle-free way with very low latency and high spatial resolution. They also insisted on maintaining the option of being able to fall back on the 2D visualization they were used to; they wanted to be sure that whatever extras the new technology might bring, they would not lose anything they were accustomed to. This is quite understandable given the
17 Three-dimensional Television: Consumer, Social, and Gender Issues
615
critical nature of their work. K. Ward thought that as a doctor herself, she observed great fear among the medical profession that new technology may not be as safe; anything that doctors do not have experience with feels less safe to them and they hesitate to risk a bad outcome for their patients. ¨ M. Ozkan noted another potential reason for resistance from the medical establishment, which in theory should greatly benefit from 3D visualization in both training and practice. He underlined the resistance to even lossless digital image compression techniques for fear of costly malpractice lawsuits, and so was pessimistic regarding the adoption of 3D techniques in practice, but thought they may be more acceptable in training, especially remote training. J. Kim reported on trials in Korea applying different kinds of information technology to medicine. He referred to two big issues: broadband network connection among remotely located hospitals and doctors for collaborative operation and treatment, and exploitation of 3D visualization technologies for education and real practice. Accurate 3D models of human organs, bones, and their 3D visualization would be very time- and cost-efficient in educating medical students. Doctor-to-doctor connections for collaborative operations is considered even more necessary and useful than doctor-patient connections for remote diagnosis and treatment. Kim believes the medical field will surely be one of the major beneficiaries of 3DTV. H. M. Ozaktas noted that many examples of resistance to new technology is available in consumer applications as well; a new car design with different positions of the brakes, accelerator, and gearstick, would not easily be accepted even if tests showed it was safer and gave the driver better control. Likewise, despite its clear inferiority, the QWERTY keyboard is still standard and very few people attempt to learn one of the available ergonomic keyboard layouts. A number of participants, including C. T¨ ur¨ un and V. Skala emphasized the education market. Skala gave several examples from three-dimensional geometry where students had difficulty visualizing shapes and concepts; 3DTV may help them improve these skills. Indeed, the traditional book culture as well as the more recent visual culture are both heavily invested in 2D habits of thinking. H. M. Ozaktas agreed that perception of 3D objects may be improved with the use of 3D imaging in education, but argued that the applications to education should not be limited to this, suggesting that we should be able to, for instance, show simulations of a vortex in fluid mechanics or the propagation of a wave in electromagnetics. However, even very low-tech 2D animations which can add a lot to understanding are not often used in educational settings, despite their availability. Ozaktas gave the example of simple animations or simulations of electromagnetic waves and how useful they could be, but noted that most electromagnetics courses do not include such simulations. He concluded that customary habits and possibly organizational obstacles may come before technical obstacles in such cases.
616
H. M. Ozaktas
Part III: Social Impact and Other Social Aspects 17.10 Impact Areas of Three-dimensional Television Will the greatest impact of 3DTV be in the form of consumer broadcasting and video (that is, the three-dimensional version of current TV and video), or will the greatest impact be in other areas such as medicine, industry, scientific applications, etc ? P. Surman believed that the greatest impact will be in the form of consumer broadcasting and video, since this will potentially be the most widespread application. I. Rakkolainen agreed that this may be the case in the long run; in the meantime the greatest impact will be in special experiences created by the entertainment industry with high-end equipment. He noted that there are already very low-cost head-mounted displays for PCs and game consoles, but they have not yet sold well, although they could become popular within 5–10 years. V. Skala also agreed that in the long run consumer 3DTV will have the greatest impact but meanwhile other professional areas will have a larger impact.
17.11 Social Impact of Three-dimensional Television Television is currently understood as being a social ill. Its negative effects, including those on children, have been widely documented and are considered to far outweigh its positive aspects. In this light, what will be the effect of 3DTV technology? Will it further such social ills? Will it have little effect? Can it offer anything to reduce these ills ? There is a vast literature regarding the negative effects of ordinary television on children. The negative effects mentioned include the conveying of a distorted picture of real life, excessive exposure to violence, obesity due to replacement of active play, unsociability due to replacement of social encounters, and negative developmental effects due to replacement of developmentally beneficial activities. In the early years, additional negative effects include negative influences on early brain development as a result of the replacement of real-person stimuli, and exposure to fast-paced imagery which affects the wiring of the brain, potentially leading to hyperactivity and attention deficit problems. S. Sainov noted that during their holographic exhibitions, children 2–5 years old and older people with minor mental deteriorations were very much impressed by 3D images; this suggests that the psychological impact of 3D images on TV screens should be taken into account. P. Surman noted that children are fascinated by 3D, suggesting this may be due to their greater ability to fuse stereo images. R. Ilieva commented that although TV is a social ill, it has also had important positive aspects; it has brought knowledge of the world to low income
17 Three-dimensional Television: Consumer, Social, and Gender Issues
617
people who cannot travel and do not have access to other sources of information. 3D technology can have a positive impact on science education but it is not clear how much 3D can add to the general information dissemination function of TV. P. Surman thinks it will have little effect, since there was no noticeable difference when color took over from monochrome. F. Porikli however thinks that both positive and negative impacts would be enhanced since 3DTV has the potential to become a more convincing and effective medium than conventional TV. A. Boev noted that a social ill is something that hinders the basic functions of society, such as socializing; by this definition, TV is a social ill, but Skype is not; playing computer games is a social ill, but writing in web-forums is not. He said that even reading too much could be a social ill. He argued that maybe the main feature which makes TV a social ill is its lack of interactivity. If TV was truly interactive (and went beyond just calling the TV host to answer questions), it would not be such a social ill. M. Karam¨ uft¨ uo˘ glu noted that this view can be criticized; for instance, some would argue that certain forms of Internet communication such as chats are poor substitutes for face-to-face human communication, that they distance people from their immediate family and friends, and actually have an antisocial effect. Karam¨ uft¨ uo˘ glu believes that 3DTV can be less of a social ill than present-day television only if it can be made to and is used to convey more human knowledge. This should involve bodily, embodied tactile interaction and immersion with affectivity and subjectivity. G. Ziegler pointed out that there has been a radical transformation in the social isolation associated with playing computer games; many games are now networked and played interactively, sometimes in large role-playing communities. While these games may separate you from your local community, they make you a member of other communities. Potentially such games may still be considered a social ill, given that they may isolate people from family, school, and work contacts. On the other hand, being able to network with others who share common interests and not being limited to people in ones immediate social environment, appear to be beneficial. I. Rakkolainen pointed out that it is not so easy to claim that TV (or networked games, Skype, books) is a social ill for all. Some people get seriously addicted to TV, watching it 10 hours a day, while others may get addicted to excessive gardening, football, music playing, drawing, virtual reality, drugs, sex, and so forth in an attempt to escape real life. He argued that if anybody gets addicted to these things, the reason is usually not in the particular thing or technology, but somewhere deeper in their personality or history. Nevertheless, he agreed that new lucrative technologies can make the old means of escape even more effective. Will 3D technologies have such an effect? Some people were already addicted to computer games in the 1980s, but the advent of superior graphics, 3D displays and virtual reality will make it much easier for the masses to get immersed. Interactive technologies are more immersive, as they
618
H. M. Ozaktas
require continuous attention. In the end, the implications of A. Boev’s thesis that interactivity might improve the status of TV, remained an open issue. ¨ M. Ozkan and J. Kim agreed with Rakkolainen that the technology is not intrinsically good or bad, but it is how it is used that makes it good or bad; lack of social interaction seems to be a growing problem in the developed world ¨ and Ozkan was not sure that it was fair to blame TV for it. He noted that the erosion of extended families, the disappearance of communal structures and living conditions such as old style neighborhood interactions and neighborhood shops have all contributed to social isolation. V. Skala agreed with him that the major negative aspects of TV were based on its use as a tool to boost consumption and to indoctrinate people; he added that TV programmers do not try to produce value but just use sophisticated psychological techniques to ¨ keep people passively watching TV. On the other hand, M. Ozkan continued, TV could potentially be used as an effective and economical education tool if public-interest broadcasting could be more widespread. Given the current trends, the move to 3DTV may increase its power and negative effects, a point also agreed to by Kim. N. Stefanoski suggested that the spectrum of different applications of 3DTV technology will be much wider than traditional television, with potential applications in the areas of medicine (telesurgery, surgery training, surgery assistance), industry (CAD), and military (training and simulation in virtual environments). Immersive 3D environments could be created to improve the social environment of elderly and handicapped people, helping them to have more realistic-looking visual contact with other people and interact with them. Thus, in judging the overall effects of the technology, we should not focus only on consumer 3DTV, but also consider the array of potential non-consumer applications which may have a considerable benefit to society. D. Kaya-Mutlu noted that the question under discussion frames TV around the “effects” model of mass communication. However, this model/ theory of “strong effects” or “uniform influences” was challenged in the 1930s, 40s, and 50s (i.e., the uses and gratifications approach to media consumption). It was shown that the media do not affect everybody uniformly; individual psychological differences, social categories (e.g., age, sex, income, education), and social relationships (e.g., family, friends, acquaintances) affect people’s perception and interpretation of media content. In the 1980s, culturalist studies of audiences showed that consumers are not passive recipients of encoded meanings and identities in the media. These studies redefined media consumption as a site of struggle. For example, Stuart Hall has argued that representations of violence on TV are not violence per se but discourses/messages about violence. Hall, David Morley and other audience researchers showed that, depending on their social, cultural, and discursive dispositions, viewers are able to negotiate and even resist media messages.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
619
17.12 Comparison to the Move to Color Will the effects of moving to three dimensions be similar to the effects of moving to color ? A. Boev believes that merely adding another dimension (color, depth, even haptics or olfaction) to TV is not going to greatly affect the social impact of such a medium. On the other hand, P. Surman believes that once viewers have become accustomed to watching 3D images, 2D images will appear dull and lifeless. R. Ilieva also believes that the move to 3D will be more important than the move to color. C. T¨ ur¨ un noted that the present state of 3D displays are not yet of sufficient quality to allow us to imagine what it might be to experience 3DTV where the images are almost impossible to distinguish from the real thing. If such a high quality image is hanging in the air and is not physically attached to a screen, people might have an experience which is difficult for us to imagine now. In this case, the move to 3D would be much more significant than the move to color and may be comparable to the difference between a still photograph and a moving picture. G. Ziegler agreed with T¨ ur¨ un, underlining that present display technologies which are not quite “true” 3D are not significantly different from ordinary television. He also noted that a more challenging target than the move to color might be the move to moving pictures: could we ever create the awe that the first moving pictures ¨ generated in the audience? M. Ozkan also agreed that only transition to a “real” 3DTV system could be much more important than the transition to color. However, many believe that in any event, interactive TV or immersive TV is almost certain to have a much larger impact than 3DTV. In other words, the addition of the third dimension may have less impact than the possibility of interaction or immersion. G. Ziegler noted that even very convincing 3D displays, if in relatively confined spaces and viewing conditions, will be far from creating the immersive experience of even ordinary cinema. This seems to indicate that simply the size of the display and the architecture can have more of an immersive effect than the perception of three dimensions. Therefore, G. Ziegler suggested that it would be worth investigating large-scale 3D display options, even if they were not true 3D or they offered only a limited amount of depth perception; these might create a much more breathtaking experience than true 3D systems in confined or restricted viewing conditions.
17.13 Economic Advantages of Leadership If Europe, the Far East, North America, or some other block becomes the first to establish standards and offer viable 3DTV technologies, especially to the home consumer market, what economic advantages may be expected ?
620
H. M. Ozaktas
P. Surman noted that for Europe, the value added could be from licensing and from the fact that there are no overwhelming barriers to displays being manufactured in Europe. R. Ilieva believed that if Europe became the first player to establish the standards and offer viable 3DTV technologies, especially to the consumer market, the economic advantages would compare to that for CDs and DVDs. V. Skala didn’t think that Europe would be the main player. He also expressed pessimism regarding standards: they would take a long time to develop and meanwhile the market would have already moved on. He thought that while there will be similar principles of coding, there will also be many variations (like NTSC, PAL, SECAM). He guessed that major Far Eastern countries may once again take the lead. G. Ziegler was also skeptical of Europe’s capacity to provide leadership, based on observation of earlier technologies such as HDTV and GPS. Nevertheless, he argued that if the EU could set forth certain common standards, then it might give media producers and hardware manufacturers a huge home market for the fruits of their latest research; and if the rest of the world finds the new medium desirable, then these companies will have an advantage. He also underlined that it is the media standardization and the following media content which generate the ¨ revenue, not the hardware. M. Ozkan also agreed; the size of the EU market makes it viable for consumer electronics companies to achieve economies of scale in producing a new EU-standard 3DTV. However, he also noted that the added value is not in the hardware but in the content. For traditional broadcast TV, the commercial model was simple: consumers pay for the equipment (so manufacturers target the end-consumer—branding and marketing is important) and advertisement revenue pays for the content. With the move to Digital TV, it has been mostly the service provider who pays for the equipment and recoups this cost from the monthly service charge to the end-consumer. On most equipment, the manufacturers brand is either invisible or is clearly dominated by the brand of the service provider. A parallel business model has been in the works for game platforms. In such “service provider” subsidized equipment models, low-cost manufacturers have a clear advantage, and famous brands end up having a cost disadvantage because of the brand marketing costs (among other things) they incur. However, establishing a standard obviously creates a great advantage for those companies who own the intellectual property and patents. Hence although “manufacturing” of the equipment might be done by non-European companies (or even European companies doing it offshore), if intellectual property is developed early on and included in the standards, that can establish a clear and lasting advantage. C. T¨ ur¨ un thought that with the current rate of development, no place is any more advanced than any other. But if something radical were to be achieved by a European company, such as an application much more extraordinary than mundane TV or film content, we would be able to talk about real economic advantages.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
621
17.14 Comparison with Other High-impact Technologies How large might the impact of consumer 3DTV technology be, compared to other related established consumer technologies such as audio and video, cellular communications, etc. P. Surman noted that the impact of 3DTV technology is likely to be high due to the high proportion of time people devote to watching TV. Also, viewing patterns are likely to change in the future; the TV set, the most familiar and easy-to-operate device in the home, will evolve into a media access gateway serving the information society. In contrast, A. Smolic said that when he imagines the world before TV, he believes that the move to 3D would be nowhere nearly as important as the introduction of TV itself. 3DTV should be considered more as another step in the development of TV, rather than as a revolutionary new technology. Moreover, he predicted that 3DTV would not spread far quickly and broadly, but rather would develop from niches; perhaps it would never completely replace 2DTV. G. Ziegler approached this issue somewhat differently. Rather than thinking of end-to-end consumer TV, he looked at acquisition, compact representation, rendering, and display technologies separately. With regard to acquisition for example, he noted that being able to acquire 3D surfaces of yourself or your surroundings could be of great interest for immersive online games, where you could quickly create a 3D avatar of yourself. If 3D tracking can be made good enough, it will open up exciting new possibilities for game entertainment and would likely be popular. As for compact representation, being able to compress 3D video so that you can store it on a DVD could be of interest for documentaries, but not for feature films, since they may remain a rather passive experience with 3D being merely an add-on (as in IMAX 3D theaters). However, in video conferencing, 3D would probably increase the feeling of telepresence, and might be successful. Free viewpoint rendering is probably mostly of interest for documentaries or plays; this is a radical change in filmmaking since the director’s control over point-of-view is lost. For this reason, free viewpoint movies may not appear soon. But, in video conferencing it would be very desirable to change viewpoints. In summary, many kinds of 3D displays could have market potential provided they did not cause eye strain, and provided there was a standardized media format (which did not soon become obsolete), and interesting high-quality media content. S. Fleck noted that apart from TV programming in the conventional sense, many other forms of content for consumer 3DTV may emerge. The example he used was Google Earth; he noted some attempts to produce some basic anaglyphic (involving red and blue glasses) stereoscopic screenshots so it would be possible to experience “3D” in Google Earth.
622
H. M. Ozaktas
17.15 Social Impact of Non-consumer Applications Which non-consumer applications of 3DTV—including such areas as medicine, industry, the environment, the military—may have significant beneficial or harmful impacts on society (Example: telesurgery.) What could be the extent and nature of these impacts ? ¨ M. Ozkan noted the steady increase of non-conventional training methods in the military. The U.S. department of defense is specifically supporting software companies to develop game based e-learning and training systems. Such systems allow the military to train their personnel for very different environments and situations, without risking lives. These users would likely welcome a realistic 3DTV system. Even higher-cost early implementations of 3DTV may find use in military training systems. (Such systems can obviously be used for more humane applications, such as disaster readiness, first-aid, and humanitarian aid applications; unfortunately funds for such applications are much more limited.) G. Ziegler also noted implications for the battlefield such as more realistic virtual training environments. He was concerned that 3D visualization might make war become even more of a “computer game” to commanders and thus distance and alienate them from the resulting human suffering. ¨ Ozkan also noted that industrial designers, specifically designers and manufacturers of 3D parts and systems have always been very aggressive users of any 3D software and hardware that shortens production times. However, these applications might require very high resolution, thus creating the need for higher quality industrial grade 3DTV. I. Rakkolainen observed that the trend in modern society to become very visual will be enhanced by any future 3D technologies; undesirable uses will be enhanced along with the desirable ones—violent video games will become even more realistic. G. Ziegler believes that telesurgery will have a huge impact. Young students have no problem looking at a screen while using surgical instruments in simulated surgery. Apparently, kids who have grown up with computer games will be quite ready to adapt to augmented reality, making such applications a reality. Another example he gave was in the area of aircraft maintenance. Paper airplane repair manuals could run up to many kilograms of weight and are very hard to handle during maintenance because their pages easily become soiled. Augmented reality displays could provide the page content, but 3DTV technology could do even more by superimposing actual objects from the manual, such as screws, onto the field of view, and thus make it easier for the maintainers to do their job. Video-conferencing is one area where the quality of face-to-face meetings can still not be recreated. If 3DTV could make video-conferencing more satisfactory, Ziegler pointed out it could reduce business travel considerably with enormous time and cost savings. H. M. Ozaktas added that this may be an incentive for companies to invest in rather expensive preliminary 3DTV models.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
623
17.16 Implications to Perception of Reality Referring to the tabletop football scenario, M. Karam¨ uft¨ uo˘ glu noted that the first thing that comes to mind when 3DTV is mentioned is a telepresence type of experience. 3DTV is usually described as being realistic in the sense that it captures all the information and not just a 2D projection of it. However the perspective in the artist’s depiction of tabletop football was not the natural perspective of a person in the stadium or on the field; it was rather a bird’seye view. The distance between the human observer and the players is very short and the scale is very different. M. Karam¨ uft¨ uo˘ glu made two further points regarding realism. First, the greater the amount of information conveyed, and the more absolute the information becomes, the less there is for viewers to do and they will be pushed into a more and more passive role; this is already a criticism of conventional TV. Only the final dimension of realism—interactivity—could possibly reverse this. Indeed, the 3DTV image is absolute and real only in the passive sense; realism in a broader sense must also include interactivity. The main distinguishing feature between real and reproduction is not 2D versus 3D, not black-and-white versus color, but the possibility of interaction with the actual scene as opposed to isolation from it. He explained that nothing short of tactile, bodily interaction with the objects would bring back realism. TV content can be divided into two broad categories: factual programs and fictional programs, noted D. Kaya-Mutlu. Realism, especially in the case of fictional programs, should not be construed solely in the sense of realism of the images. One also needs to consider “emotional realism” (the recognition of the real on a connotative level). For example the relevancy of the characters and events in a film to the everyday lives of viewers may contribute much more to the impression of reality than the characters’ realistic appearance; it is not the appearance of the characters which make them realistic but rather their actions, their relation to events and to other characters in the film. The contribution of 3DTV to increasing realism should be evaluated within this broader understanding of realism. Nevertheless, Kaya-Mutlu suggested that it is interesting to think about the contributions of 3D images to the production or enhancement of emotional realism itself. Another very interesting topic was brought up by G. Ziegler, and elaborated in a response by H. M. Ozaktas. Ziegler talked about the implications of mixing real and animated characters, especially animated characters with real skin; this produces an erosion of trust in video images. Animated images are obviously not to be trusted, and everyone knows that still photographic images can be altered by “photomontage.” However, realistic video images are still trusted because people know they must be real shots of real people; people “know” that such images cannot be fabricated. But as everything becomes fully manipulable, we are entering an era where no media content can be taken to constitute “evidence”; you can believe it to the extent that you trust the source, but there is no true first-hand witnessing-at-a-distance any
624
H. M. Ozaktas
longer since virtually most forms of transmitted data will be manipulable, even the most “realistic” ones of today. And people will know it. I. Rakkolainen joined this discussion by suggesting that with the digitalization of all forms of media, it would be possible to create immersive and interactive 3D experiences; eventually synthetic 3D objects would be indistinguishable from real objects captured with a camera. H. M. Ozaktas found this to be a strong statement, with interesting implications. Ozaktas recalled the famous work “The Work of Art in the Age of Mechanical Reproduction” by Frankfurt School author Walter Benjamin, who wrote in the first half of the 20th century. With “recordings” on media no longer just representing, but becoming indistinguishable from true things, art and esthetic theorists will have a lot to theorize on. Ozaktas surmised that maybe one can write about “The Work of Art in the Age of Indistinguishable Reproduction.” Ziegler thought that the ability to use digitized characters in computer games and maybe even films will make it possible to create or simulate nonexisting or existing digital actors. This could drastically reduce film production costs and also encourage pirate productions, with a host of interesting and complex implications. Ozaktas noted that the ability to pirate an actor digitally would open up copyright and ownership issues much more complex than the current pirate copying and distribution of copyrighted material. Also, the “star-making” industry will be transformed. A star may be a real person, but it will no longer be necessary for him or her to actually act, or even be able to act; he or she will merely be the model for the digital representation used in the films. In some cases, the star will not correspond to any real person, and will be nothing more than an item of design in a studio. Ziegler also brought up the possibility of “reviving” human actors or personalities who are no longer alive through skeletal animation with realistic skin. The traditional capital of an actor was his/her actual physical presence, but now it would be merely his or her image, a commodity that can be sold and hired, even after the actor is dead. Celebrities like football stars or top models can be in Tokyo this morning in a televised fashion show and in Rio tomorrow for a talk show. Indeed, since most of us have never seen these people in the flesh, it is conceivable that totally imaginary personalities could be synthesized for public consumption. People may or may not know that these people do not actually exist, but perhaps it will be commonly accepted that they may not exist, just as we do not mind watching fictional films, knowing the events are not true. K. K¨ ose noted that in certain circumstances, the availability of very convincing images could open the door to “persona piracy,” the ability to convince others that you are someone else. Ziegler also asked whether such media would offer alternative means of escape, involving addiction, and concluded yes, but since there are already enough paths to escape, this will not have a substantial impact. Mastery of even the earliest computer or video games demonstrated the potential for addiction and escapism. Realism may increase this considerably. But will highly
17 Three-dimensional Television: Consumer, Social, and Gender Issues
625
realistic and interactive media allow qualitatively new levels of escapism from the real world? Presently, many substitutes for the real world exist, including imitation eggs and sugar, but in the social realm, substitutes are usually poor, nowhere near the real thing. If this changes, more people may prefer the more controllable, lower-risk nature of artificial experiences, leading to a society of isolated individuals simulating human experiences with quite genuine sensory accuracy. Although sensorily and therefore cognitively equivalent, these experiences will not be socially equivalent and will have an effect on how society operates. Unless, of course, computers are programmed to synchronize and coordinate the virtual experiences of individuals so that the resulting experiences and actions are in effect comparable to present social interactions and their consequences. For example, person A is hooked up to a simulator virtually experiencing an interaction with person B, and person B is likewise apparently interacting with A, and the two computers are linked such that their interactions do actually simulate actual interaction between them. In that case, the distinction between simulation and true interaction disappears; such simulation is effectively a form of remote communication and interaction. Ziegler also argued that children who experience such media concepts will not be impressed by conventional fairy tales and lose interest in them. Generalizing to other aspects of human culture, it may be argued that poetry, novels, music, even traditional cinema images may no longer be able to capture the interest of audiences. The erosion of interest in poetry and the theater in the 20th century may support this, but the continued interest in at least some forms of printed content and plastic arts are counterexamples. M. Karam¨ uft¨ uo˘ glu compared the 3DTV tabletop football scenario with the painting Anatomy Lesson by Rembrandt. He noted how the way the professor held his hand conveyed knowledge about human anatomy. He contrasted this embodied tactile/haptic human knowledge with the disembodied absolute/objective machinic knowledge of the tabletop football scenario. Indeed, engineers attempt to produce more “realistic” images by going from black-andwhite to color, from 2D to 3D, and so forth, but they often seem to be moving towards such objective, physical/machinic knowledge, at the expense of more human knowledge. In the physicist’s objective understanding of color, each wavelength corresponds to a different color, whereas human understanding of color is based on the three primaries, which is rooted in human physiology. Objective knowledge is independent from the human observer; thus “true 3D” aims to fully reconstruct a light field as exactly as possible and thus preserve as much objective information as possible. An artists rough sketch or caricature, on the other hand, may do very poorly in terms of any objective measure of fidelity and information preservation, but may carry a very high degree of information about human nature, even including such intangible things as psychological states. These discussions are not necessarily restricted to 3DTV, but applicable to any technology such as 3DTV, which increases the accuracy and realism of remote experiences. For example, odor-recording technologies allowing the
626
H. M. Ozaktas
recording and playback of smells are being developed. These devices analyze odors, and then reproduce them by combining an array of non-toxic chemicals.
17.17 Interactivity Although interactivity is not a defining characteristic of 3DTV, it comes up constantly in relation to 3DTV. Interactivity is a very important trend in all TV-type media, but it is possibly even more crucial for, perhaps even inseparable from, 3DTV. Interactive “TV” is almost certain to have a greater impact, good and bad, than 3DTV, regardless of the number of units sold. A. Boev discussed different kinds of entertainment or “media” and their differing degrees of interactivity. Live events such as theater and sports events, offer a degree of interactivity with the opportunity to throw eggs, shout, sing, and engage in hooliganism. Historically, in some forms of staging the audience was allowed to shout, argue, or even decide on the course of action (gladiators). Nevertheless, with the exception of certain experimental theaters, being among the audience is generally a safe place to be, allowing one to be close to the action without risking too much. Different degrees of interactivity are sought by different audiences in different contexts and ignoring this fact might lead to the rejection of a product. There might be a few forms of interactivity—such as the choice of point of view—which are special to 3DTV. In the same context, D. Kaya-Mutlu noted that interactivity necessitates an active viewer. However, in her view, the popularity of TV is based on its being a passive medium. Interactivity may not be much desired in a medium which is so associated with leisure.
Part IV: Gender Related Issues 17.18 Effect on Gender Inequality and Gender Bias Can you think of any aspect or application of 3DTV that will increase or decrease gender inequality or bias in the social and cultural sense? Most participants who expressed an opinion on this issue did not believe there will be any major effect, apart from what is discussed in the following section. However, it is important to underline that most of the discussion participants—as well as most developers of 3DTV technology—are male. G. Ziegler pointed out that it is worth looking into such biases in the areas of computer games, general computer usage, and usage of other “high-tech” consumer gadgets (although the insights gained may not be specific to 3DTV). He suggested that women would become more interested in using computers when the available applications and games have a considerable component of social interaction. Likewise, if 3DTV becomes a tool of social interaction, more women will become interested in it.
17 Three-dimensional Television: Consumer, Social, and Gender Issues
627
Can you think of any applications of 3DTV that will benefit or harm men or women to a greater degree? (Example: Medical application treating a disease more common in men or women.) Not too many examples were put forward, with the exception of the application areas suggested in the following section. Of these, the effects of pornography are largely perceived as negative, and more so for women. On the other hand, it has been argued that training and education applications may benefit women to a greater degree. Whether entertainment applications which selectively target men or women can be said to “benefit” them is open to debate. It was commented that application of 3DTV to shopping may also be viewed as being potentially exploitative.
17.19 Gender-differentiated Targeting of Consumers D. Kaya-Mutlu noted that ethnographic audience studies have shown that TV viewing is a gender differentiated activity not only in program preferences but also in viewing styles. Researchers have found that while men prefer such programs as news, current affairs, documentary, and adventure films, women prefer quiz shows, serials, soap operas, and fantasy movies (mostly talk oriented TV genres for which 3D may not be necessary). But the home is a site of power relations and when there is a clash of tastes, masculine preferences prevail. Researchers have also identified some differences between the viewing styles of men and women; while men watch TV in a focused and attentive manner, women watch it in a distractive manner (i.e., together with at least one another domestic activity like ironing or feeding the children). Kaya-Mutlu concluded that these gendered program preferences and viewing habits imply that 3DTV, which seems to favor visual information and encourages focused/attentive viewing, may be more responsive to male demands and tastes; this may encourage producers to reserve 3D for prime-time programs since most daytime programs are addressed to children and housewives. Can you think of any applications of 3DTV that will target either men or women as primary consumers? (Example: Broadcasting of male-dominated sports events.) Three applications have been noted that may target men as primary consumers: male-dominated sports events, games applications, and pornography. It was suggested that such applications may help 3DTV by building initial niche markets. Y. Yardımcı and A. Smolic noted that in particular, 3D pornography may become a popular industry, but many participants expressed that they would be uncomfortable from an ethical perspective in building 3DTV’s success on such grounds. If 3DTV does become a new tool in disseminating this type of content, the result will be to increase the negative and exploitative impact of such content on society. It was also commented that in any event,
628
H. M. Ozaktas
it cannot be taken for granted that 3D will make this type of content more attractive for its consumers. A number of shopping-related applications that may target women as primary consumers were noted by G. B. Akar. 3D virtual malls were described as a game-like environment where you can navigate through shops. A 3D dressup simulator would allow selected garments to be mixed and matched on an avatar and viewed from different angles. Finally, the use of 3D in telemarketing was noted as an enhancement that might make home shopping more attractive. Akar also suggested that application of 3DTV in the areas of professional and vocational training (for instance, to become a surgeon, pilot, or technician) has the potential to benefit women particularly, because some studies seem to show that women are more inclined towards visual learning. She added that a similar potential exists in K-12 education, especially in subjects such as geometry, chemistry, biology, and physics where visualization or simulation of complex structures or phenomena are vital. This may also help increase the interest of women in science and engineering. Will it be important for companies to target one or the other gender to sell 3DTV consumer equipment or applications? A. Boev and G. Ziegler mentioned market studies on which gender dominated in the decision to purchase consumer electronics, and which features were decisive (technical parameters, design, etc.). Furthermore some studies seem to indicate sex differences in perception and spatial ability, which may have implications especially on immersive technologies. Given that these issues are highly charged and there are many unresolved claims, it is difficult to make any meaningful conclusions.
Acknowledgments We would like to thank the following participants of the Network of Excellence for their contributions to the discussions which formed the basis of this work: Gozde B. Akar (Middle East Technical University, Ankara, Turkey), Atanas Boev (Tampere University of Technology, Tampere, Finland), Reha Civanlar (Ko¸c University, Istanbul, Turkey, now DoCoMo Labs, Palo Alto, USA), Sven Fleck (University of T¨ ubingen, T¨ ubingen, Germany), Rossitza Ilieva (Bilkent University, Ankara, Turkey), Matthias Kautzner (Fraunhofer Institute for Telecommunications/Heinrich-Hertz-Institut, Berlin, Germany), Kıvan¸c K¨ose (Bilkent University, Ankara, Turkey), Matthias Kunter (Technical University of Berlin, Berlin, Germany), Haldun M. Ozaktas (Bilkent Uni¨ versity, Ankara, Turkey), Mehmet Ozkan (Momentum AS¸, Istanbul, Turkey), Ismo Rakkolainen (FogScreen Inc., Helsinki, Finland), Simeon Sainov (Bulgarian Academy of Sciences, Sofia, Bulgaria), Vaclav Skala (University of West Bohemia in Plzen, Plzen, Czech Republic), Aljoscha Smolic (Fraunhofer
17 Three-dimensional Television: Consumer, Social, and Gender Issues
629
Institute for Telecommunications/Heinrich-Hertz-Institut, Berlin, Germany), Nikolce Stefanoski (University of Hannover, Hannover, Germany), Philip Surman (De Montfort University, Leicester, United Kingdom), Cemil T¨ ur¨ un (Yogurt Technologies Ltd., Istanbul, Turkey), Yasemin Yardımcı (Middle East ¨ ur Y¨ Technical University, Ankara, Turkey), Ali Ozg¨ ontem (Bilkent University, Ankara, Turkey), Gernot Ziegler (Max Planck Institute for Informatics, Saarbr¨ ucken, Germany). We are especially grateful to the following external participants for their contributions to the discussions: G¨ uliz Ger (Bilkent University, Faculty of Business Administration, Department of Management, Ankara, Turkey), an expert on the sociocultural dimensions of consumption, consumption and marketing in transitional societies and groups, and related issues of globalization, modernity, and tradition; Murat Karam¨ uft¨ uo˘ glu (Bilkent University, Faculty of Art, Design, and Architecture, Department of Communication and Design, Ankara, Turkey), an expert on information retrieval theory, design, and evaluation, computer mediated communication and collaborative work, computer semiotics, philosophical foundations of information systems, and the organizational, social and political implications of information systems; Dilek Kaya-Mutlu (Bilkent University, Faculty of Art, Design, and Architecture, Department of Graphic Design, Ankara, Turkey), an expert on film studies with an emphasis on audience studies, film reception, and Turkish cinema; Jinwoong Kim (ETRI, Radio and Broadcasting Research Laboratory, Daejeon, Republic of Korea), 3DTV Project Leader; Fatih Porikli (Mitsubishi Electric Research Labs, Cambridge, USA), Principal Member and Computer ¨ Vision Technical Leader; Ozlem Sandık¸cı (Bilkent University, Faculty of Business Administration, Department of Management, Ankara, Turkey), an expert on culturally oriented issues in marketing, including advertising reception, gender and advertising, consumption culture, and the relationships between modernity, postmodernity, globalization and consumption. Special thanks goes to Gozde B. Akar (Middle East Technical University, Faculty of Engineering, Department of Electrical and Electronics Engineering, Ankara, Turkey) and Yasemin Soysal (University of Essex, Department of Sociology, Colchester, United Kingdom) for their critical comments on Part IV. We would like to take this opportunity to also thank Levent Onural of Bilkent University for his support as leader of the Network of Excellence on Three-Dimensional Television. Finally, we are grateful to Kirsten Ward for careful editing of the manuscript. This work is supported by the EC within FP6 under Grant 511568 with the acronym 3DTV.