3D Video and Its Applications
Takashi Matsuyama r Shohei Nobuhara Takeshi Takai r Tony Tung
3D Video and Its Applica...
162 downloads
1491 Views
10MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
3D Video and Its Applications
Takashi Matsuyama r Shohei Nobuhara Takeshi Takai r Tony Tung
3D Video and Its Applications
r
Takashi Matsuyama Graduate School of Informatics Kyoto University Sakyo, Kyoto, Japan
Takeshi Takai Graduate School of Informatics Kyoto University Sakyo, Kyoto, Japan
Shohei Nobuhara Graduate School of Informatics Kyoto University Sakyo, Kyoto, Japan
Tony Tung Graduate School of Informatics Kyoto University Sakyo, Kyoto, Japan
ISBN 978-1-4471-4119-8 ISBN 978-1-4471-4120-4 (eBook) DOI 10.1007/978-1-4471-4120-4 Springer London Heidelberg New York Dordrecht Library of Congress Control Number: 2012940250 © Springer-Verlag London 2012 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To our colleagues, students, and families
Preface
This book addresses 3D video production technologies and applications developed by our laboratory in Kyoto University, Japan, over the past ten years and more. In 1996, we started the Cooperative Distributed Vision project, where a group of network connected active cameras monitor a 3D real world scene to cooperatively detect and track people in real time. At the last stage of the project in 1999, we applied the system to synchronized multi-view video data capture to measure full 3D human shape and motion, which was then followed by the development of texture mapping methods to generate full 3D video around 2000. Since then, we have been conducting successively work to improve multi-view video capture systems in both image resolution and object movable space, implement parallel processing methods to reconstruct 3D shape and motion in real time using a PC cluster system, develop accurate 3D shape and motion reconstruction algorithms as well as high fidelity texture mapping and lighting environment estimation methods. With these 3D video production technologies, in 2002, we started to explore applications of 3D video including interactive 3D visualization, 3D content editing, and data compression methods to cultivate the world of 3D video. This book gives a comprehensive view of the state-of-the-art of 3D video production technologies and applications we developed, as well as related contemporary visual information media technologies which will help graduate students and young researchers to understand the world of 3D video. Since the employed technologies include a very wide range of technical disciplines covering real time synchronized multi-view video capture, object tracking with a group of active cameras, geometric and photometric camera calibration, parallel processing by a PC cluster system, 2D image and video processing, 3D shape and motion reconstruction, texture mapping and image rendering, lighting environment estimation, attractive 3D visualization, visual contents analysis and editing, 3D body action analysis, and data compression, we put as references books and technical survey papers on these fundamental technical areas for readers to understand background knowledge of 3D video. Although we have established technical skills and know-how for implementing multi-view video capture systems, and the quality of generated 3D video has been significantly improved with advanced technologies, a high fidelity 3D video producvii
viii
Preface
tion system and its casual usages in everyday life environments are still our future research targets. We hope this book will promote further explorations of the world of 3D video. Takashi Matsuyama Shohei Nobuhara Takeshi Takai Tony Tung
Acknowledgements
Our research activities on 3D video over the past decade have been supported by the Japanese government under several different programs: Research for the Future Program by the Japan Society for the Promotion of Science (1996–2000), Grant-inAid for Scientific Research (2001–2003) and National Project on Development of High Fidelity Digitization Software for Large-Scale and Intangible Cultural Assets (2004–2008) by the Ministry of Education, Culture, Sports, Science & Technology, and Strategic Information and Communications R&D Promotion Programme by the Ministry of Internal Affairs and Communications (2006–2008). As of 2011, we have got another Grant-in-Aid for Scientific Research (2011–2012) and a collaborative research with Nippon Telegraph and Telephone Corporation to explore further advanced 3D video technologies and applications. We are very grateful for all these supports. Since the 3D video project has long been one of the major research topics in our laboratory, many undergraduate, master and Ph.D. students, as well as faculty members and postdoctoral researchers have been engaged in the project. Some have stayed in our laboratory and contributed to this book, and others are working in universities, research institutions, and companies to develop new information technologies. Among others, Prof. Toshikazu Wada and Dr. Xiaojun Wu established foundations of 3D video production systems, Profs. Norimichi Ukita and Shinsaku Hiura implemented a real time active multi-target tracking system with a group of network connected pan-tilt-zoom cameras, Dr. Atsuto Maki, Mr. Hiromasa Yoshimoto, and Mr. Tatsuhisa Yamaguchi developed a 3D video capture system of an object moving in a wide area with a group of active cameras, and Prof. Hitoshi Habe and Dr. Lyndon Hill implemented a sophisticated 3D video coding method. Mr. Qun Shi developed a gaze estimation method from captured 3D video. Without their efforts and enthusiasm, we could not have explored such a wide range of 3D video production technologies and applications as covered in this book. Needless to say, our everyday research activities are supported by the staffs of our laboratory and university. We would like to thank all of our former and current secretaries for their devoted work on project proposal writing, budget management, travel schedule planning, conference and workshop organization, paper and homeix
x
Acknowledgements
page preparation, and so on. Their charming smiles encouraged us to attack difficult problems in research as well as management. Last, but of course not least, we would like to express our sincere appreciations to our families for their everlasting support and encouragement, especially to our wives, Mrs. Akemi Matsuyama, Yoshiko Nobuhara, Shiho Takai, and Bidda Camilla Solvang Poulsen.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Visual Information Media Technologies . . . . . . . . . . . . 1.2 What Is and Is Not 3D Video? . . . . . . . . . . . . . . . . . 1.3 Processing Scheme of 3D Video Production and Applications References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part I
. . . . .
. . . . .
. . . . .
1 1 3 5 12
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
17 17 18 19 20 20 22 23 24 25 26 27 29 29 36 41 42 43
Multi-view Video Capture
2
Multi-camera Systems for 3D Video Production 2.1 Introduction . . . . . . . . . . . . . . . . . 2.1.1 Single-Camera Requirements . . . . 2.1.2 Multi-camera Requirements . . . . . 2.2 Studio Design . . . . . . . . . . . . . . . . 2.2.1 Camera Arrangement . . . . . . . . 2.2.2 Camera . . . . . . . . . . . . . . . . 2.2.3 Lens . . . . . . . . . . . . . . . . . 2.2.4 Shutter . . . . . . . . . . . . . . . . 2.2.5 Lighting . . . . . . . . . . . . . . . 2.2.6 Background . . . . . . . . . . . . . 2.2.7 Studio Implementations . . . . . . . 2.3 Camera Calibration . . . . . . . . . . . . . 2.3.1 Geometric Calibration . . . . . . . . 2.3.2 Photometric Calibration . . . . . . . 2.4 Performance Evaluation of 3D Video Studios 2.5 Conclusion . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
3
Active Camera System for Object Tracking and Multi-view Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Fundamental Requirements for Multi-view Object Observation for 3D Video Production . . . . . . . . . . . .
45 45 45 xi
xii
Contents
3.1.2 Multi-view Video Capture for a Wide Area . . . . . . . 3.2 Cell-Based Object Tracking and Multi-view Observation . . . . 3.2.1 Problem Specifications and Assumptions . . . . . . . . 3.2.2 Basic Scheme of the Cell-Based Object Tracking and Multi-view Observation . . . . . . . . . . . . . . . . . 3.2.3 Design Factors for Implementation . . . . . . . . . . . 3.2.4 Cell-Based Camera Control Scheme . . . . . . . . . . 3.3 Algorithm Implementation . . . . . . . . . . . . . . . . . . . 3.3.1 Constraints between Design Factors and Specifications 3.3.2 Studio Design Process . . . . . . . . . . . . . . . . . . 3.3.3 Cell-Based Camera Calibration . . . . . . . . . . . . . 3.3.4 Real-Time Object Tracking Algorithm . . . . . . . . . 3.4 Performance Evaluations . . . . . . . . . . . . . . . . . . . . 3.4.1 Quantitative Performance Evaluations with Synthesized Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Quantitative Performance Evaluation with Real Active Cameras . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Designing a System for Large Scale Sport Scenes . . . . . . . 3.5.1 Problem Specifications . . . . . . . . . . . . . . . . . 3.5.2 Camera and Cell Arrangements . . . . . . . . . . . . . 3.6 Conclusion and Future Works . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II 4
. . . . . .
46 50 50
. . . . . . . . .
. . . . . . . . .
51 52 54 55 56 62 63 65 65
. .
65
. . . . . .
. . . . . .
74 77 78 80 83 84
. . . .
91 91
. .
93
. . . .
93 96
3D Video Production
3D Shape Reconstruction from Multi-view Video Data . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Visual Cues for Computing 3D Information from 2D Image(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Full 3D Shape Reconstruction . . . . . . . . . . . . . . 4.2.3 Dynamic Full 3D Shape Reconstruction for 3D Video Production . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Design Factors of 3D Shape Reconstruction Algorithms . . . . 4.3.1 Photo-Consistency Evaluation . . . . . . . . . . . . . . 4.3.2 Visibility and Occlusion Handling . . . . . . . . . . . . 4.3.3 Shape Representation and Optimization . . . . . . . . . 4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction Algorithms . . . . . . . . . . . . . . . . . . . 4.4.1 3D Shape from Multi-view Images . . . . . . . . . . . 4.4.2 Simultaneous 3D Shape and Motion Estimation from Multi-view Video Data by a Heterogeneous Inter-frame Mesh Deformation . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
104 106 106 109 113
. . 118 . . 118
. . 131 . . 142 . . 146
Contents
5
6
3D Surface Texture Generation . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Texture Painting, Natural-Texture Mapping, and Texture Generation . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Problems in Texture Generation . . . . . . . . . . . . . 5.2 Geometric Transformation Between a 3D Mesh and a 2D Image 5.3 Appearance-Based View-Independent Texture Generation . . . 5.3.1 Notation and Studio Configuration . . . . . . . . . . . 5.3.2 Generating Partial Texture Images . . . . . . . . . . . 5.3.3 Combining Partial Texture Images . . . . . . . . . . . 5.3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . 5.4 View-Dependent Vertex-Based Texture Generation . . . . . . . 5.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . 5.5 Harmonized Texture Generation . . . . . . . . . . . . . . . . . 5.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . 5.5.3 Mesh Optimization . . . . . . . . . . . . . . . . . . . 5.5.4 View-Dependent Texture Deformation . . . . . . . . . 5.5.5 Experimental Results . . . . . . . . . . . . . . . . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation of 3D Dynamic Lighting Environment with Reference Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Lighting Environment Estimation Methods . . . . . . . . . . . 6.2.1 Direct Methods . . . . . . . . . . . . . . . . . . . . . 6.2.2 Indirect Methods . . . . . . . . . . . . . . . . . . . . . 6.3 Problem Specifications and Basic Ideas . . . . . . . . . . . . . 6.3.1 Computational Model . . . . . . . . . . . . . . . . . . 6.3.2 3D Shape from Silhouette and 3D Light Source from Shadow . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Algebraic Problem Formulation . . . . . . . . . . . . . . . . . 6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources with the Skeleton Cube . . . . . . . . . . . . . . . . . 6.5.1 Technical Problem Specifications . . . . . . . . . . . . 6.5.2 Skeleton Cube . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Lighting Environment Estimation Algorithm . . . . . . 6.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 6.7 Surface Reflectance Estimation and Lighting Effects Rendering for 3D Video . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Generic Texture Generation . . . . . . . . . . . . . . . 6.7.2 Lighting Effects Rendering . . . . . . . . . . . . . . .
xiii
. . 151 . . 151 . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
153 156 158 159 160 162 164 166 169 170 172 173 173 174 177 183 186 192 193
. . . . . . .
. . . . . . .
195 195 196 196 197 198 198
. . 201 . . 203 . . 205 . . . . .
. . . . .
207 207 208 213 216
. . 220 . . 221 . . 225
xiv
Contents
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Part III 3D Video Applications 7
Visualization of 3D Video . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 3D Video Visualization System . . . . . . . . . . . . . . . . . 7.3 Subjective Visualization by Gaze Estimation from 3D Video . . 7.3.1 3D Face Surface Reconstruction Using Symmetry Prior 7.3.2 Virtual Frontal Face Image Synthesis . . . . . . . . . . 7.3.3 Gaze Estimation Using a 3D Eyeball Model . . . . . . 7.3.4 Performance Evaluation . . . . . . . . . . . . . . . . . 7.3.5 Subjective Visualization . . . . . . . . . . . . . . . . . 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
233 233 234 236 238 245 246 248 252 252 254
8
Behavior Unit Model for Content-Based Representation and Edition of 3D Video . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Topology Dictionary . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Dataset Clustering . . . . . . . . . . . . . . . . . . . 8.2.2 Markov Motion Graph . . . . . . . . . . . . . . . . . 8.3 Topology Description Using Reeb Graph . . . . . . . . . . . 8.3.1 Characterization of Surface Topology with Integrated Geodesic Distances . . . . . . . . . . . . . . . . . . 8.3.2 Construction of the Multi-resolution Reeb Graph . . . 8.3.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Advantage . . . . . . . . . . . . . . . . . . . . . . . 8.4 Behavior Unit Model . . . . . . . . . . . . . . . . . . . . . 8.4.1 Feature Vector Representation . . . . . . . . . . . . . 8.4.2 Feature Vector Similarity Computation . . . . . . . . 8.4.3 Performance Evaluation . . . . . . . . . . . . . . . . 8.4.4 Data Stream Encoding . . . . . . . . . . . . . . . . . 8.4.5 Data Stream Decoding . . . . . . . . . . . . . . . . . 8.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Behavior Unit Edition . . . . . . . . . . . . . . . . . 8.5.2 Semantic Description . . . . . . . . . . . . . . . . . 8.6 Performance Evaluations . . . . . . . . . . . . . . . . . . . 8.6.1 Topology Dictionary Stability . . . . . . . . . . . . . 8.6.2 3D Video Progressive Summarization . . . . . . . . . 8.6.3 Semantic Description . . . . . . . . . . . . . . . . . 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
255 255 257 259 260 264
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
264 265 267 268 270 270 271 276 278 278 280 280 282 283 283 286 288 289 291
Contents
9
xv
Model-Based Complex Kinematic Motion Estimation . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Skin-and-Bones Model for Kinematic Motion Estimation from 3D Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Reliability Evaluation Methods . . . . . . . . . . . . . . . . . . . 9.3.1 Reliability Measure Based on the Surface Visibility . . . . 9.3.2 Reliability Measure Based on the Photo-Consistency . . . . 9.4 Kinematic Motion Estimation Algorithm Using the Reliability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Quantitative Performance Evaluation with Synthesized Data 9.5.2 Qualitative Evaluations with Real Data . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 3D Video Encoding . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Encoding 3D Visual Media into 2D Video Data 10.1.2 Problem Specification for 3D Video Encoding . 10.2 Geometry Images . . . . . . . . . . . . . . . . . . . . 10.2.1 Overview . . . . . . . . . . . . . . . . . . . . . 10.2.2 Cut Graph Definition . . . . . . . . . . . . . . 10.2.3 Parameterization . . . . . . . . . . . . . . . . . 10.2.4 Data Structure Constraints . . . . . . . . . . . . 10.3 3D Video Data Encoding . . . . . . . . . . . . . . . . 10.3.1 Resolution . . . . . . . . . . . . . . . . . . . . 10.3.2 Encoding and Decoding . . . . . . . . . . . . . 10.4 Stable Surface-Based Shape Representation . . . . . . 10.4.1 Stable Feature Extraction . . . . . . . . . . . . 10.4.2 Temporal Geodesic Consistency . . . . . . . . . 10.4.3 Stable Surface-Based Graph Construction . . . 10.5 Performance Evaluations . . . . . . . . . . . . . . . . 10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
295 295 297 300 300 302 303 304 304 308 311 312 315 315 316 319 321 321 322 324 328 328 329 329 330 331 331 332 335 336 340
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Part I
Multi-view Video Capture
Drastic advances of digital technologies and the Internet in this decade have made digital (still and video) cameras ubiquitous in everyday life. Most of mobile phones, tablets and laptop PCs are equipped with cameras, and a huge amount of pictures and video streams are exchanged over the Internet every second. These days, moreover, 3D cinemas and 3D TV monitors have been put into the market and are gradually getting popular. In city areas a large number of cameras are placed for security and traffic monitoring. A variety of computer vision technologies such as auto-focusing on human faces and image stabilization against hand vibrations have been used in modern cameras. Quite recently, moreover, new types of camera with highly advanced computer vision technologies have been commercialized to enhance utilities of cameras in everyday life: a 2.5D range imaging camera for interactive video games [1] and a light-field camera for interactively shifting focused points in a recorded picture [2–4]. Since the major objectives of these cameras are to make pictures beautiful, attractive, and enjoyable, many artificial visual effects are introduced to enhance “image qualities” based on characteristics of human visual system and psychology. Hence captured image data lose quantitative physical grounds. For 3D video production, instead, cameras are used as physical sensors to conduct processing based on geometry, photometry, and dynamics, which requires geometric, photometric, and dynamical calibrations of cameras. Since their accuracies define qualities of produced 3D video data, the selection of cameras and their calibration methods are one of the most important processes in 3D video production. Moreover, since 3D video data is produced from multi-view video data captured by a group of cameras surrounding an object(s) in motion, their layout design and mutual geometric, photometric and dynamical calibrations are required in addition to individual camera calibrations. In this part, Chap. 2 discusses geometric, photometric, and dynamical design and calibration methods of multi-view camera systems including practical implementation technologies. Chapter 3 presents a novel multi-view active camera system for tracking and capturing an object moving in a wide spread area, which allows us to produce 3D video of sports like ice skating.
16
References 1. 2. 3. 4.
Microsoft: Kinect (2010) Levoy, M., Hanrahan, P.: Light field rendering. In: Proc. of ACM SIGGRAPH, pp. 31–42 (1996) Raytrix GmbH: Raytrix-R29 3D lightfield-camera (2011) Lytro: Lytro camera (2012)
Chapter 2
Multi-camera Systems for 3D Video Production
2.1 Introduction As discussed in the previous chapter, 3D video records full 3D shape, motion, and surface texture of an object in motion rather than a pair of stereo video or 2.5D range data. To produce such data, the entire 3D object surface should be captured simultaneously. The practical method for this is to employ a group of video cameras,1 place them to surround an object in motion, and reconstruct its 3D shape, motion, and surface texture from a group of multi-view video data recording partial 2D or 2.5D object views. While several advanced 3D video capture systems [5] are being developed introducing Time-Of-Flight cameras [25] and/or active-stereo cameras with structured lights to capture 2.5D range video data in addition to ordinary cameras, we do not consider such 2.5D cameras in this book and present 3D video production methods by reconstructing 3D object shape, motion, and surface texture from multi-view 2D video data. General limitations of current 3D video production technologies are: • In principle, multiple objects in motion can be captured at the same time. In practice, however, since their mutual occlusions degrade the quality of 3D video data, most of 3D video data are produced for a single object. Thus, in what follows we assume a 3D video stream of one object in motion is produced, except when we explicitly refer to multiple objects. • Since the problem of reconstructing 3D object shape, motion, and surface texture in natural environments is very difficult due to dynamically changing background objects and lighting environments, most of 3D video data are produced from multi-view video data captured in well-designed studios. As will be discussed in the next part, even though we assume a single object in motion in a well-designed studio, there remain many technical problems to be solved for producing high fidelity 3D video. 1 In
what follows, we simply refer video cameras as cameras.
T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_2, © Springer-Verlag London 2012
17
18
2
Multi-camera Systems for 3D Video Production
Table 2.1 Camera parameters and their effects Parameter
Effect
Iris
The smaller the size, the deeper the depth of field, but the darker the image becomes
Gain
The smaller the gain, the less noisier, but the darker the image becomes
Shutter
The faster the shutter, the less motion blurred, but the darker the image becomes
Zoom
The smaller the zooming factor, the less deeper depth of field and the wider the field of view, but the smaller the image resolution becomes
This chapter presents and discusses requirements, design factors, and implementation methods of a multi-view camera studio for 3D video production (3D video studio, for short). The basic policy we employed is to implement 3D video studios with off-theshelf devices rather than develop specialized ones for 3D video production. This is not only to develop cost effective systems for casual usages but also to investigate essential problems in 3D video production. Thus, all devices introduced in this and the next chapters can be easily prepared to start research and development of 3D video.
2.1.1 Single-Camera Requirements Firstly, the requirements for 3D video studios can be classified into two categories: single-camera requirements and multi-camera requirements. The former include the following. 1. A camera should be kept well focused on the object during its motion. 2. Captured video data should not contain any motion blur even if the object motion is fast. 3. The dynamic range of a camera should be adjusted to lighting environments in a studio to capture color data accurately. 4. The resolution of a camera should be high enough to capture detailed object surface textures. 5. The field of view of a camera should be wide enough to capture an object in motion. To satisfy these requirements, the camera parameters should be adjusted: focus, iris (aperture size), gain, color balance, shutter speed (exposure time), zoom (focal length, or field of view), and position and orientation (pan and tilt). Table 2.1 summarizes effects by some of these parameters, which show mutual dependencies and hence trade-offs among them. For example, while closing the iris and shortening the exposure time as much as possible are useful to satisfy the requirements 1 and 2 above, very powerful lightings are required to satisfy the requirement 3. Moreover, the requirements 4 and 5 are in a trade-off relation, whose practical solution with
2.1 Introduction
19
active cameras will be given in Chap. 3. Thus, we need to find an acceptable set of the parameters by considering trade-offs between them.
2.1.2 Multi-camera Requirements While the single-camera requirements are well known and various types of knowhow have been developed in photography and cinema production, multi-camera requirements are rather unique in computer vision and some modern cinematography with multiple camera systems. They include: 1. Accurate 3D positions and viewing directions of multiple cameras should be known to integrate captured multi-view video data geometrically. 2. Multiple cameras should be accurately synchronized to integrate captured multiview video data temporally. 3. Accurate brightness and chromatic characteristics of multiple cameras should be known to integrate captured multi-view video data chromatically. 4. All object surface areas should be observed by at least two cameras to reconstruct their 3D shapes by stereo-based methods; while visual cues in a single image such as shading can be used to reconstruct 3D object shape, absolute 3D depth cannot be computed and, moreover, many assumptions which are not always valid in the real world are required. Requirements 1, 2, and 3 imply that cameras should be well calibrated geometrically and photometrically as well as synchronized. While these requirements can be satisfied in some accuracy with modern camera calibration methods, the last requirement is rather hard to satisfy. Especially for objects with loose clothes such as MAIKO and objects playing complex actions such as Yoga, it is not possible to satisfy this requirement. As will be discussed in detail in the next part, moreover, the multi-view surface observability plays a crucial role in the 3D shape and motion reconstruction (Chap. 4) and the texture generation (Chap. 5) for 3D video production. Consequently, the layout design of cameras should be done carefully to allow as many object surface areas as possible to be observed. As the first step toward 3D video production, this chapter establishes technical understandings about how we can find a feasible set of camera parameters which satisfies the above-mentioned requirements in practice. Section 2.2 first discusses design factors of 3D video studios and introduces three 3D video studios we developed. Then, Sect. 2.3 presents geometric and photometric camera calibration methods. The introduction and calibration of active cameras for tracking and capturing multi-view video of an object moving in a wide spread area will be presented in Chap. 3. The calibration or estimation of lighting environments will be presented in Chap. 6, since it has much to do with the texture generation in Chap. 5. Section 2.4 evaluates the performance of the three 3D video studios we developed, where the accuracy of the geometric camera calibration is quantitatively evaluated. Section 2.5 concludes this chapter with discussions and future works.
20
2
Multi-camera Systems for 3D Video Production
2.2 Studio Design Here we discuss technical problems to design a 3D video studio with static cameras. While, as will be discussed later, such system constrains the object movable space to satisfy the requirements described above, most of 3D video studios developed so far used static cameras to produce high quality 3D video. The introduction of active cameras, which cooperatively track an object and capture multi-view highresolution video, is a promising method to expand the object movable space. Since such an active multi-view video capture system should satisfy an additional requirement for dynamic camera control synchronized with the object motion, we confine ourselves to static cameras in this chapter and reserve discussions on active cameras for Chap. 3.
2.2.1 Camera Arrangement One of the most important design factors of a 3D video studio is how to determine (1) the number of cameras to be installed and (2) their spatial arrangement to achieve high 3D shape reconstruction accuracy. If we do not have any specific knowledge about the object shape or motion, or if we want to capture a variety of objects in the same studio, one reasonable solution is to employ a circular ring camera arrangement, where a group of cameras placed evenly along the ring observe the object performing actions at the ring center. We may call it a converging multi-camera arrangement. Figure 2.1 illustrates three typical multi-camera arrangements: diverging multi-camera arrangement for omni-directional image capture and parallel multi-camera arrangement for multi-baseline stereo [23, 27] and light-field modeling [15]. Then the next design factors to be specified concern the placement of the camera ring and the number of cameras installed on the ring. In [28], we pointed out: • The best observability of the object surface with a single ring camera arrangement is achieved by locating the ring at the mid-height of the target object. • Shape-from-silhouette methods for 3D shape reconstruction (Sect. 4.2.2.2) require at least nine cameras (40° spacing on the ring), and the reconstruction accuracy can be improved well by increasing the number of cameras up to 16 (23°). Even with larger number of cameras, the accuracy improvement is limited, since shape-from-silhouette methods can only reconstruct an approximated 3D shape of the object by definition (cf. “visual hull” in Sect. 4.2.2.2). • Shape-from-stereo methods require at least 14 cameras (25°) for an optimal balance between matching accuracy and depth ambiguity; the wider the baseline between a pair of stereo cameras becomes (i.e. wide-baseline stereo), the better accuracy of the depth measurement is achieved, while the harder the stereo matching becomes.
2.2 Studio Design
21
Fig. 2.1 Multi-view camera arrangements: (a) converging, (b) diverging, and (c) parallel arrangements
Fig. 2.2 Capturable area. Each camera can capture an object located in its field of view and within the depth-of-field (DoF) without blur. The capturable area of a multi-view camera system is given by the intersection of such “in-focus” areas
Hence, we conclude here that we need at least nine to 16 cameras for a 3D video studio with a single ring camera arrangement. As will be shown later, in Sect. 2.2.7, practical 3D video studios are usually equipped with ceiling cameras in addition to a camera ring(s) to increase the observability of top areas of an object. The camera arrangement constrains the object movable space to guarantee the multi-view observability of the object surface. In general, the 3D observable space of a camera can be represented as a quadrilateral pyramid formed by its projection center and bounded image plane. Thus, intuitively, with a converging multi-camera arrangement, the object movable space is confined within intersections of multiple quadrilateral pyramids (Fig. 2.2). That is, to guarantee the surface observation by at least two cameras, the object can only move in spaces where at least two quadrilaterals intersect. Similar space limitations are introduced also by focusing and zooming. They will be described later in this section. It should be noted that with enough number of cameras, all of them do not need to capture entire object images. That is, as long as all object surface areas can be observed by multiple cameras, some of cameras can capture the object partially by
22
2
Multi-camera Systems for 3D Video Production
Table 2.2 Categorization of video cameras Media production
Machine vision
Consumer
Cost
High
Middle to low
Middle to low
Quality
High
Middle to low
Middle to low
Data transmission
HD-SDI
IEEE1394b, 1000Base-T, CameraLink, USB3.0
USB, IEEE1394a
Synchronization
GenLock + Timecode
Trigger Signal
N/A
Lens
PL- or PV-mount
C-, CS-, or F-mount
Unchangeable
zooming up to increase image resolution. In fact, one of our studios (Studio B in Table 2.3) employed this strategy to increase image resolution.
2.2.2 Camera A large variety of commercial cameras are available in the market. They can be categorized by their application domains (Table 2.2). The first group is for professional media productions designed to achieve high-end quality: high-resolution and high color-depth. The second group is for industrial and machine vision. They are originally designed for factory automation, robot, etc., and relatively low-cost. The last group is for consumer use. They are widely available in the market, but not fully designed to interoperate with other cameras or controllers. Since 3D video studios require the synchronization of multiple cameras, consumer cameras cannot be used. The important difference between media production and machine vision cameras is twofold. The first is in their image qualities. Since media production cameras typically utilize 3CCD system, they offer full 8-bit depth for each color channel. On the other hand, most of machine vision cameras utilize 1CCD system with Bayer color filter [2], and their effective color-depth is reduced into 1/3. The second difference is in their synchronization mechanisms. While both media production and machine vision cameras accept a signal to control video capture timing, there is an important difference in the temporal structures of timing signals allowed. In the GenLock (generator lock) system for media production cameras, the signals should come regularly with a standardized interval such as 24 Hz, 29.97 Hz, etc. On the other hand, trigger systems for machine vision cameras allow signals to arrive at arbitrary timings. When selecting cameras, these two different synchronization mechanisms should be taken into account, especially when both types are employed into a 3D video studio. Note that some of machine vision cameras have yet another synchronization mechanism called “bus-sync”. It makes all cameras on the same bus synchronized automatically without providing additional signals. Other practical factors when selecting cameras are the allowable cable length and the data transmission rate between a camera and its data receiver. HD-SDI (formally SMPTE-292M) connection for media production cameras and 1000Base-T
2.2 Studio Design
23
machine vision cameras (known as “GigE Vision” cameras standardized by AIA) allow 100 m cable length. On the other hand, IEEE1394b (or FireWire800), CameraLink, and USB3.0 connections for machine vision cameras allow only 3 m to 10 m without active repeaters. Note that some non-standard long cables are available in the market. Thus, the camera selection for real time multi-view video capture should be done taking into account the physical size of the 3D video studio, the bandwidth of video data transfer, and the processing speed of computers and storage devices.
2.2.3 Lens While not discussed usually, the lens selection is very important to guarantee high quality multi-view image capture, because a lens specifies the field of view, the amount of incoming light, and the depth of field. The field of view can be computed from the physical imager size and the effective focal length of the lens. Suppose the imager size is W mm × H mm and the effective focal length is f mm. Then the horizontal and vertical field of view angles are simply given by −1 W FOVH = 2 tan , 2f (2.1) H FOVV = 2 tan−1 . 2f Imager sizes are often described by their “format”, such as “1/1.8 inch sensor”. For some historical reasons in optics, this number is equal to the diagonal size of the imager divided by 16; that is, the diagonal length of “1/1.8 inch sensor” is 1/1.8 × 16 = 8.89 mm. The amount of light recorded by an imager through a lens is denoted by Fnumber (or F-ratio, F-stop). The F -number is a dimensionless value given by the focal length divided by the effective aperture diameter of the lens. The larger the F -number, the smaller the lens opening is, and the lesser light comes in. Therefore it is better to use a lens with smaller F -number to capture brighter images of scenes under limited lighting environments. F -number also specifies the depth of field of a lens, which defines the depth range in which images can be captured without blur. A small F -number means a small depth of field. Since the physical pixel size is the finest resolvable point size in an image, blurring within this size does not introduce any effects in a captured image. This size is known as the circle of confusion, the maximum tolerable size of blurring. When a lens is focused at infinity, the farthest distance DH beyond which all object images are not blurred can be computed from the circle of confusion diameter c as follows: DH ≈
f2 , Fc
(2.2)
24
2
Multi-camera Systems for 3D Video Production
where f denotes the focal length and F the F -number. This distance DH is called hyperfocal distance. If the lens is focused at df < DH distance from the optic center, then the nearest and farthest distance between which all object images are not blurred are given as DN ≈
Dh df , D H + df
(2.3)
DF ≈
Dh df , D H − df
(2.4)
respectively. Hence the depth-of-field is DOF = DF − DN =
2DH df2 2 − d2 DH f
.
(2.5)
For example, let the physical pixel size be 4.4 µm × 4.4 µm, the focal length 6 mm, the F -number 1.4, and the focus distance 2.5 m. Then DH ≈ (6 mm)2 /(4.4 µm × 1.4) = 5.84 m, DN ≈ (5.84 × 2.5)/(5.84 + 2.5) = 1.75 m and DF ≈ (5.84 × 2.5)/(5.84 − 2.5) = 4.37 m. This means that when cameras are placed on a ring of 3 m radius, an object located within 1.25 m from the ring center can be captured in good focus without blurs. However, if it moves more than 1.25 m = 3 m − 1.75 m to a camera, then the image captured by that camera will not be well focused. That is, for a 3D video studio with a ring camera arrangement of radius R, the capturable area in terms of the depth of field can be approximated by the intersection of concentric circles of diameter 2R − 2DN and 2R − 2DH as illustrated by Fig. 2.2, which further constrains the movable space of an object.
2.2.4 Shutter The shutter speed controls the amount of motion blur as well as incoming light. By shortening the shutter, we can suppress the motion blur while reducing the amount of incoming light. Similarly to the discussion on the depth of field, if the object motion appears smaller than the pixel size, then the image does not include any effects of motion blur. There are two different types of shutter: global and rolling shutters. With the global shutter, all pixels in the imager start and end exposure simultaneously. In contrast, the rolling shutter makes each pixel line start exposure one by one while a captured image can be transmitted frame-wise. This introduces illusionary deformations into dynamic object images, and makes 3D video production unnecessarily harder. Therefore we suggest global shutter cameras, most of which have CCD sensors.
2.2 Studio Design
25
Fig. 2.3 Single-view and multi-view chroma-keying. (a) In single-view chroma-keying, colored reflections from the background to the object surface are occluded from the camera. On the other hand, in multi-view environment (b), colored reflections are observed from multiple cameras
2.2.5 Lighting In a 3D video studio, the camera arrangement constrains the arrangement of light sources as well as the object movable space. In general, cameras should not observe light sources directly, because strong direct lights damage captured images. While ordinary single-camera systems can avoid this problem by locating light sources next to the camera, such light source arrangement cannot be used for multi-view ring camera systems; a light source placed near by a camera is captured by other cameras. Thus, one reasonable solution is to locate light sources on the ceiling and set viewing directions of cameras so that captured images do not include the ceiling (Fig. 2.3). To facilitate such light source arrangement, 3D video studios should have enough heights as ordinary TV studios and theaters. As discussed before, to enhance multi-view image capture capabilities of a 3D video studio, the amount of incoming light to an image sensor is reduced (1) with a smaller iris to make the depth-of-field wider and the capturable area wider, and (2) with a shorter shutter speed to avoid motion blur. To compensate for these darkening effects, we should increase lighting or the sensor gain, which usually reduces the SN ratio of captured images. Typical lighting systems consist of halogen lamps, fluorescent tubes, LEDs, etc. While they have different characteristics on their initial cost, energy efficiency, life time, color, and so on, an important point for the 3D video studio design is whether it does flicker or not. In particular fluorescent tubes without inverters blink at 100 or 120 Hz (double of AC input), and make the global illumination level drift periodically. This should be avoided in a 3D video studio. Besides these continuous lighting devices, we can use lighting devices which flash synchronously to camera exposures. For example, we can use projectors as programmable lights, or strobe lights to “freeze” object images in quick motion [31]. To make full use of such dynamic lighting, well-designed synchronization controls should be developed to coordinate video capture and lighting.
26
2
Multi-camera Systems for 3D Video Production
Another augmentation of lighting is the introduction of structured lights [1, 26] to realize active-stereo analysis. Since high beams of structured lights may disturb human actions to be captured, infra-red structured lighting systems are used. In fact, advanced 3D video systems being developed [5] employ such active sensing devices in addition to ordinary cameras. While studio and theater lighting designs have been well studied and effective lightings are very important design factors to produce attractive visual contents, this book does not cover them except for Chap. 6, which presents a method of estimating 3D shapes, positions, and radiant intensities of distributed dynamic light sources.
2.2.6 Background As will be discussed in Chap. 4, multi-view object silhouettes are very useful for the 3D object shape reconstruction. In particular, the accurate silhouette contour extraction is very crucial, since it directly defines the accuracy of the visual hull geometry (Sect. 4.2.2.2). In fact, the visual hull is often used as the initial estimation of the 3D object surface in practical algorithms (Sect. 4.4). One straightforward solution for the silhouette extraction is to employ background subtraction or chroma-keying techniques. In the former, an object silhouette is given as the difference between a captured object image and the background image taken beforehand without any object. In the latter, on the other hand, the background with a known uniform color is prepared and an object silhouette is extracted as image regions having colors different from the background color. Both techniques are well studied and produce images in media production quality for studio setup. However, it should be noted that the chroma-keying for multi-view camera studio introduces non-negligible color bias into captured images (Fig. 2.3). That is, blue or green lights reflected from the background illuminate the object. In single-view chroma-keying, widely used for cinema and broadcast media production, this is known as “blue (or green) spill”. It appears typically only around the occluding boundary, because most of the reflected lights are occluded by the object. In 3D video studios, on the other hand, all surface areas are lit by colored reflections from the background. To avoid this color bias, we can use the gray background as used in Studios A and B in Fig. 2.4, or estimate lighting environments in a 3D video studio by such methods as presented in Chap. 6 and neutralize the illumination bias. The latter approach is left for future studies. While we do not discuss object silhouette extraction methods in this book, even with the state-of-the-art computer vision technologies, it is still not possible to achieve the perfect accuracy. Especially, when an object wears very colorful clothes like MAIKO with FURISODE, the chroma-keying does not work well and, moreover, wrinkles of her loose FURISODE are covered with soft shadows, and decorations in gold thread generate highlights. To cope with such complicated situations, ordinary 2D image processing methods alone are not enough and hence advanced
2.2 Studio Design
27
Fig. 2.4 Three 3D video studios developed at Kyoto University. The left column shows their interior scenes and the right the camera arrangements, respectively. The colored quadrilateral pyramids in the camera arrangements illustrate the projection centers and fields of view of the cameras
methods which integrate both the multi-view 2D silhouette extraction and the 3D shape reconstruction should be developed [8, 9, 13, 29, 32]. In summary, the problem of 3D video studio design can be regarded as the optimization of the object surface observability by a group of cameras, i.e. the surface coverage by multi-view images of well-focused, high spatial resolution, and high fidelity color. Since an object freely moves and performs complex actions, it is not possible to compute the optimal design analytically. Chapter 3 derives algebraic constraints in designing a 3D video studio with active cameras and analyzes their mutual dependencies to obtain a feasible solution. Finally, it should be noted that the 3D video studio design should be done based on real world physics, while the camera calibration discussed below is conducted based on a simplified algebraic model.
2.2.7 Studio Implementations Figure 2.4 and Table 2.3 show three 3D video studios and their specifications we developed so far, respectively. They were designed for different objectives. Studio A was designed to develop a 3D video studio with multi-view active cameras, which track and capture an object moving in a wide spread area. Its computational algorithm and technical details will be presented in Chap. 3.
28
2
Multi-camera Systems for 3D Video Production
Table 2.3 Specifications of three 3D video studios developed at Kyoto University
Feature
Studio A
Studio B
Studio C
Wide area
Accurate shape and color
Transportable
Shape
Square
Dodecagon
Rounded square
Size
10 m × 10 m 2.4 m height
6 m diameter 2.4 m height
6 m diameter 2.5 m height
Camera
high and low double rings
high and low double rings
single ring
Arrangement
with ceiling cameras
with ceiling cameras
with ceiling cameras
Camera
Sony DFW-VL500 ×25
Sony XCD-X710CR ×15
Pointgrey GRAS-20S4C ×16
Imager
1/3 inch 1CCD
1/3 inch 1CCD
1/1.8 inch 1CCD
Image format
VGA/RAW
XGA/RAW
UXGA/RAW
Lens
Integral 5.5 mm to 64 mm
C-mount 6 mm & 3.5 mm
C-mount 6 mm & 3.5 mm
Pan/tilt/zoom
Active (with pan/tilt unit)
Static
Static
Frame rate
12.5fps
25fps
25fps
Capture PC
25
15
2
Connection
IEEE 1394a 20 m cable
IEEE 1394a 20 m cable
IEEE 1394b 10 m cable
Datarate
3.66 MB/s
18.75 MB/s
45.78 MB/s (366 MB/s per PC)
Background
Gray plywood
Gray plywood
Green screen
Lighting
Overhead inverter fluorescent lights
Studio B was designed to produce 3D video with accurate object surface geometry and texture for digital archiving of Japanese traditional dances. Most of multiview video data used in this book were captured in this studio. Its gray static background eliminates the color bias discussed before and allows high fidelity colored surface texture generation, which is an important requirement for digital archiving, especially for colorful Japanese clothes, KIMONO. Note, however, that chromakeying with gray background often introduces errors in object silhouettes: soft shadows at small wrinkles on object clothes are captured as gray regions. To remove such errors, image segmentation and/or 3D shape reconstruction methods should employ the constraints on the connectivity of silhouette regions and the inter-viewpoint silhouette consistency [22]. Studio C was designed as a transportable 3D video studio to realize on-site 3D video capture. To minimize the studio equipments, it employs only two PCs to receive 16 UXGA video streams, and the green screen background for easier silhouette extraction.
2.3 Camera Calibration
29
2.3 Camera Calibration Following the 3D video studio design, its geometric and photometric calibrations should be done for obtaining multi-view video data usable for 3D video production.
2.3.1 Geometric Calibration 2.3.1.1 Camera Model The geometric camera calibration is the process that estimates parameters of the geometric transformation conducted by a camera, which projects a 3D point onto the 2D image plane of the camera. Figure 2.5 illustrates the camera model used in this book. Note that this pinhole camera model simplifies geometric transformations conducted by a physical camera and hence cannot represent important physical characteristics required to design a 3D video studio such as the depth of field. While closely related, therefore, the 3D video studio design and the camera calibration should be considered as separate processes. As shown in Fig. 2.5, the position of a 3D point in the scene is described by vector Wp = (x, y, z) in the world coordinate system W . Wp is transformed to the camera coordinate system C by W p C p = R Wp + T = (R | T ) , (2.6) 1 where R and T are the rotation matrix and the translation vector which describes the position and posture of the camera in the world coordinate system. Then the point Cp in the camera coordinate system is transformed to (u, v) , the ideal position in the image coordinate system without considering the lens distortion: ⎛ ⎞ ⎛ ⎛ ⎞ ⎞⎛ ⎞ u α γ u0 ku s u0 f 0 0 λ ⎝v ⎠ = A Cp = ⎝ 0 β v0 ⎠ Cp = ⎝ 0 kv v0 ⎠ ⎝ 0 f 0⎠ Cp, (2.7) 1 0 0 1 0 0 1 0 0 1 where λ is a scale parameter which normalizes the third component of the left-hand side vector to 1. By definition λ is equal to the z-value (depth) of Cp. f denotes the effective focal length of the camera in pixel. ku and kv denote the aspect ratio of the pixel, s denotes the skew parameter, and (u0 , v0 ) the intersection point of the optic axis with the image screen represented by the image coordinate system. Given (u, v) , its observed position (u , v ) , which is transformed with lens distortions, is modeled as a mapping in the normalized camera coordinates: Nx Nx
2 4 (2.8) Ny , Ny = 1 + k1 r + k2 r
30
2
Multi-camera Systems for 3D Video Production
Fig. 2.5 Camera model. A 3D point is first projected onto the ideal position (u, v) in the 2D image plane, and then shifted to the observed position (u , v ) by lens distortions
2
2
where r 2 = Nx + Ny . k1 and k2 are the radial distortion parameters. The normalized coordinate system is given by ⎛N ⎞ x λ ⎝Ny ⎠ = Cp. (2.9) 1 In other words, the matrix A in Eq. (2.7) of the normalized camera is the identity matrix. Finally, (u , v ) is given as ⎛N ⎞ ⎛ ⎞ x u ⎝v ⎠ = A ⎝Ny ⎠ . (2.10) 1 1 In this camera model, R and T are called extrinsic parameters. A is called the intrinsic parameter since it is independent of the camera position and posture. k1 and k2 are also independent of the extrinsic parameters, but are called lens distortion parameters in particular. The geometric calibration is a process which estimates these extrinsic, intrinsic, and lens distortion parameters by observing some reference objects in the scene.
2.3.1.2 Computational Methods for Static Camera Calibration In general, • The camera calibration should be done by placing reference objects around the 3D local area where an object to be captured in 3D video performs actions. This is because the accuracy of the camera calibration is guaranteed only around the reference objects. • The camera calibration should employ a non-linear optimization like the bundle adjustment as the final step to minimize a geometrically meaningful error metric such as the reprojection error.
2.3 Camera Calibration
31
Fig. 2.6 Planar pattern for the camera calibration. Left: observed image. Right: rectified image using estimated intrinsic and lens distortion parameters
This section introduces a practical four-step calibration procedure while any calibration procedures can be used as long as the above-mentioned points are satisfied: Step 1. Step 2. Step 3. Step 4.
Intrinsic and lens distortion parameters estimation by Zhang [33]. Extrinsic parameter calibration by 8-point algorithm [10]. Non-linear optimization (bundle adjustment). Global scale and position adjustment.
2.3.1.2.1 Intrinsic and Lens Distortion Parameters Estimation The most standard camera calibration method is a planar pattern-based method proposed by Zhang [3, 4, 33]. Given a set of planar reference 3D points whose positions on the plane are known, it estimates the camera position and posture with respect to the reference, and the intrinsic and lens distortion parameters. Figure 2.6 shows the planar pattern used for the calibration. In this method, the planar pattern defines the world coordinate system. This method, however, cannot be used in the calibration of the multi-camera system in the 3D video studio. • With the ring camera arrangement, the placement of the planar pattern is very limited to guarantee the simultaneous observation by all cameras. While a possible placement to satisfy the simultaneous multi-view observation is to place it on the floor, the accuracy of the pattern detection in observed images is degraded because the cameras can observe the plane at very shallow angles. • The placement limitation can also degrade the overall calibration accuracy; the reference plane should be placed in the object action space to guarantee the calibration accuracy. Note that a transparent planar pattern would solve these problems, while its specular surface reflections would introduce another placement limitation from lighting environments. Thus, we use Zhang’s method only for the intrinsic and lens distortion
32
2
Multi-camera Systems for 3D Video Production
parameter estimation, which can be done for each camera independently, and employ a multi-view extrinsic parameter estimation method at the second step. With Zhang’s method, the intrinsic parameters represented by A in Eq. (2.7) and the lens distortion parameters k1 and k2 in Eq. (2.8) are estimated. Figure 2.6 compares a captured image of the reference pattern and its rectified image with the estimated parameters.
2.3.1.2.2 Extrinsic Parameter Estimation Given the intrinsic and lens distortion parameters for each camera, we can compute the relative positions of multiple cameras by linear 8-point [10], non-linear 5-point [20], or trifocal-tensor-based algorithms [6] from 2D-to-2D point correspondences (Fig. 2.7). To implement a practical extrinsic parameter estimation method, we have to develop methods to (1) obtain accurate 2D-to-2D point correspondences, and (2) calibrate multiple cameras from the 2D-to-2D point correspondences. For (1), we can make full use of the synchronized multi-view image capture. That is, move a uniquely identifiable reference object(s) scanning the possible object action space. Then, regard reference object positions in simultaneously captured multi-view images as corresponding points. To make this method work well, feature point(s) on the reference object should be designed as view-invariant: for example, 2D chess corners or a center of a 3D sphere (Fig. 2.8). A simple solution for (2) above is to use the 8-point algorithm for estimating the relative position and posture of each camera pairs. Since the 8-point algorithm estimates only the pair-wise relative position up to a scale factor, we should determine the relative positions of all cameras by the following process. Let us consider three cameras A, B, and C as the minimal setup for multi-camera calibration. 1. Suppose we use camera A as the reference, i.e., we are going to describe positions and postures of B and C in the camera A coordinate system. 2. Estimate the relative position and posture for each pair A ↔ B, B ↔ C, and C ↔ A. Note here that we have unknown scale factors for each pair of cameras: λAB , λBC , and λCA (Eq. (2.7)). Let the relative posture and position of Y w.r.t. X be XR Y and XT Y which transforms a point Yp in the camera Y coordinate system to the X coordinate system by Xp = XR Y Yp + λXY XTY . Here we can assume |XT Y | = 1 without loss of generality. 3. Let A 0 denote the origin of the camera A coordinate system. 4. The origin of the camera B coordinate system is represented by ARB B 0 + λAB ATB = λAB ATB in the camera A coordinate system. 5. Similarly, the origin of the camera C coordinate system is represented by λAC ATC in the camera A coordinate system. 6. On the other hand, the origin of the camera C coordinate system is represented by λBC BTC in the camera B coordinate system, which is represented by λBC ARB BTC in the camera A coordinate system. Then, the origin of the camera C coordinate system is represented by λBC ARB BTC + λAB ATB .
2.3 Camera Calibration
33
Fig. 2.7 Extrinsic parameter estimation. With several known 2D-to-2D point correspondences in a pair of observed images (p1 to p1 , . . . , pn to pn ), the relative 3D position and posture of two cameras (R and T ) can be estimated up to scale
Fig. 2.8 2D-to-2D point correspondences by using chess corners (left, by [4]) and sphere centers (right)
7. By equating the above two representations of the origin of the camera C coordinate system, we can obtain the constraint for three scale factors. That is, the three coordinate systems of cameras A, B, and C are integrated into the common coordinate system with one scale factor. By iteratively applying this method for the other cameras one by one, we can describe all the camera positions and postures in the camera A coordinate system with a scale factor. Notice that this process obviously accumulates calibration errors through the iteration. However, this is not a serious problem since the following non-linear optimization will reduce these errors. From a practical point of view, we can use this extrinsic parameter calibration to verify whether or not the calibration processes and the multi-camera system are working correctly. That is, if the 8-point algorithm fails to estimate the pair-wise positions and postures, that is, if calibration errors such as sum of the reprojection errors (described in the next section) are not acceptable, then examine if 1. The corresponding point estimation may have introduced errors due to falsepositive and/or true-negative detections, or 2. The multi-camera synchronization may not be working properly to produce erroneous point correspondences. Since both the calibration and the synchronization are the most crucial requirements for 3D video production, it is highly recommended to check the calibration errors before optimizing the parameters.
34
2
Multi-camera Systems for 3D Video Production
2.3.1.2.3 Bundle Adjustment By the previous two steps, all calibration parameters have been estimated. One standard metric to evaluate the accuracy of the estimated parameters is the reprojection error. That is, for each corresponding point pair pik and pjk of camera Ci and Cj , compute the 3D point P k from them by triangulation, and reproject P k onto the image planes again. Let p˘ ik and p˘ jk be the reprojection of P k on the image planes of cameras Ci and Cj , respectively. Then the reprojection error is defined by E(Ci , Cj ) =
p k − p˘ k 2 + p k − p˘ k 2 . i i j j
(2.11)
k
The goal of the non-linear optimization is to minimize this error for all cameras. That is, it optimizes a set of parameters which minimizes E= E(Ci , Cj ), (2.12) Ci =Cj ∈C
where C is the set of cameras. This optimization is called the bundle adjustment; it optimizes the calibration parameters by adjusting the bundle of light rays from each camera center to its image feature points so that corresponding rays from multiple cameras intersect each other in the 3D space. In practice this non-linear optimization is done by Levenberg–Marquardt algorithm. Furthermore the sparse implementation of Levenberg–Marquardt algorithm can perform better since the Jacobian of Eq. (2.12) is significantly sparse. In addition, as pointed out by Hernandez et al. [11], modifying camera position T has very similar computational effects to shifting image center (u0 , v0 ) in particular for circular camera arrangements, and hence fixing (u0 , v0 ) through the optimization can perform better. One important point in implementing the extrinsic parameter estimation is the estimation method of P k from pik and pjk . As discussed in [10], it is not a good idea to estimate P k by the midpoint of the common perpendicular to the two rays through pik and pjk , since it is not projective-invariant. Instead, [10] suggested to use linear triangulation methods or to solve a 6-degree polynomial.
2.3.1.2.4 Global Scale and Position Adjustment The last step of the geometric calibration is to transform the world coordinate system used for the extrinsic parameter estimation into a physical one: determine the scale parameter of the common coordinate system to which all camera coordinate systems were transformed in the extrinsic parameter estimation. One simple practical method for it is to measure three points po , px , and py on the studio floor. po defines the origin of the physical coordinate system, the directions from po to px and py defines X- and Y -axes, respectively. The Z-direction is given by the cross
2.3 Camera Calibration
35
Fig. 2.9 Global scale and position adjustment using a chessboard on the floor
product of the X- and Y -directions. For example, place a chessboard designed with physical measures on the floor (Fig. 2.9). Let {Ri , Ti } (i = 1, . . . , N ) (N : number of cameras) denote the optimal extrinsic camera parameters obtained by the bundle adjustment. Then, select two cameras i and i (i = i ) which can best observe the chessboard and apply Zhang’s method [33] to estimate the rotation and translation w.r.t. the floor as Rˆ j and Tˆj (j = i , i ). The global scale parameter is given by comparing the distance between camera i and i in the two different coordinate systems. That is, λ=
|Tˆi − Tˆi | |Ti − Ti |
(2.13)
is the global scale parameter to be applied for the result of the bundle adjustment. Finally, in order to describe the camera positions and postures w.r.t. the floor, {Ri , Ti } (i = 1, . . . , N ) should be transformed to Ri = Rˆ i Ri Ri
Ti = λ Rˆ i Ri Ti − Rˆ i Ri Ti + Tˆi ,
(2.14)
which represent the positions and postures of cameras in the physical coordinate system. With this representation, we can easily design object actions in the 3D video studio. Note that the calibration accuracy of the above process does not affect the reconstruction accuracy of 3D object because it uniformly transforms all camera coordinate systems by a rotation and a translation. The accuracy of camera calibration in each 3D video studio we developed will be shown later.
2.3.1.3 Active Camera Calibration While all geometric parameters of static cameras are fixed, those of active cameras can be dynamically changed during video capturing. Typical controllable parameters of active cameras include pan, tilt, dolly, and zoom. While pan, tilt, and dolly controls modify only the position of the projection center geometrically, zooming changes all camera parameters including the focal length, the projection center, the lens distortion, and the image resolution, since the zoom control modifies the entire optical system configuration of a camera.
36
2
Multi-camera Systems for 3D Video Production
Thus from a viewpoint of camera calibration, active cameras without zooming are a reasonable class of practically usable active cameras; the camera calibration process is required to estimate the position of the projection center dynamically while the other parameters are kept fixed. In [30], we developed the fixed-viewpoint pan-tilt camera, where (1) the pan and tilt axes intersect with each other and (2) the projection center is aligned at the intersecting point. With this camera, the projection center is fixed during any pan-tilt controls, and hence it can be calibrated just as a static camera, which greatly facilitates the development of active object tracking systems to monitor 3D motion trajectories of objects [17] as well as high-resolution panoramic image capture systems. One important technical problem when employing active cameras is the synchronization between the camera control and the image capture. That is, since these two processes usually run asynchronously, some synchronization mechanisms should be introduced to associate the viewing direction of a camera with a captured image. In [16], we proposed the dynamic memory architecture to virtually synchronize asynchronous processes. With this mechanism, each captured video frame can be annotated by synchronized pan and tilt parameter values. Note that pan and tilt values obtained from the camera controller are not accurate enough to be used as calibration parameters and hence the ordinary camera calibration should be done using them as initial estimates. The calibration of active cameras, except for the fixed-viewpoint pan-tilt camera, involves many difficult technical problems including the camera model itself and hence its accuracy is limited. We will discuss them in Chap. 3 in detail.
2.3.2 Photometric Calibration A camera records light flux converging to its projection center as a 2D array of pixel intensity values. While the geometric calibration models geometric aspects of this imaging process, the light flux has photometric characteristics such as colors (i.e. wave length of light) and powers (i.e. irradiance), which are also transformed through the imaging process. The goal of the photometric calibration is to rectify the photometric transformations by a camera. Here, we consider the following two practical characteristics for the photometric calibration. Gain: The gain defines the transformation from incident light intensities to image pixel values. First of all, to use cameras as physical sensors, the γ correction should be done to make this transformation linear; most cameras transform incident light intensities nonlinearly to image pixel values to make captured images look natural on displays or printed papers. Since ordinary color cameras employ the RGB decomposition of incident light to record RGB image intensity values for each pixel, the gain is defined for each color channel. Then, the adjustment of RGB gains, which is called color balance
2.3 Camera Calibration
37
or white balance, should be done to capture high fidelity color images. Moreover, image sensor sensitivity and electronic circuit characteristics vary from camera to camera even if they are of the same type, making color calibration of multicamera systems much harder. Vignetting: Ordinary lens systems introduce vignetting: central areas of an image become brighter than peripheral areas. That is, the latter can receive less light rays compared to the former due to (1) multiple optical elements in a lens system (optical vignetting) and (2) the angle of incoming light (natural vignetting by the cosine fourth law). Compared to color calibration, vignetting elimination is rather easy if lens parameters are not dynamically changed. In multi-camera systems, each camera observes a different part of the scene from a different viewpoint. This means that lighting environments vary from camera to camera. To calibrate lighting environments in a 3D video studio, 3D distributions of light sources and inter-reflections in the studio have to be modeled. These will be discussed in Chap. 6. In this section, we assume we can prepare uniform lighting environments for the photometric calibration and present two practical photometric calibration methods for multi-camera systems: relative and absolute methods. The former normalizes photometric characteristics to be shared by all cameras, while the latter establishes their transformations to standard ones defined by reference data.
2.3.2.1 Relative Multi-camera Photometric Calibration A standard idea of gain and vignetting correction is to measure a specified point in the scene by different pixels of an image sensor by moving a camera. That is, align the point at central and then peripheral image pixels one by one, and estimate parameters of a vignetting model. Kim and Pollefeys [14] proposed a method which estimates vignetting parameters from overlapped image areas in a patch-worked panoramic image. This method suits well for mobile camera and can calibrate spatial gain bias and vignetting of single-camera systems. For multi-camera systems, we proposed an idea of object-oriented color calibration in [21]. The idea is to optimize vignetting and gain parameters of cameras to minimize observed color differences of a specified 3D object surface. The following process is applied to each color channel independently. Let p denote an identifiable point on the 3D object surface and pCi the pixel representing the projection of p on the camera Ci image plane. Then, the ideal intensity value l at pCi is transformed first by a simplified Kang-and-Weiss model [34] representing the lens vignetting: l =
1 − ar l, (1 + (r/f )2 )2
(2.15)
where r denotes the distance from the image center (u0 , v0 ) to p. f and a denote the vignetting parameters. Then the intensity is transformed by the gain adjustment
38
2
Multi-camera Systems for 3D Video Production
Fig. 2.10 (a) Originally captured multi-view images, (b) photometrically calibrated multi-view images. ©2009 IPSJ [22]
process as follows, assuming the γ correction has been done already: l = αl + β,
(2.16)
where α and β denote the scale and bias factors. Reversing these transformations, the ideal intensity can be estimated from the observed intensity:
(l − β)(1 + (r/f )2 )2 l = F l = . α(1 − ar)
(2.17)
Then, the goodness of the gain and vignetting parameters for p can be evaluated by
E(p) = VAR FCi ICi (pCi ) ,
(2.18)
where Ci denotes a camera which can observe p without occlusion, ICi (pCi ) the observed intensity of pCi , FCi the function defined in Eq. (2.17) for Ci , and VAR{·} the function to compute the variance. Note that p should be on a Lambertian surface because its radiance should be independent of viewing angles of Ci s. Let P denote a set of Lambertian surface points. Then, apply Levenberg– Marquardt method to estimate the optimal gain and vignetting parameters which minimize the following objective function. E= E(p). (2.19) p∈P
Figure 2.10 shows the result of the photometric calibration of multi-view images. Figure 2.11 demonstrates that photometric characteristic variations of uncalibrated cameras can introduce visible artifacts in images rendered from 3D video. Here the simplest view-independent texture generation method in Sect. 5.3 is used to demonstrate the color differences across original images. Notice that the relative photometric calibration normalizes photometric characteristics of multi-view cameras so that multi-view observations of a 3D surface point give the same pixel intensity value. Hence it does not guarantee that the calibrated color is the “true” color of the object.
2.3 Camera Calibration
39
Fig. 2.11 Textures generated from Fig. 2.10(a) and (b), respectively. The red arrows indicate texture boundaries introduced by photometric characteristics variations among the cameras. ©2009 IPSJ [22].
Fig. 2.12 Macbeth color checker. The triplet of hexadecimal values attached to each color patch denotes approximated 8-bit RGB values [24]
2.3.2.2 Absolute Multi-camera Photometric Calibration Assuming that the vignetting calibration is done, the absolute color calibration adjusts RGB color channel gains of a camera so that RGB values for reference color patterns coincide with predefined standard responses. Figure 2.12 shows a wellknown color pattern called Macbeth color checker, where each color patch is associated with predefined standard RGB values [24]. The color calibration with a standard color pattern also requires standard lighting environments: the pattern should be uniformly lit by a standard light source such as defined by ISO/IEC standards. As is well known, since RGB values denote spectral integrals, the accuracy of the above-mentioned RGB-based color calibration is limited. Thus, physics-based color calibration should be employed to attain the truly absolute color calibration: estimate spectral filtering characteristics of RGB channels from a reference pattern and a light source whose spectral radiance and radiant characteristics are known, respectively. To evaluate the usability of standard color samples, such as Munsell standard colors, in the physics-based color calibration, we measured spectral characteristics of radiance intensities of 1,016 color samples lit by a standard light, where spectral characteristics of each color sample is represented by 176 radiance intensity values from 380 nm to 730 nm with 2 nm sampling pitch. Then, we computed the major principal components. Table 2.4 shows eigen values and residual errors for 16 major principal components. From these results, we can observe that spectral characteristics of Munsell color samples can only be represented by several major spectral bases. This implies that detailed spectral characteristics of cameras and lighting en-
40 Table 2.4 Dimensionality reduction of Macbeth colors by PCA
2 # of principal component
Multi-camera Systems for 3D Video Production Eigenvalue
0
Approx. error
100.000
1
2.1544e−02
18.011
2
5.1743e−04
9.592
3
1.3787e−04
5.486
4
4.3802e−05
3.228
5
1.1529e−05
2.290
6
4.7269e−06
1.766
7
3.1202e−06
1.311
8
1.7696e−06
0.961
9
7.4854e−07
0.766
10
4.4186e−07
0.624
11
2.6615e−07
0.519
12
1.9256e−07
0.428
13
1.6722e−07
0.329
14
8.3086e−08
0.269
15
5.5517e−08
0.218
16
3.2762e−08
0.182
vironments cannot be estimated with such color samples; the dimension of spectral characteristic space is degenerated. To estimate spectral characteristics of cameras, we need to utilize additional reference measurements given by special optical systems such as spectrometer [19], multi-spectral camera [18], and hyper-spectral sensor [7]. These techniques play an important role on digital archiving of cultural assets such as ancient tapestries, statues, etc. In addition, knowledge about spectral characteristics of reference objects can help to calibrate such sensors. ISO/TR 16066:2003 [12] provides spectral color data of more than 50 thousand common objects as well as their reflectance and transmittance characteristics in order to calibrate spectral response of image sensors. While the absolute photometric calibration can be conducted for each camera independently before installation in a 3D video studio, lighting environments of the studio should be estimated to obtain calibrated RGB values. As will be discussed in Chap. 6, the lighting environment estimation itself involves difficult problems. Especially, it would be almost impossible to estimate the 3D spatial distribution of detailed spectral characteristics of lighting environments, because an object in motion disturbs lighting environments by its shadows as well as inter-reflections with the background scene. In summary, it would be a practical method for multi-camera photometric calibration to employ the relative multi-camera photometric calibration and then normalize RGB values based on the RGB responses of an absolutely calibrated camera.
2.4 Performance Evaluation of 3D Video Studios
41
Fig. 2.13 Multi-view videos captured by the three studios. Each line of the subtitles shows the title, captured length, and feature, respectively
2.4 Performance Evaluation of 3D Video Studios Figure 2.13 shows multi-view videos captured by the three studios in Kyoto University described in Sect. 2.2.7. Each of them has different features such as active tracking, complex and non-rigid object shape, complex motion, etc. They will be used as input for our 3D video production algorithm described in the following chapters.
42
2
Multi-camera Systems for 3D Video Production
Table 2.5 Performances of the three studios described in Sect. 2.2.7
Studio A
Studio B
Studio C
3.0
3.0
4.0
Capture space (m) Cylinder diameter Cylinder height
2.0
2.2
2.2
Effective resolution (mm/pix)
3.9
2.0
2.0
Calibration accuracy (mm)
4.3
2.4
3.4
Table 2.5 reports the performance measures achieved in the three studios. The capture spaces are approximated by cylinders where the requirements for 3D video production are satisfied. In the table, the diameters and heights of the cylinders are described. As in most of 3D video studios, the object movable space is very limited to guarantee the high object surface observability. The effective resolution implies the average physical distance between two neighboring pixels at the center of the capture space. The calibration accuracy is computed as the average 3D distance between a pair of rays from a pair of corresponding points in different views. The accuracy in 2D, that is, the reprojection errors of corresponding points are all in sub-pixel level. The lower resolution and accuracy of Studio A can be ascribed to its lower camera resolution (VGA); Studio A was developed for tracking and multi-view object observation with pan/tilt/zoom active cameras. Studio C, on the other hand, was designed to realize a wider object movable space with the almost same number of cameras as Studio B. To this end, the field of view was increased by employing a larger imager (1/1.8 inch) as well as improving the camera resolution (UXGA). With these designs, the effective resolution of Studio C attained the same level as that of Studio B, while the calibration accuracy was degraded due to its enlarged capture area. In summary, to enlarge the capture space as well as improve the effective resolution and calibration accuracy, we need to increase the number of cameras or employ active pan/tilt/zoom cameras. This problem is discussed in the next chapter.
2.5 Conclusion This chapter discussed the design factors of multi-camera systems for 3D video studios and introduced our three implementations. While advanced imaging device, computer, and computer vision technologies make it rather easy to implement 3D video studios, many problems are left in (1) camera selection and arrangements to guarantee multi-view observability of an object in motion, (2) geometric and photometric camera calibrations to realize the “seamless” integration of multi-view video data, and (3) design and calibration of lighting environments. These are crucial requirements for successful 3D video production for Part II. As noted at the beginning of this chapter, we designed and implemented 3D video studios with off-the-shelf cameras and lenses. Specially developed cameras
References
43
such as 4K and 8K cameras with professional lenses will improve the performance measures of 3D video studios shown in Table 2.5, while algorithms and technologies to solve the problems (1) and (2) above are left for future studies. The second generation of 3D video studios will employ new imaging sensors such as time-of-flight cameras or active-stereo systems to directly obtain 2.5D video data. Their calibration, synchronization, and data integration with ordinary video cameras will require the development of new technologies. Similarly, it would be another interesting augmentation of 3D video studios to introduce audio capturing devices such as microphone arrays for recording 3D acoustic environments. To integrate 3D visual and acoustic scenes, cross-media synchronization and calibration methods should be developed.
References 1. Batlle, J., Mouaddib, E., Salvi, J.: Recent progress in coded structured light as a technique to solve the correspondence problem: a survey. Pattern Recognit. 31(7), 963–982 (1998) 2. Bayer, B.E.: US Patent 3971065: Color imaging array (1976) 3. Bouguet, J.-Y.: Camera Calibration Toolbox for Matlab. http://www.vision.caltech.edu/ bouguetj/calib_doc/ 4. Bradski, G.: The OpenCV Library (2000). http://opencv.willowgarage.com 5. Virtualizing Engine. Private communication with Profs. Takeo Kanade and Yaser Sheikh, Robotics Institute, Carnegie Mellon University, PA (2011) 6. Fitzgibbon, A.W., Zisserman, A.: Automatic camera recovery for closed or open image sequences. In: Proc. of European Conference on Computer Vision, pp. 311–326 (1998) 7. Gevers, T., Stokman, H.M.G., van de Weijer, J.: Colour constancy from hyper-spectral data. In: Proc. of British Machine Vision Conference (2000) 8. Goldlüecke, B., Magnor, M.: Joint 3D-reconstruction and background separation in multiple views using graph cuts. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 683–688 (2003) 9. Guillemaut, J.Y., Hilton, A., Starck, J., Kilner, J., Grau, O.: A Bayesian framework for simultaneous matting and 3D reconstruction. In: Proc. of International Conference on 3-D Digital Imaging and Modeling, pp. 167–176 (2007) 10. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 11. Hernandez, C., Schmitt, F., Cipolla, R.: Silhouette coherence for camera calibration under circular motion. IEEE Trans. Pattern Anal. Mach. Intell. 29(2), 343–349 (2007) 12. ISO/TR 16066: Standard Object Colour Spectra Database for Colour Reproduction Evaluation (SOCS) (2003) 13. Ivanov, Y., Bobick, A., Liu, J.: Fast lighting independent background subtraction. Int. J. Comput. Vis. 37(2), 199–207 (2000) 14. Kim, S.J., Pollefeys, M.: Robust radiometric calibration and vignetting correction. IEEE Trans. Pattern Anal. Mach. Intell. 30(4), 562–576 (2008) 15. Levoy, M., Hanrahan, P.: Light field rendering. In: Proc. of ACM SIGGRAPH, pp. 31–42 (1996) 16. Matsuyama, T., Hiura, S., Wada, T., Murase, K., Toshioka, A.: Dynamic memory: architecture for real time integration of visual perception, camera action, and network communication. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 728–735 (2000) 17. Matsuyama, T., Ukita, N.: Real-time multitarget tracking by a cooperative distributed vision system. Proc. IEEE 90(7), 1136–1150 (2002)
44
2
Multi-camera Systems for 3D Video Production
18. Miyake, Y., Yokoyama, Y., Tsumura, N., Haneishi, H., Miyata, K., Hayashi, J.: Development of multiband color imaging systems for recordings of art paintings. In: Proc. of SPIE, pp. 218–225 (1998) 19. Morimoto, T., Mihashi, T., Ikeuchi, K.: Color restoration method based on spectral information using normalized cut. Int. J. Autom. Comput. 5, 226–233 (2008) 20. Nister, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 756–770 (2004) 21. Nobuhara, S., Kimura, Y., Matsuyama, T.: Object-oriented color calibration of multiviewpoint cameras in sparse and convergent arrangement. IPSJ Trans. Comput. Vis. Appl. 2, 132–144 (2010) 22. Nobuhara, S., Tsuda, Y., Ohama, I., Matsuyama, T.: Multi-viewpoint silhouette extraction with 3D context-aware error detection, correction, and shadow suppression. IPSJ Trans. Comput. Vis. Appl. 1, 242–259 (2009) 23. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 15(1), 353–363 (1993) 24. Pascale, D.: RGB coordinates of the ColorChecker (2006). http://www.babelcolor.com/ main_level/ColorChecker.htm 25. PMDTechnologies GmbH: CamCube3.0 (2010) 26. Salvi, J., Pagès, J., Batlle, J.: Pattern codification strategies in structured light systems. Pattern Recognit. 37(4), 827–849 (2004) 27. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42 (2002) 28. Starck, J., Maki, A., Nobuhara, S., Hilton, A., Matsuyama, T.: The multiple-camera 3-d production studio. IEEE Trans. Circuits Syst. Video Technol. 19(6), 856–869 (2009) 29. Toyoura, M., Iiyama, M., Kakusho, K., Minoh, M.: Silhouette extraction with random pattern backgrounds for the volume intersection method. In: Proc. of International Conference on 3-D Digital Imaging and Modeling, pp. 225–232 (2007) 30. Wada, T., Matsuyama, T.: Appearance sphere: Background model for pan-tilt-zoom camera. In: Proc. of International Conference on Pattern Recognition, pp. A-718–A-722 (1996) 31. Yamaguchi, T., Wilburn, B., Ofek, E.: Video-based modeling of dynamic hair. In: Proc. of PSIVT, pp. 585–596 (2009) 32. Zeng, G., Quan, L.: Silhouette extraction from multiple images of an unknown background. In: Proc. of Asian Conference on Computer Vision, pp. 628–633 (2004) 33. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 34. Zheng, Y., Yu, J., Kang, S., Lin, S., Kambhamettu, C.: Single-image vignetting correction using radial gradient symmetry. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Chapter 3
Active Camera System for Object Tracking and Multi-view Observation
3.1 Introduction As discussed in the previous chapter, most of 3D video studios developed so far employ a group of static cameras and hence the object’s movable space is rather strictly constrained to guarantee the high resolution well-focused multi-view object observation. This chapter presents a multi-view video capture system with a group of active cameras, which cooperatively track an object moving in a wide area to capture its high resolution well-focused multi-view video data.
3.1.1 Fundamental Requirements for Multi-view Object Observation for 3D Video Production In general, the multi-view object observation for 3D video production should satisfy the following basic requirements. Requirement 1: Accurate camera calibration, Requirement 2: Full visual coverage of the object surface, and Requirement 3: High spatial image resolution. Requirement 1 is crucial to reconstruct 3D shape, motion, and surface texture of an object in motion. It includes both geometric and photometric camera calibrations as discussed in the previous chapter. From a practical point of view, large efforts and computations are required in calibrating a large number of cameras, which, moreover, should be repeated when the cameras are misaligned due to hard object actions or are replaced due to malfunctions. Thus, it is a too naive idea to introduce a huge number of cameras to satisfy requirements 2 and 3. The second requirement means that every point on the object surface must be observed by at least two cameras to estimate its 3D position by shape-from-stereo methods. Obviously, this requirement is very hard or often impossible to satisfy for an object in action; objects performing complex actions like Yoga or wearing T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_3, © Springer-Verlag London 2012
45
46
3
Active Camera System for Object Tracking and Multi-view Observation
loose clothes like MAIKOs introduce heavy self-occlusions, which prevent object surface areas from being observed by cameras. Moreover, action scenes by multiple performers continuously introduce mutual occlusions, which significantly limit the observability. Note that as will be discussed in Chap. 4, 3D shape reconstruction methods can estimate, i.e. interpolate, 3D object surface shape even for those parts which are not observable from any cameras. In practice, therefore, it becomes an important design factor of 3D video studios to configure a group of cameras so that the object surface observability is maximized. The last requirement is very important to make 3D video an attractive visual media as ordinary 2D video. Besides employing high resolution cameras, zoom-up video capture can increase the spatial resolution. However, since it limits observable object surface areas, the trade-off with the second requirement must be solved in the 3D video studio design. These three requirements set the design principles of the 3D video studio development and constrain the capture space where an object can move. This is mainly because the mutually contradicting requirements 2 and 3 above should be compromised with a limited number of cameras. In fact, as shown in Tables 2.3 and 2.5, and Fig. 2.4, the capture spaces of our 3D video studios are constrained within rather small 3D spaces, which prevents 3D video production of sports actions like ice skating and gymnastics.
3.1.2 Multi-view Video Capture for a Wide Area There are two possible technical solutions to enlarge the capture space: (1) increase the number of cameras or (2) employ active cameras which track an object dynamically.
3.1.2.1 Space Partitioning with a Group of Camera Rings The former is a simple solution, but it is less cost-effective in both equipment and calibration labor. For example, suppose a 3D video studio is equipped with N static cameras arranged along a ring as discussed in Sect. 2.2.1. Its capture space is defined by the intersection among the fields of view of the cameras (Fig. 3.1(a)). Then, by doubling the camera number to generate a pair of camera rings and placing the rings as illustrated in Fig. 3.1(b), we can easily enlarge the capture space. Here, it should be noted that neighboring capture spaces should have overlapping spaces to guarantee the following. • The consistent global camera calibration; by placing reference patterns for the extrinsic camera parameter calibration in the overlapping spaces, the camera calibration within each ring can be integrated. That is, both intra-ring and inter-ring camera calibration should be done by making use of the overlapping spaces.
3.1 Introduction
47
Fig. 3.1 Capture space of multi-camera systems. (a) Capture space by the ring camera arrangement (Sect. 2.2.1). (b) Doubled capture space with a pair of camera rings
• The continuous observation of a moving object; multi-view video data captured in a ring should be seamlessly integrated with those captured in another ring based on the overlapping space between two rings. With this system architecture, however, the M times enlargement of the capture space requires at least M times more cameras, and, moreover, M times intra-ring and M(M − 1)/2 times inter-ring calibration labors, while N × (M − 1) cameras are not used for object video capture when a single object performs actions in the studio.
3.1.2.2 Object Tracking with Active Cameras The introduction of object tracking with active cameras [20] would be a more costeffective system design. That is, pan/tilt/dolly/zoom camera controls can solve the trade-off between requirements 2 and 3 even with a limited number of cameras; by controlling camera positions and postures according to the object motion, all cameras can capture multi-view object video as designed. In fact, while in 2D, camera works in cinema productions enable zoom-up video capture of a person during his/her large space motion. However, in such active multi-camera systems requirement 1 above for accurate camera calibration becomes hard to satisfy: • Since pan/tilt/dolly/zoom controls are conducted by mechanical devices, their accuracies are limited compared to the pixel level accuracy for 3D shape reconstruction and texture generation. • Since camera position and posture parameters should be obtained well synchronized with video frame capture timing, accurate synchronization mechanisms between video capture and camera control are required. Although it is not impossible to solve these problems by developing sophisticated pan/tilt/dolly/zoom mechanisms, it would be better to develop active camera calibration methods from observed video data; special devices cost much and prevent people from producing 3D video in everyday life environments.
48
3
Active Camera System for Object Tracking and Multi-view Observation
Recall that the fixed-viewpoint pan/tilt camera described in Sect. 2.3.1.3 can be calibrated just as a static camera. However, it is hard to augment it with the zoom control; the pan and tilt axes should be adaptively shifted so that their intersecting point is aligned at the projection center, which moves depending on zooming factors dynamically. Moreover, it cannot change the viewpoint to increase the observability of the object surface. Thus we have to develop calibration methods for general pan/tilt/dolly/zoom cameras. There have been proposed two approaches to the active camera calibration: model-based and image-based calibrations. 3.1.2.2.1 Model-Based Active Camera Calibration There have been proposed advanced studies on modeling and calibration of active cameras [1, 6]. As a realistic pan/tilt/zoom unit modeling, Jain et al. [5] proposed a calibration method in which the pan and tilt axes can be located at arbitrary positions. In addition, the projection center can move according to zooming. In practice, however, it requires the exact pan/tilt/zoom values at each image acquisition timing, and only allows prefixed zooming factors. While the pan/tilt/dolly control modifies positions and postures of a camera, the zoom control modifies all of intrinsic, lens distortion, and extrinsic parameters. Moreover, complex optical mechanisms are employed to control zooming, and their accurate mathematical modeling itself is rather difficult. For example, Lavest et al. [7–9] proposed a method of zoom lens calibration based on a thick lens model rather than a pinhole camera model, and developed a 3D shape reconstruction method by zooming; since zooming shifts the projection center, multi-zooming can realize multi-viewpoint for shape-from-stereo. Sarkis et al. [15] proposed an example-based calibration method, which interpolates arbitrary intrinsic parameters from their neighboring real measurements based on a moving-least-square approximation. 3.1.2.2.2 Image-Based Active Camera Calibration The image-based active camera calibration has been widely studied as self calibration of moving cameras [2, 3, 11–14, 19] and image mosaicing [18, 21]. In [14], Pollefeys et al. proved that the calibration of the intrinsic and extrinsic parameters except the skew parameter is possible from multiple video frames captured by a moving camera. In [18], Sinha and Pollefeys proposed a fixed-viewpoint pan/tilt/zoom and lens-distortion calibration method from observed images taken under different pan/tilt/zoom parameters. One important assumption used in these approaches is that there exist a sufficient number of static feature points trackable over multiple frames or static overlapping areas across images taken under different pan/tilt/zoom parameters. In 3D video studio environments, these static features or image areas should be taken from the background scene, since the object is moving. However, we are faced with the following problems.
3.1 Introduction
49
• While the off-line calibration can use well designed and uniformly placed markers, the active calibration should extract them dynamically from the background scene. As discussed in Sect. 2.2.6, however, the background of 3D video studios should be made as uniform as possible to facilitate object silhouette extraction as well as lighting environment estimation. • As pointed out in the previous chapter, the calibration should be done by minimizing reprojection errors in the local area where the object surface exists. Calibration using background features does not satisfy this requirement and hence will introduce larger reprojection errors on the foreground object surface. • The main task of active cameras installed in a 3D video studio is to capture high resolution multi-view object images zooming up the object. As a result, background areas in captured images are limited due to occlusion by the object. Thus it is hard to observe a sufficient number of uniformly distributing background features for calibration. One idea to avoid these problems is to employ features on the object surface for the camera calibration. That is, first background features provide an initial guess of calibration parameters for each active camera, and then object surface features are used to determine 2D-to-2D correspondences across cameras. Finally reprojection errors at such 2D-to-2D correspondences are minimized by a non-linear optimization. While we developed a pan/tilt camera calibration method based on this idea, its reprojection errors stayed more than two times larger than those by the off-line calibration. The reasons for this are: • In the off-line calibration, positions of feature points can be estimated at subpixel accuracy based on the knowledge about calibration patterns like a regular chessboard or a sphere, which, moreover, increases the accuracy of their matching across cameras. • Texture patterns on the object surface are not known and their geometry is not planar, i.e. not affine invariant in general. As a consequence, the accuracy of their detection and matching across cameras is limited. • While calibration patterns can be placed to uniformly cover the capture space, the distribution of object surface features is usually biased. In short, whichever of the model-based or the image-based methods are employed, it is hard to make their calibration accuracy comparable to the off-line calibration.
3.1.2.2.3 Requirement for Camera Control Besides the camera calibration, the most distinguishing problem in employing active cameras for 3D video studios is that the following new requirement should be satisfied. Requirement 4: Track a moving object in real time while satisfying requirements 2 and 3.
50
3
Active Camera System for Object Tracking and Multi-view Observation
In [20], we developed a cooperative multi-target tracking system with a group of active cameras to detect, track, and compute 3D object motion trajectories in a room. However, since its objective was the multi-target tracking, the accuracy of the camera calibration was limited and, moreover, requirements 2 and 3 were not taken into account in the system design. Hence, it cannot capture multi-view object video data usable for 3D video production. As noted in Sect. 2.3.1.3, the active camera control requires sophisticated realtime processing methods to continuously track an object whose motion is not know a priori. They include fast computation, prediction of the object motion, modeling of active camera dynamics, and real-time coordination among multiple cameras. In other words, the 3D video studio design with active cameras requires integrated analysis methods based on not only geometry and photometry but also dynamics, which we believe is a challenging problem to cut open a new frontier of computer vision technologies. In summary, while the multi-view video capture for a wide area space will exploit new applications of 3D video in sports like ice skating and gymnastics, its realization is a challenging issue. This chapter introduces a novel idea named cell-based object tracking and presents a 3D video studio with active cameras which can capture high resolution well-focused multi-view video data of an object moving in a wide area. The novelty of the idea rests in the integration of both space partitioning with a group of camera rings and object tracking with active cameras to satisfy all four requirements. Section 3.2 introduces the concept of the cell-based object tracking, followed by the practical algorithm of the cell-based multi-view video capture for 3D video production in Sect. 3.3. We derive algebraic constraints on studio design factors to satisfy the four basic requirements specified before. Section 3.4 evaluates the performance of the algorithm with synthesized and real data. To prove the practical utility of the cell-based object tracking in the real world, Sect. 3.5 designs a multi-view video capture system for 3D video production of ice skating. Sect. 3.6 concludes the chapter with future studies.
3.2 Cell-Based Object Tracking and Multi-view Observation 3.2.1 Problem Specifications and Assumptions The following are problem specifications and assumptions employed: 1. A single object of roughly known 3D size moves freely on the flat floor in the scene of a known size. Note that the flat floor assumption is used just for simplicity and the presented algorithm can be applied to a flying object that moves freely in the 3D space. In this chapter, we assume that the 3D shape of the object can be modeled by its bounding cylinder of radius r and height h. 2. No a priori knowledge about the object motion except its maximum velocity is given.
3.2 Cell-Based Object Tracking and Multi-view Observation
51
3. Requirement 3 about image resolution is specified by the lowest allowable resolution. 4. The active cameras employed are PTZ (pan/tilt/zoom) cameras without the dolly control. Thus the movements of their projection centers are limited. 5. The cameras are arranged to surround the scene uniformly, which guarantees the high observability specified by requirement 2. The distance of the camera arrangement from the scene and the zoom control are mutually dependent and will be designed by the studio design algorithm to satisfy requirement 3. 6. The developed cell-based object tracking algorithm enables the active cameras to track the object continuously and capture well-calibrated multi-view video data satisfying requirements 1 and 4, in addition to requirements 2 and 3.
3.2.2 Basic Scheme of the Cell-Based Object Tracking and Multi-view Observation In [22], we proposed a novel idea of the cell-based object tracking and multi-view observation for 3D video production. Its objective is to capture high resolution wellfocused multi-view video data of an object moving in a wide area with a limited number of active cameras. Figure 3.2 illustrates its basic scheme. Step 1: Cell formation Partition the object movable space into a set of regular disjoint subspaces named cells (cf. Fig. 3.2(a)). Step 2: Camera parameter determination For each camera, determine its pan/ tilt/zoom parameters to best observe each cell (cf. Fig. 3.2(b)). That is, the cameras do not observe the object but the cells where the object moves. Step 3: Camera calibration With the pan/tilt/zoom parameters for each cell, conduct the static camera calibration using a reference pattern located in each cell (cf. Fig. 3.2(c)). Step 4: Object tracking While tracking and capturing multi-view video of an object, some cameras observe the object in cell A with the pan/tilt/zoom parameters predetermined for cell A, and others observe its neighboring cell(s) B with the pan/tilt/zoom parameters predetermined for cell B. Depending on the object motion, cameras for the multi-view object observation are dynamically switched from one group to another (Fig. 3.2(d)). The ideas behind this scheme are: • The calibration of each active camera is conducted by fixing pan/tilt/zoom parameters for each cell, which allows us to employ ordinary calibration methods for static cameras with high accuracy. Thus, requirements 1, 2, and 3 can be satisfied by assigning an appropriate group of cameras to each cell. • The dynamic control of each active camera is conducted on a cell-by-cell basis. That is, it can be done by computing which cell each camera should observe now and next rather than tracking the object continuously. Thus, requirement 4 can be satisfied together with requirements 1, 2, and 3.
52
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.2 Basic scheme of the cell-based object tracking and multi-view observation
• Video data captured during camera motions are not used for 3D video production, because they are not calibrated. That is, only when an active camera fixes its gaze to a cell with the object stably, its video data become calibrated and are used for 3D video production.
3.2.3 Design Factors for Implementation To implement a practical algorithm, the following three design factors should be fixed. Space partitioning and cell arrangement: When an approximate object motion trajectory is known a priori, cells should be arranged to cover the trajectory. The system presented in [22] used this knowledge and aligned a sequence of cells along the trajectory. Without such knowledge, here, it is reasonable to partition
3.2 Cell-Based Object Tracking and Multi-view Observation
53
Fig. 3.3 Hexagonal cell partitioning and cell-based camera control rule (see text)
the object movable 2D floor using a regular tessellation. Among others, the hexagonal tessellation realizes the most isotropic cell shape, which enables us to control active cameras isotropically. Moreover, the number of cells connected at a corner is three in the hexagonal tessellation whereas those in the square and triangular tessellations are four and six, respectively. Hence, it facilitates the cellbased camera control, as will be discussed below. It should be noted here that we do not need to introduce any overlapping areas between hexagonal cells: compare Figs. 3.1 and 3.3. Even without overlapping areas, the cell-based camera control realizes the continuous object observation, which will be explained in the next section. Camera arrangement: Since the object can move freely on the floor and each camera should observe each cell in higher resolution than specified, a camera ring surrounding the entire floor is a reasonable design of the basic camera arrangement (Fig. 3.4). Recall that the camera ring is usually augmented with a group of ceiling cameras as shown in Fig. 2.4. As will be discussed later, the practical camera arrangement should be designed based on controllable ranges of pan/tilt/zoom parameters as well as the cell size, the floor size, and the object maximum velocity. Cell-based camera control: As discussed before, when the object is in cell A as illustrated in Fig. 3.3 and observed by a group of cameras assigned to cell A, the other camera groups should become ready to observe the cells adjacent to A to ensure the continuous multi-view object observation. If this condition is always satisfied, the continuous multi-view object observation is achieved even if the cells do not have any overlaps. In other words, when the object crosses a cell border, the object is observed by a pair of camera groups which are assigned to a pair of neighboring cells sharing the border; the field of view of each camera is set wide enough to cover its assigned cell, as will be described later. Thus, the cell-based camera control can be realized by solving a camera-to-cell assignment problem. The detailed discussions on this problem are given below, since the cell-based camera control should be designed to satisfy all four requirements specified before and has much to do with the above two design factors.
54
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.4 Cell groups and their corresponding camera groups (see text)
3.2.4 Cell-Based Camera Control Scheme Since an arbitrary object movement can be expressed by a movement from the current cell to one of its six neighboring cells, a naive camera-to-cell assignment method would be to partition cameras into seven groups and assign a group to a cell where the object exists, say cell A in Fig. 3.3 and the other six to its neighboring cells B, C, D, E, F , and G. Then, when the object moves into cell D next, three camera groups assigned to cells C, E, and G are controlled to observe H , I , and J , respectively. This camera-to-cell assignment, however, is not cost-effective, because 6/7 of the installed cameras are not used for the object observation. If the object trajectory is known a priori, then it can be represented by a sequence of cells. In this case, since each cell has only two neighboring cells, the set of cameras can be partitioned into two groups: one group observes the cell with the object and the other group is controlled to observe one of the neighboring cell depending on which of the neighboring cells the object is heading to. With this camera-to-cell assignment, the number of idle cameras can be reduced to 1/2. In [22], we developed a more sophisticated camera control method which maximizes the object observability and reduces the idle cameras. Even without knowledge about the object trajectory, we can reduce the number of the idle cameras by assuming the cameras can be controlled from one cell to another faster than the object motion. Suppose the object in cell A is approaching to the corner shared with cells B and C as illustrated with x in Fig. 3.3. Then, a pair of camera groups should be assigned to cells B and C to prepare for the object
3.3 Algorithm Implementation
55
observation. If the object changes its motion and approaches the corner shared with cells B and D as illustrated with x in Fig. 3.3, then the camera group assigned to C should be controlled to observe from cell C to D before the object crosses the border of cell D. With such quickly controllable cameras, the following camera-tocell assignment reduces the number of idle cameras to 2/3. Note that, as will be proven later in Sect. 3.4.1.1, the average number of idle cameras is much smaller than 2/3 in practice, because the field of view of each camera is set wide enough to cover its assigned cell and as a result, each camera can observe some parts of cells neighboring to its assigned cell. Figure 3.4 illustrates the proposed camera-to-cell assignment. The set of hexagonal cells are partitioned into three disjoint cell groups so that three cells sharing a corner should belong to different cell groups. The set of cameras are also partitioned into three groups, and the one-to-one correspondence is established between the cell groups and the camera groups so that each camera group observes the cells in its corresponding cell group by dynamically controlling pan/tilt/zoom parameters depending on the object motion. Three cells sharing a corner are given a unique cell cluster ID, within which, as shown in Fig. 3.4, each cell is assigned a unique cell ID, i, j , where i denotes the cell cluster ID and j the camera group ID. The camera control is conducted as follows. Suppose the object is in cell 4, 1 in Fig. 3.4 and camera group #1 is capturing its multi-view video. Then, depending on which corner is the closest, camera groups #2 and #3 are controlled to observe cells sharing that closest corner with cell 4, 1. Thus, if the object is wandering in cell 4, 1, camera group #2 may change the target cell among cells 1, 2, 2, 2, and 4, 2, and camera group #3 among 1, 3, 4, 3, and 3, 3, respectively. This corner based camera control also validates the hexagonal tessellation, because in the square and triangular tessellations, four and six cells would share a corner, respectively, hence increasing the number of idle cameras. In summary, with our cell-based object tracking and multi-view observation scheme, by increasing the number of cameras three times, the high resolution wellfocused multi-view object observation can be realized even if the object moves freely in a wide spread area.
3.3 Algorithm Implementation To implement a practical algorithm for the cell-base object tracking and multiview observation presented above, we should design the camera arrangement, the cell size, the calibration method, and the real-time tracking algorithm. This section presents the design of their practical implementations based on the specifications: the approximate object size, its maximum speed, the entire 3D scene space, and the minimum allowable image resolution.
56
3
Active Camera System for Object Tracking and Multi-view Observation
3.3.1 Constraints between Design Factors and Specifications Suppose the camera ring is arranged to surround the scene as shown in Fig. 3.4 and its radius is given. The cell radius R is designed based on (1) the maximum object speed, (2) the camera control speed, and (3) the minimum allowable image resolution required for 3D video production. As illustrated in Fig. 3.3, intuitively, the cell radius R should be large enough to ensure that the camera group observing cell C can be switched to cell D before the object at x arrives at D. On the other hand, R should be small enough to ensure the object observation to be in more than the minimum allowable image resolution. This section first gives intuitive descriptions of how the design factors and the specifications are related, and then derives algebraic constraints on them.
3.3.1.1 Camera Control Rule and Cell Size As the most fundamental constraint, here we derive the constraint on the cell radius R by the maximum object speed and the camera control speed, and explain how the design factors and the specifications are related qualitatively. Figure 3.5 illustrates the worst case which requires the quickest switching from one cell to another. Suppose the object is at p in cell 1, j and is going straight to q in cell 3, j . To catch up with this object motion, the camera group j observing cell 1, j should be switched to cell 3, j when the object crosses at the midpoint of p and q, since the corner point q becomes the closest corner point from the object. Then, the pan/tilt/zoom parameters of the camera group j , which were adjusted for cell 1, j , should be changed and fixed to those for cell 3, j before the object arrives at q. Let v [m/sec] denote the maximum object velocity and τ [sec] the maximum allowable control time for a camera to switch its pan/tilt/zoom parameters from one cell to another. Then, R should satisfy R ≥ 2τ v.
(3.1)
This equation implies that the maximum image resolution is achieved when R = 2τ v,
(3.2)
since the smaller the cell is, the larger the appearance of the object in the image becomes. If the maximum possible image resolution given above does not satisfy the specification, we should modify the specifications by slowing down the object motion, reducing the minimum allowable image resolution, employing a faster pan/tilt/zoom camera device, or enlarging the radius of the camera ring. The dashed thick color lines in Fig. 3.5 denote rule borders, at which the camera control from√ a cell to another is triggered. They form another hexagonal tessellation with radius 3R. The color hatched areas of width τ v in Fig. 3.5 illustrate the controllability constraint of the active cameras. That is, when the red hatched area
3.3 Algorithm Implementation
57
Fig. 3.5 Designing the camera control rule for camera group j and the cell radius R. Cells are represented by black and gray solid lines. Dashed thick color lines denote rule borders and colored hatched areas controllability constraints of active cameras (see text). ©2010 IPSJ [22]
does not overlap with cell 1, j , the camera group j can be switched to that cell before the object comes in from other neighboring j group cells. The same condition holds for the blue hatched area. The width of such hatched areas, i.e. τ v, can be reduced with fast controllable active cameras or enlarging the radius of the camera ring. When active cameras are equipped with a large zooming function to observe cells from rather far away, the radius of the camera ring can be increased to reduce τ . This is because the camera angle (i.e. pan/tilt parameters) to be changed for the cell switching is reduced when the distance between the camera and cells becomes large. In addition, a pair of cells between which a camera is switched can be observed with almost same zooming factors from a distant camera, and the time for its zoom control is also reduced. Note that strictly speaking, the camera control time varies depending on a camera position as well as a pair of cells between which its pan/tilt/zoom parameters are switched. That is, the widths of the hatched areas illustrated in Fig. 3.5 vary from cell to cell as well as from rule border to rule border. To disentangle the mutual dependencies among the design factors and the specifications, and to determine R to satisfy the specifications, the following sections derive algebraic constraints on the possible range of R, which are categorized into (1) image observation constraints and (2) camera control constraints. The former are derived to guarantee sufficiently high resolution object observation and the latter to realize sufficiently fast camera control for the cell switching. Note that as will be discussed below, the distance from a camera to a cell, d, and R are tightly coupled with each other and hence, most of the constraints include both. In practice, d can be determined based on the capture space and the possible camera arrangement space, which then constrains R. Note also that all analyses below are conducted on a 2D floor for simplicity, hence neglecting the object height and the heights of camera positions. This is because the objective of the analyses is just to show how fieldof-view, image resolution, and depth-of-field parameters as well as pan/tilt/zoom control speed constrain the cell radius R. Once a radius R satisfying all constraints is obtained, the detailed 3D parameter design can be done easily by simulation.
58
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.6 Constraint between the cell radius R and the field-of-view of a camera
3.3.1.2 Image Observation Constraints As discussed in Chap. 2, a camera has a set of controllable parameters to be optimized to capture object images sufficient for 3D video production: focal length, focus, shutter speed, iris size, and gain. Here, we assume that gain and shutter parameters have been already optimized to capture the object in good exposure and high S/N ratio, because their adjustments should be done by taking into account or controlling lighting environments, which is out of the main scope of this book. In what follows, we analyze constrains on R by field-of-view, image resolution, and depth-of-field, which are determined by focal length, iris size, and focus parameters.
3.3.1.2.1 Constraint by Field-of-View Figure 3.6 illustrates the constraint between the field-of-view (i.e. zooming factor) and the cell size. Let d [mm] denote the distance from the projection center to the cell center and r [mm] the radius of the bounding cylinder of the object. To capture the object in this cell,1 the field-of-view at the distance of d should be larger than 2 × (R + r). That is, 2(R + r) W ≤ d f ⇔
R≤
dW − r, 2f
(3.3) (3.4)
where W [mm] and f [mm] are the image width and focal length, respectively. Note that r and W are fixed constants, and f and d design parameters for R [mm]. This constraint implies that the maximum possible cell radius is obtained when the camera is observing the nearest cell with the shortest focal length, i.e. the largest field-of-view. In other words, since the shortest focal length, the image width, and the object size are fixed, the minimum value of distance d determines the maximum possible cell radius R. That is, 1 We
regard the object to be in a cell when the axis of its bounding cylinder is included in the cell.
3.3 Algorithm Implementation
59
Fig. 3.7 Constraint between the cell radius R and the image resolution
Camera arrangement constraint 1: to make the cell as large as possible, which facilitates the camera control and reduces calibration work, place the cameras as far as possible from the scene.
3.3.1.2.2 Constraint by Image Resolution Figure 3.7 illustrates the constraint between the cell radius R and the image resolution. The maximum pixel coverage s [mm/pix], which is inversely proportional to the image resolution, is given when the object surface is located at d + R + r from the projection center: s=
(d + R + r)W , fN
(3.5)
where N denotes the pixel count of the image width W . Note r, N , and W are fixed constants, f and d control parameters to optimize s and R. This constraint implies that given R, the minimum possible image resolution is obtained when the camera is observing the farthest cell with the largest focal length, i.e. the smallest field-of-view. That is, Camera arrangement constraint 2: to make the minimum image resolution larger than the specified, place each camera so that with its largest focal length, the image resolution at the farthest scene point gets larger than the specified minimum allowable image resolution. In other words, to increase the image resolution, place cameras as close as possible to the scene, which obviously contradicts with the camera arrangement constraint 1. In designing the arrangements of cameras and cells, the largest focal length is fixed and the minimum allowable image resolution is specified. Thus we can compute the maximum possible cell radius R as follows: R = sf
N − d − r, W
(3.6)
where s and f denote the maximum allowable pixel coverage and the largest focal length, respectively.
60
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.8 Constraint between the cell radius R and the depth-of-field
3.3.1.2.3 Constraint by Depth-of-Field As discussed in Chap. 2, multi-view video data for 3D video production should capture the object without blur. This indicates that the cell size should be designed so that each camera can capture the object within its depth-of-field. Consider a cell located at d [mm] from the camera as shown in Fig. 3.8. In this figure the near and far clips DN and DH of the depth-of-field are given by Eqs. (2.3) and (2.4). Hence this constraint can be expressed as follows: d − R − r ≥ DN ≈
DH df f 2 df = 2 , DH + df f + F cdf
DH d f f 2 df = 2 , d + R + r ≤ DF ≈ D H − df f − F cdf
(3.7)
where F denotes the F -number, c = W/N the diameter of circle-of-confusion, i.e., the physical size of a pixel, and df the distance at which the lens is focused. In this constraint, c = W/N is a fixed constant, df , F and f are control parameters to optimize DN , DF and R. This constant implies that larger F -number (smaller iris) and shorter focal length (wider field-of-view and lower image resolution) make the maximum possible cell radius larger. Since the minimum image resolution is also given as a design specification, making F -number large by using a stronger illumination is the option without trade-offs.
3.3.1.3 Camera Control Constraints While the image observation constraints described above are derived to capture multi-view video data usable for 3D video production, the camera control constraints specify conditions to ensure the continuous object tracking with the given pan/tilt/zoom control mechanisms.
3.3 Algorithm Implementation
61
Fig. 3.9 Constraint between the cell radius R and the pan speed. If the object moves from p to q, then the camera should start panning from cell 1, j to 3, j when the object goes beyond the midpoint between p and q
3.3.1.3.1 Constraint by Pan/Tilt Control Since we assumed that the object moves on a 2D plane and the cameras are arranged on a ring, the pan control plays a major role in active object tracking. Here, therefore, we analyze the constraint by the pan control alone; the constraint by the tilt control will be satisfied when that by the pan control is satisfied, because both controls share similar dynamics and can be done in parallel. Figure 3.9 illustrates the situation where a camera should switch its focusing cells by the fastest pan control. As explained in Fig. 3.5, the camera should switch the cells within R/2v [sec], where v denotes the maximum object speed. Let ωp [rad/sec] and T denote the angular velocity of the pan control and the computation delay, respectively. Then, we have the following constraint: 2 tan−1 (3R/2d) R +T ≤ . ωp 2v
(3.8)
This constraint implies that the farther a camera is placed from the scene, the larger the cell radius is allowed, which is consistent with camera arrangement constraint 1. 3.3.1.3.2 Constraint by Zoom Control Figure 3.10 illustrates the situation where a camera should switch its focusing cells by the fastest zoom control. Similarly to the above pan control, the camera should switch the cells within R/2v [sec]. Figure 3.11 illustrates that the focal length to capture the entire cell at distance d is given by f [mm] =
(2d − R − r)W . √ 2 3(R + r)
(3.9)
Therefore to zoom out from cell 3, j to 1, j , we need to shorten the focal length by √ 3RW (2(d − 3R) − R − r)W (2d − R − r)W [mm]. (3.10) − = √ √ R +r 2 3(R + r) 2 3(R + r)
62
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.10 Constraint between the cell radius R and the zoom control speed. If the object moves from p to q, then the camera zooms out from cell 3, j to 1, j when the object goes beyond the midpoint between p and q
Fig. 3.11 The field-of-view of a camera in the case of Fig. 3.10
Hence the constraint by the zoom control is given as √ 3RW R +T ≤ , z(R + r) 2v
(3.11)
where z [mm/sec] is the zooming speed. Note that this constraint does not include d and hence purely specifies the allowable range of R given the zoom control speed. By solving this inequality for R, we have √ √ R≥ 12v 2 W 2 + 8 3v 2 zT − 4 3rvz W + 4v 2 z2 T 2 + 4rvz2 T + r 2 z2 √ + 2 3vW + 2vzT − rz (2z)−1 . (3.12)
3.3.2 Studio Design Process As shown in the previous section, the key design factors, the cell radius R and the camera distance d, should be determined to satisfy all constraints described above. Note that while not discussed above, d is rather strictly constrained by physical studio environments where physical cameras can be placed. Hence, it is reasonable to take the following studio design process: 1. Select active cameras with high image resolution, small F -number, large zooming factor, and fast pan/tilt/zoom control. 2. Place the cameras uniformly to surround the capture space. Their distances to the scene should be made as far as possible while satisfying the camera arrangement constraint 2. Note that depending on the shape of the capture space, the camera ring should be deformed to cover the space uniformly. This book assumes the
3.3 Algorithm Implementation
3. 4.
5.
6.
63
capture space is almost isotropic, where the ring camera arrangement can be employed. The optimal camera arrangement for an anisotropic capture space is left for future studies. Lighting should be made as bright as possible to guarantee sufficient depth-offield and reduce motion blur with small iris size and fast shutter speed. Then, compute the cell radius R which satisfies the constrains described by Eqs. (3.4), (3.6), (3.7), (3.8), and (3.12) assuming the cell placement rule described next. Note that the cell radius computation should be done for each camera and we have to find such R that satisfies all constraints for all cameras. Since the constraints are valid only for the 2D scene, some margin should be introduced to determine R. One idea to reduce the margin is to use the maximum or minimum 3D distances from a camera to the capture space as d in the constraints. If no R satisfies the constraints, then employ higher performance cameras, modify the camera arrangement, and/or reduce the capture space, and recompute R. Partition the scene into a group of hexagonal cells of radius R. That is, the 2D floor is partitioned into the hexagonal tessellation with radius R by fixing a cell at the center of the floor. This cell placement should be taken into account in the previous process of computing R. Conduct the 3D simulation to verify that the continuous object tracking can be realized while capturing its multi-view video satisfying the specifications. If necessary, adjust the camera arrangement and/or the cell partitioning to satisfy the specifications or enhance the system performance.
We will show three practical studio designs later in this chapter: two for laboratory environments and one for a large scale real world scene, an ice skate arena.
3.3.3 Cell-Based Camera Calibration Once the cell partitioning is fixed, the next step is to (1) optimize the camera control parameters to maximize the spatial resolution of captured images, and (2) calibrate the cameras for each cell. Let G1 , G2 , and G3 denote three camera groups, each of which is assigned to three cell groups 1, 2, and 3, respectively, as illustrated in Fig. 3.4. Then the camera control parameter optimization and calibration are conducted as follows. Step 1 Cell-wise control parameter optimization and calibration: For each cell i, j , Step 1.1 For each camera in the associated camera group Gj , optimize the control parameters so that it observes the entire cell i, j with the highest spatial resolution. In this optimization, we assume that the 3D object shape is modeled by a cylinder of the specified radius and height. Let Ei,j denote the set of the optimized control parameters for the cameras in Gj to capture cell i, j .
64
3
Active Camera System for Object Tracking and Multi-view Observation
Step 1.2 Calibrate the intrinsic and extrinsic parameters Ai,j , Ri,j , Ti,j of the cameras in Gj under Ei,j using the static camera calibration algorithm described in Chap. 2. Step 2 Pair-wise integration of intra-camera-group calibrations: For each pair of neighboring cells i, j and i , j , Step 2.1 Suppose camera groups Gj and Gj observe cells i, j and i , j with the optimized control parameters Ei,j and Ei ,j , respectively. Step 2.2 Control the other camera group Gj to observe both cells i, j and i , j simultaneously. Step 2.3 Estimate the intrinsic and extrinsic parameters of cameras in Gj . Notice that the extrinsic parameters of each camera group are described in the intra-group local coordinate system. Step 2.4 Place at least three 3D reference points in each of cells i, j and i , j , respectively, and estimate their 3D positions in each intral group local coordinate system. Let kpi,j denote the lth 3D point in the cell i, j described in the camera group k coordinate system. l Step 2.5 Solve the absolute orientation problem [10] between jpi,j and l jp i,j .
This provides the optimal rigid transformation of the extrinsic parameters Ri,j , Ti,j into the group Gj coordinate system. Step 2.6 Similarly we can also transform Ri,j , Ti,j in the group Gj coordinate system, and as a result, the rigid transformation between Ri,j , Ti,j and Ri,j , Ti,j are obtained. By repeating the above pair-wise integration process, all cell-based extrinsic camera parameters can be integrated to a single coordinate system. Note that, since the extrinsic camera parameter integration does not modify the intrinsic, lens distortion, and extrinsic camera calibration parameters computed for each cell, the accuracy of the camera calibration inside a cell is preserved, although the inter-cell camera calibration may introduce some spatial discontinuity between neighboring cells. While the intra-cell camera calibration should be done for each cell, the number of the inter-cell calibration processes can be reduced. That is, camera group Gj in the above method can be controlled to observe multiple cells associated with Gj and Gj simultaneously. Then we can integrate intra-cell calibrations in such cells into a single local coordinate system defined by Gj at once. To reduce calibration labor more, we can select some cameras and make them observe the entire scene for ˆ denote the group of the selected establishing the global coordinate system. Let G ˜ j (j = 1, 2, 3) three camera groups excluding the selected ones. By cameras and G ˆ and G ˜ j , we can describe the positions and postures observing 3D points both by G ˆ of the cameras in Gj by the G camera coordinate system, i.e. the global coordinate system. For n cells, while the original pair-wise integration requires n(n − 1)/2 times repetition of Step 2.2 above to transform all intra-cell extrinsic camera parameters into the global coordinate system, this simplified procedure requires only n.
3.4 Performance Evaluations
65
3.3.4 Real-Time Object Tracking Algorithm As described before, the cell-based object tracking and multi-view observation algorithm does not track or observe an object in motion continuously. Instead, it controls pan/tilt/zoom parameters of each camera group to make its member cameras focus to its associated cell for the multi-view object observation and switch focusing cells from one to another according to the object motion. The cell-based object tracking and multi-view observation is realized by the following real-time camera control method. Step 1 The object starts actions from the initial position in the specified cell. For example suppose the object starts from x in cell A of Fig. 3.3. Then, capture the multi-view object video with the camera group assigned to cell A. Depending on the object initial position in cell A, the other two camera groups are controlled to observe two neighboring cells of A, respectively (cells B and C in Fig. 3.3). Step 2 The camera group focusing on the cell with the object detects the object from observed multi-view video frames, and computes its 3D position in real time. Here a rough 3D shape reconstruction method is employed and 3D video production is conducted later from captured video data. Note here that if the camera group focusing on a cell adjacent to the cell with the object is not in motion, multi-view video data captured by such camera group are recorded for 3D video production. Consequently, all cameras can be used to capture multi-view object video data. In other words, the algorithm guarantees that at least 1/3 of the cameras capture multi-view object video. Step 3 If the object crosses the rule border, then switch the camera group to the new cell as described in Sect. 3.3.1.1. And when the camera group stops motion, start capturing multi-view object video even if the object is not in its focusing cell. Step 4 Go back to Step 1.
3.4 Performance Evaluations 3.4.1 Quantitative Performance Evaluations with Synthesized Data To evaluate quantitatively the performance of the cell-based object tracking and multi-view observation algorithm (the cell-based algorithm, in short), we first conducted experiments using a pair of 3D video sequences produced by the previous system [22] as ground truth data, whose original multi-view video data were captured in Studio A (Fig. 2.4 and Table 2.3). As shown in Fig. 3.12, the 3D video sequences represent a person walking in the studio. The red curves in Fig. 3.13 illustrate a pair of object motion trajectories reconstructed by the previous system. Regarding these 3D video sequences as actual ground truth object motions, we designed the following simulation setups:
66
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.12 3D video sequence used for the simulation. ©2010 IPSJ [22]
Fig. 3.13 The object motion trajectories and the cell arrangement used for the performance evaluation. The hexagonal cells are given unique cell IDs consisting of cellclusterID, cameragroupID. ©2010 IPSJ [22]
Scene space: 4 [m] × 4 [m] as illustrated in Fig. 3.13. Object size: approximated by a cylinder of 0.9 [m] diameter and 1.8 [m] height. Active cameras: the same pan/tilt/zoom control parameters and image resolution as those of the real active cameras installed in Studio A. Maximum time for a camera to conduct the cell switching: τ = 1.0 [sec].
3.4 Performance Evaluations
67
Fig. 3.14 Camera arrangement. The squares and circles represent the camera positions. The former are placed at 2,200 mm height and the latter at 500 mm to increase the observability of the object surface. ©2010 IPSJ [22]
Lowest allowable image resolution: pixel coverage s ≤ 8 [mm/pix]. Maximum object velocity: v = 0.3 [m/sec]. Camera arrangement: camera ring of 8 [m] radius with 24 cameras as illustrated in Fig. 3.14. To increase the observability of the object surface, a half of the cameras are placed at 2,200 [mm] height and the other at 500 [mm]. Then, the cell radius was determined as R = 0.6 [m] and the cell partitioning illustrated in Fig. 3.13 was obtained. Note here the following. • The cell radius is set rather small to guarantee enough image resolution, because the image resolution of the active cameras in Studio A is just VGA (640 × 480). • To satisfy the fundamental constraint, R ≥ 2τ v, the slow object motion speed and the large radius of the camera ring were employed; the camera ring radius was set much larger than the actual on in Studio A (see Table 2.3). The practical process of designing the cell partitioning will be shown in the third experiment presented in this section. As a baseline method, we designed the following multi-camera system: • The same camera ring as illustrated in Fig. 3.14 was employed. • Pan/tilt/zoom parameters of 24 cameras on the ring were adjusted to observe the entire capture space and were fixed to regard the active cameras as static ones. Then, using the multi-view video data virtually captured by both the cell-based and the baseline methods, new 3D video sequences were produced and compared with the original one for their performance evaluation. Here we used the volumetric 3D shape reconstruction method in Sect. 4.4.1.2, and the view-independent texture generation technique in Sect. 5.3. For the performance evaluation, we used the following measures. Viewpoint usage: This evaluates how many cameras are observing the object at each video frame. Pixel usage: This evaluates how many pixels cover the object surface.
68
3
Active Camera System for Object Tracking and Multi-view Observation
3D shape reconstruction accuracy and completeness [16]: This evaluates how accurately and completely the object shape can be reconstructed from observed multi-view video data. Peak signal-to-noise ratio (PSNR) of rendered images: This evaluates the overall quality of the captured multi-view video data by comparing free-viewpoint object images generated from them with those from the original 3D video sequences.
3.4.1.1 Viewpoint Usage This measure specifies how many cameras among the installed ones are used to observe the object. Since the baseline method by design observes the object with all cameras, its viewpoint usage is always 1. On the other hand, the viewpoint usage of the cell-base method varies depending on the object motion. Figure 3.15 shows the timing charts of the active camera control for the two test sequences. For each sequence, the upper graph shows stop & go action profiles for each camera group and the lower the viewpoint usage. From these results, we can get the following observations: • The cell-based method worked well to continuously capture multi-view object video; in all video frames, the object was observed by at least one of three camera groups. • Since three camera groups are controlled to observe a cell cluster consisting of three neighboring cells that share a common corner, the object can be observed even from the camera groups whose assigned cells do not include the object. That is, the viewpoint usage of the cell-based method usually exceeds 1/3 and the full viewpoint usage (=1) can be attained quite often. These results proved that the cell-base method is really effective in realizing the active tracking and multi-view observation of an object moving in a wide area with a limited number of cameras.
3.4.1.2 Pixel Usage The pixel usage measures how many pixels are covering the object surface. It is defined for video frames including object images as follows: (Pixel usage at frame k) =
NO 1 i , |G(k)| NiI
(3.13)
i∈G(k)
where G(k) denotes the set of cameras observing the object at frame k, NiO the number of pixels occupied by the object in the camera i image, and NiI the total number of pixels of the camera i image. Figure 3.16 shows the temporal profiles of the pixel usage for each test sequence by the cell-based and baseline methods, and Table 3.1 their average values and standard deviations. From these results, we can make the following observations:
3.4 Performance Evaluations Fig. 3.15 Timing charts of the active camera control for two test sequences. For each sequence, the upper graph shows the stop&go actions: the boxes along the horizontal axis represent the temporal intervals when each camera group fixes all views of its member cameras onto a cell to observe the object in its interior area or to wait for the object to move into the cell, and gaps between the boxes indicate that the camera group was in motion. Numbers in the boxes represent the cell cluster ID observed by each camera group. ©2010 IPSJ [22]
Fig. 3.16 Pixel usage. ©2010 IPSJ [22]
69
70
3
Active Camera System for Object Tracking and Multi-view Observation
Table 3.1 Pixel usage. Each number shows the average ± standard deviation per sequence. ©2010 IPSJ [22]
Cell-based [%]
Baseline [%]
Sequence 1
6.7 ± 0.32
1.9 ± 0.10
Sequence 2
7.2 ± 0.45
2.0 ± 0.16
• The average pixel usage of the cell-based method is limited to about 7 %. This is because the camera parameters are controlled to observe the entire cells rather than the object. Even though the pixel usage is limited, high resolution object images can be captured with higher resolution cameras. • Compared with the baseline method with wide-view static cameras, the cell-based method attained about three times better average pixel usage. This implies that the cell-based method can capture object images in 1.8 times finer resolution approximately. These results proved that the cell-base method is effective to capture finer resolution images of an object in motion with active pan/tilt/zoom controls.
3.4.1.3 Shape Reconstruction Accuracy and Completeness With the given pair of 3D video sequences, the accuracy and completeness of the 3D object shape reconstructed from multi-view object images observed by the cellbased and baseline methods can be evaluated. Figures 3.17 and 3.18 show 90 % accuracy and 10 mm completeness for each frame in the two sequences by the cell-based and baseline methods. 90 % accuracy denotes the distance d (in cm) such that 90 % of the reconstructed surface is within d cm of the ground truth, and 10 mm completeness measures the percentage of the reconstructed surface that are within 10 [mm] of the ground truth [17]. The viewpoint usage is also illustrated at the bottom of each figure to evaluate the performance. Figure 3.19 summaries their averages and standard deviations. From these results, we can observe the following:
Fig. 3.17 10 mm-completeness and 90 %-accuracy for sequence 1. ©2010 IPSJ [22]
3.4 Performance Evaluations
71
Fig. 3.18 10 mm-completeness and 90 %-accuracy for sequence 2. ©2010 IPSJ [22]
Fig. 3.19 Average performance of the cell-based and baseline methods in terms of 10 mm-completeness and 90 %-accuracy. ©2010 IPSJ [22]
• On average, the cell-based method performed equally well or better than the baseline method. • At frames where the viewpoint usage dropped to 1/3, the completeness and accuracy by the cell-based method were degraded. This is because the observability of the object surface with 1/3 of the cameras is limited. • At frames where the viewpoint usage was more than 2/3, the cell-based method outperformed the baseline method in both completeness and accuracy.
72
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.20 Shape reconstruction results for frame 135 of sequence 1. The first row shows the error distributions of the reconstructed 3D object shapes, where colors indicate distances to the ground truth object surface: blue denotes large error and red small. The bottom row shows the rendered object images and their PSNR with respect to the ground truth image, i.e. the left image. ©2010 IPSJ [22]
To further analyze the effects of the viewpoint usage, we examined frames 135 and 95 of sequence 1 in detail, where the viewpoint usage values were 1/3 and 2/3, respectively. Figure 3.20 shows the results of the 3D shape reconstruction and object image generation for frame 135 of sequence 1, where only 1/3 of the cameras could observe the object in the cell-base method. The color pictures on the top row show the error distributions of the 3D shape reconstructed from the multi-view object images observed by the cell-based and baseline methods, respectively. While large errors in blue are distributed widely in the baseline method, they are concentrated at the middle area in the cell-based method. These errors are introduced because the observability of such areas was greatly reduced due to the self-occlusion by the object arm. That is, 1/3 of the cameras could not capture enough multi-view object images to resolve the self-occlusion. The bottom row shows object images rendered
3.4 Performance Evaluations
73
Fig. 3.21 Shape reconstruction results for frame 95 of sequence 1. ©2010 IPSJ [22]
based on the original, and two reconstructed 3D object shapes, where PSNR with respect to the ground truth image, i.e. left image, were computed. (The PSNR computation method will be described below.) These results proved that the cell-based method can produce 3D video of comparable qualities to the baseline method even with 1/3 of the camera. To guarantee the high observability, the total number of cameras should be increased; eight cameras are not enough to attain the sufficient observability of an object in action. To evaluate the performance of the cell-based method with the high viewpoint usage, we analyzed frame 95 in sequence 1, where 2/3 of the cameras observed the object. Figure 3.21 shows the same data as in Fig. 3.20. Comparing these figures, the performance of the cell-based method is much increased and outperforms the baseline method in all measures: accuracy, completeness, and PSNR. The major reason for this is the increased object image resolution, i.e. the higher pixel usage. Figure 3.22 compares a pair of images taken by camera 1 at frame 95 in sequence 1 by the cell-based and baseline methods, which demonstrates the effectiveness of active zooming in the cell-based method. The increased object image resolution contributes significantly to both 3D shape reconstruction and texture generation.
74
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.22 Images captured by camera 1 at frame 95 in sequence 1. ©2010 IPSJ [22]
3.4.1.4 PSNR of Rendered Images As the final and overall performance evaluation, PSNR was computed for object images rendered based on the 3D object shapes reconstructed from multi-view video data virtually captured by the cell-based and baseline methods. Object images rendered from a pair of given 3D video sequences were used as the ground truth data. The object image rendering was done in the following setups. Viewpoint: For each video frame, place the viewpoint of the virtual camera 3 m away from the object front side. Field of view: 32 degrees. Image size and resolution: 250 × 450. Texture generation: The view-independent texture generation in Sect. 5.3. Figure 3.23 illustrates the dynamic PSNR profiles of the cell-based and baseline methods for two 3D video sequences. This demonstrated that the cell-based method can produce much better 3D video data thanks to the finer resolution of captured object images.
3.4.2 Quantitative Performance Evaluation with Real Active Cameras To evaluate the performance of the cell-based method in the real world, we implemented it in Studio A, whose specifications are shown in Fig. 2.4 and Table 2.3. Figure 3.24 illustrates the camera arrangement. Note that two cameras placed at the center of the ceiling were not used for the experiment here to realize the camera ring arrangement with 23 active cameras: 13 cameras installed on the floor and 10 cameras on the ceiling. Each active camera was implemented by mounting a Sony DFW-VL500 camera with a computer controllable zoom lens on a pan-tilt unit PTU46 by Directed Perception, Inc. They were connected to a PC cluster system with 24 nodes: one master and 23 camera nodes to capture and control 23 active cameras, respectively. The nodes were connected by 1000Base-T Ethernet, and communications between nodes were implemented by UDP. The system clocks were synchronized by NTP. As is shown in Fig. 3.24, the physical object movable space i.e. the capture space, is limited to about 3 × 3 [m], since the physical studio size is limited to about 8 [m] square and the active cameras cannot take focused object video when it comes close
3.4 Performance Evaluations
75
Fig. 3.23 PSNR of rendered images. ©2010 IPSJ [22]
Fig. 3.24 3D video studio for active object tracking and multi-view observation (Studio A in Fig. 2.4 and Table 2.3). The circles and squares indicate camera positions: the former are placed 500 mm above the floor looking upward and the latter at the ceiling 2200 mm above the floor looking downward. The numbers denote camera IDs, and the red/blue/green border colors indicate cell groups (cf. Table 3.2). ©2010 IPSJ [22] Fig. 3.25 Object used in the real studio experiment: a stuffed animal toy on a radio controlled car. ©2010 IPSJ [22]
to the cameras. Thus we used a small object rather than a human for the experiment: a stuffed animal toy on a radio controlled car shown in Fig. 3.25. It moves at most v = 0.208 [m/sec], and its volume is approximated by a cylinder of 0.800 [m] diameter and 0.500 [m] height. Note that this miniature size experiment is not due to the cell-based method itself but to the limited studio space. The next section will give a system design for a large movable space, i.e. ice skating arena.
76
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.26 Cell arrangement and object trajectory in the studio experiment. ©2010 IPSJ [22]
Table 3.2 Camera groups and cells assigned to each group. ©2010 IPSJ [22]
Group
Member camera IDs
No. of assigned cells
1
2 7
9
19
20
21
22
23
7
2
3 5
8
13
14
16
17
18
5
3
1 4
6
10
11
12
15
5
By specifying the maximum time for a camera to conduct the cell switching as τ = 1.2 [sec] and the minimum allowable image resolution as pixel coverage s = 5 [mm/pix], we obtained the cell arrangement shown in Fig. 3.26, where the estimated object trajectory is overlaid. Table 3.2 shows the numbers of member cameras and assigned cells for each camera group. Figure 3.27 shows the temporal profile of the viewpoint usage. As designed, the object was observed by more than seven cameras at all frames. Since the object moved back and forth, it traveled across rule borders several times around frame 400, which decreased the viewpoint usage; a camera group has to change its focusing cell when the object crosses its corresponding rule border. Note that even in such “busy” period, the object was tracked and its multi-view videos were observed continuously by camera groups focusing on the cells including the object.
Fig. 3.27 Viewpoint usage in the studio experiment. ©2010 IPSJ [22]
3.5 Designing a System for Large Scale Sport Scenes
77
Fig. 3.28 Images captured by camera 1 at frames 200 and 400. The upper figure shows the fields of view and the lower the captured images. ©2010 IPSJ [22]
Figure 3.28 shows the fields of view and captured images by camera 1 at frames 200 and 400. We can observe that the object size in images, i.e. the pixel usage, is kept almost constant regardless of the object distance from the camera. To evaluate the quality of 3D video data, we applied the volumetric 3D shape reconstruction algorithm in Sect. 4.4.1.2 and the view-independent texture generation method in Sect. 5.3 to the captured multi-view video data. Figure 3.29 shows a pair of free-viewpoint visualizations of the produced 3D video at frame 395. Even though the object was observed by only one camera group with eight cameras, fine textures on the object surface such as the harness or letters written on the car can be rendered. These results proved that the cell-based method can work well in the real world even with off-the-shelf camera devices and PCs.
3.5 Designing a System for Large Scale Sport Scenes The evaluations in the previous section proved that the cell-based method can successfully track an object moving freely in the scene, and can capture its multi-view
78
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.29 Produced 3D video at frame 395 with eight cameras. ©2010 IPSJ [22]
video data in sufficient qualities for 3D video production in the laboratory environment. The goal of this section is to demonstrate the practical utilities of the cell-based method by designing a system for real world large scale sport scenes with off-theshelf active cameras. As such real world scenes, we here choose figure skating. According to the standard regulations by the International Skate Union [4], the typical shape of the ice rink for short and free skating programs is defined by a rectangle of 60 × 30 [m]. We here discuss how we can design the camera and cell arrangements taking into account physical characteristics of the target motion, capture space size, and active camera dynamics.
3.5.1 Problem Specifications Figures 3.30 and 3.31 show the virtual ice skate arena we consider in this section. The capture target is a figure skater moving at most 10 [m/sec] on the 2D ice rink.2 The 3D volume of a skater is approximated by a cylinder of 2 [m] diameter and 2 [m] height, which will be enough to encase the skater even if she/he performs complex actions. We assume that the cameras can be installed Horizontally: between 0 [m] and 30 [m] away from the rink (the area surrounding the rink in Fig. 3.31), and Vertically: between 0 [m] (on the floor) and 10 [m] (on the ceiling) above the ice plane. The minimum allowable image resolution is set to a pixel coverage s = 8 [mm/pix] on the object surface. 2 While we do not know if this speed is reasonable, we had to limit the maximum speed due to the camera control speed of off-the-shelf active cameras employed.
3.5 Designing a System for Large Scale Sport Scenes
79
Fig. 3.30 Virtual ice skate arena
Fig. 3.31 Physical specifications of the virtual ice skate arena
As a typical off-the-shelf pan/tilt/zoom camera, we adopted Sony EVI-HD7V whose specifications are summarized in Table 3.3. For simplicity, we assume that its pan, tilt, and zoom control speeds are constant regardless of their current values. Note that EVI-HD7V does not accept synchronization signals, and as a consequence, multi-view video data cannot be used for 3D video production. We use it only to obtain pan/tilt/zoom control characteristics of off-the-shelf active cameras. In addition to the physical pan/tilt/zoom control speeds of the camera, the object detection and tracking processes add a delay including (1) image data acquisition, (2) object detection and 3D position localization from acquired image data, (3) PCTable 3.3 Specifications of SONY EVI-HD7V pan/tilt/zoom camera
Image sensor
1/3 inch (4.8 mm × 3.6 mm), CMOS
Image resolution
up to 1920 × 1080
Pan angle
±100 degrees
Tilt angle
±25 degrees
Pan speed
300 degrees/sec
Tilt speed
125 degrees/sec
Horizontal angle of view
70 (wide) ∼ 8 (tele) degrees
Focal length
3.4 (wide) ∼ 33.9 (tele) mm
Zoom speed
30.5 mm/sec
Iris
F1.8 ∼ close
80 Table 3.4 Assumed specifications
3
Active Camera System for Object Tracking and Multi-view Observation Object motion
v ≤ 10 [m/sec]
Object radius
r ≤ 1 [m]
Effective image resolution s ≤ 8 [mm/pix] Control delay
T ≤ 0.03 [sec]
Horizontal pixel resolution N = 1920 [pix] Horizontal imager size
W = 4.8 [mm]
Focal length
f = 3.4 ∼ 33.9 [mm]
Angular pan speed
ωp = 300/180 × π = 5π/3 [rad/sec]
Zoom speed
z = 30.5 [mm/sec]
Iris
F ≤ 2.8
to-camera data transfer, etc. We here model this delay by a constant value T = 0.03 [sec]. As described in the previous chapter, the lighting environment plays an important role on designing the shutter, gain, iris parameters, and as a result it affects the depth-of-field. In this section we assume that the cameras can close their irises up to F = 2.8 to capture images in acceptable S/N ratio. Table 3.4 summarizes the employed specifications.
3.5.2 Camera and Cell Arrangements Based on the given specifications, this section derives the arrangements of cameras and cells which satisfy all requirements for 3D video production as well as physical limitations. Firstly, the cameras should be distributed uniformly surrounding the ice rink. Secondly, the distance from the object to each camera does not affect the image resolution as long as it can be compensated by zooming. Thirdly, pan/tilt angles to be changed during the cell switching get larger when an active camera is placed closer to the rink. Hence the cameras should be located as far as possible from the rink. By taking these into account, we designed the camera arrangement as illustrated in Fig. 3.32. In this figure, 24 cameras are mounted on the ceiling edges to make the camera-to-skater distance as far as possible. The biased distribution of the cameras reflects the oblate shape of the rink and makes the camera-to-object angles to distribute evenly. The figure also illustrates the designed cell arrangement. The cell radius is determined as 5.2 [m], and the rink is covered by 32 cells in total. In what follows, we determine the key design factors, the cell radius R and the camera distance d, based on the constraints described in Sect. 3.3.1.
3.5 Designing a System for Large Scale Sport Scenes
81
Fig. 3.32 Results of the active camera arrangement and the cell arrangement. Small circles represent the camera positions
3.5.2.1 Image Observation Constraints To examine the validity of camera parameter setups, it is sufficient to analyze the nearest and farthest cells from the cameras. In the arena, supposing the cameras are installed on the ceiling of 10 [m] height and 30 [m] away √ from the rink, then the 302 + 102 = 31.623 [m]. shortest distance between a camera and a cell is dmin = √ Similarly the longest distance to the farthest cell is dmax = (30 5 + 30)2 + 102 = 97.596 [m]. Note that 3D distance values rather than 2D ones in Sect. 3.3.1 were used to make the design process practical. By substituting these values with d of the constraints described in Sect. 3.3.1.2, we obtain feasible ranges of focal length and cell radius as illustrated by Fig. 3.33. In this figure, the ranges of possible focal length and cell radius are indicated by the hatched areas where all inequalities hold. From these plots, we can observe that the maximum possible cell radius is given as the intersection of Eq. (3.4) and (3.6). Then, by solving R=
dW −r 2f
N R = sf −d −r W
(3.14)
for f and R, we have √
2dsN + d 2 − 2r − d 2 √ ( 2dsN + d 2 + d)W . f= 2sN
R=
(3.15)
Since we employ a uniform cell arrangement, the maximum possible cell radius is limited by the lower one of the two extreme, i.e. nearest and farthest, cases. Thus,
82
3
Active Camera System for Object Tracking and Multi-view Observation
Fig. 3.33 Image observation constraints represented by Eqs. (3.4), (3.6), and (3.7) for the two extreme cases: (a) nearest and (b) farthest. The hatched areas denote feasible ranges of focal length and cell radius
by substituting d in Eq. (3.15) with dmin , we obtain R ≤ 5.389 [m],
(3.16)
from the image observation constraints. Note that the camera-to-object distance df in Eq. (3.7) is set to 30.5 [m] and 76.5 [m] for Figs. 3.33(a) and (b), respectively. These values make the center of the depth-of-field coincide with the cell center.
3.5.2.2 Camera Control Constraints As discussed before, the quickest pan control is required when switching between two nearest cells. In the current camera arrangement, the shortest distance between a camera and its nearest cell is dmin . Figure 3.34(a) shows the constraint by Eq. (3.8) in this case, where the horizontal axis denotes R and the vertical axis the difference between the left-hand-side and the right-hand-side in Eq. (3.8). That is, the inequality holds where the value is negative, i.e., R ≥ 0.941 [m]. Equation (3.12) provides the constraint on R in terms of the zooming. By substituting the constant values in this inequality, we obtain R ≥ 5.168 [m].
(3.17)
Figure 3.34(b) shows the plot of the inequality of Eq. (3.11) as a reference. Notice that the constraint on zoom does not depend on the distance between the camera and cell. In summary, we obtain R ≥ 5.168 [m] from the camera control constraints.
(3.18)
3.6 Conclusion and Future Works
83
Fig. 3.34 Camera control constraints represented by (a) Eq. (3.8), (b) Eq. (3.11). In both graphs, the vertical axis denotes the difference between the left-hand-side and the right-hand-side of each inequality, which hence is satisfied in the negative value area
3.5.2.3 Optimal Cell Size By integrating the constraints on R by both the image observation constraints and the camera control constraints, we have for image observation: R ≤ 5.389 [m] for camera control: R ≥ 5.168 [m].
(3.19)
Consequently, the possible cell size is 5.168 ≤ R ≤ 5.389 [m]. Based on this condition, we designed the cell of R = 5.2 [m] to maximize the pixel usage; the smaller the cell, the larger the object image becomes in the cell. This design process also indicates that the zoom speed is the crucial factor which limits the cell size dominantly. In summary, the above design process demonstrated that the cell-based method can be used to track an object and capture well-focused multi-view video of large scale sport scenes like ice skating. In practice, however, the specified image resolution, i.e. pixel coverage s = 8 [mm/pix], is not sufficient to produce high fidelity 3D video compared to those attained in studios with static cameras in Table 2.5. Based on our simulation, s = 7 [mm/pix] will be the minimum possible pixel coverage under the given specifications. To realize multi-view video capture of higher image resolution, it is required to employ active cameras with higher functions than SONY EVI-HD7V in physical image resolution and zooming speed as well as to reduce the computational delay T .
3.6 Conclusion and Future Works This chapter presented a novel multi-view video capture system with a group of active cameras which cooperatively track an object moving in a wide area and capture
84
3
Active Camera System for Object Tracking and Multi-view Observation
high resolution well-focused multi-view object video data. The novelty rests in the cell-based object tracking and multi-view observation, where the scene space is partitioned into a set of disjoint cells, and the camera calibration and the object tracking are conducted based on the cells. The cell-based system scheme was developed to satisfy the following fundamental requirements for the multi-view video capture for 3D video production. Requirement 1: Requirement 2: Requirement 3: Requirement 4: and 3.
Accurate camera calibration, Full visual coverage of the object surface, High spatial image resolution, and Track a moving object in real time while satisfying requirements 2
To develop a practical cell-based method, the hexagonal tessellation was introduced for the cell arrangement on a 2D floor on which an object moves. Then, the sets of cells and cameras are divided into three groups, respectively, and one-to-one correspondences are established between the cell and camera groups. We analyzed the algebraic constraints among various parameters to satisfy the above requirements, and designed three 3D video studios: two for laboratory environments and one for a large scale real world scene, an ice skate arena. The experimental results with synthesized and real world scenes demonstrated the effectiveness of the cellbased method in capturing high resolution well-focused multi-view video for an object moving in a wide spread area. Throughout this part, we assumed that the cameras should be installed uniformly in the studio, and therefore the ring arrangement is a preferable choice in general. However, real applications such as 3D video for movie productions may require an anisotropic capture space, and consequently, a non-uniform camera distribution. In [22], actually, we developed a cell-based method which arranges a set of cells along the object trajectory specified by a scenario, assuming the camera arrangement is fixed. It is left for future studies to realize the simultaneous optimization of the camera and cell arrangements in various real world applications. In addition, the multi-view video capture of multiple moving objects is another future study direction. While the system we developed in [20] realized versatile multi-object detection and tracking with a group of active fixed-viewpoint cameras, mutual occlusions between objects cannot be resolved with pan/tilt/zoom cameras. To avoid occlusions and guarantee the high observability of multiple object surfaces, the 3D positions of cameras should be controlled by the dolly mechanism, which is also an interesting future research problem to exploit 3D video applications.
References 1. Davis, J., Chen, X.: Calibrating pan-tilt cameras in wide-area surveillance networks. In: Proc. of International Conference on Computer Vision, pp. 144–149 (2003) 2. Fitzgibbon, A.W., Zisserman, A.: Automatic camera recovery for closed or open image sequences. In: Proc. of European Conference on Computer Vision, pp. 311–326 (1998)
References
85
3. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 4. International Skating Union: Special Regulations & Technical Rules. Single and Pair Skating and Ice Dance (2008). Rule 342 Required rinks 5. Jain, A., Kopell, D., Kakligian, K., Wang, Y.-F.: Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 537–544 (2006) 6. Kitahara, I., Saito, H., Akimichi, S., Onno, T., Ohta, Y., Kanade, T.: Large-scale virtualized reality. In: CVPR2001 Technical Sketches (2001) 7. Lavest, J.M., Peuchot, B., Delherm, C., Dhome, M.: Reconstruction by zooming from implicit calibration. In: Proc. of International Conference on Image Processing, vol. 2, pp. 1012–1016 (1994) 8. Lavest, J.-M., Rives, G., Dhome, M.: Three-dimensional reconstruction by zooming. IEEE Trans. Robot. Autom. 9(2), 196–207 (1993) 9. Li, M., Lavest, J.-M.: Some aspects of zoom lens camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 18(11), 1105–1110 (1996) 10. Lu, C.-P., Hager, G.D., Mjolsness, E.: Fast and globally convergent pose estimation from video images. IEEE Trans. Pattern Anal. Mach. Intell. 22(6), 610–622 (2000) 11. Luong, Q.-T., Faugeras, O.D.: Self-calibration of a moving camera from point correspondences and fundamental matrices. Int. J. Comput. Vis. 22, 261–289 (1997) 12. Maybank, S.J., Faugeras, O.D.: A theory of self-calibration of a moving camera. Int. J. Comput. Vis. 8, 123–151 (1992) 13. Mendonca, P., Cipolla, R.: A simple technique for self-calibration. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 637–663 (1999) 14. Pollefeys, M., Koch, R., Gool, L.V.: Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. Int. J. Comput. Vis., 7–25 (1999) 15. Sarkis, M., Senft, C.T., Diepold, K.: Calibrating an automatic zoom camera with moving least squares. IEEE Trans. Autom. Sci. Eng. 6(3), 492–503 (2009) 16. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42 (2002) 17. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 519–528 (2006) 18. Sinha, S.N., Pollefeys, M.: Pan-tilt-zoom camera calibration and high-resolution mosaic generation. Comput. Vis. Image Underst. 103(3), 170–183 (2006) 19. Szeliski, R., Kang, S.B.: Recovering 3D shape and motion from image streams using nonlinear least squares. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 752–753 (1993) 20. Ukita, N., Matsuyama, T.: Real-time cooperative multi-target tracking by communicating active vision agents. Comput. Vis. Image Underst. 97, 137–179 (2005) 21. Wada, T., Wu, X., Tokai, S., Matsuyama, T.: Homography based parallel volume intersection: Toward real-time reconstruction using active camera. In: Proc. of CAMP, pp. 331–339 (2000) 22. Yamaguchi, T., Yoshimoto, H., Matsuyama, T.: Cell-based 3D video capture method with active cameras. In: Ronfard, R., Taubin, G. (eds.) Image and Geometry Processing for 3-D Cinematography, pp. 171–192. Springer, Berlin (2010)
Chapter 1
Introduction
1.1 Visual Information Media Technologies Looking back on our modern history, information media technologies innovated our everyday lives and societies. For example, in the 15th century the printing machine by J. Gutenberg enabled mass copying of books and newspapers. The 19th century saw the deployment of the camera and photography to capture vivid real life scenes. Electronic technologies of the 20th century have realized the explosive enrichments of information media with radio, TV broadcasting systems, as well as a variety of media capturing, recording, and presentation devices like video cameras, CD/DVD, and electronic displays. Now in the 21st century, all of our personal and social activities are promoted by multimedia technologies tightly coupled with digital telecommunication technologies such as the Internet, web, and mobile phones. Thus, it is not too much to say that information media technologies embody social innovations. In the area of visual information media technologies, one can observe a line of evolution starting from analog still 2D image (photography) and followed by analog 2D motion picture (movie, TV, and video), which then were enhanced and integrated with telecommunication by digital technologies. At the end of the last century, Computer Vision technologies enabled the exploration of another line leading to 3D images. In fact, the 3D shape of static objects can now be easily captured with off-the-shelf laser rangefinders [18, 21]. It is then natural to explore technologies for 3D motion picture next. Figure 1.1 illustrates the world of 3D visual media technologies, which consists of fundamental disciplines of visual information processing. First of all, the physical visual world includes 3D object(s), camera(s), and light source(s), upon which the following line of processing methods fabricate the world of visual information media. (I) Image Capture: 3D Object ⇒ 2D Image: Light rays emitted from the light source(s) are reflected on the object surface, and are then recorded in the form of 2D image(s) with the camera(s). To derive the computational model of this T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_1, © Springer-Verlag London 2012
1
2
1
Introduction
Fig. 1.1 World of visual information media
physical imaging process, camera and light source calibration have to be conducted. Note that the image capture process degenerates the 3D information in the physical world into 2D image data. (II) Image Processing: 2D Image ⇒ 2D Image: A large library of image processing methods [3, 16] can be applied to the captured image data for frequency
1.2 What Is and Is Not 3D Video?
3
and spatial filtering, edge and region segmentation, color, shape, and texture analysis. (III) Computer Vision: 2D Image ⇒ 3D Object: Computer vision technologies [22] enable the reconstruction of 3D object information from the captured 2D image(s): 3D shape from shading, texture, silhouette, motion, stereo vision, and so on. Note that since a camera can only observe the frontal surface of an object, multiple cameras have to be employed to reconstruct full 3D object information including the backside surface information. (IV) Editing/Coding: 3D Object ⇒ 3D Object: To produce attractive visual information media, various types of editing operation should be applied to the reconstructed 3D object: color, texture, shape, and motion editing. Moreover, a virtual scene can be designed by introducing artificially designed objects and/or modifying lighting environments. To facilitate these editing operations and/or reduce the data size for storage and transmission of 3D information, it is common to introduce efficient and flexible data structures and their coding methods. (V) Computer Graphics: 3D Object ⇒ 2D Image: Associated with 3D object editing and coding, computer graphics technologies [4] render virtual 3D objects to 2D image data, from which viewable physical photometric image(s) are generated with display monitors. (VI) Augmented/Mixed Reality: 2D Image ⇒ 2D Image: With immersive displays and see-through head-mounted displays, augmented/mixed reality technologies [2, 6] integrate the real physical and virtual computer-generated scenes to make us perceive the enriched visual world.
1.2 What Is and Is Not 3D Video? Aiming to enrich the world of visual information media, we have been working in the last ten years to develop 3D Video as a new visual information medium and cultivate its utilities in everyday life. Here 3D video [14] is the full 3D image media that records dynamic visual events in the real world as is; it records time varying full 3D object shape with high fidelity surface properties (i.e. color and texture). Its applications cover wide varieties of personal and social human activities: entertainment (e.g. 3D game and 3D TV), education (e.g. 3D animal picture books), sports (e.g. sport performance analysis), medicine (e.g. 3D surgery monitoring), culture (e.g. 3D archive of traditional dances), and so on. S. Moezzi et al. [14] and T. Kanade et al. [10] are pioneers who demonstrated that full 3D human actions could be generated from multi-view video data, and have been followed by many computer vision researchers to explore 3D video production technologies and applications. Now several venture companies [20] are commercially supporting 3D video capture systems and 3D video content generation. 3D video can be best characterized by its full 3D-ness; the full 3D shape and motion of a real world object including its backside is captured as is. While many 3D visual media technologies have been developed, their differences from 3D video can be characterized as follows.
4
1
Introduction
3D TV and Cinema: 3D TV and cinema data are nothing but a pair of 2D stereo video images, from which 3D scenes are perceived in our brains. That is, with special glasses or special display monitors, pop-up motion pictures can be enjoyed. However, their basic limitations constrain the viewing direction of the scene to be fixed; the back, left, or right sides of the scene cannot be seen interactively. Nor can one edit image contents, change object locations or poses, or modify illuminations or shadows. On the other hand, with 3D video data, a pair of 2D stereo video data for 3D TV can be easily generated, and viewing directions and zooming are interactively changed. That is, 3D video enables us to enjoy interactive full 3D scene observations. 3D Depth (Range) Image: Many off-the-shelf devices to capture 3D depth images are available: some employ laser beam time-of-flight [18, 21] and others stereobased methods [19] to measure depth values at object surface points. Usually the captured data are represented by a range image where each pixel value denotes the depth value from the camera center to a surface point. Note that even with this technology no backside surface data can be obtained. In this sense, the captured data should be called 2.5D images rather than 3D. Although 2.5D motion images can be captured in real time with modern range recorders like Kinect [24], 3D video production technologies are required to capture full 3D motion images. Ikeuchi et al. [8] developed a system to reconstruct full 3D shape of huge static objects like big Buddha statues and temples by patch-working a group of 2.5D images observed from different positions. Since the process to capture multiview 2.5D images is time consuming, target 3D objects must be static ones. 3D Motion Capture [12]: The principle of this technology is to attach markers to the object surface and measure their 3D position and motion with a group of cameras surrounding the object. Measured data are just a group of dynamic 3D position sequences of markers, and hence no 3D object surface shape, color, or texture can be obtained. As its name stands for, this enables us just to capture 3D point motions and 3D CG technologies should be employed to generate 3D motion pictures [17]. 3D Animation [11]: The critical difference between 3D video and 3D animation lies in the fact that the former records natural real world objects while the latter designs and generates artificial ones. While 3D CG technologies enable us to generate very sophisticated object shapes, motions, and surface textures, they still stay at natural-looking level. From a technical point of view, although both 3D video and 3D animation often share 3D mesh data for representing 3D object surface shape, they differ in what follows: 1. 3D mesh data of 3D video often change largely from frame to frame since the 3D mesh data of each frame are reconstructed from a set of multi-view image frames, whereas a temporal sequence of 3D mesh data in 3D animation are generated based on the pre-specified 3D motion data keeping the mesh structure (i.e. number of vertices and their connectivities) over time. 2. In 3D video the 3D object motion from a temporal sequence of reconstructed 3D mesh data has to be estimated, whereas in 3D animation 3D object motion data are specified at the designing step.
1.3 Processing Scheme of 3D Video Production and Applications
5
Table 1.1 Functional differences between image-based and model-based methods Processing method
No. of cameras
Image quality
Imaged objects
Viewpoint for image rendering
Lighting environment editing
Content editing
Object motion analysis
Image-based
Several tens— hundred dense arrangement Several—tens sparse arrangement
High
Entire scene natural scene 3D object
Constrained around cameras Completely free
NO
NO
NO
YES
YES
YES
Model-based
Average
3. In 3D video surface texture and color properties of each mesh face should be estimated from observed multi-view images, while those of 3D animation are given at its designing process. 3D CT Image [1]: In medicine 3D CT images are used for detailed medical examination and diagnosis. A 3D CT image is constructed by piling up a set of 2D CT images representing cross-sections of an object. It records the 3D feature distribution in the interior volume of the object, whereas 3D video represents 3D surface shape and texture alone. Even though the 3D CT image may be called a substantially full 3D image, it is still hard to capture objects in motion. Free-Viewpoint TV [9, 15]: This technology has many features in common with 3D Video, in particular, in capturing and visualizing image data: both employ a multi-view camera system where a group of cameras are placed so as to surround an object, and can interactively generate 2D and 2.5D object images viewed from arbitrary viewpoints. Their differences lie in data representation and processing methods: the former employs image-based methods while the latter model-based ones. In the image-based method, a free-viewpoint 2D image of the scene is synthesized based on geometric relations among pixels in a set of multi-view images. To represent the geometric relations, the epipolar image [7], rayspace [5], and light filed [13] representations of the multi-view images were developed. The model-based method, on the other hand, reconstructs explicitly 3D object shape and motion to render free-viewpoint images. Table 1.1 summarizes functional differences between them, from which we believe 3D video has much larger flexibilities. In 3D video, object position, shape, and motion can be analyzed and edited and lighting can be modified to render object images. Moreover, as will be described later in Chap. 7, the object motion can be viewed from the object’s own viewpoint: for example, dancing actions can be viewed from the dancer’s own viewpoint.
1.3 Processing Scheme of 3D Video Production and Applications The processes aligned on the left in Fig. 1.2 illustrate the basic processes to produce a 3D video frame.
6
1
Introduction
Fig. 1.2 Process of 3D video production and applications
(1) Synchronized Multi-View Image Acquisition: A set of multi-view images of an object are taken simultaneously by a group of distributed video cameras. The top row images in Fig. 1.2 show samples of captured multi-view images of a dancing MAIKO, a young traditional Japanese dancer, who wears specially designed Japanese traditional clothes named FURISODE with long sleeves. Her
1.3 Processing Scheme of 3D Video Production and Applications
7
Fig. 1.3 Multi-view video capture system with a PC cluster
hair and cosmetic styles, decorations, design patterns of clothes, and dances are nothing but Japanese cultural assets. From a technical viewpoint, her very long thin sleeves and the free edges of the OBI (a Japanese gorgeous sash) swing largely during her dance, whose 3D shape and motion reconstruction is a technical challenge we attacked in our project. In general, the cameras are spaced uniformly around the object(s) so that the captured images cover the entire object surface (Fig. 1.3). Note that it is often hard to satisfy this requirement of full observation coverage of the object surface. Some parts of the surface are occluded by others even when capturing a single object. Moreover, heavy occlusions become inevitable when capturing multiple objects in action. Thus in order to produce a 3D video, methods that cope with self and mutual occlusions have to be developed. Other technical issues to be solved include geometric and photometric camera calibrations, and layout and calibration of light sources, which will be discussed in Chap. 2. One of the technical challenges we attacked deals with the camera layout and control problem: how to capture a 3D video of an object moving in a wide area while guaranteeing high resolution surface texture observation. We will address these issues and their solutions in detail in Chap. 3. (2) Object Silhouette Extraction: The model-based method requires the segmentation of an object region(s) in each observed image. In a well designed 3D video studio, background subtraction and/or chroma-key method can be applied
8
1
Introduction
Fig. 1.4 Taking multi-view video at a NOH theater: NOH is a Japanese traditional dance and is sometimes played at night lit with a group of small bonfires. In this figure, a MAIKO is illustrated instead of a NOH performer, because 3D video of MAIKO dances will be used in this book
to generate a set of multi-view object silhouettes (second row images from top in Fig. 1.2). Although we ideally want to capture 3D videos in any natural environments as shown in Fig. 1.4, their geometric, photometric, and dynamical complexities forced us to work in well controlled studio environments instead. Nevertheless a 3D video capture system in everyday life environments remains one of our future research targets. (3) 3D Object Shape Reconstruction: Many methods for 3D shape reconstruction from a set of multi-view images have been developed. Even though simple, one of the most popular methods remains the volume intersection. Each object silhouette is back-projected into the common 3D space to generate a 3D visual cone encasing the 3D object. Then, such 3D cones are intersected with each other to generate the voxel representation of the object shape (third picture from top in Fig. 1.2). Since this method utilizes only silhouette information, many concave parts of the object cannot be reconstructed. Hence a 3D shape refinement process based on surface texture and motion information has to be performed. For such surface-based processing, voxel data have to be converted into 3D surface mesh data (third picture from bottom in Fig. 1.2). The second picture from the bottom in Fig. 1.2 illustrates a refined 3D shape obtained with our 3D mesh deformation method. Silhouette, texture, and motion information are integrated to accurately fit the 3D mesh to the object surface. Note that the preceding process of object silhouette extraction can be integrated with the 3D shape reconstruction process. That is, an integrated method of segmentation and 3D shape reconstruction can be developed. In Chap. 4, we will survey methods of 3D shape reconstruction from multi-view video data and in-
1.3 Processing Scheme of 3D Video Production and Applications
9
troduce several algorithms we developed for our 3D video project with experimental results. (4) Surface Texture Generation: With a reconstructed 3D object shape, color and texture on each 3D mesh face are computed from the observed multi-view images (bottom picture in Fig. 1.2). The technical problems to be solved here are: 1. Since a mesh face may be observed differently in multiple images, how to integrate the images to generate smooth and high resolution 3D video? Note that since 3D videos are usually observed interactively by changing viewing directions and zooming factors, texture generation methods should take such viewing directions and zooming factors into account to render high fidelity surface texture. 2. The most difficult problem in surface texture generation lies in the inevitable errors in camera calibration and 3D object shape reconstruction, which makes 3D object shape reconstruction not perfectly consistent with the set of multiview images. Thus, a texture generation method that copes with such inconsistency has to be developed. 3. Since one of the appealing characteristics of 3D video is the interactive observation with changing viewpoints, texture generation should be done in real time even if the production of 3D video data is done off-line. We developed several texture generation methods to solve the above mentioned problems, which will be presented in Chap. 5 with rendered 3D video data. By repeating the above processes for each video frame, one can obtain 3D video data. Note that since motion is an important source of information to conduct each of the above mentioned processing steps, methods can be developed for processing a new frame based on the analysis of the preceding one, or by analyzing multiple frames at once in one process. Practical usages of motion information will be introduced in Chap. 4. (5) Lighting Environment Estimation: As mentioned in Sect. 1.1, light sources are one of the most important components of a visual scene. Especially in traditional and artistic dances, dynamically controlled and spatially designed lighting effects play a crucial role to produce attractive visual events. Thus 3D video capture systems should be able to estimate lighting environments under which 3D object images are captured; characteristics of light sources as well as object shape, motion, and texture data are what one wants to obtain by analyzing multiview video data. As shown in what follows, the estimated lighting information can be used in 3D video visualization. Chapter 6 addresses a method to estimate dynamically changing lighting environments. 3D locations and shapes of distributed light sources such as candle lights are estimated from observed video data. Note that the three problems for the 3D video production described so far, namely the object shape and motion reconstruction, the surface texture and color estimation, and the lighting environment estimation have strong mutual dependencies. Hence it
10
1
Introduction
is still left to future studies to solve the three problems at once without any assumptions, i.e. the 3D video production in arbitrary environments. With the methods mentioned above, 3D video data of object(s) can be produced. What should be done next is to cultivate the world of 3D video by developing 3D video data processing methods. They include the following applications that we developed: (6) Editing and Visualization: With 3D video data, the most straightforward application is visualization. A 3D video editor and visualization system was therefore developed to produce attractive 3D video contents. The functionalities of the current system are limited to rather simple ones, such as 3D layout and duplication of 3D video data in a 3D scene surrounded by an omni-directional video. The system allows us to interactively observe the 3D scene with a 3D display. One sophisticated visualization method which was developed can visualize human actions from his/her own viewpoint: extract 3D gaze directions and motions from 3D video data and render the data based on the 3D gaze information. This is indeed a very unique visualization function that cannot be realized with imagebased methods. With this function, for example, one can observe how a dancer like a MAIKO controls his/her eyes to produce emotional effects, and as well one can learn where to look at while performing complex body actions like juggling. These visualization methods are described in Chap. 7. More sophisticated 3D video editing and visualization systems include functions such as the following. Editing Lighting Environments: Given the lighting environments estimated by the method in Chap. 6, object surface reflectance properties can be estimated: intrinsic color and degree of specularity. 3D video can then be visualized under arbitrary lighting environments, which will enhance utilities of 3D video data to produce a variety of visual media with different atmospheres: the same dance can be observed under daylight, sunset, candle lightings, and so on. Some simplified lighting environment editing examples are given in Chap. 6. Editing Body Actions: While the 3D video editor for visualization previously described does not modify captured 3D video data themselves, the action editor conducts geometric and temporal transformations & modifications to generate new 3D video data. Its functionalities range from a simple spatiotemporal transformation method to sophisticated human action analysis. 1. 3D mesh data edition: With 3D video data, one can easily and freely edit 3D mesh data: a thin body can be inflated and an action can be accelerated or slowed down. One technical problem to be solved for this kind of mesh data editing concerns the texture generation process which should work consistently with such 3D mesh modifications. 2. Behavior unit edition: A sequence of complex actions captured in 3D video data can be decomposed into a group of simple behavior units: standing up, hand shaking, turning, walking, and so on (Fig. 1.5). By extracting such behavior units from captured 3D video data and weaving the behavior
1.3 Processing Scheme of 3D Video Production and Applications
11
Fig. 1.5 3D video data are decomposed into a group of behavior units in order to create new 3D video data of new actions. Here, 201 frames of a MAIKO dance sequence were classified into N behavior units (N : 4, 24, 72, 86). Each unit is represented by a unique color
units, new 3D video data of new object actions can be generated. While the captured 3D video data in each behavior unit are preserved, new 3D video data representing a variety of new sequences of object actions can be generated. Chapter 8 addresses methods to extract and edit behavior units from 3D video in order to produce new 3D video data. 3. Action editor: Estimate a kinematic structure and motion of an object from captured 3D video data and edit object actions based on its intrinsic kinematics: for human action editing, extract his/her bone structure and joint motion from captured 3D video data, and edit the kinematic structure and motion so that his/her action can be well synchronized with that of a virtual object. Chapter 9 describes a method to estimate bone and joint locations and motions from 3D video data of a complex human body action such as Yoga playing. Given such kinematic information, 3D animation technologies like skin-and-bones model can be applied to edit actions. While this function enables us to use 3D video data as marker-less motion capture data, our future problem deals with the estimation of bones and joints from 3D video data of a MAIKO wearing very complex clothes. (7) Data Representation and Coding: As shown in Fig. 1.1 and discussed in Sect. 1.1, data structure design and coding are one of the main technical issues when cultivating the visual information media. In fact, appropriate data structures and coding methods enable the exploration of new functionalities, and make data processing, recording, and transmission much easier. In the 3D video world, 3D mesh data with 2D observed surface texture patterns are produced from a set of 2D multi-view video, which can then be transformed into a 3D skin-and-bones model like 3D animation. Each data model has its own functional advantages and limitations. As shown in Table 1.1, for example, although image-based rendering from multi-view video returns high fidelity images, flexibility of content visualization and editing are limited. Since the naturalness of 3D video comes mainly from both natural motions of clothes and
12
1
Introduction
Fig. 1.6 3D video data compression is achieved by converting 3D video frames (top) into geometry images (bottom), where each pixel value RGB represents the position XY Z of a 3D surface mesh vertex. Thus standard MPEG video coding scheme can be used to stream 3D video data
their surface texture, we cannot use the skin-and-bones model as a universal data structure for 3D video even if it enables body action editing. Another important aspect of the data model design is data compression. As a matter of fact, even with modern computing and telecommunication technologies, it is still of great importance to compress huge data of visual information media for easy storage and transmission. As it is well known, a stream of 2D video data can be compressed efficiently with video coding methods like MPEG [23], which has allowed its wide utilities in everyday life activities. Hence, 3D video data compression is one of its most important applications. Chapter 10 addresses a method we developed for 3D video compression, which suits well to MPEG video coding; stored and transmitted data are nothing else but a stream of 2D array data like 2D video (Fig. 1.6). In summary, this book first addresses 3D video production methods consisting of multi-view video capture, 3D shape and motion reconstruction, surface texture mapping, and lighting environment estimation in Chap. 2, Chap. 3, Chap. 4, Chap. 5, and Chap. 6, respectively, and then introduces its applications for visualization, action editing, and coding in Chap. 7, Chap. 8, Chap. 9, and Chap. 10.
References 1. Becker, C., Reiser, M., Alkadhi, H.: Multislice CT. Springer, Berlin (2008) 2. Bimber, O., Raskar, R.: Spatial Augmented Reality: Merging Real and Virtual Worlds. A K Peters, Welesley (2005) 3. Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 4. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice in C, 2nd edn. Addison-Wesley, Reading (1996) 5. Fujii, T., Kimoto, T., Tanimoto, M.: Ray space coding for 3D visual communication. In: Picture Coding Symposium, pp. 447–451 (1996) 6. Gutierrez, M., Vexo, F., Thalmann, D.: Stepping into Virtual Reality. Springer, Berlin (2008) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 8. Ikeuchi, K., Oishi, T., Takamatsu, J., Sagawa, R., Nakazawa, A., Kurazume, R., Nishino, K., Kamakura, M., Okamoto, Y.: The great Buddha project: Digitally archiving, restoring, and analyzing cultural heritage objects. Int. J. Comput. Vis. 75, 189–208 (2007)
References
13
9. iview: http://www.bbc.co.uk/rd/projects/iview 10. Kanade, T., Rander, P., Narayanan, P.J.: Virtualized reality: constructing virtual worlds from real scenes. In: IEEE Multimedia, pp. 34–47 (1997) 11. Kerlow, I.V.: The Art of 3D Computer Animation and Effects. John Wiley and Sons, New York (2004) 12. Kitagawa, M., Windsor, B.: MoCap for Artists: Workflow and Techniques for Motion Capture. Focal Press, Waltham (2008) 13. Levoy, M., Hanrahan, P.: Light field rendering. In: Proc. of ACM SIGGRAPH, pp. 31–42 (1996) 14. Moezzi, S., Tai, L.-C., Gerard, P.: Virtual view generation for 3D digital video. In: IEEE Multimedia, pp. 18–26 (1997) 15. Eye Vision project: http://www.ri.cmu.edu/events/sb35/tksuperbowl.html 16. Rosenfeld, A., Kak, A.C.: Digital Picture Processing, 2nd edn. Academic Press, San Diego (1982) 17. Rusu, R.B., Cousins, S.: 3D is here: Point Cloud Library (PCL). In: Proc. of International Conference on Robotics and Automation, Shanghai, China (2011) 18. Konica Minolta 3D Laser Scanner: http://www.konicaminolta.com/ 19. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42 (2002) 20. 4D View Solutions: http://www.4dviews.com/ 21. MESA Swissranger: http://www.mesa-imaging.ch/ 22. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, Berlin (2011) 23. Watkinson, J.: MPEG Handbook, 2nd edn. Elsevier, Amsterdam (2004) 24. Microsoft Kinect for Xbox 360: http://www.xbox.com/kinect
Part II
3D Video Production
This part addresses algorithms for 3D video production from multi-view video data captured by the systems presented in Part I. Before discussing the practical algorithms, here we introduce an overall computational scheme of 3D video production. Figure II.1 illustrates the physical scene components included in the world of 3D video: • • • •
objects O in motion, background B, light sources L , and a group of fully calibrated cameras C .
Multi-view video data of objects O surrounded with background scene B are captured by multiple cameras C under lighting environments L . The light sources generate shadings, shadows, highlights on the object and background surfaces. Moreover, interreflections are observed on closely located surfaces. Based on geometric and photometric analyses of the physical world, we can design a computational model of the world of 3D video production as shown in
Fig. II.1 Physical scene components included in the world of 3D video
88
Fig. II.2 Computational model for 3D video production
Fig. II.2, where circles denote physical entities and rectangles imply visual phenomena induced by interactions among the entities: • Object O is characterized by shape So , texture To , and (generic) reflectance properties Ro . • Background B is characterized by shape Sb , texture Tb , and (generic) reflectance properties Rb . • Camera Ci produces images Ii . The arrows in the figure denote dependencies between entities in the model: A ← B implies A depends on B. In other words, it implies that given B, we can compute A: A = f (B). In this model, the object surfaces are lit by light rays directly from the light sources L as well as those interreflected from the objects O and the background B. We can model light sources as a collection of point light sources even if their 3D shape and radiant intensity change dynamically. The lighting environments, the objects, and the background produce a complex light field in the scene. Here the light field includes all phenomena of interreflections, highlights, shadings, shadows, and occlusions. Then, the multi-view cameras C capture parts of the light field as multi-view videos I , respectively. Positions, directions, and characteristics of the cameras define which parts of the light field are recorded in the video data. Disparities between object appearances are induced by such multi-view observations. In the world of 3D video illustrated in Fig. II.1, we introduce a virtual camera Cˆ as the fifth component of the world. One straightforward reason for its introduction
89
is to model free-viewpoint visualization of produced 3D video. As will be discussed later in this part, moreover, the virtual camera plays an crucial role in 3D video production itself. That is, while most of 3D video production processes are designed to generate view-independent (i.e. generic) 3D shape and texture, the complexity of the light field does not allow us to completely reconstruct such generic properties of the objects from limited multi-view video data. In the surface texture generation, especially, while the accurate estimation of surface reflectance (generic) properties in uncontrolled lighting environments is almost impossible, produced 3D video data should represent shiny surfaces and highlights as is for high fidelity visualization of object actions. To cope with such problem, the virtual camera is employed in the surface texture generation process to realize view-dependent 3D video production. Note that we model all entities except the background to be dynamic, i.e., to have time-varying characteristics. For example, lighting by a torch dynamically changes its shape (distribution of point light sources) and radiant intensity. The solid and dashed arrows in Fig. II.2 illustrate dynamic and static relationships, respectively. In this computational model, the scene reconstruction from observed images I is represented as an inverse problem of generating I from O, B, and L . That is, given a computational model including A ← B, we have to develop an inverse computation algorithm that estimates B from A. For the 3D video production, in particular, we need to estimate from I the object shape (position and normal) So , texture To , reflectance Ro , and lighting environment L surrounded by unknown backgrounds B. Thus, the problem of 3D video production from multi-view video data can be represented as follows: Under the computational model I = f (C , So , To , Ro , L , Sb , Tb , Rb ), estimate So , To , Ro , L , Sb , Tb , and Rb from I , C , and if necessary Cˆ. Obviously, this problem is significantly under-constrained and cannot be solved. As referred to Marr’s Shape from X model [1], computer vision has developed a wide variety of scene reconstruction algorithms which exploit visual cues such as shading, silhouette, disparity, etc., introducing assumptions about some of the unknowns in order to convert the original ill-posed problem into a well-posed one. For example, given an image taken under known (controlled) lighting environments, shape-from-shading algorithms estimate normal directions of object surfaces with Lambertian reflectance properties. To convert the significantly under-constrained problem of 3D video production into a manageable one, the following assumptions are introduced: 1. Assume that the surface reflections Ro and Rb follow Lambertian (isotropic reflection) model throughout this book. Note that this assumption does not mean that our algorithms, especially texture generation methods, cannot cope with shiny object surfaces, even though the 3D video production of transparent objects is out of the scope of this book. 2. Neglect all interreflections between object and background surfaces. 3. As presented in Chap. 2, the multi-view video capture is conducted in a 3D video studio, which enables us to control the background scene: Sb , Tb , and Rb can be
90
considered as known if necessary. In other words, 3D video production in natural environments is left for future research. 4. Take a three step computation process for 3D video production: (1) 3D shape So reconstruction (Chap. 4), (2) surface texture To generation (Chap. 5), and (3) lighting environment L estimation (Chap. 6). Each computation step introduces additional assumptions to fulfill the task, which will be described later in corresponding chapters. 5. The 3D shape reconstruction and texture generation methods presented in this part do not assume the scene includes only one object. That is, they can process multi-view video data of multiple objects in motion. This is partly because the controlled background assumption facilitates the identification of each object and partly because the estimation process of So and To should be designed to manage self-occlusion even when the scene includes a single object in motion. Thus we assume that a 3D video data stream can be produced for each object in the scene, respectively, which then will be processed in the applications presented in the next part. 6. Since multi-view video data record 3D surface shape and texture of object(s) in motion, a temporal sequence of 3D textured mesh data is employed as the data structure for representing a 3D video data stream in this book.
References 1. Marr, D.: Vision. Freeman, New York (1982)
Chapter 4
3D Shape Reconstruction from Multi-view Video Data
4.1 Introduction As illustrated in Fig. II.2, the problem of 3D video production, that is, to compute 3D textured mesh data of object(s) in motion from multi-view video data, is very complicated and significantly under-constrained. Hence we divide the problem into sub-problems and develop a solution method for each sub-problem one by one, introducing appropriate assumptions. As the first step, this chapter addresses 3D shape reconstruction methods from multi-view video data. The assumptions employed here include the following. Surface reflection: All surfaces of foreground and background objects follow the Lambertian reflection model, i.e. no specular surfaces or transparent objects are included in the scene. Lighting environments: The entire scene is illuminated by a uniform directed light. Moreover, lights are not attenuated irrespectively of distances between light sources and illuminated surfaces. Interreflection: No interreflections between foreground and background object surfaces are taken into account even if the surfaces have concavities and/or multiple surfaces come close to face each other. Figure 4.1 illustrate the simplified computational model under these assumptions. The black arrows in the figure denote computational algorithms which transforms input image data to the 3D object shape, while the gray arrows illustrate dependency relations as defined in Fig. II.2. With the above assumptions, interreflection and highlight in Fig. II.2 are simplified into “Lambertian reflection.” Moreover, shading and shadow can be regarded as painted patterns of the surface just as texture. In this simplified model, all pixels in multi-view images corresponding to the same 3D object surface point share the same value. In other words, object surface colors and textures are preserved even if camera positions and directions are changed, which enables 3D shape reconstruction methods to conduct appearancebased, i.e. image-based, matching among multi-view images for estimating 3D surface shape. This process is called shape from stereo. Note that since it is very popular T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_4, © Springer-Verlag London 2012
91
92
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.1 Computational model for 3D shape reconstruction. This is a simplified model of Fig. II.2. The gray and black arrows illustrate “generation” and “estimation” processes, respectively
that objects have specular surfaces such as shiny silk FURISODEs of MAIKOs, the appearance-based matching process among multi-view images are prone to introduce errors and hence the accuracy of the 3D shape reconstructed by shape from stereo is limited. As will be discussed later in this chapter, some 3D shape reconstruction methods employ object silhouettes to estimate 3D object shape, which can work without the above-mentioned assumptions. Since shape from stereo and shape from silhouette are complementary to each other, their combination (the black arrows in Fig. 4.1) enables the robust and accurate 3D shape reconstruction. Another important difference between Fig. 4.1 and Fig. II.2 is that the virtual viewpoint Cˆ in the latter is not included in the former. This implies that the 3D shape reconstruction is implemented as a view-independent generic property estimation process. Here, an essential question is brought up: “Can the generic object property, i.e. 3D shape, be obtained by appearance-based processing?” While the general answer is “no”, our practical solution taken in this book is as follows. • Since the above-mentioned assumptions are valid as the first-order approximation of the world of 3D video production, the 3D object shape can be reconstructed in some reasonable level of accuracy. • Since one of the major applications of 3D video is the interactive visualization of real world 3D object actions, the surface texture generation process should manage introduced errors, as well as specular surface properties that are neglected in the 3D shape reconstruction by employing the information of virtual view-
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
93
points. That is, the texture generation and visualization processes should be implemented as view-dependent processing. Chapter 5 will discuss and compare view-dependent and view-independent texture generation methods. • As a step toward obtaining completely the generic properties of objects, i.e. 3D shape and surface reflectance, Chap. 6 will present a method of estimating lighting environments including multiple dynamically changing light sources such as bonfires. The problem of estimating 3D shape and reflectance of objects in motion from multi-view video data under known lighting environments would be a reasonable research topic for future studies. Note that developing view-dependent 3D shape reconstruction methods, that is, introducing the virtual viewpoints Cˆ into the computational model of 3D shape reconstruction in Fig. 4.1 would be a possible future research topic for realizing higher fidelity visualization of 3D video. 3D shape reconstruction has been a major research topic in computer vision and a variety of different algorithms have been developed. The rest of this chapter first overviews and categorizes several important approaches to full 3D shape reconstruction of objects in motion from multi-view video data in Sect. 4.2. The categorization is done based on what object properties, i.e. surface or silhouette properties, are employed for, and how object motion is estimated. Then, Sect. 4.3 discusses three essential computational components in designing 3D shape reconstruction algorithms, followed by several practical algorithms with experimental results in Sect. 4.4. Section 4.5 concludes the chapter with discussions on presented ideas and algorithms and topics for future researches.
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production 4.2.1 Visual Cues for Computing 3D Information from 2D Image(s) Marr [30] proposed the computational model of human visual perception, where several algorithms for 3D visual perception were exploited. Augmenting the computational approach to visual perception, computer vision has enriched visual cues for computing 3D information from 2D image(s). While the algorithms are usually referred to as Shape from X as a whole, some of them compute only surface normals without the depth information and may be called pure shape without position from X, whereas others measure 2.5D depth values of visible object surface points and may be named as depth from X. Still, others estimate full 3D object volumes and may be called full 3D shape and position from X. The following summarizes visual cues for the 3D shape reconstruction. Shading: Shading is a very effective visual cue for human to perceive 3D object surfaces. Horn [16] first proposed an algorithm for estimating surface normals from
94
4 3D Shape Reconstruction from Multi-view Video Data
shading. Note that while the visible 3D object surface shape can be reconstructed by integrating surface normals, another cue should be employed to estimate its absolute depth. The originally introduced assumptions of a non-attenuating uniform directional light and a convex Lambertian surface without texture were relaxed by successive researches to make Shape from Shading work in a real world scene characterized by, for example, an attenuating proximal light source and a concave specular surface with non-uniform texture inducing complex mutual reflections [59]. Whatever models of lighting and reflectance are employed, they are crucial requisites for shape from shading. Texture: Shape from Texture [18] estimates surface normals by analyzing texture pattern deformations introduced by the projective imaging process assuming 3D object surfaces are covered with uniform texture. Note that since spatial texture patterns are a generic surface property, shape from texture can be designed to work without accurate models of lighting. Shadow: Shadows carry useful information about 3D object shape and position, since the geometric relations among light sources, objects, and shadows share much with those among cameras, objects, and silhouettes. Thus given calibrated light sources, effective constraints on the full 3D object shape and position can be computed by Shape from Shadow just in the similar way as will be described below in shape-from-silhouette methods. Chapter 6, on the other hand, will present a method of estimating 3D shapes and positions of light sources from shadows and shading patterns on calibrated 3D reference object surfaces. Silhouette: A silhouette represents a 2D object shape projected on an image plane, which carries rich information about full 3D object shape and position; the principle of computer tomography proves that an object image can be reconstructed from a group of its projected data. While we cannot apply this shape from the projection method in computer tomography directly to our problem,1 a geometric constraint on full 3D object shape and position can be derived by back-projecting an observed 2D silhouette into the 3D scene to generate a visual cone encasing the 3D object. By integrating visual cones generated from multi-view images, full 3D object shape and position can be reconstructed. This is the principle of Shape from Silhouette. While some concave parts of the object cannot be reconstructed, shape from silhouette works stably without accurate models of lighting or surface reflectance. Stereo: Binocular stereo allows us to perceive 3D scene. A variety of computational methods of Shape from Stereo have been developed in computer vision. The core computational process in shape from stereo rests in matching corresponding points in left and right images. Practical implementations of this appearancebased matching define the degree of stability against variations of object surface reflectance properties and lighting environments. To increase the stability of 1 The projection in computer tomography is characterized as an integral projection where rays go through the interior area of an object and their degrees of attenuation are observed in projected data. In computer vision, most of objects in the real world are non-transparent and reflect light rays at their surfaces. Thus the projection in computer vision implies ray blocking or shadowing.
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
95
matching, active lighting devices which project high contrast light patterns onto object surfaces are introduced [22]. Such methods are referred as Active Stereo. Note here that shape from silhouette and shape from stereo are complementary to each other; the latter employs object surface properties as visual cues for computing corresponding points in left and right images, while surface boundaries play a crucial role to estimate 3D object shape and position in the former. Motion: Starting computational models for human Shape from Motion perception, computer vision has explored computational methods including pure 3D shape from motion by optical flow analysis between contiguous image frames [35], 3D structure from motion by algebraic analysis of position data of feature points in a dynamic sequence of images such as factorization [52], and so on. As in shape from stereo, establishing correspondences between multiple dynamic images is the core computation process in shape from motion, where practical implementations define the degree of stability against variations of object surface reflectance properties and lighting environments. Focus: Changing a focus of camera lens, blurring patterns in observed images vary depending on 3D surface depth ranges. Employing this principle, many 2.5D depth sensing methods named as Shape from Focus [36] and Shape from Defocus [50] have been developed. As in active stereo, well designed imaging devices allow accurate depth sensing even in real time [37]. While shape from (de)focus methods require high contrast surface textures, they work stably against variations of object surface reflectance properties and lighting environments. Deviating from the scheme of shape from X, computer vision developed a straightforward technology to obtain a 2.5D depth image of an object surface. It is referred as range finding based on time-of-flight (ToF), which measures the timeof-flight of a laser beam between its emission and observation of its reflected beam from an object surface. Many commercial laser range finders are available nowadays. They can be applied only to static objects and scenes because the scanning process of a laser beam takes some time to obtain a 2.5D depth image. In other words, laser range finders are a kind of 0D (point-wise) sensors. To overcome this disadvantage, 2D ToF sensors which realize simultaneous sensing of time-of-flights in a certain area have been developed. Even though their resolution and accuracy are still limited, utilizing such cameras for 3D video production can be another research direction. Among those visual cues listed above, most of 3D video production systems employ mainly shape from silhouette and shape from stereo. The reasons for this are: • Shape from Silhouette works stably without accurate lighting and reflectance models since the appearance of 2D silhouette boundaries are not much affected by light reflection properties on the object surface, and • Shape from Stereo can reconstruct precise 3D object shape even at concave parts once it can establish correspondences based on the object appearance. The three assumptions described at the beginning of the previous section allow us to conduct the appearance-based matching for shape from stereo.
96
4 3D Shape Reconstruction from Multi-view Video Data
• Real lighting environments in multi-view video capture systems are very complex and cannot be actively controlled for 3D shape reconstruction. Thus shape from shading and shape from shadow which require accurate calibration of lighting environments cannot be used. • Shape from texture assumes that surface texture patterns are given a priori, which are actually the information 3D video production systems have to compute. • Human actions to be captured are very complicated. Especially in recording intangible cultural assets such as MAIKO dances, motion patterns and shape deformations of long and loose clothes are too complex to be modeled by algebraic formulations used in shape from motion methods. • Active stereo, shape from (de)focus, and range finders require specially designed imaging devices having limited resolutions compared to ordinary video cameras. The introduction of such special imaging devices into 3D video studios in addition to a group of high resolution video cameras will be realized in the next generation 3D video studio systems [6]. In what follows, we will first give an overview of algorithms for full 3D shape reconstruction of a static object based on shape from stereo in Sect. 4.2.2.1 and shape from silhouette in Sect. 4.2.2.2, followed by integrated algorithms based on both shape from stereo and silhouette in Sect. 4.2.2.3. Then, Sect. 4.2.3 will present algorithms for full 3D shape and motion reconstruction of an object in motion.
4.2.2 Full 3D Shape Reconstruction Intensive explorations of 2.5D shape from X algorithms for over two decades from the 1970s as well as rapid advancements of imaging technologies led computer vision researchers to the exploitation of full 3D shape reconstruction in the middle of the 1990s; it is natural to proceed first from 2D to 2.5D, and then from 2.5D to 3D employing multi-view imaging technologies. Two pioneer studies almost at the same time proposed algorithms for full 3D shape reconstruction of a human in motion based on shape from multi-view stereo and shape from multi-view silhouette, respectively. In 1997, Kanade et al. proposed a stereo-based approach [21]. This is a two-step algorithm which first reconstructs 2.5D shapes (depth maps) for each viewpoint by a conventional stereo method and then merges them to generate a full 3D shape. In this sense, it is a straightforward extension of 2.5D multi-baseline stereo [43] for full 3D reconstruction. On the other hand, Moezzi et al. proposed a silhouette-based approach [34]. The algorithm was designed based on shape from silhouette [27], which reconstructs an approximate full 3D shape of the object from multi-view silhouettes directly. They called it volume intersection and 3D textured volume as 3D video. Note that both of these two methods processed video frames one by one, and hence no 3D motion data were reconstructed or employed in their 3D shape reconstruction methods. Following these two studies, [32] and [49] proposed integrated methods of these two reconstruction cues to achieve both robustness and accuracy. This section
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
97
addresses characteristics of shape from stereo and shape from silhouette first, and then discusses ideas for their integration.
4.2.2.1 Shape from Stereo Algorithms of full 3D shape reconstruction by stereo or by texture matching can be classified into two types.
4.2.2.1.1 Disparity-Based Stereo The first type employs the two-step approach as proposed by Kanade et al. [21]. It first reconstructs multiple 2.5D shapes (depth maps) from different viewpoints, and then merges them to build a full 3D shape. This approach can employ well-studied narrow-baseline stereo methods as long as cameras are placed densely. However, when the number of cameras is reduced to save costs of devices and calibration or when distantly located cameras as well as near cameras are used for stereo to obtain comprehensive 3D information, widebaseline stereo techniques should be used, which are known to be harder than the regular stereo in producing dense correspondences between multi-view images; appearances of an object in wide-baseline stereo images often change too greatly to find correspondences. In addition, since the first step generates depth maps, i.e. partial reconstructions of the 3D shape, from different viewpoints independently of each other, the second step may encounter inconsistent depth values in overlapping areas among the depth maps. Moreover, per-viewpoint occlusion handling in the first step may generate mutually inconsistent occluding boundaries. To solve these inconsistencies among the depth maps is not a trivial problem. Note that algorithms of merging 2.5D depth maps can also be used for full 3D shape reconstruction from multi-view range sensor data. In fact, Ikeuchi et al. developed a full 3D shape reconstruction system for huge cultural assets like big Buddhas and ancient temples [19], where they proposed a global optimization method to seamlessly merge a large number of multi-view range sensor data. Volumetric 3D shape reconstruction algorithm by Hernandez et al. [14] (described later) can be regarded as an integration of depth fusion and volumetric stereo approach.
4.2.2.1.2 Volumetric Stereo The second type of stereo-based approach reconstructs 3D shape directly from multi-view images. While the first type of algorithms processes pixels as basic entities for appearance-based matching, the second type digitizes the 3D space into small unit cubes called voxels and processes them as basic entities for the full 3D shape reconstruction.
98
4 3D Shape Reconstruction from Multi-view Video Data
Kutulakos and Seitz [26] first proposed this concept as space carving in 1999. In this method, starting from far enough outside of the object surface, voxels are examined one by one sequentially from the outside toward the inside to determine whether they satisfy photo-consistency or not. If a voxel does, it remains as a part of the object surface and the carving process along that direction stops. Otherwise it is carved away. This is based on the assumption that if a voxel is on the object surface, the colors of the pixels corresponding to that voxel in multi-view images should be consistent with each other. Note that this assumption holds for objects with Lambertian reflectance surfaces. By applying this test from the outmost voxels of the scene, the algorithm terminates when all “exposed” voxels are identified. Here the resultant data are called photo hull since it is the largest 3D volume which satisfies the photo-consistency. Under the assumption of Lambertian surface reflectance, this approach is free from the inter-viewpoint inconsistencies involved in the first type. Instead, it involves a serious problem about which multi-view images have to be used for the photo-consistency evaluation of a voxel. That is, if all multi-view images were employed, no voxel would survive from carving. This is because, while some of the light rays connecting an exposed voxel with its projected pixels in multi-view images intersect at the same object surface point, others intersect at different surface points occluding the exposed voxel. Hence, while the photo-consistency holds for the former, it is violated for the latter. This problem is called the visibility of cameras against a voxel and is essentially a chicken-and-egg problem. Without knowing the 3D shape, we cannot determine which cameras can observe a point in the scene irrespectively of self-occlusion. Without knowing the cameras which can observe a 3D point in question, we cannot evaluate its photo-consistency and hence cannot estimate the 3D shape accurately. A practical method to resolve this problem is to introduce a visibility model which approximates the real visibility. We will discuss this point in Sect. 4.3.2 in detail.
4.2.2.2 Shape from Silhouette The concept of 3D shape reconstruction from multi-view silhouettes was proposed by Baumgart first in computer graphics in 1974 [3, 4], and then revisited by Martin in 1983 [31] and Laurentini in 1994 [27] in computer vision. This method is widely called shape from silhouette, as well as visual cone intersection or volume intersection.
4.2.2.2.1 Visual Hull The idea of shape from silhouette comes from the following question. Given a set of object silhouettes taken from different viewpoints, what is the largest 3D shape such that its 2D projections regenerate the given silhouettes?
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
99
Fig. 4.2 Visual hull in 2D
Fig. 4.3 Visual hull and object concavity
Figure 4.2 illustrates the answer in 2D. Suppose all the cameras are in general positions. Then the shape given by n-view silhouettes forms a 2n-polygon called visual hull. Notice that the silhouettes are 1D line segments, and the visual hull is a 2D polygon in this figure. In contrast to shape from stereo, this process does not involve any matching process across multi-view images. In other words, a set of multi-view silhouettes is mathematically equivalent to the visual hull, and therefore once silhouettes are given, then no additional “decisions” will be required for visual hull computation. This is the biggest advantage of shape from silhouette since obtaining accurate silhouettes is relatively more stable than establishing dense (per-pixel) and accurate correspondences among multi-view images. This advantage holds particularly with a controlled background (for example, chroma-keying) and sparse wide-baseline camera setup. On the other hand, shape from silhouette has three important disadvantages. The first is its limited ability to reconstruct the object shape. Intuitively, the visual hull has difficulties in modeling surface concavities as illustrated in Fig. 4.3. This is because the visual hull can only represent 3D surface shape of parts that appear as 2D silhouette contours. The second is that the visual hull may include some false-positive portions called phantom volumes. As illustrated in Fig. 4.4, the two areas B and C can coexist with the real object areas A and D in terms of the silhouette consistency. Thus silhouettes are necessary but not sufficient conditions, as they introduce false-positives. To solve this problem, Miller et al. proposed a concept of safe hull in which phantoms are carved if the carving does not change the 2D silhouettes of the object [33]. This assumption works well for regular scene, but obviously does not always hold.
100
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.4 Phantom volume
The last problem is the lack of robustness against false-negatives in 2D silhouettes. Since the visual hull computation is a logical AND operation, one falsenegative in any single viewpoint can spoil the entire result. This can be very critical if we use large number of cameras which observe an object against different background scenes; the object silhouette extraction is required to be robust against varying background images. Basically increasing the number of cameras makes the visual hull closer to the real object shape, since more shape profiles are likely to be seen as 2D silhouette contours. However, it can also falsely carve the visual hull due to the increased possibility of false-negatives in particular around 2D silhouette boundaries (see “Matting” in Chap. 2). To solve this problem, [5, 42, 62] proposed methods of integrating multi-view silhouette extraction and visual hull construction into a single process.
4.2.2.2.2 Voxel-Based Visual Hull Computation The simplest but widely used implementation of shape from silhouette is a voxelbased carving approach (Fig. 4.5). Suppose the entire volume is decomposed into voxels, then each voxel is carved if its projection onto an image falls outside of the silhouette but in the image frame. If it falls outside of the image frame, i.e. the camera cannot observe the voxel, then the voxel is not carved since the silhouette tells nothing about the scene outside of the image frame. The resultant volume consists of a set of voxels such that all their projected images are observed as parts of silhouettes. Note here that we assume that the camera layout in a 3D video studio is designed so that some of observed images cover entire areas of objects and others include enough spaces to separate silhouettes of multiple objects into disjoint regions. These assumptions allow us to reconstruct 3D shapes of objects as a group of disjoint 3D volumes. Unlike the space carving described before, the test for a voxel can be conducted independently of the others. This means that it can be done in arbitrary voxel order in parallel, and therefore is suitable for parallel processing.
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
101
Fig. 4.5 Volumetric shape from silhouette
However, if we increase the voxel resolution, the computational cost as well as the required memory space increases in cubic order. For example, while 1 cm resolution for 3 m by 3 m by 2 m space requires 18 M voxels, 1 mm resolution results in 18 G voxels. One straightforward solution is to use compact or sparse representation of the 3D space. For example, Szeliski [51] proposed an octree-based approach in 1993 which achieves both memory and processing time efficiency. Other inefficiency, innate in the naive voxel-based implementation, could be found in its numerous number of computations of 3D to 2D perspective projections. Matsuyama et al. [32] proposed an efficient parallel algorithm which eliminates time-consuming 3D to 2D perspective projections by representing the target 3D volume as a set of parallel 2D planes and directly conducting the back projection of multi-view silhouettes into the 3D space: first a 2D image silhouette is back projected onto one of the parallel planes and then its affine transformations give back projections on the other planes. With this 3D space representation, both the visual cone generation from multi-view silhouettes and the intersection of the cones can be done at each parallel plane independently, which can be implemented in parallel to realize real time 3D shape reconstruction of human actions in about 8 [mm] voxel resolution [60].
4.2.2.2.3 Surface-Based Visual Hull Computation Since the essential 3D shape information is carried by 2D silhouette boundaries, the shape from silhouette can be implemented based on boundary data rather than volume data as follows: first generate 3D surfaces of visual cones from 2D silhouette boundaries and then compute the 3D object surface by intersecting the 3D visual cone surfaces. While the computation of 3D surface intersection may be complicated, the computational complexity can be reduced.
102
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.6 Frontier points. A frontier point X is a point on a 3D object surface where the epipolar plane c1 c2 X is tangent to the object surface. This means that the epipolar lines are tangent to the apparent silhouette contours at x1 and x2 in the observed images, respectively, and the frontier point is exposed on the visual hull surface Fig. 4.7 Visual cone by a partially captured object silhouette. The top surface of the 3D cone is false-positively generated by the image boundary
Lazebnik et al. [28] proposed an algorithm of this approach. They first generate a polyhedral representation of a visual cone from each 2D silhouette boundary regarding each boundary pixel as a vertex of a polyhedron and each line connecting a vertex with the projection center as an edge of the polyhedron. Then they compute exact intersections among those polygonal cones. This approach makes the best use of silhouette resolution in pixel, and the computed visual hull represents well fine structures in 2D silhouettes. However, their cone-to-cone intersection process depends on accurate estimation of frontier points. Here, a frontier point is a point on a 3D object surface where the epipolar plane is tangent to the object surface (Fig. 4.6). In such cases, the epipolar lines are tangent to the apparent 2D silhouette contours (x1 and x2 in Fig. 4.6), and hence the frontier point is exposed on the visual hull surface. From a computational point of view, the estimation of a frontier point is a process of finding a pair of epipolar lines tangent to the 2D silhouette contours. This process is not practically trivial especially with noisy silhouettes. Franco and Boyer [9] proposed another algorithm without computing frontier points explicitly. It is more robust against noise, and runs reasonably fast. While these methods can solve the explosion of the computational cost and reconstruct a visual hull representing well the 2D silhouette resolution, they rely on the assumption that all cameras observe entire areas of objects. As discussed in Chap. 2, we often want to employ some zoom-up cameras to increase image resolutions for a specific object area such as a human face. For example, see Fig. 4.27,
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
103
where most of the multi-view images do no cover the entire object. With such cameras, then, captured images do not cover the entire area of an object, which disables the application of the surface-based visual hull computation methods. Moreover, this problem of “partial capture” is often encountered when an object is allowed to move in a wide area as discussed in Chap. 3. Suppose one camera cannot capture the entire object in its image frame. Then, we obtain an apparent silhouette boundary on the image boundary (Fig. 4.7). How to cope with such fake boundaries is not a trivial issue for the surface-based visual hull computation methods.
4.2.2.3 Shape from Stereo and Silhouette As discussed so far, shape from stereo and shape from silhouette work complementary to each other. Shape from stereo can reconstruct the complete 3D object shape if appearance-based texture-matching or photo-consistency and visibility tests work accurately. (See Sect. 4.3.2 for detailed discussions on the visibility examination.) Such conditions, however, are unlikely to be satisfied without knowing a reasonable estimation of the object surface geometry and reflectance. On the other hand, shape from silhouette can reconstruct the 3D object shape as the visual hull without depending on critical conditions, while the reconstructed 3D shape is not only an approximation of the real shape but also erroneous phantom volumes may be included. Thus, recent studies integrate these two 3D shape reconstruction algorithms. A simple integration is to form a two-step approach which first utilizes shape from silhouette to generate an initial guess of the 3D shape and then feeds it to a shape from stereo algorithm. One drawback of this method is that the silhouette constraint is not used in the second step, which loosely follows the shape suggested by the visual hull but does not guarantee the reconstructed shape to coincide with the 2D multi-view silhouettes. For 2.5D shape reconstruction, Fua and Leclerc [10] proposed an algorithm which combines stereo, shading, and silhouette cues. It models the 2.5D surface by a triangular mesh and deforms it so as to satisfy the constraints derived from these three visual cues. This active surface modeling can be regarded as an extension of 2D geometric Snakes [23] into 2.5D shape. Section 4.4.1 will present an active full 3D mesh-deformation method to integrate surface texture and silhouette constraints into a unified computational process. Before concluding this section, it should be noted that 3D shape reconstruction algorithms introduced here can work even when multiple objects are in a scene. Their successful segmentation, however, depends on the camera layout, object shapes, and their mutual positions. In case of 3D video production of human actions, cameras looking straight downward from the ceiling facilitate the object segmentation. Thus, if not specified explicitly, we assume the 3D shape of each object is well segmented from the others and object-wise processing can be done.
104
4 3D Shape Reconstruction from Multi-view Video Data
4.2.3 Dynamic Full 3D Shape Reconstruction for 3D Video Production Regardless of what visual cues are employed, the algorithms reviewed so far are for the full 3D shape reconstruction of a static object. Their straightforward application to 3D video production is to reconstruct 3D shape from each frame of observed multi-view video data. This simple frame-wise 3D shape reconstruction, however, obviously misses an important reconstruction cue, i.e. motion, temporal continuity, or consistency. Consequently, it is natural to augment the algorithms by introducing dynamical properties. In fact, a simultaneous 3D shape and motion reconstruction by a mesh-deformation method [39] and a simultaneous 3D shape sequence reconstruction method [12] were developed. Note here that the former explicitly estimates 3D motion vectors between consecutive 3D mesh data, while the latter does not involve such 3D motion vector estimation. Figure 4.8 illustrates the categorization of 3D shape reconstruction methods from multi-view video data. First they are categorized based on their reconstruction target data, and then characterized according to their reconstruction cues. Notice here we use the term motion as equal to inter-frame correspondence of 3D object surface points, and motion estimation as establishing correspondences between such points.
4.2.3.1 Frame-Wise Shape Reconstruction The first group reconstructs a single 3D shape from multi-view video frame images. In this approach, multi-view video data are decomposed into per-frame multi-view images and processed independently. Since the input is now merely a set of multiview images, this group includes a variety of algorithms for non-3D video purpose, e.g. static object reconstruction [15, 58], and as described in Sect. 4.2.2. Notice that appropriate camera settings (Chap. 2), e.g. dense camera arrangement, can be designed for a static object and high quality multi-view images can be observed. Since some static 3D shape reconstruction methods assume such well designed camera settings, their direct application to multi-view video data may not work well; relative viewing positions and directions toward objects dynamically change as objects move. Recall that a group of active cameras are employed to cope with this problem in Chap. 3. While this frame-wise 3D shape reconstruction is the most popular method,2 one should notice that data structures for representing 3D surface shape data vary a lot even if a sequence of 3D shape data themselves vary smoothly. In fact, vertex–edge structures of 3D mesh data, which are the most widely used data structure to represent 3D video, greatly change frame to frame because their computation processes are conducted independently of others. Thus 3D video application methods should be designed taking into account such temporal variations of the vertex–edge mesh 2 The
reasons for this will be discussed later in this chapter.
4.2 Categorization of 3D Shape Reconstruction Methods for 3D Video Production
105
Fig. 4.8 Categorization of 3D shape reconstruction methods for 3D video production. Top: frame-wise reconstruction approach. Middle: simultaneous shape and motion reconstruction approach. Bottom: multi-frame 3D shape reconstruction approach
structure as will be described in Part II in this book. Section 4.4.2, on the other hand, will present a mesh-deformation method that keeps the vertex–edge structure of the mesh over a sequence of multi-view video frames. 4.2.3.2 Simultaneous Shape and Motion Reconstruction The second group estimates 3D shape and motion simultaneously from multi-view images of a pair of consecutive frames. It includes two approaches: by deformation [39] and by matching [55]. The first approach tries to find a deformation from the 3D object shape at frame t to that of t + 1. Once a deformation is conducted, it yields the 3D shape at t + 1 as well as 3D motion from t to t + 1 represented by dense correspondences or deformation trajectories of points on 3D surfaces. Section 4.4.2 will present a detailed implementation and performance evaluation of this approach.
106
4 3D Shape Reconstruction from Multi-view Video Data
The second approach tries to find a pair of corresponding 3D points in a pair of consecutive frames which satisfies the constraint that the 3D points at frame t and t + 1 should be at the same location on the 3D object surface. A pair of 3D points satisfying the constraint should be spatially photo-consistent across multi-view images at t and t + 1, respectively, as well as temporally photo-consistent across two consecutive frames. Vedula et al. [55] proposed an algorithm which formalizes this idea as a voxel carving in 6D space, a Cartesian product of two 3D voxel spaces corresponding to a pair of consecutive frames. Note that correspondences between 3D surface points over time can facilitate later processing in texture generation and 3D video coding.
4.2.3.3 Multi-Frame 3D Shape Reconstruction The third group reconstructs a sequence of 3D shape data simultaneously from multi-view video data. This approach has a chance to “interpolate” ambiguous or unreliable 3D reconstruction at a certain frame by its neighboring frames automatically. This is a kind of long-term temporal coherency constraint on the object 3D shape. Goldluecke and Magnor [12] proposed an algorithm which extracts 4D hyper surface from 4D (volume + time) space representing a temporal sequence of 3D shape data. With this approach erroneous portions around self-occluded areas were successfully carved out. Note that this algorithm does not establish temporal correspondences between 3D shape data at neighboring frames.
4.3 Design Factors of 3D Shape Reconstruction Algorithms Through a comprehensive review of 3D shape reconstruction algorithms in the previous sections, we identified three essential factors for designing 3D shape reconstruction algorithms for 3D video production: photo-consistency evaluation, visibility and occlusion handling, and shape representations for optimization. In what follows, detailed discussions on each design factor will be given.
4.3.1 Photo-Consistency Evaluation The concept of photo-consistency is first introduced by Kutulakos and Seitz in 1999 [26].3 Suppose we are going to evaluate the photo-consistency at a 3D point p. The photo-consistency of p is given by comparing multi-view images at projections of p. This process involves two major computational factors. The first is how to 3 An
earlier version can be found in the voxel coloring study by Seitz and Dyer in 1997 [44].
4.3 Design Factors of 3D Shape Reconstruction Algorithms
107
Fig. 4.9 Photo-consistency evaluation. The photo-consistency at the central white mesh point is evaluated by computing similarity measures among areas in multi-view images denoting 2D projections of the 3D patch window defined at the mesh point
determine which cameras can observe p, which is called visibility test. We will describe it in detail in Sect. 4.3.2. The second is how to evaluate similarity measures among image projections of p in multi-view images, which will be discussed here. Let us assume a set of cameras C that can observe p. Then, if we compare the pixel colors at projections of p, we can express the evaluation function of their photo-consistency as ρ(p, C). Comparison of single pixel colors is not so discriminative or stable. Instead a window-based method can be helpful. To realize geometrically consistent image window generation, first a planar 3D rectangular patch centered at p is generated and the patch direction is aligned with the surface normal at p. Then, the patch is projected onto the image planes of C to determine image windows for their similarity computation (Fig. 4.9). This computation process requires the surface normal n and curvature κ at p, which define the shape and texture of the 2D image window for each viewpoint in C. Consequently the photo-consistency evaluation function can be expressed as ρ(p, n, κ, C). Here we come up with a chicken-and-egg problem; the photo-consistency evaluation for 3D shape reconstruction requires 3D shape properties. A practical solution is to introduce approximation and/or assumption. For example, the normal n can be approximated by using the visual hull. In this case n of point p is approximated by the normal of such point that is on the visual hull and closest to p. Another possible approximation is to assume that n is oriented to the direction of the observable camera closest to p. The observable camera determination will be discussed later in Sect. 4.3.2. The curvature κ (or the 3D patch size) can be given a priori as knowledge on the target or automatically adjusted by analyzing the texture edge features. For example, if projections of p have rich/poor textures then we can make κ smaller/larger, respectively.
4.3.1.1 Photo-Consistency Evaluation by Pair-Wise Texture Matching One possible implementation of the photo-consistency evaluation is to integrate multiple pair-wise texture-matching results into a single score. If we follow a simple
108
4 3D Shape Reconstruction from Multi-view Video Data
averaging strategy, then the photo-consistency evaluation function is given by ρ(p, n, κ, C) =
1 ρ2 (p, n, κ, c1 , c2 ), |C|
(4.1)
c1 ,c2 ∈C
where |C| denotes the number of cameras in C, and ρ2 (·) computes a pair-wise texture similarity/dissimilarity measure such as Sum-of-Absolute-Difference (SAD), Sum-of-Squared-Difference (SSD), Normalized-Cross-Correlation (NCC), Zeromean SAD (ZSAD), Zero-mean SSD (ZSSD), Zero-mean NCC (ZNCC), Census, defined by SAD(q c1 , q c2 ) = |q c1 − q c2 |1 ,
(4.2)
SSD(q c1 , q c2 ) = q c1 − q c2 ,
(4.3)
q c1 · q c2 , q c1 · q c2 ZSAD(q c1 , q c2 ) = (q c1 − q¯ c1 ) − (q c2 − q¯ c2 )1 , ZSSD(q c1 , q c2 ) = (q c1 − q¯ c1 ) − (q c2 − q¯ c2 ), NCC(q c1 , q c2 ) =
(q c1 − q¯ c1 ) · (q c2 − q¯ c2 ) , q c1 − q¯ c1 · q c2 − q¯ c2 Census(q c1 , q c2 ) = Hamming T (q c1 ), T (q c1 ) ZNCC(q c1 , q c2 ) =
(4.4) (4.5) (4.6) (4.7) (4.8)
where qc = qc (p, n, κ) denotes pixel values in the projected window for camera c ∈ C: the 3D patch window defined at p is rasterized into an m-dimensional vector. | · |1 and · denote l-1 and l-2 norms, respectively. q¯ c is a vector such that its size is equal to that of q c and all elements are set to the average value of the elements of q c , i.e. |q c |1 /m. Hamming(·) denotes the Hamming distance of two binary vectors. T (·) denotes the Census transform T (q c ) =
m ξ q c (i), qˆ c ,
(4.9)
i=1
ξ(a, b) =
0 1
if a ≤ b, if a > b,
(4.10)
where denotes a bit-wise catenation, and qˆ c denotes the pixel value of the image window center position, i.e. the projection of p on the screen of c. ξ is a binary or magnitude comparator. Notice that NCC and ZNCC measure similarities (their values range from 0 to 1 with NCC and −1 to 1 with ZNCC, larger is better) while the others measure dissimilarities (smaller is better, 0 is the best).
4.3 Design Factors of 3D Shape Reconstruction Algorithms
109
4.3.1.2 Photo-Consistency Evaluation by Variance Another implementation of the photo-consistency evaluation function is to compute the variance of colors among C [17, 26]. In this implementation, the photoconsistency evaluation function is given as follows: ρ(p, n, κ, C) =
1 (qc − q˜C ), |C| − 1
(4.11)
c∈C
where q˜C denotes the average of qc (·), that is, q˜C =
c∈C qc /|C|.
4.3.1.3 Robustness Against Camera Characteristics and Surface Reflectance Properties The choice of the photo-consistency evaluation methods should reflect assumptions on input multi-view video data and object surface properties. The use of the variance-based photo-consistency evaluation or SSD/SAD for pair-wise texture matching implies that camera characteristics are reasonably identical and the object surface is well modeled as Lambertian. NCC implies that camera characteristics are identical up to per-camera scale factors and the object surface is Lambertian. ZSAD, ZSSD and ZNCC add a tolerance for per-camera biases to SAD, SSD, and NCC, respectively. Census is more robust against per-camera scale factors/biases as well as non-Lambertian specularities because it is sensitive only to differences between pixel values in an image window and that of the center pixel. The variance-based photo-consistency evaluation can also be robust against specularities by discarding the brightest color from visible cameras C. Notice that here exists a trade-off between the discriminability and tolerance to variations of camera characteristics and object surface properties. Census is more robust against non-Lambertian surfaces and poorly color-calibrated cameras, but less discriminative than SSD/SAD.
4.3.2 Visibility and Occlusion Handling As described in the previous section, the photo-consistency evaluation requires knowledge about the visibility of objects. For example, as illustrated in Fig. 4.10, the photo-consistency evaluation of a point p on the object surface should be done by using only images observed from c2 and c3 which can observe p. The other images observed from c1 and c4 must be excluded since p is occluded by the object itself. False visibility decision, such as regarding c1 as a visible viewpoint of p, misleads the photo-consistency evaluation. With the 3D object shape, we can determine the visibility even for a point not on the object surface, q (here q is visible from c1 , c2 , and c3 ), which allows us to eliminate q from possible object surface points through photo-consistency evaluation.
110
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.10 Physical visibility of object surface. Given the 3D object shape, the visibility from any viewpoint points can be correctly computed even for a point not on the object surface
Such physical visibility, however, can be determined only when an accurate 3D object shape is given. To solve this chicken-and-egg problem, some visibility models were proposed in literature.
4.3.2.1 State-Based Visibility The state-based visibility [48, 58] approximates the visibility of a point p by the visibility of the nearest point p on a shape proxy Sp of the real object. In many cases the visual hull of the object serves as the proxy. The visibility of p is determined by the standard Z-buffering technique in computer graphics with the proxy shape and camera arrangement. The occlusion Qc for camera c in the computational world model (Fig. II.2) is approximated as Qc ≈ Qc (p, Sp ) ≈ Q p , Sp visible if dc (p ) ≤ Dc (p , Sp ), = occluded otherwise,
(4.12) (4.13) (4.14)
where dc (p ) is the depth (z value) of p in the camera c coordinate, and Dc (p , Sp ) is the nearest depth of Sp at p in the camera c coordinate. For example, the visibility of p in Fig. 4.11 is approximated by that of p , the nearest point of p on the proxy surface, and the state-based visibility evaluation reports that c2 and c3 can observe p since they can observe p . The state-based visibility with a visual hull works well as long as the visual hull approximates well the real object shape. Suppose the visual hull coincides with the
4.3 Design Factors of 3D Shape Reconstruction Algorithms
111
Fig. 4.11 State-based visibility. The visibility of a point p is approximated by that of p , the surface point on the visual hull closest to p
object shape. In this case no visibility tests on the visual hull surface return falsepositives; if a camera is marked as visible for a point on the visual hull surface, the camera can observe the point definitely. As long as the visual hull does not occlude a camera, there exist no other occluders in the scene by definition, and therefore the camera can observe the object definitely. On the other hand, there may exist cameras which are false-negatively reported as “not visible”. This is because the visual hull may include phantom volumes which falsely occlude itself. For example, while the camera c2 of Fig. 4.12 can observe both points q and q physically, the state-based visibility evaluation using the visual hull reports that c2 cannot observe q because of the occluding phantom volume in the middle. Notice here that we cannot predict the existence of false-negatives since we do not know which portions are phantom volumes.
Fig. 4.12 False self-occlusions by phantom volumes
112
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.13 False visibility by the state-based visibility evaluation. Point p is physically observed by c1 and c2 . However, the state-based visibility evaluation approximates the visibility of p by p , and returns c2 , c3 , c4 and c5 as visible
An important fact here is that the above-mentioned observations are valid only for the points on the visual hull surface. That is, the state-based visibility may not work for parts where the visual hull cannot model well the object shape, especially around concavities as shown in Fig. 4.13. In this figure the point p on the real object surface is not exposed as a part of the visual hull surface. The state-based visibility evaluation approximates the visibility of p by that of the point p , the closest point to p on the visual hull surface, and returns that the camera c4 is the most front-facing one, and then c3 , c5 , and c2 . However, this evaluation result is not acceptable, since only c1 and c2 can physically observe the surface point p. To cope with limitations of the state-based visibility evaluation, it makes sense to employ an iterative strategy which refines both the shape and visibility starting from the visual hull. We introduce an algorithm following this idea later in Sect. 4.4.1.
4.3.2.2 Oriented Visibility The oriented visibility is another representative visibility evaluation method introduced by Lempitsky et al. in 2006 [29]. Given the position and normal direction of a point in the scene, this method determines its visibility based only on the camera arrangement. The occlusion Q of camera c for a point at p is given by using a hypothesized surface normal n as Qc ≈ Qc (p, n) visible if n · dc < θ and (p − oc ) · dc > 0, = occluded otherwise,
(4.15) (4.16)
where dc and oc denote the viewing direction from c to p and the position of c, respectively, and θ is a threshold. If θ = −0.5, then cameras observing p within an angle of 0 (front-facing) to 60 are accepted. Note that the normal n is given (1) as a result of reconstruction area digitization (Sect. 4.3.3.3), or (2) by using a shape proxy (e.g. visual hull). In the first scenario, this model works without intermediate reconstructions or guesses of the object shape, but cannot handle self-occlusions. Therefore this is useful when no
4.3 Design Factors of 3D Shape Reconstruction Algorithms
113
reasonable initial guesses are available for the object shape. Once a reliable intermediate reconstruction is obtained, then it is better to switch to the state-based visibility evaluation as suggested by Lempitsky et al. In the second scenario, this works as a modified state-based visibility which is robust to false self-occlusions. For example, the normal of q in Fig. 4.12 is approximated by that of q , say n , and then the oriented visibility Qc (p, n ) returns “visible” for all c1 , c2 and c3 though the state-based visibility falsely reports that c2 is not visible. 4.3.2.3 Integrated Evaluation of Photo-Consistency and Visibility Another idea is to utilize the photo-consistency measure itself for the visibility evaluation. Consider the situation illustrated in Fig. 4.9. If the image windows for a set of cameras C are photo-consistent, the surface point in question is likely to be observable from these cameras as well as to be at the correct position and orientation. Otherwise, it is suggested that the surface position, normal, and/or the visibility is wrong. In such cases, we can simply discard the hypothesized surface point or try to test the photo-consistency using a subset of C. Patch-based surface growing approaches described later in Sect. 4.3.3.3 follow the former strategy. The latter strategy was proposed by Vogiatzis et al. [56] which proposes an occlusion robust photo-consistency evaluation algorithm. If a subset of C forms a photo-consistent set of cameras for the surface point in question, then the unused cameras are labeled as invisible for that point. This approach works well for dense camera setups because the subset of C must have a sufficient number of cameras for reliable photo-consistency evaluation.
4.3.3 Shape Representation and Optimization Three major models can be used to represent the 3D object shape in 3D video production: volume-based, surface-based, and patch-based representations. These representations have their own advantages and disadvantages as well as well-matched optimization methods. In particular the first two representations are widely used with discrete optimization methods based on graph-cuts [25] and belief propagation [8]. A great advantage of these methods is that the resultant 3D shape is guaranteed to be optimal. In what follows, we will discuss three models of 3D shape representation together with 3D shape optimization methods based on the individual models.
4.3.3.1 Volume-Based Shape Representation In the volume-based approach, 3D shape is represented by a set of voxels. The goal of 3D shape reconstruction is then to find binary occupancies of voxels that satisfy the photo-consistency.
114
4 3D Shape Reconstruction from Multi-view Video Data
This binary optimization suits well with computational schemes based on the graph-cuts [25]. In general finding a min-cut of a graph G = {V , E} is equivalent to finding the global minimum of the following objective function: Ed (lv ) + Ec (lvi , lvj ). (4.17) E (l) = v∈V
vi ,vj ∈E
Here v denotes a vertex representing a voxel and lv its binary label: occupied or not occupied by an object. The first term represents a per-voxel objective function which returns a cost to label v as lv and the second term represents a pair-wise objective function which returns the cost to label two connected voxels vi and vj as lvi and lvj , respectively. Vogiatzis et al. [58] first proposed a method for the volumetric reconstruction of full 3D shape with the graph-cuts. In their approach the first term was used to represent a shape prior and the second term for photo-consistency values: vi + vj , (4.18) E =− λ+ ρ 2 v∈Vo
vi ,vj ∈Eo
where Vo and Eo denotes the set of graph nodes (voxels) and edges (connected voxels) of the estimated object. In this formalization the object is estimated as the collection of voxels labeled source side. Notice that λ is a positive constant. Use of larger λ indicates that a larger object shape is preferred. In this sense, λ serves as a ballooning term similarly to Snakes [23]. If λ ≤ 0, the minimization of E simply returns a trivial solution Vo = ∅ and hence Eo = ∅ and E = 0. An important advantage of this method is its flexibility to search in a wide range of 3D shape data with greatly different topological structures. Since it can represent arbitrary 3D shapes (up to the voxel resolution, of course) in terms of the binary occupancy map, a wide range of 3D shapes are evaluated to find the optimal one through the optimization. This means that the above-mentioned method can reconstruct the object 3D surface with the correct surface topology even if the topological structure of the visual hull is different from the real one typically because of phantom volumes. On the other hand, the surface-based representation discussed in the next section has a difficulty in managing such global topological changes. The voxel-based graph-cuts method, however, has two disadvantages: memory consumption and continuity constraint embedding. The first problem is shared by the shape from silhouette methods presented in Sect. 4.2.2.2, and its straightforward solution is to follow a coarse-to-fine or hierarchical strategy [14]. The second problem is an essential limitation of the graph-cuts. Since the graph-cuts algorithm has a strong limitation on the objective functions, it is relatively hard to embed continuitybased constraints such as curvature constraints which are defined over more than two nodes (voxels). One possible solution is to convert a graph with high-order cliques (n-ary relations) into a first-order graph with up to binary relations as proposed by Ishikawa in 2009 [20], but it exponentially increases the graph size and therefore memory consuming.
4.3 Design Factors of 3D Shape Reconstruction Algorithms
115
4.3.3.2 Surface-Based Shape Representation In contrast to the volume-based representation, the surface-based approach does not digitize the 3D volume but 3D (connected) mesh data are employed to represent the 3D object surface. Suppose we have an initial guess of the object surface, which is usually given by 3D mesh data transformed from the visual hull. The goal of 3D shape reconstruction is then to find a deformation of this mesh such that the deformed mesh satisfies photo-consistency over the entire object surface. One naive implementation of such deformation can be designed based on the gradient descent optimization. Starting from an initial guess, it searches for a point with a better photo-consistency value for each surface point toward the inside of the surface. It deforms the surface toward these points until it converges. Recall that the deformation should proceed toward inside if we start from the visual hull. This approach fits well with the state-based visibility strategy since it can update the visibility at each deformation step, but obviously it can result in a locally optimal solution. Sections 4.4.1.1 and 4.4.2 will present detailed mesh-deformation algorithms for a static object and an object in motion, respectively. This problem can also be rendered as a multi-label optimization. It digitizes possible deformation destinations of each surface point on the initial shape, and then finds a (semi-)optimal combination of digitized destinations [57]. By considering the digitized destinations as the label-set for each vertex, an optimal solution can be found by standard techniques such as graph-cuts, belief-propagation, or tree-reweighted message passing. By iteratively optimizing the deformation and updating the state-based visibility, we can expect that the process converges to a (semi-)optimal result. Advantages of this representation are its limited memory consumption and continuity constraint embedding in contrast to the volume-based representation. It can represent 3D shape in finer resolution with less memory and smoothly introduce a continuity constraint into its computational process since surface normal, curvature, and locally connected neighbors can easily be computed from the surface representation. The disadvantage is its poor flexibility in searching in a wide range of different 3D shapes, which is the advantage of the volume-based representation. As long as we define the surface deformation as transforming each point on the surface to another position, the surface cannot change its global topology or mesh structure inherited from the initial shape. To solve this problem, Zaharescu et al. proposed an algorithm for topology adaptive mesh deformation [61].
4.3.3.3 Patch-Based Shape Representation The patch-based approach represents a 3D surface by a collection of small patches (or oriented points). This representation can be seen as twofold integrations: (1) integration between the volume-based and surface-based shape representations, and (2) combination of the oriented and state-based visibility evaluations. It can conduct
116
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.14 Tetrahedral voxel space [29]. The entire volume space is first decomposed into unit cubes, and then each cube is decomposed into 24 tetrahedra. One of such tetrahedra is denoted in red
search in a wide range of surface topological structures like the volume-based representation, and it can employ surface continuity constraint as in the surface-based representation. The patch-based representation can be implemented with two different computation schemes. One is a “surface growing” approach and the other is a global optimization approach. The surface growing approach [11, 13] takes a best-first search method and starts the surface estimation by finding seed patches with higher confidence in their existence and accuracy. Then it searches for neighboring patches around the seeds. As for the visibility evaluation, this approach starts the evaluation with the oriented visibility first, and then gradually switches to the state-based visibility as the surface grows. One disadvantage of this approach is the lack of “closed surface constraint”. While both the volume-based and surface-based representations can reconstruct closed surfaces anyway, the surface growing approach is not guaranteed to generate such a closed 3D surface. Thus, as a post processing, we should employ some “meshing” algorithms such as Poisson surface reconstruction [24] or moving-leastsquare (MLS) surface [1] which generate smoothly connected surfaces from a cloud of oriented points filling out small gaps and resolving intersections, and/or overwraps. The other approach utilizing the graph-cuts for global optimization was introduced by Lempitsky in 2006 [29]. This approach digitizes the target volume into a collection of tetrahedra, and tries to find the best combination of their faces that represents the object 3D surface. In other words, it generates all possible triangular patches in the target space and searches for their best combination in terms of the photo-consistency. Suppose the entire volume is digitized into the tetrahedral voxel space [29]. Figure 4.14 illustrates how a cubic voxel is decomposed into 24 tetrahedra. Then, let us form a graph where nodes denote the tetrahedra. Each node has n-links to their four neighbors, and s- and t-links to the source and sink. The photo hull in this algorithm is given by a surface which minimizes
E= λ dV + ρ dA. (4.19) V
S
4.3 Design Factors of 3D Shape Reconstruction Algorithms
117
Fig. 4.15 Local graph structure for minimizing Eq. (4.20). The red triangle indicates one tetrahedral voxel of Fig. 4.14. Each tetrahedral voxel corresponds to a node of the graph. It has four directed edges called n-links connecting with its neighbors sharing one of its surface triangle. Note that s- and t -links of each node are omitted
The first term represents a ballooning term as a shape prior working over the estimated volume V , and the second term represents photo-consistency measures over the estimated surface S. Its discrete representation is given by v i + vj E= (4.20) , nij , λ+ ρ 2 v∈V
vi ,vj
where vi , vj is a pair of neighboring tetrahedra such that the surface S passes v +v v −v through their boundary triangular “patch” at i 2 j with normal nij = |vjj −vii | . This patch is used for oriented-visibility and photo-consistency evaluations. By assigning the first term as s-link weights and the second term as n-link weights (Fig. 4.15), the min-cut of the graph is exactly equivalent to the minimization of Eq. (4.20) and the surface is given as the source-side tetrahedra. Notice that the n-links are directed edges. The link weight from vi to vj represents the photo-consistency value of a v +v patch located at i 2 j and directed from vi to vj , and the link weight from vj to vi represents the photo-consistency value of the patch located at the same position but in the opposite direction. This ensures that the min-cut always returns a closed 3D surface that has the best photo-consistency. In contrast to the surface growing approach, min-cuts of this graph always generate a closed 3D surface. However, it is not trivial to embed a surface smoothness or continuity constraint due to the limitation of graph-cuts formalization, as discussed for the volume-based representation.
4.3.3.4 Notes on Optimality Both global optimization techniques, graph-cuts and belief propagation, are known to produce a (semi-)optimal result of the objective function. However, this does not mean the output is the best reconstruction of 3D object shape from multi-view video data. It only ensures that the output shape returns the best score in terms of the specified objective function. That is, the objective function essentially includes the photo-consistency evaluation score computed based on the visibility, while its
118
4 3D Shape Reconstruction from Multi-view Video Data
correct value cannot be computed without the exact 3D shape of the object. Hence if the visibility evaluation relies on an approximated 3D shape, then the optimization of the objective function may not lead us to the best photo-hull estimation. Consequently, it should be carefully examined in what sense the output is the best.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction Algorithms Here we present several practical 3D shape reconstruction methods implemented taking into account the three design factors discussed in the previous section. First we introduce a pair of 3D shape reconstruction methods for a single frame: surfacebased mesh-deformation and volume-based graph-cuts methods. Then we augment the former to realize simultaneous 3D shape and motion reconstruction. The performances of these methods are evaluated with experiments using real world objects in motion including a dancing MAIKO.
4.4.1 3D Shape from Multi-view Images 4.4.1.1 Intra-frame Mesh Deformation Algorithm As discussed so far in this chapter, 3D shape reconstruction algorithms are characterized by their reconstruction cues, shape representation methods, visibility models, photo-consistency evaluation functions, and optimization methods. Here we introduce an intra-frame mesh-deformation approach, which • uses the visual hull as the initial guess, • represents 3D shape by a triangular mesh surface, • deforms the surface so as to satisfy the ZNCC photo-consistency with state-based visibility as well as a silhouette constraint, and • renders the deformation as an iterative multi-labeling problem solved by loopy belief propagation. Suppose we start the deformation from the visual hull surface represented by a 2-manifold closed triangular mesh structure M = {V , E} where V and E denote the set of vertices and edges in M (Fig. 4.16). The goal is to find a set of vertex positions that are optimal in terms of photo-consistency and two additional constraints preserving the initial mesh structure. 4.4.1.1.1 Frame-and-Skin Model We employ the following three types of constraint to control the mesh deformation. 1. Photometric constraint: A patch in the mesh model should be placed so that its texture, which is computed by projecting the patch onto a captured image, is consistent, irrespectively of the image it is projected on.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
119
Fig. 4.16 3D mesh model. Left: a triangular mesh model. Right: 2- and 4-manifold meshes
Fig. 4.17 Frame-and-skin modeling of the 3D object shape
2. Silhouette constraint: When the mesh model is projected onto an image plane, its 2D silhouette should coincide with the observed object silhouette on that image plane. 3. Smoothness constraint: The 3D mesh should be locally smooth and should not intersect with itself. These photometric, silhouette and smoothness constraints together define a frame-and-skin model [38] which represents the original 3D shape of the object (Fig. 4.17(a)) as follows: • First, the silhouette constraint defines a set of frames of the object (Fig. 4.17(b)). • Then the smoothness constraint defines a rubber sheet skin to cover the frames (Fig. 4.17(c)). • Finally, the photometric constraint defines supporting points on the skin that have prominent textures (Fig. 4.17(d)). In what follows, we describe energy functions at each mesh vertex to satisfy the constraints.
4.4.1.1.2 Photometric Constraint We define an energy function Ep (v) which represents the photometric constraint of a vertex v ∈ V as follows: Ep = Ep (v), (4.21) v∈V
120
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.18 Silhouette constraint. ©2008 IEICE [41]
Fig. 4.19 Contour generator and apparent contour
Ep (v) = ρ(pv , n˜ v , κ, C˜ v ),
(4.22)
where ρ(·) is a photo-consistency evaluation function which returns smaller values for better matching. pv denotes the current position of v. n˜ v and C˜ v denote the normal and cameras that can observe v at its initial state. This means that the initial mesh shape serves as the shape proxy for the state-based visibility. In our iterative reconstruction (described later), we first use the visual hull as the initial mesh, and then use the result of the previous iteration for the initial mesh of the next iteration. κ is the local curvature at v given a priori.
4.4.1.1.3 Silhouette Constraint The second constraint is the silhouette constraint which restricts the deformation to preserve the 2D silhouette outlines observed by the cameras (Fig. 4.18). The constraint enforcement is realized by finding contour generators on the object surface from multi-view images first, and then by embedding these points as anchors in the deformation. Contour generators consist of points on the object surface such that their surface normal vectors are tangent to viewing rays passing through the camera center and their corresponding 2D silhouette contour points, respectively (Fig. 4.19). In other
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
121
Fig. 4.20 Contour generator and camera arrangement
words, each point on the 2D silhouette outline corresponds to at least one 3D point on the object surface along the viewing line connecting it with the camera center. Note here that since the visual hull sometimes just clips the viewing line (Fig. 4.20), the 3D position of some contour generator points cannot be determined uniquely. To solve this problem, we employ dynamic programming to estimate the optimal contour generators while considering continuity (smoothness) between estimated points as follows [53]. Suppose we have N cameras Ci (i ∈ [1, N]) and a binary silhouette image Si for each Ci . We assume the outlines of Si are given as a set of 2D curves. We denote the j th outline by si,j , and the xth pixel of si,j by si,j (x) (x ∈ [1, Nsi,j ]) where Nsi,j is the number of pixels on si,j . Every 2D point of silhouette outlines si,j (x) has one or more corresponding 3D points {Pi,j (x)} lying on the object surface. We can expect that each has a high photo-consistency value with camera image pixels since it is 3D points on the real 3D object surface, and is exposed as a point on the visual hull surface. Hence the object visual hull gives the possible 3D positions of contour generators for each si,j (x). We formalize the contour generator estimation problem as an energy minimization problem of a function Esi,j defined as follows: Esi,j =
x
ρ(px ) + λ
Ed (px , px+1 ),
(4.23)
x
where px denotes the selected 3D point from {Pi,j (x)} corresponding to si,j (x), ρ(px ) the photo-consistency term at px as defined in the previous section, and Ed (px , px+1 ) the distance between px and px+1 representing a smoothness term. Since we assumed that si,j forms a curve and parameterized by a single variable x, the smoothness term of x depends only on its neighbor x + 1. Hence this minimization problem can be solved efficiently by dynamic programming whereas a min-cut
122
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.21 Smoothness constraint as a spring model
or belief-propagation framework would be computationally expensive. We denote by Vcg ⊂ V the optimal set of mesh vertices, each of which is closest to {px }, that minimizes Esi,j . With Vcg , we define an energy function which represents the silhouette constraint as follows: Es (v), (4.24) Es = v∈Vcg
Es (v) = pv − p˜ v 2 ,
(4.25)
where pv and p˜ v denote the current position and the original position of vertex v ∈ Vcg . That is, p˜ v is the closest position to px which minimizes Ecg . 4.4.1.1.4 Smoothness Constraint The last constraint is the smoothness constraint defined for each neighboring vertex pair. The energy function for it is simply defined by the distance between the two vertices vi and vj as follows: Ec = Ec (vi , vj ), (4.26) vi ,vj ∈E
2 Ec (vi , vj ) = pvi − pvj 2 − p˜ vi − p˜ vj 2 ,
(4.27)
where vi , vj ∈ E denotes an edge in the mesh connecting vi and vj , |p˜ vi − p˜ vj |2 is the original distance of two vertices vi and vj in the initial state. This is a kind of elastic energy model where each vertex is connected by springs to its neighbors (Fig. 4.21).
4.4.1.1.5 Formalization as a Multi-labeling Problem Now we are ready to render the deformation as an iterative multi-labeling problem. The goal is to find a combination of vertex positions which minimizes E = E p + Es + Ec .
(4.28)
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
123
By assuming that each vertex of the initial mesh can move towards the opposite direction of its surface normal and that their movements are digitized with a certain unit distance, we can represent next possible positions of each vertex by K labels indexed with parameter k as follows: pv (k) = pv − μknv
(k = 0, . . . , K − 1),
(4.29)
where pv is the current position of v, nv is the normal vector of v at pv , μ is the unit distance or resolution, k is the unit count up to K − 1. Let v(kv ) denote the vertex v moved to the digitized position pv (kv ). Using this representation, we have Ep v(kv ) + Es v(kv ) + Ec vi (kvi ), vj (kvj ) (4.30) E= v∈V
v∈Vcg
vi ,vj ∈E
as the objective function of a multi-labeling problem. Since this function has only unary and binary terms, a regular loopy belief propagation can find an optimal solution efficiently. As discussed in Sect. 4.3.3.4, the initial result of optimizing Eq. (4.30) depends on the visibility defined by the visual hull. Therefore we conduct an iterative optimization of Eq. (4.30) while updating the state-based visibility based on the deformation result at the previous iteration stage until it converges.
4.4.1.2 Volumetric 3D Shape Reconstruction Algorithm As a practical volume-based 3D shape reconstruction algorithm, we here present a method based on Starck et al. [48], which employs a silhouette constraint defined in the intra-frame mesh-deformation algorithm as well as the following computational methods: • • • • •
voxel-based representation, the visual hull as the initial guess, state-based visibility, pair-wise ZNCC photo-consistency, and graph-cuts optimization.
Suppose we digitize the target space into voxels that are classified into three groups: voxels definitely outside the object surface and denoted by Vout , voxels definitely inside the object surface and denoted by Vin , and other voxels that may be carved out and denoted by Vmain . Vout is given by the voxels located outside the visual hull. Vmain is given by the voxels that are located inside the visual hull and such that their distances to the visual surface is less than a certain threshold. Vin is the ones in between Vout and Vmain (Fig. 4.22). The goal is to determine whether each voxel of Vmain is a part of the object or not. These per-voxel binary decisions can be rendered as a min-cut/max-flow problem of a graph as illustrated in Fig. 4.23. Voxels correspond to graph nodes, and are connected to neighbors (here we employed a 6-neighborhood). These inter-voxel
124
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.22 Voxel classification (see text)
Fig. 4.23 Graph structure for Eq. (4.31). The blue and red circles are Vin and Vout nodes, respectively. Blue and red lines denote edges with infinite weight, i.e. edges cannot be cut
connections are called “n-links”. Each voxel has two additional links, one to the source and the other to the sink (called “s-link” and “t-link”). The s-link and t-link weights for nodes in Vin and Vout are set to infinity, respectively. In addition, an edge connecting a pair of Vin nodes or a pair of Vout nodes is given infinite weight as well. This configuration makes a cut of this graph to divide the Vmain nodes into either of the source side or the sink side while keeping Vin and Vout voxels in the source and the sink side, respectively. Here the question is how to assign the other edge weights so that the source side of the min-cut represents the photo hull. In this algorithm, the photo hull of the surface is given by finding a surface which minimizes the following objective function [47]: vi + vj , (4.31) λ+ ρ E =− 2 v∈V
vi ,vj ∈E
where ρ() is the ZNCC photo-consistency computed by the state-based visibility function with the visual hull. The first term λ (a positive weight as described in Sect. 4.3.3.1) is a per-voxel ballooning term and a shape prior. The second term
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
125
is a data term representing the photo-consistency measure at the point between vi and vj . Minimization of this objective function can be done by finding a min-cut of the graph with mapping the first term to the s-link weights and the second term to the n-link weights. Notice that smaller photo-consistency ρ should correspond to better score in this formalization. As well ρ ≥ 0 should hold because negative n-link weights make the min-cut very hard to find. Since ZNCC takes values within [−1, 1] and larger is better, we need to convert it to satisfy these requirements on ρ. As such a mapping function, Vogiatzis et al. [58] proposed the following function. This converts ZNCC value to [0, 1], where smaller is better, by π ρ = 1 − exp − tan2 (ZNCC − 1) σ2 , (4.32) 4 where σ is a parameter to control the hot range of the original ZNCC values. For example σ = 0.5 maps all ZNCC values within [−1, 0] to 1, and σ = 2 maps ZNCCs within [−1, −0.75] to 1. This means that a smaller σ discards a wider range of lower ZNCC values, as ρ = 1 (the worst photo-consistency), regardless of their original values. In addition, we embed silhouette constraints into the graph-cuts optimization process. Suppose the contour generators are given as a set of points Vcg as in Sect. 4.4.1.1.3. These points can serve as additional “definitely object-side points” as proposed by Tran and Davis [53]. This is simply done by assigning an infinite s-link weight for nodes close to a point in Vcg . Just as the voxel nodes in Vin , the nodes having infinite s-link weight will be included in the source side, i.e. the object side, since the min-cut cannot cut such links.
4.4.1.3 Performance Evaluations 4.4.1.3.1 Quantitative Performance Evaluations with Synthesized Data In this evaluation, we conducted the following experiments: Studio: Fig. 4.24 shows the studio with 15 XGA cameras, whose specifications are taken from the calibration data of Studio B in Table 2.3 and Fig. 2.4. 3D object: Generate a super-quadric [2] object of 1.2 [m] diameter by the following equations and place it at the center of the capture space in the studio: x(u, v) = ax cosn u cose v + bx , y(u, v) = ay cosn u sine v + by , z(u, v) = az sinn u + bz , −π/2 ≤ u ≤ π/2,
(4.33)
π ≤ v ≤ π,
where u and v are spherical coordinates, n and e denote parameters controlling roundness/squareness. ax , ay , az and bx , by , bz denote scale and translation
126
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.24 Multi-view video capture studio with 15 synchronized XGA cameras @ 25fps in the same settings as Studio B in Table 2.3 and Fig. 2.4
factors for x, y, z, respectively. The object is generated with n = 1.0, e = 3.0, ax = 60.0, ay = 60.0, az = 50.0, bx = 0.0, by = 0.0, bz = 50.0 as shown in Fig. 4.25(a). It was designed to have heavy concavities and wide thin protrusions, which models complex 3D shapes of a dancing MAIKO. And moreover, their accurate 3D reconstruction is not easy in general. Multi-view images: The object surface is covered with random texture patterns, and a set of 15 multi-view images taken by the XGA cameras were generated as input data for the 3D shape reconstruction by the methods presented in Sects. 4.4.1.1 and 4.4.1.2. Figure 4.25 shows the results of the 3D shape reconstruction, where colors denote surface distance values between the original synthesized object and the reconstructed object surfaces: red regions indicate well reconstructed surface areas (areas close to the ground truth), whereas green indicate poor reconstructions. Figure 4.25(b) shows the visual hull indicating that the synthesized object has large concavities on its sides and that such concavities are not well reconstructed by shape from silhouette alone. Figure 4.25(c) shows the initial result of the mesh deformation using the visual hull as the initial shape. Figure 4.25(d) illustrates the result of the second iteration deformed from Fig. 4.25(c). Figure 4.25(e) shows the result by the volumetric reconstruction, while Fig. 4.25(f) illustrates another result of the volumetric reconstruction with a smaller ballooning term. Table 4.1 compares reconstruction accuracy and completeness quantitatively. Accuracy denotes the distance d (in cm) such that 90 % of the reconstructed surface is within d cm of the ground truth, and completeness measures the percentage of the reconstructed surface that are within 1 cm of the ground truth [45] (intuitively speaking, the percentage of “red” areas of Fig. 4.25). Max distance is the maximum distance from the reconstructed surface to the ground truth. From these results we can observe that: • The mesh-deformation algorithm performs better than the volumetric approach. • The iterative mesh optimization can contribute to refine the 3D surface geometry. • The volumetric reconstruction is sensitive to the ballooning factor. Reconstructions with smaller values tend to carve the volume too much in particular in regions where the shape is thin as shown in Fig. 4.25(f).
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
127
Fig. 4.25 Reconstructions of a synthesized object. (a) Ground truth, (b) visual hull, (c) mesh deformation (first iteration), (d) mesh deformation (second iteration), (e) volumetric reconstruction, (f) reconstruction failure by the volumetric reconstruction due to an inappropriate ballooning term. that the colors in (b) to (e) indicate surface distance values to the ground truth surface (red is closer). Note that the synthesized object (a) originally had a random texture over the surface, which was replaced by Gouraud shading for printing
Table 4.1 Reconstruction accuracy and completeness Visual hull Fig. 4.25(b)
Intra (first) Fig. 4.25(c)
Intra (second) Fig. 4.25(d)
Volumetric Fig. 4.25(e)
90 % accuracy
5.36 cm
1.27 cm
1.19 cm
3.44 cm
1 cm completeness
35.77 %
95.22 %
98.54 %
37.35 %
Max distance
14.31 cm
4.31 cm
2.37 cm
7.83 cm
The last point is due to the fact that the graph-cuts formalization (Eq. (4.31)) accumulates its cost over the entire object surface and volume. To avoid this problem, the following strategies can be introduced into the cost minimization process: • Find well photo-consistent partial surfaces, for which let the objective function accumulates smaller (ideally 0) photo-consistency costs over the surfaces as well as ballooning costs over the volumes enclosed by the surfaces.
128
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.26 Cost aggregation around a “bottleneck” shape
• Find a “bottleneck” where the object (source-side nodes) can be cut into two groups by cutting a small area, that is, by cutting a small number of n-links (Fig. 4.26). In this case the objective function can discard the ballooning cost over the volume (the red area of Fig. 4.26) and the photo-consistency cost over the surface (the green line). Instead, it should accept the photo-consistency costs over the cutting area (the bold black line). Since it does not correspond to the real geometry, per-node photo-consistency costs can be large there. However, if the area size is small enough, then the sum of such photo-consistency costs can be smaller than the accumulated ballooning costs and photo-consistency costs over the volume and surface. Figure 4.25(f) shows an example of such “bottleneck” case. The silhouette constraint tries to preserve thin areas. However, as a result of balancing between the ballooning cost and object thickness, only the silhouette constraint points remain as a part of the object, and the generated shape has many holes around it. The reason is twofold: • The silhouette constraint defined in Sect. 4.4.1.1.3 is nothing but the silhouette “boundary” constraint. It helps to preserve the boundary, not the silhouette area inside it. • The volumetric formalization does not have an explicit local surface connectivity constraint. In the case of the mesh deformation, on the other hand, the mesh edges explicitly define a strict connectivity between vertices. This point allows the silhouette “boundary” constraint to serve as a silhouette “area” constraint even if its definition itself does not guaranty exact area constraint strictly.
4.4.1.3.2 Qualitative Performance Evaluations with Real Data To evaluate the performance of the mesh-deformation and the volumetric reconstruction methods for real world objects, we captured multi-view video data of a dancing MAIKO with FURISODE in Studio B, whose specifications and calibration accuracy are shown in Tables 2.3 and 2.5, and Fig. 2.4. Figure 4.27 illustrates a test set of captured multi-view image frames. It should be noted that her long thin sleeves and sash move widely in a complex manner as the MAIKO plays. Figure 4.28 shows (a) the visual hull and reconstructed 3D shapes by (b) the mesh-deformation and (c) the volume-based methods. In this evaluation, we used
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
129
Fig. 4.27 Multi-view image frames of a dancing MAIKO captured by the Studio B system
Fig. 4.28 (a) Visual hull and reconstructed 3D shapes by (b) the mesh-deformation and (c) the volume-based methods
5 mm mesh resolution for the mesh deformation, while the voxel resolution for the volumetric reconstruction was 10 mm. Both require approximately 1 GB memory of space in total. The experimental results demonstrate that the mesh-deformation algorithm provides better reconstruction qualitatively. The upper and lower rows of Fig. 4.29 show close-up images of the same 3D shapes from two different directions A and B as illustrated in Fig. 4.28, from which we can observe 1. The visual hull has some phantom protrusions and holes due to the complicated self-occlusions. 2. The mesh-deformation result clearly inherits the complicated mesh structure from the visual hull.
130
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.29 Close-ups of the 3D shapes of Fig. 4.28. The directions for the upper and lower close-ups are shown in A and B in Fig. 4.28(b), respectively
3. The volumetric reconstruction successfully removed phantom volumes regardless of the complicated visual hull structure illustrated with the red circles in Fig. 4.29(c). 4.4.1.4 Discussions 4.4.1.4.1 Silhouette Constraint The 3D shape reconstruction methods presented in this section employ the silhouette boundary constraint. The silhouette areas inside the boundaries are used only to compute the visual hull. This means that the mesh is deformed to keep the silhouette boundaries, and the volumetric method may introduce holes even in areas covered by silhouette interiors as shown in Fig. 4.25(f). For a better use of the silhouette constraint, the following methods were developed. Furukawa and Ponce [11] proposed a patch-based algorithm with a silhouette “area” constraint. It explicitly assigns at least a single patch to each silhouette pixels to ensure that the projection of the estimated 3D shape exactly matches with original multi-view silhouettes. Sinha et al. [46] proposed an iterative shape reconstruction algorithm which estimates a 3D shape first, and then modifies it so that its 2D projections exactly match with original multi-view silhouettes. This algorithm iteratively utilizes shape from stereo and shape from silhouette to seek a 3D shape which satisfies both constraints. Cremers and Kolev [7] proposed a reconstruction framework based on convex optimization. The shape is represented by a set of voxels, and the silhouette area constraint is defined as a per-pixel inequality constraint which aggregates binary voxel occupancies along the visual ray passing through the pixel. This approach integrates both shape from stereo and silhouette constraints in a single optimization framework.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
131
4.4.1.4.2 Topology In surface-based algorithms, simple-mesh deformations based on vertex translations do not change the global surface topology. This implies that the visual hull, or the initial shape for the deformation, must have the correct object topology. For example, if the object has a genus-1 shape like a torus, then the visual hull must have the same genus-1 topology. In contrast to this limitation, if we can assume that the visual hull and the real object have a same topology, then this simple deformation helps preserving the topology even for thin object areas. The above-mentioned limitation can be relaxed by implementing complex deformation operations [61] or changing the shape representation into volumetric or patch-based ones. One straightforward implementation is to employ a two-step approach. The first step utilizes the volume-based reconstruction with a relatively large (safer) volumetric cost λ, and then uses the resultant 3D shape as the input of the second step which utilizes the mesh-based reconstruction. However, designing the first step to cull phantom (or safe) volumes is not a simple task.
4.4.1.4.3 Ballooning To tackle the problem of uniform volumetric ballooning cost (Fig. 4.25(f)), Hernandez et al. proposed an “intelligent ballooning” scheme, which introduces nonuniform volumetric cost [14]. They defined the value of λ based on the result of multi-view 2.5D depth reconstruction. They first reconstruct multi-view 2.5D depth maps, and set a smaller λ if nodes are placed in front of the depth maps. This approach can be seen as an extension of traditional “depth fusion” techniques used to reconstruct a single 3D structure from multiple 2.5D depth maps obtained by a laser range finder [19].
4.4.2 Simultaneous 3D Shape and Motion Estimation from Multi-view Video Data by a Heterogeneous Inter-frame Mesh Deformation In this section we present a simultaneous 3D shape and motion estimation algorithm based on a heterogeneous inter-frame mesh deformation [39, 40]. Suppose we have reconstructed the 3D shape of frame t as M t by the intra-frame mesh deformation presented in Sect. 4.4.1.1. The goal of the algorithm is to find an inter-frame deformation which deforms M t to M t+1 based on the multi-view image frames at t + 1. The deformation should map each vertex of M t to a corresponding position in M t+1 while preserving the mesh structure: the number of vertices and their mutual connectivities. The key idea is to employ a heterogeneous deformation strategy where vertices with prominent textures lead the deformation of neighbors while poorly textured
132
4 3D Shape Reconstruction from Multi-view Video Data
vertices follow such leading neighbors. This is based on the following observations. If all vertices of M t had rich textures and could be uniquely localized/identified on the surface, we would be able to establish exact vertex correspondences over frames. For real world objects, however, such prominent textures are limited on some parts of their surfaces. In addition, for complex human actions like MAIKO dances, surface motion patterns vary a lot depending on their locations: some parts like a head follow rigid motions, while others like loose and soft sleeves and sashes show very complicated motion patterns. Since the motion estimation for the former can be well modeled, it may lead the estimation of the latter. Thus the heterogeneity employed in the presented algorithm is twofold: the mesh deformation is controlled depending on both texture prominence and motion rigidity. To implement the above-mentioned strategy, we employ the following constraints to deform M t to M t+1 . Photometric constraint: A patch in M t+1 should be placed so that its texture, which is computed by projecting the patch onto a captured image, should be consistent irrespectively of the multi-view images at both t and t + 1 it is projected on. This is an extension of the photo-consistency constraint used in the intra-frame deformation in Sect. 4.4.1.1, and we call this a spatio-temporal photo-consistency constraint. Silhouette constraint: When M t+1 is projected onto an image plane, its 2D silhouette should coincide with the observed object silhouette at frame t + 1 on that image plane. Smoothness constraint: M t+1 should be locally smooth and should not intersect with itself. Motion constraint: Vertices of M t should drift in the direction of the motion flow of its vicinity to reach those of M t+1 . The drifting is controlled by the rigidity of each vertex (see below). Collision constraint: While in motion, some parts of M t may collide with others and later get apart. To preserve the mesh structure even when such surface collisions occur, we introduce a new collision constraint, which prevents M t+1 from intruding inside the surface. (The definition will be given in detail later.) We implement these constraints as forces working on each vertex of M t , which deform M t to M t+1 so that M t+1 is consistent with the observed multi-view video frames at t + 1. In what follows, we present a heterogeneous mesh-deformation algorithm as the computational scheme to realize such a deformation.
4.4.2.1 Heterogeneous Mesh Deformation Algorithm As discussed above, two types of heterogeneity are employed in the algorithm: texture prominence and motion rigidity. Here we address their implementation methods.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
133
Fig. 4.30 Roughly estimated motion flow patterns. (a) Two 3D surfaces of consecutive frames. (b) Estimated motion flow pattern. ©2008 IEICE [41]
4.4.2.1.1 Heterogeneous Motion Model As discussed above, complicated object actions like MAIKO dances cannot be modeled by rigid motions or their combinations. To represent such object actions, we introduce a heterogeneous motion model consisting of a mixture of surface warping and rigid motions. First we classify vertices of M t into warping or rigid parts by computing and clustering motion flow patterns from t to t + 1 frames (cf. Fig. 4.30, and the practical algorithm described later): Rigid part (Ca-1) A vertex representing an element of a rigid part of the object. It should move along with the others in the same part keeping their shape. Warping part (Ca-2) A vertex corresponding to a part of the object surface under free deformation. It should move smoothly along with its neighbors.
4.4.2.1.2 Vertex Identifiability In general, we cannot expect that all the points on the object surface have prominent texture and their 3D positions can be estimated by stereo matching. Hence not all the vertices of the mesh model are identifiable, and hence the photo-consistency constraint, which is used to estimate the vertex position on the 3D object surface, will not work well for such vertices. Thus the vertices should be classified into the following two types: Cb-1 A vertex with prominent texture. It can be stably identified over frames to obtain reliable local deformation, which its neighbors should follow. Cb-2 A vertex with less prominent or no texture. Its deformation should be led by its neighbors. We regard a vertex as identifiable if it has prominent texture and, as well, is photoconsistent. It is labeled as Cb-1 (identifiable), or otherwise as Cb-2.
134
4 3D Shape Reconstruction from Multi-view Video Data
4.4.2.1.3 Algorithm With the above-mentioned definitions of the two types of heterogeneity, the heterogeneous mesh-deformation algorithm is designed as follows: Step 1. Set the mesh M t at frame t as the initial estimation of M t+1 to be reconstructed. Note that M 0 is computed by the intra-frame mesh-deformation method described in Sect. 4.4.1.1. Step 2. Compute roughly estimated motion flow patterns between frames t and t + 1. Step 3. Categorize the vertices based on the motion flow patterns: Step 3.1. By clustering the estimated motion flow vectors, label the vertex with Ca-1 if it is an element of a rigid part, or Ca-2 otherwise. Step 3.2. Make the springs attached to vertices with label Ca-1 stiff. As will be shown in Fig. 4.32, the spring model is used to implement the smoothness constraint. Step 4. Deform the mesh iteratively: Step 4.1. Compute forces working at each vertex, respectively. Step 4.2. For a vertex having identifiability I (v) exceeding a certain threshold, that is, for a vertex with label Cb-1, let the force computed for it diffuse to its neighbors. Step 4.3. Move each vertex according to the force. Step 4.4. Terminate if all vertex motions are small enough. Otherwise go back to Step 4.1. Step 5. Take the final shape of the mesh model as the object shape at frame t + 1, M t+1 . Note that M t and M t+1 share the same mesh structure and the detailed motion vectors for all vertices, i.e. the vertex correspondences between M t and M t+1 , are also obtained. Note that for a vertex of type Ca-2 ∧ Cb-2, a vertex without prominent texture nor belonging to a part of a rigid part, the position at t + 1 is interpolated by the smoothness constraint. On the other hand, a vertex of type Ca-1 ∧ Cb-1, a vertex with prominent texture and belonging to a part of a rigid part, moves so as to lead the rigid part motion. A vertex of type Ca-2 ∧ Cb-1 moves freely to satisfy the photoconsistency, while a vertex of type Ca-1 ∧ Cb-2 moves under combined influences from the neighbors. In what follows, the practical implementation methods to compute the forces are given supposing M t = M t+1 = {V , E} where V and E denote the set of vertices and edges in M t as well as in M t+1 .
4.4.2.2 Rigid Part Estimation In Step 2 and Step 3.1 of the algorithm, we need to estimate a rough motion flow and then find rigid parts from it.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
135
Fig. 4.31 Motion flow clustering. (a) Roughly estimated motion flow patterns based on inertia and nearest neighbor search between the current shape at t and the visual hull at t + 1. (b) Short flow removal. (c) Clustering by position. (d) Sub-clustering by motion direction. ©2008 IEICE [41]
The motion flow pattern is computed by a nearest point search from M t to the visual hull at t + 1, VHt+1 . Suppose each vertex is associated with a motion vector between M t−1 to M t to predict a motion vector from M t to M t+1 . Then, we estimate the vertex position at t + 1 by integrating both the surrounding motion flow pattern and motion inertia. Let VHt+1 denote the visual hull of frame t + 1 and pv the position of vertex v on M t . Then the motion flow of v is defined as follows. mv = pVH,v − pv ,
(4.34)
d pv where pVH,v denotes a vertex position on VHt+1 such that it is closest to pv + dt d and satisfies npv · npVH,v > 0. Here pv + dt pv denotes the predicted position of v at t + 1 based on its motion history and npVH,v the normal direction of VHt+1 at pVH,v . Then we estimate rigid parts by applying a clustering to mv . This is done by (1) removing short flows (Fig. 4.31(a) and (b)), (2) clustering mv based on the origin positions (Fig. 4.31(c)), and (3) sub-clustering based on the directions (Fig. 4.31(d)).
4.4.2.3 Constraints as Vertex Forces 4.4.2.3.1 Photometric Constraint Following the photometric constraint implementation of the intra-frame deformation in Sects. 4.4.1.1.2 and 4.4.1.1.5, the photo-consistent position of each vertex v ∈ V is searched along its normal direction. Then, the possible vertex position can be defined as follows: pv (k) = pv − μknv
(k = −K, . . . , K),
(4.35)
136
4 3D Shape Reconstruction from Multi-view Video Data
where pv denotes the current position of v. nv denotes the normal vector of v at pv , μ the unit distance or resolution, and k the unit count. Let v(kv ) denote the vertex v transformed to the digitized position pv (kv ). With this representation, the most photo-consistent position for v is defined by kp = arg max ρ v(k) , (4.36) k
where ρ(v(k)) evaluates the photo-consistency using state-based visibility with M t . Notice that the photo-consistency evaluation function ρ() evaluates multi-view textures of frame t captured at the position of pv , and frame t + 1 captured at pv (k) in order to measure the spatio-temporal consistency. Then, the texture prominence is defined by ρ(v(kp ))/ρ(v(kp )), where kp denotes the second best photo-consistent position. The force at v derived by the photometric constraint is controlled by this texture prominence measure as follows: pv (kp ) − pv (0) if ρ(v(kp ))/ρ(v(kp )) > threshold, (4.37) Fp (v) = 0 otherwise.
4.4.2.3.2 Silhouette Constraint First, the contour generator points Vcg on the visual hull at frame t + 1 is computed by the method described in Sect. 4.4.1.1.3. Then for each vertex v in M t , the existence of a point in Vcg is searched along the normal direction of v just in the same way as above. Let pv (k) (k = −K, . . . , K) denote the possible positions of v defined by Eq. (4.35) and pv (ks ) one of pv (k) (k = −K, . . . , K) closest to the contour generator points. Then, the force at v derived from the silhouette constraint is defined as follows: pv (ks ) − pv (0) if pv (ks ) − pcg,v 2 < threshold and npv · npcg,v > 0, Fs (v) = 0 otherwise, (4.38) where pcg,v denotes the point in Vcg closest to pv (ks ). npv and npcg,v denote the normal vector on M t at the original position of v and that on the visual hull of t + 1 at pcg,v , respectively. The force Fs (v) navigates the vertex v toward the contour generator if there exists pv (ks ) such that one of Vcg is close enough to pv (ks ) and has the similar normal direction to that of v.
4.4.2.3.3 Motion Constraint As we discussed earlier, not all the vertices have prominent textures. In the intraframe mesh deformation, we can simply assume that all vertices move toward the inside since the deformation starts from the visual hull. The inter-frame mesh deformation, on the other hand, cannot have such default motion.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
137
Fig. 4.32 Extended spring model for the smoothness constraint. ©2008 IEICE [41]
The motion constraint in the inter-frame deformation is designed to provide a “default” motion direction for a vertex to deform. Similarly to the rigid part estimation, the force derived from the motion constraint for a vertex can be defined as follows: Fm (v) = pVH,v − pv ,
(4.39)
d pv where pVH,v denotes a vertex position on VHt+1 such that it is closest to pv + dt d and satisfies npv · npVH,v > 0. Here pv + dt pv denotes the predicted position of v at t + 1 based on its motion history and npVH,v the normal direction of VHt+1 at pVH,v . Notice that pVH,v is fixed throughout the iteration, but pv changes its position during the deformation.
4.4.2.3.4 Smoothness Constraint We model the smoothness constraint by using an extended spring model (Fig. 4.32). In this model, vertex v has two spring groups: • Structural springs connecting to the neighboring vertices of v, vj . • Flex springs connecting v to the vertices vˇj such that a line between v and vˇ j defines a diagonal of a quadrilateral obtained by merging a pair of neighboring triangle patches which share a pair of neighboring vertices of v. Structural springs control distances between neighboring vertices, while flex springs control mesh folding. With spring constants ks (v, vj ) and kf (v, vˇj ) for the structural and flex springs, the force at v derived from the smoothness constraint is defined as follows: F i (v) =
N j =1
N f i v, vj , ks (v, vj ) + f i v, vˇj , kf (v, vˇj ) − q˙ v ,
(4.40)
j =1
where vj denotes the j th neighboring vertex of v, vˇj the j th diagonally facing vertex of v, N the number of neighboring vertices, and q˙ v the damping force of spring which changes proportionally to the velocity of v. f i (·) is the Hooke spring
138
4 3D Shape Reconstruction from Multi-view Video Data
force given by f i (va , vb , k) = k
q va − q vb − l(va , vb ) (q va − q vb ), q va − q vb
(4.41)
where l(va , vb ) denotes the nominal length of the spring between va and vb . Note that the number of diagonally facing vertices is equal to the number of neighboring vertices N .
4.4.2.3.5 Collision Constraint The repulsive force Fr (v) works for vertices where surface collisions occur. It prevents collided surfaces from intruding into the surface interior. To implement this force, the parts of the surface that may collide with each other need to be identified first. While surface collisions could be computed with accurate shape and motion data, it would not be a good idea in our case since the accuracy of shape and motion is limited. We employ instead the visibility of each vertex for the collision detection. That is, for surface parts having faces close to others, no camera can observe them: the set of visible cameras Cv of v becomes empty. Note that the deformation of such invisible vertices as well as colliding ones should follow neighbors while preventing surface intersections. Suppose Fr (v) is initialized to 0 for all vertices and V∅ denotes a set of invisible vertices such that Cv = ∅. Then for each v ∈ V∅ , 1. Select the set of vertices v ∈ Vd (v) from V∅ \v such that the distance to v is closer than the shortest distance to the neighboring vertices of v. 2. Compute Fr (v) by p v − pv . (4.42) Fr (v) = − pv − pv 3 v ∈Vd (v)
While the general collision detection is known to be a time-consuming process, this algorithm can drastically reduce the number of vertices to be processed by reusing the visibility computation used for the photo-consistency evaluation.
4.4.2.4 Overall Vertex Force The overall vertex force is given by F (v) = ωp Fp (v) + ωs Fs (v) + ωi Fi (v) + ωm Fm (v) + ωr Fr (v),
(4.43)
where the ωs are weighting coefficients. Starting with M t , the 3D shape of frame t, we can compute this per-vertex force for the heterogeneous mesh deformation. Each of the constituent forces navigates a vertex of M t so that its corresponding constraint is satisfied. The heterogeneous deformation finds a vertex position where the forces balance.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
139
Fig. 4.33 Camera setup for the synthesized data
4.4.2.5 Performance Evaluations 4.4.2.5.1 Synthesized Data Figure 4.34 shows the result of the inter-frame deformation for a synthesized object captured by the camera setup illustrated in Fig. 4.33. The left column shows a temporal sequence of the synthesized object, the second left their visual hulls, the second right the results of the intra-frame deformation, and the right a temporal sequence of the 3D shape computed by the inter-frame deformation. Figure 4.35 shows the quantitative performance evaluation results. Accuracy evaluates the distance d (in cm) such that 90 % of the reconstructed surface areas are within d cm of the ground truth, and completeness measures the percentage of the reconstructed surface areas that are within 1.0 cm of the ground truth [45]. From these results, we can observe that the inter-frame mesh-deformation algorithm can (1) improve the shape compared with the visual hull, and (2) preserve the
Fig. 4.34 Results of inter-frame deformation for the synthesized data
140
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.35 Reconstruction accuracy and completeness
mesh structure even if some surface parts collide with others, but (3) shows lower quality measures compared with the intra-frame deformation as the time proceeds even though the former employs richer reconstruction cues, i.e., the temporal information. The reason of the limited performance of the inter-frame deformation is threefold. 1. The mesh structure (vertex connectivity) inherited from the initial frame mesh is not guaranteed to be optimal to describe the 3D shapes for the other frames. Keeping the initial mesh structure over time can work as an excessive constraint for the shape reconstruction. On the other hand, the shape optimization in the intra-frame deformation is free from such an additional constraint. 2. The intra-frame deformation can exploit a strong reconstruction cue given by the visual hull. That is, as long as the deformation starts from the visual hull, the real object surface is guaranteed to be found inside the visual hull. This allows the shape optimization process to seek a solution surface in a very limited direction. On the other hand, the inter-frame deformation does not have such a solid direction in its optimization. 3. The sequential deformation process of the inter-frame mesh deformation accumulates errors over time, which degrades the accuracy of the shape reconstruction at later frames. While we may be able to improve the performance of the inter-frame mesh deformation taking the above limitations into account, it would be a reasonable choice to employ frame-by-frame reconstruction methods to reconstruct a long sequence of 3D shapes in a good quality even if the mesh structure is not preserved over time. Thus the texture generation and 3D video applications (visualization, editing, kinematic structure analysis, and encoding) presented in the later chapters assume that the 3D shape reconstruction for a 3D video stream is conducted frame by frame. Hence the 3D mesh structure varies over time and no correspondence between a pair of consecutive frame mesh data is established.
4.4 Implementations and Performance Evaluations of 3D Shape Reconstruction
141
Fig. 4.36 Frames of MAIKO video captured by one of the multi-view cameras. ©2008 IEICE [41]
Fig. 4.37 Sequence of 3D shapes reconstructed by the inter-frame mesh deformation
4.4.2.5.2 Real Data Figure 4.36 shows a temporal sequence of MAIKO video frames captured by one of the multi-view cameras. Figure 4.37 shows a sequence of 3D shapes reconstructed by the inter-frame mesh deformation and Fig. 4.38(c) the dense and longterm 3D motion flow patterns representing the vertex motion trajectories between Figs. 4.38(a) and (b). These results empirically prove that the inter-frame deformation can process dynamically complex object actions such as MAIKO dances. To evaluate the performance of the inter-frame deformation for complex object motions including surface collisions, multi-view video data of Yoga performance were analyzed. Figure 4.39 shows a temporal sequence of image frames captured by one of the multi-view cameras and Fig. 4.40 the result of the inter-frame de-
142
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.38 Estimated long-term motion flow pattern. Each line (from blue to red) denotes a 3D motion flow
formation. In this sequence, the subject crosses her legs, which touch each other completely at frame 13 to 14. Figures 4.41 and 4.42 show the topological structures of the 3D shapes in Fig. 4.40 and the corresponding visual hulls. Here the topological structure of each 3D shape is computed by using Reeb-graph [54] (cf. the detailed computational algorithm in Chap. 8). We can observe that the topological structure is preserved over time with the inter-frame deformation, while those of the visual hulls drastically change due to the surface collisions at the legs. This point can be well observed by close-up figures of frame 14 shown in Fig. 4.43. From these results, we can conclude that the proposed inter-frame deformation can process object actions involving heavy surface collisions.
4.5 Conclusion This chapter addresses the 3D shape reconstruction from multi-view video data. We first reviewed Shape from X methods and concluded that shape from stereo and shape from silhouette are the most practical reconstruction methods among others, and the state-of-the-art technologies integrate them together since they can work complimentarily for accurate and robust reconstruction. As for 3D shape reconstruction for 3D video production, we categorized existing methods into three types: frame-wise 3D shape reconstruction, (2) simultaneous 3D shape + motion estimation, and (3) 3D shape sequence estimation. Then we introduced the three essential design factors for 3D shape reconstruction algorithms: (1) photo-consistency, (2) visibility evaluation, and (3) shape representation and associated computational model for optimization. Based on these design factors, we implemented several practical algorithms for frame-wise 3D shape reconstruction and simultaneous 3D shape and motion reconstruction, and evaluated their performance quantitatively with synthesized data as well as qualitatively with real world multi-view video data of MAIKO dance and Yoga performance. Important conclusions we obtained include: • According to the state-of-the-art technologies, the simultaneous 3D shape and motion reconstruction approach, i.e. the inter-frame mesh-deformation method, does not perform better than the frame-wise reconstruction methods in terms of the reconstructed 3D shape accuracy, while the former can manage complex object actions involving heavy surface collisions. This is because:
4.5 Conclusion
Fig. 4.39 Yoga video frames captured by one of the multi-view cameras. ©2008 IEICE [41]
Fig. 4.40 3D shapes reconstructed by the inter-frame deformation
143
144
4 3D Shape Reconstruction from Multi-view Video Data
Fig. 4.41 Topological structures computed from the 3D shapes in Fig. 4.40
– The mesh structure inherited from the initial frame mesh is not guaranteed to be optimal to represent 3D shapes over frames. Keeping the initial mesh structure over time can even work as an excessive constraint for the shape reconstruction especially when the object shape changes largely over time as MAIKO dances. – The intra-frame deformation can exploit a strong reconstruction cue given by the visual hull, which allows the shape optimization process to seek a solution in a very limited direction, while the inter-frame deformation does not have such a solid direction in its optimization. – The sequential computation process to deform the initial mesh structure frame by frame for a long period is prone to accumulate errors. • Consequently we employ the frame-wise reconstruction strategy in the following chapters. That is, each frame of a 3D video stream has a completely independent mesh structure while the object changes its shape continuously. In addition, this chapter has pointed out several open issues to be studied as future problems.
4.5 Conclusion
145
Fig. 4.42 Topological structures computed from the visual hulls
Fig. 4.43 Shapes and topological structures in a frame where parts of the object surface collide with each other. (a) 3D shape by the inter-frame deformation, (b) visual hull, (c) the topological structure of (a), (d) the topological structure of (b)
View-dependent 3D shape reconstruction: The introduction of a virtual camera/ viewpoint into the computational process of 3D shape reconstruction from multiview images can improve the reconstruction quality. The view-dependent 3D
146
4 3D Shape Reconstruction from Multi-view Video Data
shape reconstruction could weight multi-view images depending on the angles between the viewing directions of the virtual and actual cameras. New cues and sensors: The introduction of other cues and/or devices such as shape from (de)focus/specularities/polarization, active stereo, ToF cameras, etc. (Sect. 4.2.1) will enable us to capture specular and transparent object surfaces as well as improve the 3D shape accuracy. Problem formalization: Recent studies such as [7] showed that even the silhouette constraint can be better implemented with a new problem formalization (Sect. 4.4.1.4.1). Developing such new problem formalizations will improve the accuracy and robustness of dynamic 3D shape reconstruction from multi-view video data. Segmentation: In this book we assume that accurate multi-view object silhouettes are available. While silhouettes are strong reconstruction cues, their accurate extraction in real environments is a long lasting segmentation problem in computer vision. Further studies on simultaneous estimation of the 3D shape and 2D silhouettes (Sect. 4.2.2.2.1), or 3D shape reconstruction from incomplete silhouettes will help making the 3D video technology more practical. Limited visibility: Finally we should note that the limited visibility of the object surface (Sect. 4.3.2) is the essential source of the problem in 3D shape reconstruction from multi-view images. That is, even if we increase the number of cameras, some parts of the object surface cannot be observed due to self-occlusions, especially when an object performs complex actions like Yoga. To cope with such self-occlusions, the inter-frame mesh-deformation strategy can be augmented to estimate properties of occluded surface parts from data obtained when they become visible. Such global dynamic data processing schemes should be investigated.
References 1. Alexa, M., Behr, J., Cohen-Or, D., Fleishman, S., Levin, D., Silva, C.T.: Point set surfaces. In: The Conference on Visualization, pp. 21–28 (2001) 2. Barr, A.H.: Rigid Physically Based Superquadrics, pp. 137–159. Academic Press, San Diego (1992) 3. Baumgart, B.G.: Geometric modeling for computer vision. Technical Report AIM-249, Artificial Intelligence Laboratory, Stanford University (1974) 4. Baumgart, B.G.: A polyhedron representation for computer vision. In: Proceedings of the National Computer Conference and Exposition, AFIPS’75, pp. 589–596 (1975) 5. Campbell, N., Vogiatzis, G., Hernández, C., Cipolla, R.: Automatic 3D object segmentation in multiple views using volumetric graph-cuts. Image Vis. Comput. 28(1), 14–25 (2010) 6. Virtualizing Engine. Private communication with Profs. Takeo Kanade and Yaser Sheikh, Robotics Institute, Carnegie Mellon University, PA (2011) 7. Cremers, D., Kolev, K.: Multiview stereo and silhouette consistency via convex functionals over convex domains. IEEE Trans. Pattern Anal. Mach. Intell., 1161–1174 (2010) 8. Felzenszwalb, P., Huttenlocher, D.: Efficient belief propagation for early vision. Int. J. Comput. Vis. 70, 41–54 (2006)
References
147
9. Franco, J.-S., Boyer, E.: Efficient polyhedral modeling from silhouettes. IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 414–427 (2009) 10. Fua, P., Leclerc, Y.G.: Using 3-dimensional meshes to combine image-based and geometrybased constraints. In: Proc. of European Conference on Computer Vision, pp. 281–291 (1994) 11. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 12. Goldlüecke, B., Magnor, M.: Space-time isosurface evolution for temporally coherent 3D reconstruction. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 350–355 (2004) 13. Habbecke, M., Kobbelt, L.: A surface-growing approach to multi-view stereo reconstruction. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 14. Hernandez, C., Vogiatzis, G., Cipolla, R.: Probabilistic visibility for multi-view stereo. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 15. Hernandez Esteban, C., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. 30, 548–554 (2008) 16. Horn, B.K.P., Brooks, M.J.: Shape from Shading. MIT Press, Cambridge (1989) 17. Hornung, A., Kobbelt, L.: Robust and efficient photo-consistency estimation for volumetric 3d reconstruction. In: Proc. of ECCV, pp. 179–190 (2006) 18. Ikeuchi, K.: Shape from regular patterns. Artif. Intell. 22(1), 49–75 (1984) 19. Ikeuchi, K., Oishi, T., Takamatsu, J., Sagawa, R., Nakazawa, A., Kurazume, R., Nishino, K., Kamakura, M., Okamoto, Y.: The great Buddha project: digitally archiving, restoring, and analyzing cultural heritage objects. Int. J. Comput. Vis. 75, 189–208 (2007) 20. Ishikawa, H.: Higher-order clique reduction in binary graph cut. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2993–3000 (2009) 21. Kanade, T., Rander, P., Narayanan, P.J.: Virtualized reality: constructing virtual worlds from real scenes. In: IEEE Multimedia, pp. 34–47 (1997) 22. Kang, S.B., Webb, J.A., Zitnick, C.L., Kanade, T.: A multibaseline stereo system with active illumination and real-time image acquisition. In: Proc. of International Conference on Computer Vision, pp. 88–93 (1995) 23. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988) 24. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Symposium on Geometry Processing, pp. 61–70 (2006) 25. Kolmogorov, V., Zabih, R.: What energy functions can be minimizedvia graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26, 147–159 (2004) 26. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. In: Proc. of International Conference on Computer Vision, pp. 307–314 (1999) 27. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16(2), 150–162 (1994) 28. Lazebnik, S., Furukawa, Y., Ponce, J.: Projective visual hulls. Int. J. Comput. Vis. 74, 137–165 (2007) 29. Lempitsky, V., Boykov, Y., Ivanov, D., Ivanov, D.: Oriented visibility for multiview reconstruction. In: Proc. of European Conference on Computer Vision, pp. 226–238 (2006) 30. Marr, D.: Vision. W. H. Freeman & Co, New York (1982) 31. Martin, W.N., Aggarwal, J.K.: Volumetric description of objects from multiple views. IEEE Trans. Pattern Anal. Mach. Intell. 5(2), 150–158 (1983) 32. Matsuyama, T., Wu, X., Takai, T., Nobuhara, S.: Real-time 3D shape reconstruction, dynamic 3D mesh deformation and high fidelity visualization for 3D video. Comput. Vis. Image Underst. 96, 393–434 (2004) 33. Miller, G., Hilton, A.: Safe hulls. In: Proc. 4th European Conference on Visual Media Production, IET (2007) 34. Moezzi, S., Tai, L.-C., Gerard, P.: Virtual view generation for 3D digital video. In: IEEE Multimedia, pp. 18–26 (1997)
148
4 3D Shape Reconstruction from Multi-view Video Data
35. Nagel, H.H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 8, 565–593 (1986) 36. Nayar, S.K., Nakagawa, Y.: Shape from focus. IEEE Trans. Pattern Anal. Mach. Intell. 16, 824–831 (1994) 37. Nayar, S.K., Watanabe, M., Noguchi, M.: Real-time focus range sensor. IEEE Trans. Pattern Anal. Mach. Intell. 18, 1186–1198 (1996) 38. Nobuhara, S., Matsuyama, T.: Dynamic 3D shape from multi-viewpoint images using deformable mesh models. In: Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis, pp. 192–197 (2003) 39. Nobuhara, S., Matsuyama, T.: Heterogeneous deformation model for 3D shape and motion recovery from multi-viewpoint images. In: Proc. of International Symposium on 3D Data Processing, Visualization and Transmission, pp. 566–573 (2004) 40. Nobuhara, S., Matsuyama, T.: Deformable mesh model for complex multi-object 3D motion estimation from multi-viewpoint video. In: Proc. of International Symposium on 3D Data Processing, Visualization and Transmission, pp. 264–271 (2006) 41. Nobuhara, S., Matsuyama, T.: A 3D deformation model for complex 3D shape and motion estimation from multi-viewpoint video. IEICE Trans. Inf. Syst. J91-D(6), 1613–1624 (2008) (in Japanese) 42. Nobuhara, S., Tsuda, Y., Ohama, I., Matsuyama, T.: Multi-viewpoint silhouette extraction with 3D context-aware error detection, correction, and shadow suppression. IPSJ Trans. Comput. Vis. Appl. 1, 242–259 (2009) 43. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 15(1), 353–363 (1993) 44. Seitz, S., Dyer, C.: Photorealistic scene reconstruction by voxel coloring. Int. J. Comput. Vis. 25(3), 151–173 (1999) 45. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 519–528 (2006) 46. Sinha, S.N., Mordohai, P., Pollefeys, M.: Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In: Proc. of International Conference on Computer Vision, pp. 1–8 (2007) 47. Starck, J., Hilton, A.: Surface capture for performance based animation. IEEE Comput. Graph. Appl. 27(3), 21–31 (2007) 48. Starck, J., Hilton, A., Miller, G.: Volumetric stereo with silhouette and feature constraints. In: Proc. of British Machine Vision Conference, pp. 1189–1198 (2006) 49. Starck, J., Maki, A., Nobuhara, S., Hilton, A., Matsuyama, T.: The multiple-camera 3-d production studio. IEEE Trans. Circuits Syst. Video Technol. 19(6), 856–869 (2009) 50. Subbarao, M., Surya, G.: Depth from defocus: a spatial domain approach. Int. J. Comput. Vis. 13, 271–294 (1994) 51. Szeliski, R.: Rapid octree construction from image sequences. CVGIP, Image Underst. 58(1), 23–32 (1993) 52. Tomasi, C., Kanade, T.: Shape and motion from image streams: a factorization method—full report on the orthographic case. Int. J. Comput. Vis. 9, 137–154 (1992) 53. Tran, S., Davis, L.: 3d surface reconstruction using graph cuts with surface constraints. In: Proc. of European Conference on Computer Vision, vol. 3952, pp. 219–231 (2006) 54. Tung, T., Schmitt, F.: The augmented multiresolution reeb graph approach for content-based retrieval of 3D shapes. Int. J. Shape Model. 11(1), 91–120 (2005) 55. Vedula, S., Baker, S., Seitz, S., Kanade, T.: Shape and motion carving in 6D. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2000) 56. Vogiatzis, G., Hernandez, C., Torr, P., Cipolla, R.: Multiview stereo via volumetric graphcuts and occlusion robust photo-consistency. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2241–2246 (2007)
References
149
57. Vogiatzis, G., Torr, P., Seitz, S.M., Cipolla, R.: Reconstructing relief surfaces. In: Proc. of British Machine Vision Conference, pp. 117–126 (2004) 58. Vogiatzis, G., Torr, P., Cipolla, R.: Multi-view stereo via volumetric graph-cuts. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 391–398 (2005) 59. Wada, T., Ukida, H., Matsuyama, T.: Shape from shading with interreflections under a proximal light source: distortion-free copying of an unfolded book. Int. J. Comput. Vis. 24(2), 125–135 (1997) 60. Wu, X.: Parallel pipeline volume intersection for real-time 3D shape reconstruction on a PC cluster. PhD thesis, Kyoto University (March 2005) 61. Zaharescu, A., Boyer, E., Horaud, R.: Transformesh: a topology-adaptive mesh-based approach to surface evolution. In: Proc. of Asian Conference on Computer Vision, pp. 166–175 (2007) 62. Zeng, G., Quan, L.: Silhouette extraction from multiple images of an unknown background. In: Proc. of Asian Conference on Computer Vision, pp. 628–633 (2004)
Chapter 5
3D Surface Texture Generation
5.1 Introduction As discussed in the beginning of Part II, the complexity and limited availability of information in the computational model of 3D video production do not allow us to solve the problem of 3D video production from multi-view video data at once. Thus we take the following three-step solution method: 3D shape reconstruction, surface texture generation, and estimation of lighting environments, while introducing assumptions to overcome the complexity and make up for the incompleteness of input data. First, the 3D shape reconstruction methods described in the previous chapter produce a sequence of 3D meshes from multi-view video data assuming uniform directional lighting and Lambertian object surface. In the second step, the surface texture generation methods presented in this chapter compute surface texture patterns for a 3D mesh data sequence based on observed multi-view video data. The following are basic assumptions about the 3D mesh data: • Shape reconstruction and texture generation processes can work without any problem even for scenes with multiple objects, even though mutual occlusions can damage the quality of 3D video. In this chapter, however, we assume one single 3D object in the scene for the sake of simplicity. It should be noticed that even for one single object, the texture generation process should cope with selfocclusions. • We assume in this chapter a sequence of 3D mesh data reconstructed independently one frame to another. That is, the 3D mesh structure, i.e. number of vertices and their connectivity relations, changes over frames and no explicit correspondence is established between a pair of consecutive frame mesh data. With these assumptions, the problem of texture generation is defined as the problem of generating the surface texture of a 3D mesh frame from a set of multi-view video frames. Figure 5.1 illustrates the computational model for texture generation, where the gray and black arrows illustrate “generation” and “estimation” processes, respecT. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_5, © Springer-Verlag London 2012
151
152
5
3D Surface Texture Generation
Fig. 5.1 Computational model for texture generation. The gray and black arrows illustrate “generation” and “estimation” processes, respectively
tively. First, suppose 3D mesh data of an object at frame t , So , a set of multiview images, Ii (i = 1, 2, . . . , N), and calibration data of a group of cameras, Ci (i = 1, 2, . . . , N) are given. Then, the general computational model of 3D video production illustrated in Fig. II.2 in the beginning of this part is simplified into Fig. 5.1 assuming: Lambertian reflection: If required, the texture generation process can assume an object surface that follows Lambertian reflection, i.e. Ro = Lambertian, and conducts appearance-based matching between multi-view images to estimate appropriate surface texture patterns depending on local object surface position and shape. Thus photo-consistency evaluation methods similar to those employed in 3D shape reconstruction (Sect. 4.3.1) would be employed for texture generation. As described below, however, the texture generation methods presented in this chapter can simulate non-Lambertian surface reflections such as highlights while introducing a virtual camera. Non-attenuation of uniform directed lighting without interreflection: To simplify the complex light field phenomena, the same assumptions about lighting environments as for 3D shape reconstruction are introduced. That is L ≈ nonattenuation of uniform directed lighting. No background scene: Since the 3D shape of an object is segmented out from the background scene, we can eliminate the background scene from the computational model.
5.1 Introduction
153
The computational model for texture generation is best characterized by introducing a virtual camera, Cˆ. With this additional entity, view-dependent texture generation can be realized, which can render object surface textures with nonLambertian reflection properties such as highlights as well as partially specular shading and soft shadows induced by non-uniform lighting environments. These visual effects are crucial in rendering 3D video of MAIKO dances with shiny FURISODEs of silk embroidered with gold thread. The black arrows in Fig. 5.1 illustrate the computational flow of this view-dependent texture generation, which can simulate complex light field phenomena including highlight, shading, and shadow as well as unknown light sources by controlling the integration process of multi-view images based on a virtual camera. In what follows, the computational scheme of texture generation is discussed and compared to similar methods employed in computer graphics and 3D animation.
5.1.1 Texture Painting, Natural-Texture Mapping, and Texture Generation To develop computational methods of texture generation for 3D video production, we first have to clarify the existing differences with similar methods used in 3D CG animation, which we will refer to as texture painting. The left part of Fig. 5.2 illustrates the computational scheme of texture painting for 3D CG animation, while the right part illustrates texture generation for 3D video production. In the former, first, 3D mesh data of a (static) object and surface texture patterns are designed manually by artists. Then, to animate the object a dynamic sequence of 3D mesh data is generated by retargeting motion capture data and/or by manual motion design (Fig. 5.3 left). Since the designed surface texture patterns specify generic surface characteristics such as reflectance and roughness properties, i.e., diffuse, specular, transparency, and so on, 3D visual contents for 3D cinema, TV, and game can be rendered by computing object appearances including shading patterns, highlights, and shadows under specified lighting environments. This is the process of rendering texture patterns in visual CG contents, where all required information for computation is given. That is, all information on objects, light sources, and background scene as well as virtual camera in Fig. II.2 is prepared to generate the light field, which is then projected onto the image plane of the virtual camera. Since there is no computationability problem, performance is evaluated by “naturalness” of rendered contents and computation speed. Note that all 3D mesh data in 3D CG animation usually share the same mesh structure. This implies that once painted, the surface texture patterns are preserved over the sequence. More specifically speaking, the number of mesh faces (we will use “face” instead of “patch” in this chapter), their connectivity relations, and texture patterns are kept constant, while 3D coordinate values of mesh vertices change over time. While sophisticated 3D CG animation methods allow dynamic change of mesh structure and
154
5
3D Surface Texture Generation
Fig. 5.2 Computational schemes of texture painting and texture generation
Fig. 5.3 Data representations in 3D CG animation and 3D video
texture over time, complete information for texture synthesis is prepared a priori as well. To improve naturalness, natural-texture patterns captured by cameras are often mapped on 3D mesh data. This process is referred to as natural-texture mapping (in what follows, we will call it texture mapping for the sake of simplicity) and shares much with texture generation for 3D video production. The major problems
5.1 Introduction
155
Fig. 5.4 Computational scheme of natural-texture mapping
to be solved for texture mapping, which are shared as well by texture generation, are summarized as follows (see also Fig. 5.4). Geometric transformation between 3D mesh and 2D image Simple algebraic transformations cannot be employed to map 2D image data onto given 3D mesh data if the topological structures of 3D mesh and 2D image data are different. For example, consider how to map a bounded 2D planar image onto the unbounded surface of a sphere seamlessly. Since 3D mesh data representing complex human actions have very complex topological structures, which moreover vary over time, sophisticated geometric mapping methods should be devised for the 3D video texture generation process. This problem is known as mesh parameterization and has been actively studied in computer graphics [12, 18, 19]. Section 5.2 presents a fundamental geometric mapping scheme for texture mapping, which will be employed afterwards for texture generation. Estimation of generic surface reflectance properties from observed images The estimation of object surface reflectance properties from observed images has been a major issue in computer vision [5, 11, 14, 16, 17, 22]. The left half of Fig. 5.4 illustrates a computational model of surface reflection and image observation, which implies that the problem of surface reflectance estimation from observed images is ill-posed without appropriate assumptions. That is, since the 3D surface shape is known in the texture generation as well as in the texture mapping, information about lighting should be given or assumed for reflectance property estimation. While it is possible to estimate reflectance properties of a static object with well calibrated controlled lightings [2, 3], such methods can
156
5
3D Surface Texture Generation
only be used for texture mapping because lighting environments in 3D video capture studios cannot be controlled for calibration of an object that moves freely. In addition to these problems shared with texture mapping, the texture generation process has to solve several additional problems which are essential in 3D video production, and will be discussed next.
5.1.2 Problems in Texture Generation First, let us recall the assumptions introduced in the beginning of this chapter: Lambertian reflectance, non-attenuation of uniform directional lighting without interreflection, segmented 3D object, frame-by-frame 3D mesh reconstruction, and introduction of a virtual camera (Fig. 5.1). Under these assumptions, the texture generation methods presented in this chapter compute the object appearance viewed from the virtual camera using reconstructed 3D mesh and captured multi-view video frame data (Fig. 5.2 right). In this sense, we can call them appearance-based texture generation. As an approach toward generic-property-based texture generation, first a method of estimating dynamically changing lighting environments with a reference object having known 3D shape and reflectance properties will be presented in Chap. 6. Then, a method of estimating reflectance properties of object surface and rendering of visual contents under user-specified lighting environments will be presented in Sect. 6.7. Compared to generic-property-based texture generation, appearance-based texture generation is easier to realize. Its computation process has to solve the following technical issues for rendering high fidelity visual contents, which as well best characterize the differences from texture painting and texture mapping. Multiple observations of mesh face For each frame of video data, a face of 3D mesh is usually observed in multiple images, from which we have to generate a texture for that face. How to integrate such multiple observations without degrading texture quality is a problem. Errors in reconstructed 3D mesh data Since calibration and 3D shape reconstruction processes cannot be perfect, reconstructed 3D mesh data are corrupted with errors and noise. Moreover, as the accuracy of geometric mapping between 3D mesh and observed images is limited, observed images cannot be consistently aligned on 3D mesh. How to resolve such unknown misalignments is a crucial problem for rendering high fidelity visual contents. Time varying 3D mesh structures Although the inter-frame mesh deformation method presented in Sect. 4.4.2 produces a sequence of 3D mesh data preserving the mesh structure over time, many 3D shape reconstruction methods process frames independently one-by-one and generate temporally inconsistent sequence of 3D mesh data (see Fig. 5.3 right). Moreover, as the shape and calibration errors described above vary over time, it does not allow us to use motion information in texture generation. Consequently, the texture generation process should be designed to work independently frame-by-frame.
5.1 Introduction
157
Table 5.1 Texture generation for 3D video Virtual-view dependence
Surface property
Processing unit
Algorithm
Independent
Generic
face
Generic texture generation (Chap. 6)
Independent
Appearance
face
View-independent texture generation (Sect. 5.3)
Dependent
Appearance
vertex
View-dependent vertex-based texture generation (Sect. 5.4)
Dependent
Appearance
face
Harmonized texture generation (Sect. 5.5)
Interactive viewing-direction control The assumptions of uniform directional lighting and Lambertian object surface enable us to consider shading, highlight, and shadow as surface painted markers just in the same way as in 3D shape reconstruction. In texture generation, however, since viewing directions of 3D video are interactively specified and changed by a user, user-specified information, i.e. the properties of a virtual camera, can be employed to render realistic appearances of shiny object surfaces; highlights and shiny surface appearance changes according to viewing directions. That is, the texture generation process should compute the object appearance according to position and viewing direction of the virtual camera: fidelity factors of observed multi-view images should be evaluated based on user-specified viewing direction. With this viewdependent texture generation, object surface textures with non-Lambertian reflection properties such as highlights, as well as partially specular shading and soft shadows induced by non-uniform lighting environments, can be rendered even without knowing generic surface reflectance properties or accurate lighting environments. From a computational speed viewpoint, such texture generation process should compute object appearance from user-specified viewpoint in real time because the viewpoint is not known a priori and is changed dynamically. The remainder of this chapter is structured as follows (see Table 5.1). First, Sect. 5.2 introduces a basic concept of texture mapping focusing on data structures for representing geometric mapping relations between complex 3D mesh and 2D image data. Then Sect. 5.3 presents a naive view-independent texture generation method mainly to demonstrate how much the rendering of visual contents is degraded due to the problems described above. The effectiveness of using information about the virtual camera in texture generation is demonstrated with an improved view-dependent texture generation method described in Sect. 5.4. Finally, Sect. 5.5, which is the main part of this chapter, presents a sophisticated texture generation method named Harmonized texture generation developed to solve the problems described above. Experimental results have proved its performance in generating high fidelity visual contents comparable to originally observed video data. Section 5.6 concludes the chapter and discusses future problems.
158
5
3D Surface Texture Generation
Fig. 5.5 3D mesh and transformed 2D mesh data
5.2 Geometric Transformation Between a 3D Mesh and a 2D Image To map a texture image onto a 3D mesh, we need to know their geometric correspondence. In other words, we need to know how a 3D mesh surface is transformed onto a 2D image plane. The simplest method consists of decomposing a 3D mesh into discrete triangles and placing them uniformly on a 2D image plane. Figure 5.5(a) shows a 3D mesh of MAIKO observed from a certain viewpoint, and Fig. 5.5(b) a uniform decomposition of a 2D rectangular image plane: the correspondences between the 3D mesh triangles and 2D triangles can be established by assigning unique IDs for corresponding 3D and 2D triangles and defining a geometric transformation between a pair of the triangles with the same ID. This simple decomposition, however, has the following drawbacks that degrade the quality of visualized 3D video: Shape distortion: Original shapes of 3D mesh triangles are not preserved. Uniform size: Original size differences among 3D mesh triangles are not preserved. Discontinuity: Connectivity relations among 3D mesh triangles are not preserved. While the former two degrade texture quality, the latter introduces jitters, i.e. visible noisy discontinuities, along triangular face boundaries. To solve these problems, sophisticated methods for geometric transformation between a 3D mesh and a 2D image have been developed, and are referred to as mesh parameterization. Although mainly used for texture mapping, they can be applied to other applications: mesh editing which converts complex 3D algorithms to simpler 2D computations and mesh coding which encodes 3D mesh vertex coordinates (x, y, z) into RGB pixel values of a 2D image. The latter application, which is known as geometry image [9], will be discussed in detail in Chap. 10. Concerning mesh parameterization for texture generation, we basically use the smart uv project implemented in Blender [1]. The algorithm is defined as follows: 1. Cluster faces of a 3D mesh into several groups based on their normal vectors. Note that the clustering does not rely on the connectivity between faces and only processes normal vectors.
5.3 Appearance-Based View-Independent Texture Generation
159
2. For each group, compute the average normal vector of member faces, and derive the orthogonal projection to transform faces from 3D space to 2D plane. Weighting factors proportional to face area sizes are employed to average the normal vectors. Hence normal vectors of larger faces should be respected. This process enables us to transform faces from 3D to 2D while preserving shapes and sizes. 3. For each group, merge neighboring member faces to generated islands on the 2D plane. 4. Tile all generated islands in a specified 2D rectangular area as compactly as possible to obtain a uv unwrapped mesh. Here ‘uv’ denotes the 2D coordinate system for the 2D rectangular area. This method enables us to transform a 3D mesh into a 2D image that almost preserves shape, size, and connectivity relations among 3D mesh triangles. Figure 5.5(c) shows the result of the uv unwrapping of MAIKO with the smart uv project. We refer to the process as uv unwrapping of a 3D mesh. ‘Unwrapping’ denotes the process of the geometric transformation from a 3D mesh surface to a 2D image plane. ‘uv’ may be referred as texture coordinates since surface texture patterns are recorded in the 2D rectangular area specified by the uv coordinate system. As for the practical data structure, a face in a 3D mesh is represented by its constituent: three vertices recording 3D coordinate values as well as texture coordinates. Note that texture coordinates of some 3D vertices may be recorded differently depending on which faces they are considered to belong to. When all faces in a uv unwrapped 2D mesh are painted and/or include texture patterns, we refer to it as texture image. The uv unwrapping is also employed in 3D CG animation, since it enables artists to paint texture patterns using 2D image painting software without painting directly on a 3D mesh surface. The view-independent texture generation methods for 3D video employ uv unwrapping to generate and store surface texture data, while the view-dependent texture generation methods exploit augmented mesh parameterizations for interactive 3D video visualization. The next section presents appearance-based viewindependent texture generation, where a uv unwrapped mesh image is textured based on observed multi-view images.
5.3 Appearance-Based View-Independent Texture Generation In this section we present an appearance-based view-independent texture generation method for 3D video. We demonstrate how significantly rendered 3D video contents are degraded if we neglect the inaccuracy of the camera calibration and errors of the shape reconstruction, and stick to view-independent methods similar to those in computer graphics without introducing a virtual camera into the texture generation process. Algorithm 1 describes the overall computational process. (The notation we employ will be explained in Sect. 5.3.1.) Namely, we first generate a partial texture image where the uv unwrapped mesh image is partially textured by
160
5
3D Surface Texture Generation
Algorithm 1: Appearance-based view-independent texture generation foreach camera C in C do TC ← create a partial texture image (Sect. 5.3.2). T ← combine partial texture images with a specified method (Sect. 5.3.3).
a captured image, and then combine them with a certain criterion. This texture generation method does not take into account the position and direction of a virtual viewpoint, and hence the generated texture is static: it is generated before rendering and never changes during interactive visualization. In what follows, we first introduce the notation and the studio configuration, and then describe the partial texture generation process followed by several methods of combining partial texture images into a full texture image of a given 3D mesh. Object images rendered from several viewpoints are presented to demonstrate how artifacts look like. We discuss reasons for their poor image quality, which leads us to view-dependent texture generation methods.
5.3.1 Notation and Studio Configuration We first summarize the notations we use in this chapter as follows: • Ci (i = 1, 2, . . . , nc ), nc = number of cameras: ith camera placed in the multiview video capture studio. ICi denotes an image captured by Ci . • V Ci : the viewing-direction vector of Ci . • C: a set of the cameras. • M: a reconstructed 3D mesh of an object in motion at t which comprises vertices, vi ∈ R3 (i = 1, 2, . . . , nv ), nv = number of vertices, and faces, fj (j = 1, 2, . . . , nf ), nf = number of faces. Note that we do not denote t explicitly since all the presented texture generation methods process one video frame at a time independently of the others. We use a mesh with triangular faces.1 • N f : a normal vector of face f . • pI f : 2D positions of three constituent vertices of face f on I . • T : a generated texture image. Figure 5.6 shows the configuration of the studio (Studio B in Table 2.3) including cameras and a MAIKO. Note that the checker pattern of the floor was generated just for illustration and the real studio floor is uniformly painted gray as the wall. The 1 As described in Chap. 4, the resolution of reconstructed 3D mesh data is approximately 5 mm in the length between a pair of neighboring vertices. Except for the vertex-based method described in Sect. 5.4, we will use a decimated mesh in this chapter. As a matter of fact, since the other texture generation methods compute texture patterns of mesh faces, the fidelity of rendered object images can be realized with high-resolution texture image data.
5.3 Appearance-Based View-Independent Texture Generation
161
Fig. 5.6 Studio configuration
studio space is a dodecagonal prism: its diameter and height are 6 m and 2.4 m, respectively. It is equipped with 15 cameras (Sony XCD-710CR; XGA, 25 fps) that have the same lenses except for three cameras: the frontal (Camera 13) and the ceiling cameras (Cameras 14 and 15) are designed to capture zoom-up images of an object. In the figure, the positions, viewing directions, and fields-of-view of the cameras are illustrated with the quadrilateral pyramids. A set of captured multi-view images is shown in Fig. 5.7. At this frame, the MAIKO was out of the field-of-view of Camera 15. Note that most of the images include partial views of the object, to which the intra-frame mesh deformation method presented in Sect. 4.4.1.1 was applied to reconstruct the full 3D object shape. The input and output data for the texture generation method described in this section are the following, assuming uv unwrapping is applied to a 3D mesh before texture generation. Input: Camera parameters, multi-view images, and a 3D mesh with uv texture coordinates.2 Intermediate output: A single texture image generated from multiple partial texture images. Output: A rendered image of 3D video.
2 The original mesh consists of 142,946 vertices and 285,912 faces, and is decimated to 1,419 vertices and 2,857 faces for the face-based methods.
162
5
3D Surface Texture Generation
Fig. 5.7 A set of captured multi-view images
5.3.2 Generating Partial Texture Images A partial texture image is a uv unwrapped mesh image where only surface areas observable from a camera are textured with image data observed by that camera. To generate a partial texture image, we need to know two transformations: (1) projection from a 3D mesh to the image plane of a camera and (2) geometric transformation from a 3D mesh to an unwrapped mesh image. The former is given by camera calibration and the latter is computed by the uv unwrapping process as described in Sect. 5.2. The partial texture image generation process includes the following two steps: (1) generating a depth map to efficiently compute the visibility of 3D mesh surfaces from a specified camera, and (2) generating a texture image from the image captured by the camera. The detailed computational process at each step is described below.
5.3.2.1 Depth Map To create a partial texture image from a captured image, we need to know which parts of the 3D object surface can be observed from a camera. More precisely, we need to know to which surface point of the 3D shape each pixel of the captured image corresponds to. We can compute the correspondence by casting a ray from the projection center of a camera to a pixel on the image plane and computing the point on the 3D mesh surface closest to the intersecting point between the ray and the surface. In practice, we create a depth map to find the correspondences efficiently, instead of computing ray casting every time we need.
5.3 Appearance-Based View-Independent Texture Generation
163
Algorithm 2: Generating depth map DCi ← create a map with the same geometry of ICi and initialize it to the infinite value. foreach face f in M do if N f · V Ci < 0 then DCi
pf
← compute 2D positions of the vertices of face f on DCi . DC
foreach pixel p inside the triangle defined by pf i do P ← compute the 3D position on M corresponding to p. d ← compute the depth between P and the projection center of Ci . if d < DCi [p] then DCi [p] ← d
Fig. 5.8 Depth maps. Darker pixels are closer to the camera
The depth map is an image that contains depth values instead of colors, where depth denotes the distance from the projection center of a camera to a surface point of the 3D mesh. It is also called depth buffer, Z-buffer, or W-buffer, and often utilized in 3D computer graphics [8, 15]. Algorithm 2 shows a procedure to generate the depth map, DCi for camera Ci . Figure 5.8 shows depth maps visualized as grayscale images, where a darker pixel is closer to a camera. Note that the object is not observable from Camera 15 and thus no depth map is generated for the camera.
164
5
3D Surface Texture Generation
Algorithm 3: Partial texture generation TCi ← create a texture image and initialize the pixels to uncolored. foreach face f in M do if N f · V Ci < 0 then TCi
pf
← compute 2D positions of the vertices of f on TCi . TC
foreach pixel p inside the triangle defined by pf i do P ← compute the 3D position on M corresponding to p . d ← compute the depth between P and the projection center of Ci . p ← compute the 2D pixel position on ICi corresponding to P . if d − DCi [p] < threshold then TCi [p ] ← ICi [p].
Once a depth map is created, we can efficiently find the correspondence between image pixels and surface points by comparing values in the depth map with computed distance values between 3D points on the surface and the projection center of a camera.
5.3.2.2 Partial Texture Image Generation Algorithm 3 shows the procedure of generating the partial texture image, TCi , for camera Ci using the depth map DCi . Note that the geometry (width and height) of the texture image, i.e. the uv unwrapped mesh image, can be specified arbitrarily, but the power of two is effective for graphics processors. Figure 5.9 shows partial texture images generated for the cameras. We do not have the partial texture image for Camera 15 for the same reason as for the depth-map generation.
5.3.3 Combining Partial Texture Images After obtaining a set of partial texture images, we combine them with a certain criterion to create a complete texture image for a 3D mesh. Since the images have the same geometry, the combination process just computes a color value for each pixel from the partial texture images.3 3 We ignore the uncolored pixels in this process. Namely, the average and the median processes described later in this section calculate colors only from colored pixels, and the face normal and the face size processes also choose colors only from colored pixels.
5.3 Appearance-Based View-Independent Texture Generation
165
Fig. 5.9 Partial texture images. The uncolored pixels are shown in white in the images
We developed the four combination processes described below, namely two image-based and two geometry-based. The former includes the average and median processes and the latter the face normal and face size processes. Note that the view-dependent texture generation method described in Sect. 5.4 uses the average process in its computational process, the harmonized texture generation method described in Sect. 5.5 uses the average, face normal, and face size processes implicitly, and the generic texture generation method described in Chap. 6 uses the median process. Characteristics of each combination process are summarized as follows: Average This process creates a full texture image by averaging color values of the pixels at the same position in all partial texture images. (The uncolored pixels are ignored.) It can create a smooth texture image but introduces blurring artifacts in regions with strong specular reflections, i.e. highlights, because captured colors around such regions vary a lot with viewing directions of cameras. Ghosting, as well as blurring artifacts appear due to the inaccuracy of camera calibration and errors in the reconstructed 3D shape. Median This process is similar to the average process but can reduce blurring artifacts caused by strong highlights, because highlights appear in a narrow range of the reflection angle and can be reduced by taking the median among color values in multi-view images. In this sense, this method can be used for estimating
166
5
3D Surface Texture Generation
the diffuse reflection parameter of the surface. In Chap. 6, actually, the texture image is generated with this method for estimating surface reflectance properties and computing lighting effects on a 3D object surface. However, the generated texture image is noisier than the one generated by the average process in general. Face normal This process extracts a texture image for each face of a 3D mesh from the image captured by the best face-observing camera, i.e. the one most directly oriented toward the face. Since this method does not blend color values, it can create sharp and clear texture images, while textural discontinuities appear at face boundaries where the best face-observing cameras are switched from one to another. Face size Although this process is similar to the face normal process, it takes into account the face size instead of the normal vector. The pros and cons are the same as for the face normal process. This process works well when the size of an object in observed multi-view images varies a lot. The highest resolution, i.e. largest texture image, for a face is then extracted to generate the texture image. Figure 5.10 shows the complete texture images generated by processes mentioned above. We first examine the results of the image-based processes. The average process demonstrates that it can generate a smooth texture image but introduces blurring artifacts (Fig. 5.10(a)). On the other hand, the median process demonstrates that it can generate a sharp and clear texture image. In particular, it reduces ghosting artifacts (doubly projected patterns) on the sleeves, the body, etc., which are introduced by the average method (Fig. 5.10(b)). On the other hand, some noisy areas are introduced at the sash on the object’s back, sleeves, etc. These points will be observed more clearly in the rendering results in the next section. Secondly, the geometry-based processes can generate a sharper and clearer texture image than the image-based ones. We can see very clear patterns on the clothes and the sash in Fig. 5.10(c) and Fig. 5.10(d). However, textural discontinuities between mesh faces appear largely in the images. Sources of the discontinuities can be classified into two types: color mismatch and geometric misalignment. The color mismatch is caused by the inaccuracy of the color calibration of the cameras, as well as the deviation of reflectance properties from the assumed Lambertian. The geometric misalignment is caused by the inaccuracy of the geometric calibration of the cameras and errors in the shape reconstruction. The face size process has more artifacts around the object’s face, sleeves, etc. than the face normal process, because distorted texture images are selected when the sizes of their projected areas on captured images are larger than the others. Consequently, this makes the combined texture distorted. These points will be also clearly shown in the next section.
5.3.4 Discussions Figure 5.11 shows rendering of 3D object images from different viewpoints based on the combined texture images shown in Fig. 5.10. As described in the previous
5.3 Appearance-Based View-Independent Texture Generation
167
Fig. 5.10 Combined texture images with different criteria
section, each combination process has pros and cons, which can be observed more clearly in the close-up images in the figure. The image-based processes can generate smooth images although blurring and ghosting artifacts are introduced, especially by the average process. This is because the 3D shape and the camera calibration are not perfectly accurate. In the experiment, in fact, the partial texture image generated from Camera 14 (Ceiling camera) is not consistent with the other images (see Fig. 5.9). Consequently, the average process introduces many artifacts into the texture image. The median process can remove such artifacts by excluding outliers even though its blurring factor is larger than in the geometry-based processes.
168
5
3D Surface Texture Generation
Fig. 5.11 Object images rendered with different texture images
On the other hand, the geometry-based processes can generate very sharp and clear images. Nevertheless, although clearer patterns on the sash, the sleeve, and so on can be generated, discontinuity artifacts on the surface, especially on the sash, are introduced. As described in the previous section, the reflectance of FURISODE is not perfectly Lambertian since it is made of silk and some patterns are embroidered. As readers may have noticed, the process of combining partial texture images described here has much to do with the photo-consistency evaluation of the 3D shape reconstruction described in Sect. 4.3.1. Even though their objectives are different, as the former is a combination and the latter an evaluation, multi-view images are compared with each other with similar criteria in both cases. Moreover, both come up with the same problem: how to cope with non-Lambertian surfaces. In addition to this problem, texture generation for 3D video should comply with contradicting requirements, namely smoothness and sharpness under inaccurate camera calibration and erroneous shape reconstruction. This section has proven that a simple view-independent texture generation method based on the texture mapping in CG cannot generate high quality texture for 3D video. To solve these problems, the next section will introduce a view-dependent texture generation method, which uses a virtual viewing point for 3D video visualization into the texture generation process.
5.4 View-Dependent Vertex-Based Texture Generation
169
5.4 View-Dependent Vertex-Based Texture Generation As discussed in the previous section, we need to solve the following contradictory problems in addition to the non-Lambertian surface problem to generate high quality texture for 3D video. Smoothness: If blending is simply applied to smooth texture, then blurring and ghosting artifacts are created. Sharpness: If the best face-observing image is simply selected and copied to avoid blurring and ghosting, then texture discontinuity artifacts are created. If perfectly accurate camera parameters and 3D shape of an object could be obtained, like in 3D CG animation, these problems would never be introduced. In practice, however, the errors are inevitable. Thus, we have to solve the problems by augmenting texture generation algorithms. In this section, we introduce two ideas to cope with these problems: viewdependent and vertex-based texture generation: View-dependent The view-dependent texture generation dynamically updates a texture image generated from multi-view images based on a virtual viewpoint of a 3D object [4]. Namely, it blends multi-view images by controlling weighting factors computed using angular differences between viewing directions of a virtual camera and the real cameras. This method can reduce both blurring and discontinuity artifacts by smoothly blending the multi-view images as well as dynamically changing the best face-observing camera: larger weighting factors are given to cameras having viewing directions closer to the virtual camera. This method ensures that exactly the same object appearance as in the captured image is rendered when the viewing direction of a virtual camera coincides with one of the real cameras. Moreover, the non-Lambertian surface problem can be solved by using the following method; the texture generation can simulate the viewing-direction dependency of non-Lambertian surface appearances. When the viewing direction of a virtual camera is close to be aligned with the light ray direction that is specularly reflected on a non-Lambertian surface, the view-dependent texture generation method generates a texture image representing well such specular reflection. On the other hand, when the viewing direction of a virtual camera is far away from the specularly reflected light, a texture image without specular reflections is generated. Such dynamic appearance changes of a surface depending on its viewing angles make the surface to be perceived as made of a shiny material. It should be noticed that this simulation is valid only under the lighting environments in which the multi-view video data are captured. In other words, the view-dependent texture generation encodes the lighting environments into generated texture images as well as the 3D object shape and its surface texture. Hence we cannot modify lighting environments freely as in computer graphics. In this sense, the viewdependent texture generation is just as one of the appearance-based texture generation methods under fixed lighting environments.
170
5
3D Surface Texture Generation
Vertex-based The vertex-based texture generation method generates a texture image by interpolating vertex colors. This method never generates discontinuity artifacts because neighboring faces share the same vertices. However, one disadvantage of this method is that a very high-resolution 3D mesh is required to represent fine details of the surface texture pattern. As for the vertex-based 3D shape representation and surface texture generation, see the point-cloud-based technologies in [7]. In what follows, we present the algorithm of the view-dependent vertex-based texture generation,4 and then discuss the quality of rendered images and the problems of this method.
5.4.1 Algorithm The input data is the same as in Sect. 5.3.1 except for the mesh resolution. We use the original mesh with 5 mm resolution rather than the decimated one because the vertex-based method requires a dense mesh. We introduce the following notations in addition to those defined in Sect. 5.3.1. v: a vertex of M. N v : a normal vector of vertex v, which is computed by averaging the normals of the faces that share vertex v. ˆ V Cˆ : a viewing direction of the virtual camera, C. wCi : a weighting factor of camera Ci to be used for combining multi-view images. c[v]: a vertex color of v. c˚ [v, Ci ]: a vertex color of v extracted from the image, ICi , captured by Ci . Algorithm 4 shows the computational process. We first compute color values of the vertices visible from each camera as an offline processing. This vertex visibility test is done using multi-view depth maps as described in Sect. 5.3.2.1, while color values of the invisible vertices are specified as uncolored. We then dynamically compute weighting factors and color values of vertices viewed from the virtual camera in real time. The weighting factors are computed as the dot products of the viewing directions between the virtual camera and the real cameras. When the viewing directions between the virtual camera and a real camera get closer, the weighting factor for the real camera becomes almost 1.0 whereas the others become almost 0.0.5 Thus, the generated texture image will be equivalent to the image captured by the camera. After obtaining the color values of all visible vertices from the virtual camera, the texture of each face is generated by interpolating the vertex colors as follows: 4 An
earlier version was published in [13].
5 The
sharpness of this weighting factor attenuation is controlled by the γ value in Algorithm 4. We heuristically set γ = 50 in the following experiments.
5.4 View-Dependent Vertex-Based Texture Generation
171
Algorithm 4: View-dependent vertex-based texture generation /* OFFLINE: /* Extract vertex colors from each camera. foreach vertex v in M do foreach camera Ci in C do c˚ [v, Ci ] ← extract a color value of v if it is visible from Ci .
*/ */
/* ONLINE: /* Compute vertex color values for the virtual viewpoint. V Cˆ ← specify the viewing direction of the virtual viewpoint. foreach camera Ci in C do V Cˆ · V Ci + 1 γ wCi ← , 2
*/ */
(5.1)
where γ denotes a control parameter for weighting. foreach vertex v in M do if N v · V Cˆ < 0 then /* Update weighting factors. foreach camera Ci in C do if c˚ [v, Ci ] is uncolored then w¯ Ci ← 0 else w¯ Ci ← wCi /* Compute vertex color. c[v] ←
nc i
w¯ C nc i c˚ [v, Ci ]. ¯ Cj j w
*/
*/
(5.2)
foreach face f in M do Generate texture of f by interpolating colors, c[v], of all vertices of f (see text).
• All vertices of a face have color values: generate texture for the face by bilinearinterpolation of all the values. • Two vertices of a face have color values: generate texture for the face by linear interpolation of the two values. • One vertex of a face has a color value: paint the face with that value. • No vertex of a face has a color value: no texture is generated (paint the face with a default color such as gray).
172
5
3D Surface Texture Generation
Fig. 5.12 Images rendered by the view-dependent vertex-based texture generation
5.4.2 Discussions Figure 5.12 shows 3D object images rendered by the view-dependent vertex-based texture generation. The viewpoints for generating these images are the same as those used for generating Fig. 5.11. As shown in the figure, it can generate sharper and clearer images than the image-based processes, and smoother images than the geometry-based processes. For example, we can clearly see the face of MAIKO and the colorful patterns on the sash and the sleeve, and cannot see any strong discontinuity artifacts on the surface. Figure 5.13 demonstrates the effect of simulated non-Lambertian surface appearances by the view-dependent method. With the view-dependent method, the shiny appearance of the sash changes depending on the viewing direction of the virtual camera, while the appearance is fixed with the view-independent method. These rendered images demonstrate the effectiveness of the view-dependent vertex-based method for 3D video texture generation. As a conclusion of this section, we have to mention that the quality of the viewdependent vertex-based texture generation remains lower than the quality of the captured images, because the rendered images are still blurred by the vertex-based texture interpolation. Although we can generate 3D mesh data of finer resolution by dividing faces of the original mesh, the improvement of rendered image quality is limited due to calibration and shape reconstruction errors [13].
Fig. 5.13 Simulated appearances of the non-Lambertian surface, the sash of MAIKO. Upper: the view-dependent vertex-based method. Lower: the view-independent surface-based method with the surface normal process
5.5 Harmonized Texture Generation
173
Moreover, this method does not solve the issues related to the inaccuracy of the camera calibration and errors in the shape reconstruction directly, even though it can reduce the artifacts up to a certain level. In the next section, we introduce the best texture generation method for 3D video to our knowledge, namely, harmonized texture generation, which can cope with these problems and generate non-blurred texture by dynamically deforming the geometry of the texture as well as the color patterns.
5.5 Harmonized Texture Generation 5.5.1 Background As discussed in the previous section, while the view-dependent vertex-based texture generation enables us to visualize 3D video in better image quality than the view-independent method, there still remain possibilities of further improvements. In this section, we present the harmonized texture generation method.6 Its most distinguishing characteristics are that the inaccuracy of the camera calibration and errors in the shape reconstruction are explicitly computed and managed in the texture generation process, and surface texture patterns can be generated in almost the same quality as originally captured images. The key ideas of the harmonized texture generation are (1) view-dependent deformation of local surface textures and geometric shapes, and (2) mesh optimization that works together with the deformation. Idea (1) enables us to render sharp and smooth images by dynamically deforming captured multi-view local images so that they become consistent with each other on the reconstructed 3D object surface. Idea (2) leads us to the mesh optimization to facilitate the texture deformation and rendering. These ideas enable the harmonized texture generation to generate high quality texture for 3D video even if the reconstructed 3D mesh deviates from the actual 3D object shape due to the inaccuracy of the camera calibration and errors in the shape reconstruction. Incidentally, Eisemann et al. proposed floating textures [6], which reduce blurring and ghosting artifacts by dynamically warping local texture patterns depending on a virtual viewpoint so that the textures are consistent to each other. They assume that the camera parameters and the reconstructed 3D shape are almost accurate but not perfect, and thus small pixel-wise deformations are sufficient to compensate texture inconsistency. Although the basic concept of the floating textures is similar to the harmonized texture generation, differences are (1) the former processes pixels while the latter mesh data, and (2) the latter uses 3D mesh optimization for texture deformation that enables more effective deformations of local texture patterns. Distinguishing characteristics of the harmonized texture generation which the floating textures do not share are listed as follows: 6 An
earlier version was published in [20].
174
5
3D Surface Texture Generation
Adaptive texture deformation: The degree of inconsistency among the captured multi-view images on the reconstructed 3D mesh, which is incurred by errors in the calibration and the shape reconstruction, varies depending on local surface areas. Errors in concave areas are often larger than those in convex areas. In 3D video of complex object shape and motion, especially like MAIKO dances, accurate reconstruction of heavy concavities with a limited number of cameras are very difficult. To cope with the uneven shape error distribution, the harmonized texture generation conducts a coarse-to-fine 3D mesh optimization by evaluating degrees of consistency in both local surface shape and texture. With this function, it can adaptively control where and how much local textures should be deformed. Nonlinear texture deformation: Coupled with the mesh optimization, the locally linear face-based texture deformation can realize globally nonlinear texture deformation, which enables the harmonized texture generation to cope with larger errors in the calibration and the shape reconstruction. Real-time rendering: The harmonized texture generation can render images in real time (≥30 fps) with 15 viewpoint images, since it performs the 3D mesh optimization as an offline process before rendering.
5.5.2 Algorithm Overview The key idea of the harmonized texture generation, i.e. the view-dependent deformation of multi-view local surface textures coupled with the 3D mesh optimization, enables us to render non-blurred and high quality images even with large local errors in the calibration and the shape reconstruction. It admits the errors and harmonizes multi-view images based on a specified viewpoint. The following notations are used to describe the computational process of the harmonized texture generation in addition to those defined in Sects. 5.3.1 and 5.4.1. Mˆ Ci (i = 1, 2, . . . , nc ): the mesh projected onto the image plane of camera Ci . We refer to this mesh as a projected mesh. Mˆ Ci
pv
∈ R2 : the 2D position of a projected vertex of Mˆ Ci . This value is valid only when vertex v is visible from Ci .
Mˆ Ci
If
: the triangular image area, i.e. a group of pixels, in ICi defined by the projected face f of Mˆ Ci . We refer to this image as a face-image segment.
Mˆ Ci ←Mˆ Cj
If
Mˆ Cj
: a transformed version of If
, which is transformed so that its geMˆ Ci
ometry as a group of pixels is aligned with that of If Mˆ C If j
Mˆ C If i
. With this transforma-
and can be computed by a tion, the texture discrepancy between simple pixel-based similarity evaluation. We refer to this image as a transformed face-image segment.
5.5 Harmonized Texture Generation Mˆ Ci
Iv
: a polygonal image region on the image plane of camera Ci consisting of a Mˆ Ci
set of neighboring If image segment.
Mˆ Ci ←Mˆ Cj
Iv
175
s that share vertex v. We refer to this image as a vertexMˆ Cj
: a transformed version of Iv
, which is transformed so that its geMˆ Ci
ometry as a group of pixels is aligned with that of Iv Mˆ C Iv j
. With this transforma-
Mˆ C Iv i
and can be computed by a tion, the texture discrepancy between simple pixel-based similarity evaluation. We refer to this image as a transformed vertex-image segment. T¯ : the harmonized texture image that is defined on the image plane of the userspecified virtual viewpoint. The geometry of the image, i.e. width, height, and resolution, is also specified by the user. T¯f denotes a texture segment of face f in T¯ . As shown in Fig. 5.14, the computational flow of the harmonized texture generation is divided into two stages, i.e. the view-independent mesh optimization (offline) and the view-dependent texture deformation (online): 1. Mesh simplification: In the offline stage, we optimize the 3D mesh reconstructed from multi-view images so that the texture deformation in the succeeding stage can be facilitated. The mesh optimization is done based on a coarse-tofine strategy. That is, we first reduce the number of faces of the 3D mesh so that the size of each face is large enough to include sufficient feature points for the later matching. 2. Mesh refinement and deformation vector computation: We then refine the mesh by subdividing faces for a precise texture deformation. In other words, the mesh optimization restructures the 3D mesh with respect to texture consistency as well as shape preservation. To assess surface texture misalignments among multi-view images incurred by shape reconstruction errors, we first compute deformation vectors for each projected vertex of the mesh on each image plane. A deformation vector for vertex v on the image plane of camera Ci represents a misalignment between its projected images in a pair of observed images ICi and ICj (cf. Fig. 5.17). The detailed computation algorithm is given below. With the deformation vectors, the degree of texture misalignments is computed for each face. Then, faces with large misalignments are recursively divided into sub-faces until the misalignment degree or the face size becomes small enough. Note that we do not try to reconstruct a more accurate 3D shape in the refinement process because: (1) we have already optimized the shape by the 3D shape reconstruction process described in the previous chapter, and (2) it is hard to estimate the shape more accurately with inaccurate camera parameters. The deformation vectors are also used as key information for the view-dependent texture deformation in the succeeding stage. 3. Harmonized position computation: In the online stage, we first compute the harmonized position of each projected vertex on each image plane with the deformation vectors and the user-specified virtual viewpoint. Note that the harmo-
176
5
3D Surface Texture Generation
Fig. 5.14 Processing flow of the harmonized texture generation
nized position is changed depending on the viewing direction. The harmonized positions of three vertices of a face specify from which areas in each multiview image the texture image of that face is extracted for view-dependent texture blending. 4. Texture extraction: Texture images are extracted for each face of the mesh using the harmonized positions from the multi-view images. 5. Texture transformation and blending: Extracted textures are transformed into harmonized texture images, in which the extracted textures can be compiled. We finally generate a non-blurred and high quality texture for the 3D video object by blending the harmonized texture images depending on the virtual viewpoint. In summary, the harmonized texture generation implicitly realizes a nonlinear texture deformation depending on the user-specified viewpoint as well as a mesh optimization for an effective deformation. It enables us to generate non-blurred and high quality texture images even with inaccurate camera parameters and errors in the shape reconstruction. In the following sections, we describe the algorithms of the processes above in detail.
5.5 Harmonized Texture Generation
177
5.5.3 Mesh Optimization As described in the previous section, the mesh optimization consists of two processes: mesh simplification and refinement. In the mesh simplification, we first reduce the number of faces of the 3D mesh by edge collapse operations with shape preservation [10]. Then, we collapse edges of faces whose textures extracted from multi-view images are inconsistent with each other. An ordinary mesh simplification method generally collapses redundant edges alone and does not include the process of merging texture-inconsistent faces, because such additional operation usually increases the discrepancy between the original and reduced meshes. The harmonized texture generation, on the other hand, eliminates such textureinconsistent faces and represents them by larger faces. Although it may sound strange, this is the point of the mesh simplification of the harmonized texture generation. The basic idea is described in what follows. In surface areas with large texture inconsistency, 3D shape is not accurately reconstructed mainly due to heavy surface concavities. This is because it is difficult to accurately reconstruct concave areas by silhouette-based and wide-baseline stereo-based 3D shape reconstruction methods. Furthermore, inaccurate camera parameters can produce large texture inconsistencies as multi-view images may be projected onto the 3D mesh jumping over sharp surface concavities. In order to extract consistent textures for such areas, we need to deform images more dynamically than in texture-consistent areas. To realize large dynamic image deformation, larger faces are needed and therefore such texture-inconsistent faces are eliminated and represented by larger faces. Consequently, the mesh simplification method employed in the harmonized texture generation enables us to adaptively reduce the 3D mesh with respect to both shape preservation and texture inconsistency. After the mesh simplification, we refine the mesh by subdividing faces to include sufficient deformation vectors where textures are inconsistent.
5.5.3.1 Mesh Simplification Algorithm 5 shows the procedure of the mesh simplification. In the first step, the number of faces is reduced in order to make faces large enough to include sufficient image features. Then, the face-image-segment discrepancy of each face is computed, which is given by Algorithms 6 and 7 (see also Fig. 5.15), to evaluate texture inconsistency of the faces. After obtaining a set of face-image-segment discrepancies, the face with the maximum value of discrepancy is selected. If its area is smaller than a certain threshold, then it is merged with one of its neighboring faces. By iterating this process until no face is merged, we can sufficiently simplify the mesh to realize the dynamic image deformation.
178
5
3D Surface Texture Generation
Fig. 5.15 Computing the texture discrepancy of a face
Algorithm 5: Mesh simplification repeat Mesh Simplification by edge collapse with shape preservation. until average size of faces > threshold. continue ← true. repeat foreach face f in M do Df ← compute face-image-segment discrepancy of f (see Algorithm 6). f ← find a face that has the maximum Df and its area < threshold. if f is empty then continue ← false else Find one of the three edges of f which, after elimination, preserves shape the most, and collapse it. until continue is false.
5.5.3.2 Mesh Refinement and Deformation Vector Computation Following the mesh simplification, deformation vectors and mesh refinement are computed by subdividing faces to include sufficient deformation vectors. Algorithm 8 shows the overall procedure of the mesh refinement and the deformation vector computation. First, the deformation vectors by vertex-image-segment matching is computed (described later). Then, for each face, the discrepancy of the deformed face-image segment that is generated using the deformation vectors is computed. This discrepancy indicates how much the texture consistency of a face can be improved using the deformation vectors. A large value of the discrepancy implies that the deforma-
5.5 Harmonized Texture Generation
179
Algorithm 6: Computing face-image-segment discrepancy of face f foreach face f in M do foreach camera Ci in C do /* Compute a camera index that evaluates how well a camera captures a face directly from its front as well as in an image resolution. */ eCi ← compute camera index of face f by Algorithm 7. C ← find the best face-observing camera that has the largest camera index. Compute the face-image-segment discrepancy of face f by n
C Mˆ C 1 Mˆ C ←Mˆ C , Df = eCi If i − If i nC − 1
(5.3)
i,i=
where C and nC denote a set of cameras that can observe face f and the Mˆ Ci
number of such cameras, respectively. If the difference between
Mˆ C If i
and
Mˆ Ci ←Mˆ C
− If
Mˆ C ←Mˆ C If i ,
Mˆ Ci Mˆ C ←Mˆ C = 1 I − If i f 3np p
which is given by
Mˆ Ci I [p, c] f
c∈{R,G,B}
Mˆ Ci ←Mˆ C
− If Mˆ C
computes
[p, c],
(5.4)
Mˆ C ←Mˆ C
[p, c] denote the intensity of color where If i [p, c] and If i band c at pixel p in the triangular area of each face-image segment, respectively, and np the number of pixels in the triangular area. As
Mˆ C
Mˆ C ←Mˆ C
share the described in the beginning of Sect. 5.5.2, If i and If i same pixel-based representation. Note that the difference values are weighted with the camera indices. Hence, difference values computed for cameras with larger camera indices affect largely texture discrepancy.
tion vectors are not sufficient to generate consistent texture and face subdivision is required for more dynamic deformation. Furthermore, faces without textures can exist due to self-occlusion, as shown in Fig. 5.16(a), because the texture is generated only for a face where all of its vertices are visible at least from one camera. Thus, faces without textures are also
180
5
3D Surface Texture Generation
Algorithm 7: Computing camera index for face f Compute camera index eCi defined by eCi = α d¯Ci + (1 − α)¯sCi ,
(5.5)
where dC d¯Ci = nc i , j dCj
−V Ci · N f + 1 2
γ1
sC s¯Ci = nc i , j sCj (5.6) • sCi : area of the face projected on the image plane of Ci , and • α, γ1 : weighting coefficient. Values of α and γ1 are heuristically defined as 0.3 and 10, respectively. dCi =
,
Algorithm 8: Mesh refinement and deformation vector computation repeat a set of deformation vectors ← compute vertex-image-segment matching (see Algorithm 9). a set of discrepancies ← compute the deformed face-image-segment discrepancy of each face with the deformation vectors (see Fig. 5.18 and text). continue ← evaluate discrepancies and self-occlusion, and then subdivide faces if required (see Algorithm 10). until continue is false.
Fig. 5.16 Subdivision of partially occluded faces. The gray colored faces are partially occluded by blue faces. Some of such faces are subdivided as shown in (b) to be textured. The red circles and lines denote newly introduced vertices and faces. ©2009 ITE [21]
marked to be subdivided for generating textures with as large as possible surface areas (Fig. 5.16(b)). In the third step, faces are subdivided considering the discrepancy and the selfocclusion at the same time. By iteratively conducting the above processes, a refined mesh and a set of the deformation vectors are finally obtained. In the following, we describe the detailed procedures of the algorithm.
5.5 Harmonized Texture Generation
181
Fig. 5.17 Computing deformation vectors by the vertex-image-segment matching
Computing deformation vectors by vertex-image-segment matching Figure 5.17 illustrates the computation of deformation vector by vertex-imagesegment matching. As shown in the figure, a vertex is projected onto each image plane with the camera parameters as illustrated by green circles and dashed lines. Here the focus is on image ICj . A vertex-image segment of projected Mˆ C
Mˆ C
vertex pv j , Iv j , is extracted from ICj as represented by the red polygonal region. Then, it is transformed onto the image plane of camera Ci , which Mˆ Ci ←Mˆ Cj
is denoted by a transformed vertex-image segment, Iv
as shown in the
Mˆ C ←Mˆ Cj Mˆ C figure. The best match position of Iv i around p v i on ICi is sought Mˆ C ←Mˆ Cj where the image appearances of Iv i and ICj are the most similar to Mˆ C ←Mˆ Cj each other. The best match position is denoted by pv i , which is illusMˆ Ci trated by a filled circle near the projected vertex p v on ICi . If the 3D shape Mˆ C ←Mˆ Cj and the camera parameters were perfectly accurate, pv i would coincide Mˆ Ci with p v , i.e. the projected position of vertex v on the image plane of Ci . In Mˆ C ←Mˆ Cj Mˆ C practice, however, pv i is located away from p v i due to errors in the calMˆ C ←Mˆ Cj Mˆ C ibration and shape reconstruction. The displacement from p v i to p v i is Mˆ Ci ←Mˆ Cj Mˆ Ci ←Mˆ Cj Mˆ Ci denoted by the deformation vector v v = pv − p v . Note that
a set of deformation vectors are computed for each projected vertex on each image; the displacements change depending on which projected vertex is used as key reference point. Algorithm 9 shows the procedure described above in detail. Computing the deformed-face-image-segment discrepancy Figure 5.18 illustrates the procedure of computing the discrepancy value of a deformed face-
182
5
3D Surface Texture Generation
Algorithm 9: Computing deformation vectors by the vertex-image-segment matching ˆ = {Mˆ Ci |i = 1, 2, . . . , nc } ← compute projected meshes with M and a set of M cameras C. ˆ do foreach projected mesh Mˆ Ci in M foreach vertex v in M do ˆ do foreach projected mesh Mˆ Cj in M if vertex v is not visible from Ci then Mˆ C ←Mˆ C
j vv i ← undefined. else if vertex v is not visible from Cj or Cj = Ci then
Mˆ Ci ←Mˆ Cj
vv else
Mˆ Cj
Iv
←0
← extract the vertex-image segment from ICj .
Mˆ C ←Mˆ Cj Iv i
Mˆ C
← transform Iv j from the image plane of Cj to that of Ci . /* Search the best matched position. */ Mˆ Ci ←Mˆ Cj
pv
← search the corresponding position in ICi Mˆ C ←Mˆ C
j which matches best with Iv i . /* Compute the deformation vector.
Mˆ C ←Mˆ Cj vv i
Mˆ C ←Mˆ Cj ← pv i
*/
Mˆ C − pv i .
Note that the deformation vector denotes a 2D vector on the image plane of camera Ci .
image segment. The computation process is similar to the evaluation process of the texture difference of a face-image segment described in Algorithm 6. For each face f , (1) first determine its best face-observing camera C . (2) For each camera Ci , (2-1) compute the best match positions on IC for three verMˆ C
Mˆ C ←Mˆ C
i , k = 1, 2, 3. tices of face-image segment If i on ICi , respectively: pvk (2-2) Extract the triangular area from IC defined by the three best match po-
ˆ
MC ,C sitions. This area is referred as a deformed face-image segment of f , Iˆf i . (2-3) Transform the deformed face-image segment so that its geometry as a group Mˆ Ci
of pixels coincides with If
. (2-4) Compute the discrepancy value between the ˆ
ˆ
ˆ
MC ←MCi MC transformed deformed face-image, Iˆf , and If i by the image differ-
5.5 Harmonized Texture Generation
183
Fig. 5.18 Computing the deformed-face-image-segment discrepancy
ence computation used in Algorithm 6. (3) Just as in Algorithm 6, compute the deformed-face-image-segment discrepancy of f , Dˆ f , as the weighted average of the discrepancy values computed for all cameras that can observe f . Evaluation of discrepancies, self-occlusion, and subdivision of faces With Dˆ f , we can evaluate the effectiveness of the deformation vectors computed for f . A small value of Dˆ f implies that the deformation vectors are good enough to compensate the image inconsistency on the face. Usually most faces have small values because the reconstructed 3D shape and camera parameters are almost accurate. In contrast, a large average value of Dˆ f implies the texture patterns on f have large mutual inconsistencies, which leads to subdividing f to localize such inconsistencies. Moreover, self-occlusion is inspected as described at the beginning of this section to find texture-less faces, which then are also subdivided for possible texture generation for subdivided faces. The face subdivision is performed by adding vertices at the center of each edge of a face. To preserve the topology of the 3D mesh, first, faces that are required to be subdivided are found, and then a type (Type 0 to 4) is assigned to all the faces based on the state of the adjacent faces as shown in Fig. 5.19. Finally, the faces are subdivided at the same time after the assignment is completed. See Algorithms 10 and 11 for the detailed procedure.
5.5.4 View-Dependent Texture Deformation With the optimized mesh and the deformation vectors, textures are dynamically deformed depending on the user-specified viewpoint. Figure 5.20 illustrates the processes of the view-dependent texture deformation including the harmonized position computation, extraction of deformed texture segment, transformation and blending of texture segments.
184
5
3D Surface Texture Generation
Fig. 5.19 Subdivision patterns of a face. The central face is subdivided depending on the number of the adjacent faces that are supposed to be subdivided. Type 0 illustrates the face that will not be subdivided, and Types 1–3 illustrate how the face is subdivided depending on the states of its adjacent faces. Type 4 is different from the others, that is, the face is marked to be subdivided by the discrepancy evaluation step and/or the occlusion test. Red circles denote the vertices introduced by the subdivision process. ©2009 ITE [21]
Algorithm 10: Evaluation and subdivision of faces Compute the average of the deformed-face-image-segment discrepancies by 1 ˆ Dˆ avg = Df . nf
(5.7)
f
if Dˆ avg > threshold or subdivision is required due to the occlusion then Subdivide faces (see Algorithm 11). return true else return false
Algorithm 11: Subdivision of faces foreach face f in M do if Dˆ f > threshold or subdivision of f is required due to the occlusion then typeOfFacef ← 4 foreach face f in M do if typeOfFacef = 4 then typeOfFacef ← 0 foreach adjacent face fa of f do if typeOfFacefa = 4 then typeOfFacef ← typeOfFacef + 1 foreach face f in M do if typeOfFacef > 0 then Subdivide face f as shown in Fig. 5.19 by typeOfFacef .
5.5 Harmonized Texture Generation
185
Fig. 5.20 Generating harmonized texture from the deformed textures
Harmonized position computation When viewing direction VCˆ of virtual camera Cˆ is specified, weighting factors for multi-view images can be computed, respectively, and then harmonized positions for extracting deformed texture segments as well. Algorithm 12 shows the procedure of computing harmonized positions with the deformation vectors. In the algorithm, the weighting factor, w, is computed by the dot product between the real and virtual camera’s viewing directions, i.e. a larger value means that their directions are closer. The weighting factor controls how much a texture generated from a camera has to be deformed: a larger weighting factor realizes a smaller texture deformation. Namely, a texture generated from a camera with a larger value is not so much deformed and hence the image captured by that camera is preserved. This enables us to generate texture patterns without any deformation for such camera having similar viewing direction as the virtual one.7 Mˆ C
Texture extraction The texture segment of face f on image ICi , Tf i , is defined as a triangular image area on ICi specified by the harmonized positions of three Mˆ Ci
vertices of f . Tf
is the deformed version of the face-image-segment of f on
Mˆ C If i ,
ICi , which is illustrated as a green triangle in Fig. 5.20. With this deformation, the inconsistency among multi-view texture patterns can be reduced. Texture transformation After obtaining the texture segments of f from multiview images, they are transformed onto the same image plane of the harmonized texture image. As described in the beginning of Sect. 5.5.2, the harmonized tex7 We
use a coefficient γ2 for controlling the weighting factor which is heuristically defined as γ2 = 50. With this value, the normalized weighting factor (in Eq. (5.9)) for the camera that has the same viewing direction as the virtual one becomes almost 1.0 and the others 0.0.
186
5
3D Surface Texture Generation
Algorithm 12: Computing the harmonized position of vertex v on each image plane ˆ V ˆ ← specify the viewing direction of virtual camera C. C
foreach camera Ci in C do Compute weighting factor by wCi =
V Cˆ · V Ci + 1 2
γ2 (5.8)
,
where γ2 denotes a weighting coefficient. foreach camera Ci in C do foreach visible vertex v in M from Ci do
Mˆ Ci
Compute the harmonized position of v on image ICi , p¯ v weighting factors and the deformation vectors by Mˆ C p¯ v i
Mˆ C = pv i
+
nc j
wCj Mˆ C ←Mˆ Cj vv i . C k wCk
, with the
(5.9)
ture image is defined on the image plane of the virtual camera, and the geometry is specified by the user. See the first half of Algorithm 13 for details. Texture blending We finally blend the transformed textures with the weighting factors computed in Algorithm 12, and generate the harmonized texture image for the 3D video object. See the second half of Algorithm 13 for details.
5.5.5 Experimental Results We evaluate the performance of the harmonized texture generation with two 3D video data streams of MAIKOs and one of Kung-fu. Data characteristics are as follows: MAIKO 1: a woman wearing a green FURISODE (Fig. 5.21). Both sides of her body and a part of the broad sash have large concavities that could not be reconstructed accurately. The sash on her back is reconstructed thicker than the actual, and thus the texture patterns on it that are extracted from captured multi-view images are not consistent with each other. MAIKO 2: a woman wearing a red FURISODE, as used in the previous sections. Kung-fu: a man performing Kung-fu. Clothes are simpler than MAIKOs’. All data were captured in Studio B, as described in Sect. 5.3.1.
5.5 Harmonized Texture Generation
187
Algorithm 13: Texture extraction and blending foreach face f in M visible from virtual camera Cˆ do /* Texture extraction and transformation. foreach camera Ci in C do if f is observable from Ci then
*/
Mˆ C
Tf i ← extract the texture segment of f from ICi using the harmonized positions. ˆ
ˆ
ˆ
M ˆ ←MCi MC ˆ Note this ← transform Tf i from Ci to C. Tˆf C Mˆ Ci
transformation makes Tf face-image segment
Mˆ ˆ If C
align pixel-wisely with the
of f on the image plane of Cˆ
/* Texture blending. foreach pixel p in T¯f do foreach camera Ci in C do
*/
ˆ
ˆ
M ˆ ←MCi if Tˆf C exists then w¯ Ci ← wCi else w¯ Ci ← 0
Compute color at p by T¯f [p] =
nc i
w¯ C Mˆ ˆ ←Mˆ Ci nc i Tˆf C [p], ¯ Cj j w
(5.10)
where Tf [p] denotes the color value at pixel p in face-image Mˆ Cˆ
segment If ˆ
ˆ Note that T¯f and of f on the image plane of C.
ˆ
M ˆ ←MCi Tˆf C share the same pixel-wise shape representation.
5.5.5.1 Mesh Optimization Figure 5.22 shows the 3D mesh data of MAIKO 1 generated during the mesh optimization process. Figure 5.22(a) shows the original mesh with 107,971 vertices and 215,946 faces, and Fig. 5.22(b) the mesh simplified for each face to include sufficient image features for matching. Figure 5.22(e) illustrates the simplified mesh colored with the texture discrepancy computed by Algorithm 6, where high discrepancy value is painted in red-yellow, medium in green, and low in lightdark blue. We can observe that large texture discrepancy appears on the side of the body, sash, sleeves and hem of the FURISODE. Texture inconsistencies of
188
5
3D Surface Texture Generation
Fig. 5.21 Captured images of MAIKO 1. ©2009 ITE [21]
the side and sash are caused by concavities. Others are mainly due to the occlusions from the ceiling cameras. The latter are not easy to compensate by the local texture deformation alone, because errors of the calibration and shape reconstruction produce large texture misalignments. In such occluded areas, very different multi-view texture patterns are generated on corresponding mesh surface faces. Actually, the texture of a head area observed from the ceiling camera is projected onto a part of the hem (a lower part of the mesh, which shows a larger difference value than the other areas). This causes an artifact of texture switching when the virtual camera moves between the ceiling cameras and the others. Figure 5.22(c) shows the mesh after the second step of mesh simplification based on the texture consistency. As shown in Fig. 5.22(f), faces with inconsistent texture are merged to occupy larger surface areas on the mesh. Figures 5.22(d) and (g) show the final result of the mesh optimization. The mesh refinement process partitions and localizes faces with inconsistent texture that were enlarged by the previous process. This result demonstrates that the original 3D mesh is adaptively optimized with regard to texture consistency and shape preservation.
5.5.5.2 Quantitative Performance Evaluation We align a virtual camera with the same parameters of the real camera Ci , and render an image IC i , without using ICi , the image captured by Ci . We then evaluate the performance of the harmonized texture generation by computing PSNR between ICi and IC i . For comparison, we also evaluated PSNRs of the view-independent
5.5 Harmonized Texture Generation
189
Fig. 5.22 Mesh optimization results. ©2009 ITE [21]. Images (a)–(d) show the original and resulting mesh data generated by the mesh optimization process. ‘Simplified-S’ and ‘Simplified-T’ denote the simplified mesh data with respect to shape preservation and texture consistency, respectively. Images (e)–(g) illustrate the spatial distributions of texture difference values for the mesh data in (b), (c), and (d), respectively, where red implies large texture difference values, green medium, and blue small
average method (described in Sect. 5.3) and the view-dependent vertex-based method (described in Sect. 5.4). We performed the evaluation for all cameras in the studio except the frontal and ceiling cameras, i.e., 12 cameras from the top-left in Fig. 5.21. Figure 5.23 illustrates the result of the performance evaluation, which demonstrates the harmonized texture generation can generate much higher fidelity images than the other methods.
190
5
3D Surface Texture Generation
Fig. 5.23 Quantitative performance evaluation of three texture generation methods. ©2009 ITE [21]
5.5.5.3 Qualitative Image Quality Evaluation Figure 5.24 shows close-up captured images and corresponding images generated by the view-independent average method, the view-dependent vertex-based method, and the harmonized texture generation method. Although the first method generates textures with almost the same sharpness as the captured images, blurring and ghosting artifacts appear on the broad sash and the sleeves due to errors of the camera calibration and the shape reconstruction. The images generated by the view-dependent vertex-based method also contain blurring artifacts due to the interpolation of vertex colors as well as the errors. On the other hand, the harmonized texture generation method can obviously reduce such artifacts and render high quality images.
Fig. 5.24 Images for qualitative image quality evaluation (MAIKO 1). ©2009 ITE [21]
5.5 Harmonized Texture Generation
191
Fig. 5.25 Images for qualitative image quality evaluation (MAIKO 2)
Figure 5.25 illustrates another set of rendered images, by the above mentioned three methods and the view-independent face normal method (described in Sect. 5.3). As shown in the figure, the harmonized texture generation method demonstrates its effectiveness in reducing blurring, ghosting, and discontinuity artifacts drastically. Note also that the view-independent average method cannot render shiny areas on the sash correctly, whereas the other three can generate highlights with high fidelity.
5.5.5.4 Real-Time Rendering Lastly, Table 5.2 shows the computational time of each process of the harmonized texture generation for rendering one frame. The numbers of faces of optimized MAIKO 1 and Kung-fu data are approximately 5,000. The specification of the PC that we utilized for the evaluation is CPU: Core2Duo 2.4 GHz, Memory: 4 GB, GPU: GeForce 8800 GTX and VRAM: 750 MB, and the software is implemented in C# and Managed DirectX with Pixel shader 2.0. Figure 5.26 shows rendered images of the Kung-fu stream. The harmonized texture generation method can render high quality images in real time (≥30 fps) using a 15-viewpoint video stream, which
192
5
3D Surface Texture Generation
Fig. 5.26 Rendered 3D video stream of Kung-fu. ©2009 ITE [21] Table 5.2 Computational time for rendering one frame (msec). ©2009 ITE [21]
MAIKO 1
Kung-fu
Weighting factors (Eq. (5.8))
<1
<1
Harmonized texture coordinates (Eq. (5.9))
27.2
15.8
Texture generation and rasterization (Algorithm 13)
<1
<1
allows us to enjoy a dynamic 3D action like Kung-fu interactively while changing view directions as well as zooming factors (Fig. 5.26).
5.6 Conclusions In this chapter, we discussed the texture generation for 3D video. First, we verified that the simple view-independent methods fail to generate high quality textures: they introduce non-negligible blurring, ghosting, and discontinuity artifacts and cannot well represent non-Lambertian reflections. Then, we recognized that the artifacts are incurred by the inaccuracy of the camera calibration and errors of the shape reconstruction, which are inevitable in practical applications, and that dynamic appearance changes of non-Lambertian surfaces cannot be modeled by view-independent methods. To cope with these problems, we introduced a virtual camera (viewpoint) into the computation model of texture generation to realize view-dependent texture generation methods. The first proposed method is the view-dependent vertex-based texture generation, where artifacts are reduced by generating texture patterns of mesh faces by vertex color interpolation, and dynamic variations of non-Lambertian surface appearances are simulated by computing vertex colors depending on the viewing direction of the virtual camera. To attain the high fidelity texture generation comparable to originally captured images, the harmonized texture generation was proposed. It explicitly evaluates discrepancies among surface texture patterns projected onto the 3D mesh from observed multi-view images and deforms texture patterns to reduce the discrepancies
References
193
depending on the viewing direction of the virtual camera. The quantitative and qualitative performance evaluations demonstrated its capabilities to generate smooth and sharp free-viewpoint images of a 3D video stream. As discussed at the beginning of this chapter, there are two approaches to the texture generation for 3D video: appearance-based and generic-property-based. The harmonized texture generation is the most sophisticated one among the appearancebased methods. Realizing the generic-property-based texture generation, on the other hand, is still an open problem because it requires an accurate model of lighting as well as precise surface normal vectors of a moving object, whereas both of them are not easy to obtain in the real world. In the next chapter, we will discuss the lighting environment estimation and present a method of estimating complex and dynamically changing real world lighting environments by analyzing shading and shadow on a reference object of specially designed 3D shape. Then, we will discuss the generic surface reflectance estimation from multi-view video data and present an approach toward the generic-property-based texture generation for 3D video. Another possible future research would be the introduction of motion information into the computation model. So far, all presented texture generation methods process one video frame at a time. This is because it is not easy to reconstruct 3D shape and motion accurately. With accurate motion information, local surfaces occluded at some frames could be observed at other frames to improve their texture generation. Finally, we should note that while the harmonized texture generation works better than the others, view-(in)dependent vertex-based texture generation methods are employed in applications of 3D video as will be discussed later in Chaps. 8 and 10. This is because since the former is a face-based method, it cannot be used when applications require the modification of the mesh structure.
References 1. Blender Foundation: http://www.blender.org/ 2. Dana, K.J., van Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reflectance and texture of realworld surfaces. ACM Trans. Graph. 18(1), 1–34 (1999) 3. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: SIGGRAPH’00: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 145–156. ACM, New York (2000) 4. Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In: SIGGRAPH’96: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 11–20. ACM, New York (1996) 5. Dror, R.O., Adelson, E.H., Willsky, A.S.: Surface reflectance estimation and natural illumination statistics. In: IEEE Workshop on Statistical and Computational Theories of Vision (2001) 6. Eisemann, M., Decker, B.D., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., Sellent, A.: Floating textures. In: Eurographics’08, vol. 27(2) (2008) 7. Euclideon: Unlimited Detail Technology. http://www.euclideon.com/ 8. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice in C, 2nd edn. Addison-Wesley, Reading (1995)
194
5
3D Surface Texture Generation
9. Gu, X., Gortler, S.J., Hoppe, H.: Geometry images. In: ACM SIGGRAPH, vol. 21, pp. 355– 361. ACM, New York (2002) 10. Hoppe, H.: Progressive meshes. In: SIGGRAPH’96: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 99–108. ACM, New York (1996) 11. Ikeuchi, K., Sato, K.: Determining reflectance properties of an object using range and brightness images. IEEE Trans. Pattern Anal. Mach. Intell. 13(11), 1139–1153 (1991) 12. Lévy, B., Petitjean, S., Ray, N., Maillot, J.: Least squares conformal maps for automatic texture atlas generation. In: SIGGRAPH’02: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 362–371. ACM, New York (2002) 13. Matsuyama, T., Takai, T.: Generation, visualization, and editing of 3D video. In: First International Symposium on 3D Data Processing, Visualization and Transmission, pp. 234–245 (2002) 14. Miyazaki, D., Shibata, T., Ikeuchi, K.: Wavelet-texture method: appearance compression by polarization, parametric reflection model, and daubechies wavelet. Int. J. Comput. Vis. 86(2), 171–191 (2010) 15. OpenGL Architecture Review Board, Shreiner, D., Woo, M., Neider, J., Davis, T.: OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 2.1, 6th edn. AddisonWesley, Reading (2007) 16. Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse rendering. In: ACM SIGGRAPH, pp. 117–128 (2001) 17. Sato, Y., Wheeler, M.D., Ikeuchi, K.: Object shape and reflectance modeling from observation. In: ACM SIGGRAPH’97, pp. 379–387 (1997) 18. Sheffer, A., Lévy, B., Mogilnitsky, M., Bogomyakov, A.: Abf++: fast and robust angle based flattening. ACM Trans. Graph. 24(2), 311–330 (2005) 19. Sheffer, A., Praun, E., Rose, K.: Mesh parameterization methods and their applications. Found. Trends Comput. Graph. Vis. 2(2), 105–171 (2006) 20. Takai, T., Hilton, A., Matsuyama, T.: Harmonised texture mapping. In: Fifth International Symposium on 3D Data Processing, Visualization and Transmission (2010) 21. Takai, T., Matsyuyama, T.: Harmonized texture mapping. J. Inst. Image Inf. Telev. 63(4), 488– 499 (2009) (in Japanese) 22. Tominaga, S., Wandell, B.A.: Standard surface-reflectance model and illuminant estimation. J. Opt. Soc. Am. 6(4), 576–584 (1989)
Chapter 6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
6.1 Introduction As discussed at the beginning of Part II, the complex dynamic light field is generated in the world of 3D video (Fig. II.2) and a multi-view camera system can only observe a limited part of the light field even if the number of cameras is increased. Thus the problem of 3D video production from multi-view video is substantially under-constrained. To find a practical solution to this problem, the assumptions of the uniform directional light source and the Lambertian reflection were introduced in Chap. 4. They allow the appearance of object surface to be regarded as painted markers consistently observable from any viewpoint. In Chap. 5, on the other hand, a virtual camera was additionally introduced to simulate non-uniform light sources and nonLambertian reflections, while the internal computation processes themselves still employ the assumptions. In short, the generic geometry of an object in motion, i.e. 3D object shape and motion, can be reconstructed only with limited accuracy, and the generic photometry, i.e. surface reflectance properties, cannot be estimated due to the lack of information about lighting environments. Thus, in this chapter, we address the problem of lighting environment estimation, with which 3D shape reconstruction and texture generation will be much facilitated. Here we consider that lighting environments include a group of proximal light sources with 3D volumes and spatially spreading ambient light, whose shapes, positions, and radiant intensities vary over time. They will be sufficient to model the light field in a 3D video studio. This chapter first gives a brief survey of the lighting environment estimation methods proposed so far in Sect. 6.2. Then the problem specifications, basic ideas and assumptions for 3D dynamic lighting environment estimation with reference objects are discussed in Sect. 6.3, followed by an algebraic problem formulation of estimating 3D dynamic light sources from shadow and shading on reference objects in Sect. 6.4. A practical solution method is presented in Sect. 6.5, whose performance is evaluated in Sect. 6.6 with several experimental results in a real world scene where lighting environments vary dynamically. As an application to 3D video, T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_6, © Springer-Verlag London 2012
195
196
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Sect. 6.7 addresses a method of estimating surface reflectance properties from 3D video, which enables us to visualize 3D video contents under various lighting environments. Section 6.8 concludes the chapter and discusses future research problems.
6.2 Lighting Environment Estimation Methods Compared to the comprehensive exploration of 3D shape reconstruction techniques, studies on the lighting environment estimation have been limited. This may be because they require modeling of complex light fields, i.e. 3D spatially spreading phenomena caused by mutual interactions of light emission, attenuation, reflection, refraction, and occlusion. Moreover, light fields in natural environments vary dynamically in very complicated ways, which cannot be modeled by rigid motions or continuous deformations. In other words, we have to model phenomena of dynamic interactions instead of segmenting out and identifying objects. This section gives a survey of lighting environment estimation methods that have been proposed so far. They can be categorized into two types: direct and indirect methods.
6.2.1 Direct Methods Direct lighting environment estimation methods, as their names stand for, directly capture images of light sources. Sato et al. [13] employed a pair of fish-eye lens cameras to capture a ceiling scene, and modeled the 3D distribution of light sources by mapping the observed image onto a 3D polygonal hemisphere, which was reconstructed by stereo matching. Debevec [4] captured an image of a mirror sphere to generate the radiance map of the scene and represent a local light field. Unger et al. [17] extended this method with multiple mirror spheres to represent incident light fields in a 3D space by synthesizing local light fields derived from each sphere. To apply these methods to real world scenes with electric bulbs and fluorescent lights, very high dynamic range cameras are required. As discussed in Chap. 2, 3D video studios often employ strong light sources to facilitate multi-view image capture of an object in motion. Moreover, while dynamic lighting environments can be easily modeled by capturing video images, light sources with directionally biased light emissions, e.g. LED lights, or lights with 3D volumes, e.g. candles, cannot be modeled, because they basically capture only light rays converging to the projection centers of cameras. Another limitation of the direct methods is that they cannot model proximal light sources, for which light ray attenuation should be taken into account. In short, the direct methods model light sources by their “appearances” observed from limited positions.
6.2 Lighting Environment Estimation Methods
197
6.2.2 Indirect Methods Indirect methods capture image(s) of reference object(s) and estimate positions and radiant intensities of light sources that illuminate the object surface(s). That is, the indirect methods directly reconstruct 3D lighting environments. Since the dynamic range of radiance on ordinary object surfaces is not so wide, standard cameras can be used. The essential problems to be solved by the indirect methods rest in (1) how to characterize lighting environments: models of light sources and light field phenomena, and visual cues for the light source estimation, and (2) moreover, the reference object design to facilitate the computational process of estimating lighting environments. Marschner and Greenberg [8] proposed the inverse lighting to estimate lighting environments from an image of a reference object with known shape and reflectance. They assumed that lighting environments can be modeled by a linear combination of basis lights on a large sphere surrounding the reference object. They first synthesized basis images from the object and the basis lights, and then estimated the weights of the basis images to model a captured image. As pointed out in [8], however, the Lambertian surface reflection model is too simple to localize light sources. Theoretically, as supported by [1, 11], the Lambertian reflection model works as a low-pass filtering in the light source localization, and hence can estimate only a coarse spatial distribution of light sources. In other words, shading patterns on a reference object surface carry a limited amount of information to estimate positions and radiant intensities of light sources. Recall that the direct methods with mirror balls employed mirrored light source images to localize them sharply. This suggests the introduction of a reference object with specular surface reflection to take into account both shading and highlights. In fact, Zhou and Kambhamettu [22] analyzed specular highlights on a reference sphere for estimating directions of light sources. On the other hand, Yang and Yuille [19] noted that image data around occluding boundaries carry strong constraints on the light source direction, and Nillius and Eklundh [9] proposed a method of estimating light source direction by analyzing image intensities around occluding boundaries on a (locally) Lambertian surface. Wang and Samaras [18] and Zhang and Yang [20] proposed methods that estimate few directional light sources from shading boundaries on a reference sphere, which also give a strong constraint on the light source direction. While these methods assumed that lighting environments can be modeled by a set of directional light sources, proximal light sources, whose illuminating light angles and strengths vary depending on object surface positions and orientations, should be introduced to model real world lighting environments including 3D light sources like candles. In [16], we employed multiple Lambertian spheres to estimate complex lighting environments including proximal point light sources, directional light sources and ambient light. We introduced a novel idea of the difference sphere to estimate 3D positions and radiant intensities of proximal point light sources. A difference sphere
198
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
is generated by computing image intensity differences between corresponding surface points on a pair of reference spheres which have the same surface normal. Since effects by directional light sources and ambient light are canceled out by differencing on the difference sphere, shading patterns on its surface enable us to estimate proximal point light sources. Although [6, 10] also studied the estimation of proximal light sources, it is still a challenging problem to estimate 3D positions, shapes, and radiant intensities of multiple proximal light sources. To solve such complex estimation problem, the utilization of a new visual cue is required. Sato et al. [14] proposed a method using shadows by a reference object to estimate radiant intensities of light sources placed at known 3D positions. As will be discussed later, shadows generated by reference object(s) carry rich information to localize light sources. Their method, however, works only in a simple case where lighting environments are modeled by directional light sources without any proximal light sources. The lighting environment estimation algorithm presented in this chapter is an indirect method that employs both shading and shadow to localize proximal light sources as well as the ambient light in the scene. The novelty of the presented algorithm rests in the design of a reference object, the skeleton cube, to facilitate the shadow-based light source localization without any knowledge about possible light source positions or shapes.
6.3 Problem Specifications and Basic Ideas 6.3.1 Computational Model Even in a well designed 3D video studio, it is very hard to estimate the light field generated by many distributed proximal light sources and moreover, interreflections between object and background surfaces introduce further complications into the light field. In natural scenes such as illustrated in Fig. 1.4, lighting environments change dynamically due to sunlight variations in daytime or bonfires at night. To attack such complicated problem, we introduce the following assumptions and specifications on the image capture environment. Interreflection: Compared to direct lights from light sources, interreflections are weak and negligible. That is, we assume there are no interreflections in the scene. As will be shown later in Sect. 6.6, actually, interreflections are not negligible at sharp concave surface areas in real world scenes. Reference object: We assume that the 3D shape, position, and surface reflectance of object(s) in the scene are known. That is, a reference object(s) is employed to estimate dynamically changing lighting environments. The reference object design includes: • A special 3D shape was designed to realize the effective lighting environment estimation, named skeleton cube. The design principle and effectiveness will be given later in detail.
6.3 Problem Specifications and Basic Ideas
199
• As discussed before, although highlights and specular reflections carry useful information to localize light sources, the reference object surface is painted uniformly in matte white to render Lambertian reflection. This is because strong specular reflections introduce non-negligible interreflections in the skeleton cube. In fact, the use of a reference object with specular surface and the analysis of interreflections are left for future studies. • We assume the scene contains a flat ground plane where a performer is in action. The reference object is placed on the plane. The object position on the plane should be determined so that it is close to the centroid of the performer’s motion trajectory. This is because the reference object should “capture” lighting environments as close as possible to the real ones of the performer. The wide area light field modeling with multiple reference objects placed on it to cover the entire motion trajectory is also left for future studies. Note that the system for wide area multi-view observation described in Sect. 3 will require such wide area light field modeling. • As will be discussed later, the direction of the reference object placement has large effects on the accuracy of the light source estimation. If the approximate spatial layout of light sources is available, then the object direction should be adjusted to increase the accuracy. Otherwise, it can be readjusted based on the result of the initial estimation. Video capture: The reference object is placed in the scene, whose video is captured by a calibrated camera; while the object is static, lighting environments vary over time. Ideally, the reference object should be placed in a 3D video studio when a performer(s) is in action. But actually, we assume that the scene contains only the reference object alone. This is because the performer can introduce occlusions and shadows on the reference object, which would make the lighting environment estimation difficult. Hence, the method presented here is an off-line process, and the on-line lighting environment estimation is left for future studies. Camera calibration: Even though multi-view observations of the reference object by a group of distributed cameras would be useful to increase the estimation accuracy, the method presented here employs only a single camera. As a matter of fact, we want to show the computational algorithm without being “disturbed” by calibration errors between multiple cameras. As will be described later in this chapter, since the presented algorithm directly analyzes observed intensity values of pixels to estimate lighting environments, highly accurate camera calibration among multiple cameras is required in both geometry and photometry. However, as discussed in Chap. 2, especially, it is not trivial to attain highly accurate color calibration among multiple cameras. Thus, the extension to multiple cameras is left for future studies. Light source model: Physical light sources have various 3D shapes that can change over time, such as candles and bonfires. Moreover, the strength of light emission in such 3D light sources varies spatially depending on 3D local positions. To model such dynamic non-uniform 3D light source, we employ a 3D distribution of point light sources. Then, the problem of the lighting environment estimation can be solved by reconstructing 3D distributions of point light
200
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.1 Computational model for the lighting environment estimation. Colored circles denote known entities. The gray and black arrows illustrate “generation” and “estimation” processes, respectively
sources whose shape and light emission patterns vary over time. Note that in the case of physical light sources like candles, light emission from a certain 3D position is affected by the surrounding environments. That is, neighboring point light sources in a same group are not independent and have complex mutual interactions. The algorithm presented here neglects such interactions and just estimates 3D distributions and radiant intensities of point light sources. In addition to proximal light sources, we introduce the ambient light to model light reflections produced by the background scene. Since we cannot assume the background scene to be isotropic or light sources to be uniformly placed, the ambient light should be modeled with a certain 3D structure. In practice, a large hemisphere with point light sources is employed to model the ambient light, whose radiant intensities vary spatially and dynamically. Figure 6.1 illustrates the computational model of the lighting environment estimation presented in this chapter. That is, the lighting environment estimation algorithm analyzes shadows and shading patterns on the reference object surface in order to estimate 3D distributions of point light sources in the scene. Note that occlusion here denotes self-occlusions of the reference object, which can be computed. It should be noticed that shadows and shading patterns on the reference object surface alone are analyzed, because photometric properties of the background scene surface are not known; the reference object may be placed on a shiny studio floor or a grass ground. The estimation algorithm is applied to each frame of observed video data one by one. To model the dynamics of the estimated light sources, the result of the previous frame is used to analyze a new frame.
6.3 Problem Specifications and Basic Ideas
201
Fig. 6.2 3D shape from silhouette vs. 3D point light source from shadow (see text)
6.3.2 3D Shape from Silhouette and 3D Light Source from Shadow The 3D light source estimation from shadow has much to do with the 3D shape from silhouette as illustrated in Fig. 6.2. In both cases, a camera records the light flux converging at the projection center (Fig. 6.2(a)) while a point light source emits the light flux diverging from the point (Fig. 6.2(b)). An object blocks off the light flux from the background converging to the projection center to generate its silhouette on an image plane, while an object blocks off the light flux diverging from a point light source to generate its shadow in the scene. Both cases share the same 3D light flux, which is called the visual cone in the former, while light ray directions are in the opposite direction. When a point light source is placed opposite to a camera, the directions of the 3D light flux pair are aligned and the object shadow is captured
202
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
just as the object silhouette assuming uniform ambient lights. This lighting method is called back-lighting, and it is a popular method of measuring shape of a 2D object. For a 3D object, on the other hand, the shadow boundary does not coincide with the silhouette boundary in general. As discussed in Sect. 4.2.2.2, an object silhouette on a known image plane carries rich information about 3D object shape. Similarly, a shadow on a known object/scene surface can be used to estimate the 3D position of a point light source. Moreover, multi-view object observation can be simulated by multi-reference-object placement. That is, multiple objects are lit by a fixed point light source and observed from a fixed viewpoint, and their shadows are integrated to localize the point light sources. From a computational viewpoint, however, there are critical differences between 3D object shape from silhouette and 3D point light source from shadow: • In the 3D object shape from silhouette (Fig. 6.2(a)), the light flux, i.e. the visual cone, can be computed from the projection center and the object silhouette. In 3D point light source estimation from shadow (Fig. 6.2(b)), on the other hand, matching between the 3D object and the 2D shadow1 is required to reconstruct the 3D light flux and hence to localize the point light source. As is well known, 3D object model matching with a 2D observed image is not an easy task and is essentially ambiguous. Moreover, as a 2D shadow area does not include rich texture, the matching with 3D object model is more difficult than ordinary 3D model matching with a textured image. • The increase of available information sources using multi-view object observation is a general and effective way to resolve ambiguities. In 3D point light source estimation from shadow, the introduction of multiple reference objects will facilitate the localization of point light sources. As will be addressed later, in fact, a pair of skeleton cubes is employed as reference object in real world experiments. The 3D point light source localization with multiple reference objects looks very similar to the projection center estimation in the camera calibration. The difference between them is that in the latter easily identifiable markers are placed on the calibration object surface to facilitate matching between different object postures, whereas the matching in the former should be done without such markers as described above. Thus, the introduction of unique 3D shape protrusions into the reference object as 3D markers would facilitate the matching between the 3D reference object and its shadow. • As shown in Fig. 6.2(c), multi-view object silhouettes can increase the accuracy of the 3D object shape reconstruction by volume intersection, because each object silhouette and its corresponding visual cone can be computed independently of the others. On the other hand, when the scene includes multiple point light sources (Fig. 6.2(d)), multiple light fluxes are mixed in the space and generate overlapping shadows, which are recorded as intensity variation patterns in the observed image. The decomposition of such intensity variation patterns into overlapping shadows 1 While we assume here the shadow is cast on a flat plane, shadows are often cast on 3D surfaces and carry 3D information.
6.3 Problem Specifications and Basic Ideas
203
is required to reconstruct the multiple 3D light fluxes. The task is not easy and is essentially ambiguous, especially because the number of point light sources is not known and should be estimated simultaneously with the decomposition. Consequently, even if lighting environments of an ordinary scene can be modeled by a 3D distribution of point light sources, their localization is a really challenging problem. • As shown in Fig. 4.4, 3D object shape from silhouette itself cannot reconstruct correct 3D object shape and sometimes introduces phantom volumes. In 3D object shape reconstruction, multi-view photo-consistency examination, e.g. stereobased surface texture matching, can be applied to increase the shape reconstruction accuracy as well as to remove phantoms. Similar phenomena happen in the 3D point light source estimation: since shadows carry only geometric information of object boundaries, the localization accuracy of the 3D point light source estimation from shadow is limited and hence phantom light sources are often introduced. Moreover, in addition to the position of each 3D point light source, its radiant intensity should be estimated. Analogically speaking, in the 3D light source estimation, problems of both 3D shape reconstruction and texture generation should be solved simultaneously. Consequently, besides shadow-based analysis, additional analysis process corresponding to the stereo analysis and/or the texture generation, that is, shading and highlight analyses, should be employed simultaneously to realize the 3D light source estimation. Thus, even if the 3D shape of the reference object is known, the problem of 3D light source estimation from shadow is a challenging task, which is essentially much more difficult than the 3D shape from silhouette.
6.3.3 Basic Ideas Our basic ideas to solve the above mentioned computational problem are as follows. Skeleton cube: To facilitate the matching between the 3D reference object and its shadow, we designed a skeleton cube as the reference object. As illustrated in Fig. 6.3, the skeleton cube is a hollow cube consisting of a group of rectangular pillars whose surfaces are painted in matte white. With this design, a camera can observe internal pillar surfaces, onto which shadows of pillars are cast. As selfshadows of the skeleton cube alone are used for the analysis, the skeleton cube can be placed on any ground plane, no matter what reflectance properties it may have. Note that, as shown in Fig. 6.3, the exterior surfaces of the skeleton cube are not used for the lighting environment estimation. Although a single skeleton cube includes only six different surface normals, just only three of them can be observed from a camera. Thus to increase the capability of localizing light sources from self-shadows, additional skeleton cube(s) with different orientation(s), say 45◦ off from the original one, should be introduced
204
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.3 Skeleton cube. ©2010 Springer [12]
in the interior space of the original one (see Fig. 6.7) or in the exterior space to encase the original one. Increasing the number of pillars and their directional variety increases the chance that shadows are generated on the skeleton cube surface. Then the accuracy of the 3D point light source localization improves as well. Section 6.5.2 analyzes characteristics of the skeleton cube in detail. Shading: As described in Sect. 4.2.1, shading patterns carry rich information about the 3D object surface shape. Thus, shading patterns on the skeleton cube surface give additional useful information to estimate 3D point light sources. In other words, shape from shading plays the same role as stereo analysis in increasing the accuracy of 3D shape reconstruction. Moreover, while shadows carry only geometric information, shading patterns include useful information about radiant intensities of light sources. Here again, it should be noticed that shading patterns generated from multiple light sources overlap on the surface and hence the decomposition of observed intensity values should be done to estimate the geometric and photometric properties of the light sources. As is shown in Fig. 6.3, the pillars of the skeleton cube are connected at joints, which form sharp 3D concave surfaces. If the skeleton cube surface were specular, strong interreflections might be produced at such concave surfaces, which would introduce further complications into the analysis process. Thus the skeleton cube is designed to have Lambertian reflection, and therefore the presented algorithm does not employ highlights even if they facilitate the light source localization. While not tested, exterior surfaces of the skeleton cube can be painted to have specular reflection properties without introducing interreflections. The augmentation of the presented algorithm by introducing specular reflection analysis is left for future studies. Optimization: Even with the well designed reference object, the 3D light source estimation problem is still under-constrained. Moreover, the integrated computational process of shadow and shading analyses should be developed. As a practical solution method, we formalized the problem as an algebraic equation rep-
6.4 Algebraic Problem Formulation
205
resenting light emission, reflection (i.e. shading), and shadowing, and developed an optimization method to solve the equation, which will be presented later.
6.4 Algebraic Problem Formulation This section derives the algebraic formulation from the assumptions and image capture environment specifications described in Sect. 6.3.1. Let I (R, G, B, i, j, t) denote a color video frame of the reference object (the skeleton cube) observed at t by a calibrated camera, where (i, j ) denotes a 2D pixel position. Since the following formulation can be applied to each of R, G, and B color channels and to each video frame independently, we eliminate them for simplicity. Then, the formulation of the first step consists of considering I (i, j ) = I (x),
(6.1)
where I (x) denotes the observed radiance at the 3D surface point x of the reference object which corresponds to pixel (i, j ). The validations of this are as follows. • While the irradiance from x is attenuated depending on the distance from the surface, the area on the surface covered by I (i, j ) increases depending on the distance. Accordingly, these two effects get balanced out. Hence we can regard a pixel value as the radiance of the corresponding surface point with some scaling factor. • Although physical measurements require the scaling factor, it is sufficient for lighting environment estimation to obtain the 3D distribution of point light sources with relative radiant intensities. • The camera is well calibrated in its photometric properties as well as its geometry, and the pixel sensitivity over the entire imaging plane is normalized. Figure 6.4 illustrates the geometry of the surface reflection. For the Lambertian surface lit by a group of point light sources Li (i = 1, . . . , N ), the radiance, I (x), can be modeled by I (x) =
N
2 LL i , M (x, Li ) kd Nx · Lx,i /rx,i
(6.2)
i=1
where Nx denote the unit surface normal at x, Lx,i and rx,i the unit vector at x toward light source Li and the distance between x and Li , respectively, and LLi the radiant intensity of point light source Li . To model self-shadowing, M (x, Li ) is introduced. M (x, Li ) specifies whether the point x is illuminated by light source Li or not, i.e. M (x, Li ) = 1 if light source Li illuminates point x and M (x, Li ) = 0 otherwise (see Fig. 6.5). kd which denotes the albedo is set to 1 for simplicity. Suppose the camera captures an image of the reference object, whose surface are sampled at M pixels. Then we have M radiance values at the corresponding object
206
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.4 Geometry of the surface reflection. V denotes the unit vector toward the projection center of the camera, L the unit vector toward a point light source, and N the unit surface normal
Fig. 6.5 Modeling shadow observation. The green object in the right denotes an occluding object, and the black region at the bottom denotes a shadow. The mask term of a shadow point is set to zero, which makes its observed intensity zero. ©2010 Springer [12]
surface points: I (x j ) (j = 1, . . . , M) and the following equation models the lighting environment: ⎡ ⎤ ⎡ ⎤⎡ ⎤ I (x 1 ) K11 K12 · · · K1N LL1 ⎢ I (x 2 ) ⎥ ⎢ K21 K22 · · · K2N ⎥ ⎢ LL ⎥ ⎢ ⎥ ⎢ ⎥⎢ 2 ⎥ (6.3) ⎢ .. ⎥ = ⎢ .. .. .. ⎥ ⎢ .. ⎥ , . . ⎣ . ⎦ ⎣ . . . . ⎦⎣ . ⎦ I (x M ) where
KM1
KM2
···
KMN
L LN
Kj i = M (x j , Li ) Nx j · Lx j ,i /rx2j ,i .
(6.4)
Equation (6.3) can be represented by I = KL,
(6.5)
where K = (Kj i ). I and L denote the observed radiance vector and the radiant intensity vector of the point light sources, respectively.
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources
207
When the 3D positions of all point light sources are given or assumed, K with M (x j , Li ), Lx,i and rx,i becomes known. Then, given a sufficient number of observed radiance values (i.e. M > N ), it is in principle possible to solve Eq. (6.5) for L by the linear least squares method with non-negative variables (the non-negative least squares; NNLS [7]), min KL − I 2 L
subject to L ≥ 0.
(6.6)
Namely, we can compute the radiant intensities of the point light sources, [LL1 , LL2 , . . . , LLN ]. Note that without the mask term representing shadowing effects, we cannot obtain a solution of Eq. (6.6), because the rank of K becomes nearly three [2]. In other words, as was shown in [11], shading patterns vary more smoothly than the 3D distribution of point light sources, while shadows are effective to localize the distribution [14].
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources with the Skeleton Cube 6.5.1 Technical Problem Specifications While the problem can be formulated in Eq. (6.6), we have to develop practical methods to solve the following problems. Positions of point light sources: As assumed above, the 3D positions of all point light sources should be known to solve Eq. (6.6). A straightforward method would be to set a 3D regular grid in the scene and assume the point light sources are placed at the grid points (Fig. 6.6(a)). The introduction of such 3D distribution of point light sources easily makes Eq. (6.6) under-constrained; M (the number of observed 2D data points) N (the number of 3D grid points). To avoid this problem, we have to develop an effective and efficient method of 3D space searching for the possible 3D positions of point light sources. Phantom light sources: Suppose a pair of point light sources, as illustrated with black circles in Fig. 6.6(b), that make shadows on the ground plane. Then, a pair of phantom point light sources, as illustrated with white circles in Fig. 6.6(b), can generate the same shadows. Consequently, the solution of Eq. (6.6) can include both real and phantom point light sources. To avoid this problem, we introduce the skeleton cube as reference object and combine multiple cubes to generate as much as possible varieties of self-shadows. As will be discussed later, however, we still need to develop a method of eliminating phantom point light sources from the solution of Eq. (6.6). View field limitation: Since the view field of a camera is limited, shadows cast from some of point light sources cannot be observed. As shown in Fig. 6.6(c), shadows from the black point light sources on the ground plane extend beyond the view
208
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.6 Technical problems for estimating 3D distribution of point light sources. ©2010 Springer [12]
field and hence their radiant intensities cannot be accurately estimated by solving Eq. (6.6). The introduction of multiple cameras and/or a well designed placement of camera and reference object will be useful to avoid this problem.
6.5.2 Skeleton Cube As discussed above, to increase the capability of localizing 3D point light sources by Eq. (6.6), the reference object should be designed so that it generates as many as possible self-shadows. Recall that shadows and shading patterns cannot be used on the background scene because its reflectance property is not known. As illustrated in Fig. 6.3, the skeleton cube is a hollow cube consisting of a group of rectangular pillars whose surfaces are painted in matte white. With this design, a camera can observe internal pillar surfaces, onto which shadows of pillars are cast. Since a single skeleton cube includes only six different surface normals, an
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources
209
Fig. 6.7 Dual skeleton cube
additional slanted skeleton cube, about 45◦ rotated from the original, is introduced into the interior space of the original one for experiments (Fig. 6.7). By increasing the number and the directional variety of pillars, the complexity of the mask matrix M = (M (x j , Li )) included in K of Eq. (6.6) is increased, which facilitates the localization of 3D point light sources as well as the suppression of phantoms. To quantitatively evaluate the characteristics of the skeleton cube, we conducted the following two simulations.
6.5.2.1 Effectiveness of Lighting Environment Estimation with Shadows Since the lighting environment estimation from Lambertian shading patterns alone is an ill-posed or numerically ill-conditioned problem, we employ shadows, i.e. the mask matrix M included in K, to make the problem better-conditioned. We generated and compared Ks with and without M under the following conditions. • Lighting environment: A set of point light sources are placed at 209 3D grid points ranging from (−1500, −1000, 0) to (1500, 1000, 2500) with spacing of 500 as shown in Fig. 6.8. • Camera position: (450, 450, 450). • Skeleton cube – Cube and pillar sizes: The cube side length and the width of each pillar are set to 100 and 10, respectively. – Reflectance properties: kd = 1.0E + 9, ks = 0.8E + 2 and σ = 0.3.2 – Number of the interior surface points observable from the camera: 2293.
2 In
this simulation, we assumed that the surface reflectance of the skeleton cube follows the simplified Torrance–Sparrow model [15] with specular reflection component specified by ks and σ , which were estimated for the real skeleton cube under a controlled lighting environment.
210
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.8 Simulation environments. The points illustrate the 3D positions of the point light sources. The skeleton cube and the camera are also placed in the specified positions, respectively. The grid lines are added for 3D illustration. ©2010 Springer [12]
Fig. 6.9 Complexity of K with and without M . The rows of each matrix K are aligned based on the surface normals of the skeleton cube. The colors denote values of matrix elements; blue for low, red for high, and white for zero. ©2010 Springer [12]
Figure 6.9 compares a pair of Ks with and without M , where the rows of each matrix K are aligned based on the surface normals of the skeleton cube. The colors denote values of matrix elements; blue for low, red for high, and white for zero.
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources
211
Fig. 6.10 Contributions of singular values of matrix K. The horizontal axis denotes the order of the singular values from the largest and the vertical axis the contribution factor of each singular value. The enlarged part around the origin of the graph is illustrated together. ©2010 Springer [12]
Without M , variations of the rows are roughly classified into three groups. This is because the number of surface normals observable from a single viewpoint is at most three, which implies that the rank of K without M is degenerated to three. In other words, the rows within each group are quite similar to each other except for element-wise value variations due to specular surface reflection and light attenuation determined by the 3D light source positions. With M , on the other hand, the rows of K have much more variations, because occluded, i.e. shadow generating, light sources vary depending on the observed point positions even if their surface normals are identical. To quantitatively evaluate the effectiveness of M produced by the skeleton cube, we apply the singular value decomposition to Ks with and without M to analyze their ranks with respect to the contributions of major singular values. Figure 6.10 illustrates the relative contribution factors of major singular values, from which we can observe • K without M has only three major singular values, which implies its rank is degenerated three. Consequently, Eq. (6.6) cannot be solved because the number of point light sources N is usually greater than three. • With M , the contribution values of the three largest singular values are not so outstanding, and singular values after the third still show relatively high contributions. It proves that K with M has a higher rank and can be used to solve Eq. (6.6). • While not tested, M for the dual skeleton cube shown in Fig. 6.7 would increase much more the rank of K.
212
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.11 Self-shadow generation capability of the skeleton cube (see text). ©2010 Springer [12]
In summary, although limited, the simulation results proved the effectiveness of the skeleton cube to enrich the amount of information required for solving Eq. (6.6). Note that while the algebraic analysis of K suggests that specular reflections on the reference object surface can contribute to increase the rank of K, strong specular reflections could introduce non-negligible interreflections between object local surfaces, especially in such complex 3D shapes like the skeleton cube. Thus, the 3D lighting environment estimation with a specular reference object(s) is left for future studies.
6.5.2.2 Self-Shadow Generation Capability Intuitively, the more self-shadows are generated, the more point light sources can be accurately localized. To evaluate the capability of self-shadow generation by the skeleton cube, we count for each possible point light source position the number of interior surface points of the skeleton cube that are self-occluded when viewed from that position. The number denotes how many surface points are covered by a shadow when only a single point light source is placed at a specified position in the scene. When the number is close to zero, almost all interior surface points of the skeleton cube are lit by a point light source at a 3D point, and hence the shading information alone can be used to localize the light source. As noted before, however, the light source estimation from shading has very limited localization capability. In the simulation, the 3D space ranging from (−1500, −1500, 0) to (1500, 1500, 3000) is sampled with 100 spacing to define a set of 3D regular grid points. Figure 6.11 illustrates the 3D distribution of the self-occluded interior surface point
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources
213
numbers, where red denotes small and blue large. Among 5,300 surface points, the minimum number of occluded points was 704 (when viewed from Vmin in Fig. 6.11) and the maximum 3540. This result demonstrated that the skeleton cube can generate enough self-shadows when lit by its surrounding 3D space except for some points above it such as Vmin in Fig. 6.11. With the dual skeleton cube illustrated in Fig. 6.7, the self-shadow generation capability is much enhanced.
6.5.3 Lighting Environment Estimation Algorithm Even with the skeleton cube, we have to develop a practical algorithm to solve the problems specified in Sect. 6.5.1. Especially, the regular 3D space sampling to locate possible point light sources makes Eq. (6.6) heavily under-constrained. To solve this problem, we introduced a three stage coarse-to-fine 3D space search strategy for localizing possible positions of point light sources. The search algorithm consists of the following three stages. Preprocessing: As described before, let I (R, G, B, i, j, t) denote a color video frame of the reference object and define I (x) = 0.3I (R, i, j, 0) + 0.59I (G, i, j, 0) + 0.11I (B, i, j, 0),
(6.7)
where I (R, i, j, 0), I (G, i, j, 0), and I (B, i, j, 0) denote the observed intensity values at (i, j ) at the initial frame t = 0 in three color channels, respectively. Then, conduct Stages I and II with the image intensity, I (x), defined above to estimate the 3D distribution of point light sources and their radiant intensities. Finally at Stage III, compute color radiant intensities of the point light sources in three color channels, respectively. Stage I: Directional Search (Fig. 6.12): Step I-1: First, assume the point light sources are distributed regularly on a hemisphere of a certain radius, r0 . The origin of the hemisphere is aligned at the center of the skeleton cube. Then, solve Eq. (6.6) by NNLS to obtain their radiant intensities. In practice, the regular sampling on the hemisphere is obtained by recursively subdividing the icosahedron. In the experiments, we used four time recursions to have a set of regular triangular grid points on the hemisphere. Step I-2: Repeat Step I-1 by increasing the radius (r0 < r1 < · · · < rK−1 ), where K denotes the number of different radius values employed. Note that the hemisphere is just enlarged to preserve the regular triangulation on its surface. Possible point light source positions on each hemisphere can be denoted by a pair of angular coordinates, (φi , θi ) (i = 1, . . . , N ), where i denotes the vertex ID of the subdivided icosahedron and N the number of the vertices. Step I-3: Integrate the obtained results and extract the directions of possible point light sources, (φj , θj ) (j = 1, . . . , M):
214
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.12 Directional search strategy (see text). ©2010 Springer [12]
Step I-3-1: For each radius rk , normalize the computed radiant intensity values of the sample points by dividing them by rk2 : ¯ i , θi , rk ) = L(φi , θi , rk )/r 2 , L(φ k
(6.8)
where L(φi , θi , rk ) denotes the computed radiant intensity value for the point light source at (φi , θi , rk ). Step I-3-2: For each (φi , θi ) (i = 1, . . . , N ), compute the sum of the normalized radiant intensity values: ˆ i , θi ) = L(φ
K−1
¯ i , θi , rk ). L(φ
(6.9)
k=0
ˆ i , θi ) (i = 1, . . . , N ). Step I-3-3: Let L denote the maximum among L(φ Then extract the directions of possible point light sources, (φj , θj ) (i = 1, . . . , M) such that Lˆ φj , θj ≥ 0.1L . (6.10) Step I-4: If the resolution of the sampling of the hemisphere is high enough, go to Stage II. Otherwise, increase the resolution of the sampling on the hemisphere by triangular face subdivisions around (φj , θj ) (i = 1, . . . , M) alone
6.5 Algorithm for Estimating the 3D Distribution of Point Light Sources
215
and go to Step I-1. Note that the original vertices (φi , θi ) (i = 1, . . . , N ) except for (φj , θj ) (i = 1, . . . , M) are preserved for the next iteration. Stage II: Distance Localization: Step II-1: Distribute possible point light sources along the 3D lines specified by (φj , θj ) (i = 1, . . . , M). The distance range from the origin of the skeleton cube is specified taking into account the lighting environment to be modeled. Step II-2: Solve Eq. (6.6) by NNLS to obtain the radiant intensities of the possible point light source positions. Step II-3: Let pl (l = 1, . . . , L) denote point light source positions that have meaningful radiant intensity values. Stage III: Point Light Source Localization, Ambient Light Model, and Color Radiant Intensity Estimation Distribute possible light sources at fine 3D regular grid points around pl (l = 1, . . . , L) as well as at rough regular sampling points on the hemisphere with radius r ( rK−1 ). Then, solve Eq. (6.6) by NNLS to obtain their radiant intensities. This final estimation is conducted for each color channel, respectively. That is, three versions of Eq. (6.6) are generated with I (x) = I (R, i, j, 0), I (x) = I (G, i, j, 0), and I (x) = I (B, i, j, 0) are solved by NNLS, respectively. Then, the color radiant intensity of each point light source position is obtained. Dynamic analysis: While modeling dynamic lighting environments captured by an image sequence, the above algorithm is applied to the first image at t = 0. Then its result is used as an initial 3D distribution of point light sources for the second image: possible point light sources are distributed around the positions obtained at the previous frame, and Stage III above is conducted. Of course, such simple sequential processing is affected by the initial result and often accumulates errors, which can be avoided by conducting the above algorithm from the beginning at some temporal intervals. More advanced methods for dynamic 3D lighting environment estimation that employ temporal continuity constraints on observed radiance, as well as 3D dynamic shapes of point light source distributions, are left for future studies. Basic ideas behind this algorithm are the following. • As discussed before, shadows give rich information to localize point light sources even if they are mutually overlapping. Especially, shadow boundaries enable sharp directional localization, which the directional search in Stage I employs. • The coarse-to-fine search strategy facilitates directional search. Note that Step I-4 preserves the original sampling points on the hemisphere while increasing the sampling resolution around the possible point light source positions. This is because the former sampling is required to model ambient lights in the scene. Stage III also employs such roughly sampled points on the large hemisphere to model ambient lights.
216
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
• As discussed before and illustrated in Fig. 6.6(b), phantom light sources may be included in the solution of Eq. (6.6) even if the dual skeleton cube is employed. The algorithm tries to eliminate them by step-wise refinement: starting from (φi , θi ) (i = 1, . . . , n) to obtain (φi , θi ) (i = 1, . . . , N ) at Stage I, and the pl (l = 1, . . . , L) at Stage II. The introduction of additional skeleton cube(s) located and oriented differently is a practical and effective method to eliminate phantom light sources. However, the development of more direct computational algorithm for phantom light source elimination is left for future studies.
6.6 Performance Evaluation We evaluated the performance of the above algorithm in a real scene under the following configuration (Fig. 6.13). Skeleton cube: A pair of skeleton cubes with 100 mm and 300 mm side lengths are combined to form a dual skeleton cube as illustrated in Fig. 6.7: the small cube is hanged with a fine thread inside the large cube. We assume the surface reflectance as Lambertian, and ignore interreflections. Lighting environments: The light sources include a large candle consisting of four bundled small candles, and an ambient light generated by reflected candle lights from the background scene surface. We slowly moved the candle around the dual skeleton cube in a dark room to capture a video sequence of a dynamically changing 3D lighting environment. Camera: Sony PMW-EX3, 1920×1080 pixels, 29.97 fps was fixed in the scene. We calibrated the camera intrinsic parameters by easycalib [21] and the camera extrinsic parameters using the geometry of the skeleton cube. Directional search: The initial hemisphere is generated by (1) recursively subdividing an icosahedron four times and (2) extracting the upper hemisphere above the ground plane with 301 vertices. The radius is changed from Rmin = 400 mm to Rmax = 1600 mm with 200 mm spacing. The coarse-to-fine search at Step I-4 to increase the angular resolution by subdividing selected hemisphere surface triangles was conducted four times. Depth localization: Along each detected direction, (φj , θj ) (j = 1, . . . , M), 80 point light sources are placed with 15 mm spacing, starting from a point at 400 mm away from the center of the skeleton cube. Point light source localization and ambient light model: With the 3D regular grid of 10 mm spacing, 26 neighboring grid points of each possible point light position detected by the distance localization are generated. In addition, an icosahedron of 10 m radius is recursively subdivided twice to generate 71 vertices on its hemisphere, which are then used to model the ambient light in the scene. Dynamic analysis: Select point light source positions with meaningful radiant intensity values from those used in the previous video frame. Then add 26 neighboring grid points for each selected position, and 71 vertices on the large hemisphere for representing the ambient light to estimate their radiant intensity values for the next video frame.
6.6 Performance Evaluation
217
Fig. 6.13 Configuration for performance evaluation. ©2010 Springer [12]
The left column of Fig. 6.14 shows the captured images, where the candle light position and 3D shape vary over time. The right column illustrates the estimated 3D distributions of point light sources and the synthesized images of the dual skeleton cube illuminated by the estimated lighting environments. Figure 6.15(a) illustrates the synthesized images of the dual skeleton cube without the point light sources. Figure 6.15(b) shows image intensity differences between the observed and synthesized images of the dual skeleton cube: dark blue denotes small and red large. Table 6.1 shows the quantitative evaluation of the differences. These results show the following. • Errors in most of the surface areas are low: the average errors are at most 2 % of the image intensity range 255, which proved that the lighting environments in the scene were almost correctly estimated by the presented algorithm. • Non-negligible errors illustrated in light blue in Fig. 6.15(b) are observed at areas where multiple pillars of the dual skeleton cube are connected. These can be considered as caused by interreflections. That is, even though the real world scene includes interreflections at such areas, the algorithms for both lighting environment estimation and image synthesis do not take them into consideration. To visually evaluate the performance of the algorithm, we rendered a video sequence of virtual objects: a big statue was placed at the same position of the dual skeleton cube and all objects were lit by the estimated dynamic 3D lighting environments (the right column of Fig. 6.15). The observed candle video is superimposed to verify the synchronized dynamic variations between the candle light and the object appearance. We can observe that the objects are photo-realistically illuminated with the flickers of the candle, as well as the soft-shadows on the objects surfaces illuminated by 3D lighting environments. Based on these experimental results: • The directional search worked very well to estimate the 3D direction of the candle. This proved that shadows on the dual skeleton cube have rich information to estimate the 3D directions of 3D light sources and that the directional search can eliminate phantom light sources.
218
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.14 Experimental results (I). The colors of the point light source illustrated at the left side of (b) denote the estimated radiant intensity values: blue denotes low and red high. The green surface areas of the skeleton cube in (b) denote those areas which were not used for the estimation since those were not suitable for sampling intensities, e.g. screw holes for assembling the skeleton cube and edges of the pillars. ©2010 Springer [12]
• The performance of the distance localization, on the other hand, is limited. Especially, in the initial frame shown at the top of Fig. 6.14(b), the localization along the estimated direction did not work well. This is because the distance localization relies on the numerical computation of shading and light attenuation, whose
6.6 Performance Evaluation
219
Fig. 6.15 Experimental results (II). The colors in (b) denote image intensity differences between the observed and synthesized dual skeleton cube images: dark blue denotes small and red large. (c) shows virtual objects images illuminated by the estimated lighting environments, on which the captured candle images are superimposed. ©2010 Springer [12]
Table 6.1 Intensity differences between the observed and synthesized dual skeleton cube images. ©2010 Springer [12]
Initial frame # 60 # 120 # 180 # 240 Max difference
64.3
55.8 49.4
38.7
50.0
Average of differences 4.66
4.31 3.84
2.67
3.71
Variance of differences 18.4
15.1 11.1
5.91
10.7
220
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
stability and accuracy are limited due to the limited intensity range of observed images, which is 8 bits. Hence, when solving Eq. (6.6), NNLS is easily trapped by a local minimal and cannot analyze geometric properties required to localize point light sources. • The algorithm can estimate dynamic 3D lighting environments while its spatial resolution and accuracy are limited. • The light field reconstructed by the estimated 3D dynamic light sources is compelling to model real world lighting environments, and can be used to render natural and artistic atmospheres in CG and 3D video contents. The following are topics for future studies: • While not tested, the introduction of an additional skeleton cube(s) placed at different position(s) and orientation(s) in the scene could improve the performance. In particular, if the scene includes multiple dynamic 3D light sources, such methods with multiple skeleton cubes would be required. • As noted before, accurate photometric camera calibration would enable multiview observation of the scene, and improve stability and accuracy of the algorithm. • The introduction of highlight and interreflection analyses will improve the performance. Especially, it would be a reasonable immediate augmentation to make skeleton cube surfaces specular: exterior surfaces of the outer skeleton cube of the dual skeleton cube can be made specular without introducing interreflections. • A challenging future study would be to introduce geometric constraints into Eq. (6.6): neighboring relations between object surface points as well as between point light sources. That is, most of computer vision algorithms regard individual 2D or 3D points as basic elements for algebraic formulations and conduct optimization as described here. Since the elements represent geometric entities, their geometric and topological relations will improve the algebraic optimization process if algebraic representations of such relations are developed. Recall that the algorithm presented here conducts 3D space searches to make use of geometric relations. Therefore, the geometric relations are represented procedurally instead of algebraically. • As illustrated in Figs. 6.9 and 6.11, characteristics of a reference object can be represented algebraically and geometrically. With these characterizations, it should be possible to compute the optimal reference object design and layout for a specified real world scene.
6.7 Surface Reflectance Estimation and Lighting Effects Rendering for 3D Video This section addresses applications of estimated lighting environments: surface reflectance estimation and lighting effects rendering for 3D video. As discussed in Chap. 5, the texture generation algorithms presented there generate appearances of
6.7 Surface Reflectance Estimation and Lighting Effects Rendering for 3D Video
221
an object in motion viewed from a specified viewpoint, a virtual camera, and hence cannot render 3D video under different lighting environments from those of the multi-view video capture studio. This section, on the other hand, first presents a generic texture generation algorithm, where generic surface reflectance properties are estimated based on known lighting environments. Then, with a sequence of 3D mesh data with generic texture, 3D video under a variety of lighting environments can be rendered.
6.7.1 Generic Texture Generation 6.7.1.1 Computational Model Figure 6.16 illustrates the computational model employed here for the generic texture generation. The assumptions used are as follows. 3D object shape: A sequence of 3D mesh data is generated by the 3D shape reconstruction process as described in Chap. 4. Given the 3D object shape and camera viewpoints, we can computer occlusions. Reflectance model: We assume that the object surface follows the Lambertian reflection. Note that this does not mean that the algorithm presented in this section is applicable only to Lambertian surfaces. It extracts diffuse-reflection components from observed radiance and then estimate diffuse-reflectance parameters to model the generic texture. Hence, as well known, diffuse-reflectance parameters characterize properties of surface materials while specular reflectance parameters surface the roughness (Fig. 6.17). Thus, the algorithm estimates the former eliminating the latter. The estimation of specular reflection properties is more difficult than that of the diffuse reflection. Surface normal: Since specular reflection patterns vary widely depending on surface normals, the accurate surface normal estimation is required for their analysis. As discussed in Chap. 4, however, it is hard to realize such accurate estimation due to the limited accuracy of camera calibration and 3D shape reconstruction. Lighting environments: The specular reflection is also sensitive to the direction of the incident light flux. As discussed in the previous section, however, the accuracy of lighting environment estimation is still limited. Observability of specular reflection: The specular reflection, especially highlights, can be observed only in a narrow range of reflection angle. A limited number of cameras do not allow us to observe such narrow band of specular reflections on an object in motion densely or consistently. Lighting environments: We assume lighting environments of the 3D video studio as a spherical distribution of directional light sources whose radiant intensities gradually decrease from the top to the bottom. The reason for this is as follows.
222
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.16 Computational model for the generic texture generation. Colored circles denote known entities. The gray and black arrows illustrate “generation” and “estimation” processes, respectively
Fig. 6.17 Reflection model at an object surface
While the lighting environment estimation algorithm presented in the previous section is basically for proximal light sources, lighting environments of the 3D video studio are designed to uniformly illuminate an object as much as possible, as described in Chap. 2. Thus, the spherical distribution of directional light sources mostly approximates the physical lighting environments of the studio, including the direct illumination from ceiling light tubes and the indirect illumination from the walls and the floor (see Fig. 2.4). Given lighting environments and 3D object shape, we can compute shading and shadow. Figure 6.18 illustrates the model of lighting environments for the generic texture generation. First an icosahedron is recursively subdivided to have 642 vertices, where point light sources are places. The radiant intensities of the point light sources are modeled as follows, supposing the radius of the sphere is 1 and the origin is located at the center of the sphere. Zone (a): point light sources placed at more than 0.6 above the origin are given 1.0 as their radiant intensities, which represent direct illuminations from the ceiling lights.
6.7 Surface Reflectance Estimation and Lighting Effects Rendering for 3D Video
223
Fig. 6.18 Lighting environments for the generic texture generation. A group of point light sources are placed on the spherical surface. Their radiant intensities vary depending on their heights, which models direct illuminations from the ceiling lights, and indirect illuminations from the walls and the floor
Zone (b): point light sources placed between −0.6 and 0.6 heights are given radiant intensities ranging from 0.1 (bottom) to 0.2 (top), which represent indirect illuminations from the walls. Zone (c): point light sources below −0.6 from the origin are given 0.2 as their radiant intensities, which represent indirect illuminations from the floor. These values were determined heuristically, because the lighting environment estimation algorithm presented before was not available when multi-view video data used for experiments were captured. The algorithm will improve the performance of the generic texture generation.
6.7.1.2 Computation Algorithm As discussed in Sect. 6.4, the radiance of the object surface is given by Eq. (6.2). Since we assume that the lighting environment consists of multiple directional light sources, the equation is rewritten as I (x) =
N
M (x, Li ) kdx Nx · Li LLi ,
(6.11)
i
where x ∈ R3 denotes a point on the object surface. M , kdx , Nx , Li , LLi and LLi denote the mask term, the diffuse albedo at x, the normal vector at x, a light source ID, the direction and radiant intensity of Ln , respectively. The radiance at point x is considered identical to the pixel value corresponding to the point in the image as described in Sect. 6.4. The critical difference from Sect. 6.4 is that x can be observed from multiple cameras. In most cases, x has multiple radiance values observed in multi-view images. To compute the radiance at x only by the diffuse reflection, we take the median among the observed intensity values as discussed in Sect. 5.3.3; the median gives a good approximation of the diffuse reflection and eliminates effects
224
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.19 Computational process of the generic texture generation
by the specular reflection. Note that here, each of the R, G, B color channels is processed independently. This color-channel-wise processing is conducted as follows. From Eq. (6.11), the diffuse-reflection property, i.e. the diffuse albedo, can be computed by kdx = N i
I (x) M (x, Li )(Nx · Li )LLi
.
(6.12)
Figure 6.19 shows the computational process of the generic texture generation: Step I: For each color channel, generate a texture image for the 3D mesh data by applying the median process to multi-view images (as for the texture image generation, see Sect. 5.2). We regard the texture map as the radiance map, each pixel of which represents I (x) in Eq. (6.12). Step II: Generate the masked diffuse-reflection map that represents the diffusereflection component with mask term, i.e. the denominator of the right hand of Eq. (6.12). Note that the geometry of the map is identical to that of the radiance map. Step III: Generate the diffuse albedo map by applying pixel-wise division between the radiance map and the masked diffuse-reflection map, where each pixel represents the diffuse albedo in Eq. (6.12). Uncolored pixels in the diffuse albedo map in Fig. 6.19 denote undefined colors. Figure 6.20 shows the computed results of the radiance map, the masked diffusereflection map, and the diffuse albedo map, where the results for three color compo-
6.7 Surface Reflectance Estimation and Lighting Effects Rendering for 3D Video
225
Fig. 6.20 Computed radiance map, masked diffuse-reflection map, and diffuse albedo map
nents were integrated to render these maps. As shown in Fig. 6.20(b), the intensities of pixels vary depending on surface normals and light source masking conditions, i.e. shadows. Figure 6.20(c) shows the estimated generic surface property, i.e. the diffuse albedo, whose variations are less than those in the radiance map. This implies that lighting effects are removed by the method described above.
6.7.2 Lighting Effects Rendering Given a sequence of 3D mesh data with generic surface reflectance properties, freeviewpoint object images under arbitrary specified lighting environments can be rendered. Namely, lighting effects of 3D video can be edited. Note that 3D video rendering with generic texture generation requires lighting environments, which can be synthesized artificially or estimated by lighting environment estimation algorithms. Figure 6.21 shows 3D video of MAIKO rendered by modifying lighting effects. As shown in the figure, lighting effects are naturally represented on the MAIKO surface. Furthermore, specular reflections can be easily introduced by augmenting the diffuse albedo map; the specular albedo map is introduced by specifying specular reflection properties of each pixel. Figure 6.22 illustrates 3D video of MAIKO with specular reflections under various lighting effects. The simplified Torrance–Sparrow model [15] with specular reflection properties specified by ks and σ was used. The sash surface was given high specular reflection properties. Finally, Fig. 6.23 shows 3D video of MAIKO under the lighting environments estimated in Sect. 6.6, i.e. lighting environments with the moving candle. In addition, a high dynamic range image of a night scene is employed for creating a realistic scene. For image rendering, the scale of the estimated lighting environments was adjusted to fit the size of a human MAIKO. As shown in the figure, the effects of the dynamic candle fire rendered special atmospheres, which represent well Japanese intangible cultural assets. Since these lighting effects can be easily edited by any or-
226
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.21 3D under various lighting effects. The white spheres in (a) and (b) denote light sources in the scene. The images of the sky-lighting, (c) and (d), were rendered with lighting of Blender Sky Texture [3]. The images of the natural-lighting, (e) and (f), were rendered with high dynamic range (HDR) image-based lighting [5]. The HDR image is courtesy of www.openfootage.net
dinary 3D CG software, the lighting environment estimation and the generic texture generation will explore utilities of 3D video as new visual media.
6.8 Conclusions This chapter presented a novel lighting environment estimation algorithm with a specially designed reference object, the skeleton cube, and its applications to generic texture generation and lighting effects editing. Experimental results demonstrated their effectiveness in rendering 3D video under various lighting environments, which will enhance utilities of 3D video as new visual media.
6.8 Conclusions
227
Fig. 6.22 3D video with edited specular reflections under various lighting effects. The lighting environments are identical to those of Fig. 6.21. Strong highlights on the sash and the hair ornaments can be observed in the images
As discussed before, the algorithm should be augmented in various points to make it effective in real world scenes. There are still many interesting research topics to be studied in the lighting environment estimation. Among others, as described in the last part of Sect. 6.6, the introduction of topological and geometric constraints into algebraic formulations will lead to substantial improvements of computer vision algorithms. While the diffuse-reflection properties were estimated from multi-view images, the estimation of the specular reflection properties requires dense multi-view observation of a specular surface area as discussed in Sect. 6.7.1.1. With a limited number of cameras, temporal data integration methods should be developed. They are also required to improve the lighting environment estimation algorithm.
228
6
Estimation of 3D Dynamic Lighting Environment with Reference Objects
Fig. 6.23 3D video under the lighting environments with the candle fire estimated in Sect. 6.6. The 3D mesh data of MAIKO in the images are identical but rotated in the scene. The lighting environment consists of the candle fire estimated in Sect. 6.6 and HDR image-based lighting. The illumination of the candle fire varies for each image, i.e. frames # 60, # 120, and # 180 from the left. The HDR image is courtesy of www.openfootage.net
References 1. Basri, R., Jacobs, D.W.: Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 25(2), 218–233 (2003) 2. Belhumeur, P.N., Kriegman, D.J.: What is the set of images of an object under all possible illumination conditions? Int. J. Comput. Vis. 28(3), 24–260 (1998) 3. Blender Foundation: http://www.blender.org/ 4. Debevec, P.: Rendering synthetic objects into real scenes: Bridging traditional and imagebased graphics with global illumination and high dynamic range photography. In: SIGGRAPH’98: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 189–198 (1998) 5. Debevec, P.: Image-based lighting. IEEE Comput. Graph. Appl. 22(2), 26–34 (2002) 6. Hara, K., Nishino, K., Ikeuchi, K.: Determining reflectance and light position from a single image without distant illumination assumption. In: IEEE International Conference on Computer Vision, vol. 2, pp. 560–567 (2003) 7. Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. Society for Industrial Mathematics, Philadelphia (1987) 8. Marschner, S.R., Greenberg, D.P.: Inverse lighting for photography. In: Fifth Color Imaging Conference, pp. 262–265 (1997) 9. Nillius, P., Eklundh, J.-O.: Automatic estimation of the projected light source direction. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. I, pp. 1076–1083 (2001) 10. Powell, M.W., Sarkar, S., Goldgof, D.: A simple strategy for calibrating the geometry of light sources. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 1022–1027 (2001) 11. Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse rendering. In: ACM SIGGRAPH, pp. 117–128 (2001) 12. Ronfard, R., Taubin, G. (eds.): Image and Geometry Processing for 3-D Cinematography. Geometry and Computing, vol. 5. Springer, Berlin (2010) 13. Sato, I., Sato, Y., Ikeuchi, K.: Acquiring a radiance distribution to superimpose virtual objects onto a real scene. IEEE Trans. Vis. Comput. Graph. 5(1), 1–12 (1999) 14. Sato, I., Sato, Y., Ikeuchi, K.: Illumination from shadows. IEEE Trans. Pattern Anal. Mach. Intell. 25(3), 290–300 (2003) 15. Solomon, F., Ikeuchi, K.: Extracting the shape and roughness of specular lobe objects using four light photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. 18, 449–454 (1996)
References
229
16. Takai, T., Maki, A., Niinuma, K., Matsuyama, T.: Difference sphere: an approach to near light source estimation. Comput. Vis. Image Underst. 113(9), 966–978 (2009) 17. Unger, J., Wenger, A., Hawkings, T., Gardner, A., Debevec, P.: Capturing and rendering with incident light fields. In: EGRW’03: Proceedings of the 14th Eurographics Workshop on Rendering, pp. 141–149 (2003) 18. Wang, Y., Samaras, D.: Estimation of multiple directional illuminants from a single image. Image Vis. Comput. 26(9), 1179–1195 (2008) 19. Yang, Y., Yuille, A.: Sources from shading. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 534–539 (1991) 20. Zhang, Y., Yang, Y.-H.: Multiple illuminant direction detection with application to image system. IEEE Trans. Pattern Anal. Mach. Intell. 232(8), 915–920 (2001) 21. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 22. Zhou, W., Kambhamettu, R.: Estimation of illuminant direction and intensity of multiple light sources. In: European Conference on Computer Vision, pp. 206–220 (2002)
Part III
3D Video Applications
Following multi-view video capture systems in Part I and 3D video production methods in Part II, this part addresses applications of 3D video: visualization in Chap. 7, content-based encoding in Chap. 8, kinematic human motion estimation in Chap. 9, and data compression in Chap. 10. A major focus throughout these chapters is put on data representations. The 3D video production methods presented in the previous part generate a 3D video stream from multi-view video data, which is represented by a sequence of textured 3D mesh data. Here we assume that the sequence represents a 3D video of a single object in motion and that each 3D video frame is produced independently of the others: the mesh structure (i.e. number of vertices and their connectivities) varies over time without any temporal correspondences. Basically, each application analyzes this textured 3D mesh data to generate a new data representation suitable for its task. For example, the visualization methods in Chap. 7 transform the textured 3D mesh data into 2D images representing virtual views of an object in motion. The content-based encoding method in Chap. 8 partitions the sequence into a set of intervals representing atomic human behaviors, named behavior units. While these applications preserve given textured 3D mesh data, the latter two chapters analyze non-textured 3D mesh data to transform them into new data representations: Chap. 9 produces a kinematic motion description of a human represented by a sequence of skeleton structures and Chap. 10 generates a sequence of 2D images recording 3D mesh vertex coordinates, respectively. In these methods, originally captured multi-view video data are employed to analyze and generate surface texture. This is because of the following. • Since as discussed in Chap. 5, the texture generation can introduce artifacts into the textured 3D mesh data, the multi-view video data, which do not carry such artifacts, should be employed to estimate the accurate kinematic motion description. • Since the transformation from 3D mesh data into 2D images for the data compression significantly modifies 3D mesh structures, the texture generation for the compressed 3D mesh data should be done using the multi-view video data separately. Another focusing point throughout this part is the exploration of surface-based 3D object shape representation methods for analyzing object actions. First of all, multi-view video data carry object surface information and hence a 3D video stream
232
is represented based on the surface-based representation, i.e. textured 3D mesh data. Chapter 8 introduces the Reeb graph to represent a global topological structure of an entire 3D object surface. The kinematic motion estimation method in Chap. 9 conducts matching between a skin-and-bones human model and 3D mesh data and establishes their mutual surface correspondences. The 3D mesh compression in Chap. 10 also analyzes the global object surface topology to search for a temporally stable surface cut graph for each 3D mesh frame, at which the 3D surface mesh is cut open and unfolded onto a 2D plane. This allows us to convert a 3D video stream into a 2D video stream.
Chapter 7
Visualization of 3D Video
7.1 Introduction Visualization is one of the most standard applications of 3D video. Its essential functionality includes interactive free-viewpoint and 3D (pop-up) visualization of the captured scene as is. As discussed in Chap. 1, while the free-viewpoint visualization of the 3D scene can be achieved either by the model-based or by the imagebased methods, the former has more flexibilities than the latter in visualization as well as in content editing, which will be presented in the later three chapters. • As demonstrated in Chap. 6, lighting environments of 3D video can be modified to render natural and/or artistic atmospheres. • While the viewpoint in the image-based visualization method is constrained around the cameras used for multi-view video capture, the viewpoint in the model-based can be placed anywhere. This chapter presents a novel free-viewpoint visualization method of a 3D video stream of a single human in action represented by a sequence of textured 3D mesh data. The novelty rests in that the 3D video is visualized from the performer’s viewpoint. Ordinary free-viewpoint visualization methods render the object action viewed from the outside of the scene. We may call it an objective, or third-person, view of the object action. With 3D video data, moreover, we can render a subjective, or first-person, view of the object action, where the object action is visualized as if it were captured from a head-mounted camera. Such subjective visualization is very useful to understand where to look when performing juggling or traditional dances; in MAIKO dances, for example, eye motions are very important to express mental feelings. While a given sequence of textured 3D mesh data is sufficient for the objective visualization, face and gaze positions and their motions should be computed from the data for the subjective visualization. In other words, the subjective visualization implies a content-based semantic visualization of a 3D video stream. With ordinary free-viewpoint visualization methods, a user has to specify a viewpoint. The subjective visualization, on the other hand, automatically computes the T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_7, © Springer-Verlag London 2012
233
234
7
Visualization of 3D Video
appropriate viewpoint for each 3D video frame. In this sense, the subjective visualization implies an automatic viewpoint specification and control method. This chapter first introduces the standard objective visualization system we developed in Sect. 7.2. Section 7.3 describes the gaze estimation from 3D video for the subjective visualization with its quantitative performance evaluation results. Section 7.4 concludes the chapter with discussions and future works.
7.2 3D Video Visualization System We developed an objective visualization system for 3D video, which includes the following functions (Fig. 7.1(a) shows captured multi-view video frames). Omni-directional background image and video: A 3D video stream can be placed in a 3D scene that is rendered by a real omni-directional image or video (Fig. 7.1(b)) as well as synthesized CG images. 3D object positioning and duplication: The 3D positioning and its duplication can be done interactively (Fig. 7.1(c)). Multiple objects: multiple 3D video streams can be integrated into a scene. Figure 7.1(d) consists of two SAMURAIs, i.e. Japanese warriors, in action that were captured independently. Real-time interactive visualization: Varieties of visualization parameters such as viewpoint, view direction, zoom, and stop&go can be specified interactively (Fig. 7.1(e)). Vertex, mesh, texture rendering: A 3D video frame can be visualized as either a 3D distribution of colored mesh vertices, 3D wireframe, or surface texture. In Fig. 7.1(f), the lower part of MAIKO is rendered by colored wireframe without surface textures. Interactive parallax control for 3D TV: A pair of stereo images can be rendered by specifying a parallax parameter for pop-up 3D visualization (Fig. 7.1(g)). The system can visualize a 3D video stream by both the view-dependent vertexbased texture generation method (Sect. 5.4) and the harmonized texture generation method (Sect. 5.5), while the latter requires a preprocessing to convert a sequence of textured 3D mesh into the data structure for the harmonized texture generation. The computation times to render an image (800 × 800 pixels) by these methods with a modern PC (CPU: Intel Core i7 2.4 GHz, Memory: 16 GB, GPU: NVIDIA GeForce GTX 485 M, Video memory: 2 GB) are as follows. View-dependent vertex-based texture generation: 13.804 msec for an original 3D mesh with 142,946 vertices and 285,912 faces. Harmonized texture generation: 68.264 msec for a simplified 3D mesh with 4,644 vertices and 9,308 faces. While these functions enable versatile visual content visualization, the editing capability of the system is limited. That is, since given 3D video streams are preserved during visualization:
7.2 3D Video Visualization System
235
Fig. 7.1 Functions of the objective visualization system
• The order of actions in a stream cannot be changed. • The speeds or action patterns cannot be modified. Chapter 8 introduces a method of partitioning a stream into a set of intervals representing atomic human behaviors, named behavior units, which enables us to edit action orders to render a new 3D video stream. Chapter 9 produces a kinematic motion description of a human action from a 3D video stream. The development of an action editor which modifies the kinematic motion description to render a new 3D video is left for future studies: e.g. compose a 3D video stream of an action scene,
236
7
Visualization of 3D Video
where multiple persons perform complicated fights, from a set of independently produced 3D video streams of individual person’s actions.
7.3 Subjective Visualization by Gaze Estimation from 3D Video As discussed before, the subjective visualization is useful for 3D video of human performances such as traditional dances, juggling, fighting, etc. The objective visualization system described in the previous section can generate free-viewpoint images once the viewpoint is specified. Thus, what we have to do for the subjective visualization is to estimate the subjective viewpoint, i.e., the face position and gaze direction of the performer from each 3D video frame, and apply the objective visualization with the estimated viewpoint. For this estimation, we employ originally captured multi-view video data rather than texture patterns generated on 3D mesh data, because the texture generation process usually introduces artifacts, which may mislead the gaze estimation process. As the proverb “the eye is the window of the heart” says, the gaze estimation and analysis from video data have been well studied in psychology, human interface and communication areas [5]. While wearable [10] or remote gaze measurement devices [11] can be used in constrained environments, computer vision technologies can realize the gaze estimation in unconstrained natural environments and hence many methods have been proposed [4]. This section presents a novel 3D gaze estimation method from a 3D video stream. Figure 7.2 illustrates the overall processing scheme. Given a sequence of 3D mesh data and corresponding multi-view video data (Fig. 7.2 I), the gaze estimation method applies the following processes to each video frame: 1. First extract 2D face regions in multi-view images to estimate a rough 3D face surface area in each 3D mesh (Fig. 7.2 II and III). 2. Estimate the symmetry plane of the 3D face surface area: (1) first extract 3D feature points in the estimated 3D face surface area and then, (2) generate the symmetry plane by evaluating symmetric properties among the feature points (Fig. 7.2 IV and V). 3. Reconstruct an accurate and high-resolution frontal face surface by applying a super-resolution 3D shape reconstruction technique with the symmetry prior (Fig. 7.2 VI). 4. Generate a virtual frontal face image with super-resolution (Fig. 7.2 VII). 5. Estimate the 3D gaze from the virtual frontal face image using a 3D eyeball model (Fig. 7.2 VIII). The ideas behind this method are as follows. As discussed in Chaps. 4 and 5, the accuracy of reconstructed 3D shape data is limited due to errors in the calibration and shape reconstruction processes, which could mislead the gaze estimation and/or decrease its accuracy. Fortunately, the 3D face surface is rather flat, which allows
7.3 Subjective Visualization by Gaze Estimation from 3D Video
237
Fig. 7.2 Computational processes for 3D gaze estimation (see text)
many cameras to observe it, and moreover, it has symmetric properties in both 3D shape and surface texture. Thus a super-resolution technique with symmetry prior can be applied to increase the 3D shape accuracy and the image resolution making full use of original multi-view images. This method allows the face to move freely in the scene, and therefore realizes a non-contact and non-constrained 3D gaze estimation.
238
7
Visualization of 3D Video
7.3.1 3D Face Surface Reconstruction Using Symmetry Prior Here we present the super-resolution 3D face surface reconstruction algorithm using a symmetry prior from a 3D mesh and corresponding multi-view images. It consists of (1) 3D face area detection, (2) symmetry plane estimation, and (3) 3D face surface reconstruction in super-resolution. The algorithm processes frames one-by-one sequentially.
7.3.1.1 3D Face Area Detection The basic idea of the 3D face area detection is to use a 3D mesh as a voting space for accumulating partial evidence produced by applying an ordinary 2D face detector to each of the multi-view images. The evidence accumulation enables to eliminate false-positive face detections in 2D images and localize an accurate 3D face area on the 3D mesh. Let M denote a 3D mesh of an object and Ii (i = 1, . . . , N ) a set of corresponding multi-view images captured by cameras ci (i = 1, . . . , N ). The face area detection algorithm (Fig. 7.2 II and III) is defined as follows. Note that in what follows, Step X denotes the process X illustrated in Fig. 7.2. First the algorithm detects a set of 2D face candidate regions Fi by applying a conventional 2D face detector to each Ii . The blue rectangles in Fig. 7.3 show Fi for each image. Fi may include false-positive face areas due to texture patterns which accidentally look like a human face. Then all Fi s are mapped onto M for evidence accumulation. Step II Apply the Viola-and-Jones face detector [13] to each image Ii (i = 1, . . . , N ) to obtain a group of face candidate regions Fi = {fij |fij ∈ Ii , j = 1, . . . , ni }, where ni denotes the number of face candidate regions in Fi . Step III-1 Let M = {V , E} denote a 3D mesh consisting of a vertex set V and an edge set E. For each vertex v ∈ V , compute a per-vertex”faceness” score L(v) by the following method: Step III-1-1 For each v, let L(v) = 0 Step III-1-2 For each camera ci , let vi denote the projection of vertex v on image Ii . If vi falls in Fi , then let L(v) = L(v) + 1. Step III-2 Compute a set of vertices Vc = {v|L(v) > 0}, and partition it into disjoint subgroups of connected vertices S = {si |s1 ∪ s2 ∪ · · · ∪ sn = Vc , sj ∩ sk = ∅ (j = k), all vertices in si are connected}. Here n denotes the number of the subgroups. Step III-3 For each vertex group si in S, calculate the average of L(v) by v∈si L(v) , (7.1) L¯ i (v) = N (si ) where N(si ) denotes the number of vertices in si .
7.3 Subjective Visualization by Gaze Estimation from 3D Video
239
Fig. 7.3 2D face detection in multi-view images. Blue rectangles denote the detected 2D face candidate regions
¯ Step III-4 Find the si with the largest L(v) and denote it by Vf . Vf = arg max L¯ i (v). si
Return the sub-mesh Mf = {Vf , Ef } as the 3D face area. Figure 7.4 shows the detected 3D face area Mf .
(7.2)
240
7
Visualization of 3D Video
Fig. 7.4 Detected 3D face area Mf painted in skin color
7.3.1.2 Symmetry Plane Estimation The assumption that human faces have symmetric properties in both 3D shape and surface texture allows us to reconstruct a more accurate 3D face surface than Mf and hence generate a higher-resolution frontal face image than captured images. The symmetry plane detection from Mf consists of two processes: (1) detect feature points Pe on Mf (Fig. 7.2 IV) and (2) apply RANSAC [2] to estimate the symmetry plan based on Pe (Fig. 7.2 V). 7.3.1.2.1 3D Feature Points Extraction In order to find the symmetry plane that divides Mf into two symmetric parts, we first extract 3D feature points Pe on the local object surface specified by Mf . Note that to avoid possible artifacts introduced by the texture generation, we apply a stereo-based edge feature detection method to the multi-view images as illustrated in Fig. 7.5. That is, we establish sparse but reliable 2D-to-2D correspondences to obtain 3D feature points by triangulation [8]. This algorithm is based on the widebaseline stereo by Furukawa [3] and augmented by a bi-directional uniqueness examination to improve the accuracy and robustness of the matching. Step IV-1 Project Mf back onto the multi-view images to localize 2D face regions, respectively. Let c and c denote a pair of cameras whose images include well captured 2D face regions. Rectify the images captured by c and c for stereo matching (Fig. 7.5(a)) and extract edge features from the 2D face regions in the rectified images. Step IV-2 Eliminate edge features which do not cross the epipolar lines. Let IE and IE denote the resultant edge feature images (Fig. 7.5(b)). Let e denote a point on an edge feature in IE , l the corresponding epipolar line in IE , and E = {ej |j = 1, . . . , n} the points on the edge features in IE intersecting with l . Step IV-3 Compute the texture similarity between e and ej ∈ E using the normal direction optimization [3] with the ZNCC photo-consistency evaluation (Sect. 4.3.1) (Fig. 7.5(c)). Let eˆj denote the point in E which gives the best
7.3 Subjective Visualization by Gaze Estimation from 3D Video
241
Fig. 7.5 Matching based on edge features. (a) Rectified images, (b) edge features crossing epipolar lines, (c) texture similarity computation with normal direction optimization, (d) an example of matched pair. In (d), the red rectangles illustrate the windows used to compute the texture similarity, the green lines the surface normals, and the blue circles the endpoints of the edge features. ©2009 IPSJ [9]
similarity. To enforce the uniqueness constraint, we accept the pair e and eˆj if and only if the similarity between them is significantly better than the second best pair. Otherwise we reject this pair and leave e without correspondence to avoid ambiguous matching. Step IV-4 Validate the uniqueness of the correspondence in the opposite direction (eˆj → e ∈ IE ). If there is another edge feature point in IE that has a comparable similarity value with eˆj , reject this pair. Step IV-5 By iterating the steps from IV-2 to IV-4 for all e ∈ IE , we obtain the set of corresponding points between camera c and c . We denote this set Pc,c = { pci , pci | i = 1, . . . , nc,c }, where pci , pci denotes a corresponding point pair and nc,c the number of obtained correspondences. Step IV-6 By collecting Pc,c computed from all possible pairs of cameras that can observe the face area Mf , we can compute a set of 3D feature points, Pe , from a set of matching 2D point pairs.
7.3.1.2.2 Symmetry Plane Estimation Using 3D Feature Points With the reliable 3D feature point set Pe = {pi } (i = 1, . . . , N ), we then estimate the symmetry plane π from Pe as follows (Fig. 7.2 V). The idea is to generate a candidate symmetry plane π and compare the texture pattern around pi with that of its symmetric position with respect to π . If π is a valid symmetry plane, then the textures should be reasonably similar.
242
7
Visualization of 3D Video
Step V-1 Randomly pick up two points pi , pj (i = j ) ∈ Pe , and repeat the following processing for K ≤ N (N − 1)/2 times. Step V-1-1 Compute the symmetry plane πij that makes pi and pj in the symmetric position. Step V-1-2 Based on the hypothesized symmetry plane πij we can compute the symmetric position for each of the other N − 2 points. Let p˘ k denote the symmetric position of pk (k = i, j ). Then we compare the textures at pk and p˘ k . First we generate two L × L grids centered at pk and p˘ k in the 3D space. Note that these two grids lie on the planes that are perpendicular to the hypothesized symmetric plane, and the distance between neighboring grid points is d, which is a variable free to change according to the size of the 3D object. Since the 3D position of each grid point is computable, let pkmn , p˘ kmn (0 ≤ m ≤ L, 0 ≤ n ≤ L) denote the grid points on the grids centered at pk and p˘ k . Also, let Col(pkmn ) and Col(p˘ kmn ) denote the color values of the grid points pkmn and p˘ kmn , respectively, which are computed from the images by their bestobserving cameras. Here we use Mf as the shape proxy for the state-based visibility evaluation (Sect. 4.3.2.1). Then the texture dissimilarity between pk and p˘ k , dpk , is computed by mn Col p dpk = − Col p˘ kmn . (7.3) k 0≤m≤L,0≤n≤L
Note that if either pkmn or p˘ kmn is located outside the estimated face area, the point pair is considered as an outlier and a fixed value diff is set to | Col(pkmn ) − Col(p˘ kmn )|. By computing dpk for all pk (k = i, j ), we can evaluate the goodness of πij by dpk . (7.4) dij = pk ∈Pe \pi ,pj
Step V-2 Select the symmetry plane πij having the smallest di,j as the symmetry plane π . In experiments, we used L = 4 and d = 5 mm. The number of 3D feature points, N , was about a few hundreds, while changing from frame to frame.
7.3.1.3 3D Face Surface Reconstruction Using Symmetry Prior The shape reconstruction algorithms in Sect. 4.4 estimate the 3D object surface geometry without any specific knowledge nor object model, which results in the
7.3 Subjective Visualization by Gaze Estimation from 3D Video
243
limited reconstruction accuracy and the introduction of errors. By contrast, the 3D shape reconstruction algorithm in this section utilizes the knowledge of symmetric properties of the human face to attain more accurate and higher-resolution 3D shape reconstruction (Fig. 7.2 VI). The algorithm is similar to the mesh-deformation algorithm in Sect. 4.4.1, but employs the symmetry constraint in deforming the mesh. This section first describes how we model the 3D face surface by a mesh model, and then introduces how we can utilize the symmetry prior as a constraint on the mesh deformation. The processes so far described have generated the 3D face area Mf = {Vf , Ef } as a sub-area of the original 3D mesh surface M and estimated its symmetry plane π (Fig. 7.6(a)). With this symmetry plane, we first define the 3D face coordinate system as illustrated in Fig. 7.6(b): define the origin by the centroid of Vf and place the coordinate axes so that the symmetry plane π is aligned with x = 0 plane, the X-axis is defined by the normal vector of π , and the Z-axis by the principal axis of the point distribution of Vf on π . The Y -axis is computed by the cross-product of the other axes. Then we generate a new mesh Mc = {Vc , Ec } to model the higher-resolution 3D face surface: project Mf onto the y = 0 plane and define a bounded regular mesh Mc on the 2D projected region. The gray area in Fig. 7.6(c) illustrates Mc . That is, Vc and Ec denote the set of grid points and edges in this projected region, respectively. Note that the sampling pitch by the regular grid can be designed to increase the spatial resolution. With this modeling, the 3D face surface reconstruction problem is transformed to that of finding the appropriate y value of each regular grid point in Vc (Fig. 7.6(d)). Here the technical problems to be solved are the following. (1) How we can introduce the symmetry constraint into the mesh deformation? (2) How we can find the optimal y values for Vc ? First, we represent the symmetry prior by y = f (x, z) = f (−x, z),
(7.5)
where the function f (x, z) returns the y value of the grid point at (x, z). Then, we introduce the following discrete representation of y values: y = αi,
(7.6)
where i denotes an integer within a certain range, and α specifies the resolution of possible y values. This discrete modeling allows us to formalize the shape reconstruction problem as the multi-labeling problem introduced in Sect. 4.4.1. That is, we can formulate the shape reconstruction with the symmetry prior as the minimization of the following objective function: E (Mc ) = Ep (iv ) + Ec (iu , iv ), (7.7) v∈Vc ,vx ≥0
(u,v)∈Ec ,ux ,vx ≥0
244
7
Visualization of 3D Video
Fig. 7.6 3D face shape reconstruction using symmetry prior. ©2011 IPSJ [7]
where vx and ux denote the x coordinate values of v and u ∈ Vc , respectively, and iv and iu integer labels to specify y values at v and u, respectively. Ep (iv ) denotes the photo-consistency evaluation function at v and its symmetric position v. ˘ That is, Ep (iv ) = ρ(vx , αiv , vz ) + ρ(v˘x , αiv˘ , v˘z ) = ρ(vx , αiv , vz ) + ρ(−vx , αiv , vz ),
(7.8)
where ρ() denotes the photo-consistency evaluation function based on the statebased visibility with M as the shape proxy (Sect. 4.3.2.1). Ec (iu , iv ) evaluates the
7.3 Subjective Visualization by Gaze Estimation from 3D Video
245
smoothness in the y direction between a pair of connected grid points v and u: Ec (iu , iv ) = κ|αiu − αiv |
(7.9)
where κ is a weighting factor to balance the photo-consistency and smoothness terms. This formalization forces the mesh deformation to satisfy the symmetry constraint defined by Eq. (7.5). We solve this minimization problem by beliefpropagation (Sect. 4.4.1), and obtain the 3D face surface satisfying both the photoconsistency and the symmetry constraint simultaneously. In experiments, we used κ = 1.0, α = 1 mm, and 2.5 mm grid resolution for Mc . Note that the original 3D mesh resolution, i.e. the average distance between adjacent vertices, was about 4.7 mm.
7.3.2 Virtual Frontal Face Image Synthesis With the optimized Mc , the virtual frontal view of Mc is generated for gaze estimation (Fig. 7.2 VII): (1) locate a virtual camera with focal length f at (0, Pcam , 0) and align its view direction at (0, 0, 0) in the 3D face coordinate system, and then (2) generate the virtual frontal face image by rendering Mc from the virtual camera by the super-resolution technique proposed by Tung et al. [12]: Step VII-1 Set a high-resolution pixel grid on the image plane of the virtual camera. Step VII-2 Project each pixel of the original multi-view images, say source pixels, onto the pixel grid via Mc . That is, back project the source pixels onto Mc first, and then project the points on Mc to the pixel grid of the virtual camera. In this process we choose the nearest grid point as the final projection point of each source pixel. In addition, we ignore source pixels if their projections are occluded by M. Step VII-3 For each grid point with source pixel projections, compute its color by averaging associated source pixel colors. Otherwise, interpolate the grid point color using colors of its neighbors. Figure 7.7 shows a synthesized virtual front face image, where the image resolution is increased by the super-resolution rendering process. In experiments, we used f = 430 mm, Pcam = 500 mm, and the virtual face image plane of 160 mm × 160 mm sampled with 400 × 400 pixels. Considering that the average of y values in a 3D face area is about 15 mm, the size of the virtual image pixel projected on the 3D face surface is about 0.45 mm × 0.45 mm, which is much higher than 4.7 mm, the average distance between adjacent vertices in the original 3D mesh. Note that the resolution of the original multi-view images is higher than that of the original 3D mesh. It was estimated about at most 1.5 mm on the face area. That is, the super-resolution attained about three times higher resolution than the original images.
246
7
Visualization of 3D Video
Fig. 7.7 Virtual frontal face image synthesis
7.3.3 Gaze Estimation Using a 3D Eyeball Model At the last stage, the performer’s gaze direction is estimated from the synthesized frontal face image based on a 3D eyeball model (Fig. 7.2 VIII). Figure 7.8 illustrates the structure of the model. The red arrow indicates the 3D gaze direction, and θ and ϕ denote the horizontal and vertical rotation angles of the eyeball, respectively. This model is designed based on the following three assumptions: 1. The eyeball is fixed inside the eye socket and it can rotate horizontally and vertically around the eyeball center. 2. The gaze direction is defined by the 3D vector pointing from the eyeball center to the iris center. 3. The radius of the eyeball is equal to the diameter of the iris. This assumption is made based on medical statics data. To apply this model to the 3D gaze estimation, the eyeball model of the performer should be estimated first by the following off-line process: Step VIII-1 Collect virtual frontal face images in which eyes look straight forward by hand. Step VIII-2 For each image, detect the following eye feature points (Fig. 7.9) for each eye: 2D eye corners, qa and qe , 2D iris center, qc , and the intersecting points between the iris border and the eye corner line connecting qa and qe , qb and qd . The eye corners are located by the AAM [1],
7.3 Subjective Visualization by Gaze Estimation from 3D Video
247
Fig. 7.8 3D eyeball model
Fig. 7.9 Eyeball center position estimation
and the iris is detected by applying Kawaguchi et al. [6]. Note that all feature points for the right eye illustrated in Fig. 7.9 are mirrored with respect to the symmetry plane to represent those for the left eye. Step VIII-3 For each eye, let d denote the average 3D diameter of the iris, and consequently the eyeball radius. The 3D diameter of the iris is defined by the 3D distance between pb and pd on the face surface Mc , which are obtained by back-projecting qb and qd onto Mc , respectively. Step VIII-4 For each eye, compute the average 2D relative position t of the iris center qc with respect to the eye corners qa and qe . That is, t denotes the weighting parameter to represent qc by the weighted average of qa and qe : qc = (1 − t)qa + tqe where t = qc − qa / qe − qa . This process estimates the eye model parameters for the left and right eyes, respectively: dleft and tleft , and dright and tright . In what follows, we eliminate the suffix for simplicity.
248
7
Visualization of 3D Video
Fig. 7.10 Gaze estimation process. (a) Original face image generated based on the original 3D mesh M, (b) virtual super-resolution frontal face image generated based on the reconstructed face surface Mc , (d) detected irises (circles), (e) estimated eye corners (green dots) and gaze directions (red lines)
With the eye ball model parameters d and t, compute the 3D gaze directions of the left and right eyes from each 3D video frame by the following process (Figs. 7.9 and 7.10). Note that all 3D points as well as the virtual frontal face image in the 3D gaze estimation below are represented in the face coordinated system defined in Sect. 7.3.1.3 and Fig. 7.6(b), which is dynamically defined depending on the 3D face position and direction in each 3D video frame. Step VIII-5 Apply the following process to the left and right eyes, respectively. Step VIII-6 Detect 2D eye corners qa and qe , and the iris center qc from the synthesized frontal face image. Step VIII-7 Compute the 3D iris center position pc by back-projecting qc onto Mc . Step VIII-8 Compute the 3D eyeball center po by po = pc˜ + (0, −d, 0) .
(7.10)
Here pc˜ denotes the back-projection of qc˜ = (1 − t)qa + tqe onto Mc , where qc˜ represents the 2D position of the assumed iris center if the eye were looking straight forward. Step VIII-9 Finally the 3D gaze direction is given as the line passing through po and pc .
7.3.4 Performance Evaluation 7.3.4.1 Shape Reconstruction Using the Symmetry Prior We first analyze how the accuracy of the reconstructed 3D face shape is improved by introducing the symmetry prior. Figure 7.11 shows the input multi-view images. They are captured by 16 UXGA cameras running at 25 Hz with 1 msec shutter speed. The detailed setting of the capture system is described as Studio C in Table 2.3. We measure the contribution of the symmetry prior to the reconstruction accuracy by means of leave-one-out experiments. We keep one camera cf for evaluation, and use the other 15 cameras to render the face image viewed from camera
7.3 Subjective Visualization by Gaze Estimation from 3D Video
249
Fig. 7.11 Input multi-view images
cf by the rendering algorithm described in Sect. 7.3.2. Note that the size and resolution of the rendered image is adjusted to coincide with that of the image captured by cf . Let If denote the rendered image. Then we compute the mean-squared-error between the rendered image If and the originally captured image If : MSE =
1 N
2 If (x, y) − If (x, y) ,
(7.11)
(x,y)∈If
where N is the total number of effective pixels in If . Figure 7.12 shows the mean-squared-errors over 100 continuously captured frames. We can observe that the symmetry prior contributes to improve the fidelity of rendering and hence improve the reconstruction accuracy. Recall that the synthesized virtual frontal face image has higher resolution than the originally captured image, which was not evaluated in this experiment.
7.3.4.2 Gaze Estimation In order to quantitatively evaluate the accuracy of the gaze estimation, we prepared a pair of multi-view video data. Figure 7.13 illustrates the experimental environments. A human subject stands at about 2.5 m away from the wall and looks at
250
7
Visualization of 3D Video
Fig. 7.12 Mean-squared-errors between the synthesized and the original images
(1) horizontally aligned markers one by one, and (2) vertically aligned markers one by one (Fig. 7.13 left and right, respectively). For each marker, its 3D position pm in the object oriented coordinate system is measured manually. We first selected those video frames where the subject was stably looking each marker. Then, for each selected video frame, apply the above mentioned gaze estimation processes from Step I to Step VIII to obtain the estimated 3D gazing vector neye . Note that the gaze estimation is conducted for the left and right eyes independently. That is, we have neye for each eye. The ground-truth 3D gazing direction vector ntrue is defined by a 3D vector from the eye ball center po computed by Eq. (7.10) to each 3D marker position. Then, the angular error of the 3D gaze estimation in each selected video
Fig. 7.13 Gaze estimation error evaluation. ©2011 IPSJ [7]
7.3 Subjective Visualization by Gaze Estimation from 3D Video
251
Fig. 7.14 Gaze estimation errors. The upward and downward triangles at the bottom in each figure denote the signs (i.e. positive or negative) of the errors by the method with both the symmetry prior and the super-resolution image rendering technique
frame is computed for each eye by neye · ntrue θ = arccos |neye ||ntrue |
(7.12)
The angular gaze estimation errors are evaluated for the left and right eyes as well as for the horizontal and vertical directions, respectively, which gives four different error evaluation results as shown in Figs. 7.14(a), 7.14(b), 7.14(c), and 7.14(d). In each figure, three computational methods are compared: without the symmetry prior, with the symmetry prior alone, and with both the symmetry prior and the super-resolution image rendering technique. The horizontal axis in each figure denotes the selected frame IDs where the subject was stably looking at each marker. The upward and downward triangles at the bottom in each figure denote the signs (i.e. positive or negative) of the errors by the method with both the symmetry prior and the super-resolution image rendering technique. Table 7.1 shows the average errors for the first and the third methods. Table 7.2 compares the numbers of iris
252
7
Table 7.1 Average gaze estimation errors
Visualization of 3D Video
Left, horizontal
Right, horizontal
Left, vertical
Right, vertical
Original
0.3986
0.4142
0.5159
0.4825
Proposed
0.3247
0.3245
0.4633
0.4317
Table 7.2 Gaze estimation failures in 100 frames
Gaze estimation Shape reconstruction Iris detection failure failure failure Original
5
0
5
Proposed 1
0
1
detection failures in 100 continuously captured frames. In all results, the symmetry prior improved the stability of the iris detection and the accuracy of the gazing direction estimation, while the improvement by the super-resolution is limited. This is because the performance of the iris localization is not so accurate. As is well known, errors in the horizontal direction is much smaller than those in the vertical direction, because of the shape and movable range of human eyes. These results demonstrate the effectiveness of the presented method, while the accuracy of the gaze estimation is still limited.
7.3.5 Subjective Visualization With a sequence of 3D gaze direction data, we can render a 3D video stream from the performer’s viewpoint by locating a virtual camera at his/her eye position in each video frame: the projection center of the camera is aligned at the middle point between the left and right eyes on 3D mesh M and its optical axis is defined by the average of a pair of gaze direction vectors computed for the left and right eyes. Figure 7.15 shows some examples of the objective and subjective visualizations. In Figs. 7.15(a)–(c), MAIKO looks at her left hand making a well-designed beautiful pose during her dance. In Figs. 7.15(d)–(f), a person performs juggling without looking at any bottle. These subjectively visualized images help us to better understand intention and/or attention of a performer based on where the performer is looking at while engaged in designed and/or trained actions.
7.4 Conclusion This chapter introduced two visualization techniques: objective and subjective visualization methods for a 3D video stream. The former allows versatile 3D video browsing functions of 3D video streams, while its content editing capabilities are
7.4 Conclusion
253
Fig. 7.15 Examples of subjective visualization. (a) Input multi-view images, (b) objective visualization, (c) subjective visualization, (d) input multi-view images, (e) objective visualization, (f) subjective visualization. The red arrows in (b) and (e) and the red crosses in (c) and (f) denote the estimated gaze directions
limited. The next two chapters will present technologies for 3D video content analysis and editing. The subjective visualization is a novel visualization method of a 3D video stream of a person in action. It makes full use of 3D video data to realize the performer’s view rendering; unlike image-based rendering techniques for free-viewpoint visualization, it explicitly estimates the 3D gaze direction of a person in action from 3D
254
7
Visualization of 3D Video
video data, and therefore can generate his/her subjective views of 3D video data. We may call it semantic content-based visualization. The algorithm for the 3D gaze estimation consists of the 3D face area detection, the symmetry plane estimation, the accurate face surface reconstruction with the symmetry prior, the super-resolution frontal face image generation and the 3D gaze estimation based on the eyeball model. The algorithm worked stably to generate higher-resolution frontal face images, and the accuracy of the last process to estimate the iris position and gaze direction was also improved, while the absolute estimation accuracy was still limited. For further studies, we should improve the gaze estimation algorithm by exploiting the temporal information. Also it should be noted that the view-dependent 3D shape optimization such as the 3D face surface reconstruction here will achieve a better rendering as noted in Sect. 4.1.
References 1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2001) 2. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981) 3. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 4. Hansen, D.W., Ji, Q.: In the eye of the beholder: a survey of models for eyes and gaze. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 478–500 (2010) 5. Just, M.A., Carpenter, P.A.: Eye fixations and cognitive processes. Cogn. Psychol. 8(4), 441– 480 (1976) 6. Kawaguchi, T., Rizon, M., Hidaka, D.: Detection of eyes from human faces by hough transform and separability filter. Electron. Commun. Jpn. 88(5), 29–39 (2005) 7. Kuroda, M., Nobuhara, S., Matsuyama, T.: 3d face geometry and gaze estimation from multiview images using symmetry prior. In: Proc. of MIRU (2011) (in Japanese) 8. Nobuhara, S., Kimura, Y., Matsuyama, T.: Object-oriented color calibration of multiviewpoint cameras in sparse and convergent arrangement. IPSJ Trans. Comput. Vis. Appl. 2, 132–144 (2010) 9. Nobuhara, S., Tsuda, Y., Ohama, I., Matsuyama, T.: Multi-viewpoint silhouette extraction with 3D context-aware error detection, correction, and shadow suppression. IPSJ Trans. Comput. Vis. Appl. 1, 242–259 (2009) 10. Sugimoto, A., Matsuyama, T.: Active wearable vision sensor: detecting person’s blink points and estimating human motion trajectory. In: Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics, 2003, AIM2003, vol. 1, pp. 539–545 (2003) 11. Tobii Technology: X120 eye tracker 12. Tung, T., Nobuhara, S., Matsuyama, T.: Simultaneous super-resolution and 3D video using graph-cuts. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 13. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154 (2004)
Chapter 8
Behavior Unit Model for Content-Based Representation and Edition of 3D Video
8.1 Introduction The design of data structures is one of the most crucial problems when developing visual information processing systems. For example, while a 2D array is the most popular data structure to represent a 2D image, the multi-resolution pyramid and quadtree representations [27] are sometimes employed to realize effective and efficient image data processing. Since the functionalities of visual information systems vary a lot with respect to the applications, a well designed data structure and its processing algorithm should be developed to comply with the required functionality of each application. The original 3D video data consist of a stream of textured 3D mesh data representing an object in motion; while our 3D video capture system can take multi-view videos of multiple objects in action, we assume in this chapter that 3D video data represent a single object action, since the 3D shape reconstruction process described in Chap. 4 can segment out each object one by one, and generate a 3D video for each of them. The 3D video visualization and edition techniques presented in the previous chapter process textured 3D mesh data while preserving its content as well as the data structure. Uniform geometric, temporal, and color transformations that preserve the original stream of textured 3D mesh data have only a limited visual impact when action and timing are not modified. In this chapter and the following ones, on the other hand, we will present data structures and algorithms to analyze, edit, and encode 3D video content: this chapter presents the behavior unit model representing a set of atomic actions contained in 3D video data, Chap. 9 presents the kinematic structure model representing the motion of an object skeleton, and Chap. 10 presents the geometry video model for encoding a textured 3D mesh data stream into a sequence of 2D array data. Here in this chapter, we present a novel data representation method for 3D video named behavior unit model for content-based representation and edition of 3D video. Intuitively speaking, a behavior unit is defined as a partial interval of a 3D video data stream in which an object performs a simple action such as stand up, sit down, hands up, rotate, and so on, while preserving its overall topological structure. T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_8, © Springer-Verlag London 2012
255
256
8
Behavior Unit Model for Content-Based Representation and Edition
In each behavior unit interval, the original textured 3D mesh data are encoded into a compact description representing a simple action. Once a 3D video data stream is partitioned into a set of behavior units, we can realize content-based processing methods of 3D video data using the behavior units as atomic data entities: editing, summarization, and semantic description of a given 3D video data. Since an object behavior captured in a 3D video data stream is usually very smooth and complicated, it is not easy to find behavior units and partition the stream with respect to their combination. As an idea to solve this problem, we introduce the topology dictionary [32], a new technique that achieves the behavior unit-based representation of 3D video. The topology dictionary is proposed as an abstraction for data stream of geometrical objects. Data that evolve in time cannot be compactly represented using geometry only as the redundancy of information over time is not exploited. In particular it is challenging to manipulate a data stream when it becomes very complicated. Finding specific or relevant information is almost impossible. Moreover, when each data stream element is produced independently (such as in 3D video), geometrical representations quickly show their limitations as data structures are inconsistent to each other, i.e. complex matching processes are required to find geometric relations between consecutive data stream elements, and noises such as reconstruction artifacts have to be handled. However, we can observe that the topology of data structure can remain very stable despite geometrical noises or short-term physical object motions. We therefore propose to use topology as a key property to characterize geometric data stream. We first define a topology-based descriptor for each 3D video frame and then extract a feature vector from each topology-based descriptor. After applying these two stages of data abstraction, a 3D video data stream is represented by a sequence of feature vectors characterizing temporal variations of topological structures of an object in motion. Clustering feature vectors allows then the identification of typical topological structures of the object. Each cluster represents a behavior unit and is indexed in order to achieve a compact description of the 3D video data stream. Thus, the data stream can be modeled by a sequence of cluster indices. For example, long poses and repeated actions in a video of human in motion can be efficiently identified and compactly encoded using this scheme. In addition to this index structure, the overall dynamic structure of the data stream is represented by a motion graph, which models the transitions between the different topology clusters. In summation, the topology dictionary combines: (1) a dictionary of indices referring to patterns (i.e. clusters) extracted from data streams for encoding, and (2) a probabilistic graph that models the transitions between the patterns for content-based data manipulation. The topology dictionary can be used to represent any arbitrary data type, e.g. 2D or 3D, as long as a content-based (topology) descriptor can be defined to properly extract patterns in data stream. Note that the kinematic structure-based skeleton
8.2 Topology Dictionary
257
representation described in the next chapter requires a priori knowledge on the kinematic model of an object. Hence it can only be applied to a specific type of objects, whereas the topology dictionary model can be applied to any type of objects even if they have no kinematic structure such as amoebas. In the context of 3D video, enhanced Reeb graphs are employed as general 3D mesh topology descriptors. Reeb graphs have been efficiently applied for shape matching and retrieval in large databases of static 3D objects [35]. As a canonical representation of the topology of the surface, they are suitable for dynamic 3D mesh model encoding [36]. They allow the dictionary to automatically model complex sequences that other strategies, such as employing a skeleton model to represent human shape, cannot process correctly; as will be discussed in Chap. 9, it is not possible to match a human skeleton model to 3D video data of MAIKO dances. The topology dictionary is then used for applications such as encoding, editing and semantic description. Figure 8.1 presents an overview of topology dictionarybased 3D video data processing. Note that the topology dictionary stands for a general computational scheme of coding a dynamic sequence of geometric object data, while the behavior unit model refers to a practical implementation of the topological dictionary for 3D video data coding. Note also that in this chapter we do not explicitly process any texture map associated with each 3D mesh. It is assumed that the texture is encoded as a simple per vertex attribute by vertex coloring, i.e. vertexbased texture generation methods such as those described in Chap. 5 are assumed to be used for visualization. The rest of the chapter is organized as follows. Section 8.2 presents the definition of the topology dictionary which features a dictionary-based encoding strategy and a Markov motion graph. Section 8.3 describes a practical implementation of the topological structure description using Reeb graphs, followed by the behavior unit model construction process in Sect. 8.4. Section 8.5 presents applications of the behavior unit model: encoding, editing and semantic description of 3D video. Section 8.6 shows performance evaluations of the proposed applications. Section 8.7 concludes the chapter with discussions.
8.2 Topology Dictionary The topology dictionary is a new technique to perform content-based encoding, edition and semantic description of data streams of geometrical objects. It provides an abstraction of the data stream by combining two strategies: (1) dictionary-based encoding and (2) probabilistic motion graph modeling. Using a dictionary (or codebook), a data stream can be encoded by simply searching for matches between a set of patterns contained in the dictionary and the data to be encoded. As the encoder finds a match in the data stream, it substitutes for the data a reference index corresponding to the pattern position in the dictionary. Redundancies included in the data stream can therefore be efficiently identified and processed (cf. Vector Quantization [7, 41]). Dictionaries can be either generated from training datasets or patterns
258
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.1 Overview of the topology dictionary construction and application for 3D video data
extracted from data streams. We recall that for 3D video, a pattern is a representation of a behavior unit as previously defined. It denotes a cluster of feature vectors derived from Reeb graphs, topology-based shape descriptors, extracted from a temporal sequence of 3D mesh data.
8.2 Topology Dictionary
259
The learning of dictionaries of feature vector clusters (a.k.a. bag of words) has been successfully applied for image categorization, segmentation and localization [6, 29, 37]. The topology dictionary allows the extension of these applications to 3D objects in motion by indexing and learning patterns in 3D video data streams. In our implementation, pattern extraction is realized by unsupervised clustering of feature vectors computed from 3D topology descriptors. In addition, the topology dictionary features a Markov motion graph structure. Nodes are states representing identified patterns, while edges represent transitions between these states. The Markov graph structure allows content-based manipulation of data streams (e.g. edition, summarization). Its interesting applications for 2D video segmentation and summarization have already been shown in [21, 38]. In particular statistical information about the data stream content such as duration and occurrence probability of patterns can be derived. This section describes the general scheme to create a topology dictionary. It consists of clustering the dataset of feature vectors and building a motion graph. Practical methods to compute feature vectors from 3D video data are given in Sects. 8.3 and 8.4.
8.2.1 Dataset Clustering Given a set of feature vectors, the dataset clustering step allows the identification of patterns. A similarity measure (defined below) is used to find similar feature vectors in order to form clusters. Let us assume a 3D video stream composed by a set of 3D mesh models M = {m1 , . . . , mT } where mt is contained in the tth video frame. A feature vector is extracted for every model based on a topology-based shape descriptor (cf. Sect. 8.3). As a feature vector is an abstraction of a mesh, we will refer to the mesh mt or its feature vector equally. In order to cluster M , the dataset is recursively split into subsets Mt and Nt as follows: Mt = n ∈ Nt−1 : 1 − SIMk (mt , n) < τ , (8.1) Nt = Nt−1 \ Mt ,
(8.2)
where M0 = ∅ and N0 = M . Mt is a subset of M representing a cluster containing mt and similar elements. Similarities between elements of M are evaluated using a similarity function SIMk : M × M ∈ [0, 1] and a threshold τ ∈ R (cf. below and in Sect. 8.4.2 for details). The clustering step is a straightforward procedure: for each iteration step, from t = 1 to t = T , the closest matches to mt are retrieved and indexed with the same cluster reference as mt . Let us denote C = {c1 , . . . , cN } the N clusters created during the process by Eq. (8.1) (where the clusters are given by {Mt = ∅}t and N ≤ T ). Any visited element mt already assigned to a cluster in C during a previous iteration step is considered as already classified and will be not processed subsequently. If Nt = ∅ or t = T , the recursive process terminates. As a result, the 3D video sequence M is
260
8
Behavior Unit Model for Content-Based Representation and Edition
clustered into the set of topology classes C, where each class represents a behavior unit (cf. Sect. 8.4 for discussion). The next step consists of estimating topology class probabilities {P (c1 ), . . . , P (cN )} (cf. Sect. 8.2.2). SIMk is a similarity measure that computes a motion-to-motion matching score using a temporal window of frames defined in the spirit of [28]: SIMk (mi , mj ) =
k 1 SIM(mi+t , mj +t ). 2k + 1
(8.3)
t=−k
The size of the window is chosen to be one third of a second in length (k = 3) as in [19]. Equation (8.3) integrates consecutive frames in a fixed time window, thus allowing the detection of individual poses while taking into account smooth transitions. As defined, the formulation accounts not only for differences in body posture but also in motion speed. In practice, motion-to-motion matchings with SIMk can be evaluated by first computing a frame-to-frame distance (or dissimilarity) matrix {1 − SIM(mi , mj )}ij , and then convolving the matrix using a window of size 2k + 1 along diagonals (as in [9, 10]). The essential process in the clustering lies in the computational algorithm of the similarity function SIM which computes a similarity value between a pair of feature vectors. Since feature vectors are derived from topological structures of an object in motion, similarity values should be computed based on the topological structures. That is, SIM(mi+t , mj +t ) in Eq. (8.3) first evaluates the topological similarity between mi+t and mj +t and then computes similarity values between their attributes based on the correspondences between their topological structures. The practical algorithm to compute SIM(mi+t , mj +t ) for a 3D video stream is defined in Sect. 8.4.2. Clustering Evaluation The clustering effectiveness is evaluated by the number of clusters found and should allow the identification of eventual redundant patterns. The threshold τ is set accordingly to the values of the similarity function SIMk . The descriptor presented in this chapter (cf. Sect. 8.4) returns values in the range [0, 1], and τ was defined experimentally. An optimal setting of τ should return a set of clusters similar to what a (hand-made) ground-truth classification would perform. As shown in Fig. 8.2 and Fig. 8.3, τ = 0.08 returns qualitatively good clustering on humanoid datasets. Additional experiments are presented in Sect. 8.6.
8.2.2 Markov Motion Graph The structure of a data stream (e.g. video sequence) of an object in motion can be represented by a Markov motion graph that models the data evolution through successive states, such as scene changes [21]. In particular, when dealing with sequences of human actions, one can observe several repeated poses or actions,
8.2 Topology Dictionary
261
Fig. 8.2 Distance matrix. The matrix contains shape dissimilarity computation between 1000 frames of a 3D video sequence of a yoga performance. Strong and weak similarities are represented by dark and light colors, respectively. The blocks allow us to identify behavior units with similar topological structures. ©2012 IEEE [34]
namely behavior units, that can be efficiently encoded and exploited. Many researchers have indeed used statistical models of human motion to synthesize new animation sequences. For example in [1, 13, 16, 18] motion segments are identified using a motion database. A similar approach is introduced in the topology dictionary model in order to perform content-based manipulation of the data stream. Let us denote by C = {c1 , . . . , cN } the set of N clusters obtained by clustering the T frames of the sequence S = {s1 , . . . , sT }, where T = N i=1 Ni and Ni = card(ci )
262
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.3 Clustering evaluation. The topology dictionary allows the partition of data stream in behavior units (atomic actions). Top color bars: number of clusters in a data stream captured at 25 fps with respect to the threshold τ (e.g. τ = 0.08 returns 253 clusters from a partition containing 434 temporal intervals). Each color stands for a cluster. Top histogram: histogram showing the number of frames belonging to each of the 434 intervals obtained for τ = 0.08. The histogram shows one long pose (>1 s), eight short actions (40 ms to 1 s), and 425 transition states (<40 ms). Middle histogram: histogram showing the number of frames belonging to each cluster for τ = 0.08. High numbers of frames stand for long or repeated actions. Lower numbers are likely to represent short actions or transition states. Bottom histogram: histogram showing the number of interval occurrences for each cluster for τ = 0.08. The statistics indicates that the sequence contains 80 repeated occurrences of atomic actions. Long and repeated actions can be compactly encoded using the topology dictionary (cf. Sect. 8.4.4). ©2012 IEEE [34]
is the size of the ith cluster ci . We recall that the clustering was performed on (topology-based) feature vectors extracted from the frames (cf. Sect. 8.2.1). Let us assume G = (C, E) is a weighted directed graph, where C are the vertices j ∈[1,N ] and E = {eij }i∈[1,N ] are the edges. We consider a probabilistic model where C represent states and E represent transitions between the states. The weights are defined as follows: • The node weight P (ci ) = NTi is the occurrence probability of a cluster ci . If P (ci ) 0, then ci is a state representing a long or repeated action preserving the same topological structure.
8.2 Topology Dictionary
263
Fig. 8.4 Probabilistic graph structure. A data stream structure can be modeled as a Markov motion graph which represents the transitions between extracted clusters (or states). It allows us to perform content-based analysis of the data stream. For example, cycles in the motion graph reveal the existence of repeated actions taking multiple different topological structures over time, occurrence probability of states are given by node weights, while the transition probabilities between the states are given by edge weights. ©2012 IEEE [34]
• The edge weight wij corresponds to the edge eij and models the transition probability between the two states ci and cj . wij is defined as the conditional probability: (sp ,sq )∈ci ×cj δ(q − p) P (cj |ci ) = , (8.4) (sp ,sq )∈ci ×C\ci δ(q − p) where
δ(x) =
1 if x = 1 0 else
and sp is the pth frame of S . The probability isnormalized with Ni + = {cj ∈ C \ ci : ∃(sp , sq ) ∈ ci × cj , q − p = 1}, so that cj ∈N + P (cj |ci ) = 1. i
pik
Let denote the path in the motion graph G that links ci to ck . The path is defined as a set of successive nodes linked two by two by a single edge as follows: pik = {ci0 , ei0 i1 , ci1 , ei1 i2 , . . . , ciK }, where i0 = i and iK = k. Then, the probability of pik in G under Markov assumption can be evaluated using the following cost function: E pik = P (ci+1 |ci )P (ci ), (8.5) i∈{i0 ,...,iK −1}
where the most probable path pmax amongst all possible paths {p ki } that link ci to ck verifies: k (8.6) pmax = arg max E p i . {p ki }
The evolution of a data stream can be monitored using a graph representation as shown in Fig. 8.4. In the case of a sequence of animated 3D models, motion graph representation allows users (e.g. CG artists or animators) to design new sequences by navigating through probable paths as defined by Eq. (8.6), and concatenating the behavior units corresponding to the clusters (states) belonging to the paths. For example, the 1000 frames of Yoga sequence shown in Fig. 8.3 were partitioned into
264
8
Behavior Unit Model for Content-Based Representation and Edition
434 temporal intervals (at τ = 0.08) which belong to 253 clusters. Statistics on the clusters, such as the number of frames contained in each interval and each cluster, and the number of occurrences of clusters, returns one long pose (>1 s), eight short actions (40 ms to 1 s), and 80 repeated occurrences of atomic actions. The sequence can then be edited by shortening long poses and skipping atomic actions between redundant poses. The encoding strategy using the topology dictionary is presented in Sect. 8.4.4.
8.3 Topology Description Using Reeb Graph The clustering of 3D video data is a challenging problem. Although the inter-frame mesh deformation method presented in Sect. 4.4.2 generates sequence of 3D shape data sharing the same mesh structure, most 3D shape reconstruction methods generate 3D shapes one frame at a time, independently of each other. Hence we can assume that the data stream to be processed here is composed of 3D surface meshes having inconsistent mesh connectivity that, moreover, could be noisy. As discussed in Sect. 8.1, topology is a stable feature that can be used to represent and cluster geometrical data as it overcomes issues inherent to geometrical representations (e.g. surface mesh connectivity, reconstruction artifacts, etc.). This section presents a practical implementation of topology description using the augmented Multi-Resolution Reeb Graph (aMRG) [35], which is an enriched multi-resolution Reeb graph [8]. The Reeb graph is an elegant solution to analyze 3D mesh topology and shape as it gives a graphical representation of surface properties, and can be compactly encoded as a feature vector. The reason why aMRG was employed to cluster 3D video sequences is threefold: Generality: The Reeb graph extraction is fully automatic and does not require any prior knowledge on the shape, position, or topology of the captured objects. It allows the system to model so complex sequences that even the fitting of a 3D skeleton of a priori known topology would fail (e.g. subjects wearing loose clothing like MAIKO dancers). Effective 3D shape characterization: aMRGs are high level 3D shape descriptors which have proven to be efficient for shape matching and classification tasks. Efficient shape matching: Its multi-resolution property with hierarchical node matching strategy makes the search in large database tractable by by-passing the NP-complete complexity of the graph matching problem. These features are detailed in this section.
8.3.1 Characterization of Surface Topology with Integrated Geodesic Distances We assume that 3D models are defined as compact 2-manifold surfaces approximated by 3D meshes. Let S be a surface mesh. According to the Morse theory [20],
8.3 Topology Description Using Reeb Graph
265
a continuous function μ : S → R defined on S characterizes the topology of the surface on its critical points. The surface connectivity between critical points can then be modeled by the Reeb graph of μ, which is the quotient space defined by the equivalence relation ∼ as defined in [26]. Let us assume the points x ∈ S and y ∈ S, then x ∼ y if and only if: y belongs to the same connected component of μ−1 (μ(x)) (8.7) μ(x) = μ(y) The Morse function μ is defined as in [8]: μ(v) = g(v, p) dS
(8.8)
p∈S
where g(v, p) is the geodesic distance on S between two points v and p belonging to S. That is, the Reeb graph of μ on S describes the connectivity of the level sets of μ. Note that the inverse function μ−1 is defined on R and returns regions on S corresponding to level sets of μ at some isovalues. As defined μ has several useful properties. It is invariant to translation and rotation, its integral formulation on the whole object surface provides a good stability to local noise on the mesh, and it gives a measure of the eccentricity of object surface points. Considering geodesic distances, points having greater values of μ are further from the object center. The opposite follows logically. Moreover, the values of μ on the surface are stable as long as the surface topology remains unchanged. For example the vertices located on the arms of a human model can have the same values of μ regardless of the arm positions. Furthermore, invariance to scale transformation can be obtained by normalizing μ with respect to its minimal and maximal values μmin and μmax : μN (v) =
μ(v) − μmin μmax − μmin
(8.9)
where μN : S → [0, 1] is the normalized function of μ. Extremal values of μN return surface critical point locations which coincide with highly concave or convex regions. Note that geodesic distance calculations are computationally costly. The complexity is usually O(N 2 ), where N is the number of vertices of the mesh. We implemented the shortest path algorithm of Dijkstra with binary heap, as its time complexity is O(N log(N )) [5].
8.3.2 Construction of the Multi-resolution Reeb Graph The multi-resolution Reeb graph is a set of Reeb graphs of various levels of resolution. The construction process consists of first building the graph at the highest
266
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.5 Multi-resolution Reeb graph. The multi-resolution Reeb graph is a set of Reeb graphs of various levels of resolution. The graph at the highest resolution is obtained by partitioning the surface with respect to the Morse function values (cf. left). The Reeb graphs at lower resolutions are iteratively obtained from r = 6 to r = 1 (left to right, respectively). Reeb graphs of high resolution capture the surface topology (e.g. the cycle is captured for the resolution levels r ≥ 4). ©2012 IEEE [34]
resolution r = R (R > 1), and then iteratively deriving the graphs at lower resolution until r = 1 (cf. Fig. 8.5). At resolution r = 0, the graph consists of one unique root node. A Reeb graph at resolution level R is constructed by: 1. Partitioning the range of μN , i.e. [0, 1], into 2R regular intervals by iterative subdivisions, and assigning interval labels to surface points (i.e. mesh vertices) according to their μN values. 2. Creating a graph node for each surface region consisting of mutually connected surface points with the same interval label. 3. Linking the nodes that have their corresponding regions connected on the surface. In practice, when the surface is represented by a triangular mesh, at each resolution each node corresponds to a set of connected triangles (and is placed at its centroid). The nodes created in the step 2 above stand for the Reeb graph nodes, and the links created in the step 3 stand for the Reeb graph edges. Reeb graphs at lower resolutions r < R are obtained by first merging the intervals of μN values two by two using a hierarchical procedure. Then, a parent node, a node in the lower resolution Reeb graph, is assigned to each group of nodes in the higher resolution graph whose corresponding surface regions are connected and share the same merged μN value interval label. Hence, each node at resolution r > 1 has a unique parent node belonging to a Reeb graph at resolution r − 1 [8]. Note that the object surface is partitioned into regions with 2r interval labels at the resolution level r.
8.3 Topology Description Using Reeb Graph
267
As defined the multi-resolution Reeb graph captures the topological structure of 3D surface models at different levels of resolution. To obtain finer shape description of 3D models, Reeb graphs are enriched with geometrical and topological features. They are then called augmented Multi-Resolution Reeb Graph (aMRG). aMRG was initially introduced in [35] to perform fine shape matching and retrieval in dataset of 3D art objects. This approach was first applied to 3D video data stream in [36] for 3D video compression, and later in [32] for 3D video description while introducing the topology dictionary. In an aMRG, each node embeds specific attributes that characterize the surface region it is assigned to (as described above). 3D model shape similarity can then be estimated by comparing the following node attributes. • Relative area: area of the region with respect to the overall model surface, • μ range: min and max values of μ in the region the node is assigned to, • local curvature statistic: histogram of Koenderink shape indices computed at each point of the surface region [15], • cord length statistic: histogram of Euclidean distances measured between each point of the region to the region center of mass, namely cord lengths [22], • cord angle statistics: histogram of cord angles with respect to the principal 1st and 2nd axes of the region [22], • hough 3D descriptor: 3D histogram of normal orientations (azimuth θ ∈ [−π, π] and elevation φ ∈ [0, π]) and distances to node center (r ≥ 0) computed at each point of the surface region [40]. Note that some attributes are invariant to rotation, translation, and scale transformations, whereas some attributes are computed with respect to principal axes (e.g. cord angle statistics). The axes can be either the world coordinate axes if the object is already oriented, or obtained by PCA [25]. Details concerning aMRG matching are given in Sect. 8.4.
8.3.3 Robustness As 3D models in a 3D video data stream usually contain reconstruction artifacts, the Reeb graph extraction has to be particularly robust to surface noise. Fortunately, the normalized Morse function introduced in Eqs. (8.8) and (8.9) is robust to local surface noise thanks to the integral formulation (as well as being invariant to rotation, translation and scale transformation). To evaluate the stability of the Reeb graph regarding surface noise, we tested the Reeb graph extraction on 3D models of different resolution (high and low). We observed that extra nodes might appear occasionally, especially at the extremities of the graphs. These are due to surface sampling implementation artifacts, as geodesic distances between vertices of a surface mesh are computed using the mesh edges. Hence geodesic distances on two meshes having different connectivity may differ. However, as defined Eq. (8.8) can usually cope with all kind of surface noise (including mesh connectivity change) thanks to the integral formulation which smoothes local variations.
268
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.6 Reeb graph robustness against surface noise incurred by the mesh simplification. Reeb graphs are constructed for 3D models of different mesh resolution (high and low): (a) 40th frame from the Tony sequence with 17,701 vertices and 1,335 vertices, (b) 25th frame from the Free sequence of [31] with 142,382 vertices and 6,500 vertices. The resolution of the Reeb graphs is r = 3. Despite some extra nodes at some extremities of the graphs, their overall structure and topology are well preserved. ©2012 IEEE [34]
Figure 8.6 shows (a) the 40th frame from the Tony sequence, captured and reconstructed in our laboratory, with 17,701 vertices and simplified to 1,335 vertices, and (b) the 25th frame from the Free sequence of [31] with 142,382 vertices and simplified to 6,500 vertices. The Reeb graphs are shown at the resolution r = 3. The simplifications were performed by edge collapsing in order to affect the geodesic measurements as much as possible. Despite some extra nodes at some extremities of the Reeb graphs, we can observe that their overall structure and topology are preserved.
8.3.4 Advantage Structure extraction from arbitrary shape is usually performed by fitting a 3D skeleton to the shape surface model, such as in [3]. When successful, this kind of approach is powerful because the kinematic structure of the object can be extracted, and the structure joints can be tracked while the object is in motion. This topic will be discussed in detail in the next chapter. However, fitting a skeleton requires prior knowledge on the shape to be described: the skeleton has to be defined beforehand and cannot be fitted to any arbitrary shapes [12]. On the other hand, the Reeb graph overcomes these limitations as it can characterize topology and shape of arbitrary 3D models. No a priori knowledge on the model shape and topology is required, and no initial pose is required.
8.3 Topology Description Using Reeb Graph
269
Fig. 8.7 Comparison of structure extraction techniques. The Reeb graph (in green) has several advantages compared to a skeleton fitting approach (in red) [3]. It extracts a consistent structure from arbitrary shapes without any prior knowledge, even though (a) limbs are not visible, and regardless of the model (b) topology, (c) orientation, and (d) complexity. ©2012 IEEE [34]
Figure 8.7 illustrates the advantage of using an automatic topological structure extraction method such as the Reeb graph, as opposed to a skeleton fitting technique (such as [3]). As can be observed, the Reeb graph can extract a consistent structure from arbitrary shapes without any prior knowledge, even though: (a) limbs are not visible, and regardless of the model (b) topology, (c) orientation, and (d) complexity.1 Note that other approaches, such as the curve-skeleton [2, 4, 17], can be used to extract a graph with homotopy property. Nevertheless, as shown in the next section, our approach is the most suitable for shape matching as it features a hierarchical multi-resolution structure. Otherwise graph matching computation in a huge dataset can quickly become intractable.
1 Raptor
model provided courtesy of INRIA by the AIM@SHAPE Shape Repository.
270
8
Behavior Unit Model for Content-Based Representation and Edition
8.4 Behavior Unit Model This section presents the techniques we developed for behavior unit-based 3D video data representation, i.e. a practical implementation of the topology dictionary introduced in Sect. 8.2. A behavior unit is a novel data representation method for contentbased representation and edition of 3D video. The behavior unit model represents a set of atomic actions. It is defined as a partial interval of a 3D video data stream in which an object performs a simple action such as stand up, sit down, hands up, rotate, and so on, while preserving its overall topological structure. The partition of a 3D video stream results from the extraction and clustering of aMRG graph feature vectors, and the encoding is achieved by indexing the feature vectors using a topology dictionary model (cf. Sect. 8.2). In what follows we present: feature vector representation as an abstraction of aMRG graph, feature vector similarity computation, retrieval performance in 3D video, and behavior unitbased encoding and decoding of 3D video data stream. Note that the behavior unit model is defined based on atomic actions which are represented by a continuous sequence of cluster (or pattern) indices, while a cluster is a set of feature vectors. As presented in Sect. 8.2.1, the clustering constrains the selected feature vectors to be similar in object topology as well as in object motion. The clustering is performed on a distance matrix whose elements are obtained using an integration over time using a time window (cf. Eq. (8.3)). Past, present and future 3D video frames of an atomic action are taken into account when performing similarity computation. Hence clusters are bound to contain ‘continuous’ feature vectors and suits well to represent behavior units.
8.4.1 Feature Vector Representation As introduced in Sect. 8.3, aMRG graphs capture topology and shape of 3D models. The aMRG representation consists of a set of Reeb graphs at several levels of resolution that embed geometrical and topological attributes in each node. aMRG feature vector representation consists of storing graph structures and node attributes in a unique (binary) file. Attributes of each node are stored as tables of floats, and multi-resolution graph structures are stored by indexing all edges as in a standard mesh coding method (e.g. OFF format). Table 8.1 shows the data format for aMRG. The size of a feature vector is then 2 + 4 ∗ 2 + res + i nbNodes[i] ∗ nbAttr + nbEdges ∗ 2 bytes. Note that the data structure above considers the simple case of one attribute per node encoded by one float. In practice, each node embeds the attributes presented in Sect. 8.3. Table 8.2 shows the data format for an aMRG node. In our implementation the attributes of each node are encoded in 784 bytes. The nature of each attribute, as well as the parameters (e.g. number of histogram bins) are set using heuristics [33].
8.4 Behavior Unit Model
271
Table 8.1 Data format for aMRG Size in bytes
ID
Description
sizeof(char) = 1
res
aMRG highest resolution level R
sizeof(char) = 1
nbAttr
Number of attributes in each node
(res + 1) ∗ sizeof(int) = (res + 1) ∗ 4
nbNodes[] List containing the number of graph nodes at each resolution level
sizeof(int) = 4 i nbNodes[i] ∗ nbAttr ∗ sizeof(float) = i nbNodes[i] ∗ nbAttr ∗ 4
nbEdges
2 ∗ nbEdges ∗ sizeof(int) = 2 ∗ nbEdges ∗ 4
Total number of graph edges All graph node attributes All graph edges (pairs of node indices)
Table 8.2 Data format for aMRG node Size in bytes
ID
Description
sizeof(float) = 4
a
Relative area
sizeof(float) = 4
l
Length of μN interval
sizeof(int) = 4
i
Index of μN interval (i ∈ [0, 2r − 1])
16*sizeof(float) = 64
curv[]
Histogram of Koenderink shape indices
16*sizeof(float) = 64
cordL[]
Histogram of cord lengths
16*sizeof(float) = 64
cord1[]
Histogram of cord angles w.r.t. 1st axis
16*sizeof(float) = 64
cord2[]
Histogram of cord angles w.r.t. 2nd axis
4*8*4*sizeof(float) = 512
hough[]
Hough 3D descriptor
At resolution level r, the interval in which μN belongs to (i.e. [0, 1]) is partitioned into 2r intervals. Each interval carries an interval index i ∈ [0, 2r − 1], where i = 0 correspond to the regions having the minimal values of μn . The length of an interval of μN is defined as the difference between the maximum and minimum values of μN in the interval. Once the number and size of attributes per node are fixed, the file size will vary linearly with respect to the number graph nodes. The size of a feature vector can therefore be quite large, especially when dealing with high resolution aMRG graphs, and similarity computation can become quickly intractable in huge dataset as graph matching complexity is NP-complete. However, the coarse-to-fine hierarchical matching strategy described in the next sections is a solution to overcome this issue.
8.4.2 Feature Vector Similarity Computation Similarity Evaluation Assuming aMRG graphs M and N extracted from two 3D models, we denote their corresponding feature vectors equally by M and N ,
272
8
Behavior Unit Model for Content-Based Representation and Edition
respectively, as they are abstractions of the same objects. Feature vector similarity computation, as introduced in Sect. 8.2.1, consists of calculating Eq. (8.3) and the SIM function described below: 1 1+R R
SIM(M, N ) =
sim(m, n),
(8.10)
r=0 {(m,n)∈Cr }
where Cr ⊂ M ×N contains all the pairs of topologically consistent nodes at the resolution level r ∈ [0, R], whose definition will be given below, and sim : M × N → [0, 1] evaluates the similarity between two nodes m and n. sim returns a higher contribution when nodes are similar: sim(m, n) =
nbAttr-1
λk δk fk (m), fk (n) ,
(8.11)
k=0
where nbAttr is the number of attributes embedded in each node, λk (with λ = 1) is a weighting factor for the attribute fk , and δk is a function that k k compares the attributes fk depending on the types of attribute listed in Table 8.2: a(m) a(n) • if fk ≡ a, then δk (fk (m), fk (n)) = min( a(M) , a(N ) ), where a(m) and a(n) are the areas of the surface regions associated with m and n, respectively, and a(M) and a(N) are the total areas of the surfaces associated with M and N , respectively. Note that a(m) and a(n) rely on the resolution level r, whereas a(M) and a(N ) are independent of r. l(m) l(n) • if fk ≡ l, then δk (fk (m), fk (n)) = min( l(M) , l(N ) ), where l(m) and l(n) are the lengths of the intervals of μN defined on the surface regions associated with m and n, respectively, and l(M) and l(N) are the sum of the lengths of μN intervals associated with all the nodes of M and N , respectively. Note that l(m) and l(n) rely on the resolution level r, whereas l(M) and l(N ) are independent of r. • if fk ≡ curv, cordL, cord1, cord2, hough or more generally a histogram, then δk is the histogram intersection:
B−1 min fk (m)[i], fk (n)[i] , δk fk (m), fk (n) =
(8.12)
i=0
where B is the number of histogram bins and fk is a normalized histogram satisfying the following equation: ∀m ∈ M,
B−1 i=0
fk (m)[i] =
a(m) . a(M)
(8.13)
Thus, if M = N then SIM(M, M) = 1. SIM computes the similarity between two feature vectors by summation of the similarity scores obtained for each pair of matching nodes by sim at every level of resolution from r = 0 to R (the matching process is described below). Each similarity evaluation of a pair of nodes by
8.4 Behavior Unit Model
273
sim returns a (positive) contribution to the global similarity score given by SIM. As defined above, δk returns bigger contributions when nodes are similar. The descriptors a and l serve to characterize the global shape of the object. On the other hand, the local descriptors (namely, here, the histograms) serve to characterize details and variations on the surface of the objects. These descriptors were first introduced for shape matching and retrieval in database of 3D art objects in [35]. Note that sim in Eq. (8.11) is positive, reflexive, symmetric and transitive. However, SIM in Eq. (8.10) is positive, reflexive, symmetric but not transitive as M and N may not have the same node structure. In our implementation the weights {λk } are determined heuristically: λa = 0.2, λl = 0.3, λcurv = 0.1, λcordL = 0.1, λcord1 = 0.1, λcord2 = 0.1, and λhough = 0.1 and correspond to selected attributes {fk } as described above. Topology Matching Topology matching is the process that consists of matching two aMRG graphs M and N based on their topological structure. The algorithm described below returns all the pairs of topologically consistent nodes between M and N . By definition [8, 35], two nodes m and n at resolution r, belonging to M and N , respectively, are topologically consistent if: 1. The parents m and n of m and n respectively have been matched together at the level of resolution r − 1. 2. m and n have an equal interval index of μN (as defined in Table 8.2). 3. If m and n belong to a graph branch,2 they must have the same label (if they have any) to be matched: when two nodes are matched, a label (e.g. α) is identically assigned to both of them if they do no have one yet. Then, the label is propagated to their connected neighbors3 in both graphs M and N following the two monotonic directions with increasing and decreasing values of μN . Note that label propagation is performed only for branches, and not for branching nodes. 4. The parents of the neighbors of m and n, if they have ones, have been matched together at the level of resolution r − 1. In addition, the node matching procedure allows the matching between a node m and a set of nodes {n} when m is topologically consistent to all nodes of {n}. This alleviates some possible boundary issues (especially located at branch junctions) after segmentation of the object surface into regions caused by discretized values of μN (as μ is computed on the mesh vertices and using the mesh edges). Furthermore, a cost function loss is introduced to discriminate nodes or set of nodes that are all topologically consistent. loss is minimal when two candidates are similar and is used to find the best topologically consistent candidate. The function is defined in the spirit of Eq. (8.11), but involves only the node global descriptors a 2A
graph branch is a set of successive nodes linked two by two by a single edge. Two branches match together when all the nodes belonging to them match together. 3 A neighbor is a node belonging to an adjacent surface region. Neighboring nodes are connected by a Reeb graph edge at the same resolution level.
274
and l:
8
Behavior Unit Model for Content-Based Representation and Edition
l(m)
a(m) a(n)
l(n)
+ − − loss(m, n) = (1 − )
l(M) l(N ) , a(M) a(N )
(8.14)
where = 0.5. When evaluating loss and sim (cf. Eq. (8.11)) for a set of nodes, the features representing a set can be obtained by a simple addition of the attributes embedded in each node of the set. In practice the calculation of loss includes as well the neighboring nodes in order to increase the discrimination power. The matching process involves a coarse-to-fine strategy where topology consistencies between nodes of M and N are evaluated hierarchically from the root node at resolution r = 0 to the nodes at the finest level of resolution r = R. At each resolution level r, the nodes are sorted following their embedded μN values in an ascending order and visited one by one. If two nodes m and n at resolution level r and belonging to M and N , respectively, are found topologically consistent, then the matching pair (m, n) is inserted in a list Cr , and the matching process is repeated with their children nodes at resolution level r + 1. On the other hand, if two nodes m and n are not topologically consistent, then the matching process is aborted for these two nodes. The topology matching process terminates when all the pairs of topology consistent nodes are found, meaning that no matching process is running anymore. The set of pairs of consistent nodes {C0 , . . . , CR } is then returned and used for similarity computation as described in Eq. (8.10). Figure 8.8 illustrates topology matching for multi-resolution Reeb graph. Matching algorithm implementation details are as follows: 1. Let M and N denote two aMRG graphs to be compared. At the lowest resolution level r = 0, M and N are represented by one single node which are matched by default. 2. For each resolution level r > 0, the nodes {m ∈ M} are visited one by one following the interval index of μN they belong to, from 0 to 2r−1 . 3. For each visited node m ∈ M, all the nodes {n ∈ N } at the same resolution r are taken into consideration regarding the topology consistency rules defined above. The candidates (node n or group of nodes {n}) that are topologically consistent to m are further discriminated using the cost function loss (cf. Eq. (8.14)). 4. The best candidate that returns the smallest cost with loss is paired with m and inserted in a list Cr ⊂ M × N . 5. As all nodes {m ∈ M} at resolution level r are visited, Cr should contain all topologically consistent pairs (m, n) between M and N at resolution r. 6. Topology matching (i.e. steps 2 to 5) is repeated at r +1 with all children nodes of nodes that have been matched at r, and iteratively until the highest resolution r = R is reached. In Fig. 8.8, at resolution r, several nodes have similar topology. The loss function is then evaluated between all the candidates to find the best matches. Nodes that are matched at resolution r are represented with identical colors. In addition, blue links with arrows show examples of matching between nodes at resolution r (m and n ), as well as between their children nodes (m and n) and group of nodes at resolution level r + 1. Parent to child links are represented by yellow dashed arrows. Note that m and n belong to branches of M and N ,
8.4 Behavior Unit Model
275
Fig. 8.8 Topology matching of multi-resolution Reeb graph. Nodes that are matched at resolution r are represented with identical colors. In addition, blue links with arrows show examples of matching between nodes at resolution r (m and n ), as well as between their children nodes (m and n) and group of nodes at resolution level r + 1. Parent to child links are represented by yellow dashed arrows. Note that m and n belong to branches of M and N , respectively. A label (here α) is propagated in both branches as the pair (m, n) is formed. Consequently, the corresponding branches of M and N with label α are matched together
respectively. A label (here α) is propagated in both branches as the pair (m, n) is formed. Consequently, the corresponding branches of M and N with label α are matched together. Note that, by definition of a branch, the label is not propagated to branching nodes. 7. Finally, aMRG similarity score SIM(M, N ) between M and N is obtained by adding node similarity values of all topology consistent pairs {(m, n) ∈ Cr } at every level of resolution r = 0 . . . R using the sim function which takes into account embedded node attributes (cf. Eqs. (8.10) and (8.11)). Multi-resolution Strategy The coarse-to-fine multi-resolution matching strategy has two major advantages. First, it is crucial for computation tractability when dealing with large databases. It avoids indeed the NP-complete complexity issue of large graph matching by aborting the similarity evaluation process, as soon as no consistent match is found between the graphs, starting from the lowest graph resolution (which have only few nodes). Second, it provides a judicious matching scheme as
276
8
Behavior Unit Model for Content-Based Representation and Edition
global shape and topology (e.g. body postures) are privileged over fine details (e.g. arm positions, fingers). The nodes are matched hierarchically using the topology consistency rules and similarity functions (cf. sim function described above). The coarse-to-fine recursion is spread through node children (starting from the root at r = 0) up to the highest resolution (r = R). Note that irrelevant nodes are discarded in the graph matching process as they would return weak matching scores (e.g. nodes resulting from surface noise due to 3D reconstruction artifact). Further reading concerning the matching process can be found in [8, 35].
8.4.3 Performance Evaluation The performance of the 3D shape similarity computation method using aMRG presented above is evaluated against various shape similarity metrics for 3D video sequences of people with unknown temporal correspondence [9]. Performances of similarity measures are compared by evaluating Receiver Operator Characteristics (ROC) for classification against ground truth of a comprehensive dataset of synthetic 3D video sequences consisting of animations of several people performing different motions (Fig. 8.9). The synthetic dataset is created using 14 articulated character models, each of which animated using 28 motion capture sequences. Animated models are people with different gender, body shape and clothing, and height between 1.6 m and 1.9 m. Models are reconstructed using multiple view images, as a single connected surface mesh with 1 K vertices and 2 K triangles. Recognition performances are evaluated using ROC curve, showing the truepositive (TPR) or sensitivity in correctly defining similarity against the false-positive rate (FPR) or one-specificity where similarity is incorrect: TPR =
ts ts + td
and
FPR =
fs , fs + td
(8.15)
where ts denotes the number of true-similarity predictions, fs the false similar, td the true dissimilar and fd the false dissimilar in comparing the predicted similarity between two frames to the ground-truth similarity. For each similarity measure, dissimilarity values between all pairs of 3D shape data in the database are computed in the distance matrix S (cf. Sect. 8.2.1), whose element sij denotes the distance (or dissimilarity) between ith and j th 3D shape data. Distances are normalized to the range sij ∈ [0, 1]: sij =
sij − smin , smax − smin
(8.16)
where smin = 0 and smax is the maximal distance over all sij ∈ S of the whole database. A binary classification matrix for the shape descriptor C(α) = {cij (α)}ij is defined: 1 if sij < α cij (α) = (8.17) 0 otherwise
8.4 Behavior Unit Model
277
Fig. 8.9 Synthetic dataset. aMRG robustness and accuracy is evaluated using ground-truth data. (a) Illustration of 14 human models. (b) Six of the 28 motions
Fig. 8.10 ROC curves for the shape classification of 3D video sequences. The aMRG is one of the top performers for shape classification in 3D video. The comparison includes Shape Histograms (SHvr), Multi-Dimension Scaling (MDS), Spin Image (SI), Shape Distribution (SD) and Spherical Harmonics Representation (SHR). ©2012 IEEE [34]
The classification cij (α) for a given α is then compared to a ground-truth similarity classification. The number of true and false similarity classifications, ts(α), td(α), fs(α), fd(α) are then counted. The ROC performance for a given shape similarity measure is then obtained by varying the threshold α ∈ [0, 1] to obtain the true TPR(α) and false FPR(α) positive rates according to Eq. (8.15). The evaluations with the generated 3D video sequences demonstrate that the aMRG-based method is one of the top performers in the task of finding similar poses of the same person in 3D video compared to the state-of-the-art shape matching techniques (Fig. 8.10 and [10]). The comparison includes Shape Histograms (SHvr), Multi-Dimension Scaling (MDS), Spin Image (SI), Shape Distribution (SD) and Spherical Harmonics Representation (SHR) (see [9] for additional details). As the aMRG-based method is particularly sensitive to topology changes compared to other approaches, it is our first choice as a topology-based shape descriptor.
278
8
Behavior Unit Model for Content-Based Representation and Edition
8.4.4 Data Stream Encoding Here we present a linear process to compactly encode a 3D video data stream based on behavior units. As described in the previous sections, the behavior units are obtained by first clustering feature vectors characterizing topological structures, indexing clusters within a dictionary, and then partitioning the original data stream into a sequence of temporal intervals using the cluster indices. This method is a practical implementation of the topology dictionary (Fig. 8.1). Assuming the data stream S containing T frames, S = {s1 , . . . , sT }, a data structure representing a behavior unit ki is created for every of the N clusters ci ⊂ S , where 1 ≤ i ≤ N . Each ki encompasses a specific pose of the object and j j all its variations within each temporal interval ci that forms ci = j {ci }. j
j
Let simin and simax denote the frames having, respectively, the smallest and j biggest indices in the j th interval ci ⊂ ci . ki contains one textured mesh (i.e. one mesh and one texture map), one graph structure (namely one Reeb graph at a chosen resolution), a table of node position offsets corresponding to all the node trajectories j to transit from any simin to ki , and a table of node position offsets corresponding to j all the node trajectories to transit from ki to any simax . The node trajectories are j obtained by tracking the node positions in each ci , which is a trivial task as long as the graphs are consistent to each other (as it should be in each cluster). In practice, noisy nodes and edges are removed so that any Reeb graph constructed from a frame in ci is topologically consistent to ki . Practically, ki should be chosen at the center of the cluster ci : ki = arg min SIM(Mk , M), (8.18) Mk ∈ci
M∈ci
where Mk and M are meshes in ci , and SIM is the similarity function which computes aMRG feature vector similarity as described in Sect. 8.4.2 (Eq. (8.10)). The encoding process consists of sequentially substituting each frame in a 3D video data stream by a cluster (or pattern) reference index. If a frame st does not belong to the same cluster ci−1 as the previous frame st−1 , then a new data structure representing a behavior unit ki is created. If st belongs to the same cluster ci−1 as st−1 , then only position offsets of the graph nodes at t are stored into ki−1 . It is then possible to recover the node trajectories between consecutive frames and reconstruct the mesh sequence by the skinning operation, which will be presented in the next section. Let sm be the size of an encoded mesh plus a Reeb graph structure, and sg be the size of an encoded set of node position offsets, the total data size σ of the encoded sequence is then σ ≤ sm ∗ N + sg ∗ T .
8.4.5 Data Stream Decoding Model animation can be performed using several sophisticated techniques and CG software, where the model deformation process is usually guided by a skeleton
8.4 Behavior Unit Model
279
Fig. 8.11 Sequence reconstruction. 3D video sequences are reconstructed from encoded data stream by mesh skinning. (a) Textured mesh from initial 3D video data. (b) Reeb graphs extracted at resolution r = 4. (c) Reconstructed surfaces by mesh skinning. (d) Reconstructed surfaces with texture to be compared with (a). Here, one unique texture map (view-independent vertex-based texture) is used for the whole reconstructed sequence. (e) The overlay of both surfaces (initial surface in blue and reconstructed surface in red) shows that the overall surface shape is well reconstructed. Note that this 3D video was produced from multi-view video captured in our oldest studio rather than Studios A, B, or C in Chap. 2. ©2012 IEEE [34]
(e.g. Blender, Maya [3, 14]). First, a skeleton of the model is created interactively and subparts of the model are attached to bones (rigging process). Then, skeleton joints are manipulated and deformations are cast sequentially to the mesh (animation
280
8
Behavior Unit Model for Content-Based Representation and Edition
process). Additional post-processing steps are usually applied to smooth discontinuities between submesh boundaries (skinning process), as well as mesh edition techniques to improve surface rendering [23, 30]. Skinning is a popular method for performing character and object deformation in 3D games and animation movies. In our framework, 3D video reconstruction from encoded frames is obtained using a mesh skinning method where surface deformations are driven by Reeb graphs. During the skinning process, the graph is bound to a single mesh object, and the mesh is deformed as the graph nodes move. Recall that the data structure representing a behavior unit ki contains the motion trajectory data of each node in the graph. As node coordinates change, transformation matrices associated with the vertices of the mesh cause them to be deformed in a weighted manner. A weight defines how much a specific node influences vertices in the deformation process (e.g. 1.0 for rigid skinning, and less than 1.0 for smooth skinning). It is usual to set smoother skinning for vertices belonging to a joint area on a surface mesh [11]. The data stream reconstruction is performed for each cluster ci by considering j j each temporal interval ci independently, where ci = j {ci }, in order to avoid surface topology change issues. A unique data structure ki , as introduced in the previous section, describes behavior units represented by the cluster ci : all of the 3D video frames whose corresponding feature vectors belong to ci are reconstructed j by deforming ki according to encoded node coordinates (for each ci ) using a mesh skinning method as described above. Figure 8.11 illustrates a sample of 3D video data reconstructed from an encoded data stream. Note that even though the implemented mesh skinning method is not optimal, no major reconstruction artifacts would be noticed at video frame rate (25 fps). We measured 3D position distortions in a 400 × 400 × 400 voxel grid (corresponding to a 2 m × 2 m × 2 m volume having 5 mm resolution), and we obtained the mean squared error MSE ∼ 0.005 and the peak signal-to-noise ratio PSNR ∼ 75 dB when computing surface distances between original and decoded mesh data using Hausdorff distance as metric.
8.5 Applications This section presents several applications of the behavior unit model, in particular, for 3D video data editing and semantic description. As described in the previous sections, it provides an abstraction of the structure of data stream. Hence long poses, repeated actions, and slow motions can be identified in video sequences of human in motion, and compactly encoded and manipulated.
8.5.1 Behavior Unit Edition Let us assume G = (C, E) is a weighted directed graph, where C are the vertices and E are the edges. C = {c1 , . . . , cN } denotes the set of N clusters obtained by
8.5 Applications
281
topology-based clustering of T frames of the 3D video sequence S = {s1 , . . . , sT }. As presented in Sect. 8.2.2 and Sect. 8.4, G is a motion graph representing the structure of the data stream S by its states and transitions in a probabilistic framework, and C is a set of identified behavior units. Behavior unit-based edition is performed by interacting with the motion graph G. In particular, estimations of path probability with Eq. (8.6) allow users to create new sequences of actions while preserving scenario realism. Our educated guess suggests that new sequences can be created by picking two behavior units cs and ct in C, and concatenating all the behavior units that correspond to the clusters belonging to the (most) probable paths linking cs and ct . To achieve high quality rendering, transitions between behavior units should be particularly well managed, as the transition between two surfaces with very different topology can be challenging [39]. Although this was not fully investigated yet, we believe that implementing an ad-hoc surface skinning method should pay off. In what follows we present an unsupervised scheme to skim 3D video sequences by processing behavior units of human performances. The goal is to automatically produce shorter sequences while preserving scenario consistency. The motion graph is used to identify isolated and non-relevant patterns and progressively remove them. First, the set C is sorted with respect to the cluster weights P (ci ) in order to identify the frames belonging to clusters having the highest and lowest probabilities: • If P (ci ) 0, then ci contains either: (1) a long sequence of successive frames belonging to ci , or (2) a recurrent pose identified by frames belonging to ci scattered in the sequence. In the case (1), long poses (e.g. low variations such as between frames #370 and #430 in Fig. 8.2) are compressed by encoding intermediate 3D video frames as described in the previous section. In the case (2), ci is represented as a cycle (or loop) junction node in the motion graph G. The strategy is therefore to gradually remove the small and non-relevant cycles. Let L denote a cycle in the motion graph. Then compute: P (c) , (8.19) S(L ) = c∈L card{c ∈ L } {eij ∈L } P (cj |ci )P (ci ) P (L ) = , (8.20) card{eij ∈ L } where the size S(L ) is the average weight of the cycle L , and the relevance P (L ) is defined as the probability of the cycle L under Markov assumption. The following weight is used to sort the cycles: W (L ) = λ.S(L ) + (1 − λ).P (L ),
(8.21)
where λ = 0.5 ∈ [0, 1]. Practically, the skimming process consists of removing redundant frames from video sequences where small cycles with low weight values are selected as first candidates for skimming. As will be discussed in Sect. 8.6, several other skimming strategies can be adopted.
282
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.12 Behavior units with semantic labels. Arbitrary poses with annotations can be learned and indexed in the topology dictionary. ©2012 IEEE [34]
• If P (ci ) ∼ 0 then ci contains few frames. Identified isolated patterns are reclassified into adjacent clusters: e.g. in the sequence {ci , ci , cj , ci , ci }, the frames {s ∈ cj } are reclassified into ci . In practice, graph structures are modified (by node filtering) to be consistent with ci and avoid reconstruction artifacts. Finally, summarization can be processed iteratively up to some user-defined constraints such as a limitation on the sequence size or compression ratio (cf. Sect. 8.6).
8.5.2 Semantic Description Semantic description of data stream is obtained by specifying semantic labels for identified behavior units. The semantic labels are obtained by first analyzing training datasets to prepare a set of prototype behavior units and then giving to each behavior unit a semantic label such as “stand up, hands on hips”, “stand up, hands joined over the head, head looking the hands”, etc. In practice, any behavior unit with a semantic label can be added as learning and indexing can be performed on any arbitrary data. For example as shown in Fig. 8.12, models from various sources can be annotated for action recognition application. With this training process, a semantically labeled topology dictionary is constructed. Then, given a new 3D video stream, each video frame is compared with the behavior units with semantic labels and is classified to the most similar behavior unit to convert the 3D video stream into a sequence of behavior unit indices and/or semantic labels.
8.6 Performance Evaluations
283
In addition, as can be observed in Fig. 8.12, labels can be specified based on shape categorization in addition to topology. The Homer, woman and alien models have similar topology structure as they all stand up, but different shape features as the limbs have different lengths, and body build is different. The variations are captured by the graph node attributes defined in Table 8.2. Thus, the topology dictionary can perform classification and description based on shape as well as topology. Furthermore, as the behavior units we are considering are based only on shape and topology, there is no knowledge about content importance. However, a semantic weight describing the content importance can be added along with labels (or annotations) to behavior units obtained from training datasets. Hence, behavior units with lower importance weight could be removed first in the skimming process described in Sect. 8.5.1.
8.6 Performance Evaluations To assess the performance of the behavior unit model, several experiments were performed on various 3D video sequences. The Yoga sequence in Fig. 8.2 and Tony sequence in Fig. 8.5 are interesting as they contain many human poses. These sequences are useful to set the parameter τ for sequences of humanoid objects. The MAIKO dataset is challenging for shape description as the subject wears a loose FURISODE which covers arms and legs. Fortunately the Reeb graph is an effective tool to characterize arbitrary shapes. The Capoeira sequence represents quick moves of martial art. In these experiments, MAIKO, Yoga, Capoeira and Tony sequences contain, respectively, 201, 7500, 300 and 250 frames from the original sequences (cf. Fig. 2.13). Every frame contains one 3D mesh consisting of about 30 K triangles with texture information. One uncompressed frame encoded in standard OFF format requires 1.5 MB, which means 11.25 GB for 7500 frames. Feature vectors were computed on a Core2Duo 3.0 GHz 4 GB RAM, nevertheless the process requires less than 512 MB RAM. A feature vector up to resolution level R = 5 is generated in 15 s with the current implementation (cf. [33] for binaries). The similarity computation between two aMRG models takes 10 ms. Other efficient computation of Reeb graphs can be found in the literature (e.g. [24]). All the independent steps can be run sequentially using a script file.
8.6.1 Topology Dictionary Stability The core of the topology dictionary model relies on its ability to discriminate shape topology. The definition of the Morse function is therefore crucial (cf. Sect. 8.3). The ability of the dictionary to extract and classify patterns has been evaluated against different Morse functions and resolution levels R of Reeb graphs: the curves in
284
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.13 Clustering of 500 frames of Yoga with respect to τ . Curves are obtained with different Morse function (geodesic integral defined in Eq. (8.8) and height function), and at different levels of resolution. ©2012 IEEE [34]
Fig. 8.13 named geodesic R = 5, geodesic R = 4, and geodesic R = 3 were obtained when using the geodesic integral as Morse function as defined in Eq. (8.8), and by computing similarities up to the resolution levels R = 5, R = 4, and R = 3, respectively. The curve named geodesic r = 4 was obtained with the geodesic integral as Morse function, but without summation of coarse resolution levels when computing the similarity (i.e. only the contributions at level r = 4 were used in Eq. (8.10)). The curve named height R = 4 was obtained when using the height function μ(v) = z as Morse function, and by computing similarities up to the resolution level R = 4. The clustering performance is then evaluated with respect to the threshold τ . In Figs. 8.13 and 8.14, the sequences contain, respectively, 500 and 7500 frames of a Yoga session. They consist of a succession of various (complex) poses. The clustering behavior was analyzed with different values of τ and parameter setting. Finally, it turned out that the integral geodesic functions with R = 4 and τ = 0.1, and R = 3 and τ = 0.08, give the best trade-offs for clustering performance and computation time for humanoid model sequences in comparison to a hand-made clustering. The full Yoga sequence (7500 frames) contains 1749 clusters: 44 long poses or repeated actions (>1 s), 115 short actions (40 ms to 1 s) and 1590 transition states (<40 ms). Simple tests on 3D video data streams representing models with similar topological structures show that using the same parameters (i.e. μ function, R and τ ), distance estimations remain in the same range as shown in Fig. 8.15: a parameter set can be re-used for objects belonging to similar categories. aMRG graphs of Tony and Yoga have indeed similar skeleton-like structure. To cluster the MAIKO sequence, the clustering threshold τ is set to 0.2 (Fig. 8.16). The loose clothing makes the object shape more compact than humans with tight clothing. Hence the shape characterization at the lower resolutions of the aMRG is less discriminant, and therefore τ has to be increased. In fact, we believe
8.6 Performance Evaluations
285
Fig. 8.14 Clustering of 7500 frames of Yoga with respect to τ . The curve is obtained with the geodesic integral at R = 4. ©2012 IEEE [34]
Fig. 8.15 Distance evaluation between data streams presenting similar topological structures. The Yoga and Tony sequences contain frames having similar graph structures (e.g. frame #145 from Tony and frame #88 from Yoga, and frame #165 from Tony and frame #115 from Yoga). We observe that the distance evaluations share the same behavior and remain in the same range: frame #165 from Tony is closer to frame #145 than frame #100, and frame #115 from Yoga is closer to frame #88 than frame #993. Hence the same set of parameters can be used for both sequences for distance evaluation. ©2012 IEEE [34]
Tony 1-SIM
145 vs. 165 0.33
165 vs. 100 0.39
Yoga 1-SIM
88 vs. 115 0.28
115 vs. 993 0.37
286
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.16 Clustering of MAIKO sequence into behavior units. Top left: similarity matrix computed over the 201 frames of the MAIKO sequence. Color bars: number of clusters in the sequence with respect to the threshold τ (e.g. τ = 0.2 returns 24 clusters and a partition containing 95 temporal intervals). Each color stands for a cluster index. Top histogram: histogram showing the number of frames belonging to each of the 95 intervals obtained for τ = 0.2. The histogram shows two short actions (40 ms to 1 s) and 93 transitions (<40 ms). Middle histogram: histogram showing the number of frames belonging to each cluster for τ = 0.2. High numbers of frames stand for long or repeated actions. Lower numbers are likely to represent transition states. Bottom histogram: histogram showing the number of occurrences of each cluster for τ = 0.2. The statistics indicates that the sequence contains 17 repeated atomic actions which can be compactly encoded by further processing using the topology dictionary. ©2012 IEEE [34]
that a relationship between τ and the global shape of models can be established (e.g. τ = 0.08 for star-like shapes). This would allow us to set τ automatically. The MAIKO sequence describes a dancer in action, performing a 360 degree rotation and kneeling. 201 frames were clustered in 24 clusters with τ = 0.2 and a partition containing 95 temporal intervals. The statistics indicates that the sequence contains two short actions (40 ms to 1 s), and 17 repeated atomic actions. The longest state corresponds to the last part of the video, where the MAIKO slows down her motion and remains still. The short actions and (quick) transitions allow the characterization of the pace and activity of the MAIKO during her performance as she turns and moves her hand at the same time.
8.6.2 3D Video Progressive Summarization The size of a sequence is growing linearly of 1.5 MB per frame (11,250 MB for 7500 frames). Hence it becomes very difficult to search for specific information and
8.6 Performance Evaluations
287
Fig. 8.17 Encoding and summarization ratio (skimming) of the Yoga sequence (7500 frames). The 7500 frame sequence has been reduced to 3660 encoded frames (50 %) as 1749 clusters were obtained: 44 long or repeated poses (>1 s), 115 short actions (40 ms to 1 s) and 1590 transition states (<40 ms). Then, 9095 cycle combinations were identified. The skimming of short actions (<2 s) produces a 3D video sequence of 2716 encoded frames, which is equivalent to a compression ratio of 3:1 and a saving space of 66 %. Another skimming scheme consisted of progressively removing the biggest cycles (in size S(L )). It returned a sequence of 1439 encoded frames (the ratio is 5:1 and saving space is 80 %). ©2012 IEEE [34]
navigate into long sequences. As presented in Sect. 8.5.1, our 3D video encoding process consists of two steps. First, redundancies in long poses, slow motions and transition states are located in the data stream using behavior unit modeling and compactly encoded. On the Yoga dataset, cluster encoding using the data structure presented in Sect. 8.4.4 returns a compression ratio of nearly 2:1, meaning a saving space of 50 % (cf. Fig. 8.17). The 7500 frame sequence has been reduced to 3660 encoded frames as 1749 clusters were obtained: 44 long or repeated poses (>1 s), 115 short actions (40 ms to 1 s) and 1590 transition states (<40 ms). Second, using the motion graph structure 656 cycle junction clusters and 9095 cycle combinations have been identified. Figure 8.18 show similar frames identified in the Yoga sequence: frames #1783, #4179, #6987 and #7000 belong to the same cluster. Hence each of these frames corresponds to a repeated (atomic) action and correspond to a cycle junction cluster. Thus, frames contained in cycles can be removed for 3D video data skimming (and summarization) using the motion graph. The skimming of short actions (<2 s) produces a 3D video sequence of 2716 encoded frames, which is equivalent to a compression ratio of 3:1 and a saving space of 66 %. Content-based summarization can then be progressively performed while keeping relevant information. Another possible skimming scheme consists of successively skipping the biggest cycles (in size S(L )). It returns a sequence of 1439 encoded frames (the ratio is 5:1 and saving space is 80 %). Note that other strate-
288
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.18 Behavior unit-based 3D video edition of the Yoga sequence. The topology dictionary models 3D video data stream using behavior units. Frames #1783, #4179, #6987 and #7000 belong to the same cluster (or behavior unit). Hence (1) they can be encoded using a unique data structure (cf. Sect. 8.4.4), and (2) they belong to cycles of the Markov motion graph in the topology dictionary model (cf. Sect. 8.2.2). The sequence can then be skimmed by removing cycles, namely intermediate frames that are between repeated behavior units. ©2012 IEEE [34]
gies can be considered, such as maximizing content unicity by removing repeated instances of same actions and shortening long actions. Figure 8.19 presents progressive summarization of Yoga sequence. The experimental result with 1000 frames of the Yoga sequence rather than its full 7500 frames is shown for presentation clarity. The 1000 frame Yoga sequence has 80 cycle junction clusters and 486 cycle combinations. Figure 8.20 presents progressive summarization of MAIKO sequence. The MAIKO sequence has 17 cycle junction clusters and 257 cycle combinations.
8.6.3 Semantic Description Semantic description of 3D video sequences is obtained by labeling topology clusters as described in Sect. 8.5.2. The system will then return a label as a class is identified. Action recognition is achieved by populating training datasets with selected models from various sources, such as from the Internet as shown in Fig. 8.12,
8.7 Conclusion
289
Fig. 8.19 Progressive summarization of the Yoga sequence (1000 frames). Left: Values of W (L ) (Eq. (8.21)) used for cycle thresholding, and ratio of summarized sequence size with respect to original sequence size. ©2012 IEEE [34] Fig. 8.20 Progressive summarization of the MAIKO sequence (201 frames). Left: Values of W (L ) (Eq. (8.21)) used for cycle thresholding and ratio of summarized sequence size with respect to original sequence size. ©2012 IEEE [34]
or designed by CG software as shown in Fig. 8.21. Figures 8.22, 8.23 and 8.24 illustrate 3D video data with content-based description using labeled clusters and an annotated training dataset.
8.7 Conclusion This chapter presents the topology dictionary, a novel approach that achieves behavior unit-based editing of 3D video data for applications such as content-based encoding, summarization and semantic description. The topology dictionary has been proposed as an abstraction to represent data stream of geometrical objects. In particular, when the data streams become very complicated, the geometry features quickly show some limitations. It is then a challenge to manipulate the data stream, and especially to look for specific or relevant information. In some cases such as with 3D video data, every geometrical object composing the stream is obtained independently. Hence, no consistency exists between the geometrical structure of the objects, and the state-of-the-art does not provide any efficient technique to handle the data stream. However, the topology of data structure can be used as a stable feature to describe such kind of data. As one can observe, using a topology descriptor, stable topology features can be preserved even though the geometrical representation is corrupted with noise and deformation. Taking advantage of this property, the topology dictionary has been developed as a combination of two ideas:
290
8
Behavior Unit Model for Content-Based Representation and Edition
Fig. 8.21 Training dataset population. Arbitrary 3D models can be labeled and indexed in the topology dictionary for semantic description. Here 3D models from the Yoga sequence were deformed using a CG software to mimic poses from other sequences. ©2012 IEEE [34]
Fig. 8.22 Semantic description of the Capoeira sequence. Behavior units in training datasets are depicted with labels beforehand. Afterwards, as behavior units are identified in a test 3D video stream, the corresponding annotations are automatically displayed. Note that the frames #89 and #103 belong to the same clusters (behavior units) as the frames #155 and #135, respectively. ©2012 IEEE [34]
(1) a dictionary-based encoding strategy identifies relevant patterns in data stream, and (2) a probabilistic graph models the structure of the stream, allowing contentbased data manipulation.
References
291
Fig. 8.23 Semantic description of the Yoga sequence. Cluster labeling allows us to achieve content-based semantic description of 3D video data stream. ©2012 IEEE [34]
Fig. 8.24 3D video semantic description. Training datasets populated with models from various sources allow us to extend the action recognition scope. ©2012 IEEE [34]
In this chapter, we show that this abstraction can be applied to 3D video data stream. Our implementation involves the use of the augmented Multi-resolution Reeb Graph (aMRG) as a robust topology descriptor for 3D shape matching and dataset clustering, and a Markov motion graph to model transitions between clusters. We present content-based encoding, summarization, and semantic description of various 3D video sequences. We believe that the topology dictionary brings lots of perspectives to future research and applications on 3D video. As the reader may have noticed, several data structures were employed in this study, which make the overall system rich and complex. For further studies, we should provide an interactive tool to pick behavior units (or clusters) using a visual representation of the motion graph. And the best path on the motion graph between the selected clusters would return new sequences. As well, complex scenes containing animals or interacting people should be tackled, and the coding scheme should be further improved to handle complex texture map such as the harmonized texture mapping introduced in Chap. 5.
References 1. Arikan, O., Forsyth, D.A.: Interactive motion generation from examples. ACM Trans. Graph. 21(3), 483–490 (2002) 2. Sharf, A., Lewiner, T., Shamir, A., Kobbelt, L.: On-the-fly curve-skeleton computation for 3D shapes. Comput. Graph. Forum 26(3), 323–328 (2007) 3. Baran, I., Popovic, J.: Automatic rigging and animation of 3D characters. ACM Trans. Graph. 26(3), 27 (2007) 4. Cornea, N., Silver, D., Yuan, X., Balasubramanian, R.: Computing hierarchical curveskeletons of 3D objects. Vis. Comput. 21(11), 945–955 (2005)
292
8
Behavior Unit Model for Content-Based Representation and Edition
5. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1, 269–271 (1959) 6. Fulkerson, B., Vedaldi, A., Soatto, S.: Localizing objects with smart dictionaries. In: Proc. of European Conference on Computer Vision, vol. 1, pp. 179–192 (2008) 7. Gray, R.M., Gersho, A.: Vector Quantization and Signal Compression. Kluwer Academic, Norwell (1992) 8. Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology matching for fully automatic similarity estimation of 3D shapes. In: Proc. of ACM SIGGRAPH, pp. 203–212 (2001) 9. Huang, P., Hilton, A., Starck, J.: Shape similarity for 3D video sequences of people. Int. J. Comput. Vis. 89(2–3), 362–381 (2010) 10. Huang, P., Tung, T., Nobuhara, S., Hilton, A., Matsuyama, T.: Comparison of skeleton and non-skeleton shape descriptors for 3D video. In: Proc. of International Symposium on 3D Data Processing, Visualization and Transmission (2010) 11. James, D.L., Twigg, C.D.: Skinning mesh animations. ACM Trans. Graph. 24(3) (2005) 12. Carranza, J., Theobalt, C., Magnor, M., Seidel, H.-P.: Free-viewpoint video of human actors. ACM Trans. Graph. 22(3), 569–577 (2003) 13. Lee, J., Chai, J., Reitsman, P.S.A., Hodgins, J.K., Pollard, N.S.: Interactive control of avatars animated with human motion data. ACM Trans. Graph. 21(3), 491–500 (2002) 14. Kho, Y., Garland, M.: Sketching mesh deformations. ACM Trans. Graph. 24(3), 934 (2005) 15. Koenderink, J.: Solid Shape. MIT Press, Cambridge (1990) 16. Kovar, L., Gleicher, M., Pighin, F.H.: Motion graphs. ACM Trans. Graph. 21(3), 473–482 (2002) 17. Palagyi, K., Kuba, A.: A parallel 3D 12-subiteration thinning algorithm. Graph. Models Image Process. 61(4), 199–221 (1999) 18. Molina-Tanco, L., Hilton, A.: Realistic synthesis of novel human movements from a database of motion capture examples. In: IEEE Workshop on Human Motion (2000) 19. Mizuguchi, T., Buchanan, J., Calvert, T.: Data driven motion transitions for interactive games. In: Eurographics Short Presentations (2001) 20. Morse, M.: The Calculus of Variations in the Large. Am. Mathematical Society Colloquium Publication, vol. 18. AMS, New York (1934) 21. Ngo, C.-W., Ma, Y.-F., Zhang, H.-J.: Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol. 15(2), 296–305 (2005) 22. Paquet, E., Rioux, M.: A content-based search engine for VRML databases. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 541–546 (1998) 23. Park, S.I., Hodgins, J.K.: Capturing and animating skin deformation in human motion. ACM Trans. Graph. 25(3), 881–889 (2006) 24. Pascucci, V., Scorzelli, G., Bremer, P.-T., Mascarenhas, A.: Robust on-line computation of Reeb graphs: Simplicity and speed. ACM Trans. Graph. 26(3), 58 (2007) 25. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(6), 559–572 (1901) 26. Reeb, G.: On the singular points of a completely integrable Pfaff form or of a numerical function. C. R. Acad. Sci. Paris 222, 847–849 (1946) 27. Samet, H.: Foundations of Multidimensional Metric Data Structures. Morgan Kaufmann, San Mateo (2006) 28. Schödl, A., Szeliski, R., Salesin, D., Essa, I.: Video textures. In: Proc. of ACM SIGGRAPH, pp. 489–498 (2000) 29. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2008) 30. Sorkine, O., Alexa, M.: As-rigid-as-possible surface modeling. In: Proc. 5th Eurographics Symposium on Geometry Processing, pp. 109–116 (2007) 31. Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graph. Appl. (2007)
References
293
32. Tung, T., Matsuyama, T.: Topology dictionary with Markov model for 3D video content-based skimming and description. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2009) 33. Tung, T.: An implementation of the augmented multiresolution Reeb graphs (aMRG) for shape similarity computation of 3D models. http://tonytung.org/ 34. Tung, T., Matsuyama, T.: Topology dictionary for 3D video understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2012) 35. Tung, T., Schmitt, F.: The augmented multiresolution Reeb graph approach for content-based retrieval of 3D shapes. Int. J. Shape Model. 11(1), 91–120 (2005) 36. Tung, T., Schmitt, F., Matsuyama, T.: Topology matching for 3D video compression. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2007) 37. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal visual dictionary. In: Proc. of International Conference on Computer Vision, vol. 2, pp. 1800–1807 (2005) 38. Yeung, M., Yeo, B.L.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998) 39. Zaharescu, A., Boyer, E., Horaud, R.: Topology-adaptive mesh deformation for surface evolution, morphing, and multi-view reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 823–837 (2011) 40. Zaharia, T., Prêteux, F.: Indexation de maillages 3D par descripteurs de forme. In: Proc. Reconnaissance des Formes et Intelligence Artificielle (RFIA), pp. 48–57 (2002) 41. Ziv, J., Lempen, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Chapter 9
Model-Based Complex Kinematic Motion Estimation
9.1 Introduction It is a challenging problem to obtain a semantic description, i.e. semantic interpretation, of a 3D video stream, and especially the semantic interpretation of object actions and behaviors. For example, dense inter-frame object motion patterns computed by the algorithm presented in Sect. 4.4.2 are just a physical motion description and no semantics is involved. In Chap. 7, on the other hand, a semantic analysis of a 3D video stream is conducted to recognize the 3D human face and estimate his/her 3D gaze motion. The previous chapter introduced the topology dictionary to encode time-varying global surface characteristics of an object in motion, and describes a 3D video stream in terms of behavior units. In general, appropriate knowledge should be given a priori to obtain a semantic description of physical data: Chap. 7 employed 3D face and eyeball models for human face recognition and gaze detection. As presented in the previous chapter, on the other hand, the topology dictionary approach first learns the behavior unit model from training 3D video stream(s) by clustering, and then applies the model to a new 3D video stream to obtain its behavior unit description. This chapter presents a method of computing a kinematic motion description from a 3D video stream of human complex action such as Yoga, assuming a kinematic model of a human is given a priori. Here the kinematic model is defined as a skeleton structure consisting of bones and joints, and a kinematic description includes such characterizations as how much a joint angle between a pair of connected bones changes over time. While the applicability of such model-based method is limited to a specific class of objects, it can interpret 3D video streams of human actions (and animal behaviors) in terms of their kinematics. Then, object actions can be edited to produce a new 3D video stream integrating multiple 3D video streams produced independently: e.g. a pair of independently produced 3D video streams of Kung-fu actions can be synchronized spatially and temporally to produce multi-party actions scenes. Kinematic motion estimation is often referred to as motion capture particularly in computer graphics and movie production [13]. Motion capture (or mocap for short) includes well-studied technologies to generate a kinematic motion description of T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_9, © Springer-Verlag London 2012
295
296
9 Model-Based Complex Kinematic Motion Estimation
Fig. 9.1 3D skin-and-bones model. (a) 3D shape reconstructed from multi-view images. (b) 3D bone structure embedded into the 3D shape in (a). (c) Skin areas associated with the bones. Colors on the surface areas indicate their corresponding bone IDs. Areas with blended colors indicate that vertices in such areas are controlled by multiple near-by bones. ©2009 IEICE [17]
an object in motion by tracking markers attached to the object surface. While the capture system itself only detects and tracks marker motions, program libraries are available to produce a kinematic description of the object motion if the markers are attached to carefully chosen positions such as bony landmarks. One major limitation of the motion capture is that markers must be attached to appropriate positions of the object surface considering its kinematic structure as well as their observability during object action. As will be surveyed in the next section, a variety of computer-vision-based motion capture methods have been developed [15] which realize a makerless, noncontact, and non-constrained motion capture. The object models they employed vary depending on the types of input data to be analyzed, as well as the types of output description to be produced. The goal of the algorithm presented in this chapter is to produce a kinematic motion description from a 3D video stream of a human in action, which leads us to the incorporation of a skin-and-bones model of a human [10]. The model consists of a closed 3D surface and an embedded 3D skeleton structure (see Fig. 9.1). Each bone of the skeleton is associated with a certain surface area and surface deformations can be controlled by bone motions. As will be shown later in this chapter, this harmonized model of skeleton and surface facilitates the model matching for 3D video data. We first define manually a skin-and-bones model using 3D mesh data taken from a 3D video stream, and then apply the model fitting to each 3D video frame to compute the bone motion over time, i.e. kinematic description. Note that both motion capture and computer-vision-based model matching technologies cannot capture motions of humans with loose clothes such as dancing MAIKOs; as clothes swing, markers move freely away from an object body surface and the model matching fails to estimate bone positions due to large volumes and surfaces of the clothes. Thus we assume here that a 3D video stream is produced for a human with tight or less loose clothes. It should be noticed that the kinematic motion description of traditional dances is very useful to archive and preserve in-
9.2 Skin-and-Bones Model for Kinematic Motion Estimation from 3D Video
297
tangible cultural assets facing extinction, which would be a good future research problem. The essential problem in obtaining a kinematic motion description from a 3D video stream rests in how we can measure the degree of matching between the kinematic model and a 3D video frame; recall that a sequence of 3D mesh data is reconstructed frame by frame and no explicit correspondences between a pair of consecutive 3D mesh data are established due to the limited shape reconstruction capability of the inter-frame deformation algorithm in Sect. 4.4.2. If both the model and 3D video frames did not include any errors, that is, if they were exactly equal to the actual object shape, then we could measure the degree of matching simply by some geometric distance measure like the Hausdorff distance between the kinematic model and a 3D video frame. Actually, however, this simple distance measurement does not work in the model fitting to a 3D video stream because of the following. • 3D shapes reconstructed from multi-view videos include errors due to limited accuracies in both the camera calibration and shape reconstruction. • Complex human actions often introduce self-occlusions and surface collisions, which introduces unobservable or invisible surface areas from a limited number of cameras. The model matching algorithm presented in this chapter copes with these problems by first evaluating reliability measures for local surface areas and then conducting the matching to find the most reliable match between the model and a 3D video frame. The rest of this chapter is organized as follows. Section 9.2 reviews related works on the model-based object motion estimation. Following the categorization of model representation methods, we discuss technical problems to be solved in the model matching for a 3D video stream. Section 9.3 defines the reliability measure which evaluates both the reconstruction errors and the surface visibility. Section 9.4 introduces a model-based kinematic motion estimation algorithm using the reliability measure. Then Sect. 9.5 demonstrates qualitative and quantitative performance evaluations, and Sect. 9.6 concludes this chapter.
9.2 Skin-and-Bones Model for Kinematic Motion Estimation from 3D Video The object motion estimation from observed image data has been a major problem in computer vision and various types of estimation method have been developed [10, 15]: some compute physical motion patterns and others kinematic motion descriptions. We identify 16 possible categories based on the following three axes (Table 9.1 lists major categories with references). Types of input data: Image data to be analyzed are 2D or 3D. Features for matching: Points, boundaries/surfaces, regions/volumes, or skeletons. Matching algorithms: Model-based or example-based.
298
9 Model-Based Complex Kinematic Motion Estimation
Table 9.1 Categorization of kinematic motion estimation methods
Input
Feature
Matching
[3, 11]
2D
Contour
Example
[24]
2D
Region
Example Example
[1, 2, 4, 12, 22, 25]
2D
Silhouette
[6, 21]
2D
Contour
Model
[7, 9]
2D
Silhouette
Model
[5]
2D
Region
Model
[19, 28]
3D
Volume
Example
[14]
3D
Skeleton
Model
[16, 18, 26]
3D
Surface
Model
[27]
3D
Volume
Model
Note here that the model-based approach can be sub-categorized into (a) manual model design and (b) model learning from data training. While the model-based approach optimizes parameters included in a generic, i.e. parameterized, object model to find the best match with input data, the example-based approach searches for such example data from an example database that matches best with input data and obtains object motion parameters from the information associated with the best matching example. Note also that some categories include several computational algorithms producing different types of output data. Here, considering characteristics of 3D video data as well as complexities of possible human actions, it is reasonable to adopt the 3D model-based approach, because the example-based approach requires a huge example database storing a wide variety of complex human poses with kinematics annotations. Then, the problem consists of finding which matching features are appropriate for the model matching: that is, which one, surface-based or volume-based, is appropriate, because they carry richer information than points and the object skeleton is exactly what we want to produce. Here we employ a surface-based matching method, more specifically, the skin-and-bones model [10, 18], and estimate optimal bone parameters which deform the skin surface to match with 3D video surfaces. This design is based on the following observations. • The reconstructed 3D shape includes errors due to the limited accuracies of the camera calibration and shape reconstruction, which can mislead the matching. • The original data, i.e. multi-view video data, carry rich surface information, which facilitates the surface-based matching even if the reconstructed object surface includes errors. • While textured 3D mesh data are available as input data, the original multi-view video data should be used as input data for the model matching. This is because the texture generation process can introduce artifacts into surface textures. Thus we process a sequence of 3D mesh associated with multi-view video data to produce a kinematic motion description of an object in motion.
9.2 Skin-and-Bones Model for Kinematic Motion Estimation from 3D Video
299
• While the skin-muscle-bones model [20] can estimate a detailed kinematic motion description in terms of muscle motions, the resolution and accuracy of 3D video data recording entire human body actions are not high enough to find matching with such complex model and so many parameters. The skin-and-bones model consists of a closed 3D surface and an embedded 3D skeleton structure. Each bone of the skeleton is associated with a certain area of the surface, which allows us to deform the surface by controlling bone positions. Figure 9.1 illustrates our skin-and-bones model. Given a 3D video stream to be analyzed, firstly, the following process is conducted to design the skin-and-bones model for that stream: 1. An appropriate frame t0 is selected from the input 3D video stream by hand and the 3D mesh Mt0 at t0 is regarded as the model surface (Fig. 9.1(a)). Here t0 is chosen so that the surface Mt0 is reasonably close to the real object surface without any local surface collisions. 2. Then a skeleton model is manually embedded into Mt0 (Fig. 9.1(b)). 3. Finally, bone-to-surface correspondences (Fig. 9.1(c)) are established assuming that the 3D position of each vertex on the surface mesh is controlled by near-by bones depending on their distance values [8]. With these bone-to-surface correspondences, the object surface can be deformed in accordance with skeleton postures. That is, given the positions of the bones after motion, do the following. (1) Compute a new position of a vertex by applying the geometric transformation describing the motion of a bone that controls the vertex. (2) Since multiple bones control a vertex, we obtain a set of new positions for each vertex. (3) Compute the weighted average of the new positions to which the vertex is moved. Here the weights are determined based on the distances from the vertex to its controlling bones. The skeleton posture is defined by a set of parameters, say posture parameters, consisting of its global positions and joint angles between connected bones. Let vector p denote a set of posture parameters and M(p) the surface deformed by p, and Mt0 = M(pt0 ) the initial model surface, where pt0 is the posture parameters for t0 (Fig. 9.1(c)). Note that while 3D mesh structures in a 3D video stream vary over time, Mt0 is used for the matching process with every 3D video frame and hence dense vertex-based motion flow patterns can be obtained as well as the kinematic motion description. With the designed object model, the input 3D video stream is analyzed frame by frame starting from t0 backward and forward. In what follows, we assume that t0 denotes the initial 3D video frame for simplicity. The most essential problem in the model-based approach is how to measure the degree of matching between the model and a 3D video frame. In analyzing 3D video data, especially, the matching process should cope with the following problems: Self occlusion and surface collision: An object like a human often takes very complex postures, especially when playing dances and Yoga, which introduce significant self-occlusions and surface collisions. With a limited number of cameras,
300
9 Model-Based Complex Kinematic Motion Estimation
moreover, many occluded and/or collided parts of the object surface cannot be observed from some of the cameras. Thus their 3D shape reconstruction is impossible or very unreliable. Reconstruction error: As discussed at the beginning of Part II and Chap. 4, the accuracy of 3D shape reconstruction from multi-view images is limited due to many factors including calibration errors, non-Lambertian surface reflections, aperture problems, invalid visibility approximations, phantom volumes inherited from the visual hull, regularized optimizations, and so on. Note that 3D shape reconstruction methods starting from the visual hull cannot improve the original visual hull geometry, i.e. roughly estimated shape, at parts where less than two cameras can observe. Our ideas to cope with these problems are as follows: • Each reconstructed 3D shape includes (1) well-reconstructed reliable surface areas, (2) poorly reconstructed unreliable surface areas, and (3) invisible surface areas; surface areas unobservable from the cameras may be reconstructed due to the surface connectivity constraint employed in the optimization-based 3D shape reconstruction methods (see Sect. 4.4). Thus, first evaluate reliability values of local surface areas and then design an objective function based on the reliability values, which is used for the optimization process to find the best match. • The 3D shape reliability of each reconstructed 3D local surface area can be evaluated based on its photo-consistency among its observed multi-view images. • While occluded or collided 3D local surface areas can be estimated from deformed input 3D mesh data, they are given low reliability values. In short, given a skin-and-bones model and multi-view video data, our model matching process evaluates the reliability value distribution over 3D mesh data based on the photo-consistency and the surface visibility, and conducts optimization with respect to highly reliable surface matching while neglecting less reliable surface areas. The next section addresses the reliability evaluation methods based on the photoconsistency and the surface visibility.
9.3 Reliability Evaluation Methods Based on the discussions so far, we define two reliability measures in this section based on the surface visibility and the photo-consistency, respectively.
9.3.1 Reliability Measure Based on the Surface Visibility During complex actions, object body parts often contact with each other as illustrated in Fig. 9.2, where the legs are tightly closed and the right-side hand is attached to the body. Since some surface areas of such contacting body parts collide with each
9.3 Reliability Evaluation Methods
301
Fig. 9.2 Invisible 3D surface areas by body contacts. ©2009 IEICE [17]
other and are definitely unobservable from any cameras, their 3D shape cannot be reconstructed in a 3D video frame, say in Mt . On the other hand, the object model Mt0 is selected so that the entire object surface is observable: e.g. all hands and legs in Mt0 are well opened away from the body. Consequently, some surface areas of Mt0 are not included in Mt and the matching process should take this into account. That is, the standard ICP algorithm [23] which conducts the nearest-neighbor search from Mt0 to Mt cannot work well; surface areas of Mt0 that are missing in Mt will mislead the matching process. To solve this problem, we explicitly model possible surface collisions and assign low reliability values to collided surface areas. Let Mt denote a 3D mesh to be analyzed and M(p) an hypothesized object kinematic model for Mt . As will be described later, since the model matching is conducted from Mt0 = M(pt0 ) sequentially along a 3D video stream, we can assume M(p) approximates well Mt . Then we compute surface collisions and reliability values in M(p) to match it to Mt as follows. Let v denote a vertex of M(p) controlled by bone bv , and v the vertex of M(p) closest to v and controlled by bone bv = bv . Then we define the signed distance d(v) from v to v by v − v if (v − v) · nv ≥ 0, d(v) = (9.1) −v − v otherwise, where nv denotes the surface normal direction of M(p) at v. This signed distance d(v) takes a negative value if v is located in the interior of the body part to which v belongs. With d(v), we define the reliability measure based on the surface by visibility ψc (v) as follows: ψc (v) =
1 . 1 + exp(−αc (d(v) − τc ))
(9.2)
The value of ψc (v) ranges from 0 to 1. If d(v) is large, that is, if v is not close to other body parts, then ψc (v) gets close to 1. On the other hand, if d(v) is small or negative, that is, if v is close to or intersects with another body part, then ψc (v) becomes 0. αc and τc are control parameters for the mapping from d(v) to ψc (v) determined experimentally.
302
9 Model-Based Complex Kinematic Motion Estimation
Fig. 9.3 Phantom volume and spatial distribution of the photo-consistency measure. The volumes specified by the yellow dotted lines are phantom volumes. In the middle figure, the dark blue surface areas denote highly photo-consistent areas, and the light blue, green, and yellow surface areas have less photo-consistent measures. Gray areas illustrate surface areas observable by less than two cameras. ©2009 IEICE [17]
Ideally, the visibility measure ψc (v) should also take into account the spatial arrangement of distributed cameras. That is, if cameras for multi-view video capture are non-uniformly distributed in the scene, then the visible areas from the cameras become anisotropic and hence ψc (v) should be designed taking into account the camera positions, directions, and zooming factors. In practice, however, since we assume that the cameras for 3D video production are placed uniformly, we define ψc (v) only based on the mutual surface proximity.
9.3.2 Reliability Measure Based on the Photo-Consistency Even if body parts are placed far away from the others, some of their surface areas may not be well captured in multi-view images due to self-occlusions. In addition, some surface areas may not be well photo-consistent due to calibration errors, nonLambertian surface reflections, invalid visibility approximations, etc. as discussed in Chap. 4. Moreover, phantom volumes may be introduced by the silhouette-based 3D shape reconstruction with a limited number of cameras. Figure 9.3, for example, shows a reconstructed 3D surface including these limitations, errors, and artifacts. The object shape specification and image capture environments will be described later in Sect. 9.5.1. The surface is colored based on the photo-consistency measure: dark blue areas are highly photo-consistent while the others, light blue, green, and yellow, are less or not photo-consistent. Gray areas such as phantom volume surface areas can be observed from less than two cameras. Based on this observation, the photo-consistency can be used to evaluate the reliability of reconstructed 3D surface areas. We define the reliability measure ψp (u) for each vertex u of Mt based on its photo-consistency. Let ρ(u) denote the Zeromean Normalized-Cross-Correlation photo-consistency defined in Sect. 4.4.1.1, which ranges from −1 to 1 (larger is better). When u is observed from less than
9.4 Kinematic Motion Estimation Algorithm Using the Reliability Measures
303
Fig. 9.4 Bipartite graph matching between the kinematic model M(p) and a 3D video frame Mt . (a) Straightforward vertex matching from M(p) to Mt . (b) Vertex matching neglecting unreliable ones. ©2009 IEICE [17]
two cameras, ρ(u) is set to −1. Then we define ψp (u) as follows: ψp (u) =
1 . 1 + exp(−αp (ρ(u) − τp ))
(9.3)
This function takes a value within [0 : 1]. For higher ρ(u), that is, if u is highly photo-consistent, ψp (u) gets closer to 1. When u is not photo-consistent or not well observable, ρ(u) becomes smaller and ψp (u) becomes close to 0. αp and τp are control parameters for the mapping from ρ(u) to ψp (u), and are determined heuristically.
9.4 Kinematic Motion Estimation Algorithm Using the Reliability Measures With the two reliability measures, ψc (v) for vertices of M(p) and ψp (u) for vertices of Mt , the problem of finding the posture parameter p such that M(p) matches best to Mt can be formalized as a problem of finding the optimal bipartite graph matching between a pair of vertex sets of M(p) and Mt as illustrated in Fig. 9.4. If we did not consider the vertex reliability, the matching process would search the best matching vertex in Mt for each vertex of M(p) (Fig. 9.4(a)). With the vertex reliability measures defined above, on the other hand, the matching process establishes correspondences only between reliable vertices in M(p) and Mt neglecting unreliable ones (Fig. 9.4(b)). Thus, the kinematic motion estimation algorithm using the reliability measures can be formalized as the minimization process of the objective function E(M(p), Mt ), which is the weighted sum of both distances from v ∈ M(p) to its
304
9 Model-Based Complex Kinematic Motion Estimation
closest vertex uv ∈ Mt and distances from u ∈ Mt to its closest vertex vu ∈ M(p): ψc (v) ψp (uv ) v − uv 2 E M(p), Mt = Rc Rp v∈M(p)
+
ψc (vu ) ψp (u) u − vu 2 , Rc Rp
(9.4)
u∈Mt
where v − uv 2 and u − vu 2 denote the squared Euclidean distances between v to uv and u to vu , respectively, and Rc = v∈M(p) ψc (v) and Rp = u∈Mt ψp (u) denote the normalization factors for ψc (v) and ψp (u), respectively. We use the Levenberg–Marquardt algorithm to find the posture parameter pt that minimizes E(M(p), Mt ) using pt−1 as the initial value. In summary, the overall algorithm to compute the kinematic motion description from a 3D video stream, a sequence of posture parameters from ts to te , is defined as follows: Step 1. Select t0 and build M(p) from Mt0 manually. Step 2. Partition at t0 the sequence between ts and te into a pair of past and future intervals. Apply the following forward sequential matching process to the future interval. Step 3. If t0 = te , then t = t0 + 1. Otherwise go to Step 6. Step 4. Find pt which minimizes Eq. (9.4) using pt−1 as the initial value for the Levenberg–Marquardt algorithm. Step 5. If t = te , go to Step 6. Otherwise, go to Step 4 with t = t + 1. Step 6. Apply the same matching process backward to the past interval. Then the kinematic motion description is obtained as the sequence of posture parameters pts , . . . , pte . Step 7. If required, compute a motion trajectory of each vertex of M(pt0 ) by tracking the vertex positions in M(pts ), . . . , M(pte ).
9.5 Performance Evaluation 9.5.1 Quantitative Performance Evaluation with Synthesized Data In order to evaluate the performance quantitatively, we conducted experiments using a synthesized articulated object with three arms of 100 cm length each. The multiview images were captured in the virtual studio shown in Fig. 9.5. 15 XGA cameras are placed in the same configuration as for the real data presented in the next section. Each arm of the object consists of two bones sharing a common central joint with the other arms. The object surface has rich texture patterns as shown in the picture on the right in Fig. 9.3.
9.5 Performance Evaluation
305
Fig. 9.5 Synthesized object and multi-view image capture environments. ©2009 IEICE [17]
Figure 9.6 shows comparisons of the object posture estimation results by several algorithms with our proposed one. The columns show, from left to right, (a) the synthesized object (ground truth), (b) reconstructed 3D shape of 1 cm mesh resolution with photo-consistency measures, (c) posture estimation by ICP, (d) posture estimation by Ogawara et al. [18], (e) optimal posture estimation with ψc alone, (f) optimal posture estimation with ψp alone, (g) optimal posture estimation by the proposed algorithm, respectively. The colors of Fig. 9.6(b) indicate photo-consistency measures (dark blue: highly photo-consistent, light blue, yellow to red: less photo-consistent). The gray areas are observed from less than two cameras. Figure 9.7 shows a close-up image of Fig. 9.6(b) at t6 . We can observe that (1) the interior areas of the reconstructed surface is almost invisible from the cameras, and (2) the reconstructed shape has a forth arm corresponding to a phantom volume produced by the silhouette-based shape reconstruction process. Since this phantom volume is not visible from the cameras, image-based 3D shape reconstruction methods could not remove it. The red lines in Fig. 9.6(c) to (g) illustrate the estimated bone positions while the red lines in (a) show the ground truth positions. Note that the original 3D shapes in (b) are used also in (c) to (g) instead of the 3D surface shapes deformed by the posture parameters to visually demonstrate how easily the estimated bones are mislocated as the surface observability decreases. The results by ICP were obtained by setting ψc (v) = ψp (u) = 1 for all vertices. The algorithm by Ogawara et al. extends the ICP matching method so that the closest point is selected taking into account the surface normal similarity as well as the Euclidean distance. Figure 9.8 shows the averaged localization errors of the joints. From this result, we can observe the following. • When a reconstructed 3D shape includes phantom volumes (from t6 to t8 and from t11 to t13 in Fig. 9.6(b)), the naive ICP (Fig. 9.6(c)), Ogawara et al. [18] (Fig. 9.6(d)), and the optimal posture estimation with ψc alone (Fig. 9.6(e)) cannot estimate the bone positions correctly. This is because these methods cannot neglect phantom volumes in their matching process by definition. Figure 9.9
306
9 Model-Based Complex Kinematic Motion Estimation
Fig. 9.6 Results of posture estimation (see text). ©2009 IEICE [17]
shows the close-up images of the posture estimation results at t7 . Figures 9.9(a) to (e) show the ground truth, ICP, [18], ψc alone, ψp alone, and the proposed algorithm. The areas indicated by the dotted circles in (b), (c), and (d) demonstrate how the bones are mislocated. On the other hand, the methods using ψp can estimate the bone positions correctly regardless of the phantom volumes.
9.5 Performance Evaluation
307
Fig. 9.7 Close-up image of Fig. 9.6(b) at t6 . Light blue, green, and yellow areas have low ψp values while dark blue areas have high ψp values. Gray areas are observed from zero or only one camera. The central gray protrusion, which is like a fourth arm, is a phantom volume generated by the silhouette-based shape reconstruction process. ©2009 IEICE [17]
Fig. 9.8 Averaged localization errors of the joints. ©2009 IEICE [17]
• When the arms collide with each other at t9 and t10 , the naive ICP (Fig. 9.6(c)), Ogawara et al. [18] (Fig. 9.6(d)), and the method with ψp alone (Fig. 9.6(f)) cannot estimate the bone positions correctly. This is because they cannot manage missing (i.e. collided) surface areas in Mt . Figure 9.10 shows the close-up images of the posture estimation results at t9 . The dotted circles in Fig. 9.10(b), (c), and (e) demonstrate how the bones are mislocated. • Notice that the error curve of (f) in Fig. 9.8 jumps up at t8 due to the surface collision and stays high even if no surface collision is observed at the latter stage. This is because the posture optimization was conducted sequentially frame by frame using the previous result as the initial posture for the optimization, which incurs the error accumulation at the latter stage. The similar effect of error accumulation can be observed for (c). • As shown in the error curve of (e) in Fig. 9.8, although the performance of the optimal posture estimation with ψc alone is degraded due to phantom volumes, it works well even if surfaces collide with each other as designed.
308
9 Model-Based Complex Kinematic Motion Estimation
Fig. 9.9 Results of the posture estimation at t7 . (a) Ground truth, (b) ICP, (c) Ogawara [18], (d) ψc alone, (e) ψp alone, (f) the proposed method. Red lines in each figure illustrate the estimate bone posture. Note that (b)–(f) in this figure correspond to (c)–(g) in Fig. 9.6, respectively. ©2009 IEICE [17]
Fig. 9.10 Results of the posture estimation at t9 . (a) Ground truth, (b) ICP, (c) Ogawara [18], (d) ψc alone, (e) ψp alone, (f) the proposed method. Red lines in each figure illustrate the estimate bone posture. Note that (b)–(f) in this figure correspond to (c)–(g) in Fig. 9.6, respectively. ©2009 IEICE [17]
• As shown in the error curve of (d) in Fig. 9.8, the performance of the improved ICP with the outlier elimination based on surface normal evaluation [18] behaves similarly to ICP at the first half stage, (c) in Fig. 9.8. This is because (1) phantom volumes can have similar normal directions to the normals of the actual object, and (2) the uniform nearest-neighbor search establishes correspondences even for vertices of M(p) that do not have corresponding points in Mt due to surface collisions. Figure 9.11 gives a detailed analysis of what happens when surface areas collide with each other. • As illustrated (g) in Fig. 9.8, the proposed algorithm using both ψc and ψp estimates the joint positions within 1 cm errors over the entire motion period. Considering the original mesh resolution is 1 cm, the algorithm can produce the kinematic motion description from a 3D video stream in a reasonable accuracy.
9.5.2 Qualitative Evaluations with Real Data Figure 9.12 shows a result of kinematic motion estimation from a 3D video of an object performing complicated Yoga postures over 3000 frames (2 minutes). The 3D video was captured in Studio B described in Fig. 2.4 with 15 XGA cameras at 25 fps. Figures 9.12(a) to (j) show ten representative postures including body surface collisions. For each pose, a pair of multi-view captured images and a pair of
9.5 Performance Evaluation
309
Fig. 9.11 Limitations of the outlier elimination based on the surface normal evaluation. The upper row of (a) and (b) shows the reconstructed 3D shapes at t7 and t9 in Fig. 9.6, respectively. The planes illustrate the cutting planes to analyze the model matching: the lower row illustrates (a) the cross-sections of Mt7 and M(t7 ) and (b) the cross-sections of Mt9 and M(t9 ), where Mt s are shown by the black bold lines and M(t)s by the gray bold dotted lines. The gray bold lines in (a) illustrate the phantom volume in the observed 3D shape. With phantom volumes as shown in (a), the matching process can establish false correspondences as shown by the bold black arrows between the phantom volume in Mt and M(t) because the former can include surface areas having normal directions similar to the nearest part of the latter. When surfaces collide with each other as shown in (b), on the other hand, the matching process establishes false correspondences for collided surface areas as shown by the bold wavy gray arrows in (b) since surface normal directions of Mt and M(t) are similar. ©2009 IEICE [17]
multi-view estimated posture images are shown. The postures are illustrated with red skeletons embedded in reconstructed 3D shape data. We can observe that the proposed kinematic motion estimation algorithm works well even for complex human actions such as Yoga. Figure 9.13 compares the posture estimation results by the proposed method and the naive ICP. While the posture looks simple, it includes heavy surface collisions over the entire object surface. Note that the performer took first a posture without surface collisions by opening hands and legs and then gradually closed hands and legs to take the illustrated posture. Figures 9.13(a), (b), and (c) show the result of the proposed algorithm rendered from right, front, and left side, respectively. Figures 9.13(d), (e), and (f) show the result of ICP. The red lines show the estimated bones. The textured 3D video frame data are overlaid to visually evaluate the estimation quality. Figure 9.14 shows close-up images of Fig. 9.13 at her shoulder and hand parts. From these figures we can conclude that the proposed method can estimate the bone postures correctly even for complex postures that ICP cannot manage well.
310
9 Model-Based Complex Kinematic Motion Estimation
Fig. 9.12 Kinematic motion estimation of complex Yoga. ©2009 IEICE [17]
Fig. 9.13 Comparison between the proposed method and ICP. (a), (b), and (c) Result by the proposed method viewed from left, front and right side, respectively. (d), (e), and (f) Result by ICP. The red lines illustrate the estimated 3D bone postures. ©2009 IEICE [17]
9.6 Conclusion
311
Fig. 9.14 Close-up images of Fig. 9.13 at her shoulder and hand areas. (a) and (b) By the proposed method, and (c) and (d) by ICP. ©2009 IEICE [17]
9.6 Conclusion This chapter presents a model-based 3D kinematic motion estimation algorithm from 3D video. Compared to the behavior unit model presented in the previous chapter, which partitions a 3D video stream into a set of behavior unit intervals and describes an entire object action by a probabilistic transition model among the behavior units, the algorithm can produce a kinematic description of human actions from a given 3D video, which allows us to analyze human actions quantitatively and if required, edit them based on their kinematics. The key point of the algorithm is to introduce a pair of reliability measures into the kinematic model matching. The first measure evaluates the visibility of the model surface areas in a 3D video frame to be analyzed. Depending on object postures, some surface areas collide with each other and cannot be observed from any cameras. Since such unobservable surface areas cannot be reconstructed in the 3D video frame, we should evaluate which surface areas of the model cannot be observed and hence should be neglected in the matching process with the 3D video frame. To model this, we introduced a visibility measure for each surface point of the model. The second reliability measure evaluates the photo-consistency at each surface point of the 3D video frame. As discussed in Chap. 4, the reconstructed 3D shape can include less photo-consistent surface areas because of calibration errors, nonLambertian surface reflections, aperture problems, invalid visibility approximations (Sect. 4.3.2), phantom volumes inherited from the visual hull (Sect. 4.3.2.1), and/or regularized optimizations (Sect. 4.3.3). Since less photo-consistent surface areas in the 3D video frame should be discounted in the model matching, we introduced a photo-consistency measure for each surface point of the 3D video frame. The kinematic model matching is defined as an optimization process which minimizes the distance measure between the model and a 3D video frame, where the distance function is weighted by the visibility and photo-consistency measures. In evaluating the photo-consistency measure, multi-view images rather than textured 3D mesh data are employed because the texture generation process may have introduced artifacts. It should be noticed that the idea of evaluating reliability measures for the reconstructed 3D shape based on observed multi-view images and managing errors and lack of information to attain a given task is shared with the texture generation presented in Chap. 5. Experimental results with a synthesized and a real 3D
312
9 Model-Based Complex Kinematic Motion Estimation
video data demonstrated the performance of the proposed method. It can estimate correctly very complex object actions including heavy surface collisions as well as phantom volumes introduced due to the limited camera observation. An important future problem is the kinematic structure learning, which estimates the bone-and-joint structure from a 3D video stream. The inter-frame deformation algorithm presented in Sect. 4.4.2 produces dense surface motion patterns, which then can be used to estimate solid object parts. In addition, the Reeb graph described in Sect. 8.3 produces a graph structure representing the global object surface topological structure, which can lead the estimation of the bone-and-joint structure. With these technologies, we will be able to develop a kinematic structure learning algorithm from a 3D video stream of a simple object action. A challenging future problem is the kinematic structure and motion estimation for complex object actions such as dancing MAIKOs, where only partial object parts can be observed while the others should be estimated from motion patterns of loose clothes. As noted previously, many traditional dances which are performed wearing cultural heritage decorations are facing extinction, and computer-vision technologies can be used to archive and preserve such intangible cultural assets.
References 1. Agarwal, A., Triggs, B.: 3d human pose from silhouettes by relevance vector regression. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, CVPR’04, pp. 882– 888 (2004) 2. Agarwal, A., Triggs, B.: Learning to track 3D human motion from silhouettes. In: Proc. of International Conference on Machine Learning, pp. 9–16 (2004) 3. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2006) 4. Brand, M.: Shadow puppetry. In: Proc. of International Conference on Computer Vision, vol. 2, pp. 1237–1244 (1999) 5. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, CVPR’98, p. 8 (1998) 6. Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search. Int. J. Comput. Vis. 61, 185–205 (2005) 7. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61, 55–79 (2005) 8. Fernando, R.: GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Pearson Higher Education, Upper Saddle River (2004). 0321228324 9. Gall, J., Stoll, C., Aguiar, E.D., Theobalt, C., Rosenhahn, B., Seidel, H.-P.: Motion capture using joint skeleton tracking and surface estimation. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2009) 10. Gibson, S., Mirtich, B.: A Survey of Deformable Modeling in Computer Graphics (1997) 11. Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3D structure with a statistical imagebased shape model. In: Proc. of International Conference on Computer Vision, pp. 641–647 (2003) 12. Howe, N.R.: Silhouette lookup for automatic pose tracking. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 15–22 (2004) 13. King, B.A., Paulson, L.D.: Motion capture moves into new realms. Computer 40, 13–16 (2007)
References
313
14. Ménier, C., Boyer, E., Raffin, B.: 3D Skeleton-Based Body Pose Recovery. In: Proc. of International Symposium on 3D Data Processing, Visualization and Transmission, pp. 389–396 (2006) 15. Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006) 16. Mukasa, T., Miyamoto, A., Nobuhara, S., Maki, A., Matsuyama, T.: Complex human motion estimation using visibility. In: Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–6 (2008) 17. Nobuhara, S., Miyamoto, A., Matsuyama, T.: Complex 3D human motion estimation by modeling incompleteness in 3D shape observation. IEICE Trans. Inf. Syst. J92-D(12), 2225–2237 (2009) (in Japanese) 18. Ogawara, K., Li, X., Ikeuchi, K.: Marker-less human motion estimation using articulated deformable model. In: Proc. of International Conference on Robotics and Automation, pp. 46–51 (2007) 19. Peng, B., Qian, G.: Online gesture spotting from visual hull data. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1175–1188 (2011) 20. Plänkers, R., Fua, P.: Articulated soft objects for multiview shape and motion capture. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1182–1187 (2003) 21. Rogez, G., Orrite, C., Martinez, J., Herrero, J.: Probabilistic spatio-temporal 2D-model for pedestrian motion analysis in monocular sequences. In: Proc. of International Conference on Articulated Motion and Deformable Objects, pp. 175–184 (2006) 22. Rosales, R., Siddiqui, M., Alon, J., Sclaroff, S.: Estimating 3D body pose using uncalibrated cameras. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 821–827 (2001) 23. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proc. of International Conference on 3-D Digital Imaging and Modeling, pp. 145–152 (2001) 24. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: Proc. of International Conference on Computer Vision, pp. 750–757 (2003) 25. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3D human motion estimation. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 390–397 (2005) 26. Starck, J., Hilton, A.: Spherical matching for temporal correspondence of non-rigid surfaces. In: Proc. of International Conference on Computer Vision, vol. 2, pp. 1387–13942 (2005) 27. Theobalt, C., Magnor, M., Schuler, P., Seidel, H.-P.: Multi-layer skeleton fitting for online human motion capture. In: Proceedings of 7th International Fall Workshop on Vision, Modeling and Visualization, pp. 471–478 (2002) 28. Ukita, N., Hirai, M., Kidode, M.: Complex volume and pose tracking with probabilistic dynamical models and visual hull constraints. In: Proc. of International Conference on Computer Vision, pp. 1405–1412 (2009)
Chapter 10
3D Video Encoding
10.1 Introduction Over the past decade the progress in computing and telecommunication technologies have made storage and transmission of visual information media even more ubiquitous. Nowadays it is usual to stream in real-time a huge amount of data online, e.g. over a LAN or the Internet. Hence it becomes crucial to design an efficient compression scheme for any data model to be transmitted through a network, especially when large bandwidth is required. The benefits of data compression are worldwide renowned and can be observed in everyday life activity. For example, the standard MPEG-4 codec is often used to efficiently compress a stream of 2D video in HD to be broadcast or played on any modern device (e.g. digital television, smart-phone, etc.). As presented in the previous chapters and pointed out in Chap. 8, a 3D video stream consists of a sequence of textured 3D mesh data representing an object in motion. Each 3D video frame is produced independently using multi-view shape reconstruction and texture generation techniques as described in Chaps. 4 and 5, and represented by a high resolution textured 3D mesh [26, 27]. In particular, we recall that consecutive frames have no consistent geometrical structure: e.g. vertex number and mesh connectivity can be totally different. Since a naive representation of textured 3D mesh data requires large memory space, a long 3D video stream is difficult to store and manipulate, and on-line streaming becomes quickly impracticable when the network bandwidth is limited. This chapter presents a method we developed for 3D video encoding that transforms a 3D video stream into a 2D video stream. 3D video data can therefore be easily stored and transmitted by taking advantage of any mature 2D image encoding technology such as Windows Media, Quicktime, MPEG-4, Real Media, Flash, etc. Thus, we believe 3D video could become a visual medium considered as standard as 2D video in a near future. It should be noticed that the 3D video compression methods presented in this chapter process vertex data and not face data. Hence, we assume that the surface T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4_10, © Springer-Verlag London 2012
315
316
10
3D Video Encoding
texture of 3D mesh is generated by a view-independent vertex-based texture generation method, where appearance-based RGB values or generic reflectance properties of mesh vertices are recorded to generate texture patterns on faces by interpolation. If required, we can also use the view-dependent vertex-based texture generation method described in Sect. 5.4. In that case, a set of multi-view RGB values is recorded for each vertex. Nevertheless, the data size to be encoded would then be increased. Note also that since the 3D video compression method largely modifies the structure of 3D mesh data, the harmonized texture generation method presented as the best view-dependent appearance-based texture generation method in Sect. 5.5 cannot be employed; its texture generation process is defined based on faces, instead of vertices, and hence completely depends on the original 3D mesh structure before compression.
10.1.1 Encoding 3D Visual Media into 2D Video Data In [12], we proposed skin-off as a method for omni-directional video compression. The idea is as follows (Fig. 10.1): 1. An omni-directional view of a scene is captured by a group of cameras heading toward outside to observe the entire surrounding scene. 2. The captured multi-view images are seamlessly mapped onto a common spherical screen by an image mosaicing method using the camera calibration parameters. Ideally, all the camera projection centers should be aligned at the same position. Otherwise, in practice, an image warping method should be employed to compensate the misalignment of the projection centers. 3. A regular polyhedron whose centroid is aligned with the centroid of the sphere is then introduced, and the omni-directional image on the spherical screen is mapped onto the regular polyhedral surface. 4. The regular polyhedral surface with the image patterns is cut open at edges and unfolded onto a 2D plane. Finally, a standard 2D image compression method is applied to the unfolded 2D image. With this skin-off method, we can compress omni-directional video by using ordinary 2D video compression algorithms. We tested several unfolding methods and evaluated their performance of image compression. As illustrated in Fig. 10.1, the geometric transformations involved are all between 2D planes. No significant geometric distortions are introduced. Note, however, that ordinary lossy image compression methods damage the texture continuity at the boundary parts of the unfolded image painted red in Fig. 10.1, which may result in visible seams in the decoded omni-directional video. This is because different compression operations are applied at the boundary parts. As will be discussed later, how to preserve the texture continuity at boundary parts of an unfolded 3D mesh through encoding and decoding processes is a common problem in 3D mesh coding methods based on mesh unfolding.
10.1
Introduction
317
Fig. 10.1 The skin-off method for omni-directional video. Captured multi-view images are seamlessly mapped onto a common spherical screen by image mosaicing. Then, the spherical image is mapped onto a regular polyhedral surface, which finally unfolded onto the 2D plane
In this skin-off method, since the shape of the 3D textured surface does not change, the cut position of the regular polyhedron is preserved over time. When we want to apply the skin-off method to 3D video coding, on the other hand, we have to find such stable paths for a sequence of dynamically changing textured 3D mesh data. Dynamic variations of the cut paths over time would create significant discontinuities in a temporal sequence of unfolded images, which makes it difficult to realize satisfactory compression results by ordinary 2D image compression methods. This is a major problem to be discussed in this chapter. Inspired from geometry video [2], which shares a lot with skin-off, we developed a compression algorithm for 3D video data. Since geometry video was developed for animated 3D mesh sequences, it assumes that the mesh structure is preserved over time. On the other hand, as presented in Chap. 4, the mesh structure of 3D video data changes largely frame-by-frame due to the frame-wise 3D shape reconstruction strategy. Consequently, how to cope with this mesh structure variation over time is another major problem to be solved by an encoding algorithm for 3D video. A geometry video consists of a sequence of geometry images [10], each of which represents the geometry of a surface mesh. Assuming 3D models represented by closed 3D surfaces (this assumption is reasonable according to the reconstruction process described in Chap. 4), the encoding process of a geometry video is the following (Fig. 10.2): 1. Assuming a triangular surface mesh M with no hole, define an cut ρ on M. 2. Use ρ to open up M and parameterize the 3D surface on a 2D plane D, i.e. map ρ to the border of D and 3D mesh vertices onto the interior of D. Note that, as will be described later, this parameterization/mapping is not a simple algebraic transformation but involves complex optimization processes. 3. Digitize D and compute XY Z values at each pixel position by interpolating the XY Z values of projected vertices in the pixel neighborhood. Note that each pixel now represents a 3D vertex with XY Z values and vertex connectivity relations. Edges in the 3D mesh are implicitly represented by 2D neighboring relations between pixels: 4-neighbor pixel connectivity implies quadrilateral 3D mesh faces, and 8-neighbor triangular faces. In this chapter, we basically use the former, i.e. the decoded 3D shape is represented by a 3D quadrilateral mesh as shown in the left bottom of Fig. 10.2. The resultant XY Z-valued 2D image is called the geometry image. Note that associated with the computation of XY Z pixel values,
318
10
3D Video Encoding
Fig. 10.2 Geometry image sequence coding scheme. The processing steps to encode a 3D video frame (e.g. a 3D model of MAIKO) into a geometry image for storage or transmission are the following: (1) an input mesh from a 3D video frame (e.g. a 3D model of MAIKO) is cut using a cut graph defined on the mesh surface, (2) then the mesh is opened up and parameterized on a 2D plane. As will be described later, the cut graph is mapped to the border of the 2D image. (3) 3D coordinate values of vertices are interpolated and sampled at pixels on the 2D plane and 3D coordinate values of the pixels are recorded in a geometry image. In the figure, the geometry image is illustrated as a color image regarding XY Z values as RGB values. Finally (4) the geometry image is compressed by a standard image compression method, or (5) a reconstructed mesh can be obtained from the geometry image by regarding pixel vales as vertex positions of the reconstructed 3D mesh and pixel adjacency relations on the 2D as 3D edges between the vertices
RGB values or reflectance properties of the 3D point specified by the XY Z values are computed, which then are recorded another 2D color image. That is, a 3D video frame is encoded into a pair of 2D images with XY Z and RGB values respectively. 4. Apply the above mentioned encoding scheme to each frame of 3D video data, and considering geometry image as an ordinary RGB color image, apply a conventional video encoding method to the sequence of geometry images. The associated RGB images can be compressed in the same way.
10.1
Introduction
319
5. To reconstruct 3D video data stream from the compressed image sequence, geometry information contained in each frame is recovered by decoding each geometry image: each triplet of pixel RGB of the geometry image is converted into 3D surface vertex coordinate XY Z and edges between the vertices are established based on the corresponding pixel neighboring relations. Note that the reconstructed mesh is regular. Then, apply texture generation to the reconstructed 3D mesh by using RGB values of vertices: texture patterns of 3D triangular or quadrilateral faces can be generated by bi-linear interpolation of RGB values of constituent vertices. As described above, once a geometry image is computed, the computation of its associated RGB valued image is almost straightforward. In what follows, therefore, we will discuss the computation process of a geometry image sequence from a 3D mesh stream alone. The most distinguishing characteristics of skin-off and geometry video are that any existing 2D video data compression algorithm, codec or software can be employed to encode omni-directional and 3D video data respectively. Thus, we do not need to develop yet another compression method or data format for storage and transmission of 3D video data.
10.1.2 Problem Specification for 3D Video Encoding One major issue in geometry image encoding concerns the sub-sampling of projected vertex coordinate values on the 2D plane which may not be able to accurately preserve geometry. The cut graph definition should therefore be designed to preserve original 3D shape as accurately as possible. Moreover, to achieve optimal encoding of 3D video, consecutive geometry images should necessarily be as similar as possible, as most of 2D video compression techniques rely on temporal information coherence or redundancy. The mesh cutting step can then be crucial for this purpose: finding consistent cuts between consecutive frames is indeed a sufficient condition to obtain stable geometry images over time. Furthermore, as mentioned previously, if a lossy compression method is used for encoding and alters the border of geometry images, cracks may be observed on the reconstructed surface around the cut. Hence, a post-processing step (e.g. mesh joining or hole filling) may be necessary to preserve the topology of the initial mesh. In practice, chrominance sub-sampling of 2D video compression should be disabled during the compression process to avoid information degradation. While these requirements are not hard to satisfy in 3D CG animation, for which the geometry video was proposed originally, it is a challenging problem with realworld 3D video data. 3D video frames are reconstructed independently and consecutive frames contain meshes with inconsistent connectivity: mesh topology and vertex number change with frames as discussed in Chap. 4. We therefore propose a new technique to characterize dynamic surface that varies with time. The state-of-the-art has provided numerous methods for characterizing
320
10
3D Video Encoding
Fig. 10.3 Surface-based graphs (in green) are computed on 3D surface models (in yellow), which can be then transformed into 2D images of 128 × 128p (bottom), respectively. In the figure, the geometry images are illustrated as color images regarding XY Z values as RGB values. 3D models (in gray) can be reconstructed by decoding these geometry images. The stable graphs return similar geometry images that vary smoothly for optimal 2D image encoding
shape based on volume, surface, global or local properties (e.g. medial axis [1], skeleton-curve [7], Reeb graphs [13, 28], etc.). Although most of these descriptors can capture intrinsic shape property (e.g. topology), they are not suited for dynamic representation as their structure usually vary over time. Thus, we introduce a new shape descriptor whose representation is stable over time. The descriptor is defined as graphs lying on surface models and joining stable surface points that are tracked over the sequence (e.g. extremal points given by a Morse function as described in Chap. 8). Positions of graph branches and joints are optimized using a Bayesian probabilistic framework driven by geodesic consistency cues while surfaces undergo non-rigid deformations. Geodesic consistency of stable points is maintained over time to ensure graph stability. The stable surface-based graph brings a temporally consistent structure to geometrical data that are produced independently. In this chapter, we show that the descriptor can be applied for efficient 3D video data stream encoding. The graphs can be used as cut graphs that cut open surface meshes for parameterization into a square domain [8]. Each frame of 3D video is then converted into a geometry image, as introduced previously. Since consecutive geometry images are maintained similar thanks to the stable cut graphs, redundancy or coherence information can be optimally encoded (cf. Fig. 10.3). The remainder of the chapter is organized as follows: Sect. 10.2 presents an overview of geometry image creation and addresses technical problems involved. In Sect. 10.3, we discuss important technical elements to efficiently encode a 3D video stream: geometry image resolution, encoding and decoding. Section 10.4 presents our new algorithm for 3D video data encoding that relies on a stable surface-based shape representation. Section 10.5 describes various experiments on challenging datasets. We show remarkable stability and performances. Section 10.6 concludes the chapter with discussions.
10.2
Geometry Images
321
10.2 Geometry Images Geometry images were proposed as a method of storing arbitrary 3D surface mesh model as a 2D image. It provides 3D mesh structure with completely regular connectivity, suitable for optimal processing and hardware rendering. This section reviews in detail the main steps of the algorithm [10].
10.2.1 Overview In [10], a technique was proposed to remesh arbitrary surface onto a completely regular structure, namely a geometry image. The original mesh is decomposed into a genus-0 chart, onto which the geometry is parameterized and sampled as a simple 2D array of quantized points, i.e. pixels, while surface normals and texture can be stored similarly in 2D arrays having the same implicit surface parameterization; here, the usual texture point coordinates UV as described in Sect. 5.2 do not need to be encoded. In practice, a pair of 2D arrays are generated where XY Z values of a 3D vertex and its color values are recorded at each pixel position respectively. As noted before, connectivity relations between vertices, i.e. 3D mesh edges, are implicitly represented by pixel adjacency relations. That is, the texture generation employs a 2D texture image described by UV coordinates to represent surface texture patterns of a 3D mesh model, while the geometry image directly encodes the mesh and texture in a pair of images. The main challenge is to find a parameterization to map all geometry information of a 3D surface onto a 2D plane. As was presented in Sect. 5.2, CG software (e.g. MAYA, Blender, etc.) usually offer tools to automatically slice open a 3D mesh into several 2D planar pieces, i.e. charts, where surface texture patterns can be directly modified by artists. The collection of charts is then placed on a 2D image named texture atlas. Although charts are convenient for artists, they are not optimal for storage as they contain lots of unused space (see Fig. 5.5). The geometry image, on the other hand, offers the ability to map arbitrary surface directly into one unique chart within a square image domain (with a regular structure) with no gap. It is therefore a good choice for 3D video data encoding, compression (storage) and streaming (transmission). The following steps describe the process to create a geometry image from a 3D video frame that consists of one 3D mesh model. 1. Input mesh: The initial mesh M is generated by using reconstruction techniques as described in Chap. 4. No assumptions are made about the mesh except that it is triangular and has no hole. 2. Define cuts: A cut graph ρ is found that is topologically sufficient to open the mesh into a disk. If the mesh has high genus then handles must be cut. Additional cuts may be introduced to reduce mapping distortions at the following mesh sampling step.
322
10
3D Video Encoding
3. Parameterization: The mesh M is opened following ρ and mapped onto a flat square plane. The new mesh M has the topology of a disk and its boundary is aligned with that of the square plane. The position of each vertex of M are optimized on the plane, so that in the latter step the sampled mesh will be a good representation of the original mesh. 4. Sampling: Using the mapped mesh, the mesh is sampled by a regular grid on the square. For each 3D sample the coordinates XY Z are scaled and stored as RGB color components of image pixels. Scaling is necessary to make maximum use of the range of pixel values. 5. Compression: The geometry image is compressed using a conventional codec. Chrominance sub-sampling must be disabled to avoid unnecessary degradation of the mesh. 6. Decoding: When a user wants to retrieve the mesh M, the image is decompressed, the RGB values are rescaled and stored as 3D mesh coordinates. The overall process to encode a 3D mesh into a geometry image is shown in Fig. 10.2. The next two sections discuss about essential steps in the algorithm, which are (1) cut graph definition, and (2) parameterization of boundary and surface. As mentioned previously in Sect. 10.1.2, bad mesh reconstruction from geometry image can be obtained if cut graph and parameterization are not well defined. The 3D to 2D mapping can produce stretched or flat triangles on the 2D plane that will not be well captured by the sampling step. Objects with very long and thin features cannot be well represented by geometry images as triangles would be stretched so much under parameterization that they would be represented by only few samples; more samples are needed for these regions. Under specific constraints on the parameterization borders, distortion can be avoided [5, 22]. For general cases, to obtain efficient geometry images, a cut graph ρ has to pass through the various extrema of the mesh M, that is, the surface protrusions or regions with high curvature. All extrema are sought so that the initial cut can be augmented by additional cuts (called cut paths) passing through them.
10.2.2 Cut Graph Definition Assuming a continuous surface of arbitrary genus represented by a 2-manifold triangle mesh M, a cut graph ρ is defined as a set of connected cuts, or cut paths, which are sufficient to open M into a topological disk. A cut path is a sequence of connected edges and vertices in M that lie between two cut nodes in ρ; cut nodes are defined as vertices in ρ having valence k = 2. Most algorithms [2, 10, 11] consider the two following steps to define the cut graph ρ: (1) find an initial cut ρ0 , and (2) augment ρ0 with additional cut paths (ρ ← ρ ∪ ρ0 ) which pass through regions on the mesh that have high curvature in order to improve the subsequent parameterization and reconstruction quality.
10.2
Geometry Images
323
The first step is achieved by topological surgery [25] and returns a spanning tree, which is further trimmed from all edge trees. This leaves a graph ρ0 with only one arbitrary vertex if the genus is 0, or a set of connected loops if the genus is ≥ 1. This step is detailed at the end of this section. In the second step, ρ is usually determined by iteratively adding a cut path to the initial cut ρ0 using shape-preserving parameterizations of M on a disk, where stretched triangles (which are located in regions with high curvature, i.e. on surface protrusions) can be easily detected using an ad-hoc metric [9, 21]. At each iteration i, a new parameterization is calculated from the current cut graph ρ, and a cut path ρi that joins ρ to the triangle with maximal stretch is added: ρ ← ρ ∪ ρi . When the iterations return no more significant change, no more cut path is added, and the final cut graph is obtained: ρ = ( ρi ) ∪ ρ0 . Finally, a geometric-stretch parameterization over 2D square is used for sampling and storage. (It has been shown that a square parameter domain returns more accurate reconstruction when geometry images are compressed than with a parameterization on a disk [10].) In [11], we proposed a similar method that additionally uses texture information to avoid the routing of cuts through detailed texture regions. Nevertheless, the algorithm was recently improved by applying a novel surface-based shape descriptor that can be used as cut graph and has two main advantages: • The proposed representation can produce temporally stable geometry images from real-world 3D video data stream even though the mesh connectivity between consecutive frames is inconsistent (i.e. the geometry video technique [2] cannot be applied). The challenge is to define cut graphs whose global position would change as little as possible between frames that are produced independently. If resulting geometry images vary smoothly, then encoding efficiency would be maximal. • There are no more sequential parameterization. The intrinsic structure of shape is characterized using a surface-based function, and temporally consistent cut paths passing through local extrema can be added at once, i.e. the costly parameterization step is calculated only once per mesh. The definition of the initial cut ρ0 is detailed below. Our new algorithm is presented in Sect. 10.4. First, let us assume a 3D model represented by a closed surface, i.e. a triangular mesh M with no hole. This assumption is reasonable with 3D video data according to the reconstruction techniques presented in Chap. 4. However, if M is an opened mesh, then the set of boundary edges B ∈ M is initially included in the cut as a subset of ρ0 and remains unchanged throughout the process: ρ0 ← ρ0 ∪ {B}. 1. Remove an arbitrary seed triangle from M. 2. Repeat: identify an edge e ∈ / B adjacent to exactly one triangle t , and remove both e and t . (The two remaining edges of t are left in the simplicial 3-complex.) The order of triangle removals is given according to their geodesic distance to the seed triangle. Note that geodesic distances can be computed using Dijkstra’s shortest path algorithm. The repetition ends when all triangles are removed. The set of removed triangles forms indeed a topological disk, no matter how high the
324
10
3D Video Encoding
Fig. 10.4 Cut graph calculation. The figure shows cut graphs obtained at different steps of the algorithm. From left to right: input mesh, initial cut graph after topological surgery [25] (step 2), initial cut graph after trimming (step 3), and final cut graph after augmentation (cf. Sect. 10.4 for details)
genus of M is. Thus, the remaining edges (and all the vertices) form a topological cut ρ0 of M. Note that this step is in comparable to topological surgery [25]. The cut graph is further trimmed from unnecessary edges in the next step. 3. Repeat: identify a vertex v adjacent to exactly one edge e, and remove both v and e. This step ends when all edge trees have been removed, leaving only connected loops. (Loops connect on cut nodes.) 4. Repeat: straighten each cut path in ρ0 to avoid the resulting cut to be too jagged. The shortest path that connects two adjacent cut nodes is computed, while the resulting cut path is kept close to the original cut path. 5. If the surface is of genus 0, then ρ0 consists of one single vertex. In that case two adjacent mesh edges are added back to ρ0 so that the surface can be cut open. Loops appear only when the genus is ≥ 1. As explained previously, in order to avoid sampling issues and degraded 3D shape reconstruction, the initial cut ρ0 is usually augmented with additional cuts so that the final cut graph passes through all surface extrema. Figure 10.4 illustrates cut graphs obtained at different steps of the algorithm. Stable cut graph computation for 3D video data encoding is presented in Sect. 10.4.
10.2.3 Parameterization The algorithm described in the previous section returns a cut graph ρ which is sufficient to cut open the mesh M to form a new mesh M that has the topology of a disk. (We recall that ρ consists of a set of edges forming a graph on M.) To create M , each non-boundary edges in ρ is split into two boundary edges to form the opened cut ρ ; when an edge in ρ is split, the two resulting edges in ρ are mates. ρ is then a directed loop of edges that define the boundary of M .
10.2
Geometry Images
325
Let D denote the 2D domain unit square for the geometry image, which consists of an n × n array of XY Z data values. Φ denotes the parameterization defined as a piecewise linear map from D to M , which associates coordinates (s, t) ∈ D with each vertex in M . D has a rectilinear n × n grid, where grid points have j i , n−1 ) with i, j = 0, . . . , n − 1. Φ is calculated at the grid points coordinates ( n−1 in order to sample the mesh geometry [10]. The geometry image samples serve to reconstruct an approximation of M. Some insights about sampling resolution are given in Sect. 10.3.1. The parameterization is created in two steps: (1) fix a mapping between the opened cut ρ and the boundary of the domain D, and then (2) compute a mapping of M onto D that is consistent with the boundary conditions. We give here a description of both steps (cf. [10] for further details): Boundary Parameterization The strategy consists of mapping all border vertices that are on the opened cut ρ onto the parameterization border of D at sample point positions. In order to make sure that consecutive geometry images have the same orientation, the first edge of ρ to be laid on the border has to be tracked to the next frame mesh. Furthermore, the following rules should be respected to correctly reconstruct geometry and avoid cracks around the cut where the surface joins: • cut nodes in ρ must be mapped to sample points on the boundary of D, • cut-path mates must be sampled identically on the boundary of D, • no triangles in M can have all its three vertices mapped on a same border of D (otherwise it would be parametrically degenerate), • an edge in ρ that spans one of the corner of D should be broken at the corner, • extremities of the cut graph ρ (i.e. cut nodes with valence-1) should be avoided at corners of D because it would result in a poor geometric behavior (this can be avoided by rotating the boundary parameterization). Hence, when necessary new vertices are introduced in M and triangles are split. To preserve mesh connectivity consistency across the cut (i.e. avoid cracks in the reconstructed mesh), identical editing should be applied between mate edges. Note that an n × n geometry image can represent a surface with genus at most n. Interior Parameterization A good parameterization should produce accurate surface reconstruction after sampling. Amongst the numerous metrics which can be used to measure parameterization quality, the L2 geometric-stretch metric has proven to return good approximation (cf. [21] for a comprehensive analysis). Geometric-stretch measures how much triangles are stretched on the reconstructed surface mesh when the parameter domain D is uniformly sampled. Thus, minimizing geometric stretch tends to uniformly distribute samples on the surface. The standard process used in [10, 21] is as follows: 1. With the fixed boundary parameterization, the interior of M is first simplified to form a progressive mesh representation [14], and the resulting (few) base mesh vertices are optimized within D.
326
10
3D Video Encoding
Fig. 10.5 Mesh parameterization examples. Left to right: naive mesh unwrap, circular parameterization, and square parameterization
2. Vertex splits are applied from the progressive mesh to successively refine the mesh. For each inserted vertex, the parameterization of its neighborhood is optimized to minimize the stretch using a local non-linear optimization algorithm. This two-step process creates a map of the 3D mesh M into the unit square D which is included within the opened cut ρ . Let us point out again that in the geometry image methods [2, 10, 11], the final parameter domain is square, as opposed to the parameter domains that are used during the iterative cut graph creation which are circular: shape-preserving parameterization on a disk [9] is useful to detect stretch triangles, but a final parameterization on a square returns better reconstruction results, and is more suitable for storage and rendering [10]. The only constraint on the position of the vertices in M before sampling is that they must retain the same triangulation as in M. To obtain the sampled mesh, linear basis functions (triangles) are used to define the reconstruction interpolant for geometry. Note that in [15] a further step models the parameterization vertices with bicubic B-splines and store the spline control points in the geometry image. The resulting meshes have much improved visual quality but the method is restricted to only genus-zero surfaces. In [20] a sphere specific parameterization that employs a spherical stretch metric is used to remesh genus-zero surfaces onto the unit sphere. Figure 10.5 shows three unfolding methods of a 3D mesh. Practical Implementation Our implementation of parameterization on a square domain D is inspired from [10]. It uses geometric-stretch metric and first order one dimensional search to optimize each vertex position. The procedure is the following: First, let ρ denote a opened cut on a closed surface mesh M, and D denote the 2D domain unit square for the geometry image. 1. The border vertices on the opened cut ρ ∈ M are mapped onto the parameterization border of D at sample points. 2. The remaining mesh in D is simplified to a single vertex following a progressive mesh representation which decimates vertices one by one [14]. 3. The mesh in D is then iteratively refined: for each vertex from the progressive mesh of M (taken in the inverse order of removal), a map point is added to
10.2
Geometry Images
327
Fig. 10.6 Iteration steps of parameterization on square domain. Left to right: the vertices on the cut graph are initially split and mapped onto the square border and the remaining mesh is simplified to a single vertex located at the square centroid, then the interior mesh is iteratively refined while the vertex positions are optimized
Fig. 10.7 Stabilizing parameterization vertices. Left: polygon formed by neighboring vertices. Right: kernel of the polygon in yellow; triangulation is preserved if the central vertex remains in the kernel
the parameterization in D at the centroid of its neighboring map vertices. All map vertices are then moved to the centroid of their neighbors in the new mesh. As map vertices move, their neighboring vertices are no longer at their centroid positions, so this is repeated until the parameterization settles: all vertices are treated in turn, with the centroid stabilization applied each time. Each vertex position is optimized within the bounds of the kernel1 of the polygon formed by the neighboring vertices (in x and y directions by a first order search). 4. Step 3 is repeated until the total geometric stretch converges. Figure 10.6 illustrates the algorithm at different iterative steps. During stabilization by moving to centroid, only a few neighbors of the original mesh may have been expanded and hence the polygon of neighboring vertices may have some very small angles. Moving the vertex to the centroid in this case may break the triangulation, therefore a test is performed to check if the kernel of the polygon is representative of the polygon. If it is too small, then the vertex is moved to the centroid of the kernel. During vertex optimization the search area is limited to the inside of the neighbor polygon kernel in order to avoid breaking the triangulation (Fig. 10.7). kernel of a polygon P is defined as a region in P where for ∀ point pair (a, b) ∈ P × P , the line segment [a, b] lies entirely within P .
1 The
328
10
3D Video Encoding
10.2.4 Data Structure Constraints We make here a brief recap of conditions to be considered when designing a data structure for geometry images. Note that every constraint is applicable equally to each geometry image to be produced when encoding a 3D video data stream. Cut Graph Sufficiency. The cut graph ρ is composed of a number of individual cuts. ρ must be sufficient to open up a mesh M to a topological disk so that it can be sampled on a flat plane D. In order to preserve connectivity there cannot be any holes in the parameterized mesh (M ∈ D); this implies that the cut graph ρ must be a single connected component. Cut Graph Routing. The routing of the cut graph ρ can be chosen freely but it must satisfy the constraints of parameterization because after cutting, the opened cut ρ forms the border of the mesh M . Parameterization Border Requirements. The border ρ of the mesh M is the boundary of the parameterization and is predefined to be a fixed shape. If a cut path ρ is poorly chosen then fitting the border of the mesh to the square boundary of a parameterization may place all the vertices of a triangle in a straight line making it unsuitable for sampling. Parameterization Sampling. The size and shape of triangles in the parameterization can be arbitrary, but as the mesh M will be sampled, each mesh triangle should be represented as a triangle on the parameterization domain; sufficient sampling points will be inserted in order to preserve the mesh shape. In order to break up the connectivity of regions of the mesh with high curvature and improve sampling, additional cuts may be added. Preserving Mesh Topology. In order to rejoin the cut edges and preserve the mesh topology, each cut (border) vertex in ρ must be placed exactly at a sample point. This means that the minimum sampling resolution must be large enough to allow every cut graph edge to be properly represented on the border. Accurate Mesh Sampling. As mentioned previously, at least one sample for every vertex in the original mesh M is required in order to accurately preserve surface geometry.
10.3 3D Video Data Encoding Our strategy for 3D video encoding consists of converting a 3D video data stream in a sequence of geometry images, namely a geometry video. Assuming a 3D video of one object in motion, its geometrical information can be extracted frame-by-frame and encoded individually as a sequence of geometry images using the process described in the previous section. In addition, as discussed in Sect. 10.1, any fixed per-vertex attributes (e.g. colors, reflectance properties, etc.) can be stored in a separate image(s), which is encoded and decoded for rendering a decoded 3D mesh. In Briceño’s Geometry Video [2] a proxy mesh, that is, a mesh for which every edge gets the length of its temporal average length across the sequence, is used for
10.3
3D Video Data Encoding
329
finding a cut graph that will be used for all frames of the video. (Using a proxy mesh is not always the best choice, but the method performs consistently well. A reference frame could be chosen heuristically as well but it would increase humanintervention and computation time.) However, this method is limited to a sequence of animated meshes having the same connectivity and cannot be directly applied to real-world 3D video data because the mesh connectivity (and surface topology) changes with each frame. Our goal is to provide accurate reconstruction of 3D video data stream while achieving efficient data encoding. Hence, we need to define stable cuts across the sequence that would return similar geometry images with each frame. Stable cut graph extraction is discussed in Sect. 10.4. In this section, we discuss important technical elements to efficiently encode a 3D video: geometry image resolution (i.e. sampling of D), as well as encoding and decoding strategies.
10.3.1 Resolution As mentioned in Sect. 10.2, it is expected that there is at least one sample in D for each vertex in the original mesh M. It may be necessary to have additional sample points to ensure that the full shape of the mesh is captured properly unless vertices can be placed exactly at sample points on the parameterization. In order to make sure the overall 3D mesh joins together properly when recovered, it is necessary to place all the cut nodes ∈ ρ exactly on sample points (cf. parameterization process in Sect. 10.2.3). This means that the size of the border of D must at least corresponds to the number of vertices ∈ ρ . However, such a resolution may not be sufficient to generate a reasonable approximation of the mesh, as it may be so small that it would be impossible to parameterize the mesh M within the constraints of the border. In [10], a minimum resolution is arbitrarily chosen and a multiple of it is used as the actual resolution. The image can then be downsampled while cut vertices remain on exact sample points allowing the mesh to be properly re-joined. In [20], resolution is not explicitly mentioned but one can observe that the geometry image sizes have been chosen to exactly capture all the vertices in M.
10.3.2 Encoding and Decoding Encoding Compression can be obtained by encoding the sequence of geometry images using a standard video codec while keeping good accuracy. The only constrain is that chrominance sub-sampling should be disabled to accurately preserve geometry. For example, a standard video frame resolution of 720 × 480 would support 345,600 vertices. Note that most videos would not require such highly detailed mesh in order to render a convincing viewing. On the other hand, the resolution of the color channels limits the volume resolution. However, nowadays most formats support more bits than the traditional 8 bits
330
10
3D Video Encoding
per color channel. For example, MPEG-4 defines profiles that allow for 10 or 14 bits per color channel [24]. Without scaling,2 8 bits per channel at 5 mm resolution gives 1.28 meters per volume axis. If 16 bit images are used, 327 meters per axis can be represented. Decoding Decoding of compressed files is easily performed by sequentially applying the inverse process of parameterization. Each color coordinate RGB in each geometry image is scaled back and converted to 3D vertex position XY Z. Surface connectivity is given implicitly by the regular structure of the parameter domain D, i.e. the geometry image: edges are obtained by connecting each vertex to up, down, left and right neighbors. Note that if a lossy compression method is used for encoding geometry video, cracks may be observed on the reconstructed surface. In particular, if the compression alters the border of geometry images, regions around the cut graph ρ on the reconstructed mesh may contain artifacts. Hence, a post-processing step like mesh joining or hole filling may be necessary to preserve the topology of the mesh.
10.4 Stable Surface-Based Shape Representation As discussed previously in Sect. 10.2, the usual strategy to define a cut graph ρ on the mesh M consists of the following two steps [2, 10, 11]: (1) find an initial cut graph ρ0 , and (2) augment ρ0 with additional cut paths (ρ ← ρ ∪ ρ0 ) that pass through regions on the mesh that have high curvature in order to improve the subsequent parameterization and reconstruction quality. The first step is achieved by topological surgery [25] and returns a spanning tree, which is further trimmed from all edge trees. This leaves a graph with only one arbitrary vertex if the genus is 0, or a set of connected loops if the genus is ≥ 1. In the latter step, ρ is usually determined by iteratively adding a cut path to a particular parameterization3 of M where stretched triangles can be easily detected [9, 21]. Although mesh parameterizations from such iterative search methods are efficient to obtain reconstruction with high geometrical accuracy, the bottleneck of the algorithm remains in the iterative process itself. The sequential parameterization is time consuming, therefore making the conversion process of long sequences of 3D video tedious. Thus, we present an algorithm that relies on a novel surface-based shape descriptor, which is defined as graphs lying on surface models and joining surface points that are stable over time. Positions of graph branches and joints are optimized using a Bayesian probabilistic framework driven by geodesic consistency cues while surfaces undergo non-rigid deformations. Geodesic consistency of stable points is maintained over time to ensure graph stability. The stable graphs can then be used as cut graphs that cut open surface meshes for parameterization into 2 With 3A
scaling any volume size can be represented.
shape-preserving parameterization on a disk [9].
10.4
Stable Surface-Based Shape Representation
331
a square domain (Sect. 10.2.3). They have two main advantages for 3D video encoding: (1) The proposed representation can produce temporally stable geometry images from a real-world 3D video data stream even though the mesh connectivity between consecutive frames is inconsistent (i.e. the geometry video technique [2] cannot be applied). (2) No sequential cut-path augmentation involving the parameterization (cf. Sect. 10.2.1) is required. Temporally consistent cut paths passing through local extrema can be added at once, i.e. the costly parameterization step is calculated only once per mesh. Our method transforms a 3D mesh of 3K vertices in geometry image in about 40 sec whereas a conventional technique implemented with the CGAL [6] library requires about 1 min.
10.4.1 Stable Feature Extraction Our basic assumption is that dynamic surfaces representing real-world objects in motion can be approximated by compact 2-manifold meshes, and present remarkable local properties that can be identified and tracked over time. We consider geodesic distances to characterize the stable surface intrinsic properties, as geodesic distances are invariant to pose orientation and translation, and robust to shape variations when adequately normalized [3, 13]. Assuming an object surface S , let μ : S → R denote the continuous function defined as follows: g(v, s) dS, (10.1) μ(v) = S
where g : S 2 → R is the geodesic distance4 between two points on S . Equation (10.1) is the geodesic integral function. Its critical points can be used to characterize shape (cf. Morse theory [17, 19]). For example, in a humanoid shape surface local maxima usually correspond to limb extremities while the global minimum corresponds to the center of the object. Moreover, these critical points can be tracked over time on non-rigid deformable objects using geometry information and topology matching (cf. Chap. 8).
10.4.2 Temporal Geodesic Consistency Definition 1 Assuming a set of N points B = {b1 , . . . , bN } defined on a 2-manifold S , the points v1 and v2 on S are said geodesically consistent with respect to B if and only if: ∀i ∈ [1, N ],
g(v1 , bi ) = g(v2 , bi ).
(10.2)
4 Geodesic distances can be calculated using Dijkstra’s shortest path algorithm since surfaces are approximated by 2-manifold meshes.
332
10
3D Video Encoding
If the points in B do not have any particular configuration of alignment or symmetry, the geodesic consistency property can be used to uniquely locate points on S when N > 2. t } defined on a deDefinition 2 Assuming a set of N points B t = {b1t , . . . , bN t t t formable 2-manifold S at time t, the points v1 and v2 on S t are said temporally geodesically consistent with respect to B t in [tb , te ] if and only if: (10.3) ∀t ∈ [tb , te ], ∀i ∈ [1, N], g v1t , bit = g v2t+δ , bit+δ ,
where tb < te and t + δ ∈ [tb , te ]. The unicity property mentioned in Definition 1 holds here as well. Moreover, with surfaces undergoing non-rigid deformations (e.g. scale changes), it is necessary to normalize the geodesic distances to preserve geodesic consistency. Hence ∀t, g is normalized by the maximum geodesic distance over all pairs of points on S t .
10.4.3 Stable Surface-Based Graph Construction t } denote a set of stable features (e.g. local exDefinition 3 Let C t = {c1t , . . . , cN t trema) on S that can be tracked over time, i.e. in [tb , te ]. We define the surfacebased shape descriptor T on S t as a graph having branches and joints temporally geodesically consistent with respect to C t in [tb , te ]. In particular, every branch of T is linked to a stable feature in C t , and joints of T represent branch intersections. After initialization at time tb , the graph is stable over time as long as C t can be tracked. Stabilization is optimized using a probabilistic framework.
Initialization At tb , the graph T is built by iteratively adding a branch linking a stable feature from a set C tb to the existing geometrical structure at a joint until all points in C tb are linked. Let us denote V tb the set which will contain all the joints of T at tb . V tb is initially empty. The algorithm is as follows: tb } with stable features extracted on S tb . In prac1. Populate a set C tb = {c1tb , . . . , cN tice, we can choose the local maxima given by Eq. (10.1). 2. Extract an initial stable structure ρ0 on S tb . In practice, we choose the global minimum given by Eq. (10.1) for genus-0 surface, or the initial graph given by topological surgery [25] for higher genus surface (cf. Sect. 10.2.2). ρ0 is then either a point or a path5 defined on S tb . Set the graph on S tb : ρ ← ρ0 . t t 3. Choose the pair of points (cjb , vjb ) ∈ C tb × ρ that has the minimal geodesic distance tb tb (10.4) cj , vj = arg min g(c, v), (c,v)∈C tb ×ρ
5A
path on a surface is a set of points joined two-by-two by a line.
10.4
Stable Surface-Based Shape Representation
333
Fig. 10.8 Surface-based shape descriptor construction. Here, we show: (a) original mesh, (b) geodesic integral function μ (Eq. (10.1)) computed on the mesh surface, (c) and (d) topological structures represented with Reeb graphs at resolution 3 and 4, respectively (used for topology matching over time), and (e) surface-based graph joining critical values of μ
and create a branch ρj that links cjtb to vjtb using the shortest path on S tb . Update ρ ← ρ ∪ ρj , and populate the set V tb with vjtb . cjtb is then discarded in the rest of the algorithm. 4. Repeat Step 3. until every feature in C tb is linked to ρ. Finally, ρ = ( ρj ) ∪ ρ0 and we set at tb : T ← ρ. In Step 3., the shortest path is used to avoid branch overlapping when linking the stable features to ρ. Figure 10.8 illustrates the surface-based graph construction. In particular, we show: critical values extracted from the surface-based function μ (cf. Eq. (10.1)), corresponding Reeb graphs at resolution 3 and 4, and final surfacebased shape representation. (As for the Reeb graph, see Sect. 8.3.) Reeb graphs are shown for comparison and can be used for topology characterization. Note that their structure is subject to surface variations while the extremities remain stable. Stable Graphs For all t > tb , a stable representation is obtained by building a graph having branches and joints temporally geodesically consistent with the graph at t − 1. In the general case when dealing with dynamically changing shapes, a geometrical structure inherited from an initial representation at tb is not guaranteed to be optimal to represent shapes over time. Keeping the initial structure over time can even work as an excessive constraint. The problem is then modeled as a Markov process and the algorithm to construct the graph at t is the following: t } at t are extracted on S t using Eq. (10.1) (e.g. 1. Stable features C t = {c1t , . . . , cN local maxima) and matched with the previous one C t−1 using geometry information and topology matching [28]. 2. Extract an initial stable structure ρ0t on S t as described previously for surface of genus ≥ 1 (i.e. a path); otherwise ρ0t is chosen as a point on S t geodesically consistent to ρ0t−1 with respect to C t . Set the graph at t : ρ t ← ρ0t . 3. Branches that link the stable features to the existing geometrical structure ρ t are then added iteratively using a Bayesian probabilistic strategy, and following the same order found in the initialization step. Let P t = {pit } denote the set of points forming path (a graph branch) joining a stable feature ct to a joint v t at t , and
334
10
3D Video Encoding
D t = {dit } denote the set of points forming the shortest path (given by Dijkstra’s algorithm) joining ct to v t . The problem can be expressed as a MAP-MRF in which the posterior probability to maximize is t t t t−1 Pr P t |D t , P t−1 ∝ Ed pi , di Ep pi , pi V pit , pjt , i j ∈N (i)
i
(10.5) where Ed and Ep are the local evidence terms for a point pit to be at positions inferred from dit and pit−1 respectively, N (i) contains the indices of the neighbors of i, and V is a pair-wise smoothness assumption (so that P t forms a path). Ed and Ep are modeled as follows: Ed pit , dit = fd
t t g p , c − g d t , ct , i k i k
(10.6)
k∈[1,N ]
Ep pit , pit−1 = fp
t t g p , c − g p t−1 , ct−1 , i k i k
(10.7)
k∈[1,N ]
where fd and fp are Gaussian distributions centered on dit and pit−1 respectively, g is the normalized geodesic distance, ckt ∈ C t and ckt−1 ∈ C t−1 . Hence, Eq. (10.5) estimates the probability of P t to be geodesically consistent to the previous branch P t−1 , as well as the shortest path D t . Let P denote the optimal path joining the feature ct to the joint v t . Thus, we have to estimate (10.8) P = arg max Pr P t |D t , P t−1 , {P t }
where {P t } denotes all the possible paths linking ct to v t . The first branch Pj1 links a feature cjt 1 in C t to ρ0t , and then ρ t ← Pj1 ∪ ρ0t . The following branches link iteratively {cjt } to ρt and intersect ρt at the joints {vjt }. Nevertheless the position at t of each joint has to be computed before the branch creation. vjt is obtained by taking the closest point on ρ t to (10.9) v¯jt = arg min λ.g vˆjt , v + (1 − λ).g vˇjt , v , v∈ρt
where vˆjt is the point in S t geodesically consistent to vjt−1 in S t−1 with respect to C t , vˇjt is the intersection point given by the shortest path from cjt to ρ t , and λ = 0.7. Note that each joint vjt is enforced to belong to branches geodesically consistent to branches vjt−1 belongs to, therefore the structure of T is maintained over time. 4. Repeat Step 3. until every feature in C t is linked to ρ t . Finally, the representation t of graph T at t is given by ρ ← ( Pj,t ) ∪ ρ0t . 5. Set t ← t + 1 and repeat Step 1. to 4. for all t < te .
10.5
Performance Evaluations
335
Note that the minimization problem in Step 3 can be efficiently solved by dynamic programming. The probabilistic framework is necessary because real-world geometrical data produced by systems in real environments usually contain changing and unpredictable noise (e.g. 3D video data can contain reconstruction errors).
10.5 Performance Evaluations Datasets For performance evaluations, we have tested the algorithm on publicly available datasets of 3D video reconstructed from multi-view images. The sequences were captured by the University of Surrey [23] and the MIT CSAIL [30]. They consist of subjects wearing loose clothing while performing various actions, such as dancing or jumping. Surfaces can therefore vary a lot between two consecutive frames when the subjects move quickly. The proposed approach is compared to a conventional geometry image approach [10], namely Geometry Image Sequence (GIS), where cut graphs are obtained by iterative parameterization [9] as described in Sect. 10.2.2. Stability Evaluation To assess the stability of the descriptor and its ability to produce consistent geometry images, the mean square error of pixel values (MSE) between consecutive geometry images is computed. It allows us to estimate how much the geometry images vary over a sequence. Note that we use 3D models processed and provided by [4] for theoretical validation. As surface topology is consistent over the sequences, a stable descriptor should return optimal result. Table 10.1 show average MSE obtained on various sequences. As expected, our approach shows remarkable stability between consecutive frames as average MSE values are very low. As stability is not handled by GIS, it returns high average MSE values. Figure 10.9 illustrates stable graphs obtained with our approach against conventional GIS. Our approach returns stable surface-based graphs. Geometry images are similar and reconstruction is compelling. With GIS, cut graphs are not stabilized on the surface. However, mesh reconstruction is accurate. Figure 10.10 shows MSE values obtained on all evaluated sequences with the proposed method and GIS. Reconstruction Accuracy To assess the reconstruction accuracy from geometry images, i.e. how efficient is the path definition, the Hausdorff distance is computed between the original mesh and the reconstructed mesh using [18]. Average Hausdorff distances Δ between ground truth sequences and reconstructed surfaces are reported in Table 10.2. We can observe similar performances between our approach and GIS: Δ is very low for both methods. Thus, our method can achieve accurate reconstruction with only one parameterization step, as opposed to the costly iterative parameterization method employed by GIS (Sect. 10.2.2). In general, variations can occur when: (1) critical points are not well located (e.g. if geodesic distances are not exactly computed), (2) tracking of stable features is lost, (3) the surface undergo very large deformation. In the latter case we can
336
10
3D Video Encoding
Fig. 10.9 Stability evaluation on Bouncing sequence. (a) Our approach returns stable surface-based graphs (in green). Geometry images are similar and reconstruction is compelling (in gray). (b) With conventional GIS, cut graphs are not stabilized on the surface. However, mesh reconstruction has good quality Table 10.1 MSE between consecutive geometry images Bouncing
MSE (GIS)
MSE (our approach)
35302
2886
Crane
28485
1670
Handstand
30671
1261
Kickup
27700
1938
Lock
22466
1700
Samba
35302
2886
observe a drift due to error accumulation. However, the geometry images will still vary smoothly. Figure 10.11 shows reconstruction examples. Hausdorff distances computed on all evaluated sequences are given in Fig. 10.12. Encoding Performance Table 10.3 shows 3D video encoding performance with respect to different strategies. Our method is clearly the most satisfactory; the better compression rate with the similar 3D shape decoding accuracy to GIS.
10.6 Conclusion This chapter discusses techniques related to 3D video data stream encoding. As 3D video data require a large amount of space for storage, and large network bandwidth
10.6
Conclusion
337
Fig. 10.10 MSE comparisons. Our approach produces sequences of similar geometry images that vary very smoothly: the MSE between consecutive frames is very low
338
10
3D Video Encoding
Fig. 10.11 Encoding of Samba and Handstand sequences using our stable surface-based shape descriptor. The representation is stable even though surfaces undergo strong variations. As can be observed, the surface-based graphs return similar geometry images Table 10.2 Distance to ground truth (average Hausdorff distance Δ)
Δ (GIS)
Δ (our approach)
Bouncing
0.158
0.162
Crane
0.1083
0.1157
Handstand
0.1026
0.0995
Kickup
0.136
0.145
Lock
0.128
0.130
Samba
0.071
0.071
for transmission, it is necessary to design an optimal data structure for compression and streaming. For this purpose, we propose a new method based on the geometry images [10] that transforms a 3D video data stream in a 2D video stream. 3D video data can therefore be easily stored and transmitted using any mature coding technology (e.g. Windows Media, Quicktime, MPEG-4, Real Media, Flash, etc.). As a 3D video data stream consists of individually reconstructed meshes which undergo significant variations of topology and triangulation over time, the geometry video technique [2] is not optimal for encoding (as it was developed for an animated mesh sequence). The challenge here is to design an algorithm to find temporally stable cut graphs that open up meshes consistently across 3D video sequences. Transforming 3D video data stream into a sequence of similar geometry images is indeed suitable for efficient compression and streaming as most 2D video encoders rely on temporal information consistency and redundancy. Hence, we introduce a novel surface-based shape descriptor for dynamic surfaces. The proposed representation consists of graphs lying on surface models and is stable over time. Branches and joints are stabilized using temporal geodesic consistency rules within a probabilistic framework. Experiments reveal that the graphs
10.6
Conclusion
339
Fig. 10.12 Reconstruction accuracy Δ of 3D video data. Both strategies return similar performances
340 Table 10.3 3D video encoding. For each format, the size of each sequence is given in KB. H.264/MPEG-4 is used for lossless compression of geometry images (128 × 128)
10 #fr.
OFF(zip)
3D Video Encoding
GIS
Our approach
Bouncing
174
16,300
304.4
248.3
Crane
173
14,100
283.7
192.7
Handstand
173
24,700
283.4
182.2
Kickup
219
29,900
365.1
252.2
Lock
249
32,400
388.2
271.2
Samba
174
22,200
304.0
211.4
show remarkable properties for dynamic surface encoding as the descriptor brings a stable geometrical structure to 3D video data; it remains stable with dynamic surfaces undergoing non-rigid deformations. The temporally stable surface-based graphs are employed as cut graphs and enable 3D surface models to be transformed into 2D geometry images. We show stability and performance of the proposed descriptor against state-of-the-art method. Note that surface topology changes are considered to be out of the scope of this chapter. As presented in Chap. 8, a 3D video data stream can be partitioned regarding surface topology characterization (using the behavior unit model), and then robust methods to ensure topology consistency of dynamic surfaces obtained from multi-view video system while providing consistent remeshing [3, 4] can be applied independently to each partition before encoding. Hence, video content could be dynamically adapted to device and network bandwidth, and cope with adaptive bit rate streaming technology. For further improvement, additional surface features such as SIFT [16] or color (when available) may be used to leverage the graph stability [29, 31]. High fidelity texture map encoding for optimal visualization is still under investigation. For example, as presented in Sect. 5.5, the harmonized texture generation method requires complex data structures rather than the simple vertex-based texture generation assumed by the coding method presented in this chapter.
References 1. Blum, H.: A Transformation for Extracting New Descriptors of Shape. Models for the Perception of Speech and Visual Form (1967) 2. Briceno, H., Sandler, P., McMillian, L., Gortler, S., Hoppe, H.: Geometry videos: a new representation for 3D animations. In: Eurographics/SIGGRAPH Symposium on Computer Animation, pp. 136–146 (2003) 3. Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Calculus of non-rigid surfaces for geometry and texture manipulation. IEEE Trans. Vis. Comput. Graph., 902–913 (2007) 4. Cagniart, C., Boyer, E., Ilic, S.: Probabilistic deformable surface tracking from multiple videos. In: Proc. of European Conference on Computer Vision (2010) 5. Carr, N., Hoberock, J., Crane, K., Hart, J.: Rectangular multi-chart geometry images. In: Proc. fourth Eurographics Symposium on Geometry Processing (SGP), pp. 181–190 (2006) 6. CGAL: Computational Geometry Algorithms Library. http://www.cgal.org
References
341
7. Cornea, N., Silver, D., Yuan, X., Balasubramanian, R.: Computing hierarchical curveskeletons of 3D objects. Vis. Comput. 21(11), 945–955 (2005) 8. Erickson, J., Har-Peled, S.: Optimally cutting a surface into a disk. CoRR cs.CG/0207004 (2002) 9. Floater, M.: Parametrization and smooth approximation of surface triangulations. Comput. Aided Geom. Des. 14(3), 231–250 (1997) 10. Gu, X., Gortler, S., Hoppe, H.: Geometry images. In: Proc. of ACM SIGGRAPH, pp. 355–361 (2002) 11. Habe, H., Katsura, Y., Matsuyama, T.: Skin-off: Representation and compression scheme for 3D video. In: Picture Coding Symposium (2004) 12. Habe, H., Yamazawa, I., Nomura, T., Katsura, Y., Matsuyama, T.: Compression method for omni-directional video using polyhedral representations. J. Inst. Electron. Inf. Commun. Eng. J88-A(9), 1074–1084 (2005) (in Japanese) 13. Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology matching for fully automatic similarity estimation of 3D shapes. In: Proc. of ACM SIGGRAPH, pp. 203–212 (2001) 14. Hoppe, H.: Progressive meshes. In: Proc. of ACM SIGGRAPH, pp. 99–108 (1996) 15. Locasso, F., Hoppe, H., Schaefer, S., Warren, J.: Smooth geometry images. In: Eurographics Symposium on Geometry Processing, pp. 138–145 (2003) 16. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proc. of International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 17. Morse, M.: The Calculus of Variations in the Large. American Mathematical Society Colloquium Publication, vol. 18. AMS, New York (1934) 18. Cignoni, P., Rocchini, C., Scopigno, R.: Metro: measuring error on simplified surfaces. Comput. Graph. Forum 17(2), 167–174 (1998) 19. Pascucci, V., Scorzelli, G., Bremer, P.T., Mascarenhas, A.: Robust on-line computation of Reeb graphs: Simplicity and speed. ACM Trans. Graph. 26(3), 58 (2007) 20. Praun, E., Hoppe, H.: Spherical parametrization and remeshing. In: Proc. of ACM SIGGRAPH, pp. 340–349 (2003) 21. Sander, P., Gortler, S., Snyder, J., Hoppe, H.: Signal-specialized parametrization. Microsoft Research MSR-TR-2002-27 (2002) 22. Sander, P., Wood, Z., Gortler, S., Snyder, J., Hoppe, H.: Multi-chart geometry images. In: Eurographics Symposium on Geometry Processing, pp. 146–155 (2003) 23. Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graph. Appl. (2007) 24. Sullivan, G., Topiwala, P., Luthra, A.: The h.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions. In: Proc. SPIE conference on Applications of Digital Image Processing, vol. XXVII, pp. 454–474 (2004) 25. Taubin, G., Rossignac, J.: Geometric compression through topological surgery. ACM Trans. Graph. 17(2), 84–115 (1998) 26. Tung, T., Nobuhara, S., Matsuyama, T.: Simultaneous super-resolution and 3D video using graph-cuts. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2008) 27. Tung, T., Nobuhara, S., Matsuyama, T.: Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In: Proc. of International Conference on Computer Vision (2009) 28. Tung, T., Schmitt, F.: The augmented multiresolution Reeb graph approach for content-based retrieval of 3D shapes. Int. J. Shape Model. 11(1), 91–120 (2005) 29. Varanasi, K., Zaharescu, A., Boyer, E., Horaud, R.P.: Temporal surface tracking using mesh evolution. In: Proc. of European Conference on Computer Vision (2008) 30. Vlasic, D., Baran, I., Matusik, W., Popovic, J.: Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph. 27(3) (2008) 31. Zaharescu, A., Boyer, E., Varanasi, K., Horaud, R.P.: Surface feature detection and description with applications to mesh matching. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (2009)
Index
Symbols 2.5D image, 4 3D animation, 4 3D cinema, 4 3D CT image, 5 3D depth image, 4 3D face reconstruction, 238 3D mesh, 4 3D mesh data editing, 10 3D point light source from shadow, 201 3D shape reconstruction, 8 face, 238 3D TV, 4 3D video, 3 3D video studio, 20 8-point algorithm, 32 A Absolute color calibration, 39 Action editor, 11 Active camera, 35 Active camera calibration, 48 Active stereo, 95 AMRG, 267 feature vector representation, 270 similarity evaluation, 271 Augmented reality, 3 B Background, 26 Ballooning term, 131 Behavior unit, 10, 270 Behavior unit edition, 281 Bundle adjustment, 34
C Calibration 8-point algorithm, 32 absolute color, 39 active, 35, 48 bundle adjustment, 34 color, 36 extrinsic parameter, 32 geometric, 29 intrinsic parameter, 31 lens distortion, 29, 31 normalized camera, 29 photometric, 36 rectification, 29 relative color, 37 reprojection error, 34 scale, 34 self, 48 Camera arrangement, 20 calibration, see calibration synchronization, 22 types, 22 Camera model, 29 pinhole, 29 Camera ring, 46 Capture space, 46 Cell, 51 Cell-based object tracking, 51 Census, 108 Chroma-keying, 26 Circle of confusion, 23 Coding, 11 Color calibration, 36 Compression omni-directional video, 316 Computer graphics, 3 Computer tomography, 94
T. Matsuyama et al., 3D Video and Its Applications, DOI 10.1007/978-1-4471-4120-4, © Springer-Verlag London 2012
343
344 Computer vision, 3 Contour generator, 120 Cut graph, 321, 322 Cut path, 322 Cut-node, 322 D Data representation, 11 Data stream decoding, 280 Data stream encoding, 278 Deformation vector, 175 Depth buffer, 163 Depth fusion, 131 Depth map, 162 Depth of field, 23 Difference sphere, 197 Diffuse albedo map, 224 Dolly, 35 Dynamic full 3D shape reconstruction, 104 Dynamic memory architecture, 36 E Editing, 10 Editing body action, 10 Extrinsic parameter, 30 Extrinsic parameter calibration, 32
Index Global shutter, 24 Graph-cuts, 113 H Harmonized position, 175 Harmonized texture generation, 173 Harmonized texture image, 175 Hyperfocal distance, 23 I ICP, 305 Image mosaicing, 48 Image processing, 2 Image-based method, 5 Intrinsic parameter, 30 Intrinsic parameter calibration, 31 Inverse lighting, 197 Iris, 23 K Kinematic motion, 295
F F -number, 23 Field of view, 23 Fixed-viewpoint pan-tilt camera, 36 Flickering, 25 Floating texture, 173 Focus, 23 Frame-and-skin model, 118 Free-viewpoint TV, 5 Free-viewpoint visualization, 233 Frontier point, 102 Full 3D shape reconstruction, 96 FURISODE, 6, 128
L Lens, 23 Lens distortion, 29 Lens distortion estimation, 31 Lens distortion parameter, 30 Levenberg–Marquardt algorithm, 34 Light field, 88 Light source directional, 197 proximal, 197 Light source model, 199 Lighting, 9, 25 strobe, 25 Lighting environment, 195 editing, 10 Lighting environment estimation direct, 196 indirect, 197
G Gamma correction, 36 Gaze estimation, 246 Generic texture generation, 221 GenLock, 22 Geodesic consistency, 332 temporal, 332 Geodesic distance, 265 Geometric calibration, 29 Geometric transformation between a 3D mesh and a 2D image, 158 Geometry image, 317, 321 Geometry video, 317
M Macbeth color checker, 39 MAIKO, 6, 128, 141 Markov motion graph, 260 Masked diffuse-reflection map, 224 Mesh deformation, 118 2.5D, 103 intra-frame, 118 Mesh parameterization, 155, 158 Mixed reality, 3 Model-based method, 5 Motion capture, 4, 295 MPEG, 12
Index Multi-camera arrangement converging, 20 diverging, 20 parallel, 20 Multi-camera photometric calibration, 37 Multi-view image acquisition, 6 N Narrow-baseline stereo, 97 Natural texture mapping, 154 Normalized camera, 29 Normalized cross correlation, NCC, 108 O OBI, 7 Object tracking cell-based, 51 Omni-directional video, 316 Oriented point, 115 Oriented visibility, 112 P Pan, 35 Parameterization, 322, 324 boundary, 325 interior, 325 Patch-based shape representation, 115 Phantom light source, 207 Phantom volume, 99 Photo hull, 98 Photo-consistency, 98, 106, 302 by variance, 109 integration with visibility test, 113 pair-wise evaluation, 107 Photometric calibration, 36 multi-camera, 37 Pinhole camera model, 29 Proxy mesh, 328 R Radiance map, 224 Range image, 4 Rectification, 29 Reeb graph, 265 augmented multi-resolution, 267 multi-resolution, 265 Relative color calibration, 37 Reprojection error, 34 Rolling shutter, 24 S Safe hull, 99 Semantic description 3D video, 282
345 Shape from defocus, 95 focus, 95 motion, 95 shading, 93 shadow, 94 silhouette, 94, 98 stereo, 94 stereo and silhouette, 103 texture, 94 Shape from silhouette, 201 Shape from X, 89, 93 Shape proxy, 110 Shape reconstruction dynamic full 3D, 104 full 3D, 96 intra-frame mesh deformation, 118 silhouette constraint, 120 Shape representation patch-based, 115 surface-based, 115 volume-based, 113 Shutter, 24 Silhouette constraint, 120, 128 Silhouette extraction, 7 Skeleton cube, 203, 208 Skimming 3D video stream, 281 Skin-and-bones model, 11, 296, 299 Skin-off, 316 Skinning, 280 Smart uv project, 158 Space carving, 97 Stereo active, 95 depth fusion, 97 disparity-based, 97 narrow-baseline, 97 volumetric, 97 wide-baseline, 97 Strobe lighting, 25 Sum of absolute difference, SAD, 108 Sum of squared difference, SSD, 108 Super-resolution, 238 Super-resolution rendering, 245 Surface growing, 116 Surface reconstruction face, 242 symmetry prior, 242 Surface visibility, 300 Surface-based graph, 332 stable, 333 Surface-based shape descriptor, 330 Surface-based shape representation, 115
346 Synchronization GenLock, 22 trigger, 22 T Texture coordinate, 159 Texture generation, 9, 151 appearance-based, 156 appearance-based view-independent, 159 generic, 221 generic-property-based, 156 harmonized, 173 vertex-based, 170 view-dependent, 169 Texture image, 159 partial, 159, 162 Texture mapping natural, 154 Texture painting, 153 Tilt, 35 Time-of-flight, ToF, 95 Topology dictionary, 257 U Uv unwrapped mesh, 159 Uv unwrapping, 159 V Vertex-based texture generation, 170 View-dependent texture generation, 169 View-independent average method, 165 Vignetting, 37 Virtual camera, 88
Index Visibility, 98, 109, 300 integration with photo-consistency computation, 113 oriented, 112 state-based, 110 Visibility test, 107 Visual cone, 8 Visual cone intersection, 98 Visual hull, 98 partial capture, 103 surface-based computation, 101 voxel-based computation, 100 Visual information media, 1 Visualization, 10, 233 free-viewpoint, 233 objective, 233, 234 subjective, 233, 236 Volume intersection, 8, 98 Volume-based shape representation, 113 Voxel, 8, 97 W Wide-baseline stereo, 97 Z Z-buffer, 163 Zero-mean normalized cross correlation, ZNCC, 108 Zero-mean sum of absolute difference, ZSAD, 108 Zero-mean sum of squared difference, ZSSD, 108 Zoom, 35