Pattern Recognition: 31st DAGM Symposium, Jena, Germany, September 9-11, 2009, Proceedings (Lecture Notes in Computer Science Image Processing, Computer Vision, Pattern Recognition, and Graphics)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Joachim Denzler | Gunther Notni | Herbert Süße

5 downloads 989 Views 17MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5748

Joachim Denzler Gunther Notni Herbert Süße (Eds.)

Pattern Recognition 31st DAGM Symposium Jena, Germany, September 9-11, 2009 Proceedings

13

Volume Editors Joachim Denzler Herbert Süße Friedrich-Schiller Universität Jena, Lehrstuhl Digitale Bildverarbeitung Ernst-Abbe-Platz 2, 07743 Jena, Germany E-mail: {joachim.denzler, herbert.suesse}@uni-jena.de Gunther Notni Fraunhofer-Institut für Angewandte Optik und Feinmechanik Albert-Einstein-Str. 7, 07745 Jena, Germany E-mail: [email protected]

Library of Congress Control Number: 2009933619 CR Subject Classification (1998): I.5, I.4, I.3, I.2.10, F.2.2, I.4.8, I.4.1 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13

0302-9743 3-642-03797-6 Springer Berlin Heidelberg New York 978-3-642-03797-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12743339 06/3180 543210

Preface

In 2009, for the second time in a row, Jena hosted an extraordinary event. In 2008, Jena celebrated the 450th birthday of the Friedrich Schiller University of Jena with the motto “Lichtgedanken” – “ﬂashes of brilliance.” This year, for almost one week, Jena became the center for the pattern recognition research community of the German-speaking countries in Europe by hosting the 31st Annual Symposium of the Deutsche Arbeitsgemeinschaft f¨ ur Mustererkennung (DAGM). Jena is a special place for this event for several reasons. Firstly, it is the ﬁrst time that the university of Jena has been selected to host this conference, and it is an opportunity to present the city of Jena as oﬀering a fascinating combination of historic sites, an intellectual past, a delightful countryside, and innovative, international research and industry within Thuringia. Second, the conference takes place in an environment that has been heavily inﬂuenced by optics research and industry for more than 150 years. Third, in several schools and departments at the University of Jena, research institutions and companies in the ﬁelds of pattern recognition, 3D computer vision, and machine learning play an important role. The university’s involvement includes such diverse activities as industrial inspection, medical image processing and analysis, remote sensing, biomedical analysis, and cutting-edge developments in the ﬁeld of physics, such as the recent development of the new terahertz imaging technique. Thus, DAGM 2009 was an important event to transfer basic research results to diﬀerent applications in such areas. Finally, the fact that the conference was jointly organized by the Chair for Computer Vision of the Friedrich Schiller University of Jena and the Fraunhofer Institute IOF reﬂects the strong cooperation between these two institutions during the past and, more generally, between research, applied research, and industry in this ﬁeld. The establishment of a Graduate School of Computer Vision and Image Interpretation, which is a joint facility of the Technical University of Ilmenau and the Friedrich Schiller University of Jena, is a recent achievement that will focus and strengthen the computer vision and pattern recognition activities in Thuringia. The technical program covered all aspects of pattern recognition and consisted of oral presentations and poster contributions, which were treated equally and given the same number of pages in the proceedings. Each section is devoted to one speciﬁc topic and contains all oral and poster papers for this topic sorted alphabetically by ﬁrst authors. A very strict paper selection process was used, resulting in an acceptance rate of less than 45%. Therefore, the proceedings meet the strict requirements for publication in the Springer Lecture Notes in Computer Science series. Although not reﬂected in these proceedings, one additional point that also made this year’s DAGM special is the Young Researchers’ Forum, a special session for promoting scientiﬁc interactions between excellent

VI

Preface

young researchers. The impressive scientiﬁc program of the conference is due to the enormous eﬀorts of the reviewers of the Program Committee. We thank all of those whose dedication and timely reporting helped to ensure that the highly selective reviewing process was completed on schedule. We are also proud to have had three renowned invited speakers at the conference: – Josef Kittler (University of Surrey, UK) – Reinhard Klette (University of Auckland, New Zealand) – Kyros Kutulakos (University of Toronto, Canada) We extend our sincere thanks to everyone involved in the organization of this event, especially the members of the Chair for Computer Vision and the Fraunhofer Institute IOF. In particular, we are indebted to Erik Rodner for organizing everything related to the conference proceedings, to Wolfgang Ortmann for installation and support in the context of the Web presentation and the reviewing and submission system, to Kathrin M¨ ausezahl for managing the conference oﬃce and arranging the conference dinner, and to Marcel Br¨ uckner, Michael Kemmler, and Marco K¨orner for the local organization. Finally, we would like to thank our sponsors, OLYMPUS Europe Foundation Science for Life, STIFT Thuringia, MVTec Software GmbH, Telekom Laboratories, Allied Vision Technologies, Desko GmbH, Jenoptik AG, and Optonet e.V. for their donations and helpful support, which contributed to several awards at the conference and made reasonable registration fees possible. We especially appreciate support from industry because it indicates faithfulness to our community and recognizes the importance of pattern recognition and related areas to business and industry. We were happy to host the 31st Annual Symposium of DAGM in Jena and look forward to DAGM 2010 in Darmstadt. September 2009

Joachim Denzler Gunther Notni Herbert S¨ uße

Organization

Program Committee T. Aach H. Bischof J. Buhmann H. Burkhardt D. Cremers J. Denzler G. Fink B. Flach W. F¨orstner U. Franke M. Franz D. Gavrila M. Goesele F.A. Hamprecht J. Hornegger B. J¨ahne X. Jiang R. Koch U. K¨ othe W.G. Kropatsch G. Linß H. Mayer R. Mester B. Michaelis K.-R. M¨ uller H. Ney G. Notni K. Obermayer G. R¨atsch G. Rigoll K. Rohr B. Rosenhahn S. Roth B. Schiele C. Schn¨ orr B. Sch¨olkopf G. Sommer T. Vetter F.M. Wahl J. Weickert

RWTH Aachen TU Graz ETH Z¨ urich University of Freiburg University of Bonn University of Jena TU Dortmund TU Dresden University of Bonn Daimler AG HTWG Konstanz Daimler AG TU Darmstadt University of Heidelberg University of Erlangen University of Heidelberg University of M¨ unster University of Kiel University of Heidelberg TU Wien TU Ilmenau BW-Universit¨ at M¨ unchen University of Frankfurt University of Magdeburg TU Berlin RWTH Aachen Fraunhofer IOF Jena TU Berlin MPI T¨ ubingen TU M¨ unchen University of Heidelberg University of Hannover TU Darmstadt University of Darmstadt University of Heidelberg MPI T¨ ubingen University of Kiel University of Basel University of Braunschweig Saarland University

Prizes 2007

Olympus Prize The Olympus Prize 2007 was awarded to Bodo Rosenhahn and Gunnar R¨ atsch for their outstanding contributions to the area of computer vision and machine learning.

DAGM Prizes The main prize for 2007 was awarded to: J¨ urgen Gall, Bodo Rosenhahn, Hans-Peter Seidel: Clustered Stochastic Optimization for Object Recognition and Pose Estimation Christopher Zach, Thomas Pock, Horst Bischof: A Duality-Based Approach for Realtime TV-L1 Optical Flow Further DAGM prizes for 2007 were awarded to: Kevin K¨ oser, Bogumil Bartczak, Reinhard Koch: An Analysis-by-Synthesis Camera Tracking Approach Based on Free-Form Surfaces Volker Roth, Bernd Fischer: The kernelHMM : Learning Kernel Combinations in Structured Output Domains

Prizes 2008

Olympus Prize The Olympus Prize 2008 was awarded to Bastian Leibe for his outstanding contributions to the area of closely coupled object categorization, segmentation, and tracking.

DAGM Prizes The main prize for 2008 was awarded to: Christoph H. Lampert, Matthew B. Blaschko: A Multiple Kernel Learning Approach to Joint Multi-class Object Detection Further DAGM prizes for 2008 were awarded to: Bj¨ orn Andres, Ullrich K¨ othe, Moritz Helmst¨ adter, Winfried Denk, Fred A. Hamprecht : Segmentation of SBFSEM Volume Data of Neural Tissue by Hierarchical Classiﬁcation Kersten Petersen, Janis Fehr, Hans Burkhardt : Fast Generalized Belief Propagation for MAP Estimation on 2D and 3D Grid-Like Markov Random Fields Kai Krajsek, Rudolf Mester, Hanno Scharr : Statistically Optimal Averaging for Image Restoration and Optical Flow Estimation

Table of Contents

Motion and Tracking A 3-Component Inverse Depth Parameterization for Particle Filter SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙ Evren Imre and Marie-Odile Berger

1

An Eﬃcient Linear Method for the Estimation of Ego-Motion from Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Raudies and Heiko Neumann

11

Localised Mixture Models in Region-Based Tracking . . . . . . . . . . . . . . . . . . Christian Schmaltz, Bodo Rosenhahn, Thomas Brox, and Joachim Weickert A Closed-Form Solution for Image Sequence Segmentation with Dynamical Shape Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank R. Schmidt and Daniel Cremers Markerless 3D Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Walder, Martin Breidt, Heinrich B¨ ulthoﬀ, Bernhard Sch¨ olkopf, and Crist´ obal Curio

21

31 41

Pedestrian Recognition and Automotive Applications The Stixel World - A Compact Medium Level Representation of the 3D-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hern´ an Badino, Uwe Franke, and David Pfeiﬀer

51

Global Localization of Vehicles Using Local Pole Patterns . . . . . . . . . . . . . Claus Brenner

61

Single-Frame 3D Human Pose Recovery from Multiple Views . . . . . . . . . . Michael Hofmann and Dariu M. Gavrila

71

Dense Stereo-Based ROI Generation for Pedestrian Detection . . . . . . . . . . Christoph Gustav Keller, David Fern´ andez Llorca, and Dariu M. Gavrila

81

Pedestrian Detection by Probabilistic Component Assembly . . . . . . . . . . . Martin Rapus, Stefan Munder, Gregory Baratoﬀ, and Joachim Denzler

91

High-Level Fusion of Depth and Intensity for Pedestrian Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcus Rohrbach, Markus Enzweiler, and Dariu M. Gavrila

101

XII

Table of Contents

Features Fast and Accurate 3D Edge Detection for Surface Reconstruction . . . . . . Christian B¨ ahnisch, Peer Stelldinger, and Ullrich K¨ othe

111

Boosting Shift-Invariant Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas H¨ ornlein and Bernd J¨ ahne

121

Harmonic Filters for Generic Feature Detection in 3D . . . . . . . . . . . . . . . . Marco Reisert and Hans Burkhardt

131

Increasing the Dimension of Creativity in Rotation Invariant Feature Design Using 3D Tensorial Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrik Skibbe, Marco Reisert, Olaf Ronneberger, and Hans Burkhardt Training for Task Speciﬁc Keypoint Detection . . . . . . . . . . . . . . . . . . . . . . . Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua Combined GKLT Feature Tracking and Reconstruction for Next Best View Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Trummer, Christoph Munkelt, and Joachim Denzler

141 151

161

Single-View and 3D Reconstruction Non-parametric Single View Reconstruction of Curved Objects Using Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin R. Oswald, Eno T¨ oppe, Kalin Kolev, and Daniel Cremers

171

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior . . . . Krishnamurthy Ramnath and Ambasamudram N. Rajagopalan

181

Making Shape from Shading Work for Real-World Images . . . . . . . . . . . . . Oliver Vogel, Levi Valgaerts, Michael Breuß, and Joachim Weickert

191

Learning and Classiﬁcation Deformation-Aware Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Gass, Thomas Deselaers, and Hermann Ney Multi-view Object Detection Based on Spatial Consistency in a Low Dimensional Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurman Gill and Martin Levine

201

211

Active Structured Learning for High-Speed Object Detection . . . . . . . . . . Christoph H. Lampert and Jan Peters

221

Face Reconstruction from Skull Shapes and Physical Attributes . . . . . . . . Pascal Paysan, Marcel L¨ uthi, Thomas Albrecht, Anita Lerch, Brian Amberg, Francesco Santini, and Thomas Vetter

232

Table of Contents

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudhir Raman and Volker Roth Learning with Few Examples by Transferring Feature Relevance . . . . . . . Erik Rodner and Joachim Denzler

XIII

242 252

Pattern Recognition and Estimation Simultaneous Estimation of Pose and Motion at Highly Dynamic Turn Maneuvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Barth, Jan Siegemund, Uwe Franke, and Wolfgang F¨ orstner

262

Making Archetypal Analysis Practical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Bauckhage and Christian Thurau

272

Fast Multiscale Operator Development for Hexagonal Images . . . . . . . . . . Bryan Gardiner, Sonya Coleman, and Bryan Scotney

282

Optimal Parameter Estimation with Homogeneous Entities and Arbitrary Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jochen Meidow, Wolfgang F¨ orstner, and Christian Beder

292

Detecting Hubs in Music Audio Based on Network Analysis . . . . . . . . . . . Alexandros Nanopoulos

302

A Gradient Descent Approximation for Graph Cuts . . . . . . . . . . . . . . . . . . Alparslan Yildiz and Yusuf Sinan Akgul

312

Stereo and Multi-view Reconstruction A Stereo Depth Recovery Method Using Layered Representation of the Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tarkan Aydin and Yusuf Sinan Akgul

322

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandro Esquivel, Reinhard Koch, and Heino Rehse

332

A Superresolution Framework for High-Accuracy Multiview Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bastian Goldl¨ ucke and Daniel Cremers

342

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Munkelt, Michael Trummer, Peter K¨ uhmstedt, Gunther Notni, and Joachim Denzler

352

XIV

Table of Contents

Real Aperture Axial Stereo: Solving for Correspondences in Blur . . . . . . . Rajiv Ranjan Sahay and Ambasamudram N. Rajagopalan Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Schick and Rainer Stiefelhagen Image-Based Lunar Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . Stephan Wenger, Anita Sellent, Ole Sch¨ utt, and Marcus Magnor

362

372 382

Image Analysis and Applications Use of Coloured Tracers in Gas Flow Experiments for a Lagrangian Flow Analysis with Increased Tracer Density . . . . . . . . . . . . . . . . . . . . . . . . Christian Bendicks, Dominique Tarlet, Bernd Michaelis, Dominique Th´evenin, and Bernd Wunderlich Reading from Scratch – A Vision-System for Reading Data on Micro-structured Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ralf Dragon, Christian Becker, Bodo Rosenhahn, and J¨ orn Ostermann

392

402

Diﬀusion MRI Tractography of Crossing Fibers by Cone-Beam ODF Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans-Heino Ehricke, Kay M. Otto, Vinoid Kumar, and Uwe Klose

412

Feature Extraction Algorithm for Banknote Textures Based on Incomplete Shift Invariant Wavelet Packet Transform . . . . . . . . . . . . . . . . . Stefan Glock, Eugen Gillich, Johannes Schaede, and Volker Lohweg

422

Video Super Resolution Using Duality Based TV-L1 Optical Flow . . . . . . Dennis Mitzel, Thomas Pock, Thomas Schoenemann, and Daniel Cremers HMM-Based Defect Localization in Wire Ropes — A New Approach to Unusual Subsequence Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esther-Sabrina Platzer, Josef N¨ agele, Karl-Heinz Wehking, and Joachim Denzler Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion . . . . . . . . . Christian Schmaltz, Joachim Weickert, and Andr´es Bruhn Decoding Color Structured Light Patterns with a Region Adjacency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Schmalz Residual Images Remove Illumination Artifacts! . . . . . . . . . . . . . . . . . . . . . Tobi Vaudrey and Reinhard Klette

432

442

452

462 472

Table of Contents

XV

Superresolution and Denoising of 3D Fluid Flow Estimates . . . . . . . . . . . . Andrey Vlasenko and Christoph Schn¨ orr

482

Spatial Statistics for Tumor Cell Counting and Classiﬁcation . . . . . . . . . . Oliver Wirjadi, Yoo-Jin Kim, and Thomas Breuel

492

Segmentation Quantitative Assessment of Image Segmentation Quality by Random Walk Relaxation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Andres, Ullrich K¨ othe, Andreea Bonea, Boaz Nadler, and Fred A. Hamprecht Applying Recursive EM to Scene Segmentation . . . . . . . . . . . . . . . . . . . . . . Alexander Bachmann Adaptive Foreground/Background Segmentation Using Multiview Silhouette Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Feldmann, Lars Dießelberg, and Annika W¨ orner

502

512

522

Evaluation of Structure Recognition Using Labelled Facade Images . . . . . Nora Ripperda and Claus Brenner

532

Using Lateral Coupled Snakes for Modeling the Contours of Worms . . . . Qing Wang, Olaf Ronneberger, Ekkehard Schulze, Ralf Baumeister, and Hans Burkhardt

542

Globally Optimal Finsler Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Zach, Liang Shan, and Marc Niethammer

552

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

563

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM ˙ Evren Imre and Marie-Odile Berger INRIA Grand Est- Nancy, France

Abstract. The non-Gaussianity of the depth estimate uncertainty degrades the performance of monocular extended Kalman ﬁlter SLAM (EKF-SLAM) systems employing a 3-component Cartesian landmark parameterization, especially in low-parallax conﬁgurations. Even particle ﬁlter SLAM (PF-SLAM) approaches are aﬀected, as they utilize EKF for estimating the map. The inverse depth parameterization (IDP) alleviates this problem through a redundant representation, but at the price of increased computational complexity. The authors show that such a redundancy does not exist in PF-SLAM, hence the performance advantage of the IDP comes almost without an increase in the computational cost.

1

Introduction

The monocular simultaneous localization and mapping (SLAM) problem involves the causal estimation of the location of a set of 3D landmarks in an unknown environment (mapping), in order to compute the pose of a sensor platform within this environment (localization), via the photometric measurements acquired by a camera, i.e. the 2D images [2]. Since the computational complexity of the structure-from-motion techniques, such as [6], is deemed prohibitively high, the literature is dominated by extended Kalman ﬁlter (EKF) [2],[3] and particle ﬁlter (PF) [4] based approaches. The former utilizes an EKF to estimate the current state, deﬁned as the pose and the map, using all past measurements [5]. The latter exploits the independence of the landmarks, given the trajectory, to decompose the SLAM problem into the estimation of the trajectory via PF, and the individual landmarks via EKF [5]. Since both approaches use EKF, they share a common problem: EKF assumes that the state distribution is Gaussian. The validity of this assumption, hence the success of EKF in a particular application, critically depends on the linearity of the measurement function. However, the measurement function in monocular SLAM, the pinhole camera model [2], is known to be highly nonlinear for landmarks represented with the Cartesian parameterization (CP) [9], i.e., with their components along the 3 orthonormal axes corresponding to the 3 spatial dimensions. This is especially true for low-parallax conﬁgurations, which typically occurs in case of distant, or newly initialized landmarks [9]. A well-known solution to this problem is to employ an initialization stage, by using a particle ﬁlter [10], or simpliﬁed linear measurement model [4], and then to J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 1–10, 2009. c Springer-Verlag Berlin Heidelberg 2009

2

˙ E. Imre and M.-O. Berger

switch to the CP. The IDP [7] further reﬁnes the initialization approach: it uses the actual measurement model, hence is more accurate than [4]; computationally less expensive than [10]; and needs no special procedure to constrain the pose with the landmarks in the initialization stage, hence simpler than both [4] and [10]. Since it is more linear than the CP [7], it also oﬀers a performance gain both in low- and high-parallax conﬁgurations. However, since EKF is an O(N 2 ) algorithm, the redundancy of the IDP limits its use to the low-parallax case. The main contribution of this work is to show that in PF-SLAM, the performance gain from the use of the IDP is almost free: PF-SLAM operates under the assumption that for each particle, the trajectory is known. Therefore the pose related components of the IDP should be removed from the state of the landmark EKF, leaving exactly the same number of components as the CP. Since this parameterization has no redundancy, and has better performance than the CP [7], its beneﬁts can be enjoyed throughout the entire estimation procedure, not just during the landmark initialization. The organization of the paper is as follows: In the next section, the PFSLAM system used in this study is presented. In Sect. 3, the application of IDP to PF-SLAM is discussed, and compared with [4]. The experimental results are presented in Sect. 4, and Sect. 5 concludes the paper. 1.1

Notation

Throughout the rest of the paper, a matrix and a vector is represented by an uppercase and a lowercase bold letter, respectively. A standalone lowercase italic letter denotes a scalar, and one with paranthesis stands for a function. Finally, an uppercase italic letter corresponds to a set.

2

A Monocular PF-SLAM System

PF-SLAM maximizes the SLAM posterior over the entire trajectory of the camera and the map, i.e, the objective function is [5] pP F = p(X, M |Z),

(1)

where X and M denotes the camera trajectory and the map estimate at the kth time instant, respectively (in (1) the subscript k is suppressed for brevity). Z is the collection of measurements acquired until k. Equation 1 can be decomposed as [1] pP F = p(X|Z)p(M |X, Z).

(2)

In PF-SLAM, the ﬁrst term is evaluated by a PF, which generates a set of trajectory hypotheses. Then, for a given trajectory X i , the second term can be expanded as [1] γ i i p(M |X , Z) = p(mij |X i , Z), (3) j=1

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM

3

where γ is the total number of landmarks, and M i is the map estimate of the particle i, computed from X i , via EKF. Therefore, for a τ - particle system, (2) is maximized by a particle ﬁlter and γτ independent EKFs [1]. When a κparameter landmark representation is employed, the computational complexity is O(γτ κ2 ). In the system utilized in this work, X is the history of the pose and the rate of displacement estimates. Its kth member is sk = [ck qk ] xk = [sk tk wk ],

(4)

where ck and qk denote the position of the camera center in 3D world coordinates, and, its orientation as a quaternion, respectively. Together, they form the pose sk . tk and wk are the translational and rotational displacement terms, in terms of distance and angle covered in a single time unit. M is deﬁned as a collection of 3D point landmarks, i.e., γ

M = {mj }j=1

(5)

The state evolves with respect to the constant velocity model, deﬁned as ck+1 = ck + tk qk+1 = qk ⊗ q(wk ) tk+1 = tk + vt wk+1 = wk + vw ,

(6)

where q is an operator that maps an Euler angle vector to a quaternion, and ⊗ is the quaternion product operation. vt and vw are two independent Gaussian noise processes with covariance matrices Pt and Pw , respectively. The measurement function projects a 3D point landmark to a 2D point feature on the image plane via the perspective projection equation [9], i.e., [hx

hz ]T = r(qk −1 )[mj − ck ]T hy x zj = [νx − αx h hz νy − αy hz ],

hy

(7)

where r(q) is an operator that yields the rotation matrix corresponding to a quaternion q, and zj is the projection of the jth landmark to the image plane. (νx ; νy ) denotes the principal point of the camera, and (αx ; αy ) represents the focal length-related scale factors. The implementation follows the FASTSLAM2.0 [1] adaptation, described in [4]. In a cycle, ﬁrst the particle poses are updated via (6). Then, the measurement predictions and the associated search regions are constructed. After matching with normalized cross correlation, the pose and the displacement estimates of all particles are updated with the measurements, zk . The quality of each particle is assessed by the observation likelihood function p(zk |X i , M i ), evaluated at zk . The resampling stage utilizes this quality score to form the new particle set. Finally, for each particle X i , the corresponding map M i is updated

4

˙ E. Imre and M.-O. Berger

with the measurements. The algorithm tries to maintain a certain number of active landmarks (i.e. landmarks that are in front of the camera, and have their measurement predictions within the image), and uses FAST [8] to detect new landmarks to replace the lost ones. The addition and the removal operations are global, i.e., if a landmark is deleted, it is removed from the maps of all particles.

3

Inverse-Depth Parameterization and PF-SLAM

The original IDP represents a 3D landmark, m3D , as a point on the ray that joins the landmark, and the camera center of the ﬁrst camera in which the landmark is observed [9], i.e., 1 m3D = c + n, (8) λ where c is the camera center, n is the direction vector of the ray and λ is the inverse of the distance from c. n is parameterized by the azimuth and the elevation angles of the ray, θ and φ, as n = [cos φ sin θ

− sin φ

cos φ cos θ],

(9)

computed from the orientation of the ﬁrst camera, and the ﬁrst 2D observation, q and u, respectively. The resulting 6-parameter representation, IDP6, is mIDP6 = [c

θ(u, q)

φ(u, q)

λ].

(10)

This formulation, demonstrated to be superior to the CP [7], has two shortcomings. Firstly, it is a 6-parameter representation, hence its use in the EKF is computationally more expensive. Secondly, u and q are not directly represented, and their nonlinear relation to θ and φ [9] inevitably introduces an error. The latter issue can be remedied by a representation which deals with these hidden variables explicitly, i.e., a 10-component parameterization, mIDP10 = [c

q u λ].

(11)

νy − u 2 1]T αy n= l . l

(12)

In this case, n is redeﬁned as l = r(q)[ νx α−x u1

With these deﬁnitions, the likelihood of a landmark in a particle, i.e., the operand in (3), is p(mij |X i , Z) = p(sij , uij , λij |X i , Z). (13) Consider a landmark mj is initiated at the time instant k − a, with a > 0. By deﬁnition, sij is the pose hypothesis of the particle i at k − a, i.e., sik−a (see (4) ). Since, for a particle, the trajectory is given, this entity has no associated uncertainty, hence, is not updated by the landmark EKF. Therefore, sik−a

sik = sik−a } ⇒ p(mij |X i , Z) = p(uij , λij |X i , Z). ∈ xik−a ∈ X i

(14)

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM

5

In other words, the pose component of a landmark in a particle is a part of the trajectory hypothesis, and is ﬁxed for a given particle. Therefore, it can be removed from the state vector of the landmark EKF. The resulting parameterization, IDP3, is mIDP3 = [u λ]. (15) Since the linearity analysis of [7] involves only the derivatives of the inverse depth parameter, it applies to all parameterizations of the form (8). Therefore, IDP3 retains the performance advantage of IDP6 over CP. As for the complexity, IDP3 and CP diﬀer only in the measurement functions and their Jacobians. Equation (7) can be evaluated in 58 ﬂoating point operations (FLOP), whereas when (12) and (8) is substituted into (7), considering that some of the terms are ﬁxed at the instantiation, the increase is 13 FLOPs, plus a square root. Similar ﬁgures apply to the Jaocbians. To put the above into perspective, the rest of the state update equations of the EKF can be evaluated roughly in 160 FLOPs. Therefore, in PF-SLAM, the performance vs. computational cost trade-oﬀ that limits the application of IDP6 is eﬀectively eliminated, there is no need for a dedicated initialization stage, and IDP3 can be utilized throughout the entire process. Besides, IDP3 involves no approximations over CP, it only exploits a property of particle ﬁlters. A similar, 3-component parameterization is proposed in [4]. However, the authors employ it in an initialization stage, in which a simpliﬁed measurement function that assumes no rotation, and translation only along the directions orthogonal to the principal axis vector is utilized. This approximation yields a linear measurement function, and makes it possible to use a linear Kalman ﬁlter, a computationally less expensive scheme than EKF. The approach proposed in this work, employing IDP3 exclusively, has the following advantages: 1. IDP is utilized throughout the entire process, not only in the initialization. 2. No separate landmark initialization stage is required, therefore the system architecture is simpler. 3. The measurement function employed in [4] is valid only within a small neighborhood of the original pose [4]. The approximation not only adversely affects the performance, but also limits the duration in which a landmark may complete its initialization. However, the proposed approach uses the actual measurement equation, whose validity is not likewise limited.

4

Experimental Results

The performance of the proposed parameterization is assessed via a pose estimation task. For this purpose, a bed, which can translate and rotate a camera in two axes with a positional and angular precision of 0.48 mm and 0.001o, respectively, is used to acquire the images of two indoor scenes, with the dimensions 4x2x3 meters, at a resolution of 640x480. In the sequence Line, the camera moves on an 63.5-cm long straight path, with a constant translational and angular displacement of 1.58 mm/frame and 0.0325o/frame, respectively.

6

˙ E. Imre and M.-O. Berger

Fig. 1. Left: The bed used in the experiment to produce the ground truth trajectory. Right top: The ﬁrst and the last images of Line and Hardline Right bottom: Two images from the circle the camera traced in Circle.

Hardline is derived from Line by discarding 2/3 of the images randomly, in order to obtain a nonconstant-velocity motion. The sequence Circle is acquired by a camera tracing a circle with a diameter of 73 cm (i.e. a circumference of 229 cm), and moving at a displacement of 3.17 mm/frame. It is the most challenging sequence of the entire set, as, unlike Hardline, not only the horizontal and forward components of the displacement, but also the direction changes. Figure 1 depicts the setup, and two images from each of the sequences. The pose estimation task involves recovering the pose and the orientation of the camera from the image sequences by using the PF-SLAM algorithm described in Sect. 2. Two map representations are compared: the exclusive use of the IDP3 and a hybrid CP-IDP3 scheme. The hybrid scheme involves converting an IDP3 landmark to the CP representation as soon as a measure of the linearity of the measurement function, the linearity index, proposed in [7], goes below 0.1 [7]. At a given time, the system may have both CP and IDP3 landmarks in the maps of the particles, hence the name hybrid CP-IDP3. It is related to [4] in the sense that both use the same landmark representation for initialization, however the hybrid CP-IDP3 employs the actual measurement model, hence is expected to perform better than [4]. Therefore, it is safe to state that the experiments compare IDP3 to an improved version of [4]. In the experiments, the number of particles is set to 2500, and both algorithms try to maintain 30 landmarks. Although this may seem low, given the capabilities of the contemporary monocular SLAM systems, since the main argument of this work is totally independent of the number of landmarks, the authors’ believe that denser maps would not enhance the discussion. Two criteria are used for the evaluation of the results: 1. Position error: Square root of the mean square error between the ground truth and the estimated trajectory, in milimeters. 2. Orientation error: The angle between the the estimated and the actual normals to the image plane (i.e., the principal axis vectors), in degrees. The results are presented in Table 1 and Figs. 2-4.

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM

7

Table 1. Mean trajectory and principal axis errors Criterion

Line Hardline Circle IDP3 CP-IDP3 IDP3 CP-IDP3 IDP3 CP-IDP3 Mean trajectory error (mm) 7.58 11.25 8.15 12.46 22.66 39.87 Principal axis error (degrees) 0.31 0.57 0.24 0.54 0.36 0.48

The experiment results indicate that both schemes perform satisfactorily in Line and Hardline. The IDP3 performs slightly, but consistently, better in both position and orientation estimates, with an average position error below 1 cm. As for the orientation error, in both cases, the IDP3 yields an error oscillating around 0.3o , whereas, in the CP-IDP3, it grows towards 1o , as the camera moves. However, in Circle, the performance diﬀerence is much more pronounced: the IDP3 can follow the circle, the true trajectory, much closely than the CPIDP3. The average and peak diﬀerences are approximately 1.7 and 4 cm, respectively. The ﬁnal error in both algorithms are less than 2% of the total path length. The superiority of the IDP3 can be primarily attributed to two factors: the nonlinearity of (8) and the relatively high nonlinearity of (6), when mj is represented via the CP, instead of the IDP [9]. The ﬁrst issue aﬀects the conversion from the CP to the IDP3. Since the transformation is nonlinear, the conversion of the uncertainty of an IDP landmark to the the corresponding CP landmark is not error-free. The second problem, the relative nonlinearity, implies that the accumulation of the linearization errors occurs at a higher rate in a CP landmark than an IDP landmark. Since the quality of the landmark estimates are reﬂected in the accuracy of the estimated pose [7], IDP3 performs better. The performance diﬀerence is not signiﬁcant in Line (Fig. 4), a relatively easy sequence in which the constant translational and angular displacement assumptions are satisﬁed, as seen in Table 1. Although Hardline (Figs. 2, 3 and 4) is a more diﬃcult sequence, the uncertainty in the translation component is still constrained to a line, and PF can cope with the variations in the total displacement magnitude. Besides, it is probably somewhat short to illustrate the eﬀects of the drift: the diverging orientation error observed in Figs. 2, 3 and 4 is likely to cause problems in a longer sequence. However, in Circle (Figs. 2, 3 and 4), there is a considerable performance gap. It is a sequence in which neither the direction, nor the components of the displacement vector are constant. Therefore the violation of the constant displacement assumption is the strongest among all sequences. Moreover, at certain parts of the sequence, the camera motion has a substantial component along the principal axis vector of the camera, a case in which the nonlinear nature of (6) is accentuated. A closer study of Fig. 3 reveals that it is these parts of the sequence, especially in the second half of the trajectory, in which the IDP3 performs better than the CP-IDP3 scheme, due to its superior linearization.

8

˙ E. Imre and M.-O. Berger

Fig. 2. Top view of the trajectory and the structure estimates. Left:Hardline Right: Circle. G denotes the ground truth. Blue circles indicate the estimated landmarks.

Fig. 3. Trajectory and orientation estimates for Hardline (top) and Circle (bottom). Left: Trajectory. Right: Orientation, i.e, the principal axis. In order to prevent cluttering, the orientation estimates are downsampled by 4.

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM

9

Fig. 4. Performance comparison of the IDP3 and the CP-IDP3 schemes. Top: Line. Middle: Hardline. Bottom: Circle. Left column is the Euclidean distance between the actual and the estimated trajectories. Right column is the angle between the actual and the estimated principal axis vectors.

10

5

˙ E. Imre and M.-O. Berger

Conclusion

The advantage the IDP oﬀers over the CP, the relative amenability to linearization, is a prize that comes at the price of reduced representation eﬃciency, as the CP describes a landmark with the minimum number of components, whereas the IDP has redundant components. In this paper, the authors show that, this is not the case in PF-SLAM, i.e., the IDP is eﬀectively as eﬃcient as the CP, by exploiting the fact that in a PF-SLAM system, for each particle, the trajectory is given, i.e., has no uncertainty, therefore, any pose-related parameters can be removed from the landmark EKFs. This allows the use of the IDP throughout the entire estimation procedure. In addition to reducing the linearization errors, this parameterization strategy removes the need for a separate feature initialization procedure, hence also reduces the system complexity, and eliminates the errors introduced in transferring the uncertainty from one parameterization to another. The experimental results demonstrate the superiority of the proposed approach to a hybrid CP-IDP scheme.

References 1. Montemerlo, M.: FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping, Ph. D. dissertation, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2003) 2. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Analysis and Machine Intelligence 29(6), 1052–1067 (2007) 3. Jin, H., Favaro, P., Soatto, S.: A Semi-Direct Approach to Structure from Motion. The Visual Computer 19(6), 377–394 (2003) 4. Eade, E., Drummond, T.: Scalable Monocular SLAM. In: CVPR 2006, pp. 469–476 (2006) 5. Durrant-Whyte, H., Bailey, T.: Simultaneous Localization and Mapping: Part I. IEEE Robotics and Automation Mag. 13(2), 9–110 (2006) 6. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual Modeling with a Hand-Held Camera. Intl. J. Computer Vision 59(3), 207–232 (2004) 7. Civera, J., Davison, A.J., Montiel, J.M.M.: Inverse Depth to Depth Conversion for Monocular SLAM. In: ICRA 2007, pp. 2778–2783 (2007) 8. Rosten, E., Drummond, T.: Fusing Points and Lines for High Performance Tracking. In: ICCV 2005, pp. 1508–1515 (2005) 9. Civera, J., Davison, A.J., Montiel, J.M.M.: Uniﬁed Inverse Depth Parameterization for Monocular SLAM. In: RSS 2006 (2006) 10. Davison, A.J.: Real-Time Simultaneous Localization and Mapping with a Single Camera. In: ICCV 2003, vol. 2, pp. 1403–1410 (2003)

An Eﬃcient Linear Method for the Estimation of Ego-Motion from Optical Flow Florian Raudies and Heiko Neumann Institute of Neural Information Processing University of Ulm 89069 Ulm, Germany

Abstract. Approaches to visual navigation, e.g. used in robotics, require computationally eﬃcient, numerically stable, and robust methods for the estimation of ego-motion. One of the main problems for egomotion estimation is the segregation of the translational and rotational component of ego-motion in order to utilize the translation component, e.g. for computing spatial navigation direction. Most of the existing methods solve this segregation task by means of formulating a nonlinear optimization problem. One exception is the subspace method, a wellknown linear method, which applies a computationally high-cost singular value decomposition (SVD). In order to be computationally eﬃcient a novel linear method for the segregation of translation and rotation is introduced. For robust estimation of ego-motion the new method is integrated into the Random Sample Consensus (RANSAC) algorithm. Diﬀerent scenarios show perspectives of the new method compared to existing approaches.

1

Motivation

For many applications visual navigation and ego-motion estimation is of prime importance. Here, processing starts with the estimation of optical ﬂow using a monocular spatio-temporal image sequences as input followed by the estimation of ego-motion. Optical ﬂow ﬁelds generated by ego-motion of the observer are getting more complex if one or multiple objects move independently of ego-motion. A challenging task is to segregate such moving objects (IMOs), where MacLean et al. proposed a combination of ego-motion estimation and the Expectation Maximization (EM) algorithm [15]. With this algorithm a single motion model is estimated for ego-motion and each IMO using the subspace method [9]. A key functionality of the subspace method is the possibility to cluster ego-motion and motion of IMOs. More robust approaches assume noisy ﬂow estimates besides IMOs when estimating ego-motion with the EM algorithm [16,5]. Generally, the EM algorithm uses an iterative computational scheme and in each iteration the evaluation of the method estimating ego-motion is required. This necessitates a computationally highly eﬃcient algorithm for the estimation of ego-motion in real-time applications. So far, many of the ego-motion algorithms introduced in the past lack this property of computationally eﬃciency. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 11–20, 2009. c Springer-Verlag Berlin Heidelberg 2009

12

F. Raudies and H. Neumann

Bruss and Horn derived a bilinear constraint to estimate ego-motion by utilizing a quadratic Euclidian metric to calculate errors between input flow and model flow [3]. The method is linear w.r.t. either translation or rotation and independent of depth. This bilinear constraint was used throughout the last two decades for ego-motion estimation: (i) Heeger and Jepson built their subspace method upon this bilinear constraint [9]. (ii) Chiuso et al. used a ﬁx-point iteration to optimize between rotation (based on the bilinear constraint), depth, and translation, [4] and Pauwels and Van Hulle used the same iteration mechanism optimizing for rotation and translation (both based on the bilinear constraint) [16]. (iii) Zhang and Tomasi as well as Pauwels and Van Hulle used a GaussNewton iteration between rotation, depth, and translation [20,17]. In detail the method (i) needs singular value decomposition, and methods of (ii) and (iii) iterative optimization techniques. Here, a novel linear approach for the estimation of ego-motion is presented. Our approach utilizes the bilinear constraint, the basis of many nonlinear methods. Unlike to these previous methods, here, a linear formulation is achieved by introducing auxiliary variables. In turn with this linear formulation a computationally eﬃcient method is deﬁned. Section 2 gives a formal description of the instantaneous optical ﬂow model. This model serves as basis to derive our method outlined in Section 3. An evaluation of the new method in diﬀerent scenarios and in comparison to existing approaches is given in Section 4. Finally, Section 5 discusses our method in the context of existing approaches [3,9,11,20,16,18] and Section 6 gives a conclusion.

2

Model of Instantaneous Ego-Motion

Von Helmholtz and Gibson introduced the deﬁnition of optical ﬂow as moving patterns of light falling upon the retina [10,8]. Following this deﬁnition LonguetHiggins and Prazdny gave a formal description of optical ﬂow which is based on a model of instantaneous ego-motion [13]. In their description they used a pinhole camera with the focal length f which projects 3-d points (X, Y, Z) onto the 2-d image plane, formally (x, y) = f /Z · (X, Y ). Ego-motion composed of the translation T = (tx , ty , tz )t and rotation R = (rx , ry , rz )t causes the 3-d in t t t t stantaneous displacement X˙ Y˙ Z˙ = − tx ty tz − rx ry rz × X Y Z , where dots denote the ﬁrst temporal derivative and t the transpose operator. Using this model, movements of projected points on the 2-d image plane have the velocity 1 −f 0 x 1 u xy −(f 2 + x2 ) f y V := = T+ R. (1) v 0 −f y −xy −f x Z f (f 2 + y 2 )

3

Linear Method for Ego-Motion Estimation

Input flow, e.g. estimated from a spatio-temporal image sequence, is denoted by Vˆ , while the model flow is deﬁned as in Equation 1. Now, the problem is to ﬁnd

Estimation of Ego-Motion

13

parameters of the model flow which describe the given ﬂow Vˆ best. Namely, these parameters are the scenic depth Z, the translation T and rotation R. Based on Equation 1 many researchers studied non-linear optimization problems to estimate ego-motion [3,20,4,18]. Moreover, most of these methods have a statistical bias which means that methods produce systematic errors considering isotropic noisy input [14,16]. Unlike these approaches we suggest a new linearized form based on Equation 1 and show how to solve this form computationally eﬃcient with a new method. Further, this method can be unbiased. The new method is derived in three consecutive steps: (i) the algebraic transformation of Equation 1 which is independent of depth Z, (ii) a formulation of an optimization problem for translation and auxiliary variables, and (iii) the removal of a statistical bias. The calculation of rotation R with translation T known is then a simple problem. Depth independent constraint equation. Bruss and Horn formulated an optimization problem with respect to depth Z which optimizes the squared Euclidian distance of the residual vector between the input flow vector Vˆ = (ˆ u, vˆ)t and the model ﬂow vector V deﬁned in Equation 1. Inserting the optimized depth into Equation 1 they derived the so called bilinear optimization constraint. An algebraic transformation of this constraint is ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞ tx f vˆ −(f 2 + y 2 ) xy fx rx ⎠ ⎝ ry ⎠), ˆ ⎠−⎝ xy −(f 2 + x2 ) fy 0 = ⎝ ty ⎠ (⎝ −f u tz yu ˆ − xˆ v fx fy −(x2 + y 2 ) rz

=:M

=:H

(2) which Heeger and Jepson describe during the course of their subspace construction. In detail, they use a subspace which is orthogonal to the base polynomial deﬁned by entries of the matrix H(xi , yi )i=1..m , where m denotes the ﬁnite number of constraints employed [9]. Optimization of translation. Only a linear independent part of the base polynomial H is used for optimization. We chose the upper triangular matrix together with the diagonal of matrix H. These entries are summarized in the vector E := (−(f 2 + y 2 ), xy, f x, −(f 2 + x2 ), f y, −(x2 + y 2 ))t . To achieve a linear form of Equation 2 auxiliary variables (tx ·rx , tx ·ry , tx ·rz , ty ·ry , ty ·rz , tz ·rz )t := K are introduced. With respect to E and K the linear optimization problem T,K(T ) F (Vˆ ; T, K(T )) := [T t M + K t E]2 dx −−−−−→ min, (3) Ωx

is deﬁned, integrating constraints over all locations x = (x, y) ∈ Ωx ⊂ 2 of the image plane. This image plane is assumed to be continuous and ﬁnite. Calculating partial derivatives of F (Vˆ ; T, K(T )) and equating them to zero leads to the linear system of equations

14

F. Raudies and H. Neumann

0= Ωx

0=

Ωx

[T t M + K t E] · E t dx, [T t M + K t E] · [M +

(4)

∂(K t E) t ] dx, ∂T

(5)

consisting of nine equations and nine variables in K and T . Solving Equation 4 with respect to K and inserting the result as well as the partial derivative for the argument T of expression K into Equation 5 results in the homogenous linear system of equations 0 = Tt

Li Lj dx =: T t C, i, j = 1..3, with Ωx Li := Mi − (DE)t EMi dx, i = 1..3 and D := [ Ωx

Ωx

(6) EE t dx]−1 ∈ 6×6 .

A robust (non-trivial) solution for such a system is given by the eigenvector which corresponds to the smallest eigenvalue of the 3 × 3 scatter matrix C [3]. Removal of statistical bias. All methods which are based on the bilinear constraint given in Equation 2 are statistically biased [9,11,14,18]. To calculate this bias we deﬁne an isotropic noisy input by the vector V˜ := (ˆ u, vˆ) + (nu , nv ), with components nu and nv ∈ N (μ = 0,σ) normally distributed. A statistical ˜ bias is inferred by studying the expectation value < · > of the scatter matrix C. ˜ This scatter matrix is deﬁned inserting the noisy input flow V into Equation 6. This gives < C˜ >=< C > +σ 2 N with ⎛

⎞ f 0 −f < x > 0 f −f < y > ⎠ , N =⎝ −f < x > −f < y > < (x2 + y 2 ) >

(7)

using the properties < nu >=< nv >= 0 and < n2u >=< n2v >= σ 2 . Several procedures to remove the bias term σ 2 N have been proposed. For example, Kanatani suggested a method of renormalization subtracting the bias term on the basis of an estimate of σ 2 [11]. Heeger and Jepson used dithered constraint vectors and deﬁned a roughly isotropic covariance matrix with these vectors. MacLean used a transformation of constraints into a space where the inﬂuence by noise is isotropic [14]. Here, the last approach is used, due to its computational eﬃciency. In a nutshell, to solve Equation 6 considering noisy input we calculate ˜ Prethe eigenvector which corresponds to the smallest eigenvalue of matrix C. − 12 ˜ − 12 ˜ ˇ withening of the scatter matrix C gives C := N CN . Then the inﬂuence by noise is isotropic, namely σ 2 I, where I denotes a 3 × 3 unity matrix. The newly ˇ = (λ + σ 2 )x preserves the ordering of λ and deﬁned eigenvalue problem Cx − 12 eigenvectors N x compared to the former eigenvalue problem Cx = λx. Then the solution is constructed with the eigenvector of matrix Cˇ which corresponds 1 to the smallest eigenvalue. Finally, this eigenvector has to be multiplied by N − 2 .

Estimation of Ego-Motion

4

15

Results

To test the proposed method for ego-motion estimation in diﬀerent conﬁgurations we use two sequences, the Yosemite sequence1 and the Fountain sequence2 . In the Yosemite sequence a ﬂight through a valley is simulated, speciﬁed by T = (0, 0.17, 0.98)·34.8 px and R = (1.33, 9.31, 1.62)·10−2 deg/frame [9]. In the Fountain sequence the curvilinear motion with T = (−0.6446, 0.2179, 2.4056) and R = (−0.125, 0.20, −0.125) deg/frame is performed. The (virtual) camera employed to gather images has a vertical ﬁeld of view of 40 deg and a resolution of 316 × 252 for the Yosemite sequence and 320 × 240 for the Fountain sequence. All methods included in our investigation have a statistical bias which is removed with the technique of MacLean [14]. The iterative method of Pauwels and Van Hulle [18] employs a ﬁx-point iteration mechanism using a maximal number of 500 iterations and 15 initial values for the translation direction, randomly distributed on the positive hemisphere [18]. Numerical stability. To show numerical stability we use the scenic depth of the Fountain sequence (5th frame) with a quarter of the full resolution to test diﬀerent ego-motions. These ego-motions are uniformly distributed in the range of ±40 deg azimuth and elevation in the positive hemisphere. Rotational components for pitch and yaw are calculated by ﬁxating the central point and compensating translation by rotation. An additional roll component of 1 deg/frame is superimposed. With scenic depth values and ego-motion given, optical ﬂow is calculated by Equation 1. This optical ﬂow is systematically manipulated by applying two diﬀerent noise models: a Gaussian and an outlier noise model. The Gaussian noise model was former speciﬁed in Section 3. In the outlier noise model a percentage, denoted by ρ, of all ﬂow vectors are replaced by a randomly constructed vector. Each component of this vector is drawn from a uniformly distributed random variable. The interval of this distribution is deﬁned by the negative and positive of the mean length of all ﬂow vectors. Outlier noise models sparsely distributed gross errors, e.g. caused by correspondences that were incorrectly estimated. Applying a noise model to the input flow, the estimation of ego-motion becomes erroneous. These errors are reported by: (i) the angular diﬀerence between two translational 3-d vectors, whereas one is the estimated vector and the other the ground-truth vector, and (ii) the absolute value of the diﬀerence for each rotational component. Again, diﬀerences are calculated between estimate and ground-truth. Figure 1 shows errors of ego-motion estimation applying the Gaussian and the outlier noise model. All methods show numerical stability, whereas the mean translational error is lower than approximately 6 deg for both noise models. The method of Pauwels and Van Hulle performs best compared to the other methods. Better performance is assumed to be achieved by employing numerical ﬁx-point iteration with diﬀerent initial values randomly chosen within the search space. 1 2

Available via anonymous ftp from ftp.csd.uwo.ca in the directory pub/vision. Provided at http://www.informatik.uni-ulm.de/ni/mitarbeiter/FRaudies.

F. Raudies and H. Neumann

std of angular error [°]

mean of angular error [°]

c)

b) proposed method Kanatani, 1993 Pauwels & Van Hulle, 2006

6

5 4 3 2 1 0 0

5 10 Gaussian noise V [%]

5 4 3 2 1 0 0

5 10 Gaussian noise V [%]

d)

6

6 std of angular error [°]

a) 6

mean of angular error [°]

16

5 4 3 2 1 0 0

20 40 outlier noise U [%]

5 4 3 2 1 0 0

20 40 outlier noise U [%]

Fig. 1. All methods employed show numerical stability in the presence of noise due to small translational and rotational errors (not shown). In detail, a) shows the mean angular error for Gaussian noise and c) for outlier noise. Graphs in b) and d) show the corresponding standard deviation, respectively. The parameter σ is speciﬁed with respect to the image height. Mean and standard deviation are calculated for a number of 50 trials. Table 1. Errors for estimated optical and ground-truth input flow of the proposed method. In case of the Yosemite sequence which contains the independently moving cloudy sky the RANSAC paradigm is employed which improves ego-motion estimates (50 trials, mean and ± standard deviation shown). (x) denotes the angle calculated between estimated and ground-truth 3-d translational vectors.

sequence Fountain Yosemite Fountain Yosemite Fountain Yosemite Yosemite

Fountain Yosemite Yosemite

translation rotation (T est , T gt )(x) [deg] |Δrx | [deg] |Δry | [deg] estimated optical ﬂow; Brox et al. [2]; 100% density 4.395 0.001645 0.0286 4.893 0.02012 0.1187 estimated optical ﬂow; Farnebaeck [6]; 100% density 6.841 0.01521 0.05089 4.834 0.03922 0.00393 estimated optical ﬂow; Farnebeack, [6]; 25% density 1.542 0.0008952 0.01349 1.208 0.007888 0.01178 (RANSAC) 1.134 0.01261 0.008485 ± 0.2618 ± 0.002088 ± 0.002389 ground-truth optical ﬂow; 25% of full resolution 0.0676 0.000259 8.624e-006 5.625 0.02613 0.1092 (RANSAC) 1.116 0.01075 0.004865 ± 1.119 ± 0.01021 ± 0.006396

|Δrz | [deg] 0.02101 0.1153 0.025 0.07636 0.003637 0.02633 0.02849 ± 0.003714 0.0007189 0.06062 0.02256 ± 0.009565

Estimated optical flow as input. We test our method on the basis of optical input flow estimated by two diﬀerent methods. First, we utilize the tensor-based method of Farnebaeck together with an aﬃne motion model [6] to estimate optical ﬂow. The spatio-temporal tensor is constructed by projecting the input signal to a set of base polynomials of ﬁnite Gaussian support (σ = 1.6 px and length l = 11 px, γ = 1/256). Spatial

Estimation of Ego-Motion

17

averaging of the resulting components of the tensor is performed with a Gaussian ﬁlter (σ = 6.5 px and l = 41 px). Second, optical ﬂow is estimated with the aﬃne warping technique of Brox et al. [2]. Here, we implemented the 2-d version of the algorithm and used the following parameter values, α = 200, γ = 100, = 0.001, σ = 0.5, η = 0.95, a number of 77 outer ﬁx point iterations and 10 inner ﬁx point iterations. To solve partial diﬀerential equations the numerical method Successive Over-Relaxation (SOR) with parameter ω = 1.8 and 5 iterations is applied. Errors of optical ﬂow estimation are reported by a 3-d mean angular error which was deﬁned by Barron and Fleet [1]. According to this angular error optical ﬂow is estimated for frame pair 8 − 9 (starting to count from index 0) of the Yosemite sequence, with 5.41 deg accuracy for the method of Farnebaeck and with 3.54 deg for the method of Brox. In case of frame pair 5 − 6 of the Fountain sequence the mean angular error is 2.49 deg estimating ﬂow with Farnebaeck’s method and 2.54 deg for the method of Brox. All errors refer to a density of 100% for optical ﬂow data. Table 1 lists errors of ego-motion estimation for diﬀerent scenarios. Comparing the ﬁrst two parts of the table, we conclude that a high accuracy for optical ﬂow estimates does not necessarily provide a high accuracy in the estimation of ego-motion. In detail, the error of ego-motion estimation depends on the error characteristic (spatial distribution and value of errors) within the estimated optical ﬂow ﬁeld. However, this characteristic is not expressed by the mean angular error. One way to reduce the dependency on the error characteristic is to reduce the data set, leaving out data points which are most erroneous. Generally, this requires (i) an appropriate conﬁdence measure to evaluate the validity or reliability of data points, (ii) and a strategy to avoid linear dependency in the resulting data w.r.t. ego-motion estimation. Farnebaeck describes how to calculate a conﬁdence value within his thesis [6]. Here, this conﬁdence is used to thin out ﬂow estimates, whereas we retain 25% of all estimates, enough to avoid linear dependency for our conﬁgurations. For ego-motion estimation errors are then reduced as can be observed in the third part of Table 1. In case of the Yosemite sequence sparsiﬁcation has a helpful side eﬀect. The cloud motion is estimated by the method of Farnebaeck with low accuracy and conﬁdence. Thus, no estimates corresponding from the cloudy sky are contained in the data set for the estimation of ego-motion. In the last part of Table 1 ground-truth optical ﬂow is utilized to estimate ego-motion. In this case, the cloudy sky is present in the data set and thus deﬂects estimates of ego-motion, e.g. the translational angular error amounts 5.6 deg. To handle IMOs we use the RANSAC algorithm [7]. In a nutshell, the idea of the algorithm is to achieve an estimate which is based on non erroneous data points only. Therefore, initial estimates are performed on diﬀerent randomly selected subsets of all data points, which are tried to be enlarged by other non erroneous data points. The algorithm stops if an estimate is found, that is based on a data set of a certain cardinality. For the groundtruth ﬂow of the Yosemite sequence, this method is successfully in estimating ego-motion, now the translational angular error amounts 1.116 deg (mean value).

18

5

F. Raudies and H. Neumann

Discussion

A novel linear optimization method was derived to solve the segregation of the translational and rotational component, one of the main problems in computational ego-motion estimation [3,9,13]. Related work. A well-known linear method for ego-motion estimation is the subspace method [9]. Unlike our method a subspace independent of the rotational part was used by Heeger and Jepson for the estimation of translation, using only m − 6 of m constraints. However, in the method proposed here, all constraints are used which leads to more robust estimates. Zhuang et al. formulated a linear method for the segregation of translation and rotation employing the instantaneous motion model together with the epipolar constraint [21]. They introduced auxiliary variables, as superposition of translation and rotation, then optimized w.r.t. these variables and translation. In a last step they reconstructed rotation from auxiliary variables. Unlike their method we used the bilinear constraint for optimization, deﬁned auxiliary variables diﬀerently, split up the optimization for rotation and translation and ﬁnally had to solve only a 3 × 3 eigenvalue problem for translation estimation, instead of a 9 × 9 eigenvalue problem in case of Zhuang’s approach. Moreover, applying this diﬀerent optimization strategy allowed us to incorporate the method of MacLean to remove a statistically bias, which is not the case for the method of Zhuang. Complexity. To achieve real-time capability in applications a low computationally complexity is of vital need. Existing methods for the estimation of egomotion have a higher complexity than our method (compare with Table 2). For example [9] employs a singular value decomposition for a m × 6 matrix, or iterative methods to solve for nonlinear optimization problems are employed [4,18,20]. Comparable to our method in case of computational complexity, is the method of Kanatani [11]. Unlike our approach this method is based on the epipolar constraint. Numerical stability. We showed that the optimization method is robust against noise, compared to other ego-motion algorithms [11,18]. Furthermore, the technique of pre-whitening is applied to our method to remove a statistical bias Table 2. Average (1000 trials) computing times [msec] of methods estimating egomotion, tested with a C++ implementation on a Windows XP platform, Intel Core 2 Duo T9300. (∗) This algorithm employs a maximal number of 500 iterations and 15 initial values. method new proposed method (unbiased) Kanatani (unbiased) Heeger & Jepson (unbiased) Pauwels & Van Hulle, 2006 (unbiased)(∗)

25 0.05 0.03 0.08 0.16

number of vectors 225 2.025 20.164 80.089 0.06 0.34 4.56 22.16 0.11 0.78 7.56 29.20 2.44 399.20 n.a. n.a. 0.81 6.90 66.87 272.95

Estimation of Ego-Motion

19

as well. This technique was proposed by MacLean [14] for bias removal in the subspace algorithm of Heeger and Jepson [9] and the method by Pauwels and Van Hulle for the ﬁx-point iteration, iterating between coupled estimates for translation and rotation of ego-motion [17]. Unlike other unbiasing techniques MacLean’s technique needs neither an estimate of the noise characteristic nor an iterative mechanism. With statistical bias removed methods are consistent in the sense of Zhang and Tomasi’s deﬁnition of consistency [20]. Outlier detection. To detect outliers in ego-motion estimation, in particular IMOs, several methods were suggested, namely frameworks employing the EM algorithm [15,5], the Collinear Point Constraint [12] and the RANSAC algorithm [19]. In accordance to the conclusion of Torr’s thesis, who found that the RANSAC algorithm performs best in motion segmentation and outlier detection, we chose RANSAC to achieve robust ego-motion estimation.

6

Conclusion

In summary, we have introduced a novel method for the separation of translation and rotation in the computation of ego-motion. Due to the simplicity of the method it has a very low computational complexity and is thus faster than existing estimation techniques (Table 2). First, we tested our method with a computed optical ﬂow ﬁeld, where ego-motion can be estimated exactly. Under noisy conditions results show numerical stability of the optimization method and its comparability with existing methods for the estimation of ego-motion. In more realistic scenarios utilizing estimated optical ﬂow, ego-motion can be estimated with high accuracy. Future work will employ temporal integration of ego-motion estimates within the processing of an image sequences. This should stabilize ego-motion and optical ﬂow estimation counting on the spatio-temporal coherence of the visually observable world.

Acknowledgements Stefan Ringbauer kindly provided a computer graphics ray-tracer utilized to generate images and ground-truth ﬂow for the Fountain sequence. This research has been supported by a scholarship given to F.R. from the Graduate School of Mathematical Analysis of Evolution, Information and Complexity at Ulm University.

References 1. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical ﬂow techniques. Int. J. of Comp. Vis. 12(1), 43–77 (1994) 2. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical ﬂow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)

20

F. Raudies and H. Neumann

3. Bruss, A.R., Horn, B.K.P.: Passive navigation. Comp. Vis., Graph., and Im. Proc. 21, 3–20 (1983) 4. Chiuso, A., Brockett, R., Soatto, S.: Optimal structure from motion: Local ambiguities and global estimates. Int. J. of Comp. Vis. 39(3), 195–228 (2000) 5. Clauss, M., Bayerl, P., Neumann, H.: Segmentation of independently moving objects using a maximum-likelihood principle. In: Lafrenz, R., Avrutin, V., Levi, P., Schanz, M. (eds.) Autonome Mobile Systeme 2005, Informatik Aktuell, pp. 81–87. Springer, Berlin (2005) 6. Farnebaeck, G.: Polynomial expansion for orientation and motion estimation. PhD thesis, Dept. of Electrical Engineering, Linkoepings universitet (2002) 7. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Comm. of the ACM 24(6), 381–395 (1981) 8. Gibson, J.J.: The Perception of the Visual World. Houghton Miﬄin, Boston (1950) 9. Heeger, D.J., Jepson, A.D.: Subspace methods for recovering rigid motion i: Algorithm and implementation. Int. J. of Comp. Vis. 7(2), 95–117 (1992) 10. Helmholtz, H.: Treatise on physiological optics. In: Southhall, J.P, (ed.) (1925) 11. Kanatani, K.: 3-d interpretation of optical-ﬂow by renormalization. Int. J. of Comp. Vis. 11(3), 267–282 (1993) 12. Lobo, N.V., Tsotsos, J.K.: Computing ego-motion and detecting independent motion from image motion using collinear points. Comp. Vis. and Img. Underst. 64(1), 21–52 (1996) 13. Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image. Proc. of the Royal Soc. of London. Series B, Biol. Sci. 208(1173), 385–397 (1980) 14. MacLean, W.J.: Removal of translation bias when using subspace methods. IEEE Int. Conf. on Comp. Vis. 2, 753–758 (1999) 15. MacLean, W.J., Jepson, A.D., Frecker, R.C.: Recovery of egomotion and segmentation of independent object motion using the EM algorithm. Brit. Mach. Vis. Conf. 1, 175–184 (1994) 16. Pauwels, K., Van Hulle, M.M.: Segmenting independently moving objects from egomotion ﬂow ﬁelds. In: Proc. of the Early Cognitive Vision Workshop (ECOVISION 2004), Isle of Skye, Scotland (2004) 17. Pauwels, K., Van Hulle, M.M.: Robust instantaneous rigid motion estimation. Proc. of Comp. Vis. and Pat. Rec. 2, 980–985 (2005) 18. Pauwels, K., Van Hulle, M.M.: Optimal instantaneous rigid motion estimation insensitive to local minima. Comp. Vis. and Im. Underst. 104(1), 77–86 (2006) 19. Torr, P.H.S.: Outlier Detection and Motion Segmentation. PhD thesis, Engineering Dept., University of Oxford (1995) 20. Zhang, T., Tomasi, C.: Fast, robust, and consistent camera motion estimation. Proc. of Comp. Vis. and Pat. Rec. 1, 164–170 (1999) 21. Zhuang, X., Huang, T.S., Ahuja, N., Haralick, R.M.: A simpliﬁed linear optic ﬂowmotion algorithm. Comp. Graph. and Img. Proc. 42, 334–344 (1988)

Localised Mixture Models in Region-Based Tracking Christian Schmaltz1 , Bodo Rosenhahn2 , Thomas Brox3 , and Joachim Weickert1 1

Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Building E1 1 Saarland University, 66041 Saarbr¨ ucken, Germany {schmaltz,weickert}@mia.uni-saarland.de 2 Leibniz Universit¨ at Hannover 30167 Hannover, Germany [email protected] 3 University of California, Berkeley, CA, 94720, USA [email protected]

Abstract. An important problem in many computer vision tasks is the separation of an object from its background. One common strategy is to estimate appearance models of the object and background region. However, if the appearance is spatially varying, simple homogeneous models are often inaccurate. Gaussian mixture models can take multimodal distributions into account, yet they still neglect the positional information. In this paper, we propose localised mixture models (LMMs) and evaluate this idea in the scope of model-based tracking by automatically partitioning the fore- and background into several subregions. In contrast to background subtraction methods, this approach also allows for moving backgrounds. Experiments with a rigid object and the HumanEva-II benchmark show that tracking is remarkably stabilised by the new model.

1

Introduction

In many image processing tasks such as object segmentation or tracking, it is necessary to distinguish between the region of interest (foreground) and its background. Common approaches, such as MRFs or active contours build appearance models of both regions with their parameters being learnt either from a-priori data or from the images [1,2,3]. Various types of features can be used to build the appearance model. Most common are brightness and colour, but any dense feature set such as texture descriptors [4] or motion [5] can be part of the model. Apart from the considered features, the statistical model of the region is of large interest. In simple cases, one assumes a Gaussian distribution in each region. However, since usually object regions change their appearance locally, such a Gaussian model is too inaccurate. A typical example is the black and white stripes of a zebra, which leads to a Gaussian distribution with a grayish mean

We gratefully acknowledge funding by the German Research Foundation (DFG) under the project We 2602/5-1.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 21–30, 2009. c Springer-Verlag Berlin Heidelberg 2009

22

C. Schmaltz et al.

(a)

(b)

(c)

(d)

Fig. 1. Left: Illustrative examples of situations where object (to be further speciﬁed by a shape prior) and background region are not well modelled by identically distributed pixels. In (a), red points are more likely in the background. Thus, the hooves of the giraﬀe will not be classiﬁed correctly. In (b), the dark hair and parts of the body are more likely to belong to the background. Localised distributions can model these cases more accurately. Right: Object model used by the tracker in one of our experiments (c) and decomposition of the object model into three diﬀerent components (d), as proposed by the automatic splitting algorithm from [6]. There are 22 joint angles in the model, resulting in a total of 28 parameters that must be estimated.

that does neither describe the black nor the white part very well. In order to deal with such cases, Gaussian mixture models or kernel density models have been proposed. These models are much more general, yet still impose the assumption of identically distributed pixels in each region, i.e., they ignore positional information. The left part of Fig. 1 shows two examples where this is insuﬃcient. In contrast, a model which is sensitive for the location in the image was proposed in [7]. The region statistics are estimated for each point separately, thereby considering only information from the local neighbourhood. Consequently, the distribution varies smoothly within a region. A similar local statistical model was used in [8]. A drawback of this model is that it blurs across discontinuities inside the region. As the support of the neighbourhood needs to be suﬃciently large to reliably estimate the parameters of local distributions, this blurring can be quite signiﬁcant. This is especially true when using local kernel density models, which require more data than a local Gaussian model. The basic idea in the present paper is to segment the regions into subregions inside which a statistical model can be estimated. Similar to the above local region statistics, the distribution model integrates positional information. The support for estimating the distribution parameters is usually much larger as it considers all pixels from the subregion, though. Splitting the background into subregions and employing a kernel density estimator in each of those allows for a very precise region model relying on enough data for parameter estimation. Related to this concept are Gaussian mixture models in the context of background subtraction. Here, the mixture parameters are not estimated in a spatial neighbourhood but from data along the temporal axis. This leads to models which include very accurate positional information [9]. In [10], an overview of several possible background models ranging from very simple to complex models

Localised Mixture Models in Region-Based Tracking

23

is given. The learned statistics from such models can also be combined with a conventional spatially global model as proposed in [11]. For background subtraction, however, the parameters are learned in advance, i.e., a background image or images with little motion and without the object must be available. Such limitations are not present in our approach. In fact, our experiments show that background subtraction and the proposed localised mixture model (LMM) are in some sense complementary and can be combined to improve results in tracking. Also note that, in contrast to image labelling approaches that also split the background into diﬀerent regions such as [12], no learning step is necessary. A general problem that arises when making statistical models more and more precise is the increasing amount of local optima in corresponding cost functions. In Fig. 1 there is actually no reason to put the red hooves to the giraﬀe region or the black hair to the person. A shape prior and/or a close initialisation of the contour is required to properly deﬁne the object segmentation problem. For this reason we focus in this paper on the ﬁeld of model based tracking, where both a shape model and a good initial separation into foreground and background can be derived from the previous frame. In particular, we evaluated the model in silhouette-based 3-D pose tracking, where pose and deformation parameters of a 3-D object model are estimated such that the image is optimally split into object and background [13,6]. The model is generally applicable to any other contour-based tracking method as well. Another possible ﬁeld of application is semi-supervised segmentation, where the user can incrementally improve the segmentation by manually specifying some parts of the image as foreground or background [1]. This can resolve above ambiguities as well. Our paper is organised as follows: We ﬁrst review the pose tracking approach used for evaluation. We then explain the localised mixture model (LMM) in Section 3. While the basic approach only works with static background images, we remove this restriction later in a more general approach. After presentation of our experimental data in Section 4, the paper is concluded in Section 5.

2

Foreground-Background Separation in Region-Based Pose Tracking

In this paper, we focus on tracking an articulated free-form surface consisting of rigid parts interconnected by predeﬁned joints. The state vector χ consists of the global pose parameters (3-D shift and rotation) as well as n joint angles, similar to [14]. The surface model is divided into l diﬀerent (not necessarily connected) components Mi , i = 1, . . . , l, as illustrated in Fig. 1. The components are chosen such that each component has a uniform appearance that diﬀers from other components, as proposed in [6]. There are many more tracking approaches than the one presented here. We refer to the surveys [15,16] for an overview. Given an initial pose, the primary goal is to adapt the state vector such that the projections of the object parts lead to maximally homogeneous regions in the image. This is stated by the following cost function which is sought to be minimised in each frame:

24

C. Schmaltz et al.

(a)

(b)

(c)

(d)

(e)

Fig. 2. Example of a background segmentation. From left to right: (a) Background image. (b,c) K-means clustering with three and six clusters. (d,e) Level set segmentation with two diﬀerent parameter settings.

E(χ) = −

l i=0

Ω

vi (χ, x)Pi,χ (x) log pi,χ (x) dx,

(1)

where Ω denotes the image domain. The appearance of each component i and of the background (i = 0) is modelled by a probability density function (PDF) pi , i ∈ 0, . . . , l. The PDFs of the object parts are modelled as kernel densities, whereas we will use the LMM for modelling the background as explained later. Pi,χ is the indicator function for the projection of the i-th component Mi , i.e. Pi,χ (x) is 1 if and only if a part of the object with pose χ is projected to the image point x. In order to take occlusion into account, vi (χ, x) : R6+n × Ω → {0, 1} is a visibility function that is 1 if and only if the i-th object part is not occluded by another part of the object in the given pose. Visibility can be computed eﬃciently using openGL. The cost function can be minimised locally by a modiﬁed gradient descent. The PDFs are evaluated at silhouette points xi of each projected model components. These points xi are then moved along the normal direction of the projected object, either towards or away from the components, depending on which of the regions’ PDF ﬁts better at that particular point. The point motion is transferred to the corresponding change of the state vector by using a point-based pose estimation algorithm as described, e.g., in [7].

3

Localised Mixture Models

In the above approach, the object region is very accurately described by the object model, which is split into various parts that are similar in their appearance. Hence, the local change of appearance within the object region is taken well into account. The background region, however, consists of a single appearance model and positional changes of this appearance are so far neglected. Consider a red-haired person standing on a red carpet which is facing the camera. Then, only a very small part of the person is red, compared to a large part of the background. As a larger percentage of pixels lying outside the person are red, red pixels will be classiﬁed to belong to the outside regions. Thus, the hair will be considered as not being part of the object, which deteriorates tracking. This happens despite the fact that the carpet is far away from the hair.

Localised Mixture Models in Region-Based Tracking

25

The idea to circumvent this problem is to separate the background into multiple subregions each of which is modelled by its own PDF. This can be regarded as a mixture of PDFs, yet the mixture components exploit the positional information telling where the separate mixture components are to be applied. 3.1

Case I: Static Background Image Available

If a static background image is available, segmenting the background is quite simple. In contrast to the top-level task of object-background separation, the regions need not necessarily correspond to objects in the scene. Hence, virtually any multi-region segmentation technique can be applied for this. We tested a very simple one, the K-means algorithm [17,18], and a more sophisticated level set based segmentation, which considers multiple scales and includes a smoothness prior on the contour [19]. In the K-means algorithm the number of clusters is ﬁxed, whereas the level set approach optimises the number of regions by a homogeneity criterion, which is steered by a tuning parameter. Thus, the number of subregions can vary. Fig. 2 compares the segmentation output of these two methods for two diﬀerent parameter settings. The results with the level set method are much smoother due to the boundary length constraint. In contrast, the regions computed with K-means have more fuzzy boundaries. This can be disadvantageous, particularly when the localisation of the model is not precise due to a moving background as considered in the next section. After splitting the background image into subregions, a localised PDF can be assembled from PDFs estimated in each subregion j. Let L(x, y) denote the labelling obtained by the segmentation, we obtain the density p(x, y, s) = pL(x,y)(s),

(2)

where s is any feature used for tracking. It makes most sense to use the same density model for the subregions as used in the segmentation method. In case of K-means this means that we have a Gaussian distribution with ﬁxed variance: (s − μj )2 pkmeans (s) ∝ exp − , (3) j 2 where μj is the cluster centre of cluster j. The level set based segmentation method is build upon a kernel density estimator (x,y)∈Ωj δ(s, I(x, y)) levelset pj (s) = Kσ ∗ (4) |Ωj | where δ is the Dirac delta distribution and Kσ is a Gaussian kernel with standard √ deviation σ. Here, we use σ = 30. The PDF in (2) can simply be plugged into the energy in (1). Note that this PDF needs to be estimated only once for the background image and then stays ﬁxed, whereas the PDFs of the object parts are reestimated in each frame to account for the changing appearance.

26

3.2

C. Schmaltz et al.

Case II: Potentially Varying Background

For some scenarios, generating a static background image is not possible. In outdoor scenarios, for example, the background usually changes due to moving plants or people passing by. Even inside buildings, the lighting conditions – and thus the background – typically vary. Furthermore, the background could vary due to camera motion. In fact, varying backgrounds can appear in many applications and render background subtraction methods impossible. In general, the background changes only slowly between two consecutive frames. This can be exploited to extend the described approach to non-static images or to images where the object is already present. Excluding the current object region from the image domain the remainder of the image can be segmented as before. This is shown in Fig. 5. To further deal with slow changes in the background, the segmentation can also be recomputed in each new frame. This takes changes in the localisation or in the statistics into account. A subtle diﬃculty appearing in this case is that there may be parts of the background not available in the density model because these areas were occluded by the object in the previous frame. When reestimating the pose parameters of the object model, the previously occluded part can appear and needs some treatment. In such a case we choose the nearest available neighbour and use the probability density of the corresponding subregion. That is, if Ωj is the jth subregion as computed by the segmentation step, the local mixture density is: p(x, y, s) = pj ∗ (x, y)

4

with

j ∗ = argmin (dist((x, y), Ωj )) . j

(5)

Experiments

We evaluated the described region statistics on sequence S4 of the HumanEvaII benchmark [20]. For this sequence, a total of four views as well as static background images are available. Thus, this sequence allows us to compare the variant that uses a static background image to the version without the need for such an image. The sequence shows a man walking in a circle for approximately 370 frames, followed by a jogging part from frame 370 to 780, and ﬁnally a “balancing” part until frame 1200. Ground truth marker data is available for this sequence and tracking errors can be evaluated via an online interface provided by Brown University. Note that the ground truth data between frame 299 and 334 is not available, thus this part is ignored in the evaluation. In the ﬁgures, we plotted a linear interpolation between frame 298 and 335. Table 1 shows some statistics over tracking results with diﬀerent models. The ﬁrst line in the table shows an experiment in which background subtraction was used to ﬁnd approximate silhouettes of the person to be tracked. These silhouette images are used as additional features, i.e. in addition to the three channels of the CIELAB colour space, for computing the PDFs of the diﬀerent regions. This approach corresponds to the one in [6]. Results are improved when using the LMM based on level set segmentation. This can be seen by comparing the ﬁrst

Localised Mixture Models in Region-Based Tracking

1

2 2

2 2

1 3

3

1 3

27

3

1

2

1 3

1

2

3

Fig. 3. PDFs estimated for the CIELAB colour channels of the subregions shown in Fig. 5. Each colour corresponds to one region. From left to right: Lightness channel, A channel and B channel. Top: Estimated PDFs when using the level-set-based segmentation. Bottom: Estimated PDFs when computing the subregions using Kmeans. Due to the smoothness term, the region boundaries are smoother resulting in PDFs that are separated less clearly when using the level-set-based method than with the K-means algorithm. Nevertheless, the level set approach performed better in the quantitative evaluation.

and third line of the table. The best results are achieved when using both the silhouette images as well as the LMM (ﬁfth line). The level set based LMM yields slightly better results than K-means clustering. See Fig. 4 for a tracking curve illustrating the error per frame for the best of these experiments Fig. 5 shows segmentation results without using the background image, hence dropping the assumption of a static background. Fig. 3 visualises the estimated PDFs for each channel in each subregion. Aside from some misclassiﬁed pixels close to the occluded area (due to tracking inaccuracies, and due to the fact that a human cannot be perfectly modelled by a kinematic chain), the background is split into reasonable subparts and yields a good LMM. Tracking is almost as good as the combination with background subtraction, as indicated by the lower part of Table 1, without requiring a strictly static background any more. The same setting with a global Parzen model fails completely, as depicted in Fig. 4, since fore- and background are too similar for tracking at some places. In order to verify the true applicability of the LMM in the presence of nonstatic backgrounds, we tracked a tea box in a monocular sequence with a partially moving background. Neither ground truth nor background images are available for this sequence, making background subtraction impossible. As expected, the LMM can handle the moving background very well. When using only the Parzen model for the background, a 90◦ rotation of the tea box is missed by the tracker as shown in the left part of the lower row in Fig. 6. If we add Gaussian noise with standard deviation 10, the Parzen model completely fails (right part in lower row of Fig. 6) while tracking still works when using the LMM.

28

C. Schmaltz et al.

Table 1. Comparison of diﬀerent tracking versions for sequence S4 of the HumanEva-II benchmark as reported by the automatic evaluation script. Each line shows the model used for the background region, if images of the backgrounds were used, the average tracking error in millimetre, its variance and its maximum, as well as the total time used for tracking all 1200 frames. Model BG image Avg. error Variance Max. Time Parzen model + BG subtraction yes 46.16 276.81 104.0 4h 31m LMM (K-means) yes 49.63 473.90 114.2 4h 34m LMM (level set segmentation) yes 42.18 157.31 93.6 4h 22m BG subtraction + LMM (K-means) yes 42.96 178.19 92.6 4h 27m BG subtraction + LMM (LS segm.) yes 41.64 153.94 83.8 4h 29m Parzen model no 451.11 24059.41 728.4 5h 12m LMM (K-means) no 52.64 588.66 162.7 9h 19m LMM (level set segmentation) no 49.94 168.61 111.2 19h 9m

Parzen model Walking

Balancing

Jogging

Localised mixture model

Fig. 4. Tracking error per frame of some tracking results of sequence S4 from the HumanEva-II dataset, automatically evaluated. Left: LMM where background subtraction information is supplemented as an additional feature channel. This plot corresponds to the ﬁfth line in Table 1. Right: Global kernel density estimator (red) and LMM (blue). Here, we did not use the background images or any information derived from them. These plots correspond to the last (blue) and third last (red) line of Table 1.

Fig. 5. Segmentation results for frame 42 as seen from camera 3. Leftmost: Input image of frame 42 the HumanEva-II sequence S4. Left: Object model projected into the image. The diﬀerent colours indicate the diﬀerent model component. Right: Segmentation with level-set-based segmentation and using K-means with 3 regions. The white part is the area occluded by the tracked object, i.e. the area removed from the segmentation process. Every other colour denotes a region. Although no information from the background image was used, segmentation results still look good.

Localised Mixture Models in Region-Based Tracking

29

Fig. 6. Experiment with varying background. Upper row: Model of the tea box to be tracked, input image with initialisation in ﬁrst frame, and tracking results for frame 50, 150 and 180. Lower row: Input image (frame 90), result when using LMM, result with Parzen model, and results with Gaussian noise with LMM and the Parzen model.

5

Summary

We have presented a localised mixture model that splits the region whose appearance should be estimated into distinct subregions. The appearance of the region is then modelled by a mixture of densities, each applied in its local vicinity. For the partitioning step, we tested a fast K-means clustering as well as a multi-region segmentation algorithm based on level sets. We demonstrated the relevance of such a localised mixture model by quantitative experiments in model based tracking using the HumanEva-II benchmark. Results clearly improved when using this new model. Moreover, the approach is also applicable when a static background image is missing. In such cases tracking is only successful with the localised mixture model. We believe that such localised models can also be very beneﬁcial in other object segmentation tasks, where low-level cues are combined with a-priori information, such as semi-supervised segmentation, or combined object recognition and segmentation.

References 1. Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 23(3), 309–314 (2004) 2. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer segmentation of live video. In: Proc. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 53–60. IEEE Computer Society, Los Alamitos (2006) 3. Paragios, N., Deriche, R.: Geodesic active regions: A new paradigm to deal with frame partition problems in computer vision. Journal of Visual Communication and Image Representation 13(1/2), 249–268 (2002) 4. Sifakis, E., Garcia, C., Tziritas, G.: Bayesian level sets for image segmentation. Journal of Visual Communication and Image Representation 13(1/2), 44–64 (2002)

30

C. Schmaltz et al.

5. Cremers, D., Soatto, S.: Motion competition: A variational approach to piecewise parametric motion segmentation. International Journal of Computer Vision 62(3), 249–265 (2005) 6. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J., Wietzke, L., Sommer, G.: Dealing with self-occlusion in region based motion capture by means of internal regions. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2008. LNCS, vol. 5098, pp. 102–111. Springer, Heidelberg (2008) 7. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision 73(3), 243–262 (2007) 8. Morya, B., Ardon, R., Thiran, J.P.: Variational segmentation using fuzzy region competition and local non-parametric probability density functions. In: Proc. Eleventh International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos (2007) 9. Grimson, W., Stauﬀer, C., Romano, R., Lee, L.: Using adaptive tracking to classify and monitor activities in a site. In: Proc. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 22–29. IEEE Computer Society Press, Los Alamitos (1998) 10. Pless, R., Larson, J., Siebers, S., Westover, B.: Evaluation of local models of dynamic backgrounds. In: Proc. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 73–78 (2003) 11. Sun, J., Zhang, W., Tang, X., Shum, H.Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 12. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proc. Twelfth International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos (2008) 13. Dambreville, S., Sandhu, R., Yezzi, A., Tannenbaum, A.: Robust 3D pose estimation and eﬃcient 2D region-based segmentation from a 3D shape prior. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 169– 182. Springer, Heidelberg (2008) 14. Bregler, C., Malik, J., Pullen, K.: Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision 56(3), 179–194 (2004) 15. Gavrila, D.M.: The visual analysis of human movement: a survey. Computer Vision and Image Understanding 73(1), 82–98 (1999) 16. Poppe, R.: Vision-based human motion analysis: An overview. Computer Vision and Image Understanding 108(1-2), 4–18 (2007) 17. Elkan, C.: Using the triangle inequality to accelerate k-Means. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 147–153. AAAI Press, Menlo Park (2003) 18. Gehler, P.: Mpikmeans (2007), http://mloss.org/software/view/48/ 19. Brox, T., Weickert, J.: Level set segmentation with multiple regions. IEEE Transactions on Image Processing 15(10), 3213–3218 (2006) 20. Sigal, L., Black, M.J.: HumanEva: Synchronized video and motion capture dataset for evaluation of articulated motion. Technical Report CS-06-08, Department of Computer Science, Brown University (September 2006)

A Closed-Form Solution for Image Sequence Segmentation with Dynamical Shape Priors Frank R. Schmidt and Daniel Cremers Computer Science Department University of Bonn, Germany

Abstract. In this paper, we address the problem of image sequence segmentation with dynamical shape priors. While existing formulations are typically based on hard decisions, we propose a formalism which allows to reconsider all segmentations of past images. Firstly, we prove that the marginalization over all (exponentially many) reinterpretations of past measurements can be carried out in closed form. Secondly, we prove that computing the optimal segmentation at time t given all images up to t and a dynamical shape prior amounts to the optimization of a convex energy and can therefore optimized globally. Experimental results conﬁrm that for large amounts of noise, the proposed reconsideration of past measurements improves the performance of the tracking method.

1

Introduction

A classical challenge in Computer Vision is the segmentation and tracking of a deformable object. Numerous researchers have addressed this problem by introducing statistical shape priors into segmentation and tracking [1,2,3,4,5,6,7]. While in earlier approaches every image of a sequence was handled independently, Cremers [8] suggested to consider the correlations which characterize many deforming objects. The introduction of such dynamical shape priors allows to substantially improve the performance of tracking algorithms: The dynamics are learned via an auto-regressive model and segmentations of the preceding images guide the segmentation of the current image. Upon a closer look, this approach suﬀers from two drawbacks: – The optimization in [8] was done in a level set framework which only allows for locally optimal solutions. As a consequence, depending on the initialization the resulting solutions may be suboptimal. – At any given time the algorithm in [8] computed the currently optimal segmentation and only retained the segmentations of the two preceding frames. Past measurements were never reinterpreted in the light of new measurements. As a consequence, any incorrect decision would not be corrected at later stages of processing. While dynamical shape priors were called priors with memory in [8], what is memorized are only the decisions the algorithm took on previous frames – the measurements are instantly lost from memory, a reinterpretation is not considered in [8]. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 31–40, 2009. c Springer-Verlag Berlin Heidelberg 2009

32

F.R. Schmidt and D. Cremers

The reinterpretation of past measurements in the light of new measurements is a diﬃcult computational challenge due to the exponential growth of the solution space: Even if a tracking system only had k discrete states representing the system at any time t, then after T time steps, there are k T possible system conﬁgurations explaining all measurements. In this work silhouettes are represented by k continuous real-valued parameters: While determining the silhouette for time t amounts to an optimization in k , the optimizaton over all silhouettes up to time T amounts to an optimization over k·T . Recent works tried to address the above shortcomings. Papadakis and Memin suggested in [9] a control framework for segmentation which aimed at a consistent sequence segmentation by forward- and backward propagation of the current solution according to a dynamical system. Yet this approach is entirely based on level set methods and local optimization as well. Moreover, extrapolations into the past and the future rely on a sophisticated partial diﬀerential equation. In [10] the sequence segmentation was addressed in a convex framework. While this allowed to compute globally optimal solutions independent of initialization, it does not allow a reinterpretation of past measurements. Hence incorrect segmentations will negatively aﬀect future segmentations. The contribution of this paper is it to introduce a novel framework for image sequence segmentation which overcomes both of the above drawbacks. While [8,10] compute the best segmentation given the current image and past segmentations here we propose to compute the best segmentation given the current image and all previous images. In particular we propose a statistical inference framework which gives rise to a marginalization over all possible segmentations of all previous images. The theoretical contribution of this work is therefore two-fold. Firstly, we prove that the marginalization over all segmentations of the preceding images can be solved in closed form which allows to handle the combinatorial explosion analytically. Secondly, we prove that the resulting functional is convex, such that the maximum aposteriori inference of the currently best segmentation can be solved globally. Experimental results conﬁrm that this marginalization over preceding segmentations improves the accuracy of the tracking scheme in the presence of large amounts of noise.

Ê

2

Ê

An Implicit Dynamic Shape Model

In the following, we will brieﬂy review the dynamical shape model introduced in [10]. It is based on the notion of a probabilistic shape u deﬁned as a mapping u : Ω → [0, 1]

(1)

Ê

that assigns to every pixel x of the shape domain Ω ⊂ d the probability that this pixel is inside the given shape. While our algorithm will compute such a relaxed shape, for visualization of a silhouette we will simply threshold u at 1 2 . We present a general model for shapes in arbitrary dimension. However, the approach is tested for planar shapes (d = 2).

A Closed-Form Solution for Image Sequence Segmentation

33

The space of all probabilistic shapes forms a convex set, and the space spanned by a few training shapes {u1 , . . . , uN } forms a convex subset. Any shape u can be approximated by a linear combination of the ﬁrst n principal components Ψi of the training set: u(x) ≈ u0 (x) +

n

αi · Ψi (x)

(2)

i=1

with an average shape u0 . Also, the set Q := {α ∈

Ên|∀x ∈ Ω : 0 ≤ u0 +

n

αi · Ψi (x) ≤ 1}

i=1

of feasible α-parameters is convex [10]. Any given sequence of shapes u1 , . . . , uN can be reduced to a sequence of low dimensional coeﬃcient vectors α1 , . . . , αN ∈ Q ⊂ n . The evolution of these coeﬃcient vectors can be modeled as an autoregressive system

Ê

αi =

k

Aj αi−j + ηΣ −1

(3)

j=1

Æ

Ê

of order k ∈ , where the transition matrices Aj ∈ n×n describe the linear dependency of the current observation on the previous k observations. Here ηΣ −1 denotes Gaussian noise with covariance matrix Σ −1 .

3

A Statistical Formulation of Sequence Segmentation

In the following, we will develop a statistical framework for image sequence segmentation which for any time t determines the most likely segmentation ut given all images I1:t up to time t and given the dynamical model in (3). The goal is to maximize the conditional probability P(αt |I1:t ), where αt ∈ n represents the segmentation ut := u0 + Ψ · αt . For the derivation we will make use of four concepts from probabilistic reasoning:

Ê

– Firstly, the conditional probability is deﬁned as P(A|B) :=

P(A, B) . P(B)

(4)

– Secondly, the application of this deﬁnition leads to the Bayesian formula P(A|B) =

P(B|A) · P(A) P(B)

(5)

34

F.R. Schmidt and D. Cremers I1

I2

noise

noise

A2 α−1

α0

A1

I3

...

noise

It noise

A2 α1

A1

α2

A1

α3

...

αt

A2 Fig. 1. Model for image sequence segmentation. We assume that all information about the observed images Iτ (top row) is encoded in the segmentation variables ατ (bottom row) and that the dynamics of ατ follow the autoregressive model (3) learned beforehand. If the state space was discrete with N possible states per time instance, then one would need to consider N t diﬀerent states to ﬁnd the optimal segmentation of the t-th image. In Theorem 1, we provide a closed-form solution for the integration over all preceding segmentations. In Theorem 2, we prove that the ﬁnal expression is convex in αt and can therefore be optimized globally.

– Thirdly, we have the concept of marginalization: P(A) = P(A|B) · P(B) dB

(6)

which represents the probability P(A) as a weighted integration of P(A|B) over all conceivable states B. In the context of time-series analysis this marginalization is often referred to as the Chapman-Kolmogorov equation [11]. In particle physics it is popular in the formalism of path integral computations. – Fourthly, besides these stochastic properties we make the assumption that for any time τ the probability for measuring image Iτ is completely characterized by its segmentation ατ as shown in Figure 1:

The segmentation ατ contains all information about the system in state τ . The rest of the state τ is independent noise. Hence, Iτ contains no further hidden information, its probability is uniquely determined by ατ .

(7)

With these four properties, we can now derive an expression for the probability P(αt |I1:t ) that we like to maximize. Using Bayes rule with all expressions in (5) conditioned on I1:t−1 , we receive P(αt |I1:t ) ∝ P(It |αt , I1:t−1 ) · P(αt |I1:t−1 )

(8)

A Closed-Form Solution for Image Sequence Segmentation

35

Due to property (7), we can drop the dependency on the previous images in the ﬁrst factor. Moreover, we can expand the second factor using Bayes rule again: P(αt |I1:t ) ∝ P(It |αt ) · P(I1:t−1 |αt ) · P(αt )

(9)

Applying the Chapman-Kolmogorov equation (6) to (9), we obtain P(αt |I1:t ) ∝ P(It |αt ) P(I1:t−1 |α1:t ) · P(α|αt ) · P(αt ) dα1:t−1

(10)

P(α1:t )

This expression shows that the optimal solution for αt requires an integration over all conceivable segmentations α1:t−1 of the preceding images. To evaluate the right hand side of (10), we will model the probabilities P(It |αt ), P(I1:t−1 |α1:t ) and P(α1:t ). Assuming a spatially independent prelearned color distribution Pob of the object and Pbg of the background, we can deﬁne p(x) := − log(Pob (x)/Pbg (x)) which is negative for every pixel that is more likely to be an object pixel than a background pixel. By introducing an exponential weighting parameter γ for the color distributions, P(It |αt ) becomes

P(It |αt ) =

Pob (x)

γut (x)

Pbg (x)

γ(1−ut (x))

∝ exp

x∈Ω

γut (x) log

x∈Ω

Pob (x) Pbg (x)

n ∝ exp − γ · (αt )i · Ψi (x) · p(x) = exp (−γ at , ft ) . i=1

x∈Ω

ft,i

To compute P(I1:t−1 |α1:t ), we use the assumption (7). Besides the information encoded in α1:t , the images Iτ contain no further informations and are therefore pairwise independent: P(I1:t−1 |α1:t ) =

t−1

P(Iτ |α1:t ) =

τ =1

t−1

P(Iτ |ατ ) =

τ =1

t−1

exp (−γ aτ , fτ )

τ =1

The second equation holds again due to (7): Since the probability for Iτ is uniquely determined by ατ , the dependency on the other states can be dropped. Now, we have to address the probability P(α1:t ) which can recursively be simpliﬁed via (4): P(α1:t ) = P(αt |α1:t−1 ) · P(α1:t−1 ) = · · · =

t−1

P(ατ |α1:τ −1 )

τ =1

Using the dynamic shape prior (3), this expression becomes ⎛ 2 ⎞ t−1 k ⎠ P(α1:t ) ∝ exp ⎝− ατ − Ai ατ −i −1 τ =1

i=1

Σ

(11)

36

F.R. Schmidt and D. Cremers

To make this formula more accessible, we introduced k additional segmentation parameters α1−k , . . . , α0 . These parameters represent the segmentation of the past prior to the ﬁrst observation I1 (cf. Figure 1). To simplify the notation, we will introduce α := α1−k:t−1 . These are the parameters that represent all segmentations prior to the current segmentation αt . Combining all derived probabilities, we can formulate the image segmentation as the following minimization task arg min αt

⎛

2 ⎞ k ⎜ ⎟ ατ − exp ⎝− γ · fτ , ατ − A α ⎠ dα j τ −j τ =1 τ =1 j=1 −1 t

t

(12)

Σ

Numerically computing this n · (t + k − 1)-dimensional integral of (12) leads to a combinatorial explosion. Even for a simple example of t = 25 frames, n = 5 eigenmodes and an autoregressive model size of k = 1, a 100-dimensional integral has to be computed. In [8], this computational challenge was circumvented by the crude assumption of a Dirac distribution centered at precomputed segmentation results – i.e. rather than considering all possible trajectories the algorithm only retained for each previous time the one segmentation which was then most likely. In this paper, we will compute this integral explicitely and receive a closedform expression for (12) described in Theorem 1. This closed-form formulation has the important advantage that for any given time it allows an optimal reconsideration of all conceivable previous segmentations. To simplify (12), we write the integral as exp(Q(α, αt ))dα. Note that Q is a quadratic expression that can be written as Q(α, αt ) = γ · ft , αt + αt Σ −1 + α, M α − b, α

I

(13)

III

II

with the block vector b and the block matrix M : bi = −γ · fi + 2ATt−i Σ −1 αt i≥1

i≥t−k

Mi,j = ΣATt−i Σ −1 At−j

i,j≥t−k

½

+ − 2Ai−j + i=j≥1

i≥1 k≥i−j≥1

ΣATl Σ −1 Ai−j+l

1≤l≤k 1≤i+l≤t−1 1≤i−j+l≤k

Despite their complicated nature, the three terms in (13) have the following intuitive interpretations: – I assures that the current segmentation encoded by αt optimally segments the current image. – II assures that the segmentation path (α−1 , . . . , αt ) is consistent with the learned autoregressive model encoded by (Ai , Σ −1 ).

A Closed-Form Solution for Image Sequence Segmentation

37

– III assures that the current segmentation αt also consistently segments all previous images when propagated back in time according to the dynamical model. In dynamical systems such backpropagation is modeled by the adjoints AT of the transition matrices. In the next theorem we will provide a closed form expression for (12) that is freed of any integration process and can therefore computed more eﬃciently. Additionally, we will come up with a convex energy functional. Therefore, to compute the global optimum of the image sequence problem is an easy task. Theorem 1. The integration over all conceivable interpretations of past measurements can be solved in the following closed form: 1 −1 P(αt |I1:t ) = exp −γ αt , ft − αt 2Σ −1 + Ms b, b + const (14) 4 Proof.

P(αt |I1:t ) ∝ =

e−γαt ,ft −αt Σ−1 −α,Ms α+b,α dα e

−αt ,ft −αt Σ −1 −α− 12 Ms−1 b

2

∝ exp −γ αt , ft − αt Σ −1 +

2 Ms

+ 14 Ms−1 b

1 −1 Ms b, b 4

2 Ms

dα

Theorem 2. The resulting energy E(αt ) = − log(P(αt |I1:t )) is convex and can therefore be minimized globally. Proof. The density function P(αt |I1:t ) is the integral of a log-concave function, i.e., their logarithm is a concave function. It was shown in [12] that integrals of log-concave functions are log-concave. Hence, E is convex. Therefore, the global optimum can be computed using, for example, a gradient descent approach.

In [10], discarding all preceding images and merely retaining the segmentations of the last frames gave rise to the simple objective function: 2

E1 (αt ) =γ · αt , ft + αt − vΣ −1

(15)

where v is the prediction obtained using the AR model (3) on the basis of the last segmentations. The proposed optimal path integration gives rise to the new objective function 2

E2 (αt ) =γ · αt , ft + αt Σ −1 −

1 −1 Ms b, b 4

(16)

In the next section, we will experimentally quantify the diﬀerence in performance brought about by the proposed marginalization over preceding segmentations.

38

F.R. Schmidt and D. Cremers

Fig. 2. Optimal Parameter Estimation. The tracking error averaged over all frames (plotted as a function of γ) shows that γ = 1 produces the best results for both methods at various noise levels (shown here are σ = 16 and σ = 256).

4

Experimental Results

In the following experiments, the goal is to track a walking person in spite of noise and missing data. To measure the tracking accuracy, we handsegmented the sequence (before adding noise) and measured the relative error with respect to this ground truth. Let T : Ω → {0, 1} be the true segmentation and S : Ω → {0, 1} be the estimated one. Then we deﬁne the scaled relative error as |S(x) − T (x)| dx := Ω . 2 · Ω T (x) dx It measures the area diﬀerence relative to twice the area of the ground truth. Thus we have =0 for a perfect segmentation and =1 for a completely wrong segmentation (of the same size). Optimal parameter estimation In order to estimate the optimal parameter γ for both approaches, we added Gaussian noise of standard deviation σ to the training images. As we can see in Figure 2, the lowest tracking error (averaged over all frames) is obtained at γ = 1 for both approaches. Therefore, we will ﬁx γ = 1 for the test series in the next section. Robust tracking through prominent noise The proposed framework allows to track a deformable silhouette despite large amounts of noise. Figure 3 shows segmentation results obtained with the proposed method for various levels of Gaussian noise. The segmentations are quite accurate even for high levels of noise. Quantitative comparison to the method in [10] For a quantitative comparison of the proposed approach with the method of [10], we compute the average error of the learned input sequence I1:151 for diﬀerent levels of Gaussian noise. Figure 4 shows two diﬀerent aspects. While the method in [10] exhibits slightly lower errors for small noise levels, the proposed method shows less dependency on noise and exihibits substantially better performance at larger noise levels. While the diﬀerence in the segmentation results for low noise level are barely recognizable (middle row), for high noise level, the method in [10] clearly estimates incorrect poses (bottom row).

A Closed-Form Solution for Image Sequence Segmentation

39

Fig. 3. Close-ups of segmentation results. The proposed method gets correct segmentation results. Even at the presence of high Gaussian noise (σ ∈ {64, 512}).

Average tracking error as a function of the noise level.

Segmentation for σ = 128

method in [10]

proposed method

Segmentation for σ = 2048

method in [10]

proposed method

Fig. 4. Robustness with respect to noise. Tracking experiments demonstrate that in contrast to the approach in [10], the performance of the proposed algorithm is less sensitive to noise and outperforms the former in the regime of large noise. While for low noise, the resulting segmentations are qualitatively similar (middle row), for high noise level, the method in [10] provides an obviously wrong pose estimate (bottom row).

40

5

F.R. Schmidt and D. Cremers

Conclusion

In this paper we presented the ﬁrst approach for variational object tracking with dynamical shape priors which allows to marginalize over all previous segmentations. Firstly, we proved that this marginalization over an exponentially growing space of solutions can be solved analytically. Secondly, we proved that the resulting functional is convex. As a consequence, one can eﬃciently compute the globally optimal segmentation at time t given all images up to time t. In experiments, we conﬁrmed that the resulting algorithm allows to reliably track walking people despite prominent noise. In particular for very large amounts of noise, it outperforms an alternative algorithm [10] that does not include a marginalization over the preceding segmentations.

References 1. Leventon, M., Grimson, W., Faugeras, O.: Statistical shape inﬂuence in geodesic active contours. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 316–323 (2000) 2. Tsai, A., Yezzi, A., Wells, W., Tempany, C., Tucker, D., Fan, A., Grimson, E., Willsky, A.: Model–based curve evolution technique for image segmentation. In: Comp. Vision Patt. Recog., pp. 463–468 (2001) 3. Cremers, D., Kohlberger, T., Schn¨ orr, C.: Nonlinear shape statistics in Mumford– Shah based segmentation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 93–108. Springer, Heidelberg (2002) 4. Riklin-Raviv, T., Kiryati, N., Sochen, N.: Unlevel sets: Geometry and prior-based segmentation. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 50–61. Springer, Heidelberg (2004) 5. Rousson, M., Paragios, N., Deriche, R.: Implicit active shape models for 3d segmentation in MRI imaging. In: Barillot, C., Haynor, D.R., Hellier, P. (eds.) MICCAI 2004. LNCS, vol. 3216, pp. 209–216. Springer, Heidelberg (2004) 6. Kohlberger, T., Cremers, D., Rousson, M., Ramaraj, R.: 4d shape priors for level set segmentation of the left myocardium in SPECT sequences. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 92–100. Springer, Heidelberg (2006) 7. Charpiat, G., Faugeras, O., Keriven, R.: Shape statistics for image segmentation with prior. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2007) 8. Cremers, D.: Dynamical statistical shape priors for level set based tracking. IEEE PAMI 28(8), 1262–1273 (2006) 9. Papadakis, N., M´emin, E.: Variational optimal control technique for the tracking of deformable objects. In: IEEE Int. Conf. on Comp. Vis. (2007) 10. Cremers, D., Schmidt, F.R., Barthel, F.: Shape priors in variational image segmentation: Convexity, Lipschitz continuity and globally optimal solutions. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition (2008) 11. Papoulis, A.: Probability, Random Variables, and Stochastic Processes. McGrawHill, New York (1984) 12. Pr´ekopa, A.: Logarithmic concave measures with application to stochastic programming. Acta Scientiarum Mathematicarum 34, 301–316 (1971)

Markerless 3D Face Tracking Christian Walder1,2 , Martin Breidt1 , Heinrich B¨ ulthoﬀ1 , Bernhard Sch¨ olkopf1 , 1, and Crist´ obal Curio 2

1 Max Planck Institute for Biological Cybernetics, T¨ ubingen, Germany Informatics and Mathematical Modelling, Technical University of Denmark

Abstract. We present a novel algorithm for the markerless tracking of deforming surfaces such as faces. We acquire a sequence of 3D scans along with color images at 40Hz. The data is then represented by implicit surface and color functions, using a novel partition-of-unity type method of eﬃciently combining local regressors using nearest neighbor searches. Both these functions act on the 4D space of 3D plus time, and use temporal information to handle the noise in individual scans. After interactive registration of a template mesh to the ﬁrst frame, it is then automatically deformed to track the scanned surface, using the variation of both shape and color as features in a dynamic energy minimization problem. Our prototype system yields high-quality animated 3D models in correspondence, at a rate of approximately twenty seconds per timestep. Tracking results for faces and other objects are presented.

1

Introduction

Creating animated 3D models of faces is an important and diﬃcult task in computer graphics due to the sensitivity of the human perception of face motion. People can detect slight peculiarities present in an artiﬁcially animated face model, which makes the animator’s job rather diﬃcult and has lead to data-driven animation techniques, which aim to capture live performance. Data-driven face animation has enjoyed increasing success in the movie industry, mainly using marker-based methods. Although steady progress has been made, there are certain limitations involved in placing physical markers on a subject’s face. Summarizing the face by a sparse set of locations loses information, and necessitates motion re-targeting to map the marker motion onto that of a model suitable for animation. Markers also occlude the face, obscuring expression wrinkles and color changes. Practically, signiﬁcant time and eﬀort is required to accurately place markers, especially with brief scans of a numerous subjects — a scenario common in the computer game industry. Tracking without markers is more diﬃcult. To date, most attempts have made extensive use of optical ﬂow calculations between adjacent time-steps of the sequence. Since local ﬂow calculations are noisy and inconsistent, spatial coherency constraints must be added. Although signiﬁcant progress has been made [1], the

This work was supported by Perceptual Graphics (DFG), EU-Project BACS FP6IST-027140, and the Max-Planck-Society.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 41–50, 2009. c Springer-Verlag Berlin Heidelberg 2009

42

C. Walder et al.

Fig. 1. Setup of the dynamic 3D scanner. Two 640 by 480 pixel photon focus MV-D752160 gray-scale cameras (red) compute depth images at 40 Hz from coded light projected by the synchronized minirot H1 projector (blue). Two strobes (far left and right) are triggered by the 656 by 490 pixel Basler A601fc color camera (green), capturing color images at a rate of 40 Hz.

sequential use of between-frame ﬂow vectors can lead to continual accumulation of errors, which may eventually necessitate labor intensive manual corrections [2]. It is also noteworthy that facial cosmetics designed to remove skin blemishes strike directly at the key assumptions of optical ﬂow-based methods. Non ﬂowbased methods include [3]. There, local geometrical patches are modelled and stitched together. [4] introduced a multiresolution approach which iteratively solves between-frame correspondence problems using feature points and 3D implicit surface models. Neither of these works use color information. For face tracking purposes, there is signiﬁcant redundancy between the geometry and color information. Our goal is to exploit this multitude of information sources, in order to obtain high quality tracking results in spite of possible ambiguities in any of the individual sources. In contrast to classical motion capture we aim to capture the surface densely rather than at a sparse set of locations. We present a novel surface tracking algorithm which addresses these issues. The input is an unorganized set of four-dimensional (3D plus time) surface points, with a corresponding set of surface normals and surface colors. From this we construct a 4D implicit surface model, and a regressed function which models the color at any given point in space and time. Our 4D implicit surface model is a partition of unity method like [5], but uses a local weighting scheme which is particularly easy to implement eﬃciently using a nearest neighbor library. By requiring only an unorganized point cloud, we are not restricted to scanners which produce a sequence of 3D frames, and can handle samples at arbitrary points in time and space as produced by a laser scanner, for example.

2

Surface Tracking

In this section we present our novel method of deforming the initial template mesh to move in correspondence with the scanned surface. The dynamic 3D scanner we use is a commercial prototype (see Figure 1) developed by ABW GmbH (http://www.abw-3d.de) and uses a modiﬁed coded light approach with phase unwrapping. A typical frame of output consists of around 40K points with texture coordinates that index into the corresponding color texture image.

Markerless 3D Face Tracking

43

Input. The data produced by our scanner consists of a sequence of 3D meshes with texture images, sampled at a constant rate. As a ﬁrst step we transform each mesh into a set of points and normals, where the points are the mesh vertices and the corresponding normals are computed by a weighted average of the adjacent face normals, using the method described in [6]. Furthermore, we append to each 3D point the time at which it was sampled, yielding a 4D spatio-temporal point cloud. To simplify the subsequent notation, we also append to each 3D surface normal a fourth temporal component of value zero. To represent the color information, we assign to each surface point a 3D color vector representing the RGB color, which we obtain by projecting the mesh produced by the scanner into the texture image. Hence we summarize the data from the scanner as the set of m (point, normal, color) triplets {(xi , ni , ci )}i<1<m ⊂ R4 × R4 × R3 . Template mesh. In addition to the above data, we also require a template mesh in correspondence with the ﬁrst frame produced by the scanner, which we denote by M1 = (V1 , G), where V1 ∈ R3×n are the n vertices and G ⊂ J × J the edges where J = {1, 2, . . . , n}. The construction of the template mesh could be automated — for example we could (a) take the ﬁrst frame itself (or some adaptive reﬁnement of it, for example as produced by a marching cubes type of algorithm such as [7]), or (b) automatically register a custom mesh as was done in a similar context in e.g. [1]. Instead we opt for an interactive approach, using the CySlice software package — this semi-automated step requires approximately 15 minutes of user interaction and is guaranteed to lead to a high quality initial registration (see Figure 3, top left). We normally use a template mesh of 2100 vertices, but this is not an algorithmic restriction, higher res meshes are demonstrated in the accompanying video (see footnote 1, page 47). Output. The aim is to compute the vertex locations of the template mesh for each frame i = 2, . . . s, such that it moves in correspondence with the observed surface. We denote the vertex locations of the i-th frame by Vi ∈ R3×n . Through˜ i,j ∈ R4 out the paper we refer to the j-th vertex of Vi as vi,j . We also use v to represent vi,j concatenated with the relative time of the i-th frame. That is, ˜ i,j = (vi,j v , Δi) where Δ is the interval between frames. 2.1

Algorithm

We take the widespread approach of minimizing an energy functional, Eobj. , which in our case is deﬁned in terms of the entire sequence of vertex locations, V1 , V2 , . . . , Vs . Rather than using the (point, normal, color) triplets directly, we instead use summarized versions of the geometry and color, as represented by the implicit surface embedding function fimp. and color function fcol. , respectively. The construction of these functions is explained in detail in Appendix A. For now it is suﬃcient to know that the functions can be setup and evaluated rather eﬃciently, are diﬀerentiable almost everywhere, and 1. fimp. : R4 → R estimates the signed distance to the scanned surface given the spatio-temporal location (say, x = (x, y, z, t) ). The signed distance to

44

C. Walder et al. 1 0.8 0.6 0.4 0.2 0 −0.2

Fig. 2. The nearest neighbor implicit surface (left, intensity plot of fimp. ) and color (right, RGB plot of fcol. ) models. Time and one space dimension are ﬁxed, plotting over the two remaining space dimensions. Shown is a vertical slice through the data of a human face, revealing the proﬁle contour with the nose pointing to the right. For reference, the zero level set of the implicit appears in both images as a green curve.

a surface S evaluated at x has absolute value |dist(S, x)|, and a sign which diﬀers on diﬀerent sides of S. At any ﬁxed t, the 4D implicit surface can be thought of as a 3D implicit surface in (x, y, z) (see Figure 2, left). 2. fcol. : R4 → R3 similarly estimates a 3-vector of RGB values. Evaluated away from the surface, the function returns an estimate of the color of the surface nearest to the evaluation point (see Figure 2, right). Modelling the geometry and color in this way has the practical advantage that as we construct fimp. and fcol. we may separately adjust parameters which pertain to the noise level in the raw data, and then visually verify the result. Thereafter we approach the tracking problem under the assumption that fimp. and fcol. contain little noise, while summarizing the relevant information in the raw data. The energy we minimize depends on the vertex locations through time and connectivity (edge list) of the template mesh, the implicit surface model, and the color model, i.e., V1 , . . . V s , G, fimp. , and fcol. . With a slight abuse of notation, the functional is Eobj. ≡ l∈terms αl El , where the αl are parameters which we ﬁx as described in Section 2.2, and the El are the individual terms which we now introduce. Note that it is possible to interpret the minimizer of the above energy functional as the maximum a posteriori estimate of a posterior likelihood in which the individual terms αl El are interpreted as negative log-probabilities. Distance to the Surface. The ﬁrst term is straightforward — in order to keep the mesh close to the surface, we approximate the integral over the template mesh of the squared distance to the scanned surface. As an approximation to this squared distance we take the squared value of the implicit surface embedding function fimp.. We approximate the integral by taking an area weighted sum over 2 the vertices. The quantity we minimize is given by Eimp. ≡ i j aj fimp. (˜ vi,j ) . Here, as throughout the paper, aj refers to the Voronoi area [6] of the j-th vertex of M1 , the template mesh at its starting position, but we state the simpler form here as it is easier to implement, and is more numerically stable. Color. We assume that each vertex should remain on a region of stable color, and accordingly we minimize the sum over the vertices of the sample variance of the

Markerless 3D Face Tracking

45

color components observed at the sampling times of the dynamic 3D scanner. We discuss the validity of this assumption in Section 4. The of a vecssample variance s 2 tor of observations y = (y1 , y2 , . . . , ys ) is V (y) ≡ i=1 (yi − i =1 yi /s) /s. To ensure a scaling which is compatible with that of Eimp. , we neglect the term 1/s in the above expression. Summing these variances over RGB channels, integral as before, we obtain Ecol. ≡ and taking thesame approximate vi,j ) − i fcol. (˜ vi ,j )/s2 . i,j aj fcol. (˜ Acceleration. To obtain smooth motion we also minimize a similar approximation to the surface integral of the squared acceleration of the mesh. For a physical analogy, this is similar to minimizing a discretization in time and space of the integral of the squared accelerating forces acting on the mesh, assuming that it is perfectly ﬂexible has constant mass per area. The corresponding and s−1 term is given by Eacc. ≡ j aj i=2 vi−1,j − 2vi,j + vi+1,j 2 . Mesh Regularisation. In addition to the previous terms, it is also necessary to regularize deformations of the template mesh, in order to prevent unwanted distortions during the tracking phase. Typically such regularization is done by minimizing measures of the amount of bending and stretching of the mesh. In our case however, since we are constraining the mesh to lie on the surface deﬁned by fimp. , which itself bends only as much as the scanned surface, we only need to control the stretching of the template mesh. Due to space constraints we now only brieﬂy motivate our choice of regulariser. It is possible to use variational measures of mesh deformations, but we found these energies inappropriate in our experiments as it was diﬃcult to choose the correct amount by which to penalize the terms — either: (1) the penalization was insuﬃcient to prevent undesirable stretching of the mesh in regions of low deformation, or (2) the penalization was too great to allow the correct deformation in regions of high deformation. It is more eﬀective to penalize an adaptive measure of stretch, which measures the amount of local distortion of the mesh, while retaining invariance to the absolute amount of stretch. To this end, we compute the ratio of the area of adjacent triangles, and penalize the deviation of this ratio, from that of the initial template mesh M1 , i.e. Ereg. ≡

s i=2 e∈G

2 area(face1 (ei )) area(face1 (e1 )) a(e) − . area(face2 (ei )) area(face2 (e1 ))

Here, face1 (e) and face2 (e) are the two triangles containing edge e, area(·) is the area of the triangle, and a(e) = area(face1 (e1 )) + area(face2 (e1 )). Note that the ordering of face1 and face2 aﬀects the above term. In practice we restore invariance with respect to this ordering by augmenting the above energy with an identical term with reversed order. 2.2

Implementation

Deformation Based Re-parameterization. Optimising with respect to the 3(s − 1)n variables corresponding to the n 3D vertex locations of frames 2, 3, . . . , s

46

C. Walder et al.

has the following critical shortcomings: 1) It necessitates further regularisation terms to prevent folding and clustering of the mesh, for example. 2) The number of variables is rather large. 3) Compounding the previous shortcoming, convergence will be slow, as this direct parameterization is guaranteed to be ill-conditioned. This is because, for example, the regularisation term Ereg. acts in a sparse manner between individual vertices. Hence, loosely speaking, gradients in the objective function due to local information (for example due to the color term Ecol. ) will be propagated by the regularisation term in a slow domino-like manner from one vertex to the next only after each subsequent step in the optimization. A simple way of overcoming these shortcomings is to optimize with respect to a lower dimensional parameterization of plausible meshes. To do this we manually select a set of control vertices that are displaced in order to deform the template mesh. To this end, we take advantage of some ideas from interactive mesh deformation [8]. This leads to a linear parameterization of the vertex locations V2 , V3 , . . . Vs , namely V i = V1 + Pi B, where Pi ∈ R3×p represent the free parameters and B ∈ Rp×n represent the basis vectors derived from the deformation scheme [9]. We have written V i instead of Vi , as we apply another parameterized transformation, namely the rigid body transformation. This is necessary since the surfaces we wish to track are not only deformed versions of the template, but also undergo rigid body motion. Our vertex parameterization hence takes the form Vi = R(θi )V i + ri = R(θi )(V1 + Pi B)+ ri , where r ∈ R3 allows an arbitrary translation, θi = αi , βi , γi is a vector of angles, and R(θ) ∈ R3×3 is a rotation matrix. Remarks on the re-parameterization. The above scheme does not amount to tracking only the control vertices. Rather, the objective function covers all vertices, and the control vertices are optimized to minimize this global error. Alternatively one could optimize all vertex positions in an unconstrained manner. The main drawback of doing so however is not the greatly increased computation times, but the fact that allowing each vertex to move freely necessitates numerous additional regularisation terms in order to prevent undesirable mesh behaviors such as triangle ﬂipping. While such regularisation terms may succeed in solving this problem, the above re-parameterization is a more elegant solution, as we found the problem of choosing various additional regularisation parameters to be more diﬃcult in practice than the problem of choosing a set of control vertices that is suﬃcient to capture the motion of interest. Note that the precise placement of these control vertices is not critical, provided they aﬀord suﬃciently many degrees of freedom. Hence, the computational advantages of our scheme are a fortunate side eﬀect of the regulariser induced by the re-parameterization. Incremental Optimization. It turns out that even in this lower dimensional space of parameters, optimizing the entire sequence at once in this manner is computationally infeasible. Firstly, the number of variables is still rather large: 3(s − 1)(p + 2), corresponding to the parameters {(Pi , θi , ri )}i=2...s . Secondly, the objective function is rather expensive to compute, as we discuss in the next paragraph. It turns out however, that optimizing the entire sequence would be problematic even if it were computationally feasible, due to the diﬃculty of ﬁnding a good starting

Markerless 3D Face Tracking

47

point for the optimization. Since the objective function is non-convex, it is essential to be able to ﬁnd a starting point which is near to a good local minimum, but it is unclear how to initialize all frames 2, 3, . . . s given only the ﬁrst frame and the raw scanner data. Fortunately, both the computational issue and that of the starting point are easily dealt with by incrementally optimizing within a moving temporal window. In particular, we ﬁrst optimize frame 2, then frames 2-3, frames 2-4, frames 3-5, frames 4-6, etc. With the exception of the ﬁrst two steps, we always optimize a window of three frames, with all previous frames held ﬁxed. Importantly, it is now reasonable to simply initialize the parameters of each newly included frame with those of the previous frame at the end of the previous optimization step. Note that although we optimize on a temporal window with the other frames ﬁxed, we include in the objective function all frames from the ﬁrst to the current, eventually encompassing the entire sequence. Hence Ecol. forces each vertex inside the optimization window to stay within regions that have a color similar to that “seen” previously by the given vertex at previous time steps. One could also treat the ﬁnal output of the incremental optimization as a starting point for optimizing the entire sequence with all parameters unﬁxed, but we found this leads to little change in practice. This is not surprising as, given the moving window of three frames, the optimizer essentially has three chances to get each frame right, with forward and backward look-ahead of up to two frames. Parameter Selection. We ﬁrst determined the parameters of the implicit surface/color models, and the deformation-based re-parameterization, can these can be visually veriﬁed independently of the tracking. Choosing the other parameters values was fairly straightforward, as e.g. tracking color and staying near the implicit surface are goals which typically compete very little — either can be satisﬁed without compromising the other. Hence the results are relatively insensitive to the αimp. /αcol. . To determine suitable parameter setttings for αimp. , αcol. , αacc. and αreg. , we employed the following strategy. First, we removed a degree of freedom by ﬁxing without loss of generality αimp. = 1. Next we assumed that the implicit surface was suﬃciently reliable, and treated the distance to surface term almost like the hard constraint Eimp. = 0 by setting the next parameter αcol. to be 1/100. We then took a sample dataset and ran the system over a 2D grid of values of Eacc. and Ereg. , inspected the results visually, and ﬁxed these two parameters accordingly for subsequent experiments.

3

Results

Tracking results are best visualised with animation, hence the majority of our results are presented in the accompanying video1 . Here we discuss the performance of the system, and provide images of results of the tracking algorithm, which ran on a 64 bit, 2.4 GHz AMD Opteron 850 processor with 4 GB of RAM, using a mixture of Matlab and C++ code. We focus on timings for face data, and only report averages since the timings vary little over identity/performance. 1

http://www.kyb.tuebingen.mpg.de/bu/people/mbreidt/dagm/

48

C. Walder et al.

Fig. 3. A tracking example visualized by projecting the tracked mesh into the color camera image

The recording length is currently limited to 400 frames by operating system constraints. Note that this limitation is not due to our tracking algorithm, which has constant memory and linear time requirements in the length of the sequence. The dominating computation is evaluation of the objective function and its gradient during the optimization phase, and of this, around 80% of the time is on nearest neighbor searches into the scanner data using the algorithm of [10], in order to evaluate the implicit surface and color models. Including the 1-2 seconds required to build the data structure of the nearest neighbor search algorithm for each temporal window, the optimization phase of the tracking algorithm required around 20 seconds per frame. Note that only a small fraction of the recorded data needs to be stored in RAM at any given time. Note also that the computation times seem to scale roughly linearly with template mesh density. For example the four-fold upsampled template mesh in the video needed ≈ 3.5 times the computation time. The tracking results in the accompanying video is convincing, and exhibits very little accumulation of error, as can be seen by the consistent alignment of template mesh to the neutral expression in the ﬁrst and last frames. As no markers were used, the color camera images provide photo realistic expression wrinkles. A challenging example is shown in Figure 3, where the algorithm convincingly captures complex deformations. Here we provide a few comments on the accompanying video, which contains far more results than this paper1 . To test the reliance on color we applied face paint to the female subject. The deterioration in performance is graceful in spite of both the high specularity of the paint and the sparseness of the color information. To demonstrate that the system is not speciﬁc to faces we provide an example in which colored cloth is tracked using no change to the processing pipeline, except for a diﬀerent template mesh topology. The cloth tracking exhibits only minor inaccuracies around the border of the mesh where there is less information to resolve the ambiguities due to plain colored and strongly shadowed regions. A ﬁnal example in the video1 shows a uniformly colored, deforming, and rotating piece of foam being tracked using shape cues alone.

4

Discussion and Future Work

By design, our algorithm does not use optical ﬂow calculations as the basis for the surface tracking. Rather, we combine shape and color information on a coarser scale,

Markerless 3D Face Tracking

49

under the assumption that the color does not change excessively on any part of the surface. This assumption did not cause major problems in the case of expression wrinkles, as such wrinkles tend to appear and disappear on a part of the face with little relative motion with respect to the skin. Hence, in terms of the color penalty in the objective function, wrinkles do not induce a strong force in any speciﬁc direction. Although there are other lighting eﬀects which are more systematic, such as specularities, and self shadowing, we believe these do not represent a serious practical concern for the following reasons. Firstly, we found that in practice the changes caused by shadows and highlights were largely accounted for by the redundancy in color and shape over time. Secondly, it would be easy to reduce the severity of these lighting eﬀects using light polarisers, more strobes and lighting normalization based on a model of the ﬁxed scene lighting. Due to the general lack of available data, we were unable to systematically compare the performance of our system with that of others. To make a ﬁrst step towards establishing a benchmark, we intend to publish data from our system, in order to allow future comparisons. The tracking system we have presented is automated, however it is straightforward to modify the energy functional we minimize in order to allow the user to edit the result by adding vertex constraints for example. It would also be interesting to develop a system which can improve the mesh regularisation terms in a face speciﬁc manner, by learning from previous tracking results. Another interesting direction is intelligent occlusion handling, which could overcome some of the limitations of structured light methods, and also allow the tracking of more complex self occluding objects.

References 1. Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: High-resolution capture for modeling and animation. In: ACM SIGGRAPH (August 2004) 2. Borshukov, G., Lewis, J.P.: Realistic human face rendering for the matrix reloaded. In: SIGGRAPH 2003 Sketches. ACM Press, New York (2003) 3. Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L., Schilling, A.: Reconstruction of deforming geometry from time-varying point clouds. In: SGP 2007: Proc. ﬁfth Eurographics symp. on Geometry processing, Aire-la-Ville, Switzerland, ACM, Eurographics Association, pp. 49–58 (2007) 4. Huang, X., Zhang, S., Wang, Y., Metaxas, D., Samaras, D.: A hierarchical framework for high resolution facial expression tracking. In: Articulated and non-rigid motion, Washington, DC, USA, vol. 1. IEEE Computer Society, Los Alamitos (2004) 5. Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., Seidel, H.P.: Multi-level partition of unity implicits. ACM Trans. on Graphics 22(3), 463–470 (2003) 6. Desbrun, M., Meyer, M., Schr¨ oder, P., Barr, A.H.: Discrete diﬀerential-geometry operators for triangulated 2-manifolds. VisMath 2, 35–57 (2002) 7. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: SGP 2006: Proceedings of the fourth Eurographics symposium on Geometry processing, Aire-laVille, Switzerland, Switzerland, ACM, Eurographics Association, pp. 61–70 (2006) 8. Botsch, M., Kobbelt, L.: An intuitive framework for real-time freeform modeling. In: SIGGRAPH, pp. 630–634. ACM, New York (2004) 9. Botsch, M., Sorkine, O.: On linear variational surface deformation methods. IEEE Trans. Visualization and Computer Graphics 14(1), 213–230 (2008) 10. Merkwirth, C., Parlitz, U., Lauterborn, W.: Fast nearest neighbor searching for nonlinear signal processing. Phys. Rev. E 62(2), 2089–2097 (2000)

50

A

C. Walder et al.

KNN Implicit Surface and Color Models

In this appendix, we motivate and deﬁne our nearest neighbor based implicit surface and color models. Our approach falls into the category of partition of unity methods, in which locally approximating functions are mixed together to form a global one. Let Ω be our domain of interest, and assume that we have a set of non-negative (and typically compactly supported) functions {ϕi } which partition unity, i.e. i ϕi (x) = 1, ∀x ∈ Ω. Now let {fi } be a set of locally approximating functions for each sup(ϕi ). The partition of unity approximating function on Ω is f (x) = i ϕi (x)fi (x). The ϕi are typically deﬁned implicitly by way of a set of compactly supported auxiliary functions {wi }. Provided the wi are non-negative and satisfy sup(wi ) = sup(ϕi ), the following choice is guaranteed to be a partition of unity: ϕi = wiwj . Presently we take the extreme approach j of associating a local approximating function fi with each data point from the set x1 , x2 , . . . xm ∈ R4 , produced by our scanner. In particular, for the implicit surface embedding function fimp. : R4 → R, we associate with xi the linear locally approximating function fi (x) = (x − xi ) ni , where ni is the surface normal at xi . For the color model fcol. : R4 → R3 , the local approximating functions are simply the constant vector-valued functions fi (x) = ci , where ci ∈ R3 represents the RGB color at xi . Note that the above description constitutes a slight abuse of notation due to our having redeﬁned fi twice. To deﬁne the ϕi , we ﬁrst assume w.l.o.g. that d1 ≤ d2 ≤ . . . ≤ dk ≤ di , ∀i > k, where x is our evaluation point and di = x − xi . In practice, we obtain such an ordering by way of a k nearest neighbor search using the TSTOOL software library [10]. By now letting ri ≡ di /dk and choosing wi = (1 − ri )+ , it is easy to see that the corresponding ϕi are continuous, diﬀerentiable almost everywhere, and that we only need to examine the k nearest neighbors of x in order to compute them. Note that the nearest neighbor search costs are easily amortized between the evaluation of fimp. and fcol. . Larger values of k average over more local estimates and hence lead to smoother functions — for our experiments we ﬁxed k = 50. Note that the nearest neighbor search requires Euclidean distances in 4D, so we must decide, say, what spatial distance is equivalent to the temporal distance between frames. Too small a spatial distance will treat each frame separately, too large will smear the frames temporally. The heuristic we used was to adjust the time scale such that on average approximately half of the k nearest neighbors of each data point come from the same time (that is, the same 3D frame from the scanner) as that data point, so that the other half come from the surrounding frames. In this way we obtain functions which vary smoothly through space and time. Note that it is easy to visually verify the eﬀect of this choice by rendering the implicit surface and color models, as demonstrated in the accompanying video. This method is particularly eﬃcient when we optimize on a moving window as discussed in Section 2.2. In this case, reasonable assumptions imply that the implicit surface and color models enjoy setup and evaluation costs of O(q log(q)) and O(k log(q)) respectively, where q is the number of vertices in a single 3D frame.

The Stixel World - A Compact Medium Level Representation of the 3D-World Hern´ an Badino1 , Uwe Franke2 , and David Pfeiﬀer2 1

Goethe University Frankfurt [email protected] 2 Daimler AG {uwe.franke,david.pfeiffer}@daimler.com

Abstract. Ambitious driver assistance for complex urban scenarios demands a complete awareness of the situation, including all moving and stationary objects that limit the free space. Recent progress in real-time dense stereo vision provides precise depth information for nearly every pixel of an image. This rises new questions: How can one eﬃciently analyze half a million disparity values of next generation imagers? And how can one ﬁnd all relevant obstacles in this huge amount of data in real-time? In this paper we build a medium-level representation named “stixel-world”. It takes into account that the free space in front of vehicles is limited by objects with almost vertical surfaces. These surfaces are approximated by adjacent rectangular sticks of a certain width and height. The stixel-world turns out to be a compact but ﬂexible representation of the three-dimensional traﬃc situation that can be used as the common basis for the scene understanding tasks of driver assistance and autonomous systems.

1

Introduction

Stereo vision will play an essential role for scene understanding in cars of the near future. Recently, the dense stereo algorithm “Semi-Global Matching” (SGM) has been proposed [1], which oﬀers accurate object boundaries and smooth surfaces. According to the Middlebury data base, three out of the ten most powerful stereo algorithms are currently SGM variants. Due to the computational burden, in particular the required memory bandwidth, the original SGM algorithm is still too complex for a general purpose CPU. Fortunately, we were able to implement an SGM variant on an FPGA (Field Programmable Gate Array). The task at hand is to extract and track every object of interest captured within the stereo stream. The research of the last decades was focused on the detection of cars and pedestrians from mobile platforms. It is common to recognize diﬀerent object classes independently. Therefore the image is evaluated repetitively. This common approach results in complex software structures, which

Hern´ an Badino is now with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 51–60, 2009. c Springer-Verlag Berlin Heidelberg 2009

52

H. Badino, U. Franke, and D. Pfeiﬀer

(a) Dense disparity image (SGM result)

(b) Stixel representation

Fig. 1. (a) Dense stereo results overlaid on the image of an urban traﬃc situation. The colors encode the distance, red means close, green represents far. Note that SGM delivers measurements even for most pixels on the road. (b) Stixel representation for this situation. The free space (not explicitly shown) in front of the car is limited by the stixels. The colors encode the lateral distance to the expected driving corridor shown in blue.

remain incomplete in detection, since only objects of interest are observed. Aiming at a generic vision system architecture for driver assistance, we suggest the use of a medium level representation that bridges the gap between the pixel and the object level. To serve the multifaceted requirements of automotive environment perception and modeling, such a representation should be: – compact: oﬀering a signiﬁcant reduction of the data volume, – complete: information of interest is preserved, – stable: small changes of the underlying data must not cause rapid changes within the representation, – robust: outliers must have minimal or no impact on the resulting representation. We propose to represent the 3D-situation by a set of rectangular sticks named “stixels” as shown in Fig. 1(b). Each stixel is deﬁned by its 3D position relative to the camera and stands vertically on the ground, having a certain height. Each stixel limits the free space and approximates the object boundaries. If for example, the width of the stixels is set to 5 pixels, a scene from a VGA image can be represented by 640/5=128 stixels only. Observe, that a similar stick scheme was already formulated in [2] to represent and render 3D volumetric data at high compression rates. Although our stixels are diﬀerent to those presented in [2], the properties of compression, compactness and exploitation of the spatial coherence are common in both representations. The literature provides several object descriptors like particles [3], quadtrees, octtrees and quadrics [4] [5], patchlets [6] or surfels [7]. Even though these structures partly suﬃce our designated requirements, we refrain from their usage for our matter since they do not achieve the degree of compactness we strive for.

The Stixel World - A Compact Medium Level Representation

53

Section 2 describes the steps required to build the stixel-world from raw stereo data. Section 3 presents results and properties of the proposed representation. Future work is discussed in Section 4 and Section 5 concludes the paper.

2

Building the Stixel-World

Traﬃc scenes typically consist of a relatively planar free space limited by 3D obstacles that have a nearly vertical pose. Fig. 1 displays a typical disparity input and the resulting stixel-world. The diﬀerent steps applied to construct this representation are depicted in Fig. 2 and Fig. 3. An occupancy grid is computed from the stereo data (see Fig. 2(a)) and used for an initial free space computation. We formulate the problem in such a way that we are able to use dynamic programming which yields a global optimum. The result of this step is shown in Fig. 2(c) and 3(a). By deﬁnition, the free space ends at the base-point of vertical obstacles. Stereo disparities vote for their membership to the vertical obstacle generating a membership cost image (Fig. 3(c)). A second dynamic programming pass optimally estimates the height of the obstacles. An appropriate formulation of this problem allows us to reuse the same dynamic programming algorithm for this task, which was applied for the free space computation. The result of the height estimation is depicted in Fig. 3(d). Finally, a robust averaging of the disparities of each stixel yields a precise model of the scene. 2.1

Dense Stereo

Most real-time stereo algorithms based on local optimization techniques deliver sparse disparity data. Hirschm¨ uller [1] proposed a dense stereo scheme named ”Semi-Global Matching” that runs within a few seconds on a PC. For road scenes, the “Gravitational Constraint” has been introduced in [8] which improves the results by taking into account that the disparities tend to increase monotonously from top to bottom. The implementation of this stereo algorithm on a FPGA allows us to run this method in real-time. Fig. 1(a) shows that SGM is able to model object boundaries precisely. In addition, the smoothness constraint used in the algorithm leads to smooth estimations in low contrast regions, exemplarily seen on the street and the untextured parts of the vehicles and buildings. 2.2

Occupancy Grid

The stereo disparities are used to build a stochastic occupancy grid. An occupancy grid is a two-dimensional array or grid which models occupancy evidence of the environment. Occupancy grids were ﬁrst introduced in [9]. A review is given in [10]. Occupancy grids are computed in real-time using the method presented in [11] which allows to propagate the uncertainty of the stereo disparities onto the grid. We use a polar occupancy grid in which the image column is used to represent

54

H. Badino, U. Franke, and D. Pfeiﬀer

(a) Polar occ. grid.

(b) Background subtraction.

(c) Obtained free space.

Fig. 2. Occupancy grids: Fig.(a) shows the polar occupancy grid obtained from the disparity image shown in Fig. 1(a) (brightness encode the likelihood of occupancy). Fig. (b) shows the resulting image when background subtraction is applied to Fig. (a). The free space obtained from dynamic programming is shown in Fig. (c) in green, overlaid on a Cartesian representation of the occupancy grid.

the angular coordinate and the stereo disparity is used to represent the range. Figure 2(a) shows an example of a the polar occupancy grid obtained from the stereo result shown in Fig. 1(a). Only those 3D measurements lying above the road are registered as obstacles in the occupancy grid. Instead of assuming a planar road, we estimate the road pose by ﬁtting a B-Spline surface to the 3D data as proposed in [12]. 2.3

Free Space Computation

The task in free space analysis is to ﬁnd the ﬁrst visible relevant obstacle in the positive direction of depth. By observing Fig. 2(a) this means that the search must start from the bottom of the image in vertical direction until an occupied cell is found. The space found in front of this cell is considered free space. Instead of using a thresholding operation for every column independently, we use dynamic programming (DP) to ﬁnd the optimal path cutting the polar grid from left to right. As proposed in [11], spatial smoothness is imposed by using a cost that penalizes jumps in depth, while temporal smoothness is imposed by a cost that penalizes the deviation of the current solution from a prediction. The prediction is obtained from the segmentation result of the previous cycle. In real world scenes, an image column may contain more than one object. In the example considered here, the guardrail at the right and the building at the background in Fig. 1, both have a corresponding occupancy likelihood in the occupancy grid of Fig. 2(a). Nevertheless, per deﬁnition, the free space is given only up to the guardrail. Applying dynamic programming directly on the grid of

The Stixel World - A Compact Medium Level Representation

55

Fig. 2(a) might lead to a solution where the optimal boundary is found on the background object (i.e. the building) and not on the foreground object (i.e. the guardrail). To cope with the above problem, a background subtraction is carried out before applying DP. All occupied cells behind the ﬁrst maximum which is above a given threshold are marked as free. The threshold must be selected so that it is quite larger than the occupancy grid noise expected in the grid. An example of the resulting background subtraction is shown in Fig. 2(b). The output of the DP is a set of vector coordinates (u, dˆu ), where u is a column of the image and dˆu the disparity corresponding to the distance up to which free space is available. For every pair (u, dˆu ) a corresponding triangulated pair (xu , zu ) is computed, which deﬁnes the 2D world point corresponding to (u, dˆu ). The sorted collection of points (xu , zu ) plus the origin (0, 0) form a polygon which deﬁnes the free space area from the camera point of view (see Fig. 2(c)). Fig. 3(a) shows the free space overlaid on the left image when dynamic programming is applied on Fig. 2(b). Observe that each free space point of the polygon in Fig. 3(a) indicates not only the interruption the free space but also the base-point of a potential obstacle located at that position (a similar idea was successfully applied in [13]). The next section describes how to apply a second pass of dynamic programming in order to obtain the upper boundary of the obstacle. 2.4

Height Segmentation

The height of the obstacles is obtained by ﬁnding the optimal segmentation between foreground and background disparities. This is achieved by ﬁrst computing a cost image and then applying dynamic programming to ﬁnd the upper boundary of the objects. Given the set of points (u, dˆu ) and their corresponding triangulated coordinate vectors (xu , zu ) obtained from the free space analysis, the task is to ﬁnd the optimal row position vt where the upper boundary of the object at (xu , zu ) is located. In our approach every disparity d(u, v) (i.e. disparity on column u and row v) of the disparity image votes for its membership to the foreground object. In the simplest case a disparity votes positively for its membership as belonging to the foreground object if it does not deviate more than a maximal distance from the expected disparity of the object. The disparity votes negatively otherwise. The Boolean assignments make the threshold for the distance very sensitive: if it is too large, all disparities vote for the foreground membership, if it is too small, all points vote for the background. A better alternative is to approximate the Boolean membership in a continuous variation with an exponential function of the form Mu,v (d) = 2

1−

d−dˆu ΔDu

2

−1

(1)

where ΔDu is a computed parameter and dˆu is the disparity obtained from the free space vector (Sec. 2.3), i.e. the initially expected disparity of the

56

H. Badino, U. Franke, and D. Pfeiﬀer

foreground object in the column u. The variable ΔDu is derived for every column independently as b · fx ΔDu = dˆu − fd (zu + ΔZu ), where fd (z) = z

(2)

and fd (z) is the disparity corresponding to depth z. b corresponds to the baseline, fx is the focal length and ΔZu is a parameter. This strategy has the objective to deﬁne the membership as a function in meters instead of pixels to correct for perspective eﬀects. For the results shown in this paper we use ΔZu = 2 m. Fig. 3(b) shows an example of the membership values. Our experiments show that the explicit choice of the functional is not crucial as long as it is continuous. From the membership values the cost image is computed: C(u, v) =

i=v−1 i=0

i=vf

Mu,v (d(u, i)) −

Mu,v (d(u, i))

(3)

i=v

where vf is the row position such that the triangulated 3D position of disparity dˆu on image position (u, vf ) lies on the road, i.e. is the row corresponding to the base-point of the object. Fig. 3(c) shows an exemplary cost image. For the computation of the optimal path, a graph Ghs (Vhs , Ehs ) is generated. Vhs is the set of vertices and contains one vertex for every pixel in the image. Ehs is the set of edges which connect every vertex of one column with every vertex of the following column. The cost minimized by dynamic programming is composed of a data and a smoothness term, i.e.; cu,v0 ,v1 = C(u, v0 ) + S(u, v0 , v1 )

(4)

is the cost of the edge connecting the vertices Vu,v0 and Vu+1,v1 where C(u, v) is the data term as deﬁned in Eq. 3. S(u, v0 , v1 ) applies smoothness and penalizes jumps in the vertical direction and is deﬁned as: |zu − zu+1 | S(u, v0 , v1 ) = Cs |v0 − v1 | · max 0, 1 − (5) NZ where Cs is the cost of a jump. The cost of a jump is proportional to the difference between the rows v0 and v1 . The last term has the eﬀect of relaxing the smoothness constraint at depth discontinuities. The spatial smoothness cost of a jump becomes zero if the diﬀerence in depth between the columns is equal or larger than NZ . The cost reaches its maximum Cs when the free space distance between consecutive columns is 0. For our experiments we use NZ = 5 m, Cs = 8. An exemplary result of the height segmentation for the free space computed in Fig. 3(a) is shown in Fig. 3(d). 2.5

Stixel Extraction

Once the free space and the height for every column has been computed, the extraction of the stixel is straightforward. If the predeﬁned width of the stixel

The Stixel World - A Compact Medium Level Representation

(a) Free space

(b) Membership values

(c) Membership cost image

(d) Height segmentation

57

Fig. 3. Stixels computation: Fig.(a) shows the result obtained from free space computation with dynamic programming. The assigned membership values for the height segmentation are shown in Fig. (b), while the cost image is shown in Fig. (c) (the grey values are negatively scaled). Fig. (d) shows the resulting height segmentation.

is more than one column, the heights obtained in the previous step are fused resulting in the height of the stixel. The parameters base and top point vB and vT as well as the width of the stixel span a frame where the stixel is located. Due to discretization eﬀects of the free space computation, which are caused by the ﬁnite resolution of the occupancy grid, the free space vector is condemned to a limited accuracy in depth. Further spatial integration over disparities within this frame grant an additional gain in depth accuracy. The disparities found within the stixel area are registered in a histogram while regarding the depth uncertainty known from SGM. A parabolic ﬁt around the maximum delivers the new depth information. This approach oﬀers outlier rejection and noise suppression, which is illustrated by Fig. 4, where the SGM stereo data of the rear of a truck are displayed. Assuming a disparity noise of 0.2 px, a stereo baseline of 0.35 m and a focal length of 830 px, as in our experiments, the expected standard deviation for the truck at 28 meters is approx. 0.54 m. Since an average stixel covers hundreds of disparity values, the integration signiﬁcantly improves the depth of the stixel. As expected, the uncertainty falls below 0.1m for each stixel.

58

H. Badino, U. Franke, and D. Pfeiﬀer

Fig. 4. 3D visualization of the raw stereo data showing a truck driving 28 meters ahead. Each red line represents 1 meter in depth. One can clearly observe the high scattering of the raw stereo data while the stixels remain as a compound and approximate the planar rear of the truck.

3

Experimental Results

Figure 5 displays the results of the described algorithm when applied to images taken from diﬀerent road scenarios such as highways, construction sites, rural roads and urban environments. The stereo baseline is 0.35 m, the focal length 830 pixels and the images have a VGA (640 × 480 pixels) resolution. The color of the stixels encodes the lateral distance to the expected driving corridor. It’s highly visible that even ﬁligree structures like beacons or reﬂector posts are being captured in their position and extension. For clarity reasons we do not explicitly show the obtained free space. The complete computation of stixels on a Intel Quad Core 3.00 GHz processor takes less than 25 milliseconds. The examples shown in this paper must be taken as representative results of the proposed approach. In fact, the method has successfully passed days of real-time testing in our demonstrator vehicle in urban, highway and rural environments.

4

Future Work

In the future we intend to apply a tracking for stixels based upon the principles of 6D-Vision [14], where 3D points are tracked over time and integrated with Kalman ﬁlters. The integration of stixels over time will lead to further improvement of the position and height. At the same time it will be possible to estimate the velocity and acceleration, which will ease subsequent object clustering steps. Almost all objects of interest within the dynamic vehicle environment touch the ground. Nevertheless, hovering or ﬂying objects such as traﬃc signs, traﬃc lights and side mirrors (an example is given in Fig. 5(b) at the traﬃc sign)

The Stixel World - A Compact Medium Level Representation

(a) Highway

(b) Construction site

(c) Rural road

(d) Urban traﬃc

59

Fig. 5. Evaluation of stixels in diﬀerent real world road scenarios showing a highway, a construction site, a rural road and an urban environment. The color encodes the lateral distance to the driving corridor. Base points (i.e. distance) and height estimates are in very good accordance to the expectation.

violate this constraint. Our eﬀorts in the future work also includes to provide a dynamic height of the base-point.

5

Conclusion

A new primitive called stixel was proposed for modeling 3D scenes. The resulting stixel-world turns out to be a robust and very compact representation (not only) of the traﬃc environment, including the free space as well as static and moving objects. Stochastic occupancy grids are computed from dense stereo information. Free space is computed from a polar representation of the occupancy grid in order to obtain the base-point of the obstacles. The height of the stixels is obtained by segmenting the disparity image in foreground and background disparities applying the same dynamic programming scheme as used for the free space

60

H. Badino, U. Franke, and D. Pfeiﬀer

computation. Given height and base point the depth of the stixel is obtained with high accuracy. The proposed stixel scheme serves as a well formulated medium-level representation for traﬃc scenes. Obviously, the presented approach is also promising for other applications that obey the same assumptions of the underlying scene structure.

Acknowledgment The authors would like to thank Jan Siegemund for his contribution to the literature review and Stefan Gehrig and Andreas Wedel for fruitful discussions.

References 1. Hirschm¨ uller, H.: Accurate and eﬃcient stereo processing by semi-global matching and mutual information. In: CVPR (2005) 2. Montani, C., Scopigno, R.: Rendering volumetric data using the sticks representation scheme. In: Workshop on Volume Visualization, San Diego, California (1990) 3. Fua, P.: Reconstructing complex surfaces from multiple stereo views. In: ICCV (June 1996) 4. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. In: Conference on Computer Graphics and Interactive Techniques, pp. 71–78 (1992) 5. Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., Seidel, H.P.: Multi-level partition of unity implicits. ACM SIGGRAPH 2003 22(3), 463–470 (2003) 6. Murray, D., Little, J.J.: Segmenting correlation stereo range images using surface elements. In: 3D Data Processing, Visualization and Transmission, September 2004, pp. 656–663 (2004) 7. Pﬁster, H., Zwicker, M., van Baar, J., Gross, M.: Surfels: Surface elements as rendering primitives. In: ACM SIGGRAPH (2000) 8. Gehrig, S., Franke, U.: Improving sub-pixel accuracy for long range stereo. In: VRML Workshop, ICCV (2007) 9. Elfes, A.: Sonar-based real-world mapping and navigation. Journal of Robotics and Automation 3(3), 249–265 (1987) 10. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. Intelligent Robotics and Autonomous Agents. The MIT Press, Cambridge (2005) 11. Badino, H., Franke, U., Mester, R.: Free space computation using stochastic occupancy grids and dynamic programming. In: Workshop on Dynamical Vision, ICCV, Rio de Janeiro, Brazil (October 2007) 12. Wedel, A., Franke, U., Badino, H., Cremers, D.: B-spline modeling of road surfaces for freespace estimation. In: Intelligent Vehicle Symposium (2008) 13. Kubota, S., Nakano, T., Okamoto, Y.: A global optimization algorithm for realtime on-board stereo obstacle detection systems. In: Intelligent Vehicle Symposium (2007) 14. Franke, U., et al.: 6D-vision: Fusion of stereo and motion for robust environment perception. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005, vol. 3663, pp. 216–223. Springer, Heidelberg (2005)

Global Localization of Vehicles Using Local Pole Patterns Claus Brenner Institute of Cartography and Geoinformatics, Leibniz Universit¨ at Hannover, Appelstraße 9a, 30167 Hannover, Germany [email protected]

Abstract. Accurate and reliable localization is an important requirement for autonomous driving. This paper investigates an asymmetric model for global mapping and localization in large outdoor scenes. In the ﬁrst stage, a mobile mapping van scans the street environment in full 3D, using high accuracy and high resolution sensors. From this raw data, local descriptors are extracted in an oﬄine process and stored in a global map. In the second stage, vehicles, equipped with simple, inaccurate sensors are assumed to be able to recover part of these descriptors which allows them to determine their global position. The focus of this paper is on the investigation of local pole patterns. A descriptor is proposed which is tolerant with regard to missing data, and performance and scalability are considered. For the experiments, a large, dense outdoor LiDAR scan with a total length of 21.7 km is used.

1

Introduction

For future driver assistance systems and autonomous driving, reliable positioning is a prerequisite. While global navigation satellite systems like GPS are in widespread use, they lack the required reliability, especially in densely builtup urban areas. Relative positioning, using video or LiDAR sensors, is an important alternative, which has been explored in many disciplines, like photogrammetry, computer vision and robotics. In order to determine one’s position uniquely, a global map is required, which is usually based on features (or landmarks), rather than on the originally captured raw data. In robotics, features like line segments or corners have been extracted from horizontal scans [1]. However, such features exhibit a low degree of information and, especially for indoor sites, are not very discriminative. Recently, there is much research in computer vision using highdimensional descriptors (such as SIFT), extracted from images, which can be used for object recognition [2] or localization [3]. Thus, it is interesting if highdimensional descriptors can be found which are strictly based on geometry and work in large outdoor environments. In this paper, a large 3D scanned outdoor scene is used from which a map of features is derived. Upright poles are extracted, which can be done quite reliably using simple geometric reasoning. The pole centers then form a 2D pattern. 2D point pattern matching is a research topic of its own, with applications for star J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 61–70, 2009. c Springer-Verlag Berlin Heidelberg 2009

62

C. Brenner

pattern and ﬁngerprint matching. Van Wamelen et al. [4] present an expected linear time algorithm, to ﬁnd a pattern in O(n(log k)3/2 ), where n is the number of points in the scene and k < n the number of points in the subset which shall be searched in the scene. Bishnu et al. [5] describe an algorithm which is O(kn4/3 log n) in the worst case, however is reported to perform much better for practical cases. It is based on indexing the distance of point pairs. This paper generalizes this approach, by encoding the local relations of two or more points.

2

Mobile Mapping Setup and Feature Extraction

We obtained a dense LiDAR scan of a part of Hannover, Germany, acquired by the Steetmapper mobile mapping system, jointly developed by 3D laser mapping Ltd., UK, and IGI mbH, Germany [6]. The scan was acquired with a conﬁguration of four scanners. Postprocessing yields the trajectory and, through the calibrated relative orientations of the scanners and the GNSS/IMU system, a georeferenced point cloud. Absolute point accuracy varies depending on GPS outages, however a relative accuracy of a few centimeters can be expected due to the high accuracy ﬁber optic IMU employed. Since we want to deal with local point patterns, relative accuracy is actually much more important than absolute accuracy. Fig. 1(a) shows an overview. Note that the scanned area contains streets in densely built up regions as well as highway like roads. The total length of the scanned roads is 21.7 kilometers, captured in 48 minutes, which is an average speed of 27 km/h. During that time, 70.7 million points were captured, corresponding to an eﬀective measurement rate of 24,500 points per second. On average, each road meter is covered by more than 3,200 points. After obtaining the point cloud, the ﬁrst step is to extract features. The basic motivation of this is that many of the scanned points do not convey much information and by reducing the huge cloud to only a few important features, transmission, storage, and computation requirements are substantially reduced. Our ﬁrst choice were poles, which are usually abundant in inner city scenes as well as geometrically stable over time. Pole extraction uses a simple geometric model, namely that the basic characteristic of a pole is that it is upright, there is a kernel region of radius r1 where laser scan points are required to be present, and a hollow cylinder, between r1 and r2 (r1 < r2 ) where no points are allowed to be present (Fig. 1(b)). The structure is analyzed in stacks of hollow cylinders. A pole is conﬁrmed when a certain minimum number of stacked cylinders is found. After a pole is identiﬁed, the points in the kernel are used for a least squares estimation of the pole center. Note that this method owes its reliability to the availability of full 3D data, i.e., the processing of the 3D stack of cylinders. Using just one scan plane parallel to the ground (as is often the case in robotics) or projecting the 3D cloud down to the ground would not yield the same detection reliability. The method also extracts some tree trunks of diameter ≤ r1 , which we do not attempt to discard since they are useful for positioning purposes as well.

Global Localization of Vehicles Using Local Pole Patterns

(a)

(b)

63

(c)

Fig. 1. (a) Scan path in Hannover (color encodes height of scanned points, using temperature scale). (b) Illustration of the pole extraction algorithm, using cylindrical stacks. In this example, some parts of the pole were not veriﬁed due to missing scan points (middle), others due to an additional structure, such as a sign mounted to the pole (top). (c) Extracted poles for the intersection ‘Friederikenplatz’ in Hannover. (a) and (c) overlaid with a cadastral map (which takes not part in any computation).

For the entire 22 km scene, a total of 2,658 poles was found, which is on average one pole every 8 meters (Fig. 1(c)). In terms of data reduction, this is one extracted pole (or 2D point) per 27,000 original scan points. Although the current implementation is not optimized, processing time is uncritical and yields several poles per second on a standard PC. There are also some false detections in the scene, e.g. where a pole-like point pattern is induced by low point density or occlusion. Nevertheless, the pole patterns obtained are considered to be representative.

3

Characteristics of Local Pole Patterns

The idea for global localization of vehicles is to use the local pole pattern and match this to the global set of poles. As opposed to the general case of point pattern matching, one can assume that the scale is ﬁxed. Also, there are additional constraints which arise from this special application, which is a combination of ‘horizon diameter’ and measurement accuracy. Since vehicles are not equipped with high accuracy positioning sensors, one can achieve a good measurement accuracy only for scenes of small extent. In order to keept the treatment general and map-centric, i.e. without necessity to rely on a speciﬁc vehicle sensor, standard parameters are used in this paper. It is assumed that the ‘horizon’ of a vehicle is a 50 meter radius disk. Within this radius, the vehicle is deemed to be able to measure poles with an accuracy of = 0.1 m (or alternatively, 0.2 m). A pole accuracy of means that during matching, a pole from the reference map and one from the vehicle’s horizon are considered to form a matching pair if they are within a distance of . Note that these assumptions do not imply that an actual vehicle must be equipped with a 360◦ , 50 m range sensor. Rather,

64

C. Brenner

it is just an assumption regarding the scene extent and accuracy which seems feasible, either by direct measurement or by merging of multiple scans along the path of the vehicle. Using the assumed parameters, one can derive the ﬁrst pole pattern characteristics. Placing the center of the ‘horizon’ on every pole in turn, and counting the number of other poles within a radius of r = 50 m, one ﬁnds that the average number of neighbor poles is around 18, with a maximum of 41. The number of neighbors is not uniformly distributed, but rather, there is a peak around 17, as can be seen from the histogram in Fig. 2(a). Around 50% of all poles do have between 12 and 22 neighbors. To determine uniqueness, the following experiment was performed. Each of the poles pi , 1 ≤ i ≤ N = 2, 658 is taken as center and its local set of neighbor poles Pi = {pj |j = i, pi − pj 2 ≤ r} (with r = 50 m) is matched to every pole neighborhood Pj , j = i in the scene. Taking into account all possible rotations, it is counted how many poles match within a tolerance of 2 and the maximum over all this counts is taken, i.e. ni,max = maxj=i (matchcount(Pi , Pj )). The result is shown as a histogram in Fig. 2(b), where the area of the bubbles reﬂects the count and the pole neighborhoods are sorted according to the number of neighbors |Pi |. For example, the bubble at (17, 2) represents the number of poles with 17 other poles in the neighborhood, for which ni,max = 2 (in this case, the bubble area represents a count of 101, which means that 101 out of the 155 poles with 17 neighbors (peak in Fig. 2(a)) fulﬁll this criterion). An important observation is that for poles with up to around 20 neighbors, there is a strong peak at two, which means that in the majority of those cases, if we take the center and three more points in the neighborhood, this is a pattern which is unique in the entire scene. This property is used in the next section for the design of the local descriptor. It can also be seen that ni,max may be as large as 20, for poles with |Pi | ≥ 36. As it turns out, this is due to alleys in the scene, where trees are planted with regular spacing. Along those alleys, there is a huge number of neighbors, many of which ﬁt to other places along the alley. Nevertheless, one can see that even in this case, ni,max is only around 50% of |Pi |.

4 4.1

A Local Pole Pattern Descriptor The Curse of Dimensionality

Similar to the case of image retrieval from large databases, we are looking for a local ‘pole descriptor’ which can be retrieved quickly from a huge scene, containing millions of poles. One of the key properties of local descriptors used in vision (such as the SIFT descriptor [2]) is that through their high dimensionality, they are quite unique, even in large databases. Similarly, in our case, the patterns of local pole neighbors are high-dimensional and unique. For example, if |Pi | = 17, we can describe the center point and its 17 neighbors in a unique way using their local (x, y) or polar coordinates, which will yield 33 = 2 · 18 − 3 parameters (in general, k points in 2D will require 2k − 3 parameters when rotation and translation are not ﬁxed). Diﬀerent from the case of the SIFT descriptor,

Global Localization of Vehicles Using Local Pole Patterns

65

Ϯϰ

ϮϮ

ϮϬ

ϭϴ

ϭϲϬ

ϭϲ

ϭϰϬ

ϭϰ

ϭϮϬ

ϭϮ

ϭϬϬ

ϭϬ

ϴϬ

ϴ

ϲϬ ϲ

ϰϬ ϰ

ϮϬ Ϯ

Ϭ Ϭ

Ϯ

ϰ

ϲ

ϴ ϭϬ ϭϮ ϭϰ ϭϲ ϭϴ ϮϬ ϮϮ Ϯϰ Ϯϲ Ϯϴ ϯϬ ϯϮ ϯϰ ϯϲ ϯϴ ϰϬ

(a)

Ϭ Ϭ

Ϯ

ϰ

ϲ

ϴ

ϭϬ

ϭϮ

ϭϰ

ϭϲ

ϭϴ

ϮϬ

ϮϮ

Ϯϰ

Ϯϲ

Ϯϴ

ϯϬ

ϯϮ

ϯϰ

ϯϲ

ϯϴ

ϰϬ

ϰϮ

(b)

Fig. 2. (a) Histogram of the number of poles in a local neighborhood (r = 50 m disk). (b) Histogram of the maximum number of poles in the neighborhood which match to another pole neighborhood in the scene. x-axis is number of neighbors |Pi |, y-axis is maximum matches ni,max , and the area of the bubbles represent the number of cases.

however, the number of dimensions would not be ﬁxed but rather depend on the local scene content. Moreover, since one has to take into account missing or extra poles resulting from scene interpretation errors, descriptors of diﬀerent dimensions would have to be compared. As is well-known, common eﬃcient indexing methods are not successful in databases of high dimensions. For example, the popular kd-tree has a time complexity for retrieval of O(n1−1/d + l) where n is the number of elements in the database, d is the dimension, and l is the number of retrieved neighbors. This means that for high dimensions, it is as eﬃcient as a brute force O(n) search over all elements. Therefore, one has to rely on approximations, for example searching for a near neighbor only (as done by Lowe [2]) or by using quantization (as done by Nist´er and Stew´enius [7], where quantization into cells is given by a hierarchic clustering tree). In our case, quantization is staightforward, since the parameters required to express the pole pattern are geometric in nature, with given maximum range and measurement accuracy. For example, using a quantization of 2 = 0.2 m within a total range of ±50 m would yield 500 discrete values, which requires approximately α ≈ 9 bits to encode. Thus, let us assume for the moment that the number of poles in all neighborhoods of the database (and the query) is the same, say k, in which case the dimension is d = 2k − 3, and each local neighborhood can be encoded into a single integer number using αd bits. Since each descriptor is just an integer with bounded range, one can search for an exact match instead of a nearest neighbor, using perfect hashing on a grid, with the remarkable time complexity of only O(1), however this would require a (perhaps unrealistic) O(αd · n3 ) preprocessing time [8]. Alternatively, using a search tree or sorting all database entries would still yield a time complexity of O(log n) and would require only O(n log n) preprocessing time.

66

C. Brenner

However, there are two caveats. First, since there is noise in the measured query vector, quantization may lead to an error of ±1 (if the cell size is chosen as outlined above). That is, one needs to search for the neighboring cell as well. Using a tree, this can be done in O(1), however for just one dimension, which will not help, since we ‘folded’ d dimensions into one integer. Overmars [9] proposed an √ algorithm for range search on a grid which has retrieval time O(l + logd−2 n α), where l is again the number of returned results, requiring O(n logd−1 n) storage. However, since we are not performing a general range search, but rather are interested in the two direct neighbors only (i.e., if the value in one dimension is i, we have to look at i − 1, i, i + 1), we can simply search several times, which requires 3 searches for each dimension, i.e. a total of 3d searches. This will grow by a factor of 3 for each added dimension instead of log n and allow us to use O(n) storage. Still, for large d (remember 17 neighbors yield d = 33) this is not practical. The second caveat is that we cannot assume a ﬁxed dimension and have to allow for missing and additional poles in the query. 4.2

Design of the Local Pattern Descriptor

While we can’t defeat the curse of dimensionality, the following observation is the key to a practical solution. As we have seen in section 3 (Fig. 2(b)), the maximum overlap of a pole with any other pole, ni,max , is usually quite small. Therefore, it suﬃces to take a subset of k points of a local neighborhood in order to perform a query. This suggests the following approach: – Database construction. For every pole pi with neighborhood Pi , select all possible combinations of k − 1 poles from Pi (for a total of k). Compute a unique descriptor D for those points, which has a (ﬁxed) dimension of d = 2k − 3. Store the value i under the key D in the database. – Query. For a given scene, draw a random selection of k points and retrieve the set of possible solutions. Repeat this until there is only one solution remaining (see algorithm 1). (This could also be replaced by a voting scheme.) The unique descriptor D of a point set with k ≥ 2 points is formed as follows. First, the diameter of the points is determined (the largest distance between any two points in the set), which can be done in O(k log k). The diameter yields the ﬁrst value of the descriptor. Then, the x-axis is deﬁned along the diameter and the y-axis perpendicular to it. The orientation is selected in such a way that when the remaining k − 2 points are expressed in local coordinates, the extension in +y is larger than in −y. Then, all the (xi , yi ) values are sorted in lexicographic order and are added one after the other to the descriptor, yielding a total of 1 + 2(k − 2) = 2k − 3 values. In order to prevent that a structural change of the descriptor is induced by a small error in coordinates, building the descriptor fails if during its construction any decision is closer than . Using a ﬁxed k solves the problem that varying pole neighbor counts lead to diﬀerent dimensions d. Also, selecting k as small as possible leads to a small d and thus, to a small number of queries, 3d (see next section for scalability).

Global Localization of Vehicles Using Local Pole Patterns

67

Algorithm 1. Database query. 1: 2: 3: 4: 5: 6: 7: 8: 9:

Q is a local set of poles to be searched in the database S ← {1, . . . , N } [the set of all indices in the database] Select a point q from Q, which acts as the ‘center pole’ while |S| > 1 do Randomly select k − 1 points from Q \ {q}: {q1 , q2 , . . . , qk−1 } Compute the unique descriptor, D, for the k points {q, q1 , . . . , qk−1 } Query the database for the key D, which yields a set of indices S1 S ← S ∩ S1 [narrow the set of possible solutions] return the single element in S

The random draws in the query part of the algorithm also solve the problem of erroneous extra poles in the scene. For example, considering the average number of 18 poles in a pole neighborhood, if only 50% of them (9) are captured by a vehicle, with an additional 3 captured in error, and k = 4, then the probability 9 12 of a good draw is still / = 25% and the expected number of draws 4 4 required to get at least one correct draw is 4. 4.3

Scalability

It remains to be determined how k should be selected. If it is small, this keeps the database size and the number of required queries (3d ) small, and gives better chances to pick a correct pole subset when erroneous extra poles are present. However, if it is too small, the returned set of keys S1 in algorithm 1, line 7, gets large. In fact, one would like to select k in such a way that |S1 | = O(1). Otherwise, if d is too small, |S1 | will be linear in n. For a concrete example, consider k = 2, then pairs of points are in the database. If the average number of neighbors is 18, and N = 2,658, then n = 18·2,658 = 47,844. If = 0.1 m, the error in distance (which is a diﬀerence) is 0.2 m. If the 47,844 entries are distributed uniformly in the [0, 50 m] range, 383 will be in any interval of length 0.4 m (±0.2 m). Thus, in the uniformly distributed case, we would expect that a random draw of a pair yields about 400 entries in the database and it will need several draws to reduce this to a single solution, according to algorithm 1. We will have a closer look at the case k = 3. In this case, we would expect about N ∗ 18 ∗ 17/2 = 406,674 diﬀerent descriptors (indeed there are 503,024). How are those descriptors distributed? Since for k = 3 it follows d = 3, we can plot them in 3D space. From Fig. 3, one sees that the distribution is quite uniform. There is a certain point pattern evident, especially on the ground plane, which occurs with a spacing of 6 m and can indeed be traced back to a row of alley trees in the scene, planted at 6 m spacing. Fig. 3 supports the assumption that, in contrast to indoor scenes (typically occuring in robotics), the descriptors expose only little regularity.

68

C. Brenner

Fig. 3. All descriptors D of the scene, plotted in 3D space (N = 2,658 poles, k = 3). The coordinates (x, y, z) correspond to the descriptor’s (d, x1 , y1 ).

If the distribution is about uniform, the question is how large the space spanned by all possible descriptors D is? The volume of the (oddly shaped) √ descriptor space for k = 3 is (2 3/3 − 2π/9)r3 ≈ 0.46r3 . To give an estimation, it is computed how many voxels of edge length 0.4 ( = 0.1 m, cf. to the reasoning in the case k = 2 above) ﬁt to this space, which is 891,736. Therefore, for k = 3, if the 503,024 descriptors are ‘almost’ uniformly placed in the ‘891,736 cell’ descriptor space, one can expect that a query will lead to a single result, as desired. In order to verify this, the following experiment was carried out. After ﬁlling the database, 10 queries are performed for any pole in the database, according to algorithm 1. A descriptor from the database was considered to match the query descriptor if all elements were within a distance of 2. The number of iterations (draws) required to narrow down the resulting set to a single element (while loop in line 4 of algorithm 1) was recorded into a histogram. At most 20 iterations were allowed, i.e. the histogram entries at ‘20’ mark failures. Again, the histogram entries are sorted according to the number of neighbors. Fig. 4(a) shows the case for k = 2 and = 0.1 m. It can be seen that in most of the cases, 5 iterations were required, with up to 10 or even more for poles with a large number of neighbors. For poles with only a few neighbors, there is a substantial number of failures. For = 0.2 m, the situation gets worse (Fig. 4(b)). There are more failures, and also, more iterations required in general. Of course, this is the result of a too small descriptor space. Moving on to k = 3, we see that most poles are found within 1 or 2 iterations (Fig. 4(c)). (Note that point triplets will vote for all three of their endpoints if all sides are ≤ r, for which reason often 3 solutions are returned for the ﬁrst query and another iteration is required.) When trying = 0.2 m (Fig. 4(d)), there is almost no change, which means that the descriptor space is large in relation to the number of descriptors.

Global Localization of Vehicles Using Local Pole Patterns ϮϬ

ϮϬ

ϭϱ

ϭϱ

ϭϬ

ϭϬ

ϱ

ϱ

Ϭ

69

Ϭ Ϭ

ϱ

ϭϬ

ϭϱ

ϮϬ

Ϯϱ

ϯϬ

ϯϱ

ϰϬ

Ϭ

ϱ

ϭϬ

ϭϱ

(a)

ϮϬ

Ϯϱ

ϯϬ

ϯϱ

ϰϬ

Ϯϱ

ϯϬ

ϯϱ

ϰϬ

(b)

ϮϬ

ϮϬ

ϭϱ

ϭϱ

ϭϬ

ϭϬ

ϱ

ϱ

Ϭ

Ϭ Ϭ

ϱ

ϭϬ

ϭϱ

ϮϬ

(c)

Ϯϱ

ϯϬ

ϯϱ

ϰϬ

Ϭ

ϱ

ϭϬ

ϭϱ

ϮϬ

(d)

Fig. 4. Histograms of the number of draws required to retrieve a pole uniquely from the database. x-axis is the number of poles in the neighborhood, y-axis is number of draws required (with 20 being failures). The area of the bubbles represent the number of cases. All experiments for r = 50 m and (a) k = 2, = 0.1 m, (b) k = 2, = 0.2 m, (c) k = 3, = 0.1 m, (d) k = 3, = 0.2 m.

Finally, to give an estimation on the order of N for diﬀerent k, we use the above reasoning (for r = 50 m, = 0.1 m, 18 poles neighborhood). For k=2, there are 18N descriptors and 50/0.4 = 125 cells, so that N = 7. Similarly, for k = 3, there are 18 ∗ 17/2 ·N descriptors and 891,736 cells, so that N = 5,828. For k = 4 and k = 5 it follows N = 1.3·107 (1010 cells) and N = 4.8·1010 (1014 cells). Note that although 1010 cells (for a a database size of thirteen million poles) sounds large, this is in the order of the main memory of a modern desktop computer.

5

Conclusions

In this paper, the use of local pole patterns for global localization was investigated. First, the characteristics of local pole patterns are determined, using a large scene captured by LiDAR and assumptions on the measurement range

70

C. Brenner

and accuracy. Second, a local descriptor is proposed which has a constant dimension and allows for an eﬃcient retrieval. Third, the structure and size of the descriptor space, the retrieval performance and the scalability were analyzed. There are numerous enhancements possible. When constructing the database, not all descriptors should be required and especially, clusters in descriptor space can probably be removed (similar to stop lists). Also, additional features like planar patches or dihedral edges can (and should) be used. Finally, experiments with real vehicle sensors are required to verify the assumptions regarding range and accuracy, and larger scenes would be needed to verify scalability.

Acknowledgements This work has been supported by the VolkswagenStiftung, Germany.

References 1. Arras, K.O., Siegwart, R.Y.: Feature extraction and scene interpretation for mapbased navigation and map building. In: Proc. SPIE. Mobile Robots XII, vol. 3210, pp. 42–53 (1997) 2. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 3. Fraundorfer, F., Wu, C., Frahm, J.M., Pollefeys, M.: Visual word based location recognition in 3d models using distance augmented weighting. In: Fourth International Symposium on 3D Data Processing, Visualization and Transmission (2008) 4. Wamelen, P.B.V., Li, Z., Iyengar, S.S.: A fast expected time algorithm for the 2-D point pattern matching problem. Pattern Recognition 37(8), 1699–1711 (2004) 5. Bishnu, A., Das, S., Nandy, S.C., Bhattacharya, B.B.: Simple algorithms for partial point set pattern matching under rigid motion. Pattern Recognition 39(9), 1662– 1671 (2006) 6. Kremer, J., Hunter, G.: Performance of the streetmapper mobile lidar mapping system in ‘real world’ projects. In: Photogrammetric Week, Wichmann, pp. 215– 225 (2007) 7. Nist´er, D., Stew´enius, H.: Scalable recognition with a vocabulary tree. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2161–2168 (2006) 8. Fredman, M.L., Komlos, J., Szemeredi, E.: Storing a sparse table with O(1) worst case access time. Journal of the ACM 31(3), 538–544 (1984) 9. Overmars, M.H.: Eﬃcient data structures for range searching on a grid. Technical Report RUU-CS-87-2, Department of Computer Science, University of Utrecht (1987)

Single-Frame 3D Human Pose Recovery from Multiple Views Michael Hofmann1 and Dariu M. Gavrila2 1

2

TNO Defence, Security and Safety, The Netherlands [email protected] Intelligent Systems Laboratory, Faculty of Science, University of Amsterdam (NL) [email protected]

Abstract. We present a system for the estimation of unconstrained 3D human upper body pose from multi-camera single-frame views. Pose recovery starts with a shape detection stage where candidate poses are generated based on hierarchical exemplar matching in the individual camera views. The hierarchy used in this stage is created using a hybrid clustering approach in order to efficiently deal with the large number of represented poses. In the following multi-view verification stage, poses are re-projected to the other camera views and ranked according to a multi-view matching score. A subsequent gradient-based local pose optimization stage bridges the gap between the used discrete pose exemplars and the underlying continuous parameter space. We demonstrate that the proposed clustering approach greatly outperforms state-of-the-art bottom-up clustering in parameter space and present a detailed experimental evaluation of the complete system on a large data set.

1 Introduction The recovery of 3D human pose is an important problem in computer vision with many potential applications in animation, motion analysis and surveillance, and also provides view-invariant features for a subsequent activity recognition step. Despite the considerable advances that have been made over the past years (see next section), the problem of 3D human pose recovery remains essentially unsolved. This paper presents a multi-camera system for the estimation of 3D human upper body pose in single frames of cluttered scenes with non-stationary backgrounds. See Figure 1. Using input from three calibrated cameras we are able to infer the most likely poses in a multi-view approach, starting with shape detection for each camera followed by fusing information between cameras at the pose parameter level. The computational burden is shifted as much as possible to an off-line stage – as a result of a hierarchical representation and matching scheme, algorithmic complexity is sub-linear in the number of body poses considered. The proposed system also has some limitations: Like previous 3D pose recovery systems, it currently cannot handle a sizable amount of external occlusion. It furthermore assumes the existence of a 3D human model that roughly fits the person in the scene. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 71–80, 2009. c Springer-Verlag Berlin Heidelberg 2009

72

M. Hofmann and D.M. Gavrila

Fig. 1. System overview. For details, please refer to the text, Section 3.1.

2 Previous Work As one of the most active fields in computer vision, there is meanwhile extensive literature on 3D human pose estimation. Due to space limitations we have to make a selection of what we consider is most relevant. In particular work that deals with 3D model-based tracking, as opposed to pose initialization, falls outside the scope of this paper; see recent surveys [1,2] for an overview on the topic. Work regarding 3D pose initialization can be distinguished by the number of cameras used. Multi-camera systems were so far applied in controlled indoor environments. The near-perfect foreground segmentation resulting from the “blue-screen” type background, together with the many cameras used (> 5), allows to recover pose by Shapefrom-Silhouette techniques [3,4,5]. Single camera approaches for 3D pose initialization can be sub-divided into generative and learning-based techniques. Learning-based approaches [6,7,8,9] are fast and conceptually appealing, but questions still remain regarding their scalability to arbitrary poses, given the ill conditioning and high dimensionality of the problem (most experimental results involve restricted movements, e.g. walking). On the other hand, pose initialization using 3D generative models [10,11] involves finding the best match between model projections and image, and retrieving the associated 3D pose. Pose initialization using 2D generative models [12,13] involves a 2D pose recovery step followed by a 3D inference step with respect to the joint locations. In order to reduce the combinatorial complexity, previous generative approaches apply part-based decomposition techniques [14]. This typically involves searching first for the torso, then arms and legs [12,15,13]. This decomposition approach is error prone in the sense that estimation mistakes made early on based on partial model knowledge cannot be corrected at a later stage. In this paper we demonstrate the feasibility of a hierarchical exemplar-based approach to single-frame 3D human pose recovery in an unconstrained setting (i.e. not restricted to specific motions, such as walking). Unlike [16], we do not cluster our exemplars directly in parameter space but use a shape similarity measure for both clustering and matching. Because bottom-up clustering does not scale with the number of poses represented in our system, we propose a hybrid approach that judiciously combines bottom-up and top-down clustering. We add a gradient-based local pose optimization

Single-Frame 3D Human Pose Recovery from Multiple Views

73

step to our framework in order to overcome the limitations of having generated the candidate poses from a discrete set. An experimental performance evaluation is presented on a large number of frames. Overall, we demonstrate that the daunting combinatorics of matching whole upper-body exemplars can be overcome effectively by hierarchical representations, pruning strategies and use of clustering techniques. While in this paper we focus on single-frame pose recovery in detail, its integration with tracking and appearance model adaptation is discussed in [17].

3 Single-Frame 3D Pose Estimation 3.1 Overview Figure 1 presents an overview of the proposed system. Pre-processing determines a region of interest based on foreground segmentation (Section 3.3). Pose hypotheses are generated based on hierarchical shape matching of exemplars in the individual camera views (Section 3.4) and then verified by reprojecting the shape model into all camera views (Section 3.5). This is implemented in two stages for efficiency reasons: In the 2D-based verification stage, the reprojection is done by mapping the discrete exemplars to the other camera views, while in the subsequent 3D-based verification stage, the poses are rendered on-line and therefore modeled with higher precision. As a last step, a gradient-based local pose optimization is applied to part of the pose hypotheses (Section 3.6). The final output is a list of pose hypotheses for each single frame, ranked according to their multi-view likelihood. 3.2 Shape Model Our 3D upper body model uses superquadrics as body part primitives, yielding a good trade-off between desired accuracy and model complexity [18]. Joint articulation is represented using homogeneous coordinate transformations x = Hx, H = (R(φ, θ, ψ), T ), where R is a 3 × 3 rotation matrix determined by the Euler angles φ, θ, ψ, and T a constant 3 × 1 translation vector. We represent a pose as an 13-dimensional vector π = (πtorso (φ, θ, ψ), πhead (φ, ψ), πl.shoulder (φ, θ, ψ), πl.elbow (θ), πr.sh. (φ, θ, ψ), πr.elb. (θ)) (1) 3.3 Pre-processing The aim of pre-processing is to obtain a rough region of interest, both in terms of each individual 2D camera view and in terms of the 3D space. For this, we apply background subtraction [19] to each camera view and fuse the computed masks by means of volume carving [20]. In the considered environment with dynamic background and a limited number of cameras (3) we do not expect to obtain well segmented human silhouettes in a quality suitable for solving pose recovery by SfS techniques [3,4,5]. However, approximate 3D positions of people in the scene can be estimated by extracting voxel blobs of a minimum size; this also yields information about the image scales to be used in the forthcoming single-view detection step (Section 3.4). Edge segmentation in the foreground regions then provides the features being used in the subsequent steps.

74

M. Hofmann and D.M. Gavrila

3.4 Single-View Shape Detection Shape hierarchy construction. We follow an exemplar-based approach to 3D pose recovery, matching a scene image with a pre-generated silhouette library with known 3D articulation. To obtain the silhouette library, we first define the set of upper body poses by specifying lower and upper bounds for each joint angle and discretizing each angle into a number of states with an average delta of about 22◦ . The Cartesian product contains anatomically impossible poses; these are filtered out by collision detection on the model primitives and through rule-based heuristics, more specifically a set of linear inequalities on the four arm angles. The remaining set P of about 15 × 106 “allowable” poses serves as input for the silhouette library, for which the exemplars are rendered using the 3D shape model (Section 3.2) assuming orthographic projection and, following [21,22,16], organized in a (4-level) template tree hierarchy, see Figure 2. We use a shape similarity measure (see sub-section below) for clustering as well as for matching, as opposed to clustering directly in angle space [16]. This has the advantage that similar projections (e.g. front/back views) can be compactly grouped together even if they are distant in angle space. However, bottom-up clustering does not scale with the number of allowable poses used here: On-line evaluation of our similarity measure would be prohibitively expensive; furthermore, computing the full dissimilarity matrix (approx. 2.3 × 1014 entries) off-line is not possible either due to memory constraints. We therefore propose a hybrid clustering approach; see Figure 2 for an illustration of this process. We first set the exemplars of our third tree level by discretizing the allowable poses more coarsely, such that bottom-up clustering similar to [21] for creating the second and first hierarchy level is still feasible. Then, we compute a mapping for each pose π ∈ P to the 3rd level exemplar with the best shape similarity. Each 3rd-level exemplar will thus be associated with a subset P3i of P, where i is the exemplar index and such that i P3i ≡ P. The 4th level is then created by clustering the elements of each assigned subset P3i and selecting prototypes in a number proportional to the number of elements in the subset. Each pose in P is thus mapped to a 4th-level exemplar, i.e. each 4th-level exemplar is associated with a subset P4i of P such that i P ≡ P. The need for a 4-th tree level for an increase in matching accuracy was ini 4 dicated by preliminary experiments. In the hierarchy used in our experiments we have approximately 200, 2,000, 20,000 and 150,000 exemplars at the respective levels. Hierarchical shape matching. On-line matching is implemented by a hierarchy traversal for each camera; search is discontinued below nodes where the match is below a certain (level-specific) threshold. Instead of using silhouette exemplars of different scales, we rescale the scene image using information from the preprocessing step (Section 3.3). After matching, the exemplars s ∈ S that pass the leaf-level threshold are ranked according to a single-view likelihood p(Oc |s) ∝ p(Dc (s, ec ))

(2)

where Oc is the observation for camera c and Dc (s, e) the undirected Chamfer distance [23] between the exemplar s and the scene edge image ec of camera c. We select the Kc best ranked matches for view c (Kc = 150 in our experiments, for all c) and expand the previously grouped poses from each silhouette exemplar as input for the next step. (On average, about 15,800 poses are expanded per camera in our experiments.)

75

Single-Frame 3D Human Pose Recovery from Multiple Views

Fig. 2. Schematized structure of the 4-level shape exemplar hierarchy (Section 3.4)

Fig. 3. Correction angle ϕ when transferring poses from orthographic to perspective projection (Section 3.5)

3.5 Multi-view Pose Verification Given a set of expanded poses from the single-view shape detection step (Section 3.4), we verify all poses by reprojecting them into the other cameras and computing a multiview likelihood. For efficiency reasons, this is implemented in a two-step approach. In a first step (“2D-based pose verification”), we map a pose extracted from one camera to the corresponding exemplars of the other cameras and match these exemplars onto their respective images. Due to the used orthographic projection, the mapping from a pose as observed in camera ci to the corresponding pose in camera cj is done by modifying the torso rotation angle ψtorso relative to the projected angle between cameras ci and cj on the ground plane. To account for the error made by the orthographic projection assumption, we add a correction angle ϕ as illustrated in Figure 3. The mapping from a (re-discretized) pose to an exemplar is then easily retrieved from a look-up table. The corresponding multi-view likelihood given a pose π is modeled as p(O|π) ∝ p( Dc (sc , ec )) (3) c∈C

where O is the set of observations over all cameras C, sc the exemplar corresponding to the pose π, and ec the scene edge image of camera c. For each pose, we also need to obtain a 3D position in the world coordinate system from the 2D location of the match on the image plane. We therefore backproject this location at various depths corresponding to the epipolar line in the other cameras in regions with foreground support and match the corresponding exemplars at these locations. For each pose π, the 2D location with the highest likelihood per camera is kept; triangulation then yields a 3D position x in the world coordinate system, with inconsistent triangulations being discarded. We obtain a ranked list of candidate 3D poses {π, x} of which the best L (L = 2000 in our experiments) are evaluated further.

76

M. Hofmann and D.M. Gavrila

In the second step (“3D-based pose verification”), the candidate 3D poses are rendered on-line, assuming perspective projection, and ranked according to a respective multi-view likelihood p(O|π, x) ∝ p( Dc (rc , ec )) (4) c∈C

where rc is the image of the shape model silhouette in camera c. This is a very costly step in the evaluation cascade due to the rendering across multiple camera views, but provides the most accurate likelihood evaluation because poses are not approximated by a subset of shapes anymore, and due to the assumption of perspective projection. As a result, we obtain a ranked list of pose hypotheses, of which the best M (M = 30 in our experiments) enter the following processing step and the others remain unchanged. 3.6 Gradient-Based Local Pose Optimization So far we have evaluated likelihoods given only poses π from a discrete set of poses P (Section 3.4). We can overcome this limitation by assuming that the likelihood described in Equation 4 is a locally smooth function on a neighborhood of π and x in state space and performing a local optimization of the parameters of each pose using the gradient ∇p(O|π, x). For a reasonable trade-off between optimization performance and evaluation efficiency, we decompose the parameter space during this step and optimize first over the world coordinate position x, followed by optimizations over πtorso , πhead , πl.shoulder , πr.shoulder , πl.elbow and πr.elbow respectively, evaluating the gradient once for each sub-step and moving in its direction until the likelihood value reaches a local maximum. Because the objective function used relies on rendering and therefore produces output on a fixed pixel grid, the gradients are approximated by suitable central differences, e.g. (p(O|π + 12 ) − p(O|π − 12 ))/, with chosen according to the input image resolution.

4 Experiments Our experimental data consists of 2038 frames from recordings of three overlapping and synchronized color CCD cameras looking over a train station platform with various actors performing unscripted movements, such as walking, gesticulation and waving. The same generic shape model (Section 3.2) is used for all actors in the scene. We model the likelihood distributions (Equations 2, 3, 4) as exponential distributions, computed using maximum likelihood. Cameras were calibrated [24]; this enabled the recovery of the ground plane. Ground truth poses were manually labeled for all frames of the data set; we estimate its accuracy to be within 3cm, considering the quality of calibration and labeling. We define the average pose error between two poses as dx (π 1 , π2 ) =

1 de (v1i , v2i ) |B|

(5)

i∈B

where B is a set of locations on the human upper body, |B| the number of locations, v i is the 3D position of the respective location in a fixed Euclidean coordinate system,

Single-Frame 3D Human Pose Recovery from Multiple Views cumulative number of correct hypotheses wrt. nr of selected shape exemplars 800 proposed hierarchy angle−clustered hierarchy 700

77

ratio of cumulative number of correct hypotheses (proposed hier./angle−clustered hier.) 12

ratio of correct hypotheses

number of correct hypotheses

11 600 500 400 300

10

9

8

200 7 100 0

unnormalized normalized by number of extracted poses 0

50 100 nr of selected shape exemplars

(a)

150

6

0

50 100 nr of selected shape exemplars

150

(b)

Fig. 4. (a) Cumulative number of correct pose hypotheses wrt. the number of selected shape exemplars (avg. over all frames and cameras). (b) Ratio of the number of correct pose hypotheses between both hierarchies (avg. over all frames and cameras).

and de (.) is the Euclidean distance. For the set of locations, we choose torso and head center as well as shoulder, elbow and wrist joint location for each arm. We regard a pose hypothesis as “correct”, if the average pose error to the ground truth is less than 10cm. We first compare our hybrid hierarchy clustering approach as described in Section 3.4 with a state-of-the-art clustering approach proposed in [16] in the context of hand tracking. There, clustering is performed directly in parameter space using a hierarchical k-means algorithm. We constructed an equivalent alternative hierarchy (“angleclustered hierarchy”) with the same number of exemplars on each level.To ensure a fair comparison, we evaluate the single-view shape detection step (Section 3.4) with the same tree-level specific thresholds for both shape hierarchies. Figure 4(a) shows the cumulative number of correct pose hypotheses in relation to the number of selected shape exemplars after single-view shape detection.Using our proposed hierarchy, we obtain about one order of magnitude more correct poses compared to the hierarchy clustered in parameter space. Figure 4(b) shows that the ratio of the number of correct poses between both hierarchies saturates at about 12. We additionally plot the ratio of the number of correct solutions, normalized by the number of extracted poses to take out the influence of a variable number of shape exemplars matched. Still, the proposed hierarchy generates about 9.5 times more correct hypotheses; we therefore continue all following experiments using this hierarchy. The considerably worse performance of the angle-clustered hierarchy is explained by the fact that equal distance in angle space does not imply equal shape (dis)similarity. In particular, the represented joint angles are part of an articulated model – for example, small changes of the torso rotation angle ψtorso will have a large effect on the projected silhouette if one or both arms are extended. Figure 5 shows a few example frames from our data set, together with recovered poses after executing the steps described in Sections 3.4 to 3.6. A quantitative analysis over all images in our data set (Figure 6(a)) shows the successive benefit of repeated likelihood evaluations and pruning of hypotheses in our cascaded framework: The average pose error of the best solution (among the K top-ranked) decreases after each

78

M. Hofmann and D.M. Gavrila

Fig. 5. Example result images (all three camera views shown each). Top row: Top-ranked pose hypothesis. Bottom row: Best pose hypothesis out of 20 best-ranked. Average pose error of best solution up to a rank K

Pose error of best solution among K best−ranked

25

60 K=100 K=10 K=1

50

20 avg. pose error [cm]

avg. pose error of best solution [cm]

after 2D−based verification after 3D−based verification after 3D−based local optimization

15

40

30

20

10 10

5

0

5

10

15

20

25 rank

(a)

30

35

40

45

0

20

40

60

80 frame nr

100

120

140

(b)

Fig. 6. (a) Average pose error of the best solution among the K best-ranked (K on x-axis), average over all frames of the data set. (b) Average pose error of the best solution among the K best-ranked (K ∈ {1, 10, 100}) for 150 frames of a sequence.

verification/optimization step. To obtain an average error of 10cm we need to disambiguate among the best 20 ranked pose hypotheses on average, while for 50 hypotheses, the average error decreases to approximately 8cm. Figure 6(b) provides a closer look at the average pose error for each frame in a sequence of 150 images. Between frames 10 and 80 the top-ranked pose hypothesis gives an acceptable estimate of the actual (ground truth) pose; considering more pose hypotheses for disambiguation can provide yet better accuracy. However, the spikes between frames 1-10 and 80-125 also show that our purely shape-based single-frame pose recovery does not succeed in all cases – our system still has some difficulties with more “ambiguous” poses, e.g. with hands close to the torso (see e.g. Figure 5, 2nd column), or when the silhouette does not convey sufficient information about front/back orientation of the person (see e.g. Figure 5, 3rd column). Many of these cases can be disambiguated by incorporating additional knowledge such as temporal information or enriching the likelihood function by learning an appearance model in addition to shape. Both approaches lead toward tracking and are thus out of scope for this paper, but are discussed e.g. in [17]. Figure 7 shows a plot of the average pose error before and after local pose optimization (Section 3.6), evaluated on 10 images from our data set. 1280 test input poses have been created by random perturbations π GT + N (0, Σ) of the ground truth pose πGT , with varying covariances Σ. Being a local optimization step, we expect the convergence

Single-Frame 3D Human Pose Recovery from Multiple Views

79

Effect of gradient−based local pose optimization 20

avg. pose error [cm] after optimization

18 16 14 12 10 8 6 4 2 0

0

5 10 15 avg. pose error [cm] before optimization

20

Fig. 7. Left: Plot of the average pose error in cm (Equation 5) before and after gradient-based local pose optimization (Section 3.6). Right: Example of local pose optimization, before (top row, avg. error 8.7cm) and after (bottom row, avg. error 5.9cm).

area to be close to the true solution; indeed, we can see that it is quite effective up to an input pose error of about 10cm. In addition to improving our overall experimental results (see Figure 6(a)), we expect that this transitioning from a discrete to a continuous pose space can also prove useful when evaluating motion likelihoods between poses in a temporal context that have been trained on real, i.e. undiscretized movement data. Our current system requires about 45-60s per frame (image triplet) to recover the list of pose hypotheses, running with unoptimized C++ code on a 2.6 GHz Intel PC. Currently the steps involving on-line rendering (Sections 3.5 and 3.6) and, to a lesser degree, single-view shape detection (Section 3.4) are our performance bottleneck. These components can be easily parallelized, allowing a near-linear reduction of processing speed with available processing cores.

5 Conclusion and Further Work We proposed a system for 3D human upper body pose estimation from multiple cameras. The system combines single-view hierarchical shape detection with a cascaded multi-view verification stage and gradient-based local pose optimization. The exemplar hierarchy is created using a novel hybrid clustering approach based on shape similarity and we demonstrated that it significantly outperforms a parameter-space clustered hierarchy in pose retrieval experiments. Future work involves extension to whole-body pose recovery, which would be rather memory intensive if implemented directly. A more suitable solution, better able to deal with partial occlusion, is to recover upper and lower body pose separately and integrate results. Another area of future work involves extending the estimation to the shape model in addition to the pose.

80

M. Hofmann and D.M. Gavrila

References 1. Forsyth, D., et al.: Computational studies of human motion. Found. Trends. Comput. Graph. Vis. 1(2-3), 77–254 (2005) 2. Moeslund, T.B., et al.: A survey of advances in vision-based human motion capture and analysis. CVIU 103(2-3), 90–126 (2006) 3. Cheung, K.M.G., et al.: Shape-from-silhouette across time - parts I and II. IJCV 62 and 63(3), 221–247 and 225–245 (2005) 4. Mikic, I., et al.: Human body model acquisition and tracking using voxel data. IJCV 53(3), 199–223 (2003) 5. Starck, J., Hilton, A.: Model-based multiple view reconstruction of people. In: ICCV, pp. 915–922 (2003) 6. Agarwal, A., Triggs, B.: Recovering 3D human pose from monoc. images. TPAMI 28(1), 44–58 (2006) 7. Bissacco, A., et al.: Fast human pose estimation using appearance and motion via multidimensional boosting regression. In: CVPR (2007) 8. Kanaujia, A., et al.: Semi-supervised hierarchical models for 3d human pose reconstruction. In: CVPR (2007) 9. Shakhnarovich, G., et al.: Fast pose estimation with parameter-sensitive hashing. In: ICCV, pp. 750–757 (2003) 10. Kohli, P., et al.: Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. IJCV 79, 285–298 (2008) 11. Lee, M.W., Cohen, I.: A model-based approach for estimating human 3D poses in static images. TPAMI 28(6), 905–916 (2006) 12. Mori, G., Malik, J.: Recovering 3D human body configurations using shape contexts. TPAMI 28(7), 1052–1062 (2006) 13. Ramanan, D., et al.: Tracking people by learning their appearance. TPAMI 29(1), 65–81 (2007) 14. Sigal, L., et al.: Tracking loose-limbed people. In: CVPR (2004) 15. Navaratnam, R., et al.: Hierarchical part-based human body pose estimation. In: BMVC (2005) 16. Stenger, B., et al.: Model-based hand tracking using a hierarchical Bayesian filter. TPAMI 28(9), 1372–1384 (2006) 17. Hofmann, M., Gavrila, D.M.: Multi-view 3d human pose estimation combining single-frame recovery, temporal integration and model adaptation. In: CVPR (2009) 18. Gavrila, D.M., Davis, L.: 3-D model-based tracking of humans in action: a multi-view approach. In: CVPR (1996) 19. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004) 20. Laurentini, A.: The visual hull concept for silhouette-based image understanding. TPAMI 16(2), 150–162 (1994) 21. Gavrila, D.M., Philomin, V.: Real-time object detection for “smart” vehicles. In: ICCV, pp. 87–93 (1999) 22. Rogez, G., et al.: Randomized trees for human pose detection. In: CVPR (2008) 23. Athitsos, V., Sclaroff, S.: Estimating 3D hand pose from a cluttered image. In: CVPR, pp. II.432–II.439 (2003) 24. Bouguet, J.Y.: Camera calib. toolbox for matlab (2003)

Dense Stereo-Based ROI Generation for Pedestrian Detection C.G. Keller1 , D.F. Llorca2 , and D.M. Gavrila3,4 1

Image & Pattern Analysis Group, Department of Math. and Computer Science, Univ. of Heidelberg, Germany 2 Department of Electronics. Univ. of Alcal´ a. Alcal´ a de Henares (Madrid), Spain 3 Environment Perception, Group Research, Daimler AG, Ulm, Germany 4 Intelligent Systems Lab, Fac. of Science, Univ. of Amsterdam, The Netherlands {uni-heidelberg.keller,dariu.gavrila}@daimler.com, [email protected]

Abstract. This paper investigates the beneﬁt of dense stereo for the ROI generation stage of a pedestrian detection system. Dense disparity maps allow an accurate estimation of the camera height, pitch angle and vertical road proﬁle, which in turn enables a more precise speciﬁcation of the areas on the ground where pedestrians are to be expected. An experimental comparison between sparse and dense stereo approaches is carried out on image data captured in complex urban environments (i.e. undulating roads, speed bumps). The ROI generation stage, based on dense stereo and speciﬁc camera and road parameter estimation, results in a detection performance improvement of factor ﬁve over the stateof-the-art based on ROI generation by sparse stereo. Interestingly, the added processing cost of computing dense disparity maps is at least partially amortized by the fewer ROIs that need to be processed at the system level.

1

Introduction

Vision-based pedestrian detection is a key problem in the domain of intelligent vehicles (IV). Large variations in human pose and clothing, as well as varying backgrounds and environmental conditions make this problem particularly challenging. The ﬁrst stage in most systems consists of identifying generic obstacles as regions of interest (ROIs) using a computationally eﬃcient method. Subsequently, a more expensive pattern classiﬁcation step is applied. Previous IV applications have typically used sparse, feature-based stereo approaches (e.g. [1,9]) because of lower processing cost. However, with recent hardware advances, real-time dense stereo has become feasible [12] (here we use a hardware implementation of the semi-global matching (SGM) algorithm [7]). Both sparse and dense stereo approaches haved proved suitable to dynamically estimate camera height and pitch angle, in order to deal with road imperfections, speed bumps, car accelerations, etc. Dense stereo, furthermore, holds the potential to also reliably estimate the vertical road proﬁle (which feature-based stereo, due to its sparseness does not). The more accurate estimation of ground location of pedestrians can be expected to improve system performance, especially when considering undulating, hilly roads. The aim of this paper thus is to investigate the advantages of dense vs. sparse disparity maps when detecting generic obstacles in the early stage of a pedestrian J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 81–90, 2009. c Springer-Verlag Berlin Heidelberg 2009

82

C.G. Keller, D.F. Llorca, and D.M. Gavrila

Fig. 1. Overview of the dense stereo-based ROI generation system comprising dense stereo computation, pitch estimation, corridor computation, B-Spline road proﬁle modeling and multiplexed depth maps scanning with windows related to minimum and maximum extents of pedestrians

detection system [9]. We are interested in both the ROC performance (trade-oﬀ correct vs. false detections) and in the processing cost.

2

Related Work

Many interesting approaches for pedestrian detection have been proposed. See [4] for a recent survey and a novel publicly available benchmark set. Most work has proceeded with a learning-based approach by-passing a pose recovery step and describing human appearance directly in terms of low-level features from a region of interest (ROI). In this paper, we concentrate on the stereo-based ROI generation stage. The simplest technique to obtain object location hypotheses is the sliding window technique, where detector windows at various scales and locations are shifted over the image. This approach in combination with powerful classiﬁers (e.g. [3,13,16]) is currently computationally too expensive for real-time applications. Signiﬁcant speed-ups can be obtained by including application-speciﬁc constraints such as ﬂat-world assumption, ground-plane based objects and common geometry of pedestrians, e.g. object height or aspect ratio [9,17]. Besides monocular techniques (e.g. [5]), which are out of scope in this work, stereo vision is an eﬀective approach for obtaining ROIs. In [20] a foreground region is obtained by clustering in the disparity space. In [2,10] ROIs are selected considering the x- and y-projections of the disparity space following the v-disparity representation [11]. In [1] object hypotheses are obtained by using a subtractive clustering in the 3D space in world coordinates. Either monocular or stereo, most approaches are carried out under the assumption of a planar road and no camera height and camera pitch angle variations. In recent literature on intelligent vehicles many interesting approaches have been proposed to perform road modeling and to estimate camera pitch angle and camera height. Linear ﬁtting in the v-disparity [14], in world coordinates [6] and in the so-called virtual-disparity image [18] has been proposed to estimate the camera pitch angle and the camera height. In [11] the road surface is modeled by the ﬁtting of the envelope of piecewise linear functions in the v-disparity space. Other approaches are performed by ﬁtting of a quadratic polynomial [15] or a clothoid function [14] in the v-disparity space as well. Building upon this work, we propose the use of dense stereo vision for ROI generation in the context of pedestrian detection. Dense disparity maps are provided in real-time [7]. Firstly, camera pitch angle is estimated by determining the slope with highest probability in the v-disparity map, for a reduced distance

Dense Stereo-Based ROI Generation for Pedestrian Detection

83

range. Secondly, a corridor of a predeﬁned width is computed using the vehicle velocity and the yaw rate. Only points that belong to that corridor will be used for subsequent road surface modeling. Then, the ground surface is represented as a parametric B-Spline surface and tracked by using a Kalman ﬁlter [19]. Reliability on the road proﬁle estimation is an important issue which has to be considered for real implementations. ROIs are ﬁnally obtained by analyzing the multiplexed depth maps as in [9] (see Figure 1).

3 3.1

Dense Stereo-Based ROI Generation Modeling of Non-planar Road Surface

Feature-based stereo vision systems typically provide depth measurements at points with suﬃcient image structure, whereas dense stereo algorithms estimate disparities at all pixels, including untextured regions, by interpolation. Before computing the road proﬁle, the camera pitch angle is estimated by using the v-disparity space. We assume that the camera is installed such that the roll angle is insigniﬁcant. Then, the disparity of a planar road surface (this assumption can be accepted in the vehicle vicinity) can be calculated by: d(v) = a · v + b

(1)

where v is the image row and a, b are the slope and the oﬀset which depend on camera height and tilt angle respectively. Both parameters can be estimating using a robust estimator. However, if we assume a ﬁxed camera height we can compute a slopes histogram and determine the slope with the highest probability, obtaining a ﬁrst estimation of the camera pitch angle. In order to put only good candidates into the histogram, a disparity range is calculated for each image row, depending on the tolerance of the camera height and tilt angle. The next step consists in computing a corridor of a pre-deﬁned width using the vehicle velocity, the yaw rate, the camera height and the camera tilt angle. If the vehicle is stopped, a ﬁxed corridor is used. In this way, a considerable amount of object points are not taken into account when modeling the road surface. This is particularly important when the vehicle is taking a curve, since most of the points in front of the vehicle correspond to object points. The road proﬁle is represented as a parametric B-Spline surface as in [19]. B-Splines are a basis for the vector space of piecewise polynomials with degree d. The basis-functions are deﬁned on a knot vector c using equidistant knots within the observed distance interval. A simple B-Spline least square ﬁt tries to approximate the 3D measurements optimally. However, a more robust estimation over time is achieved by integrating the B-Spline parameter vector c, the camera

Fig. 2. Road surface modeling. Distances grid and their corresponding height values along with camera height and tilt angle.

84

C.G. Keller, D.F. Llorca, and D.M. Gavrila

Fig. 3. Wrong road proﬁle estimation when a vertical object appears in the corridor for a consecutive number of frames. The cumulative variance for the bin in which the vertical object is located increases and the object points are eventually passed to the Kalman ﬁlter.

pitch angle α and the camera height H into a Kalman ﬁlter. Finally, the ﬁlter state vector is converted into a grid of distances and their corresponding road height values as depicted in Figure 2. The number of bins of the grid will be as accurate as the B-Spline sampling. 3.2

Outlier Removal

In general, the method of [19] works well if the measurements provided to the Kalman ﬁlter correspond to actual road points. The computation of the corridor removes a considerable amount of object points. However, there are a few cases in which the B-Spline road modeling still leads to bad results. These cases are mainly caused by vertical objects (cars, motorbikes, pedestrians, cyclists, etc.) in the vicinity of the vehicle. Reﬂections in the windshield can cause additional correlation errors in the stereo image. If we include these points, the B-spline ﬁtting achieves a solution which climbs or wraps over the vertical objects. In order to avoid this problem, the variance of the road proﬁle for each bin σi2 is computed. Thus, if the measurements for a speciﬁc bin are out of the bounds deﬁned by the predicted height and the cumulative variance, they are not added to the ﬁlter. Although this alternative can deal with spurious errors, if the situation remains for a consecutive number of iterations (e.g., when there is a vehicle stopped in front of the host vehicle), the variance increases due to the inavailability of measurements, and the points pertaining to the vertical object are eventually passed to the ﬁlter as measurements. This situation is depicted in Figure 3. Accordingly, a mechanism is needed in order to ensure that points corresponding to vertical objects are never passed to the ﬁlter. We compute the variance of all measurements for a speciﬁc bin and compare it with the expected variance in the given distance. The latter can be computed by using the associate standard deviations σm via error propagation from stereo triangulation [15,19]. If the computed

Dense Stereo-Based ROI Generation for Pedestrian Detection

85

Fig. 4. Rejected measurements for bin i at distance Zi since measurements variance 2 σi2 is greater than the expected variance σei in that bin

Fig. 5. Accepted measurements for bins i and i + 1 at distances Zi and Zi+1 since 2 2 measurements variances σi2 and σi+1 are lower than the expected variances σei and 2 σei+1 in these bins 2 variance σi2 is greater than the expected one σei , we do not rely on the measurements but on the prediction for that bin. This is useful for cases in which there is a vertical object like the one in the example depicted in Figure 4. However, in cases in which the rear part of the vertical object produces 3D information for two consecutive bins, this approach may fail depending on the distance to the vertical object. For example, in Figure 5 the rear part of the vehicle yields 3D measurements in two consecutive bins Zi and Zi+1 whose variance is lower than the expected one for those bins. In this case, measurements will be added to the ﬁlter which will yield unpredictable results. We therefore deﬁne a ﬁxed region of interest, in which we restrict measurements to lie. To that eﬀect, we quantify the maximum road height changes at diﬀerent distances and we ﬁt a second order polynomial, see Figure 6. The ﬁxed region can be seen as a compromise between ﬁlter stability and response to sharp road proﬁle changes (undulating roads). Apart from this region of interest, we maintain the beforementioned test on the variance, to see if measurements corresponding to a particular grid are added or not to the ﬁlter.

Fig. 6. Second order polynomial function used to accept/reject measurements at all distances

86

3.3

C.G. Keller, D.F. Llorca, and D.M. Gavrila

System Integration

Initial ROIs Ri are generated using a sliding windows technique where detector windows at various scales and locations are shifted over the depth map. In previous works [9] ﬂat-world assumption along with known camera geometry were used, so that, the search space was drastically restricted. Pitch variations were handled by relaxing the scene constraints [9], e.g., including camera pitch and camera height tolerances. However, thanks to the use of dense stereo a reliable estimation of the vertical proﬁle of the road is computed along with the camera pitch and tilt angle. In order to easily adapt the subsequent detection modules, we compute new camera heights Hi and pitch angles αi for all bins of the road proﬁle grid. After that, standard equations for projecting 3D points into the image plane can be used. First of all dense depth maps are ﬁltered as follows: points Pr = (Xr , Yr , Zr ) under the actual road proﬁle, i.e., Zi < Zr < Zi+1 and Yr < hi and over the actual road proﬁle plus the maximum pedestrian size, i.e., Zi < Zr < Zi+1 and Yr > hi + Hmax , are removed since they do not correspond to obstacles (possible pedestrians). The resulting ﬁltered depth map is multiplexed into N discrete depth ranges, which are subsequently scanned with windows related to minimum and maximum extent of pedestrians. Possible window locations (ROIs) are deﬁned according to the road proﬁle grid (we assume the pedestrian stands on the ground). Each pedestrian candidate region Ri is represented in terms of the number of depth features DFi . A threshold θR governs the amount of ROIs which are committed to the subsequent module. Only ROIs with DFi > θR trigger the evaluation of the next cascade module. Others are rejected immediately. Pedestrian recognition proceeds with shape-based detection, involving coarseto-ﬁne matching of an exemplar-based shape hierarchy to the image data at hand [9]. Positional initialization is given by the output ROIs of the dense stereo-based ROI generation stage. The shape hierarchy is constructed oﬀ-line in an automatic fashion from manually annotated shape labels. On-line matching involves traversing the shape hierarchy with the Chamfer distance between a shape template and an image sub-window as smooth and robust similarity measure. Image locations, where the similarity between shape and image is above a user-speciﬁed threshold, are considered detections. A single distance threshold applies for each level of the hierarchy. Additional parameters govern the edge density on which the underlying distance map is based. Detections of the shape matching step are veriﬁed by a texture-based pattern classiﬁer. We employ a multi layer feed-forward neural network operating on local adaptive receptive ﬁeld features [9]. Finally, temporal integration of detection results is employed to overcome gaps in detection and suppress spurious false positives. A 2D bounding box tracker is utilized, with an object state model involving bounding box position and extent [9]. State parameters are estimated using an α − β tracker.

4

Experiments

We tested our dense stereo-based ROI generation scheme on a 5 min (3942 image) sequence recorded from a vehicle driving through the canal area of the city of

Dense Stereo-Based ROI Generation for Pedestrian Detection

87

Amsterdam. Because of the many bridges and speed bumps, the sequence is quite challenging for the road proﬁling component. Pedestrians were manually labeled; their 3D position was obtained by triangulation in the two camera views. Only pedestrians located in front of the vehicle in the area 12-27m in longitudinal and ±4m in lateral direction were considered required. Pedestrians beyond this detection area were regarded as optional. Localization tolerance is selected as in [9] to be X = 10% and Z = 30% as percentage of distance for lateral (X) and longitudinal (Z) direction. In all, this resulted in 1684 required pedestrian single-frame instances in 66 distinct trajectories, to be detected by our pedestrian system. See Figure 7 for an illustration of the results. We ﬁrst examined the performance of the ROI generation module in isolation, see Figure 8. Shown are the ROCs (correctly vs. falsely passed ROIs) for various conﬁgurations (dense vs. sparse stereo, w/out pitch angle and road proﬁle estimation). No signiﬁcant performance diﬀerence can be observed between denseor sparse- stereo-based ROI generation when neither pitch angle nor road proﬁle is estimated. Estimating the pitch angle leads however to a clear performance

(a)

(c)

(b)

(d)

Fig. 7. System example with estimated road proﬁle and pedestrian detection. (a) Final output with detected pedestrian marked red. The magenta area illustrates the system detection area. (b) Dense stereo image. (c) Corridor used for spline computation after outlier removal. (d) Spline (blue) ﬁtted to the measurements (red) in proﬁle view.

88

C.G. Keller, D.F. Llorca, and D.M. Gavrila STEREO BOX FILTERING 1

Detection Rate

0.95

0.9

0.85

Dense; Road profiling; Estimated Pitch Sparse; Flat world; Estimated Pitch Sparse; Flat world; Fixed Pitch Dense; Flat world; Fixed Pitch

0.8

0.75

3

10 False Positives Per Frame

Fig. 8. ROC peformance of stereo-based ROI generation module for diﬀerent variations Table 1. Comparison of the number of false positives and total number of generated ROIs per frame for an exemplary threshold θR resulting in a detection rate of 92%

Dense - Road Proﬁling Sparse - Pitch Estimation Sparse - Fixed Pitch Dense - Fixed Pitch

FPs/Frame # ROIs/Frame 1036 1549 1662 2345 3367 4388 3395 4355

improvement. Incorporating the estimated road proﬁle yields an additional performance gain. The total number of generated ROIs and false positives for an exemplary detection rate of 92% are summarized in table 1. The number of ROIs that need to be generated can be reduced by a factor of 2.8 when utilizing road proﬁle information compared to a system with static camera position. Using camera pose information leads to an reduction of generated ROIs by a factor of 1.87. A reduced number of generated ROIs implies fewer computations in later stages of our detection system, and thus faster processing speed (approx. linear in number of ROIs). We now turn to the evaluation on the overall system level, i.e. with the various ROI generation schemes integrated in the pedestrian classiﬁcation and tracking system of [9]. Relevant module parameters (in particular density threshold θR for stereo-based ROI generation) were optimized for each system conﬁguration following the ROC convex hull technique described in [9]. See Figure 9. One observes the relative ranking of the various ROI generation schemes is maintained cf. Figure 8 (the dense stereo, ﬁxed pitch and ﬂat world case is not plotted additionally, as is has similar performance as the equivalent sparsestereo case). That is, there is a signiﬁcant beneﬁt of estimating pitch angle, camera height and road proﬁle, i.e. a performance improvement of factor 5.

Dense Stereo-Based ROI Generation for Pedestrian Detection

89

System Performance 1 0.9

Dense; Road profiling; Estimated Pitch Sparse; Flat world; Estimated Pitch Sparse; Flat world; Fixed Pitch

Detection Rate

0.8 0.7 0.6 0.5 0.4 0.3 −2 10

−1

10 False Positives Per Frame

0

10

Fig. 9. Overall performance of system conﬁgurations with diﬀerent ROI generation stages

5

Conclusions

We investigated the beneﬁt of dense stereo for the ROI generation stage of a pedestrian detection system. In challenging real-world sequences (i.e. undulated roads, bridges and speed bumps), we compared various versions of dense and sparse stereo-based ROI generation. For the case of ﬂat world assumption and ﬁxed camera parameters, sparse and dense stereo provided equal ROI generation performance (baseline conﬁguration). The speciﬁc estimation of camera height and pitch angle resulted in a performance improvement of about factor three (reduction false positives at same correct detection rate). When estimating road surface as well, the beneﬁt increased to a factor of ﬁve vs. the baseline conﬁguration. Interestingly, the added processing cost of computing dense, rather than sparse, disparity maps is at least partially amortized by the fewer ROIs that need to be processed at the system level.

References 1. Alonso, I.P., Llorca, D.F., Sotelo, M.A., Bergasa, L.M., de Toro, P.R., Nuevo, J., Ocana, M., Garrido, M.A.: Combination of Feature Extraction Methods for SVM Pedestrian Detection. IEEE Transactions on Intelligent Transportation Systems 8(2), 292–307 (2007) 2. Broggi, A., Fascioli, A., Fedriga, I., Tibaldi, A., Rose, M.D.: Stereo-based preprocessing for human shape localization in unstructured environments. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2003) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proc. of the International Conference on Computer Vision and Pattern Recognition, CVPR (2005) 4. Enzweiler, M., Gavrila, D.M.: Monocular Pedestrian Detection: Survey and Experiments. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). IEEE Computer Society Digital Library (2009), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.260

90

C.G. Keller, D.F. Llorca, and D.M. Gavrila

5. Enzweiler, M., Kanter, P., Gavrila, D.M.: Monocular pedestrian recognition using motion parallax. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2008) ´ 6. Fern´ andez, D., Parra, I., Sotelo, M.A., Revenga, P., Alvarez, S.: 3D candidate selection method for pedestrian detection on non-planar roads. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2007) 7. Franke, U., Gehrig, S., Badino, H., Rabe, C.: Towards Optimal Stereo Analysis of Image Sequences. In: Sommer, G., Klette, R. (eds.) RobVis 2008. LNCS, vol. 4931, pp. 43–58. Springer, Heidelberg (2008) 8. Gandhi, T., Trivedi, M.M.: Pedestrian protection systems: Issues, survey and challenges. IEEE Transactions on Intelligent Transportation Systems 8(3), 413–430 (2007) 9. Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. International Journal of Computer Vision 73(1), 41–59 (2007) 10. Grubb, G., Zelinsky, A., Nilsson, L., Ribbe, M.: 3D vision sensing for improved pedestrian safety. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2004) 11. Labayrade, R., Aubert, D., Tarel, J.P.: Real time obstacle detection on non ﬂat road geometry through ’v-disparity’ representation. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2002) 12. van der Mark, W., Gavrila, D.M.: Real-Time Dense Stereo for Intelligent Vehicles. IEEE Transactions on Intelligent Transportation Systems 7(1), 38–50 (2006) 13. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(4), 349–361 (2001) 14. Nedevschi, S., Danescu, R., Frentiu, D., Marita, T., Oniga, F., Pocol, C., Graf, T., Schmidt, R.: High accuracy stereovision approach for obstacle detection on non-planar roads. In: Proc. of the IEEE Intelligent Engineering Systems, INES (2004) 15. Oniga, F., Nedevschi, S., Meinecke, M., Binh, T.: Road surface and obstacle detection based on elevation maps from dense stereo. In: Proc. of the IEEE Intelligent Transportation Systems, ITSC (2007) 16. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: Proc. of the International Conference on Computer Vision and Pattern Recognition, CVPR (2007) 17. Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: single-frame classiﬁcation and system level performance. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2004) 18. Suganuma, N., Fujiwara, N.: An obstacle extraction method using virtual disparity image. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2007) 19. Wedel, A., Franke, U., Badino, H., Cremers, D.: B-Spline modeling of road surfaces for freespace estimation. In: Proc. of the IEEE Intelligent Vehicle Symposium, IVS (2008) 20. Zhao, L., Thorpe, C.: Stereo- and neural network-based pedestrian detection. IEEE Transactions on Intelligent Transportation Systems (ITS) 1(3)

Pedestrian Detection by Probabilistic Component Assembly Martin Rapus1,2 , Stefan Munder1 , Gregory Baratoﬀ1 , and Joachim Denzler2 1

Continental AG, ADC Automotive Distance Control Systems GmbH Kemptener Str. 99, 88131 Lindau, Germany {martin.rapus,stefan.munder,gregory.baratoff}@continental-corporation.com 2 Chair for Computer Vision, Friedrich Schiller University of Jena Ernst-Abbe-Platz 2, 07743 Jena, Germany [email protected]

Abstract. We present a novel pedestrian detection system based on probabilistic component assembly. A part-based model is proposed which uses three parts consisting of head-shoulder, torso and legs of a pedestrian. Components are detected using histograms of oriented gradients and Support Vector Machines (SVM). Optimal features are selected from a large feature pool by boosting techniques, in order to calculate a compact representation suitable for SVM. A Bayesian approach is used for the component grouping, consisting of an appearance model and a spatial model. The probabilistic grouping integrates the results, scale and position of the components. To distinguish both classes, pedestrian and non-pedestrian, a spatial model is trained for each class. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar.

1

Introduction

Pedestrian recognition is one of the main research topics in computer vision with applications ranging from security problems, where e.g. humans are observed or counted, to automotive safety area, for vulnerable road user protection. The varying challenges are given by appearance of pedestrians, due to clothing and posture, and occlusions, for example pedestrians walking in groups or behind car hoods. For automotive safety applications the real-time performance needs to be combined with high accuracy and low false positive rate. Earlier approaches employed full-body classiﬁcation. Most popular: Papageorgiou et al. [12] applies Haar-wavelets with SVM [15]. Instead of SVM, a cascade based on AdaBoost [6] is used by Viola and Jones [16], to achieve real-time performance. An extensive experimental evaluation of histograms of oriented gradients (HOG) for pedestrian recognition is made by Dalal and Triggs [2]. In place of the constant histogram selection [2], Zhu et al. [19] use a variable selection made by an AdaBoost cascade, which achieves better results. Gavrila and Munder [7] recognize pedestrians with local receptive ﬁelds and several neural networks. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 91–100, 2009. c Springer-Verlag Berlin Heidelberg 2009

92

M. Rapus et al.

The achieved performance with full-body classiﬁcation is still not good enough to handle the big variability in human posture. To achieve better performance, part-based approaches are used. These approaches are more robust against partial occlusions. Part-based approaches often consist of two steps, the ﬁrst one detects components, mostly by classiﬁcation approaches, while the second step groups them to pedestrians. One possible way to group components is to use classiﬁcation techniques. Mohan et al. [11] use the approach proposed in [12] for the component detection. The best results per component are classiﬁed by a SVM into pedestrian and non-pedestrian. In Dalal’s thesis [3], the HOG-approach [2] is used for the component detectors. A spatial histogram for each component weighted by the results is classiﬁed by a SVM. Felzenszwalb et al. [5] determine the component model parameters (size and position) in the training process. For the pedestrian classiﬁcation the HOG component feature vectors and geometrical parameters (scale and position) are used as input for a linear SVM. The ﬁxed ROI conﬁguration used in these approaches puts a limit on the variability of part conﬁgurations they can handle. To overcome this limitation, spatial models that explicitly describe the arrangement of components were introduced. In general, these approaches incorporate an appearance model and a spatial model. One of the ﬁrst approaches is from Mikolajczyk et al. [10]. The components are detected by SIFT-like features and AdaBoost. An iterative process with thresholding is used to generate the global result via a probabilistic assembly of the components, using the geometric relations: distance vector and scale ratio between two parts, modeled by a Gaussian. Wu and Nevatia [18] use a component hierarchy with 12 parts and the full-body as root-component. The component detection is done by edgelet features [17] and Boosting [14]. For the probabilistic grouping the position, scale and a visibility value is incorporated. Only the inter-occlusion of pedestrians is considered. The Maximum-A-Posteriori (MAP) conﬁguration is computed by the Hungarian algorithm. All results above a threshold are regarded as pedestrian. Bergtholdt et al. [1] use all possible relations between 13 components. For the component detection SIFT and color features are classiﬁed through randomized classiﬁcation trees. The MAP conﬁguration is computed with A*-search. A great number of parts is used by the last two approaches for robustness against partial occlusions. The computation time for the probabilistic grouping grows non-linearly with the number of components used, and with the number of component detection results. As a consequence these probabilistic based methods have no real-time performance on an actual desktop PC. Our approach is part-based. For real-time purpose our pedestrian detector is divided into the three parts, head-shoulder, torso, legs and for better classiﬁcation performance we distinguish between frontal/rear and side view. HOGs [2] are used as component features. We make use of a variable histogram selection by AdaBoost. The selected histograms are classiﬁed through a linear SVM. Because similar histograms are selected with weighted ﬁsher discriminant analysis (wFDA) [9] in comparison to a linear SVM, but in less training time, we apply wFDA as weak classiﬁer. A Bayesian-based approach is used for component

Pedestrian Detection by Probabilistic Component Assembly

93

grouping. To reduce the number of component detections thresholding is applied, keeping 99% true positive component detection rate. Our probabilistic grouping approach consists of an image matching and a spatial matching of the components. To use the component results for the image matching they are converted into probabilistic values. Invariance against scale and translation is achieved by using the distance vector, normalized through scale, and the scale ratio between two components. In comparison to existing approaches the spatial distributions are not approximated, instead the distribution histograms are used directly. We also diﬀerentiate component arrangements by class. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar. The paper is organized as follows. Sect. 2 describes the component detection step, followed by the component grouping through a probabilistic model in Sect. 3. The results for the component detection and grouping step are discussed in Sect. 4. The conclusion forms Sect. 5 and the paper ends with an outlook in Sect. 6.

2

Component Detection

HOG features were proven best in [2], and thus adopted here for the component detection. The averaged gradient magnitude and HOG images for our components, derived through the INRIA Person dataset [2], are visualized in Fig. 1 and Fig. 2. Instead of the histograms the corresponding edges with weighted edge length are shown. Pedestrian contours are well preserved in the average edge images, while irrelevant edges are suppressed. A (slight) diﬀerence can be seen in the head component. In the frontal view, the whole contour is preserved and in the side view it is only the head contour, while the shoulder contour is blurred. Two diﬀerent methods for the histogram selection are examined. One is a constant selection [2]: the image is divided into non-overlapping histograms, followed by an extraction of normalized blocks neighboring histograms. The other approach is similar to [19] and uses variable selection. The best histogram blocks (varying size and position) are selected using AdaBoost. We use the weighted Fisher discriminant analysis [9] as weak classiﬁer. The classiﬁcation of the generated feature vector is done by a linear SVM.

3

Probabilistic Component Assembly

This step builds the global pedestrian detections out of the detected components V = {vHS , vT , vL }, where the superscripts HS, T and L stand for head-shoulder, torso and legs respectively, by applying the appearance and the spatial relationship. The probability P (L|I) to ﬁnd a pedestrian, consisting of the mentioned components, with conﬁguration L = {lHS , lT , lL } in the actual image I, with li as position and scale for the ith component, is given by Bayes rule: P (L|I) ∝ P (I|L) · P (L) .

(1)

94

M. Rapus et al.

Fig. 1. Average gradient magnitudes and average HOGs for the frontal/rear view components (head, torso and legs) - INRIA Person dataset

Fig. 2. Average gradient magnitudes and average HOGs for the side view components (head, torso and legs) - INRIA Person dataset

The ﬁrst factor P (I|L) is the detection probability of the components, at the position and scale given by L. The second factor P (L) represents the prior probability of a positive pedestrian component arrangement. Every Head-Shoulder detection is used as start point to ﬁnd the corresponding MAP conﬁguration by greedy search. In the following sections we will go further into detail. 3.1

Probabilistic Appearance Model

To compute P (I|L) the component results of the detection step are used. For this purpose the SVM results f (x) are converted into probabilistic values. From the many choices available, we preferred an approximation of the a posteriori curve P (y = 1|f (x)), that for a speciﬁc SVM result f (x) a pedestrian component y = 1 is given, because the best ﬁt was achieved by this model. By using Bayes rule with the priors P (y = −1) and P (y = 1), and class-conditional densities p(f (x)|y = −1) and p(f (x)|y = 1), we get: P (y = 1|f (x)) =

p(f (x)|y = 1)P (y = 1) . p(f (x)|y = i)P (y = i)

(2)

i=−1,1

The resulting a posteriori values for the frontal legs training set are shown in Fig. 3(b), derived with the class-conditional densities, which can be seen in 1 Fig. 3(a). A sigmoid function s(z = f (x)) = 1+exp(Az+B) is used to approximate

Pedestrian Detection by Probabilistic Component Assembly

95

probability

SVM result histograms − Legs Front positive histogram negative histogram

0.04 0.02 0 −6

−5

−4

−3

−2 SVM result

−1

0

1

2

(a) Posterior Approximation − Legs Front sigmoid approximation posterior probability

1 probability

0.8 0.6 0.4 0.2 0

−6

−4

−2 SVM result

0

2

(b) Fig. 3. (a) Distribution histograms and (b) the approximated a posterior curve by a sigmoid function for the frontal legs

the posterior. The parameters for s(z) are determined by the Maximum Likelihood method proposed by Platt [13], using the Levenberg-Marquardt method. To compute the sigmoid parameters, training sets for each component and view are used. Fig. 3(b) shows the approximated curve for the frontal legs. By assuming independence between the detectors for each component vi , P (I|L) is given by: P (I|L) = P (y = 1|fi (xi )) (3) vi ∈V

with xi as the extracted feature vector and fi as the result of the ith component. 3.2

Probabilistic Geometric Model

Besides the appearance likelihood value P (I|L), for component conﬁguration L, the probability for the spatial arrangement P (L) has to be computed. Invariance against scale and translation is achieved by using the relative distance vector dij and the scale ratio Δsij = ssji between two components i and j: dij =

dxij dyij

=

1 · si

xj − xi yj − yi

.

(4)

As in common literature [4] the model is expressed as a graph G = (V, E), with the components vi as vertices and the possible relations as edges eij between component i and j. Our model regard all possible component relations, except those between the same component in diﬀerent views. Every edge eij gets a weight wij ∈ [0, 1], to account that component pairs of the same view appear

96

M. Rapus et al.

more likely, than component pairs of diﬀerent views. The weights are generated from the component training sets. With the priors P (li , lj ) = P (dij , Δsij ) the probability of the component arrangement L is given as: P (L) = wij P (li , lj ) = wij P (dij , Δsij ) . (5) eij ∈E

eij ∈E

The generated distribution histograms for the geometrical parameters dij and Δsij are used for the priors P (li , lj ). To distinguish between a pedestrian-like and non-pedestrian-like component arrangement, two spatial distributions are generated, one for the positive Pp (L) and one for the negative class Pn (L). Distribution histograms are also used for the negative class. The distributions are computed as follows: First the positive spatial distribution histograms are computed from training data. Afterwards, the spatial distributions for the negative class are generated, using only the hard ones, i.e. those lying in the distribution histogram range for the positive class. As ﬁnal spatial result the diﬀerence between the positive and negative spatial result is used.

4

Experiments

The INRIA Person dataset [2] is used for our experiments. This dataset contains a training set with 2416 pedestrian labels and 1218 images without any pedestrians and a test set with 1132 pedestrian images and 453 images not containing any pedestrians. Both sets have only global labels. For the component evaluation, part labels are needed, so in a ﬁrst step we applied our component labels: head-shoulder, torso and legs, in front/rear and side view. In a second step the average label sizes were determined, see Table 1. Smaller labels were resized to the size given in Table 1. The number of positive training samples and test samples, for every component and view, are listed in Table 1. Some images have no component training labels because of occlusions. In a ﬁrst experiment the component detection was evaluated, followed by testing the proposed probabilistic model from Sect. 3. Finally, the probabilistic method is compared to state of the art detectors. Receiver Operating Characteristic (ROC) curves in loglog scale are used for the experimental evaluation of the alseN eg miss rate T ruePFos+F against the false-positive rate. Matching criteria alseN eg is 75% overlap between detection and corresponding label.

4.1

Component Detection

The proposed component detection in Sect. 2 is evaluated. ”Unsigned” gradients, 9 orientation bins and a block size of 2x2 histogram cells are used as parameters for the HOG features. In this test the constant histogram selection is compared against a variable selection, as described in Sect. 2. The block sizes for the constant selection are: 16x16 pixels for the frontal torso and 12x12 pixels for

Pedestrian Detection by Probabilistic Component Assembly

97

Table 1. Component sizes and the number of positive training/test samples Part

View

Width

Height

# pos. Training-Samples

# pos. Test-Samples

head

front side front side front side

32 32 40 32 34 34

32 32 45 45 55 55

1726 678 1668 646 1400 668

870 262 846 286 756 376

torso legs

the remaining components/views. For the variable selection, block size range is 8x8 to maximum, not limited to a speciﬁc scale. The negative training set was created by using the bootstrapping method given in [2]. The generation of regions of interest (ROI) is done by a sliding window approach. ROI’s are generated in diﬀerent scales. The factor 1.2 is used between two scales. In all scales the step size is 4 pixel in both directions. For the SVM classiﬁer training we use SVMlight [8]. The ROC-curves for the component detection are shown in Fig. 4 and Fig. 5, divided into frontal/rear and side views. It conﬁrms that variable selection (solid lines) yields better results than constant selection (dotted lines), except for the frontal head-component. The results for the frontal/rear head with constant selection are slightly better as those with variable selection. An interesting observation is the obvious diﬀerence between the head and leg results, which is stronger in the frontal/rear view than the side view. The leg component produces at 10% miss rate three times fewer false positives than the head. In the frontal view, similar results are recieved by head and torso. The ROC-curves of the side torso and side legs intersect at 10% miss rate. Below 10% miss rate, fewer false positives are produced by the torso and above 10% miss rate the legs generate less false positives. The computation time per component ROI is in average 0.025 ms, on a 1.8 GHz dual core PC, using only one core. At a resolution of 320x240 pixels, 20000 search windows are generated in average per component and view. The component detection at this resolution with full search takes about 3.1 seconds. 4.2

Probabilistic Component Assembly

The proposed Bayesian approach to component assembly from Sect. 3 is evaluated here. In a ﬁrst step the probabilistic approach is tested with and without the use of spatial distribution histograms for the negative class, and afterwards compared against state of the art detectors. These detectors are the one from Dalal [2] and the cascade from Viola and Jones [16]. Again the INRIA Person dataset is used as test set. First the probabilistic approach is evaluated. The results are given in Fig. 6. By using spatial distribution histograms from both classes we achieve better results. The diﬀerence between both curves is greater at higher false positive

98

M. Rapus et al. ROC Curves − Front/Rear Components

ROC Curves − Side Components

−1

miss rate

miss rate

−1

10

Constant Head Front/Rear Constant Torso Front/Rear Constant Legs Front/Rear Variable Head Front/Rear Variable Torso Front/Rear Variable Legs Front/Rear

−2

10

10

−6

10

−5

−4

10

Constant Head Side Constant Torso Side Constant Legs Side Variable Head Side Variable Torso Side Variable Legs Side

−2

−3

−2

10 10 false positives

10

−1

10

10

−6

10

Fig. 4. Front/Rear component results

−5

10

−4

−1

10

ROC Curves − Global Detectors

without negative spatial distribution with negative spatial distribution

Variable Legs Front/Rear Global Dalal Classifier Global Viola−Jones Cascade Probabilistic Model

−1

−1

10

miss rate

miss rate

−2

10

Fig. 5. Side component results

ROC Curves − Probabilistic Model

10

−2

−2

10 −6 10

−3

10 10 false positives

10 −5

10

−4

10 false positives

−3

10

−2

10

Fig. 6. Probabilistic grouping results

−6

10

−5

10

−4

−3

10 10 false positives

−2

10

−1

10

Fig. 7. State of the art detectors in comparison to our approach (blue line)

rates. At low miss rates the extra usage of spatial distributions for the negative class reduce the number of false positives compared to the common approach. In the following experiment the probabilistic approach is compared against state of the art detectors. Fig. 7 shows the best probabilistic detector in comparison to the mentioned standard detectors and the best component result (frontal/rear legs). The results of our part-based approach are slightly better as the best state of the art detector. Below 8% miss rate our probabilistic method outperforms the state of the art detectors. Note that Dalal’s detector takes a larger margin around a person, so in comparison to our approach more contextual information is incorporated. Fig. 8 shows some typical results of our approach. At a resolution of 320x240 pixels, after applying thresholding to the component detection results, we get on average about 400 detections per component and view. For this resolution, our probabilistic grouping approach takes 190 milliseconds in average on a 1.8 GHz PC.

Pedestrian Detection by Probabilistic Component Assembly

99

Fig. 8. Some detection results (white - full body, black - head, green - torso, cyan legs). No post-processing was applied to the images.

5

Conclusion

In this paper a Bayesian component-based approach for pedestrian recognition in single frames was proposed. Our pedestrian detector is composed of the headshoulder, torso and legs, divided into front/rear and side view for better recognition. For the component detection a variable selection of histograms of oriented gradients and SVM classiﬁcation is applied. In the next step, the components are grouped by a Bayes-based approach. To shrink the number of candidates for the probabilistic grouping, thresholding is applied to all component results, so that 99% true positive component detection rate remains. Invariance against scale and translation is achieved by using the relative distance vector and scale ratio between the components. To make a better separation into positive and negative spatial component arrangements, distributions for both classes are generated. Instead of approximating these distributions, for example by a Gaussian, the computed distribution histograms are used directly. The results conﬁrm the positive beneﬁt of using distributions for both classes and not only for one. Below miss rates of 8% our approach outperforms state of the art detectors. Above, performance is similar.

6

Future Work

One main drawback of our approach is computation time, mainly of the component detection. Using a cascaded classiﬁer would make the component detection faster. To improve the performance of our approach the narrow ﬁeld of a pedestrian can be included as contextual information. First experiments show promising results. The performance of the front/rear views is much better than for the side views. To overcome this, left and right side views could be separated.

References 1. Bergtholdt, M., Kappes, J., Schmidt, S., Schn¨ orr, C.: A Study of Parts-Based Object Class Detection Using Complete Graphs. In: IJCV (in press, 2009) 2. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR, vol. 1, pp. 886–893 (2005)

100

M. Rapus et al.

3. Dalal, N.: Finding People in Images and Videos, PhD thesis, Institut National Polytechnique de Grenoble (July 2006) 4. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. IJCV 61(1), 55–79 (2005) 5. Felzenszwalb, P., Mcallester, D., Ramanan, D.: A Discriminatively Trained, Multiscale, Deformable Part Model. In: CVPR, Anchorage, Alaska, June 2008, pp. 1–8 (2008) 6. Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: International Conference on Machine Learning, pp. 148–156 (1996) 7. Gavrila, D.M., Munder, S.: Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle. IJCV 73, 41–59 (2007) 8. Joachims, T.: Making large-Scale SVM Learning Practical. In: Sch¨ olkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999) 9. Laptev, I.: Improvements of Object Detection Using Boosted Histograms. In: British Machine Vision Conference, September 2006, vol. 3, pp. 949–958 (2006) 10. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004) 11. Mohan, A., Papageorgiou, C., Poggio, T.: Example-Based Object Detection in Images by Components. PAMI 23(4), 349–361 (2001) 12. Papageorgiou, C., Evgeniou, T., Poggio, T.: A Trainable Pedestrian Detection System. In: IVS, pp. 241–246 (1998) 13. Platt, J.: Probabilities for SV Machines. In: Press, M. (ed.) Advances in Large Margin Classifiers, pp. 61–74 (1999) 14. Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5(2), 197– 227 (1990) 15. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc, New York (1995) 16. Viola, P., Jones, M.: Robust Real-time Object Detection. IJCV 57(2), 137–154 (2004) 17. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: ICCV, vol. 1, pp. 90–97 (2005) 18. Wu, B., Nevatia, R.: Detection and Segmentation of Multiple, Partially Occluded Objects by Grouping, Merging, Assigning Part Detection Responses. IJCV 82(2), 185–204 (2009) 19. Zhu, Q., Yeh, M.-C., Cheng, K.-T., Avidan, S.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: CVPR, pp. 1491–1498 (2006)

High-Level Fusion of Depth and Intensity for Pedestrian Classification Marcus Rohrbach1,3, , Markus Enzweiler2, , and Dariu M. Gavrila1,4 1

4

Environment Perception, Group Research, Daimler AG, Ulm, Germany 2 Image & Pattern Analysis Group, Dept. of Math. and Computer Science, Univ. of Heidelberg, Germany 3 Dept. of Computer Science, TU Darmstadt, Germany Intelligent Systems Lab, Fac. of Science, Univ. of Amsterdam, The Netherlands [email protected], {uni-heidelberg.enzweiler,dariu.gavrila}@daimler.com

Abstract. This paper presents a novel approach to pedestrian classiﬁcation which involves a high-level fusion of depth and intensity cues. Instead of utilizing depth information only in a pre-processing step, we propose to extract discriminative spatial features (gradient orientation histograms and local receptive ﬁelds) directly from (dense) depth and intensity images. Both modalities are represented in terms of individual feature spaces, in each of which a discriminative model is learned to distinguish between pedestrians and non-pedestrians. We refrain from the construction of a joint feature space, but instead employ a high-level fusion of depth and intensity at classiﬁer-level. Our experiments on a large real-world dataset demonstrate a signiﬁcant performance improvement of the combined intensity-depth representation over depth-only and intensity-only models (factor four reduction in false positives at comparable detection rates). Moreover, high-level fusion outperforms low-level fusion using a joint feature space approach.

1

Introduction

Pedestrian recognition is an important problem in domains such as intelligent vehicles or surveillance. It is particularly diﬃcult, as pedestrians tend to occupy only a small part of the image (low resolution), have diﬀerent poses (shape) and clothing (appearance), varying background, or might be partially occluded. Most state-of-the-art systems derive feature sets from intensity images, i.e. grayscale (or colour) images, and apply learning-based approaches to detect people [1,3,9,22,23]. Besides image intensity, depth information can provide additional cues for pedestrian recognition. Up to now, the use of depth information has been limited to recovering high-level scene geometry [5,11] and focus-of-attention mechanisms [8]. Given the availability of real-time high-resolution dense stereo algorithms [6,20],

Marcus Rohrbach and Markus Enzweiler acknowledge the support of the Studienstiftung des deutschen Volkes (German National Academic Foundation).

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 101–110, 2009. c Springer-Verlag Berlin Heidelberg 2009

102

M. Rohrbach, M. Enzweiler, and D.M. Gavrila

/FFLINE4RAINING )NTENSITY #LASSIFIER

$EPTH #LASSIFIER /NLINE!PPLICATION /NLINE!PPLICATION &USED $ECISION

Fig. 1. Framework overview. Individual classiﬁers are trained oﬄine on intensity and corresponding depth images. Online, both classiﬁers are fused to a combined decision. For depth images, warmer colors represent closer distances from the camera.

we propose to enrich an intensity-based feature space for pedestrian classiﬁcation with features operating on dense depth images (Sect. 3). Depth information is computed from a calibrated stereo camera rig using semi-global matching [6]. Individual classiﬁers are trained oﬄine on features derived from intensity and depth images depicting pedestrian and non-pedestrian samples. Online, the outputs of both classiﬁers are fused to a combined decision (Sect. 4). See Fig. 1.

2

Related Work

A large amount of literature covers image-based classiﬁcation of pedestrians. See [3] for a recent survey and a challenging benchmark dataset. Classiﬁcation typically involves a combination of feature extraction and a discriminative model (classiﬁer), which learns to separate object classes by estimating discriminative functions within an underlying feature space. Most proposed feature sets are based on image intensity. Such features can be categorized into texture-based and gradient-based. Non-adaptive Haar wavelet features have been popularized by [15] and adapted by many others [14,22], with manual [14,15] and automatic feature selection [22]. Adaptive feature sets were proposed, e.g. local receptive ﬁelds [23], where the spatial structure is able to adapt to the data. Another class of texture-based features involves codebook patches which are extracted around salient points in the image [11,18]. Gradient-based features have focused on discontinuities in image brightness. Local gradient orientation histograms were applied in both sparse (SIFT) [12] and dense representations (HOG) [1,7,25,26]. Covariance descriptors involving a model of spatial variation and correlation of local gradients were also used [19]. Yet others proposed local shape ﬁlters exploiting characteristic patterns in the spatial conﬁguration of salient edges [13,24].

High-Level Fusion of Depth and Intensity for Pedestrian Classiﬁcation

(a) Pedestrian

103

(b) Non-Pedestrian

Fig. 2. Intensity and depth images for pedestrian (a) and non-pedestrian samples (b). From left to right: intensity image, gradient magnitude of intensity, depth image, gradient magnitude of depth

In terms of discriminative models, support vector machines (SVM) [21] are widely used in both linear [1,25,26] and non-linear variants [14,15]. Other popular classiﬁers include neural networks [9,10,23] and AdaBoost cascades [13,19,22,24,25,26]. Some approaches additionally applied a component-based representation of pedestrians as an ensemble of body parts [13,14,24]. Others combined features from diﬀerent modalities, e.g. intensity, motion, depth, etc. Multi-cue combination can be performed at diﬀerent levels: On module-level, depth [5,9,11] or motion [4] can be used in a pre-processing step to provide knowledge of the scene geometry and focus-of-attention for a subsequent (intensity-based) classiﬁcation module. Other approaches have fused information from diﬀerent modalities on feature-level by establishing a joint feature space (low-level fusion): [1,22] combined gray-level intensity with motion. In [17], intensity and depth features derived from a 3D camera with very low resolution (pedestrian heights between 4 and 8 pixels) were utilized. Finally, fusion can occur on classifier-level [1,2]. Here, individual classiﬁers are trained within each feature space and their outputs are combined (high-level fusion). We consider the main contribution of our paper to be the use of spatial depth features based on dense stereo images for pedestrian classiﬁcation at medium resolution (pedestrian heights up to 80 pixels). A secondary contribution concerns fusion techniques of depth and intensity. We follow a high-level fusion strategy which allows to tune features speciﬁcally to each modality and base the ﬁnal decision on a combined vote of the individual classiﬁers. As opposed to lowlevel fusion approaches [17,22], this strategy does not suﬀer from the increased dimensionality of a joint feature space.

3

Spatial Depth and Intensity Features

Dense stereo provides information for most image areas, apart from regions which are visible only by one camera (stereo shadow). See the dark red areas to the left of the pedestrian torso in Fig. 2(a). Spatial features can be based on either depth Z (in meters) or disparity d (in pixels). Both are inverse proportional given the camera geometry with focal length f and the distance between the two cameras B: fB Z(x, y) = at pixel (x, y) (1) d(x, y)

104

M. Rohrbach, M. Enzweiler, and D.M. Gavrila

(a) Intensity features

(b) Depth features

Fig. 3. Visualization of gradient magnitude (related to HOG) and LRF features on (a) intensity and (b) depth images. From left to right: Average gradient magnitude of pedestrian training samples, two exemplary 5×5-pixel local receptive ﬁeld features and their activation maps, highlighting spatial regions of the training samples where the corresponding LRFs are most discriminative with regard to the pedestrian and non-pedestrian classes.

Objects in the scene have similar foreground/background gradients in depth space, irrespective of their location relative to the camera. In disparity space however, such gradients are larger, the closer the object is to the camera. To remove this variability, we derive spatial features from depth instead of disparity. We refer to an image with depth values Z(x, y) at each pixel (x, y) as depth image. A visual inspection of the depth image vs. the intensity image in Fig. 2 reveals that pedestrians have a distinct depth contour and texture which is diﬀerent from the intensity domain. In intensity images, lower body features (shape and appearance of legs) are the most signiﬁcant feature of a pedestrian (see results of part-based approaches, e.g. [14]). In contrast, the upper body area has dominant foreground/background gradients and is particularly characteristic for a pedestrian in the depth image. Additionally, the stereo shadow is clearly visible in this area (to the left of the pedestrian torso) and represents a signiﬁcant local depth discontinuity. This might not be a disadvantage but rather a distinctive feature. The various salient regions in depth and intensity images motivate our use of fusion approaches between both modalities to beneﬁt from the individual strengths, see Sect. 4. To instantiate feature spaces involving depth and intensity, we utilize wellknown state-of-the-art features, which focus on local discontinuities: Non-adaptive histogram of oriented gradients with a linear SVM (HOG/linSVM) [1] and a neural network using adaptive local receptive ﬁelds (NN/LRF) [23]. For classiﬁer training, the feature vectors are normalized to [−1; +1] per dimension. To get an insight into HOG and LRF features, Fig. 3 depicts the average gradient magnitude of all pedestrian training samples (related to HOG), as well as exemplary local receptive ﬁeld features and their activation maps (LRF), for both intensity and depth. We observe that gradient magnitude is particularly high around the upper body contour for the depth image, while being more evenly distributed for the intensity image. Further, almost no depth gradients are present on areas corresponding to the pedestrian body. During training, the local receptive ﬁeld features have developed to detect very ﬁne grained structures in the image intensity domain. The two features depicted in Fig. 3(a) can be regarded as specialized “head-shoulder” and “leg” detectors and are especially activated in the corresponding areas. For depth images, LRF features respond to larger structures in the image, see Fig. 3(b). Here, characteristic features

High-Level Fusion of Depth and Intensity for Pedestrian Classiﬁcation

105

focus on the coarse depth contrast between the upper-body head/torso area. The mostly uniform depth texture on the pedestrian body is a prominent feature as well.

4

Fusion on Classifier-Level

A popular strategy to improve classiﬁcation is to split-up a classiﬁcation problem into more manageable sub-parts on data-level, e.g. using mixture-of-experts or component-based approaches [3]. A similar strategy can be pursued on classiﬁerlevel. Here, multiple classiﬁers are learned on the full dataset and their outputs combined to a single decision. Particularly, when the classiﬁers involve uncorrelated features, beneﬁts can be expected. We follow a Parallel Combination strategy [2], where multiple feature sets (i.e. based on depth and intensity, see Sect. 3) are extracted from the same underlying data. Each feature set is then used as input to a single classiﬁer and their outputs combined (high-level fusion). For classiﬁer fusion, we utilize a set of fusion rules which are explained below. An important prerequisite is that the individual classiﬁer outputs are normalized, so that they can be combined homogeneously. The outputs of many state-of-theart classiﬁers can be converted to an estimate of posterior probabilities [10,16]. We use this sigmoidal mapping in our experiments. Let xk , k = 1, . . . , n, denote a (vectorized) sample. The posterior for the k-th sample with respect to the j-th object class (e.g. pedestrian, non-pedestrian), estimated by the i-th classiﬁer, i = 1, . . . , m, is given by: pij (xk ). Posterior probabilities are normalized across object classes for each sample, so that: (pij (xk )) = 1 (2) j

Classiﬁer-level fusion involves the derivation of a new set of class-speciﬁc conﬁdence values for each data point, qj (xk ), out of the posteriors of the individual classiﬁers, pij (xk ). The ﬁnal classiﬁcation decision ω(xk ) results from selecting the object class with the highest conﬁdence: ω(xk ) = arg max (qj (xk )) j

(3)

We consider the following fusion rules to determine the conﬁdence qj (xk ) of the k-th sample with respect to the j-th object class: Maximum Rule. The maximum rule bases the ﬁnal conﬁdence value on the classiﬁer with the highest estimated posterior probability: qj (xk ) = max (pij (xk )) i

(4)

Product Rule. Individual posterior probabilities are multiplied to derive the combined conﬁdence: qj (xk ) = (pij (xk )) (5) i

106

M. Rohrbach, M. Enzweiler, and D.M. Gavrila

Sum Rule. The combined conﬁdence is computed as the average of individual posteriors, with m denoting the number of individual classiﬁers: qj (xk ) =

1 (pij (xk )) m i

(6)

SVM Rule. A support vector machine is trained as a fusion classiﬁer to discriminate between object classes in the space of posterior probabilities of the individual classiﬁers: Let pjk = (p1j (xk ) , . . . , pmj (xk )) denote the m-dimensional vector of individual posteriors for sample xk with respect to the j-th object class. The corresponding hyperplane is deﬁned by: fj (pjk ) = yl αl · K (pjk , pjl ) + b (7) l

Here, pjl denotes the set of support vectors with labels yl and Lagrange multipliers αl . K(·, ·) represents the SVM Kernel function. We use a non-linear RBF kernel in our experiments. The SVM decision value fj (pjk ) (distance to the hyperplane) is used as conﬁdence value: qj (xk ) = fj (pjk )

5 5.1

(8)

Experiments Experimental Setup

The presented feature/classiﬁer combinations and fusion strategies, see Sects. 3 and 4, were evaluated in experiments on pedestrian classiﬁcation. Training and test samples comprise non-occluded pedestrian and non-pedestrian cut-outs from intensity and corresponding depth images, captured from a moving vehicle in an urban environment. See Table 1 and Fig. 4 for an overview of the dataset. All samples are scaled to 48 × 96 pixels (HOG/linSVM) and 18 × 36 pixels (NN/LRF) with an eight-pixel (HOG/linSVM) and two-pixel border (NN/LRF) to retain contour information. For each manually labelled pedestrian bounding box we randomly created four samples by mirroring and geometric jittering.

(a) Pedestrian samples

(b) Non-Pedestrian samples

Fig. 4. Overview of (a) pedestrian and (b) non-pedestrian samples (intensity and corresponding depth images)

High-Level Fusion of Depth and Intensity for Pedestrian Classiﬁcation

107

Table 1. Dataset statistics. The same numbers apply to samples from depth and intensity images.

Training Set (2 parts) Test Set (1 part) Total

Pedestrians (labelled) Pedestrians (jittered) Non-Pedestrians 10998 43992 43046 5499 21996 21523 16497 65988 64569

Non-pedestrian samples resulted from a pedestrian shape detection step with relaxed threshold setting, i.e. containing a bias towards more diﬃcult patterns. HOG features were extracted from those samples using 8 × 8 pixel cells, accumulated to 16 × 16 pixel blocks, with 8 gradient orientation bins, see [1]. LRF features (in 24 branches, see [23]) were extracted at a 5 × 5 pixel scale. Identical feature/classiﬁer parameters are used for intensity and depth. The dimension of the resulting feature spaces is 1760 for HOG/linSVM and 3312 for NN/LRF. We apply a three-fold cross-validation to our dataset: The dataset is splitup into three parts of the same size, see Table 1. In each cross-validation run, two parts are used for training and the remaining part for testing. Results are visualized in terms of mean ROC curves across the three cross-validation runs. 5.2

Experimental Results

In our ﬁrst experiment, we evaluate the performance of classiﬁers for depth and intensity separately, as well as using diﬀerent fusion strategies. Results are given in Fig. 5(a-b) for the HOG/linSVM and NN/LRF classiﬁer, respectively. The performance of features derived from intensity images (black ◦) is better than for depth features (red +), irrespective of the actual feature/classiﬁer approach. Furthermore, all fusion strategies between depth and intensity clearly improve performance (Fig. 5(a-b), solid lines). For both HOG/linSVM and NN/LRF, the sum rule performs better than product rule, which in turn outperforms the maximum rule. However, performance diﬀerences among fusion rules are rather small. Only for NN/LRF, the maximum rule performs signiﬁcantly worse. By design, maximum selection is more susceptive to noise and outliers. Using a nonlinear RBF SVM as a fusion classiﬁer does not improve performance over fusion by the sum rule, but is far more computationally expensive. Hence, we only employ the sum rule for fusion in our further experiments. Comparing absolute performances, our experiments show that fusion of depth and intensity can reduce false positives over intensity-only features at a constant detection rate by approx. a factor of two for HOG/linSVM and a factor of four for NN/LRF: At a detection rate of 90%, the false positive rates for HOG/linSVM (NN/LRF) amount to 1.44% (2.01%) for intensity, 8.92% (5.60%) for depth and 0.77% (0.43%) for sum-based fusion of depth and intensity. This clearly shows that the diﬀerent strengths of depth and intensity can indeed be exploited, see Sect. 3. An analysis of correlation between the classiﬁer outputs for depth and intensity conﬁrms this: For HOG/linSVM (NN/LRF), the correlation coeﬃcient

108

M. Rohrbach, M. Enzweiler, and D.M. Gavrila 1

Detection Rate

0.9 0.8 HOG Depth HOG Intensity HOG Fusion Sum HOG Fusion Max HOG Fusion SVM HOG Fusion Prod .

0.7 0.6 0.5

0

0.005

0.01

0.015

0.02 0.025 0.03 False Positive Rate

0.035

0.04

0.045

0.05

(a) HOG/linSVM classiﬁer 1

Detection Rate

0.9 0.8 NN/LRF Depth NN/LRF Intensity NN/LRF Fusion Sum NN/LRF Fusion Max NN/LRF Fusion SVM NN/LRF Fusion Prod .

0.7 0.6 0.5

0

0.005

0.01

0.015

0.02 0.025 0.03 False Positive Rate

0.035

0.04

0.045

0.05

(b) NN/LRF classiﬁer 1

Detection Rate

0.9 0.8 0.7 NN/LRF Depth HOG Intensity HOG Int. + LRF Depth Joint Space SVM . HOG Int. + LRF Depth Fusion Sum

0.6 0.5

0

0.005

0.01

0.015

0.02 0.025 0.03 False Positive Rate

0.035

0.04

0.045

0.05

(c) Best performing classiﬁers and joint feature space with 1-σ error bars. Fig. 5. Pedestrian classiﬁcation performance using spatial depth and intensity features. (a) HOG/linSVM, (b) NN/LRF, (c) best performing classiﬁers.

High-Level Fusion of Depth and Intensity for Pedestrian Classiﬁcation

109

between depth and intensity is 0.1068 (0.1072). For comparison, the correlation coeﬃcient beween HOG/linSVM and NN/LRF on intensity images is 0.3096. In our third experiment, we fuse the best performing feature/classiﬁer for each modality, i.e. HOG/linSVM for intensity images (black ◦) and NN/LRF for depth images (red +). See Fig. 5(c). The results of fusion using the sumrule (blue *) outperforms all previously considered variants. More speciﬁcally, we achieve a false positive rate of 0.35% (at 90% detection rate) which is a reduction by a factor of four, compared to the state-of-the-art HOG/linSVM classiﬁer on intensity images (black ◦; 1.44% false positive rate). We additionally visualize 1-σ error bars computed from the diﬀerent cross-validation runs. The non-overlapping error bars of the various system variants underline the statistical signiﬁcance of our results. We further compare the proposed high-level fusion (Fig. 5(c), blue *) with low-level fusion (Fig. 5(c), magenta Δ). For this, we construct a joint feature space combining HOG features for intensity and LRF features for depth (normalized to [−1; +1] per dimension). A linear SVM is trained in the joint space to discriminate between pedestrians and non-pedestrians. A non-linear SVM was computationally not feasible, given the increased dimension of the joint feature space (5072) and our large datasets. Results show, that low-level fusion using a joint feature space is outperformed by the proposed high-level classiﬁer fusion, presumable because of the higher dimensionality of the joint space.

6

Conclusion

This paper presented a novel framework for pedestrian classiﬁcation which involves a high-level fusion of spatial features derived from dense stereo and intensity images. Our combined depth/intensity approach outperforms the stateof-the-art intensity-only HOG/linSVM classiﬁer by a factor of four in reduction of false positives. The proposed classiﬁer-level fusion of depth and intensity also outperforms a low-level fusion approach using a joint feature space.

References 1. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of ﬂow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006) 2. Duin, R.P.W., Tax, D.M.J.: Experiments with classiﬁer combining rules. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 16–29. Springer, Heidelberg (2000) 3. Enzweiler, M., Gavrila, D.M.: Monocular pedestrian detection: Survey and experiments. In: IEEE PAMI, October 17, 2008. IEEE Computer Society Digital Library (2008), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.260 4. Enzweiler, M., Kanter, P., Gavrila, D.M.: Monocular pedestrian recognition using motion parallax. In: IEEE IV Symp., pp. 792–797 (2008) 5. Ess, A., Leibe, B., van Gool, L.: Depth and appearance for mobile scene analysis. In: Proc. ICCV (2007)

110

M. Rohrbach, M. Enzweiler, and D.M. Gavrila

6. Franke, U., Gehrig, S.K., Badino, H., Rabe, C.: Towards optimal stereo analysis of image sequences. In: Sommer, G., Klette, R. (eds.) RobVis 2008. LNCS, vol. 4931, pp. 43–58. Springer, Heidelberg (2008) 7. Gandhi, T., Trivedi, M.M.: Image based estimation of pedestrian orientation for improving path prediction. In: IEEE IV Symp., pp. 506–511 (2008) 8. Gavrila, D.M.: A Bayesian, exemplar-based approach to hierarchical shape matching. IEEE PAMI 29(8), 1408–1421 (2007) 9. Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV 73(1), 41–59 (2007) 10. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE PAMI 22(1), 4–37 (2000) 11. Leibe, B., et al.: Dynamic 3d scene analysis from a moving vehicle. In: Proc. CVPR (2007) 12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 13. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004) 14. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by components. IEEE PAMI 23(4), 349–361 (2001) 15. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38, 15–33 (2000) 16. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advance. In: Advances in Large Margin Classiﬁers, pp. 61–74 (1999) 17. Rapus, M., et al.: Pedestrian recognition using combined low-resolution depth and intensity images. In: IEEE IV Symp., pp. 632–636 (2008) 18. Seemann, E., Fritz, M., Schiele, B.: Towards robust pedestrian detection in crowded image sequences. In: Proc. CVPR (2007) 19. Tuzel, O., Porikli, F., Meer, P.: Human detection via classiﬁcation on Riemannian manifolds. In: Proc. CVPR (2007) 20. Van der Mark, W., Gavrila, D.M.: Real-time dense stereo for intelligent vehicles. IEEE PAMI 7(1), 38–50 (2006) 21. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 22. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appearance. IJCV 63(2), 153–161 (2005) 23. W¨ ohler, C., Anlauf, J.K.: A time delay neural network algorithm for estimating image-pattern shape and motion. IVC 17, 281–294 (1999) 24. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. IJCV 75(2), 247 (2007) 25. Zhang, L., Wu, B., Nevatia, R.: Detection and tracking of multiple humans with extensive pose articulation. In: Proc. ICCV (2007) 26. Zhu, Q., et al.: Fast human detection using a cascade of histograms of oriented gradients. In: CVPR, pp. 1491–1498 (2006)

Fast and Accurate 3D Edge Detection for Surface Reconstruction Christian B¨ahnisch, Peer Stelldinger, and Ullrich K¨ othe University of Hamburg, 22527 Hamburg, Germany University of Heidelberg, 69115 Heidelberg, Germany

Abstract. Although edge detection is a well investigated topic, 3D edge detectors mostly lack either accuracy or speed. We will show, how to build a highly accurate subvoxel edge detector, which is fast enough for practical applications. In contrast to other approaches we use a spline interpolation in order to have an eﬃcient approximation of the theoretically ideal sinc interpolator. We give theoretical bounds for the accuracy and show experimentally that our approach reaches these bounds while the often-used subpixel-accurate parabola ﬁt leads to much higher edge displacements.

1

Introduction

Edge detection is generally seen as an important part of image analysis and computer vision. As a fundamental step in early vision it provides the basis for subsequent high level processing such as object recognition and image segmentation. Depending on the concrete analysis task the accurate detection of edges can be very important. For example, the estimation of geometric and diﬀerential properties of reconstructed object boundaries such as perimeter, volume, curvature or even higher order properties requires particularly accurate edge localization algorithms. However, especially for the 3D image domain performance and storage considerations become very present. While edge detection in 3D is essentially the same as in 2D, the trade-oﬀ of computational eﬃciency and geometric accuracy makes the design of usable 3D edge detectors very diﬃcult. In this work we propose a 3D edge detection algorithm which provides accurate edges with subvoxel precision while being computationally eﬃcient. The paper is organized as follows: First, we give an overview about previous work. Then we describe our new approach for edge detection, followed by a theoretical analysis of the edge location errors of an ideal subvoxel edge detector under the inﬂuence of noise. This analysis is based on the same optimality constraints as used in the Canny edge detector. Finally, we show experimentally that (in contrast to a parabola ﬁt method) our algorithm is a good approximation of an ideal edge detector, and that the computational costs are negligible. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 111–120, 2009. c Springer-Verlag Berlin Heidelberg 2009

112

2

C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe

Previous Work

Since edge detection is a fundamental operation in image analysis, there exists an uncountable number of diﬀerent approaches. Nevertheless, any new edge detection algorithm must compete with the groundbreaking algorithm proposed by Canny in his MS thesis in 1986 [2]. Due to his deﬁnition, an edge is detected as local maximum of the gradient magnitude along the gradient direction. This idea proved to be advantageous to other approaches, mostly because of its theoretical justiﬁcation and the restriction to ﬁrst derivatives, which makes it more robust against noise. The Canny edge detection algorithm has been extended to 3D in [10] by using recursive ﬁlters. However both methods return only edge points with pixel/voxel accuracy, i.e. certain pixels (respectively voxels in 3D) are marked as edge points. Since generally the discrete image is a sampled version of a continuous image of the real world, attempts had been made to locate the edge points with higher accuracy. The edge points are then called edgels (i.e. edge elements in analogy to pixels as picture elements and voxels as volume elements). Since a 2D edge separates two regions from each other, the analogon in 3D is a surface. Thus 3D edge points are also called surfels. One often cited example of a subpixel precise edge detection algorithm is given in [3], where 2D edgels are detected as maxima on a local parabola ﬁt to the neighboring pixels. The disadvantage of this parabola ﬁt approach is, that the diﬀerent local ﬁts do not stitch together to a continuous image. The same is true for the approaches presented in [12]. A diﬀerent method for subvoxel precise edge detection based on local moment analysis is given in [8], but it simply oversamples the moment functions and thus there is still a discretization error being only on a ﬁner grid. An interpolation approach having higher accuracy has been proposed for 2D in [7,13,14]. Here, the continuous image is deﬁned by a computationally eﬃcient spline interpolation based on the discrete samples. With increasing order of the spline, this approximates the signal theoretic optimal sinc interpolator, thus in case of suﬃciently bandlimited images the approximation error converges to zero. An eﬃcient implementation of the spline interpolation can be found in the VIGRA image processing library [6].

3

The 3D Edge Detection Algorithm

In this section we ﬁrst introduce the concepts and mathematical notions needed to give the term “3D edge” an exact meaning. A discussion of our algorithm to actually detect them follows. 3.1

Volume Function and 3D Edge Model

In the following our mathematical model for a 3D image is the scalar valued volume function with shape (w, h, d)T ∈ N3 : f : w × h × d → D with

Fast and Accurate 3D Edge Detection for Surface Reconstruction

113

n = {0, . . . , n − 1} and an appropriately selected domain D, e.g. D = 255. The gradient of f at position p is deﬁned as ∇f := ∇gσ f with ∇gσ denoting 2 √ the vectors of spatial derivatives of the Gaussian gσ (p) := 1/ 2πσ exp(− p 2σ2 ) at scale σ. Note that the gradient can be eﬃciently computed using the separability property of the Gaussian. The gradient of the volume function is the basis for our 3D edge model. The boundary indicator b := ∇f expresses strong evidence for a boundary at a certain volume position in terms of a high scalar value. Adapting Canny’s edge model [2] to 3D images, we deﬁne surface elements (surfels) as maxima of the boundary indicator b along the gradient direction ∇f of the 3D image function. Our detection algorithm for these maxima can be divided into two phases: A voxel-precise surfel detection phase and a reﬁnement phase which improves the localization of the surfels to sub-voxel precision. 3.2

Phase 1: Voxel-Based Edge-Detection

The detection of the voxel-precise surfels is basically an adaption of the classical Canny edge detection algorithm [2]. However, we do not perform the commonly used edge tracing step with hysteresis thresholding as this step becomes especially complicated in the 3D image domain. A second reason is, that this is the most time consuming part of the Canny edge detection algorithm. Additionally, although non-maxima-suppression is performed, the classical Canny edge detection algorithm can lead to several pixel wide edges which is only alleviated by hysteresis. Therefore, we propose to use only one threshold (corresponding to the lower of the hysteresis thresholds) in combination with a fast morphological thinning with priorities which ensures one voxel thick surfaces while preserving the topology of the detected voxel-set. This also reduces the number of initial edges for subvoxel reﬁnement resulting in a signiﬁcant speedup of the further algorithm. Thus, the steps of the ﬁrst phase are: 1. Compute the gradient ∇f and the boundary indicator function b. 2. Compute a binary volume function a : w × h × d → {0, 1} marking surfels with “1” and background voxels with “0” by thresholding b with t and by using non maximum suppression along the gradient direction: a(p) := {1 if b(p) > t ∧ b(p) > b(p ± d), else 0} with d being a vector such that p + d is the grid point of the nearest neighbouring voxel in direction of ∇f , i.e., T 1 ∇f (p) d = u + 12 , v + 12 , w + 12 and (u, v, w)T = 2 sin(π/8) ∇f (p) where · is the ﬂoor operation. 3. Do topology preserving thinning with priorities on a. For the last step of our algorithm we use a modiﬁed version of the 3D morphological thinning algorithm described in [5]: Thinning is not performed in scan line order of the volume; instead surfels in a with a small boundary indicator value are preferred for removal. The outcome of the thinning step is then a one voxel

114

C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe

thin set of surfels such that its number of connected components is unchanged. A detailed description of thinning with priorities can be found in [7]. 3.3

Phase 2: Subpixel Refinement

The ﬁrst phase of our algorithm yields surface points located at voxel grid positions. In the second phase the localization accuracy of these points is improved to sub-voxel precision by means of iterative optimization methods based on line searches. Therefore a continuous version of the boundary indicator is needed, i.e. it has to be evaluable at arbitrary sub-voxel positions. Here we use B-spline based interpolators for this purpose. They are especially suited as they provide an optimal trad-oﬀ between interpolation quality and computational costs: With increasing order n spline interpolators converge to the ideal sinc interpolator and have already for small values of n a very good approximation quality. While the computational burden also grows with n their are still eﬃciently implementable for order n = 5 used in our experiments. The continuous boundary indicator can now be deﬁned via a discrete convolution of recursively deﬁned B-spline basis function βn of order n: b(p) := cijk βn (i − x)βn (j − y)βn (k − z) i,j,k

with βn := βn−1 n+1 x + 12 + βn−1 n+1 x − 12 and β0 := 2 +x 2 −x {0 if x < 0, else 1}. The coeﬃcients cijk can be eﬃciently computed from the discrete version of the boundary indicator by recursive linear ﬁlters. Note that there is one coeﬃcient for each voxel and that they have to be computed only once for each volume. The overall algorithmic complexity for this is linear in the number of voxels with a small constant factor. More details on the corresponding theory and the actual implementation can be found in [7, 13, 14]. A B-spline interpolated boundary indicator has also the advantage of being (n − 1)-times continuously diﬀerentiable. Its derivatives can also be eﬃciently computed at arbitrary sub-voxel positions which is very important for optimization methods which rely on gradient information. We can now work on the continuous boundary indicator to get sub-voxel accurate surfels. As we are adapting Canny’s edge model to 3D images, we shift the already detected surfels along the gradient direction of the 3D image function such that they are located at maxima of the boundary indicator. This can be formulated as a constrained line search optimization problem, i.e. we search for the maximizing parameter α of the one dimensional function φ(α) := b(p + α · d) with the constraint α ∈ (αmin , αmax ) and with d being an unit length vector at position p collinear with ∇f such that b increases in its direction, i.e. d := sgn(∇f T ∇b) · ∇f / ∇f . The interval constraint on α can be ﬁxed for every surfel or computed dynamically with e.g. bracketing (see e.g. [15]) which we use here. Maximizing φ can then be done via standard line search algorithms like the algorithm of Brent [1] 1 n

Fast and Accurate 3D Edge Detection for Surface Reconstruction

115

or the algorithm of Mor´e and Thuente [11]. Here we use the modiﬁcation of Brent’s algorithm presented in [15], which takes advantage of the available gradient information. In order to get even higher accuracy, the line searches deﬁned by φ can be iterated several times. Any line search based optimization algorithms should be suitable for this. Here, we choose the common conjugate gradient method (see e.g. [15]) and compare its accuracy improvement to the single line search approach in sec. 5.

4

Theoretical Analysis: Accuracy

In order to justify the localization accuracy of our algorithm we perform experiments on synthetic test volumes based on simple 3D edge models for which the true surfel positions are known from theory. For this it is necessary to carefully model and implement the corresponding image acquisition process which we deﬁne as a convolution of a continuous volume with a 3D isotropic Gaussian point spread function (PSF) with scale σPSF followed by sampling and quantization with possible addition of white Gaussian noise. The volume is modeled via a binary volume function f0 : R3 → {0, S} with S ∈ R such that the support f0−1 (S) is either an open half space, ball or a cylinder, i.e. its surface corresponds to a planar, spherical or cylindrical shell. We investigate these three types of functions, since they allow to estimate the localization accuracy for every possible case of a 3D surface. For example, if a surface is hyperbolic in some point p with principal curvatures κ1 > 0, κ2 < 0, then the localization errors should be bounded by the errors of two opposing cylinders having curvatures κ1 , respectively κ2 . The function f0 is blurred by convolution with a Gaussian before sampling. The resulting function f = f0 gσPSF deﬁnes the ground truth for edge detection. More precisely, for a planar surface with normal unit vector n and distance s ∈ R from the origin the corresponding volume function reads

1 x T fplane (p) := S · ΦσPSF (p n + s) with Φσ (x) := 1 + erf √ . 2 2σ Maxima of fplane then occur exactly at positions p with pT n + s = 0. For a ball BR with radius R, a closed form solution of the convolution integral with a Gaussian can be derived by exploiting the rotational symmetry of both functions and the separability of the Gaussian: R x−p2 (r −r)2 1 1 √ √ fsphere (p) := e− 2σ2 dx = e− 2σ2 dr 2πσ 2πσ BR −R

2 2rR

e− (R+r) 2σ2 σ2 − 1 e Sσ S R−r R+r √ = erf √ + erf √ − 2 2σ 2σ 2πr with r = p. The gradient magnitude of a blurred sphere is then the derivative of fsphere with respect to r: (R+r)2 2rR 1 ∇f (p) = √ e− 2σ2 σ 2 + rR + e σ2 rR − σ 2 (1) 2πr2 σ

116

C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe

σ R

0.2

0.4

0.6

0.8

r0 −R σ

0

−0.2

−0.4

cylinder

Fig. 1. Normalized bias of a blurred sphere and cylinder with radius R and scaling σ of the PSF. Approximating functions given by (2) and (3) are indicated with dotted lines.

As there is no closed form solution giving the maxima r0 of (1), ﬁg. 1 shows numeric results. It plots the normalized displacement (r0 − R)/σ against the ratio σ/R, in order to apply to arbitrary scales and radii. In practice, the most interesting part is in the interval 0 ≤ σ/R ≤ 0.5, since otherwise the image content is too small to be reliably detectable after blurring. In this interval an approximation with error below 3 · 10−4 is given by σ σ 2 σ 3 0.04 − 145.5 R + 345.8( R ) − 234.8( R ) r0 − R = σ σ 2 σ 3. σ 142.6 − 308.5 R + 59.11( R ) + 327.7( R )

(2)

In case of a cylinder of radius R, a closed form solution exists neither for the convolution integral nor for the position of the maxima of its gradient magnitude (but for the gradient magnitude itself). This case is mathematically identical to the 2D case of a disc blurred by a Gaussian, which has been analyzed in detail in [9]. An approximating formula for the relative displacement with error below 3 · 10−4 is given in [7]: σ

2 r0 − R = 0.52 0.122 + − 0.476 − 0.255 (3) σ R 4.1

Noisy 3D Images

Canny’s noise model is based on the assumption that the surface is a step which has been convolved with both the PSF and the edge detection ﬁlters. Therefore, 2 2 the total scale of the smoothed surface is σedge = σPSF + σfilter . The second directional derivative in the surface’s normal direction equals the ﬁrst derivative of a Gaussian (and it is a constant in the tangential plane of the surface). Near the true surface position, this derivative can be approximated by its ﬁrst order Taylor expansion: S·x sxx (x) ≈ sxxx (x = 0) · x = − √ , 3 2πσedge

Fast and Accurate 3D Edge Detection for Surface Reconstruction

117

where S is the step height. The observed surface proﬁle equals the true proﬁle plus noise. The noise is only ﬁltered with the edge detection ﬁlter, not with the PSF. The observed second derivative is the sum of the above Taylor formula and the second derivative of the smoothed noise fxx (x) ≈ sxxx |x=0 · x + nxx (x). Solving for the standard deviation of x at the zero crossing fxx (x) = 0 gives Var[nxx ] StdDev[x] = . |sxxx (x = 0)| According to Parseval’s theorem, the variance of the second directional derivative of the noise can becomputed in the Fourier domain as ∞ 2 3 Var[nxx ] = N 2 4π 2 u2 G(u)G(v)G(w) du dv dw = N 2 , 3/2 32π σ 7 −∞ where G(.) is the Fourier transform of a Gaussian at scale σfilter , and N 2 is the variance of the noise before ﬁltering. Inserting, we get the expected localization error as √ 2 2 N 3(σPSF + σfilter )3/2 StdDev[x] = (4) 7/2 S 4π 1/4 σ filter

N S

is the inverse signal-to-noise ratio. In contrast to 2D edge detection this error goes to zero as the ﬁlter size approaches inﬁnity √ N 3 lim StdDev[x] = . √ σfilter →∞ S 4π 1/4 σfilter However, this limit only applies to perfectly planar surfaces. In case of curved surfaces, enlarging the edge detection ﬁlter leads to a bias, as shown above, and there is a trade-oﬀ between noise reduction and bias.

5

Experiments: Accuracy and Speed

In this section we present the results of experiments which conﬁrm our claims about the accuracy and speed characteristics of our algorithm. We start with artiﬁcial volume data generated by sampling of the simple continuous volume functions given above. For the cylindrical cases we used 8fold oversampled binary discrete volumes with subsequent numeric convolution and downsampling. In the following we always use spline interpolation with order ﬁve for the line search and conjugate gradient based algorithms. Accuracy results for test volumes generated from fplane are shown in ﬁg. 2. As model parameters we used unit step height, σPSF = 0.9 and σfilter = 1 and various values for the sub-voxel shift s and the plane normal n. For the directions of n we evenly distributed ﬁfty points on the hemisphere located at the origin. In ﬁg. 2a results for noise free volumes are shown. As one can see both the line search and the conjugate gradient based methods possess very high accuracy and are several orders of magnitudes better than the parabolic ﬁt. In the presence of noise the accuracy of our method is still almost one order of magnitude better than the parabolic ﬁt for a rather bad signal-to-noise ration of SNR = 20, see

118

C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe

localization error [voxel]

localization error [voxel]

10−2 10−4 10−6 10−8

10−3

10−4

10−10 0

0.2

0.4

0.6

0.8

1

0

ground truth angle ϕ(t) = (α, β) [rad]

0.2

0.4

0.6

0.8

1

ground truth angle ϕ(t) = (α, β) [rad]

(a) no noise

(b) SNR = 20,σfilter = 2.0

measured StdDev[x]

parabolic ﬁt line search conjugate gradient

10−2

10−3 0

0.5

1

1.5

2

predicted StdDev[x]

2.5 ·10−2

(c) measured mean std. derivation vs. predicted std derivation according to eq. 4 computed for ten evenly distributed signal-tonoise ratios SNR ∈ [10, 100] (note that the scaling of the y axis is logarithmic).

Fig. 2. Comparison of sub-voxel accuracy of the three algorithms on sampled instances of fplane with σPSF = 0.9, σfilter = 1.0, s ∈ {0.1, 0.25, 0.5, 0.75} and using ﬁfty evenly distributed points on the hemisphere for n

measured dislocation [voxel]

−0.05 −0.20

−0.10

−0.15

−0.30

−0.20 −0.40

−0.25 −0.40

−0.35

−0.30

−0.25

predicted dislocation [voxel]

−0.20

−0.20 −0.18 −0.16 −0.14 −0.12 −0.10 predicted dislocation [voxel]

Fig. 3. comparison of predicted and measured localization bias for spherical (left) and cylindrical (right) surfaces using R = 5, σPSF = 0.9 with SNR = 10 for six evenly distributed ﬁlter scales σfilter ∈ [0.2, 0.4]. Values have been averaged over 10 instances with diﬀerent sub-voxel shift.

ﬁg. 2b. Finally, ﬁg. 2c shows that the estimated standard derivation matches the prediction from theory very well. In ﬁg. 3 we compare the predicted localization bias for spherical and cylindrical surfaces according to eq. 2 and eq. 3 respectively. Test-volumes have been generated from eq. 1 for spheres and using oversampling as described above for

Fast and Accurate 3D Edge Detection for Surface Reconstruction

119

cylinders. As model parameters we used R = 5, σPSF = 0.9 and various values for σfilter with addition of Gaussian noise such that SNR = 10. For each set of model parameters radii have then been estimated from 10 instances with same model parameters but with diﬀerent sub-voxel shift. From these ﬁgures we conclude that our algorithms correctly reproduces the localization bias prevailing over the parabolic ﬁt which possesses a systematic error. For performance comparison, we measured execution time on a Linux PC with a Pentium D 3.4 GHz processor and 2 GB of RAM for test-volumes with shape (200, 200, 200)T and two real CT-volumes. Results are given in table 1. As one can see the line search based method is only ≈ 35% slower than the parabolic ﬁt and the conjugate gradient based method only from ≈ 50% to ≈ 90% slower. Table 1. performance results for various test volumes and real CT-volumes. Columns in the middle give run-times in seconds. volume plane sphere cylinder lobster foot

shape

p. fit l.search T

(200, 200, 200) (200, 200, 200)T (200, 200, 200)T (301, 324, 56)T (256, 256, 207)T

9.24 9.76 10.91 6.22 20.01

11.63 14.03 15.51 8.24 26.74

cg

n. surfels

14.13 ≈ 39500 18.80 ≈ 48000 20.74 ≈ 75200 10.19 21571 34.00 74411

Fig. 4. Surface reconstructions for test-volumes and real CT-volumes using α-shapes [4] (α = 1) with SNR = 10 for the test-volumes

120

6

C. B¨ ahnisch, P. Stelldinger, and U. K¨ othe

Conclusions

Based on the well-known Canny edge detector, we presented a new algorithm for subvoxel-precise 3D edge detection. The accuracy of our method is much better than the accuracy of the subvoxel reﬁnement based on a parabola ﬁt. Due to an eﬃcient implementation of the spline interpolation and due to the use of fast voxel-accurate computations where-ever possible, our algorithm is still computationally eﬃcient. In order to justify the accuracy, we theoretically analyzed the measurement errors of an ideal Canny-like edge detector on an inﬁnite sampling resolution in case of 3D planar, spherical and cylindrical surfaces. Our analysis showed, that all experimental results are in full agreement with the theory, while this is not the case for the parabola ﬁt method.

References 1. Brent, R.P.: Algorithms for Minimisation Without Derivatives. Prentice-Hall, Englewood Cliﬀs (1973) 2. Canny, J.: A computational approach to edge detection. TPAMI 8(6), 679–698 (1986) 3. Devernay, F.: A non-maxima suppression method for edge detection with sub-pixel accuracy. Technical Report 2724, INRIA Sophia Antipolis (1995) 4. Edelsbrunner, H., M¨ ucke, E.P.: Three-dimensional alpha shapes. ACM Trans. Graph. 13(1), 43–72 (1994) 5. Jonker, P.P.: Skeletons in n dimensions using shape primitives. Pattern Recognition Letters 23, 677–686 (2002) 6. K¨ othe, U.: Vigra. Web Resource, http://hci.iwr.uni-heidelberg.de/vigra/ (visited March 1, 2009) 7. K¨ othe, U.: Reliable Low-Level Image Analysis. Habilitation thesis, University of Hamburg, Germany (2008) 8. Luo, L., Hamitouche, C., Dillenseger, J., Coatrieux, J.: A moment-based threedimensional edge operator. IEEE Trans. Biomed. 40(7), 693–703 (1993) 9. Mendon¸ca, P.R.S., Padﬁeld, D.R., Miller, J., Turek, M.: Bias in the localization of curved edges. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 554–565. Springer, Heidelberg (2004) 10. Monga, O., Deriche, R., Rocchisani, J.: 3d edge detection using recursive ﬁltering: application to scanner images. CVGIP: Image Underst. 53(1), 76–87 (1991) 11. Mor´e, J.J., Thuente, D.J.: Line search algorithms with guaranteed suﬃcient decrease. ACM Trans. Math. Software 20, 286–307 (1994) 12. Udupa, J.K., Hung, H.M., Chuang, K.S.: Surface and volume rendering in three dimensional imaging: A comparison. J. Digital Imaging 4, 159–169 (1991) 13. Unser, M., Aldroubi, A., Eden, M.: B-Spline signal processing: Part I—Theory. IEEE Trans. Signal Process. 41(2), 821–833 (1993) 14. Unser, M., Aldroubi, A., Eden, M.: B-Spline signal processing: Part II—Eﬃcient design and applications. IEEE Trans. Signal Process. 41(2), 834–848 (1993) 15. Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++: The Art of Scientiﬁc Computing. Cambridge University Press, Cambridge (2002)

Boosting Shift-Invariant Features Thomas H¨ornlein and Bernd J¨ ahne Heidelberg Collaboratory for Image Processing University of Heidelberg, 69115 Heidelberg, Germany

Abstract. This work presents a novel method for training shift-invariant features using a Boosting framework. Features performing local convolutions followed by subsampling are used to achieve shift-invariance. Other systems using this type of features, e.g. Convolutional Neural Networks, use complex feed-forward networks with multiple layers. In contrast, the proposed system adds features one at a time using smoothing spline base classiﬁers. Feature training optimizes base classiﬁer costs. Boosting sample-reweighting ensures features to be both descriptive and independent. Our system has a lower number of design parameters as comparable systems, so adapting the system to new problems is simple. Also, the stage-wise training makes it very scalable. Experimental results show the competitiveness of our approach.

1

Introduction

This work deals with shift-invariant features performing convolutions followed by subsampling. Most systems using this type of features (e.g. Convolutional Neural Networks [1] or biologically motivated hierarchical networks [2]) are very complex. We propose to use a Boosting framework to build a linear ensemble of shift-invariant features. Boosting ensures that trained features are both descriptive and independent. The simple structure of the presented approach leads to a signiﬁcant reduction of design parameters in comparison to other systems using convolutional shift-invariant features. At the same time, the presented system achieves state-of-the-art performance for classiﬁcation of handwritten digits and car side-views. The presented system builds a classiﬁcation rule for an image classiﬁcation problem, given in form of a collection of N training samples {xi , yi }, i = 1, . . . , N - x is a vector of pixel values in an image region and y is the class label of the respective sample1 . The depicted objects are assumed to be fairly well aligned with respect to position and scale. However, in most cases the depicted objects of one class will exhibit some degree of variability due to imperfect localization or intra-class variability. In order to achieve good classiﬁcation performance, this variability needs to be taken into account. One way to approach the problem is by using shift-invariant features (Sect. 2), namely features performing local 1

For simplicity we assume binary classiﬁcation tasks (y ∈ {−1, 1}) throughout the paper. Extension to multi-class problems is straightforward using a scheme similar to AdaBoost.MH [3].

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 121–130, 2009. c Springer-Verlag Berlin Heidelberg 2009

122

T. H¨ ornlein and B. J¨ ahne

convolution and subsampling. To avoid the complexity of hierarchical networks commonly used with this type of features, a Boosting scheme is used for feature generation (Sect. 3). In order to illustrate the eﬀectiveness of our approach, a set of experiments on USPS database (handwritten digit recognition, Sect. 4.1) and UIUC car sideview database (Sect. 4.2) is conducted. The achieved performance compares well to state-of-the-art algorithms.

2

Shift-Invariant Features for Image Classification

Distribution of samples in feature-space is inﬂuenced by discriminative and nondiscriminative variability. While discriminative variability is essential for classiﬁcation, non-discriminative variability should not inﬂuence results. It is, however, hard for training systems to learn to distinguish the two cases and usually high numbers of training samples are needed to do so. Therefore prior knowledge is commonly used to design features suppressing non-discriminative variability while preserving discriminative information. Using these features can signiﬁcantly simplify the training problem. While the global appearance of objects in one class is subject to strong variations, discriminative and stable local image structures - for example the appearance of wheels for classiﬁcation of vehicles - exist. The relative image-positions of these features may change due to changes of point-of-view or deformations of the objects but their appearance is relatively stable. Therefore the images of objects can be represented as a collection of local image features, where the exact location of each feature is unknown. Diﬀerent approaches to handling location uncertainty exist, ranging from completely ignoring position information (e.g. bag of features) to construction of complex hierarchies of object parts (e.g. [4]). In this work a model is used that searches for features in a part of the image described by p = [c0 , r0 , w, h], where c0 , r0 describes the position and w, h the width and height of the region respectively. We deﬁne the operator P(x, p) extracting patches of geometry p from feature vector x. To extract discriminative information local-convolution features are used: f (x) = sub(P(x, p) ∗ K) , (1) where K is the convolution kernel2 . The subsampling operation sub(.) makes the result invariant to small shifts. For the experiments reported in Sec. 4 the subsampling operation sub(.) returns the maximum absolute value of the ﬁlter response3 . Local convolutional features are mainly used in multi-layer feed-forward networks. The kernel matrices may be either ﬁxed or are tuned in training. Examples 2 3

The convolution is only performed for the range, in which the kernel has full overlap with the patch. Note that this subsampling operator is non-diﬀerentiable. For the backpropagation training used in Sect. 3.2 a diﬀerentiable approximation needs to be used.

Boosting Shift-Invariant Features

123

for the use of ﬁxed kernels are the biologically motivated systems in [2] and [5]. An advantage of using ﬁxed weights is the lower number of parameters to be adjusted in training. On the other hand, prior knowledge is necessary to select good kernels for a given classiﬁcation problem4 . Examples of systems using trained kernels are the unsupervised system in [1] and the supervised in [6]. The advantage of training kernels is the ability of the system to adjust to the problem at hand and thus ﬁnd compact representations. The hierarchical networks used with local convolution features are able to construct complex features by combining basic local convolution features. The cost for this ﬂexibility is the high number of design parameters to be set. In order to provide a simple scheme for using local convolution features, a single layer system is proposed in the next section.

3

Boosting Shift-Invariant Features

This section describes a boosting-based approach to train and use local convolution features. This simpliﬁes the complicated architecture and therefore unpleasant design of classiﬁers. Using Boosting we add features in a greedy stagewise manner instead of starting with a predeﬁned number of features which need to be trained in parallel. This makes the approach very scalable. Since one feature is trained at a time, only a small number of parameters is tuned simultaneously, simplifying training of kernel weights. In order to train features, diﬀerentiable base classiﬁers have to be used. Using gradient descent to train features bares a strong resemblance with the training of artiﬁcial neural networks (e.g. [7]). ANN’s, however, use ﬁxed transfer functions while the approach presented here uses smooth base classiﬁers adapting to class distributions. The use of adaptive transfer functions enables the ensemble to be very ﬂexible even in the absence of hidden layers. 3.1

Boosting Smoothing Splines

Boosting is a technique to combine weak base classiﬁers to form a strong classiﬁcation ensemble. The additive model has the form: yˆ = sign (H(x)) with H(x) =

T

αt ht (x) .

(2)

t=1

Boosting training is based on minimizing a continuous cost function J on the given training samples {xi , yi }. The minimization is performed using functional gradient descent. In stage t the update ht+1 is calculated by performing a gradient descent step on J. The step width depends on the speciﬁc Boosting algorithm

4

Though biologically motivated kernels like gabor wavelets seem to give good performance on a wide range of image processing applications.

124

T. H¨ ornlein and B. J¨ ahne

in use. GentleBoost (GB) [3] (used in the experiments of Sect. 4) uses GaussNewton updates, leading to the GentleBoost update rule: N

t Ec [cy|x] with ci = e−yi s=1 αs hs (xi ) , J = ci and αt+1 = 1 , Ec [c|x] i=1 (3) where Ec is the weighted expectation. The presented approach is not restricted to being used with GentleBoost but might be used with arbitrary Boosting schemes. The task of the base classiﬁer is to select the rule h giving the lowest costs5 Typically choices of base classiﬁers are Decision Stumps, Decision Trees and Histograms. We are, however, interested in base-classiﬁers which are diﬀerentiable. Due to their cheap evaluation and simple structure we use univariate smoothing splines. A smoothing spline base-classiﬁer is represented as: ht+1 (x) =

h(z) = aT b(z) ,

(4)

where z is a scalar input and a represents the weights of the spline basisfunctions, b returns the values of the spline basis-functions evaluated at z. To construct a ﬁt from scalar inputs zi to outputs yi , the weights a need to be calculated by solving a linear system of equations. In order to prevent overﬁtting, a tradeoﬀ between approximation error and complexity has to be found. We use P-Splines [8] for ﬁtting penalized splines: a ﬁxed high number of equidistant support points is used and a parameter λ is tuned to adjust the amount of smoothing. P-Splines use ﬁnite diﬀerences of the weights a of the spline functions to approximate roughness. The weights a can then be calculated using −1 a = BΔc BT + λDDT BΔc y , (5) where B = [b(z1 ) . . . b(zN )]T denotes the matrix of values of the spline basisfunctions evaluated at z1 , . . . , zN , y = [y1 . . . yN ]T contains the sample class and Δc ∈ IRN ×N is a diagonal matrix containing the sample weights c1 , . . . , cN . The expression aT D calculates ﬁnite diﬀerences of a given degree6 on a. The roughness penalty can be chosen using cross validation. 3.2

Training Features Using a Boosting Framework

A large group of base classiﬁers used with Boosting operate on one input feature at a time: h(x) = g(x(j) ) (component-wise base classiﬁers). The advantage of this approach is the simple nature and cheap evaluation of the resulting classiﬁcation rules. Boosting of component-wise base classiﬁers can be used as a feature selection scheme, adding features to the ﬁnal hypothesis one at a time. 5 6

The cost function dependson the Boosting algorithm used. GentleBoost uses 2 weighted squared error: = N i=1 ci (yi − h(xi )) . For classiﬁcation penalizing ﬁrst degree ﬁnite diﬀerences is a natural choice, leading to h(x) = cT y = const for λ → ∞.

Boosting Shift-Invariant Features

125

In order to use a feature selection scheme, one needs a set of meaningful features ﬁrst. However, providing such a feature set for arbitrary image classiﬁcation problems is a diﬃcult task, especially if the properties of good features are unknown. In general it would be more convenient to provide as little prior knowledge as possible and train features automatically. For Boosting feature generation - similar to boosting feature selection - a mapping z = f (x) from IRF to IR minimizing the weighted costs of the spline ﬁt is sought: N GB 2 f (x) ← min ((h(f (x))) = min ci (h(f (xi )) − yi ) , (6) f (x)

f (x)

i=1

where h(f (x)) is a weighted least squares ﬁt to y. When using local convolution features, the kernel weights w can be tuned using error backpropagation. This is similar to the training techniques used with Convolutional Neural Networks – a particular simple scheme can be found in [9]. The complete scheme for building a classiﬁer with local convolution features is shown in Alg. 1. Training time may be reduced, without deteriorating classiﬁcation performance, by visiting only a limited number of random positions (line 5). Algorithm 1. Boosting of local convolution features

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Input. Training samples {x, y}i , i = 1, . . . , N Input. Number of boosting rounds T Input. Smoothing parameter λ Input. Feature-geometry h0 (x) = y for t = 1, . . . , T do ci ← e−yi H(xi ) , ci ← ci /( N i=1 ci ) min ← ∞ for all positions p do Initialize convolution kernel K ← N (0, 1) repeat zi = sub(P(xi , p) ∗ K) Fit base-classiﬁer h(z) to {zi , yi , ci } Calculate kernel gradient ΔK using back-prop Update kernel K (e.g. using Levenberg Marquardt) until convergence or maximum number of rounds reached 2 ← N i=1 ci (yi − h(sub(P(x, p) ∗ K))) if < min then min ← , pt ← p, Kt ← K end end Fit base-classiﬁer ht (z) to {zi , yi , ci }, zi = sub(P(xi , pt ) ∗ Kt ) Add ht to ensemble end Output. Classiﬁer: H(x) = Tt=0 ht (sub(P(x, pt ) ∗ Kt ))

126

T. H¨ ornlein and B. J¨ ahne

Fig. 1. Pooling feature (for handwritten digits 5 vs 8)

Combining Features. In higher layers of hierarchical networks, basic features are combined to build more complex features [1,2]. This type of feature interaction cannot be modeled by using Alg. 1 directly. We propose, rather than using hierarchical networks, to build complex features as linear combinations of local convolution features: zi = vT zi , where zi = [f1 (xi ), f2 (xi ), . . .]T represents the values of all convolutional features learned so far and v are their respective weights. While this approach may not be as powerful as using hierarchical networks, it comes at almost no extra costs. Algorithm 1 is adapted by feeding linear combinations of features into the base classiﬁer in line 18. The weights of the local convolution features v are trained to optimize class separation. Typically, only a small number of features, say two or three, need to be combined - depending on the problem at hand. In cases where the maximum number of convolutional features to be used is limited (e.g. due to computational resources), performance may be improved by adding Boosting stages using combinations of the already learned features. Calculation of local convolutional features is much more expensive than evaluation of base classiﬁers, so costs are neglectable.

4

Experiments

In order to show the competitiveness of our approach, experiments on two wellknown image classiﬁcation databases are conducted. The data sets are selected to have very diﬀerent properties to illustrate the ﬂexibility of our approach. 4.1

USPS Handwritten Digit Recognition

The ﬁrst set of experiments is performed on the USPS handwritten digit recognition corpus. The database contains grayscale images of handwritten digits, normalized to have dimensions 16 × 16 leading to an input feature vector with 256 values. The training set includes 7, 291 samples, the test set 2, 007. Human error rate on this data set is approximately 2.5% ([10]). Penalized cubic smoothing spline base-classiﬁers with 100 support points are used to approximate class distributions. Spline roughness penalty, as well as the size of the convolution kernel were determined using cross validation. Kernels of size 5 × 5 with a subsampling area of 5 × 5 gave best results - this means each pooling feature operates on a 9 × 9 patch. Pairs of convolutional features are combined to model feature-interactions. An ensemble of 1000 base classiﬁers

Boosting Shift-Invariant Features

127

Table 1. Test error rates on USPS database method human performance ([12]) neural net (LeNet1 [13]) tangent distance ([12]) kernel densities ([11]) this work

error [%] 2.5 4.2 3.3 3.1

error ext. [%] 2.5 2.4 2.6

error rate

0.1

0.05

0 0

200 400 600 800 number of boosting rounds

1000

Fig. 2. Classiﬁcation error on USPS depending on the number of boosting rounds (black: original set, red: extented set). Note that features were trained until round 500. The remaining Boosting rounds add base classiﬁers combining already calculated features.

was build. Features were added in rounds 1 to 500. The remaining boosting rounds combined already trained local convolutional features. Experiments using an extended set of training patterns [11] suggest the original training set is to small to achieve optimal performance. In the literature diﬀerent techniques are used to extend the training set. We build an extended training set by adding distorted versions of training patterns (see [9]), increasing the number of training samples by a factor of ﬁve. Note that we did not extend the test set in any way. Figure 2 shows test error with respect to the number of features used. Experiments using the original training set yielded an error rate of 3.1%. On the extended training set an error rate of 2.6% was achieved. Note that the error rate of the extended feature set drops from 3.0% to 2.6% between round 500 and 1000 without adding new convolutional features. Table 1 compares our performance to other published results. The results of the presented scheme are competitive to other state-of-the art algorithms. 4.2

UIUC Car Classification

A second set of experiments was conducted using the UIUC car side view database [14]. The training set contains 550 images of cars and 500 images of background, each image of size 100 × 40. Again, cross validation was used to ﬁnd good parameters. The best performance was achieved using convolution kernels of size 5 × 5 and a subsampling area of size 5 × 5.

128

T. H¨ ornlein and B. J¨ ahne

Table 2. Test error rates on UIUC cars (this work: min, mean, max over ten runs) method error (single-scale set) [%] Lampert et al [15] 1.5 Agarwal et al [14] 23.5 Leibe et al [16] 2.5 Fritz et al [17] 11.4 Mutch et al [5] 0.04 this work (1.25) 1.55 (1.78)

error (multi-scale set) [%] 1.5 60.4 5.0 12.2 9.4 (2.9) 3.6 (4.0)

Fig. 3. Examples of classiﬁcation on single-scale test set (ground truth: blue, true positives green, false positives red)

The UIUC car database contains two test sets, both consist of natural images containing cars. The ﬁrst set (single-scale) consists of 170 images containing 200 cars. The cars in this set have the same scale as the cars in the training set. The second test (multi-scale) set consists of 107 images showing 139 cars. The dimensions of the cars range between 89 × 36 and 212 × 85. A sliding window approach was used to generate candidates for the classiﬁer. For multi-scale test images the sliding window classiﬁer was applied to scaled versions of the images. We used the same scales as in [14] (s = 1.2−4,−3,...,1 ). Figure 3 shows some classiﬁcation results on the single scale test set. Performance evaluation was done in the same fashion as in the original paper [14]. Table 2 compares our results to state-of-the-art7 . Results for single and multi-scale test set are among the best reported. In particular, our results on the multi-scale test set are the best reported results using a sliding window approach. The error rate with respect to the number of features on the single-scale test set is shown in Fig. 4. Errors drop to a competitive level quickly. For an average error of below 2% approximately 30 multiplications per pixel are used, giving a very eﬃcient classiﬁer. 7

To show the eﬀect of the randomness of our approach the results are given for multiple runs of the system.

1

0.1

0.95

0.08 1−fscore

recall

Boosting Shift-Invariant Features

0.9 0.85 0.8

129

0.06 0.04 0.02

0.75

0 0

0.1 0.2 1−precision

0

200 400 number of features

Fig. 4. Left: recall-precision curve for UIUC cars (black: single scale, red: multi scale). Right: f-score on single scale test set (min, mean, max over 10 runs).

5

Conclusion and Outlook

In this work a novel approach for generating shift-invariant features was presented. By using Boosting to ﬁnd meaningful features, the scheme is very simple and scalable. Performance, evaluated on USPS handwritten digit recognition database and UIUC car side views database, is competitive to state-of-the-art systems. The advantage of our method, when compared to other systems using similar features, is the low number of design parameters and its modularity. The complexity of the trained classiﬁer adapts to the problem at hand. Boosting techniques like the use of cascades, can easily be incorporated. Future extensions of the presented method will include the use of multiple scales. Right now features are generated on one ﬁxed scale. While this is suﬃcient for classiﬁcation of handwritten digits and related problems, for real world objects descriptive features will likely appear on multiple scales.

Acknowledgments We gratefully acknowledge ﬁnancial support by the Robert Bosch GmbH corporate PhD program and the Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences at IWR, Heidelberg.

References 1. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 2. Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 411–426 (2007)

130

T. H¨ ornlein and B. J¨ ahne

3. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. The Annals of Statistics 38(2) (2000) 4. Bouchard, G., Triggs, B.: Hierarchical part-based visual object categorization. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 710–715 (2005) 5. Mutch, J., Lowe, D.G.: Multiclass object recognition with sparse, localized features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 11–18 (2006) 6. Huang, F.J., LeCun, Y.: Large-scale learning with svm and convolutional nets for generic object categorization. In: Proc. Computer Vision and Pattern Recognition Conference (CVPR 2006). IEEE Press, Los Alamitos (2006) 7. Schwenk, H., Bengio, Y.: Boosting neural networks. Neural Comput. 12(8), 1869– 1887 (2000) 8. Eilers, P.H.C., Marx, B.D.: Flexible smoothing with b-splines and penalties. Statistical Science 11(2), 89–121 (1996) 9. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR 2003: Proceedings of the Seventh International Conference on Document Analysis and Recognition, Washington, DC, USA, Microsoft Research, p. 958. IEEE Computer Soc, Los Alamitos (2003) 10. Simard, P.Y., LeCun, Y.A., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition - tangent distance and tangent propagation. In: Orr, G.B., M¨ uller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 239–274. Springer, Heidelberg (1998) 11. Keysers, D., Macherey, W., Ney, H., Dahmen, J.: Adaptation in statistical pattern recognition using tangent vectors. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 269–274 (2004) 12. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., Vapnik, V.: Comparison of classiﬁer methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, 1994. Conference B: Computer Vision & Image Processing, vol. 2, pp. 77–82 (1994) 13. Cun, Y.L., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Howard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems II (Denver 1989), pp. 396–404. Morgan Kaufmann, San Mateo (1990) 14. Agarwal, S., Awan, A., Roth, D.: Learning to detect objects in images via a sparse, part-based representation. In: IEEE Transactions on Pattern Analysis and Matchine Intelligence, vol. 26 (2004) 15. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by eﬃcient subwindow search. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (June 2008) 16. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. Int. J. Comput. Vision 77(1-3), 259–289 (2008) 17. Fritz, M., Leibe, B., Caputo, B., Schiele, B.: Integrating representative and discriminative models for object category detection. In: ICCV 2005: Proceedings of the Tenth IEEE International Conference on Computer Vision, Washington, DC, USA, pp. 1363–1370. IEEE Computer Society Press, Los Alamitos (2005)

Harmonic Filters for Generic Feature Detection in 3D Marco Reisert1 and Hans Burkhardt2,3 1

Dept. of Diagnostic Radiology, Medical Physics, University Medical Center 2 Computer Science Department, University of Freiburg 3 Centre for Biological Signaling Studies (bioss), University of Freiburg [email protected]

Abstract. This paper proposes a concept for SE(3)-equivariant non-linear filters for multiple purposes, especially in the context of feature and object detection. The idea of the approach is to compute local descriptors as projections onto a local harmonic basis. These descriptors are mapped in a non-linear way onto new local harmonic representations, which then contribute to the filter output in a linear way. This approach may be interpreted as a kind of voting procedure in the spirit of the generalized Hough transform, where the local harmonic representations are interpreted as a voting function. On the other hand, the filter has similarities with classical low-level feature detectors (like corner/blob/line detectors), just extended to the generic feature/object detection problem. The proposed approach fills the gap between low-level feature detectors and high-level object detection systems based on the generalized Hough transform. We will apply the proposed filter to a feature detection task on confocal microscopical images of airborne pollen and compare the results to a 3D-extension of a popular GHT-based approach and to a classification per voxel solution.

1 Introduction The theory of non-linear filters is well developed for image translations. It is known as Volterra theory. Volterra theory states that any non-linear translation-invariant system can be modelled as an infinite sum of multidimensional convolution integrals. More precisely, a filter H is said to be equivariant with respect to some group G, if gH{f } = H{gf } holds for all images f and all g ∈ G, where gf denotes the action of the group to the image f . For the group of translations (or the group of time-shifts) such filters are called Volterra series. In this paper we want to develop non-linear filters that are invariant with respect to Euclidean motion SE(3), therefore, we need a generalization of Volterra’s principle to SE(3). In [1] a 2D non-linear filter was proposed that is SE(2)-equivariant. The filter was derived from the general concept of group integration which replaced Volterra’s principle. In this paper we want to generalize this filter to SE(3). The generalization is not straightforward because the two-dimensional rotation group SO(2) essentially differs from its three-dimensional counterpart SO(3). As already mentioned the derivation of the filter in [1] was based on the principle of group integration. In this paper we want to follow a more pragmatic way and directly propose the 3D filter guided by its 2D analogon. Let us recapitulate the workflow of the holomorphic filter and give a sketch of its 3D counterpart. In a first step the 2 holomorphic filter computes several convolutions with functions of the form z j e−|z| J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 131–140, 2009. c Springer-Verlag Berlin Heidelberg 2009

132

M. Reisert and H. Burkhardt

where z = x + iy is the pixel coordinate in complex notation. Note, that the monomial z j = rj eijφ is holomorphic. The results of these convolutions show a special rotation behavior, e.g. for j = 1 it behaves like a gradient field or for j = 2 it behaves like a 2nd rank tensor field. Several products of these convolution results are computed. These products show again a special rotation behavior. For example, if we multiply a gradient field (j1 = 1) and a 2-tensor-field (j2 = 2) we obtain a third-order field with j = j1 + j2 = 3. According to the transformation behavior of the products they are 2 again convolved with functions of the form z j e−|z| such that the result of the convolution transforms like a scalar (j = 0). This is the principle of the holomorphic filter which we want to generalize to 3D. 2 The first question is, what are the function corresponding to z j e−|z| in 3D? We know that the real and imaginary part of a holomorphic polynomial are harmonic poly2 nomials. Harmonic polynomials solve the Laplace equation. As z j e−|z| is a Gaussianwindowed holomorphic monomial we will use instead a Gaussian-windowed harmonic polynomial for the 3D filter. The second question is, how can we form products of convolutions with harmonic polynomials that entail their transformation behavior? We will find out that the Clebsch-Gordan coefficients that are known from quantum mechanics provide such products. Given two tensor fields of a certain degree we are able to form a new tensor field of another degree by a certain multiplication and weighted summation of the input fields. The weights in the summations are the Clebsch Gordan coefficients. In [1] and [2] it was shown that the convolutions with the Gaussian-windowed holomorphic basis can be computed efficiently with complex derivatives. In fact, there is a very similar approach in 3D by so called spherical derivatives [3]. The paper is organized as follows: in the following section we give a small overview about related work. In Section 2 we introduce the basics in spherical tensor analysis. We introduce the spherical product which couples spherical tensor fields and introduce basics about spherical harmonics. We also introduce so called spherical derivatives that are the counterpart to the usual complex derivatives in 2D. They will help us to compute the occurring convolutions in an efficient manner. In Section 3 we introduce the Harmonic filter and show how the parameters can be adpated to a specific problem. Section 4 shows how the filter can be implemented efficiently and how it can be applied for feature detection in confocal microscopical images. In Section 5 we conclude and give an outlook for future work. 1.1 Related Work Volterra filters are the canonical generalization of the linear convolution to a nonlinear mapping. They are widely used in the signal processing community and also find applications in image processing tasks [4,5]. The filter proposed in this work might be interpreted as a kind of ’joint’ Volterra filter for translation and rotation. Steerable filters, introduced in [6], are a common tool in early vision and image analysis. A generalization for non-group like deformations was proposed in [7] using an approximative scheme. The harmonic filter computes a certain subset of gaussian-windowed spherical moments in a first step which is actually a steerable filter. The generalized Hough transform (GHT) [8] is a major tool for the detection of arbitrary shapes. Many modern approaches [9,10] for object detection and recognition are

Harmonic Filters for Generic Feature Detection in 3D

133

based on the idea that local parts of the object cast votes for the putative center of the object. If the proposed algorithm is used in the context of object detection, it may be interpreted as some kind of voting procedure for the object center. This voting interpretation also relates our approach to the Tensor Voting [11] framework (TV). However, in TV the voting function does not depend on the local context. Contrarily the proposed filter is able to cast context dependend votes.

2 Spherical Tensor Analysis In the following we shortly repeat the basic notions in 3D harmonic analysis as they were introduced in [3]. For introductory reading we recommend literature [12] concerning the quantum theory of the angular momentum, while our representation tries to avoid terms from quantum theory to also give the non-physicists a chance to follow. See e.g. [13,14] for an introduction from an image processing/engineering viewpoint. 2.1 Preliminaries Let Djg be the unitary irreducible representation of a g ∈ SO(3) of order j with j ∈ N. They are also known as the Wigner D-matrices (see e.g. [12]). The representation Djg acts on a vector space Vj which is represented by C2j+1 . The standard basis of C2j+1 is written as ejm . We write the elements of Vj in bold face, e.g. u ∈ Vj and write the 2j+1 components in unbold face um ∈ C where m = −j, . . . j. For the transposition of a vector/matrix we write uT ; the joint complex conjugation and transposition is denoted by u = uT . Note, that we treat the space Vj as a real vector space of dimensions 2j + 1, although the components of u might be complex. This means that the space Vj is only closed under weighted superpositions with real numbers. As a consequence we observe that the components are interrelated by um = (−1)m u−m . From a computational point of view this is an important issue. Although the vectors are elements of C2j+1 we just have to store just 2j +1 real numbers. So, the standard coordinate vector r = (x, y, z)T ∈ R3 has a natural relation to elements u ∈ V1 in the form of ⎛ ⎞ ⎛ √1 ⎞ (x − iy) w 2 ⎠ = Sr ∈ V1 z u=⎝ z ⎠=⎝ −w − √12 (x + iy) Note, that S is an unitary coordinate transformation. Actually, the representation D1g is directly related to the real valued rotation matrix Ug ∈ R3×3 by D1g = SUg S . Definition 1. A function f : R3 → Vj is called a spherical tensor field of rank j if it transforms with respect to rotations as follows (gf )(r) := Djg f (UTg r) for all g ∈ SO(3). The space of all spherical tensor fields of rank j is denoted by Tj .

134

M. Reisert and H. Burkhardt

2.2 Spherical Tensor Coupling We define a family of symmetric bilinear forms that connect tensors of different ranks. Definition 2. For every j ≥ 0 we define a family of symmetric bilinear forms of type •j : Vj1 × Vj2 → Vj where j1 , j2 ∈ N has to be chosen according to the triangle inequality |j1 − j2 | ≤ j ≤ j1 + j2 and j1 + j2 + j has to be even. It is defined by (ejm ) (v •j w) :=

m=m1 +m2

jm | j1 m1 , j2 m2 vm1 wm2 j0 | j1 0, j2 0

where jm | j1 m1 , j2 m2 are the Clebsch Gordan coefficients (see e.g. [12]). Up to the factor j0 | j1 0, j2 0 this definition is just the usual spherical tensor coupling equation which is very well known in quantum mechanics of the angular momentum. The additional factor is for convenience. It normalizes the product such that it shows a more gentle behavior with respect to the spherical harmonics as we will see later. The characterizing property of these products is that they respect the rotations of the arguments, i.e. if v ∈ Vj1 and w ∈ Vj2 , then for any g ∈ SO(3) (Djg1 v) •j (Djg2 w) = Djg (v •j w) holds. For the special case j = 0 the arguments have to be of the same rank due to the triangle inequality. Actually, in this case the new product coincides with the standard inner product v •0 w = w v. Further note, that if one of the arguments of • is a scalar, then • reduces to the standard scalar multiplication, i.e. v •j w = vw, where v ∈ V0 and w ∈ Vj . Another remark is that • is not associative. The introduced product can also be used to combine tensor fields of different rank by point-wise multiplication as f (r) = v(r) •j w(r). If v ∈ Tj1 and w ∈ Tj2 and j is chosen such that |j1 − j2 | ≤ j ≤ j1 + j2 , then f is in Tj , i.e. a tensor field of rank j. 2.3 Spherical and Solid Harmonics We denote the well-known spherical harmonics by Yj : S 2 → Vj . We write Yj (r), where r may be an element of R3 , but Yj (r) is independent of the magnitude of r. We know that the Yj provide an orthogonal basis of scalar functions on the 2-sphere S 2 . Thus, any real scalar field f ∈ T0 can be expanded in terms of spherical harmonics in a unique manner. In the following, we use Racah’s normalization (also known as semi 1 Schmidt normalization), i.e. Ymj , Ymj S 2 = 2j+1 δjj δmm . One important and useful j j1 j2 property is that Y = Y •j Y . We can use this formula to iteratively compute higher order Yj from given lower order ones. Note that Y0 = 1 and Y1 = Sr, where r ∈ S 2 . The spherical harmonics have a variety of nice properties. One of the most important ones is that each Yj , interpreted as a tensor field of rank j is a fix-point with respect to rotations, i.e. (gYj )(r) = Yj (r) or in other words Yj (Ug r) = Djg Yj (r). The spherical harmonics naturally arise from the solutions from the Laplace equation as the so called solid harmonics Rj (r) := rj Yj (r).

Harmonic Filters for Generic Feature Detection in 3D

135

2.4 Spherical Derivatives This section proposes the basic tools for dealing with derivatives in the context of spherical tensor analysis. In [3] the spherical derivatives are introduced. They connect spherical tensor fields of different ranks by differentiation. Proposition 1 (Spherical Derivatives). Let f ∈ Tj be a tensor field. The spherical up-derivative ∇1 : Tj → Tj+1 and the down-derivative ∇1 : Tj → Tj−1 are defined as ∇1 f := ∇ •j+1 f ∇1 f := ∇ •j−1 f , where

(1) (2)

1 1 ∇ = ( √ (∂x − i∂y ), ∂z , − √ (∂x + i∂y )) 2 2

is the spherical gradient and ∂x , ∂y , ∂z the standard partial derivatives. Note, that for a scalar function the spherical up-derivative is just the spherical gradient, i.e. ∇f = ∇1 f . As a prerequisite to the Harmonic filter it is necessary to mention that the spherical derivative ∇j of a Gaussian is just a Gaussian-windowed solid harmonic: √ r2 ∇j e− 2σ2 = ( 2πσ)3 Gjσ (r) =

j r2 1 − 2 Rj (r) e− 2σ2 σ

(3)

An implication is that convolutions with the Gjσ are derivatives of Gaussian-smoothed functions, namely Gjσ ∗f = ∇j (Gσ ∗f ), where f ∈ T0 . Note that we use the convention G0σ = Gσ =

r2

√ 1 e− 2σ2 ( 2πσ)3

.

3 Harmonic Filters Our goal is to build non-linear image filters that are equivariant to Euclidean motion. An SE(3)-equivariant image filter is given by the following Definition 3 (SE(3)-Equivariant Image Filter). An scalar image filter F is a mapping from T0 onto T0 . We call such a mapping SE(3)-equivariant if F{gf } = gF{f } for all g ∈ SE(3) and f ∈ T0 . Our approach may be interpreted as a kind of context-dependend voting scheme. The intuitive idea is as follows: Compute for each position in the 3D space the projection onto the Gaussian windowed harmonic basis Gjσ for j = 0, . . . , n. You can do this by a simple convolution of the image f with the basis, i.e. pj := Gjσ ∗ f . Imagine this set of projections pj as some local descriptor images, where the set [p0 (r), p1 (r), . . . , pn (r)] of coefficients describe the harmonic part of the neighborhood of the voxel r. Then, for each voxel map these projections on some new harmonic descriptors Vj (r) = Vj [p0 (r), p1 (r), . . . , pn (r)] which can be interpreted as a local expansion of a kind

136

M. Reisert and H. Burkhardt

of voting function that contributes into the neighborhood of r. The contribution stemming from the voter at voxel r at position r is Vr (r) = Gη (r − r )

∞

Vj (r ) •0 Rj (r − r ),

(4)

j=0

i.e. the voting function is just a Gaussian-windowed harmonic function. The final step is to render the contribution from all pixels r in an additive way together by integration to arrive at n H{f }(r) := Vr (r)dr = Gjη

•0 Vj . R3

j=0

To ensure rotation-equivariance the Vj [·] has to obey the following equivariance constraint: Vj [D0g p0 , . . . , Dng pn ] = Djg Vj [p0 , . . . , pn ]. We will use the spherical product • as the basic building block for the equivariant nonlinearities Vj . There are many possibility to combine several spherical tensors by the products • in an equivariant way. Later we will discuss this in detail. 3.1 Differential Formulation A computational expensive part of the filter are the convolutions. On the one hand, the projection onto the harmonic basis of the input and, secondly, the rendering of the output, also by convolution. Equation (3) shows that there is another way to compute such projections: by the use of the spherical derivative. So, we can reformulate the filter as follows: H{f } := Gη ∗

n

∇j Vj [∇0 fs , . . . , ∇n fs ]

(5)

j=0

with fs = Gσ ∗ f . In Algorithm 1 we depict the computation of the filter. Note, that we just have to compute n spherical derivatives ∇1 if we implement them by repeated applications. And actually the same holds for the down-derivative ∇1 if we follow Algorithm 1. 3.2 The Voting Function The probably most simple nonlinear voting function Vj is a sum of second order products of the descriptor images pj , namely Vj [p0 , . . . , pn ] = αjj1 ,j2 pj1 •j pj2 (6) |j1 −j2 |≤j≤j1 +j2 j1 +j2 +j even j1 ,j2 ≤n

Harmonic Filters for Generic Feature Detection in 3D

137

Algorithm 1. Filter Algorithm y = H{f } Input: scalar volume image f Output: scalar volume imag y 1: Initialize yn := 0 ∈ Tn 2: Convolve p0 := Gσ ∗ f 3: for j = 1 : n do 4: pj = ∇ 1 pj−1 5: end for 6: for j = n : −1 : 1 do 7: yj−1 := ∇ 1 (yj + Vj [p0 , . . . , pn ]) 8: end for 9: Let y := y0 + V0 [p0 , . . . , pn ]) 10: Convolve y := Gη ∗ y

where αjj1 ,j2 ∈ R are expansion coefficients. We call the order of the products that are involved in Vj the order of the filter and denote it by N . Depending on the application they may or may not depend on the absolute intensity values of the input image. To become invariant against additive intensity changes one leaves out the zero order descriptor p0 . For robustness against illumination/contrast changes we introduce a soft normalization of the first order (’gradient’) descriptor p1 . This means, that in the forloop in Alg. 1 from line 3-5 we introduce a special case for j = 1, namely p1 (r) =

1 ∇1 f (r), γ + sdev (r)

where γ ∈ R is a fixed regularization parameter and sdev (r) denotes the standard deviation computed in a local window around r. The normalization makes the filter robust against multiplicative changes of the gray values and, secondly, emphasizes the ’structural’ and ’textural’ properties rather than the pure intensities. Besides γ, the filter has three other parameters: the expansion degree n, the width of the input Gaussian σ and the output Gaussian η. In the spirit of the GHT, the parameter σ determines the size of the local features that vote for the center of the object of interest. To assure that every pixel of the object can contribute, the extent of the voting function should be at least half the diameter of the object.

4 Pollen Porate Detection in Confocal Data Analysis techniques for data acquired by microscopy typically demand for a rotation and translation invariant treatment. In this experiment we use the harmonic filter for the analysis of pollen grains acquired with confocal laser scanning microscopy (see [15]). Palynology, the study and analysis of pollen, is an interesting topic with very diverse applications like in Paleoclimatology or Forensics. An important feature of certain types of pollen grain are the so called porates that are small pores on the surface of the grain. Their relative configuration is crucial for the determination of the species. We want to show that our filter is able to detect this structures in a reliable way. The dataset consists of 45 samples.

138

M. Reisert and H. Burkhardt

The images have varying sizes of about 803 voxels. We labeled the porates by hand. The experimental setup is quite simple. We apply on each pollen image the trained harmonic filter and then select local maxima up to a certain threshold as detection hypotheses. 4.1 Reference Approaches We use the ideas of Ballard et al [8], Lowe et al [9] and Leibe et al [10] and extended them to 3D. The approach is based on the generalized Hough transform (GHT). Based on a selection of interest points local features are extracted and assigned to a codebook entry. Each codebook entry is endowed with a set of votes for the center of object which are casted for each interest point. This approach resembles closely the idea of the implicit shape model by Leibe et al [10], where we used a 3D extension of Lowe’s SIFT features [9] as local features (for details see [16]). As a second approach we apply a simple classification scheme per voxel (VC). For each voxel we compute a set of expressive rotation invariant features and train a classifier to discriminate the objects of interest from the background. This idea was for example used by Staal et al [17] for blood vessel detection in retinal images in 2D or by Fehr et al [18] for cell detection in 3D. For details about the features and implementation see [16]. 4.2 Training For the training of the harmonic filter (and for both reference approaches) we selected one(!) good pollen example, i.e. three porate samples. To train the harmonic filter we built an indicator image with pixels set to 1 at the centers of the three porates. The indicator image is just the target image y which should satisfy H{f } = y. As mentioned before the linearity of the filter in its parameters makes it easy to adapt them. We use an unregularized least square approach. Due to the high dynamic differences between the filter responses corresponding to the individual parameters it is necessary to normalize the equation to avoid numerical problems. We used the standard deviation of the individual filter responses taken over all samples in the training image. The σ parameter determining the size of the local features was chosen to be 2.5 pixels. The output width η determining the range of voting function was chosen to be 8 pixels, this is about half the diameter of the porates. For the training of the reference approaches see again [16]. 4.3 Evaluation In Figure 1 we show two examples. The filter detects the porates but shows also some small responses within the pollen, however the results are still acceptable. For quantative results we computed Precision/Recall graphs. A detection was found to be successful if it is at least 8(4) pixels away from the true label. In Figure 2 on the left we show a PR-graph for a varying expansion degree n with a low detection precision of 8 pixels. As one expects the filter improves its performance with growing n. For n = 8 no performance gain is observed. The runtime of the filter heavily depends on the number of spherical products to be computed. For example for n = 6 we have to compute 46

Harmonic Filters for Generic Feature Detection in 3D

139

Fig. 1. Mugwort pollen (green) with overlayed filter response (red) for two examples. The filter detects the three porates, but there are also some spurious responses within the pollen, because the pollen has also strong inner structures. 1 0.9

0.6 0.5

1 0.9

0.8 n=3 n=4 n=5 n=6 n=7 n=8

0.7 0.6 0.5

0.8 Recall

0.7

Recall

Recall

0.8

1 0.9

GHT Harris GHT DOG GHT DHES VC KNN VC SVM Harmonic Filter

0.7 0.6 0.5

GHT Harris GHT DOG GHT DHES VC KNN VC SVM Harmonic Filter

0.4 0.4

0.5

0.6 0.7 Precision

0.8

0.9

1

0.3

0.4

0.5

0.6 0.7 Precision

0.8

0.9

1

0.3

0.4

0.5

0.6 0.7 Precision

0.8

0.9

1

Fig. 2. Precision/Recall graphs of the porate detection problem. Left: Comparison of the Harmonic filter for different expansion degrees (precision 8 pixels). Middle: Comparison with reference approaches (precision 8 pixels). Right: Comparison with reference approaches (4 pixels).

products. The computation of these products takes on a P4 2.8Ghz about 6 seconds. In Figure 2 in the middle we compare the result of the Harmonic filter with n = 7 with the reference approaches. The results of the GHT based on DOG interest points are comparable with the Harmonic filter. The voxel classification approach (VC) performs not so well. In particular, for the SVM based classification is performing quite poorly. Finally, we evaluated the PR-graph with a higher detection precision of 4 pixels. As already experienced in [1] the GHT based approach has problems in this case, which has probably to do with the inaccurate and unstable determination of the interest points. Now both VC approaches are outperforming the GHT approaches while the Harmonic Filter is definitely superior over all the others.

5 Conclusion In this paper we presented a general-purpose non-linear filter that is equivariant with respect to the 3D Euclidean motion. The filter may be seen as a joint Volterra filter for rotation and translation. The filter senses locally a harmonic projection of the image function and maps this projection onto a kind of voting function which is also harmonic. The mapping is modelled by rotation equivariant polynomials in the describing coefficients. The harmonic projections are computed in an efficient manner by the use of spherical derivatives of Gaussian-smoothed images. We applied the filter on a 3D detection problem. For low detection precision the performance is comparable to state of the art approaches, while for high detection precision the approach is definitely outperforming existing approaches.

140

M. Reisert and H. Burkhardt

Acknowledgements This study was supported by the Excellence Initiative of the German Federal and State Governments (EXC 294).

References 1. Reisert, M., Burkhardt, H.: Equivariant holomorphic filters for contour denoising and rapid object detection. IEEE Trans. on Image Processing 17(2) (2008) 2. Reisert, M., Burkhardt, H.: Complex derivative filters. IEEE Trans. Image Processing 17(12), 2265–2274 (2008) 3. Reisert, M., Burkhardt, H.: Spherical tensor calculus for local adaptive filtering. In: Tensors in Image Processing and Computer Vision (2009) 4. Thurnhofer, S., Mitra, S.: A general framework for quadratic volterra filters for edge enhancment. IEEE Trans. Image Processing, 950–963 (1996) 5. Mathews, V.J., Sicuranza, G.: Polynomial Signal Processing. J.Wiley, New York (2000) 6. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Pattern Anal. Machine Intell. 13(9), 891–906 (1991) 7. Perona, P.: Deformable kernels for early vision. IEEE Trans. Pattern Anal. Machine Intell. 17(5), 488–499 (1995) 8. Ballard, D.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–122 (1981) 9. Lowe, D.: Distinct image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 10. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS. Springer, Heidelberg (2004) 11. Mordohai, P.: Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning. Morgan and Claypool, San Francisco (2006) 12. Rose, M.: Elementary Theory of Angular Momentum. Dover Publications (1995) 13. Miller, W., Blahut, R., Wilcox, C.: Topics in harmonic analysis with applications to radar and sonar. In: IMA Volumes in Mathematics and its Applications. Springer, New York (1991) 14. Lenz, R.: Group theoretical methods in Image Processing. Lecture Notes. Springer, Heidelberg (1990) 15. Ronneberger, O., Burkhardt, H., Schultz, E.: General-purpose Object Recognition in 3D Volume Data Sets using Gray-Scale Invariants. In: Proceedings of the International Conference on Pattern Recognition, Quebec, Canada. IEEE Computer Society Press, Los Alamitos (2002) 16. Reisert, M.: Harmonic filters in 3d - theory and applications. Technical Report 1/09, IIFLMB, Computer Science Department, University of Freiburg (2009) 17. Staal, J., Ginneken, B., Niemeijer, M., Viegever, A., Abramoff, M.: Ridge based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 23(4), 501–509 (2004) 18. Fehr, J., Ronneberger, O., Kurz, H., Burkhardt, H.: Self-learning segmentation and classification of cell-nuclei in 3D volumetric data using voxel-wise gray scale invariants. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 377– 384. Springer, Heidelberg (2005)

Increasing the Dimension of Creativity in Rotation Invariant Feature Design Using 3D Tensorial Harmonics Henrik Skibbe1,3 , Marco Reisert2 , Olaf Ronneberger1,3 , and Hans Burkhardt1,3 1

Department of Computer Science, Albert-Ludwigs-Universit¨ at Freiburg, Germany 2 Dept. of Diagnostic Radiology, Medical Physics, University Medical Center, Freiburg 3 Center for Biological Signalling Studies (bioss), Albert-Ludwigs-Universit¨ at Freiburg {skibbe,ronneber,Hans.Burkhardt}@informatik.uni-freiburg.de, [email protected]

Abstract. Spherical harmonics are widely used in 3D image processing due to their compactness and rotation properties. For example, it is quite easy to obtain rotation invariance by taking the magnitudes of the representation, similar to the power spectrum known from Fourier analysis. We propose a novel approach extending the spherical harmonic representation to tensors of higher order in a very eﬃcient manner. Our approach utilises the so called tensorial harmonics [1] to overcome the restrictions to scalar ﬁelds. In this way it is possible to represent vector and tensor ﬁelds with all the gentle properties known from spherical harmonic theory. In our experiments we have tested our system by using the most commonly used tensors in three dimensional image analysis, namely the gradient vector, the Hessian matrix and ﬁnally the structure tensor. For comparable results we have used the Princeton Shape Benchmark [2] and a database of airborne pollen, leading to very promising results.

1

Introduction

In modern image processing and classiﬁcation tasks we are facing an increasing number of three dimensional data. Since objects in diﬀerent orientations are usually considered to be the same, descriptors that are rotational invariant are needed. One possible solution are features which rely on the idea of group integration, where certain features are averaged over the whole group to become invariant [3]. Here we face the problem to derive features in an eﬃcient manner. In the case of 3D rotations one of the most eﬃcient and eﬀective approaches utilises the theory of spherical harmonics [4]. This representation allows to accomplish the group integration analytically. In implementation practice the magnitudes of certain subbands of the spherical harmonic representation have to be taken to become invariant. But, there is one bottleneck that limits the creativity of designing features based on spherical harmonics: they represent scalar functions. This means that, J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 141–150, 2009. c Springer-Verlag Berlin Heidelberg 2009

142

H. Skibbe et al.

for example, vector valued functions, like the gradient ﬁeld, cannot be put into the spherical harmonics framework without loosing the nice rotation properties (which are of particular importance for the design of invariant features). We are restricted to features with scalar components that are not interrelated by a global rotation. Only then, a component-wise spherical harmonic transformation leads to rotation invariant features. Here our new approach jumps in. Imagine that all the fantastic features which have already been proposed on the basis of the spherical harmonic approach could be generalised to vector valued or even tensor valued ﬁelds. What we propose is exactly this: the natural extension of the spherical harmonic framework to arbitrary ranked tensor ﬁelds, in particular including vector ﬁelds (e.g. gradient ﬁelds or gradient vector ﬂow) and rank 2 tensor ﬁelds (e.g. the Hessian or the structure tensor). This is achieved by utilising the theory of spherical tensor analysis [1]. Doing so gives us the possibility to transform tensor ﬁelds of any rank into representations that share all the same nice properties as ordinary spherical harmonic transformations. Additionally, we show how to compute these tensor ﬁeld transformations eﬃciently by using existing tools for fast computations of spherical harmonic representations [5,6]. This paper is divided into six sections. In section 2 we introduce the fundamental mathematical deﬁnitions needed in the later sections. Sections 3 introduces the tensorial harmonic expansion as a natural extension of the spherical harmonic expansion. We further show how rotation invariant features can be obtained in a manner similar to [4]. Section 4 addresses the problem of eﬃcient tensor expansion and oﬀers a solution by utilising spherical harmonics. In section 5 we put all the details necessary to transform commonly used real cartesian tensors up to rank 2 in our framework. And ﬁnally we present our experiments in section 6. We successfully applied our approach to commonly used tensors, namely vectors and matrices. The promising results of the examples aim to encourage the reader to consider the use of the approach proposed here. The conclusion points out some ideas that were not investigated here and might be considered in future research.

2

Preliminaries

We assume that the reader has basic knowledge in cartesian tensor calculus. We further assume that the reader is familiar with the basic theory and notations of the harmonic analysis of SO(3), meaning he should have knowledge both in spherical harmonics and in Wigner D-Matrices and their natural relation to Clebsch-Gordan coeﬃcients. He also should know how and why we can obtain rotation invariant features from spherical harmonic coeﬃcients [4], because we will adapt this approach directly to tensorial harmonics. A good start for readers who are completely unfamiliar with the theory of the harmonic analysis of SO(3) might be [7] where a basic understanding of spherical harmonics is given, focused on a practical point of view. The design of rotation invariant spherical harmonic features was ﬁrst addressed in [4]. Deeper views into the theory are given in [8,1,9]. However, we ﬁrst want to recapitulate the mathematical constructs and deﬁnitions which we will use in the following sections.

Increasing the Dimension of Creativity

143

We denote by {ejm }m=−j...j the standard basis of C2j+1 . The standard coordinate vector r = (x, y, z)T ∈ R3 has a natural relation to an element in u ∈ C3 by the unitary coordinate transformation S: ⎛ ⎞ −1 −i √0 1 ⎝ S= √ (1) 0 0 2⎠ 2 1 −i 0 with u = Sr. Let Djg be the unitary irreducible representation of a g ∈ SO(3) of order j ∈ N0 , acting on the vector space C2j+1 . They are widely known as Wigner-D Matrices [8]. The representation of D1g is directly related by S to the real valued rotation matrix Ug ∈ R3×3 , namely, D1g = SUg S∗ , where S∗ is the adjugate of S. Depending on the context we will also express the coordinate vector r ∈ R3 in spherical coordinates (r, θ, φ), which is closer to the commonly used notation of spherical harmonics, where: z 2 2 2 r = x + y + z , θ = arccos , φ = atan2(y, x) (2) x2 + y 2 + z 2 e.g. we sometimes write f (r, θ, φ) instead of f (r). Definition 1. A function f : R3 → C2j+1 is called a spherical tensor ﬁeld of rank j if it transforms with respect to rotation: ∀g ∈ SO(3) :

(gf )(r) := Djg f (UTg r)

(3)

The space of all spherical tensor ﬁelds of rank j is denoted by Tj . We further need to deﬁne the family of bilinear forms which we use to couple spherical tensors of diﬀerent ranks. Definition 2. For every j ≥ 0 we deﬁne the family of bilinear forms ◦j : C2j1 +1 × C2j2 +1 → C2j+1 that only exists for those triple of j1 , j2 , j ∈ N0 that fulﬁl the triangle inequality |j1 − j2 | ≤ j ≤ j1 + j2 . T

(ejm ) (v ◦j w) :=

m1=j 1

m2=j 2

j1 m1 , j2 m2 | jm vm1 wm2

m1=−j1 m2=−j2

=

j1 m1 , j2 m2 | jm vm1 wm2

(4)

m=m1 +m2

where j1 m1 , j2 m2 | jm are the Clebsch-Gordan coeﬃcients. (The Clebsch-Gordan coeﬃcients are zero if m1 + m2 = m) One of the orthogonality properties of the Clebsch-Gordan coeﬃcients that will be used later is given by: 2j + 1 j1 m1 , j2 m2 | jm j1 m1 , j2 m2 | jm = δj2 ,j2 δm2 ,m2 (5) 2j + 1 2 m ,m 1

where δ is the Kronecker symbol.

144

3

H. Skibbe et al.

Rotation Invariant Features from Tensorial Harmonics

Combining all the previously deﬁned pieces we can now formalise an expansion of a spherical tensor ﬁeld f ∈ T using the notation proposed in [1]: f (r, θ, φ) =

∞ k=

ajk (r) ◦ Yj (θ, φ)

(6)

j=0 k=−

with expansion coeﬃcients ajk (r) ∈ C2(j+k)+1 , and the well known spherical harmonics Yj ∈ C2j+1 . Note, that we always use the semi-Schmidt normalised spherical harmonics. In the special case where = 0 the expansion coincides with the ordinary scalar spherical harmonic expansion. The important property of the tensorial harmonic expansion is given by (gf )(r) = Dg f (Ug T r) =

∞ k=

Dj+k ajk (r) ◦ Yj (θ, φ) g

(7)

j=0 k=−

This means, that a rotation of the tensor ﬁeld by Dg aﬀects the expansion coeﬃcients ajk (r) to be transformed by Dj+k . This is an important fact which we g will use when we aim to get rotation invariant features from tensorial harmonic coeﬃcients. 3.1

Designing Features

Facing the problem of designing features describing three dimensional image data, the spherical harmonic based method proposed in [4] is widely known and used to transform non-rotation invariant features into rotation invariant representations, as seen e.g. in [10,11]. Considering eq. (7) it easily can be seen that for each coeﬃcient ajk (r) a feature cjk (r) ∈ R can be computed that is invariant to arbitrary rotations Dg acting on a tensor ﬁeld f ∈ T : cjk (r)

j = Dj+k a (r) = Dj+k ajk (r), Dj+k ajk (r)

g g g k ∗ j+k j = Dj+k Dg ak (r), ajk (r) = ajk (r), ajk (r) = ajk (r) g

(8)

By now the generation of features is just the natural extension of the features proposed in [4], adapted to tensor ﬁelds of arbitrary order. In addition to that we can also consider the interrelation of diﬀerent coeﬃcients with equal rank. For a tensor ﬁeld of order we can combine 2 + 1 coeﬃcients. For two diﬀerent coeﬃcients ajk (r) and ajk (r) with j + k = j + k we can easily extend the feature deﬁned above such that the following feature is also unaﬀected by arbitrary rotations: j+k j j +k j cjj (r) = | D a (r), D a (r) | = | ajk (r), ajk (r) | (9) g g kk k k

Increasing the Dimension of Creativity

4

145

Fast Computation of Tensorial Harmonic Coeﬃcients

In the current section we want to derive a computation rule for the tensorial harmonic coeﬃcients based on the ordinary spherical harmonic expansion. This is very important, since spherical harmonic expansions can be realized in a very eﬃcient manner [6]. T It is obvious that each of the M components (eM ) f (r) of a spherical tensor ﬁeld f ∈ T can be separately expanded by an ordinary spherical harmonic expansion: T

(eM ) f (r, θ, φ) =

∞

T

bjM (r) Yj (θ, φ)

(10)

j=0

where the bjM (r) ∈ Tj are the spherical harmonic coeﬃcients. Combining eq. (10) and eq. (6) we obtain a system of equations which allow us to determine the relation between the tensorial harmonic coeﬃcients ajk (r) and the spherical harmonic coeﬃcients bjM (r): T

(eM ) f (r, θ, φ)

=

∞ k=

ajk (r) ◦ Yj (θ, φ)

j=0 k=−

=

∞ k=

ajkm (r) (j + k)m, jn | M Ynj (θ, φ)

j=0 k=− M=m+n

=

m=(j+k)

∞ k=

n=j

ajkm (r) (j + k)m, jn | M Ynj (θ, φ)

j=0 k=− m=−(j+k) n=−j

=

∞ n=j

Ynj (θ, φ)

j=0 n=−j

=

∞ n=j

k=

m=(j+k)

k=− m=−(j+k)

=bjM,n (r)

bjM,n (r)Ynj (θ, φ)

j=0 n=−j

ajkm (r) (j + k)m, jn | M

=

∞

T

bjM (r) Yj (θ, φ)

(11)

j=0

With use of eq. (11) we can directly observe that bjM,n (r) =

k=

m=(j+k)

ajkm (r) (j + k)m, jn | M

(12)

k=− m=−(j+k)

Multiplying both sides with (j + k )m , jn | M results in bjM,n (r) (j + k )m , jn | M

=

k=

m=(j+k)

k=− m=−(j+k)

ajkm (r) (j + k)m, jn | M (j + k )m , jn | M

(13)

146

H. Skibbe et al.

Summarising over all n and M leads to j bM,n (r) (j + k )m , jn | M

M,n

=

k=

m=(j+k)

ajkm (r) (j + k)m, jn | M (j + k )m , jn | M

M,n k=− m=−(j+k)

=

k=

m=(j+k)

ajkm (r)

M,n

k=− m=−(j+k)

(j + k)m, jn | M (j + k )m , jn | M

δk,k δm,m

2+1 2(j+k )+1

(14)

Due to the orthogonality of the Clebsch-Gordon coeﬃcients (5) all addends with m = m or k = k vanish: j 2 + 1 bM,n (r) (j + k )m , jn | M = aj (15) 2(j + k ) + 1 k m M,n

Finally, we obtain our computation rule which allows us to easily and eﬃciently compute the tensorial harmonic coeﬃcients ajk ∈ Tj+k based on the spherical harmonic expansion of the individual components of a given tensor ﬁeld f : ajk m =

M= n=j 2(j + k ) + 1 j bM,n (r) (j + k )m , jn | M

2 + 1 n=−j

(16)

M=−

5

Transforming Cartesian Tensors into Spherical Tensors

The question that has not been answered yet is how these spherical tensor ﬁelds are related to cartesian tensor ﬁelds like scalars, vectors and matrices. In the following we show how cartesian tensors up to rank two can easily be transformed into a spherical tensor representation which then can be used to obtain rotation invariant features. For scalars the answer is trivial. For rank 1 it is the unitary transformation S that directly maps the real-valued cartesian vector r ∈ R3 to its spherical counterpart. More complicated is the case of real valued tensors T3×3 of rank 2. Nevertheless, we will see that the vector space of real cartesian tensors of rank 2 covers tensors of rank 1 and 0, too. Due to this fact we can build up our system covering all three cases by just considering the current case. There exists a unique cartesian tensor decomposition for tensors T ∈ R3×3 : T = αI3×3 + Tanti + Tsym

(17)

where Tanti is an antisymmetric matrix, Tsym a traceless symmetric matrix and α ∈ R. The corresponding spherical decomposition is then given by: j vm = (−1)m1 1m1 , 1m2 | jm Ts1−m1 ,1+m2 (18) m=m1 +m2

Increasing the Dimension of Creativity

147

where Ts = STS∗ and vj ∈ C2j+1 , j = 0, 1, 2. Note that the spherical tensor v0 corresponds to α, namely a scalar. The real valued cartesian representation of v1 is the antisymmetric matrix Tanti or equivalently a vector in R3 , and v2 has its cartesian representation in R3×3 by a traceless symmetric matrix Tsym . Proposition 1. The spherical tensors v0 , v1 , v2 are the ⎛ results ⎞ of the spherical t00 t01 t02

decomposition of the real valued cartesian tensor T = ⎝t10 t11 t12 ⎠ of rank 2, with: t20 t21 t22

v0 =

− (t00 + t11 + t22 ) √ , 3 ⎛1

v =⎝ 1

⎞

(t02 − t20 + i(t21 − t12 )) √i (t01 − t10 ) ⎠, 2 1 2 (t02 − t20 − i(t21 − t12 )) 2

⎛

⎞ (t00 − t11 − i(t01 + t10 )) (−(t02 + t20 ) + i(t12 + t21 ))⎟ ⎟ −1 2 ⎜ ⎟ √ (t + t11 − 2t22 ) v =⎜ 6 00 ⎟ ⎝ 1 ((t02 + t20 ) + i(t12 + t21 )) ⎠ 2 1 2 (t00 − t11 + i(t01 + t10 )) ⎜ 12 ⎜

1 2

where v0 ∈ C1 , v1 ∈ C2 and v2 ∈ C3 .

6

Experiments

We perform experiments comparing tensorial harmonic descriptors derived from diﬀerent tensors. For testing we use the Princeton Shape Benchmark (PSB) [2] based on 1814 triangulated objects divided into 161 classes. We present the models in a 1503 voxel grid. The objects are translational normalised with respect to their centre of gravity. We further perform experiments based on an airborne pollen database containing 389 ﬁles equally divided into 26 classes [12,11]. All pollen are normalised to a spherical representation with a radius of 85 voxel (ﬁgure 1). In both experiments we compute the ﬁrst and second order derivatives for each object and do a discrete coordinate transform according to eq. (2) for the intensity values and the derivatives. For each radius in voxel step size the longitude θ and the colatitude φ are sampled in 64 steps for models of the PSB. In case of the pollen database we use a spherical resolution of 128 steps for the longitude θ and 128 steps for the colatitude φ. In addtition to the ordinary spherical harmonic expansion (denoted as SH) of the scalar valued intensity ﬁelds we do the tensorial harmonic expansion of the following cartesian tensor ﬁelds according to proposition 1 and eq. (16):

Fig. 1. The 26 classes of the spherically normalised airborne pollen dataset

148

H. Skibbe et al.

Fig. 2. PSB containing 1814 models divided into 161 classes

Vectorial Harmonic Expansion (VH). Similar to spherical harmonics the vectorial harmonics have been used ﬁrst in a physical context [13]. For convenience we prefer the representation of 2nd order tensors using the axiator, despite the fact that gradient vectors only have rank 1 (eq. (18)). Using proposition 1 we transform the cartesian gradient vector ﬁeld into its spherical counterpart and do the tensorial harmonic expansion. ⎛ ⎞ 0 −Iz Iy

∇I× = ⎝ Iz 0 −Ix ⎠ (19) −Iy Ix 0 where ∇ is the nabla operator, × denotes the axiator and using the notation ∂I Ix := ∂x . Hessian Harmonic Expansion (HH). The Hessian tensor ﬁeld can be transformed in a manner similar to vectorial harmonics. But in contrast we obtain two harmonic expansions according to proposition 1. Structural Harmonic Expansion (StrH). The structure tensor is widely used in the 2D and 3D image analysis. It is derived by an outer product of a gradient vector, followed by a componentwise convolution with an isotropic gaussian kernel gσ . ⎛ 2 ⎞ Ix Ix Iy Ix Iz gσ ∗ ⎝Ix Iy Iy2 Iy Iz ⎠ (20) Ix Iz Iy Iz Iz2 In our experiments we use a standard deviation σ of 3.5 (in voxel diameter). In the experiments related to the PSB we found best to cut oﬀ the expansions by band width 25. We compute rotation invariant features according to section 3.1. All features are normalised with respect to the L1 norm. In case of the HH and the StrH expansion we obtain two separate features for each expansion which we concatenate. In order to keep the results comparable to those given in [2], we perform our experiments on the test and training set of the PSB at the ﬁnest granularity. For a description of the used performance measures NearestNeighbour/1st-Tier/2nd-Tier/E-Measure/Discounted-Cumulative-Gain see [2]. Table 1 depicts our results. Results based on features considering the interrelation of diﬀerent coeﬃcients (eq. (9)) are marked with a subscripted 2, e.g. VH2 . The results of further experiments conducting a LOOCV1 considering all 1814 objects are depicted in the left hand graph of ﬁgure 3. 1

Leave-one-out cross-validation.

Increasing the Dimension of Creativity

149

Table 1. PSB: Results of the test-set (left) and training set (right). The subscribed number 2 means features based on eq. (9), other wise based on eq. (8). To show the superiority of tensorial harmonics over the spherical harmonics we also give the results for the best corresponding SH-feature (SH∗ ) from [2]. Method StrH2 StrH HH2 VH2 VH HH SH SH∗

NN 61.6% 61.0% 58.5% 58.0% 57.7% 56.9% 52.5% 55.6%

1stT 34.3% 33.5% 31.5% 31.6% 30.8% 30.5% 27.2% 30.9%

2ndT 44.2% 43.6% 40.5% 40.7% 39.9% 39.7% 36.2% 41.1%

EM 26.1% 25.4% 24.5% 24.5% 23.7% 23.8% 21.6% 24.1%

DCG 60.9% 60.2% 58.5% 58.5% 57.6% 57.5% 54.5% 58.4%

Method StrH2 StrH HH2 VH2 VH HH SH

60

2ndT 44.5% 43.5% 42.2% 42.0% 40.0% 40.3% 36.2%

EM 25.1% 24.4% 23.7% 23.6% 22.5% 22.6% 20.2%

DCG 61.9% 61.3% 60.2% 59.7% 58.4% 58.9% 55.9%

90

correctly classified in %

correctly classified in %

1stT 34.6% 33.8% 31.8% 31.6% 30.4% 30.7% 26.8%

100

50

40

30

20

10

0

NN 61.7% 61.4% 59.3% 58.9% 56.6% 57.6% 55.8%

80 70 60 50 40

1 NN 2 NN 3 NN 4 NN minimum number of correct nearest neighbours

30

1

2

3 4 5 6 7 8 minimum number of correct nearest neighbours

9

10

Fig. 3. (left): LOOCV of the whole PSB dataset, demanding 1, 2, 3 and 4 correct NN. (right): LOOCV results of the pollen dataset, showing the performance when demanding up to 10 correct nearest neighbours.

We secondly perform experiments on the airborne pollen database. The expansions are done up to the 40th band. We compute features based on eq. (8) in the same manner as for the PSB experiment. The results of a LOOCV showing the performance of the features are depicted in the right graph of ﬁgure 3.

7

Conclusion

We presented a new method with which tensor ﬁelds of higher order can be described in a rotation invariant manner. We further have shown how to compute tensor ﬁeld transformations eﬃciently using a componentwise spherical harmonics transformation. The conducted experiments concerning higher order tensors led to the highest results and have prooven our assumption that the consideration of higher order tensors for feature design is very promising. Taking advantage of the presence of diﬀerent expansion coeﬃcient with equal rank of higher order tensors additionally improved our results. But we also observed that we can’t give a ﬁxed ranking of the performance of the investigated tensors. Considering

150

H. Skibbe et al.

the results of the PSB the structural harmonic features performed best. In contrast they have shown the worst performance in the pollen classiﬁcation task. For future work we want to apply our method to tensors based on biological multi channel data. We further aim to examine features based on the gradient vector ﬂow. Acknowledgement. This study was supported by the Excellence Initiative of the German Federal and State Governments (EXC 294).

References 1. Reisert, M., Burkhardt, H.: Eﬃcient tensor voting with 3d tensorial harmonics. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW 2008, pp. 1–7 (2008) 2. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: Shape Modeling and Applications, pp. 167–178 (2004) 3. Reisert, M.: Group Integration Techniques in Pattern Analysis - A Kernel View. PhD thesis, Albert-Ludwigs-Universit¨ at Freiburg (2008) 4. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3D shape descriptors. In: Symposium on Geometry Processing (June 2003) 5. Kostelec, P.J., Rockmore, D.N.: S2kit: A lite version of spharmonickit. Department of Mathematics. Dartmouth College (2004) 6. Healy, D.M., Rockmore, D.N., Moore, S.S.B.: Ffts for the 2-sphere-improvements and variations. Technical report, Hanover, NH, USA (1996) 7. Green, R.: Spherical harmonic lighting: The gritty details. In: Archives of the Game Developers Conference (March 2003) 8. Rose, M.: Elementary Theory of Angular Momentum. Dover Publications (1995) 9. Brink, D.M., Satchler, G.R.: Angular Momentum. Oxford Science Publications (1993) 10. Reisert, M., Burkhardt, H.: Second order 3d shape features: An exhaustive study. C&G, Special Issue on Shape Reasoning and Understanding 30(2) (2006) 11. Ronneberger, O., Wang, Q., Burkhardt, H.: 3D invariants with high robustness to local deformations for automated pollen recognition. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 425–435. Springer, Heidelberg (2007) 12. Ronneberger, O., Burkhardt, H., Schultz, E.: General-purpose Object Recognition in 3D Volume Data Sets using Gray-Scale Invariants – Classiﬁcation of Airborne Pollen-Grains Recorded with a Confocal Laser Scanning Microscope. In: Proceedings of the International Conference on Pattern Recognition, Quebec, Canada (2002) 13. Morse, P.M., Feshbach, H.: Methods of Theoretical Physics, Part II. McGraw-Hill, New York (1953)

Training for Task Specific Keypoint Detection Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua CVLab EPFL Lausanne Switzerland

Abstract. In this paper, we show that a better performance can be achieved by training a keypoint detector to only find those points that are suitable to the needs of the given task. We demonstrate our approach in an urban environment, where the keypoint detector should focus on stable man-made structures and ignore objects that undergo natural changes such as vegetation and clouds. We use WaldBoost learning with task specific training samples in order to train a keypoint detector with this capability. We show that our aproach generalizes to a broad class of problems where the task is known beforehand.

1 Introduction State of the art keypoint descriptors such as SIFT [1] or SURF [2] are designed to be insensitive to both perspective distortion and illumination changes, which allows for images obtained from different viewpoints and under different lighting conditions to be successfully matched. This capability is hindered by the fact that general-purpose keypoint detectors exhibit a performance which deteriorates with seasonal changes and variations in lighting. A standard approach to coping with this difficulty is to set the parameters of the detectors so that a far greater number of keypoints than necessary are identified, in the hope that enough will be found consistently across multiple images. This method, however, entails performing unnecessary computations and increases the chances of mismatches. In this paper, we show that when training data is available for a specific task , we can do better by training a keypoint detector to only identify those points that are relevant to the needs of the given task. We demonstrate our approach in an urban environment where the detector should focus on stable man-made structures and ignore the surrounding vegetation, the sky and the various shadows, all of which display features that do not persist with seasonal and lighting changes. We rely on WaldBoost learning [3], similar in essence to the recent work [4] by the same authors, to learn a classifier that responds more frequently on stable structures. Task-specific keypoint detection is known to play an important role in human perception. Among the early seminal studies is that of Yarbus [5] where it was demonstrated that a subject’s gaze is drawn to relevant aspects of a scene and that eye movements are highly influenced by the assigned task, for instance memorization. To the best of our knowledge, these ideas have not yet made their mark for image-matching purposes. Our main contribution is to show that image matching algorithms benefit from incorporating task-specific keypoint detection. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 151–160, 2009. c Springer-Verlag Berlin Heidelberg 2009

152

C. Strecha et al.

We begin this paper with a brief review of related approaches. Next, we discuss in more detail what constitutes a stable keypoint that an optimized detector should identify and introduce our approach to training such a detector. Experimental results are then presented for the structure and motion problem, where our goal is to build a keypoint detector - called TaSK (Task Specific Keypoint) that focuses on stable man-made structure. We also show a result of a keypoint detector, which was learned to focus on face features. Finally, we conclude with a discussion.

2 Related Work State of the art keypoint detectors fall into two broad categories: those that are designed to detect corners on one hand, and those that detect blob-like image structures on the other. An extensive overview can be found in Tuytelaars et al. [6]. Corner like detectors such as Harris, FAST [7], F¨orstner [8] [9,10] are often used for the pose and image localization problems. These detectors have a high spatial precision in the image plane but are not scale invariant and are therefore used for small baseline matching or tracking. The other category of keypoint detectors aim at detecting blob structures (SIFT [1], MSER [11] or SURF [2]). They provide a scale estimate, which renders them suited for wide-baseline matching [12,13] or for the purpose of object detection and categorization. Both detector types can be seen as general-purpose hand crafted detectors, which run for many application at a very high false positive rate to prevent failures from missed keypoints. ˇ Our approach is most related to the work of Sochman and Matas [4]. These authors, emulate the behavior of a keypoint detector using the boosting learning method. They show that the emulated detector achieves equivalent performance with a substantial speed improvement. Rosten and Drummond [7,14] applied a similar idea to make fast decisions about the presence of a keypoints in a image patch. There, learning techniques are used to enhance the detection speed for general-purpose keypoint detection. Note, that their work does not focus on task specific keypoint detection, which is the aim of this paper. Similar in spirit is also the work of Kienzle et.al. [15] in which human eye movement data is used to to train a saliency detector.

3 Task Specific Keypoints Training data can be used in various ways to improve the keypoint detection. We will describe two approaches in the following sections. 3.1 Detector Verification Suppose we are given a keypoint detector K and a specific task for which training data is available. The most natural way to enhance keypoint detection is based on a post-filtering process: among all detections which are output by the detector K, we are interested only in the keypoints that are relevant given the training data. Our enhanced keypoint detector would then output all low-level keypoints and an additional classification stage is added which rejects unreliable keypoints based on the learned appearance.

Training for Task Specific Keypoint Detection

153

Fig. 1. Keypoint detections by DoG (top) and our proposed detector TaSK (bottom). Note that TaSK is specialized to focus more on stable man-made structures and ignores vegetation and sky features.

3.2 Detector Learning In order to learn the appearance of good keypoints we need to specify how they are characterized. In particular we need to specify the conditions under which a pixel can be regarded as a good keypoint. We will use the following two criteria:

154

C. Strecha et al.

1. A good keypoint can be reliably matched over many images. 2. A good keypoint is well localized, meaning its descriptor is sufficiently different from the descriptors of its neighboring pixels. All pixels that obey these criteria will constitute the positive input class to our learning while the negative training examples are random samples of the training images. Our method is based on WaldBoost learning [3] similar in spirit to the work of ˇ Sochman and Matas [4]. Using our aforementioned training examples, we learn a classifier that responds more frequently on stable structures such as buildings and ignores unstable one such as vegetation or shadows. Our eventual goal is to only detect keypoints that can be reliably matched. The advantage is not only a better registration, but also a speed up in the calibration. For the WaldBoost training we used images taken by a panorama camera. These images are taken from the same view point every 10 minutes for the past four years. This massive training set captures light and seasonal changes but does not cover appearance variations which are due to changes in view point. 3.3 Training Samples The generation of the training samples is an important preliminary step for the detector learning since the boosting algorithm optimizes for the provided training samples. In [3], the set of training samples fed into the boosting algorithm is the set of all keypoints identified by a specific detector. In so doing, the learned detector is naturally no more than an emulation of the detector for the training samples. Our research aims at generating a more narrow set of training samples, which obey the criteria proposed in section 3.2. In a first step, we used the F¨orstner [8] operator to find keypoint candidates which are well localized in the images. In a second step, keypoints which are estimated to have poor reliability for reconstruction purposes are pruned. The automated selection of keypoints is based on two features: the number of occurrences of a keypoint and the stability of a descriptor at a specific position over several images of the sequences. The number of occurrences is simply a count of how many times a fixed pixel position has been detected as a keypoint in several images of the same scene. To illustrate our measure of stability, let pji denote the position of the i-th keypoint in the j j-th image i = 1 . . . Nj , j = 1 . . . Nimages . The union P = pi contains all the positions which have been detected in at least one image. In all the images a SIFT descriptor sjk is calculated for every single position pk ∈ P. For the stability of the descriptor Euclidean distances djk1 ,j2 = dist(sjk1 , sjk2 ) are calculated and their median dk = median(djk1 ,j2 ), j1 = j2 is determined. The more stable a keypoint is in time, the smaller its median will be. A pixel position is then classified as a good keypoint if its occurrence count is high and its descriptor median is low: two thresholds were thus set so that a reasonable number of keypoints is obtained for our training set(couple of thousands per image). These keypoints form the positive training set. The negative training examples are randomly sampled from the same images such that they are no closer than 5 pixels to any positive keypoint. Given these training examples we apply WaldBoost learning, as described in the next section.

Training for Task Specific Keypoint Detection

155

4 Keypoint Boosting Boosting works by sequentially applying a, usually, weak classification algorithm to a re-weighted set of training examples [16,17]. Given N training examples x1 . . . xN together with their corresponding labels y1 . . . yN , it is a greedy algorithm which leads to a classifier H(x) of the form: H(x) =

T

ht (x) ,

(1)

t=1

where ht (x) ∈ H is a weak classifier from a pool H chosen to be simple and efficient to compute. H(x) is obtained sequentially by finding at each iteration t the weak classifier which minimizes the weighted Dt (xi ) training error: Zt =

N

Dt (xi ) exp(−yi ht (xi )) .

(2)

xi =1

The weights of each training sample, Dt (xi ), are initialized uniformly and updated according to the classification performance. One possibility to minimize eq. 2 uses domain partitioning [17] as next explained. 4.1 Fuzzy Weak Learning by Domain-Partitioning The minimization of eq. 2 includes the optimization over possible features with response function r(x) and over the partitioning of the feature response into k = 1 . . .K, non-uniformly distributed bins. If a sample point x falls into the k th bin, its corresponding weak classification result is approximated by ck . This corresponds to the real version of AdaBoost.1 By this partitioning model, eq. 2 can be written as (for the current state of training t): Z=

K

D(xi ) exp(−yi ck ) .

(3)

k=1 r(xi )∈k

To compute the optimal weak classifier for a given distribution D(xi ) many features r are sampled and the best , i.e. the one with minimal Z is kept. The optimal partitioning is obtained by rewriting eq. 3 for positive (yi = 1) and negative (yi = −1) training data: Z=

K + Wk exp(−ck ) + Wk− exp(ck ) , k=1

where +/−

Wk

=

+/−

Dk

(xi )

(4)

r(xi )∈k +/−

is the sum of positive and negative weights Dk 1

that fall into a certain bin k.

For the discrete AdaBoost algorithm, a weak classifier estimates one threshold t0 and outputs α = {−1, 1} depending of whether a data point is below or above this threshold.

156

C. Strecha et al.

ALGORITHM: WaldBoost Keypoint learning Input: h ∈ H, (x1 , y1 ) . . . (x1 , y1 ), θ+ , θ− initialize weights D(xi ) = 1/N ; mark all training examples as undecided {yi∗ = 0} For t = 1 . . . T , number weak learners in cascade sample training examples xi from undecided examples {yi∗ = 0} compute weights D(xi ) w.r.t. Ht−i ∀{yi∗ = 0} For s = 1 . . . S, number weak learner trials -sample weak learner ht ∈ H -compute response r(xi ) -compute domain partitioning and score Z [17] End -among the S weak learners keep the best and add ht to the strong classifier HT = t ht -sequential probability ratio test[3] classify all current training examples into yi∗ = {+1, −1, 0} End Fig. 2. WaldBoost Keypoint learning

After finding the optimal weak learner, Wald’s decision criterion is used to classify the training samples into {+1, −1, 0} while the next weak learner is obtained by only using the undecided, zero labelled, training examples. The entire algorithm is shown in table 4.1. For more information we refer to the work of Schapire et.al. [17]. 4.2 Weak Classifier The image features which are used for the weak classifiers are computed by using integral images and include color as well as gradient features. For the minimization of 4, we first randomly sample a specific kind of weak classifier and than its parameters. The weak classifiers include: – ratio of the mean colors of two rectangles: compares two color components of two rectangles at two different positions (2+4+4 parameters). – mean color of a rectangles: measures the mean color components of a rectangles (1+2 parameters). – roundness and intensity: integral images are computed from the componnet of the structure tensor, roundness and intensity as defined by F¨ostner and G¨ulch [8] are further computed on a randomly sampled rectange size (2 parameters).

5 Detector Evaluation Repeatability is a main criterion for evaluating the performance of keypoint detectors. In contrast to current studies by Mikolajczyk et al. [18] where a good feature detection was defined according to the percentage of overlap between the keypoint ellipses, we evaluate repeatability more specifically for the task of image calibration. The Mikolajczyk criterion is in fact not well suited to evaluate multi-view image calibration, where a successful calibration should result in a sub-pixel reprojection error. We are more interested in a keypoint location which only deviates by a few pixels from the ideal

Training for Task Specific Keypoint Detection

DoG TaSK harris MSER SURF <6

<1

<2

<3 <4 <5 accuracy [pixels]

<1

<2

<3 <4 <5 accuracy [pixels]

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

<2

<1

repeatability

repeatability

<3 <4 <5 accuracy [pixels]

<2

<3 <4 <5 accuracy [pixels]

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

<2

<3 <4 <5 accuracy [pixels]

<1

repeatability

repeatability

<3 <4 <5 accuracy [pixels]

<2

<3 <4 <5 accuracy [pixels]

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

<1

<2

<3 <4 <5 accuracy [pixels]

<2

<3 <4 <5 accuracy [pixels]

<6

8m difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

DoG TaSK harris MSER SURF

<1

<2

<3 <4 <5 accuracy [pixels]

<6

11m difference

DoG TaSK harris MSER SURF

<1

<6

DoG TaSK harris MSER SURF

10m difference

DoG TaSK harris MSER SURF

<1

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

DoG TaSK harris MSER SURF

9m difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<2

7m difference

DoG TaSK harris MSER SURF

<1

<1

5m difference

DoG TaSK harris MSER SURF

6m difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

DoG TaSK harris MSER SURF

4m difference

DoG TaSK harris MSER SURF

repeatability

repeatability

3m difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

<6

repeatability

<3 <4 <5 accuracy [pixels]

DoG TaSK harris MSER SURF

repeatability

<2

2m difference

repeatability

<1

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

repeatability

1m difference

repeatability

repeatability

0m difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

157

<6

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

DoG TaSK harris MSER SURF

<1

<2

<3 <4 <5 accuracy [pixels]

<6

Fig. 3. Repeatability evaluation for seasonal changes. Repeatability scores for matching January with all other months (all images are taken at the same time of the day).

keypoint location. Our evaluation is performed as follows: given a reference image, we calculate all keypoints obtained from a specific detector on all images for which the transformation to the reference image is available. Repeatability is now defined as the percentage of detections in another image that lie within a radius of n, n = 1 . . . 6 pixels. Hence, for every keypoint in the reference image, we perform a search in the target image to identify the closest detection with respect to the ground truth localization. This event is placed in the nth bin of the repeatability, while both keypoints are marked as already matched and not considered further. This procedure is repeated until valid keypoints have been assigned to one of the bins. 5.1 Light and Seasonal Changes To evaluate the performance with respect to light an seasonal changes we used 65 images of a panorama camera. Images from different times of the day and from different

C. Strecha et al. 4h difference 0.2

DoG boost harris MSER SURF

repeatability

repeatability

1h difference 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

12h difference 0.1

DoG boost harris MSER SURF

0.15 0.1

repeatability

158

0.05 0

<1

<2

<3 <4 <5 accuracy [pixels]

<6

DoG boost harris MSER SURF

0.08 0.06 0.04 0.02 0

<1

<2

<3 <4 <5 accuracy [pixels]

<6

<1

<2

<3 <4 <5 accuracy [pixels]

<6

Fig. 4. Repeatability evaluation for daily changes in light. Time difference between the images is 1h (left), 4h (middle) and 12h (right).

Fig. 5. Keypoint detections by DoG (left) and our proposed detector TaSK (right). Note that TaSK is in this case specialized to focus more on face features.

months of the year are used. The set of images thus covers a great variety of lighting conditions such as different incident angles, intensity and inhomogeneity due to cloud coverage. All images are perfectly aligned. The repeatability measures are shown in fig. 3 and fig. 4. On the x-Axis is the accuracy, that is the distance between the closest pair of keypoints from two different images. On the y-Axis is the ratio of the amount of pairs with a certain distance to the total number of keypoints. The 12 subfigures in fig. 3 show seasonal comparisons. An image from each month has been compared to an image from January. The time difference in months is indicated in the title of each subfigure. Depending on the appearance of the scene in the different months the repeatability varies a lot. It is evident that the time differences of zero and 1 month result in the best repeatability. The 3 subfigures in fig. 4 show comparisons between images taken at different times of the same day. The time difference in hours is indicated in the title of each subfigure.

Training for Task Specific Keypoint Detection

159

From both figures it can be observed that repeatabilities are almost always in the same range. Only in the case of comparisons with different images of the day, the repeatabilities are significantly smaller. This is reasonable since the incident angle of the sunlight changes a lot during the day but much less during the year (recall, all images in fig. 3 have been taken at noon). In the cases of extreme light changes (fig. 4 middle and right) the TaSK detector outperforms all the other detectors and provides the most reliable keypoint detections under these very difficult conditions. In the less difficult seasonal changes the TaSK detector performs roughly as 2nd best after MSER. The good performance of MSER can be explained by the fact that the test images do not contain geometric transformations. Additionally we measured how many detected keypoints lie in regions with stable structures (buildings, streets, mountains, ...) and regions with unstable structures (sky, vegetation, ...). Fig. 1 shows that the TaSK detector focuses its detections on stable regions with 79% of the total number of keypoints lying in man-made structures, while the DoG detector has less than 59 % of keypoints in those regions. In fig. 5, we show the detection result of DoG and TaSK of faces. Not that in this case we have trained the TaSK detector on a different set of positive examples. This was selected by takeing keypoint detections on faces as a positive set. Random samples of images which do not contain faces have been choosen to be the negative set.

6 Conclusions This paper deals with the learning of task specific keypoint detectors (TaSK) by using boosting. Given training examples of good keypoints, we trained a classifier to distinguish the latter from random image patches. This results in a keypoint detector, which produces high repeatability scores on challenging scenes with strong light and seasonal changes. As an example we trained a keypoint detector to work with higher repeatability on structure and motion applications. For this application, it is often a problem to match images with strong light and seasonal changes. General purpose keypoint detectors usually produce many keypoints on vegetation, which are a-priori known to be ineffectual for matching. Our trained keypoint detector (TaSK) has this knowledge incorporated. Often and in many applications such as pose estimation, structure from motion, object detection and categorization, general purpose detectors are used. We argued here, that task specific keypoint detectors can increase the performance when tuned to the specific task, which is often known beforehand. To show this we also included an example on a keypoint detector for faces. Acknowledgements. This research was supported by Nokia Reseach Center and Deutsche Telekom Laboratories.

References 1. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int’l Journal of Computer Vision 60(2), 91–110 (2004) 2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006)

160

C. Strecha et al.

ˇ 3. Sochman, J., Matas, J.: Waldboost - learning for time constrained sequential detection. In: Schmid, C., Soatto, S., Tomasi, C. (eds.) Proc. Int’l Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 150–157. IEEE Computer Society Press, Los Alamitos (2005) ˇ 4. Sochman, J., Matas, J.: Learning a fast emulator of a binary decision process. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) Proc. Asian Conf. on Computer Vision. LNSC, vol. II, pp. 236–245. Springer, Heidelberg (2007) 5. Yarbus, A.L.: Eye movements and vision. Plenum, New York (1967) (Originally published in Russian, 1962) 6. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Found. Trends. Comput. Graph. Vis. 3(3), 177–280 (2008) 7. Rosten, E., Drummond, T.W.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006) 8. F¨orstner, W., G¨ulch, E.: A fast operator for detection and precise location of distinct points, corners and centers of circular features. In: Proceedings of the ISPRS Intercommission Workshop on Fast Processing of Photogrammetric Data, pp. 281–305 (1987) 9. Ouellet, J.-N., H´ebert, P.: ASN: Image Keypoint Detection from Adaptive Shape Neighborhood. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 454–467. Springer, Heidelberg (2008) 10. Agrawal, M., Konolige, K., Blas, M.R.: Censure: Center surround extremas for realtime feature detection and matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 102–115. Springer, Heidelberg (2008) 11. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. British Machine Vision Conf., pp. 384–393 (2002) 12. Vergauwen, M., Van Gool, L.: Web-based 3d reconstruction service. Mach. Vision Appl. 17(6), 411–426 (2006) 13. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring photo collections in 3D. In: SIGGRAPH 2006, pp. 835–846. ACM Press, New York (2006) 14. Rosten, E., Porter, R., Drummond, T.: Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 99(1), 5555 15. Kienzle, W., Wichmann, F.A., Schlkopf, B., Franz, M.O.: Learning an interest operator from human eye movements. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop, p. 24. IEEE Computer Society, Los Alamitos (2006) 16. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 17. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning, 80–91 (1999) 18. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. Int’l Journal of Computer Vision 65(1-2), 43–72 (2005)

Combined GKLT Feature Tracking and Reconstruction for Next Best View Planning Michael Trummer1 , Christoph Munkelt2 , and Joachim Denzler1 1

Friedrich-Schiller University of Jena, Chair for Computer Vision Ernst-Abbe-Platz 2, 07743 Jena, Germany {michael.trummer,joachim.denzler}@uni-jena.de 2 Fraunhofer Society, Optical Systems Albert-Einstein-Straße 7, 07745 Jena, Germany [email protected]

Abstract. Guided Kanade-Lucas-Tomasi (GKLT) tracking is a suitable way to incorporate knowledge about camera parameters into the standard KLT tracking approach for feature tracking in rigid scenes. By this means, feature tracking can beneﬁt from additional knowledge about camera parameters as given by a controlled environment within a next-best-view (NBV) planning approach for three-dimensional (3D) reconstruction. We extend the GKLT tracking procedure for controlled environments by establishing a method for combined 2D tracking and robust 3D reconstruction. Thus we explicitly use the knowledge about the current 3D estimation of the tracked point within the tracking process. We incorporate robust 3D estimation, initialization of lost features, and an eﬃcient detection of tracking steps not ﬁtting the 3D model. Our experimental evaluation on real data provides a comparison of our extended GKLT tracking method, the former GKLT, and standard KLT tracking. We perform 3D reconstruction from predeﬁned image sequences as well as within an information-theoretic approach for NBV planning. The results show that the reconstruction error using our extended GKLT tracking method can be reduced up to 71% compared to standard KLT and up to 39% compared to the former GKLT tracker.

1

Introduction and Literature Review

Three-dimensional reconstruction from digital images requires a solution to the correspondence problem. Feature tracking, especially KLT tracking [1] in an image sequence is a commonly accepted approach to establish point correspondences between images of the input sequence. A point correspondence between two images consists of the two image points that are mappings of the same 3D world point. Together with calibration data, in particular the intrinsic and extrinsic camera parameters, these point correspondences are used to estimate the position of the respective 3D world point. The original formulation of KLT tracking by Lucas and Kanade in [1] entailed a rich variety of extensions, lots of them reviewed by Baker and Matthews in [2]. Fusiello et al. [3] remove spurious correspondences by an outlier detection based J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 161–170, 2009. c Springer-Verlag Berlin Heidelberg 2009

162

M. Trummer, C. Munkelt, and J. Denzler

on the image residuals. Zinsser et al. [4] propose a separated tracking process by inter-frame translation estimation using block matching followed by estimating the aﬃne motion with respect to the template image. Recent research [5,6] deals with purposive 3D reconstruction within a controlled environment (e.g. Fig. 1) by planning camera positions that most support the respective task. Such planning methods calculate camera positions that, for instance, allow the most complete reconstruction of an object with a certain number of views or that optimize the accuracy of reconstructed points. This ﬁeld of application is an example, where additional knowledge about camera parameters as available and should be used to improve feature tracking. Heigl [7] uses an estimation of camera parameters to move features along their epipolar line, but he does not consider the uncertainty of the estimation. Trummer et al. [8] give a formulation of KLT tracking with known camera parameters regarding uncertainty, called Guided KLT tracking (GKLT), Fig. 1. Robotic arm but still use the traditional optimization error function. St¨aubli RX90L as an example of a conIn [9] the authors extend the error function and the optrolled environment timization algorithm of GKLT to handle uncertainty estimation together with the estimation of transformation parameters. In this paper we present an extension of GKLT tracking resulting in combined tracking and reconstruction. We perform the reconstruction by robustly estimating the position of the respective 3D point. This step endows eﬃcient detection of spurious tracking steps not ﬁtting the current 3D model during the tracking process as well as reinitialization of lost features. We further compare our extended GKLT tracking method with standard KLT and previous GKLT tracking methods in the context of NBV planning using the NBV benchmark object proposed in [10]. The remainder of this paper is organized as follows. Section 2 gives a review of standard KLT tracking and the previous versions of GKLT tracking. In Sect. 3 we present our extended GKLT tracking for combined tracking and reconstruction. A comparison of the considered tracking methods within a NBV planning approach is carried out in Sect. 4. The conclusion of this paper and the outlook to future work is given in Sect. 5.

2

Review of KLT and GKLT Tracking

This section brieﬂy reviews the relevant tracking methods as seen from literature [1,2,8,9]. Thus the notation is deﬁned and the previous extensions of KLT tracking for the usage of camera parameters are described.

Combined GKLT Feature Tracking and Reconstruction

2.1

163

KLT Tracking

Given a feature position in the initial frame, KLT feature tracking aims at ﬁnding the corresponding feature position in the consecutive input frame with intensity function I(x). The initial frame is the template image with intensity function T (x), x = (x, y)T . A small image region and the intensity values inside describe a feature. This descriptor is called feature patch P . Tracking a feature means that the parameters p = (p1 , ..., pn )T of a warping function W (x, p) are estimated iteratively, trying to minimize the squared intensity error over all pixels in the feature patch. A common choice is aﬃne warping by a11 a12 x Δx W a (x, pa ) = + (1) a21 a22 y Δy with pa = (Δx, Δy, a11 , a12 , a21 , a22 )T . Following the additive approach (cf. [2]), the error function of the optimization problem can be written as (Δp) = (I(W (x, p + Δp)) − T (x))2 , (2) x∈P

where the goal is to ﬁnd arg minp (Δp). An iterative update rule for Δp is found by ﬁrst-order Taylor approximations of the error function (2). 2.2

Guided KLT Tracking

In comparison to standard KLT tracking, GKLT uses knowledge about intrinsic and extrinsic camera parameters to alter the translational part of the warping function. Features are moved along their respective epipolar line, but allowing for translations perpendicular to the epipolar line caused by the uncertainty in the estimate of the epipolar geometry. The aﬃne warping function (1) is changed to −l3 a11 a12 x a l1 − λ1 l2 + λ2 l1 WEU (x, paEU , m) = + (3) a21 a22 y λ1 l1 + λ2 l2 with paEU = (λ1 , λ2 , a11 , a12 , a21 , a22 )T and l1 = 0; the respective epipolar line ˜ is computed using the fundamental matrix F and the l = (l1 , l2 , l3 )T = Fm ˜ = (xm , ym , 1)T . The ﬁrst version feature position (center of feature patch) m of GKLT [8] uses a weighting matrix in the parameter update rule to control the feature’s translation along and perpendicular to the respective epipolar line. In [9] a new optimization error function for GKLT is proposed. The weighting matrix Aw,Δw and thus the uncertainty parameter w is included in the modiﬁed error function (ΔpEU , Δw) = (I(WEU (x, pEU + Aw,Δw ΔpEU , m)) − T (x))2 . (4) x∈P

This results in an EM-like approach for a combined estimation of the uncertainty and the warping parameters. By this means, Guided KLT tracking uses additional knowledge about camera parameters to optimize the tracking process in the 2D image space.

164

3

M. Trummer, C. Munkelt, and J. Denzler

Combining GKLT Tracking and Robust 3D Reconstruction

In this section we present a combined approach for GKLT feature tracking and 3D reconstruction of the respective 3D world point. We show how 2D tracking can beneﬁt from an online 3D estimation using robust statistics. For a compact formulation, we denote our extended GKLT tracking method as GKLT3D . Table 1. Comparing ﬂowcharts of KLT and GKLT3D methods. Steps describe actions for tracking one feature in one frame. Further explanations are given in Sect. 3. KLT method init feature position: detect feature (new) OR init feature from last position (tracked) OR stop without tracking (lost) ↓ if initialized : KLT tracking

GKLT3D method init feature position: detect feature (new) OR init feature from last position (tracked) OR reinit from back-projection (lost) ↓ always: GKLT tracking ↓ if tracking successful: init weights for 3D estimation ↓ robust estimation of 3D position ↓ check tracking step for acceptance

The GKLT3D tracking method consists of the following steps for tracking one feature in one frame, cf. Table 1. Initialize feature position. Since tracking in the KLT sense is an iterative optimization of feature transformation parameters, an initial solution is required. If the feature was tracked in the previous frame, it is straightforward to use the last parameter estimation as the initialization for the current frame, which corresponds to the condition of small baselines between consecutive frames. We also use this initialization technique for GKLT3D . In addition, GKLT3D reinitializes features that were lost in the previous frame or earlier and that were tracked in at least one frame. Thus a 3D estimation from at least two frames exists, in particular from the frame where the feature was detected and from at least one frame of successful tracking. For lost features we use the back-projection of the estimated 3D point to reinitialize the feature position for GKLT3D tracking. GKLT tracking. Having initialized the feature transformation, we perform 2D feature tracking by the GKLT method elaborated in [8,9]. In fact, this step of the GKLT3D method can be performed by any other tracking method including standard KLT tracking. However, we ﬁnd it natural to further extend the existing GKLT tracking method that already uses knowledge about camera parameters.

Combined GKLT Feature Tracking and Reconstruction

165

Initialize weights for 3D estimation. After successful feature tracking we include the additional information about the actual feature position in the 3D estimation. Since we use an iterative estimation and robust statistics, we need to initialize each weight wi ∈ [0, 1] for the feature position xi in frame i. The only wi we can know for sure is w0 = 1; frame 0 is the initial frame where the feature is detected. The feature positions tracked in the following frames are aﬄicted with increasing uncertainty. It is more likely for them to be outliers. (init) Thus we propose a strictly decreasing sequence (wi )i=0,1,...,n with (init)

w0

(init)

= 1 and ∀i > 0 : wi

(init)

< wi−1

(5)

as initialization for the weights wi . In the presence of output weights from a previous 3D estimation, we initialize the position weights with ⎧ ,i=0 ⎨ 1 (init) (prev) wi = wi (6) ,1≤i≤n−1 ⎩ 0.5 , i = n (init)

and hence ensure that w0 = 1, initialize the weight regarding the latest (init) (prev) tracked position as wn = 0.5 and use the previously adapted weights wi , i = 1, ..., n − 1. Robust estimation of 3D position. For 3D reconstruction we use the known camera parameters and a robust adaptation of the standard direct linear transform (DLT) algorithm for 3D triangulation [11] to perform an estimation following the idea of iteratively reweighted least squares (IRLS) estimation [12]. Since the DLT algorithm endows rather an algebraically optimal than a least squares estimation, we use robust iteratively reweighted DLT (IRDLT) estimation of the 3D position. We apply the error norm proposed by Huber [13] as robust estimator, 1 2 , |e| < t 2e ρ(e) = (7) t|e| − 12 t2 , |e| ≥ t which yields the weight function

⎧ 1 , |e| < t 1 ∂ρ(e) ⎨ t − e , |e| ≥ t ∧ e < 0 w(e) = = ⎩ t e ∂e e , |e| ≥ t ∧ e ≥ 0

(8)

for error e and outlier boundary t. The IRDLT estimation algorithm performs ˆ of 3D point X from image points the following steps to compute an estimation X xi and projection matrices Pi using weights wi , i = 0, 1, ..., n: preparation: init weights wi for 3D reconstruction according to (6), if previously estimated weights available, else according to (5) 1) perform 3D reconstruction using weighted DLT algorithm 2) recompute weights wi following (8) 3) if changes of wi are small, stop; else go to 1) ˆ of the world These steps endow a costly-inexpensive and robust 3D estimation X point X.

166

M. Trummer, C. Munkelt, and J. Denzler

(a) View of the NBV test object ...

(b) ... inside a controlled environment.

Fig. 2. All-aluminium NBV test object proposed in [10]. Outstanding artistic design to provide optical surface structure and hence features for tracking.

Check tracking step for acceptance. Besides a robust estimation of the tracked point’s 3D position, the IRDLT procedure yields weights wi ∈ [0, 1]. These weights indicate how likely it is for a position xi to be an outlier, whereat wi = 1 states that position xi in image i perfectly supports the estimated 3D position ˆ We use the weight wn of the last tracked position xn to decide for acceptance X. of the whole tracking step. If wn < tw , e.g. tw = 0.5, we roll back the whole tracking step of GKLT3D , i.e. we restore the previous 3D estimation and delete position xn . In this case the current feature position is reinitialized from the 3D estimate instead of the outlying tracked position for the consecutive frame. All the steps described above form the GKLT3D tracking method, cf. Table 1. The cycle for tracking one feature in one frame performs 2D GKLT tracking, robust estimation of the 3D position, and usage of the estimated 3D information. The outlier rejection is based on the coherence of the latest tracked image position and the robustly estimated 3D position of the respective world point. Thus the 2D tracking process beneﬁts from the concurrent robust 3D estimation in terms of reinitialization of lost features, outlier detection regarding the robust 3D estimate, and, of course, in terms of the robustly estimated 3D position itself.

4

Experimental Comparison of KLT, GKLT and GKLT3D Tracking

We compare our GKLT3D to the standard KLT and GKLT tracking methods. As input data we use predeﬁned image sequences as well as a planned sequence produced by the information-theoretic NBV planning approach described in [5]. Figure 2 shows the experimental setup. All image sequences are taken with a calibrated camera Sony DFW-VL 500 mounted on a robotic arm St¨ aubli RX90L providing position parameters. Figures 2(a) and 2(b) show the NBV test object proposed in [10]. The image sequences are taken from camera positions on a

Combined GKLT Feature Tracking and Reconstruction

(a) Initial frame.

167

(b) 3227 features (red boxes) selected.

Fig. 3. Initial frame and selected features

half sphere over the object. The test object itself is manufactured from its CAD model with an accuracy of 30μm. From this CAD model we derive a very dense point cover of the object surface, in particular 106 points equally distributed on the object surface. After transformation to the robot coordinate frame this point set provides ground-truth reference data for the 3D reconstruction. For quantitative evaluation of the tracking and reconstruction results, we use the following criteria. We measure the tracking performance by noting the mean trail length μL and the standard deviation σL in frames. The reconstruction accuracy is measured by mean error μE and standard deviation σE in mm. For this we calculate the distances between each reconstructed point and the respectively closest point from the reference point set. For a meaningful comparison of the reconstruction – and hence tracking – accuracies, we just use the trails in the 2D image space produced by each tracker to perform 3D reconstruction with the standard DLT triangulation algorithm. Thus we do not evaluate the robust 3D estimates from GKLT3D . Each 3D point available is included in the evaluation, i.e. each point that has been seen in at least two frames. Figure 3 shows the initial frame of all test sequences and the set of 3227 features selected along image edges for tracking within the predeﬁned sequences. For NBV planning we reduce the set of features considering the planning runtime. 4.1

Comparison Using a Short Image Sequence

We perform feature tracking and 3D point reconstruction using as few as ten frames for tracking. The 11 frames, one for feature detection and ten for tracking, are taken moving the camera on a meridian of the half sphere over the object with the camera directed to the center of the corresponding sphere. Since the baseline between consecutive frames on the meridian is 0.375 and thus very small, the whole sequence covers a small baseline only. Considering the fact that small 2D position errors cause large 3D errors in the presence of a small baseline, the reconstruction results emphasize the tracking accuracy. The results in Table 2 show that the mean trail length compared to the standard KLT is increased about 11% with both the GKLT and the GKLT3D

168

M. Trummer, C. Munkelt, and J. Denzler

Table 2. Comparison of trail lengths (L ) and reconstruction errors (E ) for tracking the features from Fig. 3(b) in a short sequence of 11 frames, one frame for feature detection. GKLT3D oﬀers best accuracy μL (frames) σL (frames) μE (mm) σE (mm) KLT 9.56 2.60 7.62 25.27 GKLT 10.67 (+11.61%) 1.29 (−50.38%) 3.46 (−54.59%) 6.30 (−75.07%) GKLT3D 10.64 (+11.30%) 1.36 (−47.70%) 2.75 (−63.91%) 2.74 (−89.16%)

tracker. More considerably GKLT3D reduces the mean reconstruction error by about 64% and the standard deviation by about 89% compared to the standard KLT tracker. With respect to GKLT, GKLT3D reduces μE by 20.52% and σE by 56.51%. GKLT3D beneﬁts from the removal of spurious tracking steps not ﬁtting the 3D estimation. 4.2

Comparison Using a Long Image Sequence

In addition to the tracking evaluation using a short image sequence we further apply the tracking methods to the same image features within a long sequence of 201 frames, one for feature detection and 200 for tracking. In this sequence the camera positioning covers a large baseline and change of the viewing direction. By this means, we achieve a meaningful evaluation of the tracking durations. As shown in Table 3, tracking features in the long image sequence points out the beneﬁts of reinitializing lost features, which requires an estimation of the respective 3D world point. Considering the average case, GKLT3D can track features for nearly four times more frames than standard KLT and nearly three times more than GKLT in the test sequence. This also entails a larger standard daviation σL . The diﬀerence of the mean reconstruction errors is even larger than for the short test sequence. The mean error produced by GKLT3D is about 71% smaller compared to standard KLT and about 39% smaller compared to GKLT. In comparison with the results using the short test sequence, only GKLT3D can improve the reconstruction accuracy; standard KLT and GKLT produce larger mean errors. This seems contradictory, since the longer test sequence covers a larger baseline and features are tracked in more frames. Actually, standard KLT and GKLT suﬀer from tracking inaccuracies due to diﬃcult imput images that superimpose the eﬀect of the larger baseline. By robustly estimating the current Table 3. Comparison of trail lengths (L ) and reconstruction errors (E ) for tracking the features from Fig. 3(b) in a long sequence of 201 frames, one frame for feature detection. GKLT3D shows by far the best tracking duration and reconstruction accuracy in the comparison. μL (frames) σL (frames) μE (mm) σE (mm) KLT 23.47 21.22 9.10 27.26 GKLT 33.88 (+44.35%) 21.70 (+2.36%) 4.34 (−52.31%) 6.69 (−75.46%) GKLT3D 91.06 (+287.98%) 41.90 (+97.46%) 2.65 (−70.88%) 2.38 (−91.27%)

Combined GKLT Feature Tracking and Reconstruction

169

3D position and removing spurious tracking steps, only GKLT3D allows a more accurate 3D reconstruction using the long test sequence. 4.3

Comparison within an Information-Theoretic Approach for Next Best View Planning

Compared to the standard structure-from-motion approach, 3D reconstruction within a controlled environment oﬀers additional information and allows purposive actions to improve the reconstruction procedure and the result. NBV planning uses these additional possibilities to achieve deﬁned reconstruction goals. The NBV planning method in [5] uses an extended Kalman ﬁlter to compute 3D reconstructions of tracked features and determines the next best view by applying an information-theoretic quality criterion and visibility constraints. We track 496 features and use the short test sequence as the initial sequence of the planning procedure. Table 4. Comparison of reconstruction errors after the n-th planned view NBVn for respective tracking method. GKLT3D allows more planned views and yields best accuracy.

KLT GKLT GKLT3D

init NBV1 NBV2 μE σE μE σE μE σE 5.20 21.47 5.42 21.46 / / 2.62 3.14 2.70 3.51 / / 2.10 1.43 1.74 1.26 1.79 1.16

NBV3 μE σE / / / / 1.78 1.16

Table 4 lists the reconstruction errors after each iteration of the NBV planning procedure. The feature trails provided by KLT and GKLT tracking allow only one planned view after the initial sequence. Afterwards, the respective planning result repeats itself, since both trackers cannot keep the features through the longer sequence. Only GKLT3D can gather enough information in the elongated sequence to provide new information for the next planning step, which allows a new planned position. We stopped planning with GKLT3D after the third planned view. The mean reconstruction error of about 1.75mm using GKLT3D clearly outperforms the results reached with KLT and GKLT tracking.

5

Conclusion and Future Work

We presented an extension to GKLT feature tracking within a controlled environment. Following the idea of using the additional knowledge about camera parameters within the tracking process, we described concurrent robust estimation of the 3D position from 2D feature positions by the IRDLT algorithm. We used the robustly estimated 3D position to reinitialize lost features during the tracking process and to detect and remove spurious tracking steps not supporting the current 3D estimation. Further we performed an experimental evaluation using deﬁned image sequences as well as within an information-theoretic approach for next-best-view planning.

170

M. Trummer, C. Munkelt, and J. Denzler

The experimental evaluation outlined a clear performance gain using our extended GKLT tracking method – considering tracking duration as well as tracking accuracy. In comparison to the standard KLT and the former GKLT trackers, the mean reconstruction error in the experiments was reduced by up to 71% and 39%, respectively. The gain in the tracking duration increased with longer image sequences. We noted an increase of about 290% with a long test sequence. Future work should deal with the bottleneck of constant feature templates. The reinitialization of lost features yields no positive eﬀect if the current view shows the feature through a completely diﬀerent perspective projection than seen in the initial frame. A solution to this problem also should use the additional knowledge about camera parameters.

References 1. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of 7th International Joint Conference on Artiﬁcial Intelligence, pp. 674–679 (1981) 2. Baker, S., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework. International Journal of Computer Vision 56, 221–255 (2004) 3. Fusiello, A., Trucco, E., Tommasini, T., Roberto, V.: Improving feature tracking with robust statistics. Pattern Analysis and Applications 2, 312–320 (1999) 4. Zinsser, T., Graessl, C., Niemann, H.: High-speed feature point tracking. In: Proceedings of Conference on Vision, Modeling and Visualization (2005) 5. Wenhardt, S., Deutsch, B., Angelopoulou, E., Niemann, H.: Active Visual Object Reconstruction using D-, E-, and T-Optimal Next Best Views. In: Computer Vision and Pattern Recognition. CVPR 2007, June 2007, pp. 1–7 (2007) 6. Chen, S.Y., Li, Y.F.: Vision Sensor Planning for 3D Model Acquisition. IEEE Transactions on Systems, Man and Cybernetics – B 35(4), 1–12 (2005) 7. Heigl, B.: Plenoptic Scene Modelling from Uncalibrated Image Sequences. PhD thesis, Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg (2003) 8. Trummer, M., Denzler, J., Munkelt, C.: KLT Tracking Using Intrinsic and Extrinsic Camera Parameters in Consideration of Uncertainty. In: Proceedings of 3rd International Conference on Computer Vision Theory and Applications (VISAPP), vol. 2, pp. 346–351 (2008) 9. Trummer, M., Munkelt, C., Denzler, J.: Extending GKLT Tracking – Feature Tracking for Controlled Environments with Integrated Uncertainty Estimation. In: Salberg, A.-B., Hardeberg, J.Y., Jenssen, R. (eds.) SCIA 2009. LNCS, vol. 5575, pp. 460–469. Springer, Heidelberg (2009) 10. Munkelt, C., Trummer, M., Wenhardt, S., Denzler, J.: Benchmarking 3D Reconstructions from Next Best View Planning. In: Proceedings of IAPR Conference on Machine Vision Applications (MVA), pp. 552–555 (2007) 11. Hartley, R., Zisserman, A.: Multiple View Geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge (2003) 12. Maronna, R., Martin, R., Yohai, V.: Robust Statistics. Wiley Series in Probability and Statistics (2006) 13. Huber, P.: Robust Estimation of a Location Parameter. Annals of Mathematical Statistics 35(1), 73–101 (1964)

Non-parametric Single View Reconstruction of Curved Objects Using Convex Optimization M.R. Oswald, E. T¨ oppe , K. Kolev, and D. Cremers Computer Science Department, University of Bonn, Germany

Abstract. We propose a convex optimization framework delivering intuitive and reasonable 3D meshes from a single photograph. For a given input image, the user can quickly obtain a segmentation of the object in question. Our algorithm then automatically generates an admissible closed surface of arbitrary topology without the requirement of tedious user input. Moreover we provide a tool by which the user is able to interactively modify the result afterwards through parameters and simple operations in a 2D image space. The algorithm targets a limited but relevant class of real world objects. The object silhouette and the additional user input enter a functional which can be optimized globally in a few seconds using recently developed convex relaxation techniques parallelized on state-of-the-art graphics hardware.

1

Introduction

One of the most impressive abilities of human vision is the extraction of threedimensional information from a single image. From the mathematical point of view, depth information is lost due to the projection. In contrast to multiview methods, this operation cannot be simply inverted. Hence, depth information can only be guessed by image features like object contours, edges and texture patterns. Especially for images of textured objects under complex lighting conditions, shape from shading methods usually fail to work and further assumptions or user interactions are required. In computer vision, this fundamental problem has recently attracted a large amount of attention.

Fig. 1. Input images and textured reconstruction results from the proposed method

This work was supported in part by Microsoft Research Cambridge through its PhD Scholarship Programme.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 171–180, 2009. c Springer-Verlag Berlin Heidelberg 2009

172

1.1

M.R. Oswald et al.

Existing Approaches to Single View Reconstruction

Many approaches such as that of Horry et al. [1] aim to reconstruct planar surfaces by evaluating user deﬁned vanishing points and lines. This has been extended by Liebowitz [2] and Criminisi [3]. Recently, this process has been completely automated by Hoiem et al. [4], yielding appealing results on a limited number of input images. Sturm et al. [5] make use of user-speciﬁed constraints such as coplanarity, parallelism and perpendicularity in order to reconstruct piecewise planar surfaces. An early work for the reconstruction of curved objects is Terzopoulos et al. [6] in which symmetry seeking models are reconstructed from a user deﬁned silhouette and symmetry axis using snakes. However, this approach is restricted to the class of tube-like shapes. Moreover, reconstructions are merely locally optimal. The work of Zhang et al. [7] addresses this problem and proposes a model which globally optimizes a smoothness criterion. However, it concentrates on estimating 2.5D image features rather than reconstructing real 3D representations. Moreover, it requires a huge amount of user interaction in order to obtain appealing reconstructions. In “Teddy”, Igarashi et al. [8] make use of a contour based distance function in order to inﬂate a volume. The method performs modiﬁcations of a triangle mesh in multiple steps and is rather heuristic. This leads to problems with the maintenance of mesh consistency and suboptimal silhouette ﬁtting results. Moreover, the object’s topology is restricted to that of a sphere. Closely related to our work, Prasad et al. [9] have studied the reconstruction of smooth and curved 3D surfaces from single photographs. They also calculate a globally optimal 3D surface satisfying user speciﬁed constraints. The main drawback of this method is the vast amount of necessary user input which is mainly due to the use of parametric surfaces. For a reconstruction, the user needs to select several contour edges and place them appropriately in the parameter space grid which requires explicit consideration of the topology of the object in question. Moreover, parameter space boundaries need to be considered and connected by the user requiring expert knowledge even for simple object topologies. For topologies of higher genus the required user placement of contour edges or creases in the parameter space may easily lead to over-oscillations of the surface and incorrect surface distortions. To our knowledge, all existing approaches to single view reconstructions are based on parametric representations. Consequently, solutions will be aﬀected by the choice of parametrization and extensions to diﬀerent topology are by no means straight-forward. 1.2

Contributions

In this paper, we focus on the reconstruction of curved objects of arbitrary topology with a minimum of user input. We propose a convex variational method which generates a 3D object in the matter of a second using silhouette information only. Moreover, the proposed reconstruction framework provides the user with a simple but powerful post-editing toolbox which does not require expert

Non-parametric Single View Reconstruction of Curved Objects

173

knowledge at all. Post-editing can be done interactively due to the short computation times obtained by massively parallelized implementation of the underlying nonlinear diﬀusion process. Compared to previous works, the proposed method allows to compute globally optimal reconstructions of arbitrary topology due to the use of implicit surfaces and respective convex relaxation techniques. In the following section, we will introduce a variational framework for single view reconstruction and show how it can be solved by convex relaxation techniques. In Sect. 3, we give a complete overview of the proposed reconstruction framework and explain how users can provide silhouette and additional information with minimal user interaction. The viability of our approach is tested on several examples in Sect. 4, followed by concluding remarks in Sect. 5.

2

Variational Framework for Single View Reconstruction

In the following, we introduce a variational framework for single view reconstruction. The proposed functional can subsequently be optimized using convex relaxation techniques recently developed for segmentation [10] and multiview reconstruction [11]. 2.1

Variational Formulation

Let V ⊂ IR3 be a volume surrounding the input image I : Ω → IR3 with image plane Ω ⊂ V . We are looking for a closed surface Σ ⊂ V which inﬂates the object in the image I and is consistent with its silhouette S. For simplicity, an orthographic projection is assumed and deﬁned by π : V → Ω. In order to handle arbitrary topologies, the surface Σ is represented implicitly by the indicator function u : V → {0, 1} denoting the exterior (u = 0) or interior (u = 1) of the surface. A smooth surface with the desired properties is obtained by minimizing the following energy functional: E(u) = Edata (u) + νEsmooth (u) ,

(1)

where ν ≥ 0 is a parameter controlling the smoothness the surface. The smoothness term is imposed via the weighted total variation (TV) norm Esmooth (u) = g(x) |∇u(x)| d3 x , (2) V

where the diﬀusivity g : V → IR+ can be used to adaptively adjust smoothness properties of the surface in diﬀerent locations. The range of g needs to be nonnegative to maintain the convexity of the model. The data term 3 Edata (u) = u(x) φvol (x)d x + u(x) φsil (x)d3 x (3) V

V

realizes two objectives: volume inﬂation and compliance with silhouette constraints.

174

2.2

M.R. Oswald et al.

Silhouette Consistency

The function φsil (x) merely imposes silhouette consistency. It assures that all points projecting outside the silhouette will be assigned to the background (u = 0) and that all points which are on the image plane and inside the object will be assigned object (u = 1): ⎧ ⎪ ⎨−∞ if χS (π(x)) = 1 and x ∈ Ω φsil (x) = +∞ if χS (π(x)) = 0 (4) ⎪ ⎩ 0 otherwise , where characteristic function χS : Ω → {0, 1} indicates exterior or interior of the silhouette, respectively. 2.3

Volume Inflation

The volume inﬂation function φvol allows to impose some guess of the shape of the object. The function can be adopted to achieve any desired object shape and may also be changed by user-interaction later on. In this work, we make the simple assumption that the thickness of the observed object increases as we move inward from its silhouette. For any point p ∈ V let dist(p, ∂S) = min p − s , s∈∂S

denote its distance to the silhouette contour ∂S ⊂ Ω. Then we set: −1 if dist(x, Ω) ≤ h(π(x)) φvol (x) = +1 otherwise ,

(5)

(6)

where the height map h : Ω → IR depends on the distance of the projected 3D point to the silhouette according to the function h(p) = min λcutoﬀ , λoﬀset + λfactor ∗ dist(p, ∂S)k (7) with four parameters k, λoﬀset , λfactor , λcutoﬀ ∈ IR>0 aﬀecting the shape of the reconstructed object. How the user can employ these parameters to modify the computed 3D shape will be discussed in Sect. 3. Note that this choice of φvol implies symmetry of the resulting model with respect to the image plane. Since the backside of the object is unobservable, it will be reconstructed properly for plane-symmetric objects. 2.4

Optimization via Convex Relaxation

To minimize energy (1) we follow the framework developed in [11]. To this end, we relax the binary assumption by allowing u to take on intermediate values,

Non-parametric Single View Reconstruction of Curved Objects

175

i.e. u : V → [0, 1]. Subsequently, we can globally minimize the convex functional (1) by solving the corresponding Euler-Lagrange equation

∇u 0 = φvol + φsil − ν div g , (8) |∇u| using a ﬁxed-point iteration. A global optimum of the original binary labeling problem is then obtained by simple thresholding of the solution of the relaxed problem – see [11] for details. In [12] it was shown that such relaxation techniques have several advantages over graph cut methods. In this work, the two main advantages are the lack of metrication errors and the parallelizability. These two aspects allow to compute smooth single view reconstructions with no grid bias within a few seconds using standard graphics hardware.

3

Interactive Single View Reconstruction

To make optimal use of the proposed reconstruction method we explain its integration within an interactive tool and which methods can be used to obtain good reconstructions with only a few mouse clicks. The typical workﬂow of our method is depicted in Fig. 2 and single stages are further explained in the following subsections. 3.1

Image Segmentation

The main prerequisite for a good result with the algorithm proposed in Sect. 2 is a reasonable silhouette. The number of holes in the segmentation of the target object determines the topology of the reconstructed surface. Notably, the proposed reconstruction method can also cope with disconnected regions of the object silhouette. The segmentation is obtained by utilizing an interactive graph cuts scheme similar to the ones described by [13] and [14]. The algorithm calculates two distinct regions based on respective color histograms which are deﬁned by representational pen strokes given by the user (see Fig. 2). The output of the segmentation deﬁnes the silhouette indicator function χS .

Fig. 2. Workﬂow of the proposed method: Input image with user provided seeds (foreground: blue, background: red), segmentation, reconstruction with default parameter settings, reconstruction with user-adapted parameter settings

176

M.R. Oswald et al.

cutoff

factor

offset

k=2

k=1

k = 1/100

Fig. 3. Eﬀect of λoﬀset , λfactor , λcutoﬀ (left) and various values of parameter k and resulting (scaled) height map plots for a circular silhouette

3.2

Interactive Editing

From the input image and silhouette a ﬁrst reconstruction is generated, which depending on the complexity and the class of the object - can already be satisfactory. However, for some object classes and due to the general over-smoothing of the resulting mesh, we propose several editing techniques on a 1D (parameter) and a 2D (image space) level. The goal is to have easy-to-use editing tools which cover important cases of object features. In this paper we present three diﬀerent kinds of editing tools: parameter-based, contour-based and curve-based tools. The ﬁrst two classes operate directly on the data-term of (1), whereas the third one alters the diﬀusivity of the TV-norm (2). Data Term Parameters. By altering the parameters λoﬀset , λfactor , λcutoﬀ and the exponent k of the height map function (7), users can intuitively change the data term (3) and thus the overall shape of the reconstruction. Note that the impact of these parameters is attenuated with increasing importance of the smoothness term. The eﬀects of the oﬀset, factor and cutoﬀ parameters on the height map are shown in Fig. 3 and are quite intuitive to grasp. The exponent k of the distance function in (7) mainly inﬂuences the objects curvature in the proximity of the silhouette contour. This can be observed in Fig. 3 showing an evolution from a cone to a cylinder just by decreasing k. Local Data Term Editing. Due to the use of a distance function for the volume inﬂation, depth values of the data term will always increase for larger distances to the silhouette contour. Thus, large depth values will never occur near the silhouette contour. However this can become necessary for an important class of object shapes like for instance the bottom and top of the vase in Fig. 4. A simple remedy to this problem is to ignore user speciﬁed contour parts during the calculation of the distance function. We therefore approximate the object contour by a polygon which is laid over the input image. The edges of the polygon are points of high curvature and each edge represents the contour pixels between the endpoints. By clicking on the edge, the user indicates to ignore the corresponding contour pixels during distance map calculation (see Fig. 4 top right).

Non-parametric Single View Reconstruction of Curved Objects

177

Fig. 4. Top row: height maps and corresponding reconstructions with and without marked sharp contour edges. Bottom row: input image with marked contour edges (blue) and line strokes (red) for local discontinuities which are shown right.

Local Discontinuities. Creases on the surface often add critically to the characteristic shape of an object. With the diﬀusivity function of the smoothness term (2) we are given a natural way of integrating discontinuities into the surface reconstruction. By setting the values of g to less than one for certain subsets of the domain, the smoothness constraint is relaxed for these regions. Accordingly for values greater than one smoothness is locally fortiﬁed. To keep things simple, we let the user specify curves of discontinuities by drawing them directly into the input image space. In the reconstruction space, the corresponding preimages are uniquely deﬁned hyperplanes (remembering that we make use of parallel projection). For the points lying on these planes or surrounding them, the diﬀusivity is reduced resulting in a surface crease at the end of the reconstruction process. 3.3

Implementational Issues

In order to eﬃciently solve the Euler-Lagrange equation (8) and allow fast interactive modeling the choice of the solving method and its appropriate implementation is crucial to achieve short calculation times. Instead of minimizing (1) with a gradient descent scheme, we solve the approximated system of linear equations with successive over-relaxation (SOR) as

178

M.R. Oswald et al.

proposed in [11]. On the one hand, this increases the convergence speed drastically and on the other the solution method can be parallelized to further increase computational speed. Therefore, we make use of the CUDA framework to implement SOR with a Red-Black scheme which speeds up calculations by factor 6 compared to the sequential method. Moreover, the computational eﬀort for the surface evolution during interactive modeling can be further reduced by initializing the calculations with the previous reconstruction result. For small parameter changes this initialization is usually close to the next optimal solution. In sum, this allows single view reconstruction close to realtime.

4

Experiments

In the following we apply our method to several input images. We show diﬀerent aspects of the reconstruction process for typical classes of target objects. Further we mention runtimes and limitations of the approach. The experimental results are shown in Fig. 5. Default values for the data term parameters (7) are k = 1, λoﬀset = 0, λfactor = 1, λcutoﬀ = ∞. Each row depicts several views of a single object reconstruction starting with the input image. The following main advantages are showcased in the examples. The fence (top row) is an example of an object with complex topology, the algorithm can handle. Obviously reconstructions of the shown type are nearly impossible to achieve with the help of parametrized representations. The same example is also a proof for how little user interaction is necessary in some cases to obtain a good reconstruction result. In fact, the fence was automatically generated by the method right after the user segmentation stage. The rest of the examples demonstrate the power of the editing tools described in Sect. 3. The reconstructions were edited by adding creases and selecting sharp edges. It can be seen, that elaborate modeling eﬀects can be readily achieved with these operations. Especially for the cockatoo a single curve suﬃces in order to add the characteristic indentation to the beak. No expert knowledge is necessary. For the socket of the Cristo statue, creases help to attain sharp edges, while keeping the rest of the statue smooth. It should be stressed, that no other post-processing operations were used. The experiments in the lower three rows stand for a more complex series of target objects. A closer look reveals that the algorithm clearly attains its limit. The structure of the opera building (third row) as well as the elaborate geometry of the bike and its drivers cannot be correctly reconstructed with the proposed method due to a lack of information and more sophisticated tools. Yet the results are appealing and could be spiced up with the given tools. To keep the runtime and memory demand within convenient limits, we work on 2562 input images. These result in a very detailed mesh. On a GeForce GTX card an update step of the geometry takes about 2-15 seconds, dependent on the applied operation.

Non-parametric Single View Reconstruction of Curved Objects

179

Fig. 5. Input images (1st column) and corresponding reconstruction results (2nd-4th column): textured model, untextured geometry, textured model without image plane

180

5

M.R. Oswald et al.

Conclusion

We presented the ﬁrst variational approach for single view reconstruction of curved objects with arbitrary topology. It allows to compute a plausible 3D model for a limited but reasonable class of single images. By using an implicit surface representation we eliminate the dependency on a choice of surface parameterization and the subsequent diﬃculty with objects of varying topology. The proposed functional integrates silhouette information and additional user input. Globally optimal reconstructions are obtained via convex relaxation. The algorithm can be used interactively, since the parallel implementation of the underlying nonlinear diﬀusion process on standard graphics cards only requires a few seconds. Compared to other works, the amount of user input is small and intuitive, post-editing is kept simple and does not require expert knowledge. Future work is focused on incorporating information from edges, pattern or shading to further improve the quality of reconstructions. Acknowledgements. We thank Mukta Prasad, Carsten Rother and Andrew Fitzgibbon for helpful discussions, suggestions and for providing their images.

References 1. Horry, Y., Anjyo, K.I., Arai, K.: Tour into the picture: using a spidery mesh interface to make animation from a single image. In: SIGGRAPH, pp. 225–232 (1997) 2. Liebowitz, D., Criminisi, A., Zisserman, A.: Creating architectural models from images. In: Proc. EuroGraphics, vol. 18, pp. 39–50 (1999) 3. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. Int. J. Comput. Vision 40(2), 123–148 (2000) 4. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph. 24(3), 577–584 (2005) 5. Sturm, P.F., Maybank, S.J.: A method for interactive 3d reconstruction of piecewise planar objects from single images. In: Proc. BMVC, pp. 265–274 (1999) 6. Terzopoulos, D., Witkin, A., Kass, M.: Symmetry-seeking models and 3d object reconstruction. IJCV 1, 211–221 (1987) 7. Zhang, L., Dugas-Phocion, G., Samson, J.-S., Seitz, S.M.: Single view modeling of free-form scenes. In: Proc. of CVPR, pp. 990–997 (2001) 8. Igarashi, T., Matsuoka, S., Tanaka, H.: Teddy: a sketching interface for 3d freeform design. In: SIGGRAPH 1999, pp. 409–416 (1999) 9. Prasad, M., Zisserman, A., Fitzgibbon, A.W.: Single view reconstruction of curved surfaces. In: CVPR, pp. 1345–1354 (2006) 10. Chan, T., Esedo¯ glu, S., Nikolova, M.: Algorithms for ﬁnding global minimizers of image segmentation and denoising models. SIAM 66(5), 1632–1648 (2006) 11. Kolev, K., Klodt, M., Brox, T., Cremers, D.: Continuous global optimization in multiview 3d reconstruction. Int.J. of Comp.Vision 84(1), 80–96 (2009) 12. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 13. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary region segmentation of objects in n-d images. In: ICCV 2001, vol. 1, pp. 105–112 (2001) 14. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (3), 309–314 (2004)

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior K. Ramnath and A.N. Rajagopalan Image Processing and Computer Vision Laboratory Department of Electrical Engineering Indian Institute of Technology Madras, Chennai, India [email protected], [email protected]

Abstract. Shape from focus (SFF) is a widely used technique for determining the 3D structure of textured microscopic objects. However, SFF output depends critically on the number of observations used and the focus measure operator adopted. In this paper, we propose a new SFF method that can provide rich structure information given limited number of observations. We observe that depth is non-linearly related to the observations and pose the shape estimation as a minimization problem within a Maximum A Posteriori (MAP) - Markov Random Field (MRF) framework. We incorporate a discontinuity-adaptive MRF prior for the underlying structure. The resulting cost function is non-convex in nature which we minimize using Graduated non-convexity algorithm. When tested on synthetic as well as real objects, the results obtained are quite impressive.

1

Introduction

Estimating the 3D structure of objects from 2D images is a well known inverse problem in computer vision. There are many cues one can use for this purpose, which include, motion, focus, defocus, shading, texture and silhouettes [1]. Shape from focus [2] was proposed for the problem of determining the 3D structure of textured microscopic objects. SFF is attractive as it is a direct onestep method and hence very fast. But SFF requires a large number of textured observations to work well. Depth from Defocus (DFD) [3] [4] [5] based methods can operate on just two observations captured with diﬀerent focus settings and estimate the structure using the relative blur between them. Focus and the closely related defocus-based methods to estimate structure are both single view methods. While in defocus-based methods the camera is held still, focus based methods involve axial motion. Both methods are attractive in industrial and medical applications where lateral movement is limited. These methods assume that there is no magniﬁcation across the captured images. Since optical microscopes employing telecentric optics have no parallax and are equipped with a translational stage for axial motion, SFF is a natural choice for microscopic applications. Because SFF uses more number of observations, it is known to be more accurate than DFD [6]. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 181–190, 2009. c Springer-Verlag Berlin Heidelberg 2009

182

K. Ramnath and A.N. Rajagopalan Image Plane

Lens

Motion along optical axis (k,l) Translational Stage

wd 3D Object

d(k, l)

Δd

(a) SFF setup.

Focused Plane m=0 m=1 m=2 m=3

(b) Art Structure.

(c) SFF estimate.

Fig. 1.

A typical experimental setup for SFF is shown in Fig. 1 (a). The object whose structure is to be determined is placed on the translational stage of the optical microscope. The stage is initially placed at a distance wd equal to the working distance of the microscope objective. This ensures that the portion of the object farthest from the objective is in focus initially. The stage is then translated away from the objective along the optical axis of the microscope in steps of Δd until the portion of the object nearest to the objective is in focus. SFF estimates the displacement d that is required for a pixel to come into focus which in turn reveals the structure of the object. Note that in Fig. 1 (a) wd is the working distance of the microscope objective and m is the frame number. SFF uses a focus measure operator to determine the frame in which a pixel comes into focus. Sum-modiﬁed Laplacian, proposed in [2], is typically used as the focus measure operator in SFF. The estimate is reﬁned using Gaussian interpolation around the peak of the focus measure proﬁle. In Fig. 1 (b), we give the depth map corresponding to the Art structure from the well known Middlebury stereo database [7]. The disparity values were converted to depth using the data provided therein. We use the Calf texture obtained from the USC texture database [8] for the original focused image. The SFF estimate for this structure, using 100 observations is shown in Fig. 1 (c). Note that ﬁne details are lost and the depth map is noisy (SFF does not incorporate any smoothness constraints). The estimates can become better if the step size between the frames is reduced. However, a smaller step size will mean more observations need to be captured. For example, in [9], several hundreds of images are captured using a special camera to obtain a ﬁne depth map. In this paper we model the process of image formation under the microscope as blurring of the original image by a space variant point spread function (PSF). Note that SFF does not use this model for estimating structure and rather relies on an operator to determine the degree of focus. A recent work [10] proposed the idea that assuming a model for the PSF which is diﬀerentiable with respect to structure allows the formulation of a gradient descent algorithm for estimating the structure. Since structure estimation is an ill-posed inverse problem, they use a convex regularizer in their formulation. We observe that the structure d is non-linearly related to the observations y obs , thereby making the structure space non-convex. Moreover, convex priors are liable to oversmooth, especially

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior

183

near sharp edges. Non-convex priors have previously been eﬀectively used in image restoration and super-resolution problems [11]. We propose the use of a discontinuity-adaptive Markov random ﬁeld (DAMRF) prior to preserve ﬁne undulations in the estimate of d. The resulting cost function is minimized using the graduated non-convexity (GNC) algorithm.

2

Proposed Approach

For ease of exposition, we will discuss the image formation process assuming a Gaussian model for the point spread function (PSF) [12], though the formulation itself applies to any diﬀerentiable PSF model. The formation of a space variantly blurred image can be modeled as yp = Hp (d)f + n.

(1)

The matrix Hp represents the operation of structure dependent blurring on the texture f . The vector f is the focused image lexicographically ordered. We assume a additive white zero mean Gaussian noise model for the noise term n. Since we are interesting in estimating d, Equation 1 can be equivalently expressed as, yp = F hp + n.

(2)

The matrix F is sparse and contains only elements from f . The vector hp is formed by stacking together the PSF’s at every point in the image. Let the value at (s, t) of the PSF corresponding to the pixel (x, y) in the pth observation be denoted by hp (x, y; s, t). PSF’s have a ﬁnite extent and for a Gaussian PSF (s, t) span the rectangle deﬁned by (x − 3σp (x, y), y − 3σp (x, y)) to (x + 3σp (x, y), y + 3σp (x, y)) centered at (x, y). From the formulation above it might appear that we will need to estimate both the texture f and the structure. However, in SFF, given many observations, it is possible to reconstruct the focused image of the 3D specimen using the stack by simply picking the pixels from the frames where they come into focus and we use this approximate estimate of focused image as f . We formulate the problem of structure estimation as the minimization of the energy given by P 1 E(d) = ep 2 + λR(d). (3) 2 p=1 We will refer to the ﬁrst term in the energy function as the data term and the second term as the prior. Here ep = yp − ypobs where ypobs represents the pth observation and yp = F hp is the current estimate of the pth observation. The term R(d) incorporates the regularization and λ is the regularization parameter. The gradient of the error term ep with respect to structure d can be calculated as follows ∂ep ∂hp T = F ep (4) ∂d ∂d

184

K. Ramnath and A.N. Rajagopalan

Under the Gaussian model for the PSF, the PSF at the pixel (x, y) in the pth observation is given by 2 1 d (x, y; s, t) hp (x, y; s, t) = exp − (5) Zp (x, y) 2σp2 (x, y) 2 where d2 (x, y; s, t) = (s − x)2 + (t − y)2 and Zp (x, y) = s t exp − d2σ(x,y;s,t) . 2 (x,y) p

σp (x, y) is the blur at the pixel (x, y) in the pth observation and it is related to the structure d(x, y) by the thin lens equation [13], 1 1 σp (x, y) = ρRv − (6) wd wd − d(x, y) + mp Δd where wd is the working distance and mp Δd is the displacement for the pth frame. Diﬀerentiating equation 6 we get, ∂σp ρRv (x, y) = − 2 ∂d wd − d(x, y) + mp Δd

(7)

Diﬀerentiating equation 5 we get

∂hp (x, y; s, t) hp (x, y; s, t) = d(x, y; s, t) − hp (x, y; u, v)d(x, y; u, v) ∂σp σp3 (x, y) u v (8) ∂h ∂σ ∂h Since ∂dp = ∂dp ∂σpp , using equations 7 and 8 in equation 4 we get, ∂hp (x, y; s, t) ∂ep ρRv (x, y) = − f (x, y) ep (s, t) (9) ∂σp ∂d (wd − d(x, y) + mp Δd)2 s t We will now need the gradient of the regularizer with respect to structure d. Deﬁne η(x, y; s, t) = d(x, y) − d(s, t). Let ℵ(x, y) be denote the set containing the ﬁrst order neighbors of (x, y) i.e., ℵ(x, y) = {(x − 1, y), (x + 1, y), (x, y − 1), (x, y + 1)} .

(10)

We will ﬁrst examine the GaussianMRF (GMRF) prior which is a convex prior and is given by Rgmrf (d(x, y)) = (s,t)∈ℵ(x,y) η 2 (x, y; s, t). The gradient of the GMRF prior Rgmrf is ∂Rgmrf (x, y) = η(x, y; s, t). ∂d (s,t)∈ℵ(x,y)

(11)

As we can see this prior imposes smoothness propotional to the diﬀerence in labels at adjacent sites. This can result in the prior penalizing strong discontinuities very heavily thereby resulting in solutions that are oversmooth. We propose to use a discontinuity-adaptive MRF (DAMRF) prior which overcomes

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior

185

this diﬃculty by adaptively reducing the interactions between sites on opposite sides of a discontinuity. It has previously been successfully employed for image estimation [11]. In this paper we show how a DAMRF prior can be effectively used to capture ﬁne details in the structure. Of the various DAMRF models proposed in the literature [14] we used the model, Rdamrf (d(x, y)) = −η 2 (x,y;s,t)/γ γ − γe . This function is convex only in the band (s,t)∈ℵ(x,y) − (γ/2), (γ/2) . The gradient of the DAMRF prior Rdamrf is given by 2 ∂Rdamrf (x, y) = η(x, y; s, t)e−η (x,y;s,t)/γ ∂d (s,t)∈ℵ(x,y)

(12)

where the parameter γ decreases the interaction between adjacent sites as a discontinuity forms between them and prohibits interactions for strong discontinuity. This allows the DAMRF prior to adapt to discontinuities in the estimated structure. The gradient of the energy with respect to structure ∂E is simply the ∂d sum of the gradients of the error and prior terms. ∂E (x, y) = ∂d

p=P ∂ep ∂R (x, y) + λ (x, y) ∂d ∂d p=1

(13)

With a DAMRF prior, we design the GNC algorithm as shown in Algorithm 1.

Algorithm 1. Graduated Non-Convexity algorithm (0)

Initialize d using the estimate obtained from SFF 2 γ (0) = 2ηmax (To ensure that the prior is convex initially [14].) where ηmax = dmax − dmin and dmax and dmin are the maximum and minimum values of d n=0 repeat (n)

(n−1)

∂E d

(n−1)

d =d −μ (Optimal step size μ is found by bisection search.) ∂d n=n+1 (n) (n−1) if d − d < then γ (n) = κγ (n−1) end if (n) (n−1) until (d − d < ) and (γ (n) < γtarget )

Note that it is the availability of closed form analytical expressions for the gradient that allows us to use the fast GNC method. Unlike simulated annealing(SA) [15], GNC does not have a theoretical proof for its convergence to a global minimum for non-convex functions, but has been shown to perform well on many non-convex functions [16]. The tradeoﬀ is that GNC has considerably faster convergence than SA [16]. The parameter γ is set to a high value initially and is gradually reduced by the factor κ till it reaches a low target value γtarget .

186

3

K. Ramnath and A.N. Rajagopalan

Experiments

We ﬁrst present results on the synthetic experiments. We compare our method with traditional SFF and GMRF prior based approaches (using both simulated annealing and gradient descent). In addition, we present results from two recent defocus based methods for structure estimation, namely, least squares [4] and diﬀusion [5]. The codes for these two techniques have been made available online by the authors, which facilitates straightforward comparison. In our synthetic experiments, we created two observations (near and far focused) with focus settings v1 and v2 respectively, using the true texture and structure. Unlike SFF, DFD does not involve moving the camera and the two observations are produced by changing only the focus settings. Note that this exercise is not possible for real experiments as all frames in the SFF stack were captured at a constant focus setting. These are used as input to both the DFD algorithms. Unlike SFF, these DFD algorithms return the absolute depth z which we convert to d using the relation d = wd − z to compare it with our results. We used four observations from the SFF stack for SA, gradient descent and GNC algorithms. We used the SFF estimate of structure as the initial estimate for all the methods. We show the structure estimates as grayscale images displayed so that the range [min(d), max(d)] maps to the range [0, 255]. We used κ = 0.95 in all our experiments. The regularization parameter λ was set to 0.1 for all our synthetic experiments and 0.01 for the real case. We have already seen the traditional SFF results on the Art structure in Fig. 1. The structure estimates produced by least squares method are shown in Fig. 2 (b). Diﬀusion produces better estimates than Least Squares as shown in Fig. 2 (c). We see that though the estimates from both these methods are signiﬁcantly better

(a) Ground Truth

(d) SA (140.67).

(b) Least Squares (90.90).

(c) Diﬀusion (133.48).

(e) GD - GMRF (103.39). (f) GNC-DAMRF (14.08). Fig. 2.

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior

(a) Ground Truth.

(d) SA (136.95).

(b) SFF (232.11).

187

(c) Diﬀusion (106.56).

(e) GD - GMRF (83.93). (f) GNC-DAMRF (18.21). Fig. 3.

than SFF, they are still far from the ground truth, as can be discerned from the root mean square (rms) error given in the brackets in the ﬁgures. The structure estimates obtained from the SA method are shown in Fig. 2 (d). Even though SA produces a good estimate we observe the characteristic smoothing near the edges induced by the convex GMRF prior. This eﬀect is most pronounced at the intersection of the second and third rings in the top right quadrant. It should be noted that reducing the regularization used will only result in the estimate becoming more noisy and not sharper and hence this smoothing eﬀect should not be taken as a case of over-regularization. The estimates obtained using gradient descent minimization with a GMRF prior are shown in Fig. 2 (e). We observe that the smoothing eﬀect is much more pronounced compared to SA in the same region. Thus we see that gradient descent gets trapped in a local minima and is unable to reach the minimum reached by SA. Fig. 2 (f) shows the result obtained using GNC with DAMRF prior. We notice that smoothing eﬀect is considerably less pronounced even in the region between the two rings. We also note that the estimate is very close to the ground truth and that visually they are almost indistinguishable. The rms error is just 14mm for a range of depths from 1.4m to 2.1m. We present the second synthetic experiment on the structure we call the Moebius, also obtained from the Middlebury Stereo database [7], and shown in Fig. 3 (a). SFF estimate (Fig. 3 (b)) for this structure is quite poor. We only show the Diﬀusion estimate (Fig. 3 (c)) as it gave better estimates than Least squares. We see that diﬀusion estimate is not satisfactory. SA estimate (Fig. 3 (d)) exhibits characteristic blurring due to the GMRF prior. Estimate from gradient descent (Fig. 3 (e)) is inferior to that obtained using SA. GNC with DAMRF avoids the problems of smoothing at discontinuities and produces an estimate (Fig. 3 (f)) very close to the ground truth.

188

K. Ramnath and A.N. Rajagopalan

(a) Sample Observation.

(b) Face texture.

(c) SFF.

(d) GD - GMRF.

(e) GNC-DAMRF.

(f) SA.

Fig. 4.

(a) Sample Observation. (b) Wheel texture.

(d) GD - GMRF.

(e) GNC-DAMRF.

(c) SFF.

(f) SA.

Fig. 5.

Finally, we show results on real data obtained under a microscope. We use a Nikon Eclipse LV150 industrial microscope to capture the images. The objective has a working distance of 8.8mm. The stage was translated in steps of 25 microns. Since we lack ground truth, we use the estimates from SA as a reference result for comparison. The ﬁrst sample is called the Face and Fig. 4 (a) shows one of the observations for this sample. The estimated focused image for this sample is shown in Fig. 4 (b). SFF estimate of Face structure is quite poor and is

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior

189

shown in Fig. 4 (c). This maybe due to the fact that the sample does not have much texture. SA estimates on this sample are shown in Fig. 4 (f). Estimates from Gradient Descent(Fig. 4 (d)) for this sample are inferior as they do not contain the ﬁne structures captured by SA. Estimates obtained from GNC are comparable to those obtained with SA and are shown in Fig. 4 (e). We see that GNC is able to avoid the local minima problem and captures the sharp edges even for real samples. The second real sample is called the Wheel and Fig. 5 (a) shows one of the observations for the this sample. The focused image derived from the SFF stack is shown in Fig. 5 (b). SFF fails to capture the ﬁne spokes in the wheel as shown in Fig. 5 (c). SA estimate is shown in Fig. 5 (f). In the Gradient descent estimate (Fig. 5 (d)), the edges are smoothed out. In contrast the proposed approach (Fig. 5 (e)) retains the sharp edges and is close to the SA output.

4

Conclusions

We proposed a new method for structure estimation in SFF that uses the nonlinear relationship between the observations and the structure in conjunction with a discontinuity-adaptive MRF prior to arrive at accurate shape estimates of 3D objects. The proposed approach (and the assumptions made therein) have been validated by several synthetic and real experiments. Acknowlegdements. The second author is grateful to the Alexander von Humboldt Foundation for its support.

References 1. Favaro, P., Soatto, S.: 3-D Shape Estimation and Image Restoration: Exploiting Defocus and Motion Blur. Springer, Heidelberg (2006) 2. Nayar, S.: Shape from focus system. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1992, pp. 302–308 (1992) 3. Chaudhuri, S., Rajagopalan, A.: Depth from Defocus: A Real Aperture Imaging Approach. Springer, Heidelberg (1998) 4. Favaro, P., Soatto, S.: A geometric approach to shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 406–417 (2005) 5. Favaro, P., Soatto, S., Burger, M., Osher, S.: Shape from defocus via diﬀusion. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 518–531 (2008) 6. Subbarao, M., Choi, T.: Accurate recovery of three-dimensional shape from image focus. IEEE Trans. Pattern Anal. Mach. Intell. 17(3), 266–274 (1995) 7. Scharstein, D., Pal, C.: Learning conditional random ﬁelds for stereo. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2007, pp. 1–8 (2007) 8. Brodatz, P.: Textures; a photographic album for artists and designers. Dover Publications, New York (1966) 9. Hasinoﬀ, S.W., Kutulakos, K.N.: Confocal stereo. Int’l J. Computer Vision 81(1), 82–104 (2009)

190

K. Ramnath and A.N. Rajagopalan

10. Sorel, M.: Multichannel blind restoration of images with space-variant degradations. PhD thesis, Charles Univ., Prague, Czech Republic, Department of Software Engineering Faculty of Mathematics and Physics (March 2007) 11. Subrahmanyam, G., Rajagopalan, A., Aravind, R.: Importance sampling kalman ﬁlter for image estimation. IEEE Signal Process. Lett. 14(7), 453–456 (2007) 12. Pentland, A.P.: A new sense for depth of ﬁeld. IEEE Trans. Pattern Anal. Mach. Intell. 9(4), 523–531 (1987) 13. Born, M., Wolf, E.: Principles of Optics. Pergamon Press, Oxford (1993) 14. Li, S.Z.: Markov random ﬁeld modeling in computer vision. Springer, London (1995) 15. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 16. Blake, A., Zisserman, A.: Visual reconstruction. MIT Press, Cambridge (1987)

Making Shape from Shading Work for Real-World Images Oliver Vogel, Levi Valgaerts, Michael Breuß, and Joachim Weickert Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Building E1.1 Saarland University, 66041 Saarbr¨ ucken, Germany {vogel,valgaerts,breuss,weickert}@mia.uni-saarland.de

Abstract. Although shape from shading (SfS) has been studied for almost four decades, the performance of most methods applied to realworld images is still unsatisfactory: This is often caused by oversimpliﬁed reﬂectance and projection models as well as by ignoring light attenuation and nonconstant albedo behavior. We address this problem by proposing a novel approach that combines three powerful concepts: (i) By means of a Chan-Vese segmentation step, we partition the image into regions with homogeneous reﬂectance properties. (ii) This homogeneity is further improved by an adaptive thresholding that singles out unreliable details which cause ﬂuctuating albedos. Using an inpainting method based on edge-enhancing anisotropic diﬀusion, structures are ﬁlled in such that the albedo does no longer suﬀer from ﬂuctuations. (iii) Finally a sophisticated SfS method is used that features a perspective projection model, considers physical light attenuation and models specular highlights. In our experiments we demonstrate that each of these ingredients improves the reconstruction quality signiﬁcantly. Their combination within a single method gives favorable perfomance also for images that are taken under real-world conditions where simpler approaches fail.

1

Introduction

An ultimate goal in computer vision is the 3-D reconstruction of our real world based on 2-D imagery. Although tremendous progress has been achieved when reconstructing a 3-D surface from multiple images [1], problems are much more severe when only a single image is available and the illumination is known. In our paper we address this so-called shape-from-shading (SfS) problem by introducing a novel framework that is particularly tailored to the diﬃculties one has to face in real-world scenarios. In the SfS problem, one usually assumes that a three-dimensional surface is illuminated by a single light source whose direction is known. The goal is to reconstruct this 3-D surface from the brightness variations within a single 2-D image. It is evident that this is a very diﬃcult task that requires a number of additional, simplifying model assumptions in order to become tractable. The investigation of SfS models was pioneered by Horn [2]. His orthographic camera model and his Lambertian surface assumption became characteristic for J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 191–200, 2009. c Springer-Verlag Berlin Heidelberg 2009

192

O. Vogel et al.

numerous early SfS algorithms; see e.g. [3] for a survey. Another milestone in the development of SfS models are the approaches of Prados et al. [4], Tankus et al. [5], and Cristiani et al. [6]. They replaced the orthographic camera model by a pinhole camera model performing a perspective projection, and they assumed that the light source is located at the optical centre. Moreover, a light attenuation term is considered in [4]. These ideas have been further extended by Ahmed et al. [7] and by Vogel et al. [8]. In these works, the Lambertian reﬂectance model is replaced by the more realistic model of Oren and Nayar [9], which is particularly useful for skin surfaces, or by the Phong model from computer graphics [10], which models specular highlights. Many experts agree that Lambertian assumptions do not model realistic surfaces in an appropriate way [7,11,9]. Although this development shows a clear evolution of SfS models towards more realistic assumptions, most of these papers work on synthetic data. The few ones that use real-world data sets usually do not consider more realistic eﬀects such as highlights or inhomogeneous reﬂectance properties as part of their models. In view of these diﬃculties, it is not surprising that in order to make SfS methods work in real-world applications, they had to be combined with external expertise provided e.g. by face databases and machine learning techniques [12] or by user-speciﬁed constraints [13]. Our Contribution. The goal of our paper is to show that by a more sophisticated approach, SfS works for a larger class of real-world images, even when no substantial a priori knowledge is available. To this end we combine three successful concepts: • In order to extract the object of interest for the SfS process we segment the image with two level set approaches: the region-based Chan-Vese segmentation model [14] and the edge-based geodesic active contour model [15,16]. • We detect ﬂuctuations in the albedo by a local adaptive thresholding [17] and eliminate them by inpainting with edge-enhancing anisotropic diﬀusion [18]. • We use the non-Lambertian, perspective SfS model of Vogel et al. [8] that belongs to the most realistic SfS techniques and takes into account highlights. Related Work. In our experiments we demonstrate that it is exactly the combination of the three successful concepts segmentation, albedo handling and nonLambertian SfS that is crucial for the performance of our method. However, some related ideas with two combined concepts have been proposed in the literature. Concerning the combination of inpainting and SfS, Prados et al. [19] applied an algorithm of Tschumperl´e and Deriche [20] for inpainting the eyes and the eyebrows for facial Lambertian SfS. Jin et al. [21] have combined a segmentation step with 3-D reconstruction of Lambertian surfaces. Their method also exploits multiple views. Paper Organization. In Section 2, we present more details on the key concepts of our combined method. An evaluation of their individual usefulness is given in Section 3. The paper is concluded by a summary with outlook in Section 4.

Making Shape from Shading Work for Real-World Images

2

193

Our Three-Stage Approach

Let us now have a more detailed look at the three key concepts that are combined within our SfS framework in order to exclude the background, to handle albedo variations and to deal with non-Lambertian surfaces. 2.1

Finding the Region of Interest – Segmentation

In a ﬁrst step we separate the object of interest from the background. This is necessary since both have incompatible reﬂectance properties. For this task we use the active contour model of Chan and Vese [14]. This is a classic levelset-based method that exploits the grey-value diﬀerence between object and background. The Chan-Vese model segments the image domain Ω ⊂ R2 into two regions by minimising the diﬀerence between the image intensity f (x) : Ω → R and its average value in each region. Additional constraints are imposed on the length of the region boundary C and on the area inside C. This comes down to minimising the energy E (C, c1 , c2 ) = μ length(C) + ν area(insideC) (1) + (f − c1 )2 dx + (f − c2 )2 dx, inside(C)

outside(C)

where c1 and c2 are the average values of f inside and outside C, and μ ≥ 0 and ν ≥ 0 are weighting parameters. These weights are important to tune the object detection: A large μ will give a coarse segmentation, while a small μ will detect ﬁne details. As a region-based segmentation model, the Chan-Vese method is fast and robust with respect to initialisation and noise. In order to further improve the localization of the object contour, we use the Chan-Vese result as initialisation for the edge-based geodesic active contour model [15,16]. The governing evolution equation is given by ∇φ ∂t φ = |∇φ| div g (|∇fσ |) |∇φ| on Ω × [0, ∞), (2) φ(x, 0) = φ0 (x) on Ω, where φ(x, t) is a level-set function, φ0 a suitable initialisation and ∇ = (∂x , ∂y ) is the gradient operator. The edge stopping function g draws the contour towards nearby edges in the presmoothed image fσ , which is obtained by convolving f with a Gaussian with standard deviation σ. The function g(s2 ) is decreasing in s. In our application we choose the Perona-Malik diﬀusivity gP M (s2 ) = (1 + s2 /λ2 )−1 , where λ > 0 is some contrast parameter [22]. If the object is bounded by a pronouced edge, the edge-based active contours will generally result in a sharper segmentation. 2.2

Ensuring a Homogeneous Albedo – Inpainting by Edge Enhancing Diﬀusion

Generally, real-world objects do not have a constant albedo. To apply SfS we need to ensure that the albedo does not vary within the segmented contour.

194

O. Vogel et al.

In our approach we detect regions of diﬀering albedo and ﬁll in neighborhood information to obtain homogeneous reﬂectance properties. In order to identify regions with ﬂuctuating albedo we use an adaptive thresholding algorithm that works on local windows [17]. Adaptive thresholding is robust with respect to varying illumination conditions within the scene and is widely used in document analysis. Note that by slightly enlarging the identiﬁed regions by morphological erosion we can improve the subsequent interpolation result, preventing artifacts at the boundaries. The next step is to interpolate the image in these regions. For this task we choose edge-enhancing anisotropic diﬀusion (EED) [23]. It was shown to perform better for image inpainting and scattered data interpolation than other PDEbased methods [18]. The main idea behind EED is to allow smoothing within homogeneous regions and along image edges, but to reduce smoothing across them. To this end it makes use of a diﬀusion tensor. In the region that we want to inpaint we solve the steady-state diﬀusion equation 0 = div g(∇uσ ∇u (3) σ )∇u , with the boundary conditions speciﬁed by the surrounding data. Here uσ is a smoothed version of the evolving image u, obtained by convolving it with a Gaussian of standard deviation σ. The scalar-valued diﬀusivity g is applied to the eigenvalues of the structure tensor ∇uσ ∇u σ , while leaving its eigenvectors unchanged. This way, the ﬁrst eigenvector of the diﬀusion tensor is parallel to the edge detector ∇uσ . The desired ﬁlter eﬀect comes from the fact that the corresponding eigenvalue is given by g(|∇uσ |2 ), such that smoothing is reduced at edges, where |∇uσ | is large. The second eigenvector is orthogonal to ∇uσ with corresponding eigenvalue 1. For the diﬀusivity g one typically chooses the Charbonnier diﬀusivity gC (s2 ) = (1 + s2 /λ2 )−1/2 , with contrast parameter λ > 0. The interpolated image can be seen as an albedo-corrected version of the original image, which now satisﬁes the assumption of a surface with homogeneous reﬂectance properties. 2.3

3-D Reconstruction – Shape from Shading

Finally, we need to reconstruct the modiﬁed image from Section 2.2 within the segmentation region obtained in Section 2.1. For this, we use the method of Vogel et al. [8] incorporating the Phong reﬂectance model since real-world objects feature non-Lambertian surfaces [11]. The model is formulated in terms of the Hamilton-Jacobi equation α I − ka Ia 2 W ks Is −2v 2Q2 f W − kd Id e−2v − e − 1 = 0, (4) Q Q W2 where x = (x, y) ∈ R2 is in the image domain, and u > 0 with v := ln(u) is the sought depth map. The other terms in (4) are given as follows. I := I(x) is the brightness normalised to the interval [0, 1], and f is the focal length denoting

Making Shape from Shading Work for Real-World Images

195

the distance between the optical centre of the camera and the 2-D retinal plane. The terms Q and W are given as f Q := , x2 + y 2 + f 2 W := f 2 |∇v|2 + (∇v · x)2 + Q2 .

(5) (6)

Note that in (4), the underlying brightness equation reads as I = ka Ia +

light sources

1 kd Id cos φ + ks Is (cos θ)α . 2 r

(7)

Here, φ is the angle between the surface normal at the point u˜ := (x, u(x)) ∈ R3 and the light source direction as seen from u ˜. The amount of specular light reﬂected towards the camera is proportional to (cos θ)α , where θ is the angle between the ideal (mirror) reﬂection direction of the incoming light and the viewer direction at u˜. The parameter α models the roughness of the material: For α → ∞ one would obtain a model for a perfect mirror. Ia , Id , and Is are the intensities of the ambient, diﬀuse, and specular components of light, respectively. The constants ka , kd , and ks with ka + kd + ks ≤ 1 denote the ratio of ambient, diﬀuse, and specular reﬂection [10]. For solving the PDE (4), we use the algorithm proposed by Breuß et al., for details see [24].

3

Real-World Experiments

In this section, we evaluate our proposed framework on real-world images. Figure 1 (a) shows a picture of a cup taken with a digital camera in our oﬃce environment. The image has size 408 × 306 with quadratic pixels of 1.61 μm side length. The focal length of the camera is 70.2 mm. Figure 1 (b) shows a reconstruction of this surface using the SfS method of Prados and Faugeras [4]. Note that this is already an advanced SfS method, which uses a perspective projection model on Lambertian surfaces and considers the physical light attenuation term. The parameters used for the reconstruction were f = 5435 = 70.2 mm/1.61 μm and γ = 100000, where γ is the calibration parameter used in the model of Prados et al. [4,19]. Note that this parameter can be chosen arbitrarily, since it will only scale the reconstruction uniformly in all dimensions. We can clearly see that the reconstruction fails completely in the background, at the transition from foreground to background, and at textures and highlights on the cup. Now, we demonstrate step by step how our proposed framework helps to improve this reconstruction. In the next experiment, we perform a segmentation as proposed in Section 2.1. Using the parameters ν = 0, μ = 10 for Chan-Vese postprocessed by geodesic active contours with λ = 3.6, we obtain the segmented cup shown in Figure 2

196

O. Vogel et al.

Fig. 1. (a) Photograph of a cup. (b) Lambertian reconstruction.

Fig. 2. (a) Segmented version of the cup image. (b) Lambertian reconstruction of the segmented image.

(a). Now we reduce the reconstruction to only this area. The resulting surface using a Lambertian model for reconstruction is shown in Figure 2 (b). Clearly, this improves the reconstruction of the cup. It is still oddly shaped, but on its boundaries, the reconstruction is substantially better. In the next experiment, we adapt the albedo in the textured regions using the procedure described in Section 2.2. We perform an adaptive thresholding on the image within the cup area, taking a 100 × 100 window. This gives the inpainting region, which is the black template in Figure 3 (a). After a morphological erosion of this inpainting region in order to enlarge its size, we apply EED with the parameters λ = 2 and σ = 0.3 to inpaint the image there. The inpainted image is shown in Figure 3 (b). This image can be regarded as a constant-albedo version of the original image, within the segmented area. Note that this image still contains specular highlights. Now, we reconstruct the surface from the segmented and inpainted data. Figure 3 (c) shows the corresponding reconstruction. We still use the Lambertian model by Prados et al here. The shape of the cup obtained by this Lambertian model looks quite reasonable. However, the cup is estimated much too close to the camera, in particular at specular highlights. Note that the handle, which is pointing slightly towards the background in the original image, is pulled to the front.

Making Shape from Shading Work for Real-World Images

197

Fig. 3. (a) Inpainting region obtained by adaptive thresholding. (b) Inpainted image. (c) Reconstruction of the inpainted image using a Lambertian model.

Fig. 4. (a) Reconstruction of the cup using the Phong model. (b) Rendered version of the ﬁnal reconstruction.

As a ﬁnal step, we switch to the more advanced SfS model of Vogel et al. [8], which assumes Phong reﬂectance properties. With the parameters Is = Id = 100000, kd = 0.6, ks = 0.4, α = 6, we obtain the reconstruction shown in Figure 4 (a). The parameters have been estimated manually, where only α and the ratio between kd and ks is really relevant. The magnitude of Is and Id will only scale the reconstruction. This yields a fairly realistic reconstruction of the cup. Its shape is recovered well, as is its size and the distance to the camera. The handle is now approximately at the correct position, and even at specular highlights the reconstruction is satisfactory. Compared to the results without any preprocessing

198

O. Vogel et al.

Fig. 5. (a) Photograph of a computer mouse on a table. (b) Photograph of a book.

Fig. 6. (a) Reconstruction of the computer mouse. (b) Reconstruction of the book.

in Figure 1 (b), the reconstruction quality is improved dramatically. Figure 4 (b) shows the recovered shape rendered with the texture from the input image. To show the applicability of our framework to other images, we applied it to two other real-world images shown in Figure 5. The impact of the diﬀerent steps of our framework for these experiments is similar to those of the ﬁrst experiment. This will be investigated in more detail in future work. The ﬁrst image shows a computer mouse on a table. Mouse and table obviously have diﬀerent materials, and the logo of the manufacturer on the mouse has a diﬀerent colour than the rest of the mouse. The gap between the buttons makes the reconstruction additionally diﬃcult, since we have shadows there, which contradict the model assumptions. Since for this example, foreground and background have similar brightness, we made use of the hue channel in the segmentation step. Figure 6

Making Shape from Shading Work for Real-World Images

199

(a) shows the reconstruction of the mouse. The mouse is recovered very well, including the slots on the buttons, and nearly perfect even at the gaps between the buttons. Figure 5 (b) is a photograph of a book. The background is quite inhomogeneous and would lead to distortions of the shape if reconstructed unsegmented. The book has some texture on it in diﬀerent colours and brightnesses. The reconstruction in Figure 6 (b), however, is quite convincing.

4

Conclusions and Outlook

The key message of our paper is the proof that shape from shading is possible under the diﬃcult conditions of real-world images, even without the need to include knowledge-based techniques. This has been achieved by a sophisticated three-stage model that incorporates object segmentation, albedo inpainting and non-Lambertian shape from shading. Our experiments demonstrate that shape from shading has the potential of becoming a serious alternative in computer vision systems when other techniques are diﬃcult to apply. In our future work we will focus on exploring this potential further.

References 1. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: Proc. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–528. IEEE Computer Society Press, New York (2006) 2. Horn, B.K.P.: Obtaining shape from shading information. In: Winston, P.H. (ed.) The Psychology of Computer Vision, pp. 115–155. McGraw-Hill, New York (1975) 3. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8), 690–706 (1999) 4. Prados, E., Faugeras, O.: Shape from shading: A well-posed problem? In: Proc. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 870–877. IEEE Computer Society Press, San Diego (2005) 5. Tankus, A., Sochen, N., Yeshurun, Y.: Shape-from-shading under perspective projection. International Journal of Computer Vision 63(1), 21–43 (2005) 6. Cristiani, E., Falcone, M., Seghini, A.: Some remarks on perspective shape-fromshading models. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 276–287. Springer, Heidelberg (2007) 7. Ahmed, A., Farag, A.: A new formulation for shape from shading for nonLambertian surfaces. In: Proc. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 17–22. IEEE Computer Society Press, New York (2006) 8. Vogel, O., Breuß, M., Weickert, J.: Perspective shape from shading with nonLambertian reﬂectance. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 517–526. Springer, Heidelberg (2008) 9. Oren, M., Nayar, S.: Generalization of the Lambertian model and implications for machine vision. International Journal of Computer Vision 14(3), 227–251 (1995)

200

O. Vogel et al.

10. Foley, J., van Dam, A., Feiner, S., Hughes, J.: Computer Graphics: Principles and Practice. Addison-Wesley, Reading (1996) 11. Harrison, V.G.W.: Deﬁnition and Measurement of Gloss. Printing & Allied Trades Research Association, PATRA (1945) 12. Smith, W.A.P., Hancock, E.R.: Facial shape-from-shading and recognition using principal geodesic analysis and robust statistics. International Journal of Computer Vision 76(1), 71–93 (2008) 13. Zhang, L., Dugas-Phocion, G., Samson, J.S., Seitz, S.M.: Single view modeling of free-form scenes. Journal of Visualization and Computer Animation 13(4), 225–235 (2002) 14. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Processing 10(2), 266–277 (2001) 15. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. International Journal of Computer Vision 22, 61–79 (1997) 16. Kichenassamy, S., Kumar, A., Olver, P., Tannenbaum, A., Yezzi, A.: Conformal curvature ﬂows: from phase transitions to active vision. Archive for Rational Mechanics and Analysis 134, 275–301 (1996) 17. Sauvola, J., Pietikainen, M.: Adaptive document image binarization. Pattern Recognition 33(2), 225–236 (2000) 18. Weickert, J., Welk, M.: Tensor ﬁeld interpolation with PDEs. In: Weickert, J., Hagen, H. (eds.) Visualization and Processing of Tensor Fields, pp. 315–325. Springer, Berlin (2006) 19. Prados, E., Camilli, F., Faugeras, O.: A unifying and rigorous shape from shading method adapted to realistic data and applications. Journal of Mathematical Imaging and Vision 25(3), 307–328 (2006) 20. Tschumperl´e, D., Deriche, R.: Vector-valued image regularization with PDEs: A common framework for diﬀerent applications. In: Proc. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 651–656. IEEE Computer Society Press, Madison (2003) 21. Jin, H., Cremers, D., Wang, D., Yezzi, A., Prados, E., Soatto, S.: 3-d reconstruction of shaded objects from multiple images under unknown illumination. International Journal of Computer Vision 76(3), 245–256 (2008) 22. Perona, P., Malik, J.: Scale space and edge detection using anisotropic diﬀusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 23. Weickert, J.: Theoretical foundations of anisotropic diﬀusion in image processing. Computing Supplement 11, 221–236 (1996) 24. Breuß, M., Vogel, O., Weickert, J.: Eﬃcient numerical techniques for perspective shape from shading. In: Algoritmy, Podbanske, Slovakia, March 2009, pp. 11–20 (2009)

Deformation-Aware Log-Linear Models Tobias Gass1 , Thomas Deselaers1,2 , and Hermann Ney1 1

2

Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany [email protected] Now with the Computer Vision Laboratory, ETH Zurich, Switzerland

Abstract. In this paper, we present a novel deformation-aware discriminative model for handwritten digit recognition. Unlike previous approaches our model directly considers image deformations and allows discriminative training of all parameters, including those accounting for non-linear transformations of the image. This is achieved by extending a log-linear framework to incorporate a latent deformation variable. The resulting model has an order of magnitude less parameters than competing approaches to handling image deformations. We tune and evaluate our approach on the USPS task and show its generalization capabilities by applying the tuned model to the MNIST task. We gain interesting insights and achieve highly competitive results on both tasks.

1

Introduction

One of the major problems in pattern recognition tasks is to model intra-class variability without washing away the inter-class diﬀerences. One typical application where many transformations have to be considered is the recognition of handwritten characters. In the past, many approaches towards modeling diﬀerent writing styles have been proposed and investigated [1, 2, 3, 4, 5]. In this work, we propose a novel model that directly incorporates and trains deformation parameters in a log-linear classiﬁcation framework. The conventional approaches can be split into two groups: Approaches that directly incorporate certain invariances into their classiﬁcation framework, e.g. incorporate the tangent distance into support vector machines [2], use kerneljittering to obtain translated support vectors in a two-step training approach [1] and use transformation-invariant distance measures in nearest neighbor frameworks [4, 5]. Another approach is not to incorporate the deformation-invariance into the model but use a huge amount of synthetically deformed data during training of a convolutional neural network [3]. The ﬁrst approach has the disadvantage that during testing a large amount of, potentially computationally expensive, image comparisons have to be performed whereas in the second approach the training procedure becomes potentially very expensive. None of the approaches presented above explicitly learns the parameters of the allowed deformations but the deformation model was hand-coded by the system developers. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 201–210, 2009. c Springer-Verlag Berlin Heidelberg 2009

202

T. Gass, T. Deselaers, and H. Ney

In contrast to these approaches to transformation-invariant classiﬁcation, Memisevic and Hinton [6] proposed an approach to learn image transformations from corresponding image pairs using conditional restricted Boltzmann machines. In our approach, we aim at training a small (in the number of parameters) model that directly models deformations, automatically learns which deformations are allowed (and desired), is eﬃcient to train and apply, and leads to good results. We build our approach around the image distortion model (IDM) [4], a zero-order, non-linear deformation model, which we shortly describe in the following section. The developed model can also be considered a grid-shaped hidden-conditional random ﬁeld (HCRF) [7, 8] where the latent variables account for the deformations. In section 3, we present our model which incorporates the IDM into log-linear models. In section 4, we present an experimental evaluation on the USPS and on the MNIST dataset and compare to several published state-of-the-art results as well as to an SVM with an IDM-distance kernel.

2

Image Distortion Model

The IDM has been proposed by several works independently under diﬀerent names. For example, it has been described as “local pertubations” [9] and as “shift similarity” [10]. Here, we follow the formulation of [4]. The IDM is a zero order image deformation method that accounts for image transformations by pixel-wise aligning a test image to a prototype image without considering the alignments of its neighboring pixels, which allows for eﬃcient calculation. An image alignment maps each pixel ij from an image A of size I × J to a pixel (xy)ij in the prototype image B. We denote an image alignment by (xy)IJ 11 : (ij) → (xy)ij . To restrict the possible alignments, commonly a maximal warprange W , i.e. the maximal displacement between ij and (xy)ij , is deﬁned. In [4], the IDM has been mainly used to obtain distances between pairs of images for nearest neighbor classiﬁcation. It was noted that the use of local features extracted from small neighborhoods of the pixels, such as sub-windows of Sobel features, which are smoothed directed derivatives, lead to strongly improved alignments and directly to better classiﬁcation results.

3

Integrating the IDM into Log-Linear Models

Log-linear models are well-understood discriminative classiﬁers for which eﬃcient training methods exist. Commonly, the class-posterior p(c|X) for an observation X is directly modeled as p(c, X) exp(gθ (X, c)) p(c|X) = C = C , c =1 p(c , X) c =1 exp(gθ (X, c )) where commonly gθ (X, c) = αc + λTc X is chosen.

(1)

Deformation-Aware Log-Linear Models

203

To incorporate deformation-invariance into log-linear models, we treat the image alignment as a latent variable which is marginalized out: p(c, X) = p(c, (xy)IJ (2) 11 , X). (xy)IJ 11

To account for image deformations in the discriminant function we extend gθ (X, c) to T gθ (X, (xy)IJ , c) = α + α + λ X (3) c cij(xy)ij 11 c(xy)ij ij , ij

where θ = {αc , αcij(xy)ij , λc(xy)ij } and αc is a class-bias, αcij(xy)ij corresponding to class-, position, and alignment depending deformation priors; λc(xy)ij is a class-dependent weight-vector. Note that each pixel ij of image X is represented by a D-dimensional vector to allow for additional features. Thus, the λc(xy)ij are of the same dimensionality. 3.1

Relationship to Gaussian Models

An interesting aspect of this model is that it can be rewritten as a discriminative Gaussian classiﬁer analogously to [11]. We rewrite p(c, (xy)IJ 11 |X) =

p(c) p(X, (xy)IJ 11 |c) IJ c p(c ) p(X, (xy)11 |c )

(4)

and decompose IJ IJ p(X, (xy)IJ 11 |c) = p((xy)11 |c) p(X|c, (xy)11 )

(5)

IJ where p((xy)IJ 11 |c) can be considered as a deformation prior and p(X|c, (xy)11 ) is an emission probability for a given class and alignment. IJ Then, p((xy)IJ 11 |c) can be rewritten as p((xy)11 |c) = ij p((xy)ij |ij, c) and

p(X|c, (xy)IJ 11 )

1 1 (Xij − μc(xy)ij )2 = √ exp − ij 2 σ2 2πσ 2 ij

assuming a globally pooled diagonal covariance matrix. The direct correspondence to the above model can be seen by setting

2 αc = log (p(c)) − D , λc(xy)ij = σ1 μc(xy)ij 2 log 2πσ αcij(xy)ij = log (p((xy)ij |ij, c)) −

1 T 2σ μc(xy)ij μc(xy)ij .

(6)

(7) (8)

This equivalence also shows that the αcij(xy)ij model deformation penalties. Furthermore, the transformations in eq. (7)(8) allow to start from a generative, deformation-aware model such as the one discussed in [4] to initialize our model.

204

3.2

T. Gass, T. Deselaers, and H. Ney

Maximum Approximation

In order to avoid the evaluation of sums over latent variables, a common approach is to use the maximizing conﬁguration of the latent variable, which allows us to rewrite eq. (2) as 1 p(c|X) ≈ max p(c, (xy)IJ (9) 11 , X) IJ Z(X) (xy)11 with unchanged Z(X). In addition to applying the maximum approximation it is pos in the numerator, IJ sible to also apply it in the denominator Z(X) ≈ c max(xy)IJ p(c , (xy) 11 , X). 11 We performed experiments with the three diﬀerent variants and found that the results diﬀer only slightly. In particular, we found that the method with maximum approximation in numerator and denominator despite being the fastest has the tendency to perform best. Therefore, we perform the experiments in this paper using this method. 3.3

Training Method

The training of conventional log-linear models is a convex optimization problem and can be done eﬃciently. Here, we aim at maximizing the log-likelihood of the posteriors F (θ) = log pθ (cn |Xn ), (10) n

where θ are the parameters of the class posterior distribution (cp. Eq (3)). For our proposed model, the training problem is not convex anymore. However, given a ﬁxed alignment, the training can be performed normally, and therefore, for the model with maximum approximation an algorithm that is guaranteed to converge (to a local optimum) exists. This can be seen by considering a class/alignment pair as a pseudo-class leading to a log-linear model with an enormous number of classes. For the two other variants (no maximum approximation/maximum approximation in numerator and denominator) this cannot be guaranteed, however, as we found in our experiments, the training converges well. An extension of the GIS algorithm to allow for training of log-linear models with hidden variables has been presented in [12]. However the authors observed that although the algorithm is guaranteed to converge, convergence can be slow. Similarly to their approach, we also use an alternating optimization method: Step 1: Train model parameters θ while keeping the alignment (xy)IJ 11 ﬁxed. Step 2: Determine new alignments (xy)IJ with ﬁxed model parameters θ. 11 These two steps are then repeated until convergence is reached. To train the parameters of the model with maximum approximation in numerator and denominator, the same procedure can be used, but here for each training observation, an alignment for each class has to be determined. We train our model using the RProp-algorithm [13], which has the advantage that it is robust w.r.t. varying scales of the derivatives because it only takes into account the sign of the partial derivatives to determine the parameter updates.

Deformation-Aware Log-Linear Models

205

Table 1. Deformation Priors. The diﬀerent variants of sharing α-parameters. We show the dependency of the deformation parameters αcij(xy)ij in functional form, where each of these functions is a table which is indexed by the parameters given in column 2. In the last column, we give the number of parameters in dependency of the number of classes C, the size of the image IJ, and the size of the allowed warp-range W . Pooling method

αcij(xy)ij

number of parameters C(IJ)(2W + 1)2 (IJ)(2W + 1)2 2C(IJ) C(2W + 1)2 2C

full alphas (no pooling) α(c, i, j, i − x, j − y) class pooling α(i, j, i − x, y − j) deformation independent α(c, i, j, δ(x, i) · δ(y, j)) position independent α(c, i − x, j − y) position and deformation independent α(c, δ(x, i) · δ(y, j))

3.4

Pooling of Deformation Priors

In our initial formulation, the αcij(xy)ij model the deformation priors separately for each class, pixel position and corresponding alignments leading to a large number of parameters partly sharing information. To reduce the number of parameters and allow for sharing deformation information we propose several pooling strategies over classes, positions, and deformations, respectively. An overview over these is given in table 1.

4

Experimental Evaluation

ER [%]

We evaluate our methods on two datasets, the rather small, but well-known USPS dataset [14], which we use to evaluate and tune all parameters of our method, and on the MNIST dataset [15], on which we only repeat those experiments which performed best on the USPS dataset. In ﬁgure 1, we give an example image for each of the classes for these two datasets. Both datasets consist of images from ten classes of handwritten digits, which are scaled between 0 and 1. The USPS dataset consists of 7291 training images and 2007 test images, and the MNIST dataset consists of 60 000 training images and 10 000 test images.

4.6 4.5 4.4 4.3 4.2 4.1 4 3.9 1

2

3

4

5

6

7

8

warprange

Fig. 1. Example images for the USPS (top) and the MNIST (bottom) tasks

Fig. 2. The eﬀect of diﬀerent warp ranges on the error rate on the USPS test data

206

T. Gass, T. Deselaers, and H. Ney

Table 2. Features. The impact of diﬀer- Table 3. Error rates[%] obtained using the ent local features and local context on the diﬀerent initializations with and without classiﬁcation error [%]. alternating optimization local context used: Features gray values Sobel abs(Sobel) Sobel + abs(Sobel)

no

train test train test 2.63 0.01 0.78 0.01

7.62 4.04 4.88 3.84

ﬁxed align. altern. opt.

yes

0.47 0.01 0.18 0.14

7.57 4.04 4.78 3.64

Init. Gaussian log-linear zero init

initial train

test train

test

6.52 0.69 4.63 0.69 8.27 0.05 5.93 0.01 - 1.87 8.27 0.01

4.63 4.04 4.04

In the following, we ﬁrst investigate the warp range and which features are best for ﬁnding the best alignments and for classiﬁcation. Then we investigate the eﬀect of the diﬀerent deformation-sharing parameters. We also compare three diﬀerent initializations and compare our results to the state-of-the-art and to an SVM with an IDM-distance kernel. Warp range. One crucial parameter in the IDM is the warp range W , which controls the maximal horizontal and vertical displacement for each pixel. We allow to map the pixel ij to every pixel (xy)ij , where i − W ≤ xi ≤ i + W and j − W ≤ yj ≤ j + W . In ﬁgure 2 the eﬀect of diﬀerent warp ranges on the error rate on the USPS dataset is shown. In these experiments we use simple Sobel features. Features. It was already observed by Keysers et al. [4] that local context is essential to determine good alignments and they found that sub-windows of Sobel features performed best. Here, we investigate the impact of diﬀerent local descriptors and sub-windows on the classiﬁcation performance. The results of these experiments are shown in table 2. We compare eight diﬀerent setups: simple gray values, Sobel features, absolute values of Sobel features, and a combination of Sobel and absolute Sobel. Each feature setup is evaluated with and without 3×3 sub-windows. It can be observed that using Sobel features, scaled from -1 to 1, leads to a signiﬁcant improvement over using just gray values and there is hardly a diﬀerence in the test error rate whether local context is used or not. Absolute Sobel values do not reach the performance of full Sobel features as they lose the direction of the edge information, although it can be observed that the model improves when combining the two. This is due to the fact that the feature combination contains both improved features for alignment as well as non-linear combinations of the original features which improve parameter estimation of the log-linear model. It can be observed that the use of sub-windows leads to a better performance when using the combined Sobel descriptors. Due to the minor improvements using the feature combination but nonetheless greatly increased training eﬀort, we will use simple Sobel features for further investigations and re-combine the best approaches in section 4.1 for the MNIST dataset.

Deformation-Aware Log-Linear Models

207

Table 4. Deformation prior sharing. Error rates[%] on the training and test data of the USPS dataset using the diﬀerent deformation prior sharing methods along with the number of deformation parameters and the number of parameters of the entire model. Pooling method full alphas (no pooling) class pooling deformation independent position independent position and deformation independent class pooling/pos. & deform. indep.

train ER test ER def. param total param 0.01 0.08 0.05 0.03 0.51 0.04

4.04 3.84 3.94 4.09 3.89 3.94

64000 6400 5120 250 20 2

69130 11530 10250 5380 5150 5132

Alpha pooling. Table 4 shows the results obtained using the diﬀerent strategies for deformation parameter sharing described in section 3.4. It can be observed that, although the number of parameters is signiﬁcantly reduced, the error rates on the test data are only slightly aﬀected. This shows that it is not necessary to have position- and deformation-speciﬁc deformation priors but that most of the relevant deformation information can be stored in the λ-parameters. The biggest diﬀerence is again observed on the training data, which makes us believe that the models with fewer parameters have better generalization capabilities. Initialization and alternating optimization. As described in section 3.1, the presented model can be rewritten as a Gaussian model and can be initialized from a Gaussian model. Since we cannot guarantee convergence to the global optimum of the parameters, in this section, we consider three diﬀerent ways to initialize the model: initialization from a non-deformation invariant log-linear model, initialization from a deformation-aware generative Gaussian model and initialization of all parameters with zeros. For these alternatives, we compare the results using diﬀerent training schemes. In the scheme “fixed alignment ”, we initialize the model, determine an alignment of the training data to the init-model and keep this alignment ﬁxed until convergence. In the scheme “alternating optimization”, we perform analogously to the previous experiments. That is, we initialize the model and alternate between re-aligning and parameter updates until convergence. The results of these experiments are given in table 3. Interestingly, the ﬁnal result is nearly independent of the initialization, which indicates that the alternating optimization is able to ﬁnd a good set of parameters independent of the starting point. Only for the model initialised from the deformation aware Gaussian model, the alternating optimization has no eﬀect. We believe that this model is stuck in a strong local optimum. However, if alternating optimization is not used, the other two models are clearly worse, which again highlights the importance of the alternating optimization. The training time for the diﬀerent initialisations is similar where generally the model initialised with a log-linear model needs fewer iterations than the other two.

208

T. Gass, T. Deselaers, and H. Ney

Table 5. Comparison of error rates[%], number of parameters and runtime of our deformation-aware log-linear model to state-of-the-art models

Model log-lin. model+IDM using Sobel + abs(SobelHV) + deform. param. sharing + local context

USPS

MNIST

# param. ER

# param. ER

run-time factor

69 130 74 250 10 340 92 190

4.04 3.84 3.59 3.69

211 690 227 370 31 390 282 270

1.63 1.32 1.36 1.50

50 100 100 900

log-linear model + abs(SobelHV) single Gaussians single Gaussians + IDM [4]

2 570 5 130 2 560 2 560

8.2 5.5 18.5 6.5

7 850 15 690 7 840 7 840

7.4 3.0 18.0 5.8

1 2 1 50

nearest neighbor [4] nearest neighbor + IDM [4]

1 866 496 5.6 1 866 496 2.4

47 040 000 3.1 47 040 000 0.6

729/6 000 36 455/300 000

SVM SVM + IDM [16]/[this work]

658 177 4.4 530 705 2.8

15 411 905 1.5 - 0.7

256/1 963 10 300/100 000

DBN [17] conv. network [3]

640 610 -

1 665 010 1.3 180 580 0.4

210/ 220 -/25

4.1

-

Transfer to MNIST and Comparison to the State-of-the-Art

In table 5, we show how the model, with parameters (warprange, deformation sharing method, feature setup) tuned on the USPS dataset, performs on the MNIST dataset and compare the results for both datasets with several state-ofthe-art results from the literature. Additionally to the error rates, we give the total number of parameters that are necessary in the models to classify a test observation and the run-times of the diﬀerent methods estimated from the number of basic mathematical operations relative to the fastest method. Note that sharing the deformation parameters has no noticeable impact on computation time, while each additional feature layer increases the run-time. The results in the ﬁrst block of table 5 are obtained using the deformationinvariant log-linear model. It can be seen that a combination of Sobel and absolute Sobel with position and deformation independent α-pooling improves the results. Additionally using local context does not lead to an improvement but rather to overﬁtting. All improvements using parameters optimized on the USPS dataset consistently transfer to improvements on the MNIST database, showing the good generalization capabilities of our model. The ﬁrst comparison result we give is that of a simple log-linear model, which due to the lack of deformation invariance performs signiﬁcantly worse for both datasets, but is (along with the single Gaussian model) the fastest model. Both models only require to compare a test-observation to one prototype per class. The generative single Gaussian model with IDM, already performs much better

Deformation-Aware Log-Linear Models

209

but an IDM comparison is about 50 times as expensive as a simple componentwise comparison (due to the use of Sobel features and a deformation window of 5×5 pixels). For comparison, we additionally present results using a log-linear model using absolute Sobel features. The nearest neighbor method needs as many comparison operations as there are training samples but the cost is independent of the number of classes, therefore the method is about 800 (resp. 6000) times slower than the simple log-linear model for the USPS dataset and for the MNIST dataset respectively. However, the nearest neighbor method with IDM obtains results among the best published results for both datasets. The number of operations in the SVM depends on the number of support vectors. On both datasets, the number of support vectors is typically about 30% of all trainings samples and thus the method requires about a third of the run-time of the nearest neighbor classiﬁer. For SVMs to include the IDM, we use radial basis kernel with a symmetric variant of the IDM-distance deﬁned as KIDM (X, V ) = exp − γ2 (didm (X, V ) + didm (V, X)) , since non-symmetric kernels cause problems in training SVMs. Although this kernel is not necessarily positive deﬁnite, it was observed by Haasdonk [16] that in practice the training converges to a good result. We note that the symmetric IDM-distance is known to perform worse than the asymmetric one in nearest-neighbor experiments (3.4% instead of 2.4%). Nonetheless, the support vector machines obtain excellent results on both datasets, where the results on the MNIST database have been reported by [16] and the results on the USPS database have been obtained using our own implementation. For further comparison we give two state-of-the-art results on the MNIST database using deep belief networks and convolutional neural networks. Both are based on neural networks where the deep belief network is proposed as a general learning technique [17] and no prior knowledge such as deformations is incorporated. The convolutional neural networks were designed with digit recognition in mind and are trained from a huge amount of automatically deformed training data [3]. The convolutional neural network obtains one of the best published results on the MNIST dataset despite its small size and eﬃcient classiﬁcation stage. However, the training phase for this network is computationally very expensive because the training data is automatically deformed and used for training several thousand times. As an overview, it can be seen that our method compares favorably well to other methods. In particular in comparison with the other fast methods, only the convolutional neural networks, which are diﬃcult to create and optimize, outperform our method with a comparable computation time. Furthermore, the small number of parameters in our model is a good indicator for its generalization performance which is underlined by the successful transfer of the parameters from the USPS dataset to the MNIST dataset.

5

Conclusion

We presented a new model that directly incorporates image deformations into a log-linear framework which achieves highly competitive results on well-known

210

T. Gass, T. Deselaers, and H. Ney

handwritten character recognition tasks. It is possible to ﬁne-tune the amount of deformation priors by sharing and it is shown that using fewer deformation prior parameters the model generalizes better. We also showed that the choice of the features is crucial to ﬁnd good alignments and to obtain good results. In the future we plan to investigate whether it is possible to extend the deformation-aware log-linear model toward log-linear mixture models analogously to the experiments reported in [12]. Acknowledgement. This work was partially funded by the DFG (Deutsche Forschungsgemeinschaft) under contract NE-572/6 and partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation.

References 1. DeCoste, D., Sch¨ olkopf, B.: Training invariant support vector machines. Machine Learning 46, 161–190 (2002) 2. Haasdonk, B., Keysers, D.: Tangent distance kernels for support vector machines. In: ICPR, Quebec City, Canada, pp. 864–868 (2002) 3. Simard, P.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR, Edinburgh, Scotland, pp. 958–962 (2003) 4. Keysers, D., Deselaers, T., Gollan, C., Ney, H.: Deformation models for image recognition. PAMI 29, 1422–1435 (2007) 5. Keysers, D., Macherey, W., Ney, H., Dahmen, J.: Adaptation in statistical pattern recognition using tangent vectors. PAMI 26, 269–274 (2004) 6. Memisevic, R., Hinton, G.: Unsupervised learning of image transformations. In: CVPR, Minneapolis, MN, USA (2007) 7. Laﬀerty, J., McCallum, A., Pereira., F.: Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In: ICML (2001) 8. Quattoni, A., Wang, S., Morency, L.P., Collins, M., Darrell, T.: Hidden conditional random ﬁelds. PAMI 29, 1848–1852 (2007) 9. Uchida, S., Sakoe, H.: A survey of elastic matching techniques for handwritten character recognition. Trans. Information and Systems E88-D, 1781–1790 (2005) 10. Mori, S., Yamamoto, K., Yasuda, M.: Research on machine recognition of handprinted characters. PAMI 6, 386–405 (1984) 11. Keysers, D., Och, F.J., Ney, H.: Maximum entropy and gaussian models for image object recognition. In: Van Gool, L. (ed.) DAGM 2002. LNCS, vol. 2449, pp. 498– 506. Springer, Heidelberg (2002) 12. Heigold, G., Deselaers, T., Schl¨ uter, R., Ney, H.: GIS-like estimation of log-linear models with hidden variables. In: ICASSP, Las Vegas, NV, USA, pp. 4045–4048 (2008) 13. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: ICNN, San Francisco, CA, USA (1993) 14. ftp://ftp.kyb.tuebingen.mpg.de/pub/bs/data/ 15. http://yann.lecun.com/exdb/mnist/ 16. Haasdonk, B.: Transformation Knowledge in Pattern Analysis with Kernel Methods. PhD thesis, Albert-Ludwigs-Universit¨ at Freiburg (2005) 17. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)

Multi-view Object Detection Based on Spatial Consistency in a Low Dimensional Space Gurman Gill and Martin Levine Center for Intelligent Machines McGill University, Montreal, Canada

Abstract. This paper describes a new approach for detecting objects based on measuring the spatial consistency between diﬀerent parts of an object. These parts are pre-deﬁned on a set of training images and then located in any arbitrary image. Each part is represented by a group of densely sampled SIFT features. Supervised Locally Linear Embedding is then used to describe the appearance of each part in a low dimensional space. The novelty of this approach is that linear embedding techniques are used to model each object part and the background in the same coordinate space. This permits the detection algorithm to explicitly label test features as belonging to an object part or background. A spatial consistency algorithm is then employed to ﬁnd object parts that together provide evidence for the location of object(s) in the image. Experiments on the 3D and PASCAL VOC datasets yield results comparable and often superior to those found in the literature.

1

Introduction

This paper presents an algorithm for detecting multiple instances of generic object classes in an image. Recently, several approaches have built a vocabulary of distinctive patches that are obtained by applying a saliency detector [2][3][4][5] or by densely sampling [6][7][8] the underlying image. A feature vector characterizing each image can be computed based on this vocabulary. Such a “geometry-free” representation only implicitly captures any spatial information [3] and thus is at a disadvantage for detecting objects. This aspect was dealt with in [6] which proposed using a spatial pyramid to incorporate approximate geometric correspondence and in [3] which augmented the vocabulary with pairs of visual words co-occurring within a local spatial neighborhood. The approach in this paper incorporates spatial information by dividing the object into cells, as is done in [6]. Each cell represents an object part and the cell structure is used to deﬁne spatial relationships between object parts. Each object part is a collection of overlapping densely sampled features that are described using orientation-variant SIFT [3]. To model the distribution of these overlapping SIFT features, we do not create vocabularies or codebooks [6][3][9][5] or a Gaussian mixture model [2][1][4]. Instead, all of the training features are embedded in a lower dimensional space using supervised Locally Linear Embedding (LLE) J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 211–220, 2009. c Springer-Verlag Berlin Heidelberg 2009

212

G. Gill and M. Levine

[10]. This produces distinct spatial clusters in the embedding space, each representing the appearance of the corresponding object part. The embedding space also contains a cluster representing the background class. This representation of both the object and background class in a low dimensional space has previously been used in [11]. However, [11] used global image features for modeling whereas this paper is the ﬁrst to apply LLE to SIFT features, which are necessarily local. The detection scheme is based on hypothesizing object parts in a test image and then selecting those that are spatially consistent in order to estimate the location of the object. In our framework, each cell is represented by its centroid and spatial consistency is modeled according to the location of cell centroids with respect to each other. Furthermore, for each cell centroid, a rectangle that includes all of the other cell centroids is also learnt to generate a hypothesis for class instance location. Such a hypothesis is presented in [12] but the authors deﬁne it for each visual word. Instead of learning a single rectangle for a visual word, the Implicit Shape Model [13] learns a distribution over all possible rectangles associated with a visual word. In [7], object parts are represented as partial surface models (PSMs) which are dense rigid assemblies of image features linked according to the local geometric relationships between them. In contrast, our approach represents object parts by pre-deﬁned cells and, instead of learning PSMs, learns the implicit distribution of features within each cell. Recent work in multi-view object detection [12] has used view labels of 2D training images to construct viewpoint dependent classiﬁers. Our approach also divides the view-sphere into a discrete number of segments and models the object in each one separately using annotated training images for that particular view. The algorithm in this paper permits each view model to either be represented in a diﬀerent embedding space or alternatively, all view-models in a single embedding space. For the latter, just a single classiﬁer is required to detect the object in any view. Such a view-invariant representation that does not build multiple viewspeciﬁc classiﬁers is presented in [4][14][8]. In [4], local scale invariant features are related to a viewpoint invariant reference frame called the Object Class Invariant (OCI). The topic model in [8] learns features that model object parts in diﬀerent viewpoints without any supervision. In [14], object parts that are in correspondence across diﬀerent views are represented in a canonical view. The canonical parts across all object instances and the homography between them form the 3D object category model. Lastly, 3D models have been used recently to associate features in various 2D views to the 3D object [5][9]. The major contribution of this paper is its application of LLE to model the appearance of all object parts as well as the background using overlapping SIFT features in the same coordinate space (section 2). The second contribution is the use of a spatial consistency method to localize object class instances in a test image, which we have shown to be superior to two conventional sliding-window approaches (section 3). We demonstrate that our approach improves results on the 3D database [14] and achieves comparable performance on two classes from PASCAL VOC 2006 [15] (section 4). Section 5 presents the conclusions and the ongoing work.

Multi-view Object Detection Based on Spatial Consistency

2

213

Object Model

The object model is created using training images from both object and background classes. The only requirement in modeling the appearance of the object class is that features be computed solely on the object in the image and not involve background regions. For this purpose, we have used the object’s contour mask, which is usually provided with most databases [15][14]. All object training images are normalized to a ﬁxed size M × N before feature extraction. An object is represented using features obtained by densely sampling the image every 8 pixels. A 24 × 24 region around each feature is described using a orientation-variant SIFT descriptor. This choice of sampling interval and region size permits suﬃcient overlap between neighboring descriptors. The region also is large enough, in comparison to the ﬁxed image size, to encompass local contextual information. We note that the usual SIFT descriptor [16] is invariant to in-plane orientation but the orientation-variant SIFT descriptor [3] is more distinctive and can be computed faster. Invariance to changes in the in-plane orientation by a few degrees is provided by a training set that contains object instances, which span this range. The background is modeled using dense local features sampled from images not containing the object of interest. Our model is based on a feature labeling scheme that employs supervised LLE [10] to cluster similar features across the training images. Features are automatically labeled according to their location on the object. The object is divided into K1 × K2 spatial regions, referred to as cells in the spatial pyramid framework [6]. Each cell is associated with a label, and all features that fall within a particular cell are assigned the unique cell label. Figure 1(a) shows the cell structure and labeled features on a car. The background is not divided into cells and all background features are assigned the same unique label. Each cell essentially represents an object part and the spatial layout of the cells encodes the relationship between various object parts. We next describe a simple way to encode this relationship by means of statistical measures that are collected and averaged across all training images. 1. The location of a cell (uniquely identiﬁed by label L) is represented by its centroid c(L) (see ﬁgure 1(a)). For two cells with centroids c(L1 ) and c(L2 ), their spatial relationship is characterized by the minimum minD(L1 , L2 ) and maximum distance maxD(L1 , L2 ) between them. 2. The number of features n(L) in a cell represents its ideal strength. This statistic can be then used as a measure of the probability of the existence of an object part in a novel image. The probability will be 1 if the number of features comprising the part is n(L) and will monotonically decrease if more or less features are found. 3. During the testing stage, it will be necessary to hypothesize the location of other cells given the location of a particular cell. For this purpose, a “region of interest” ROI(L) that encapsulates the centroids of all other cells relative to the cell with label L is calculated. Figure 1(a) shows the region-of-interest with respect to the bottom right cell (denoted by a rectangle of the same color as the cell’s centroid).

214

G. Gill and M. Levine

(a)

(b)

Fig. 1. Object Model: a) (top) 2×2 cell structure showing labeled features and (bottom) Centroids of all cells; Region-of-Interest and object center with respect to bottom right cell, b) Spatial clusters and background cluster in the embedding space

4. Finally, the relative distance R(L) of each cell c(L) to the center of the object C is obtained. This will be required by the cells in order to vote for a postulated object center C in a test image. Figure 1(a) also shows the center of the object C with respect to the bottom right cell. The ﬁnal step in building the object model is to apply supervised LLE to all labeled features in all training images. LLE [17] is a technique that maps highdimensional data sampled from a smooth underlying manifold to a lower ddimensional space while preserving local neighborhood relationships. It does not exploit class information of the data. However, supervised LLE computes neighborhood relationships based on the actual data point labels [10]. This yields very distinct object and background clusters in a lower dimensional space (see ﬁgure 1(b)). Essentially, the neighborhood preserving property of LLE is used to faithfully capture the similarity among the instances of the object class while the low-dimensional embeddings remove the redundancy in the encoding of similarity in the high-dimensional space. Unlike LDA and PCA, LLE also accounts for nonlinear variations in the data. For multi-view detection, the view-sphere is divided into multiple view segments and spatial clusters are built for each object view. These can be represented either in a single embedding space or, alternatively, spatial clusters in each view can be represented in a distinct embedding space. In the former, only a single classiﬁer is required for ﬁnding the object in any view.

3

Detection Algorithm

The object model described earlier does not explicitly incorporate scale. This is achieved during the detection stage by searching for an object using an image pyramid. Each level of the pyramid is uniformly sampled using a sampling interval of 8 pixels. As in the training phase, the 24 × 24 region around these features is described using an orientation-variant SIFT descriptor. Each feature is then projected to the lower dimensional d-space using the non-parametric method

Multi-view Object Detection Based on Spatial Consistency

(a) Group of neighboring features represent an object part

215

(b) Set G of groups 1-5 lie within ROI of group 1

Fig. 2. Labeled features in test images (Best viewed in color)

based on nearest-neighbor interpolation in [17] and is classiﬁed as the label of the cluster to whose mean it is closest (see nearest-mean classiﬁer in [10]). We have tested the validity of this “hard assignment” for feature labeling by using Lowe’s ratio test [16]. For each feature, we calculated the ratio between the ﬁrst and second nearest cluster mean. The feature was assigned the cluster label of the ﬁrst nearest neighbor only if the ratio was below a certain threshold. Otherwise, it was simply labeled as background. We consistently obtained the best performance for the threshold value of 1. This corresponds to a feature being assigned the label of the nearest cluster as originally hypothesized by the nearest-mean classiﬁer. Figure 2 shows the labeled features in a test image, where those classiﬁed as background are not shown. A group of neighboring features with the same label forms a hypothesis for the existence of an object part. The aim is to locate those groups that are spatially consistent and assign a score according to the degree of spatial consistency. The spatial consistency test is based on the spatial layout of the cells established during the creation of the object model in section 2. It proceeds as follows: Firstly, groups of labels {g(L)} are determined, where a group is comprised of all neighboring features with the same label L. The neighbors are deﬁned by considering a 5 × 5 window around the feature, thereby permitting the algorithm to account for intermediate mislabeled features (see red cluster in ﬁgure 2(b)). The centroid of each such group represents the center of the object part; the total number of features nE in the group represents its probability wt of occurrence. In this paper, the latter is denoted by a Gaussian function: wt{g(L)} = N (nE, μ, σ), where μ = n(L) is the ideal number of elements for the label L (computed during training) and σ is chosen to be μ/2. We employ a greedy search to ﬁnd spatially consistent groups g(L), which is based on hypothesizing a set of groups that could possibly be consistent with the object model. If consistency is found, the algorithm proceeds to examine another set of groups. The algorithm begins by initializing each group as unmarked. Using the statistics computed during construction of the object model, the following steps are repeated until all groups are marked : 1. Traversing the image row by row, select the next unmarked group g(L) and obtain the set G of unmarked groups lying inside the region of interest ROI(L) with respect to the centroid of group g(L).

216

G. Gill and M. Levine

2. Compute a consistency score between those groups (g(Li ), g(Lj ))∈G, such that the distance between their centroids is greater than minD(Li , Lj ) and less than maxD(Li , Lj ) consistency(g(Li ), g(Lj )) = wt{g(Li )} + wt{g(Lj )}

(1)

3. If all the groups in step 2 are inconsistent the implication is that group g(L) is noise. It is marked and the algorithm returns to step 1. Else the implication is that consistency is found between some groups in the set G and the algorithm moves to step 4. 4. The set G is assigned a score by adding the scores of all consistent groups in the set. score(G) = consistency(g(Li ), g(Lj )) (2) (g(Li ),g(Lj ))∈G

5. Each consistent group g(Li ) in this set votes for the location of the center of the object by adding R(Li ) to their centroid. All groups obtained in step 2 are marked and the algorithm returns to step 1. Figure 2(b) shows one iteration of the above algorithm: With respect to group 1, the algorithm locates groups 1-5 (set G) within its ROI (step 1) and ﬁnds groups 1-4 to be consistent (step 2). The sum of the consistency scores of these groups represents the detection conﬁdence (step 4). Lastly, these groups vote for the center of the object (step 5). A window of size M × N around this center represents the location of the detected object. A greedy approach is preferred since an object part (denoted by a group of labeled features) can belong to only one object. Thus, it need not be compared with other object parts (other groups) once it has already been found to belong to a spatially consistent set. This is accomplished by marking the groups. Our implementation of greedy search used row by row traversal to select a new group of labeled features (see step 1). Indeed, the results would change for a diﬀerent traversal strategy but experiments have shown that the diﬀerence in the results is quite minor (see ﬁgure 3(a)) and so any traversal method is equally eﬀective. We note that the spatial consistency test analyzes only groups of cell labels in the test image (ﬁgure 2) and does not use a sliding window over the complete image to localize object instances. Two ways of using a sliding window for detection were tested. The ﬁrst one is based on applying a window attached with cell labels (ﬁgure 1(a)) and computing the correlation with the underlying labels on a test image. The second is similar to the Bag-of-Words representation [3] that computes the histogram of cell labels in the test window and ﬁnds its Euclidean distance from the histogram of cell labels in the training window (ﬁgure 1(a)). The former imposes rigid spatial constraints on the location of diﬀerent object parts while the latter imposes no spatial constraints at all. In contrast, our modeling approach imposes ﬂexible spatial constraints by encoding the relative location of object parts within a distance range (see step 1 in section 2). Figure 3(b) compares the detection results using each of these methods. Our approach yields the best average precision (AP), which shows that ﬂexible spatial structure is optimal for representing the variations in an object class.

Multi-view Object Detection Based on Spatial Consistency

(a) Row-wise traversal vs random traversal

217

(b) Flexible spatial constraints outperforms other constraints

Fig. 3. Comparison of certain aspects of the spatial consistency algorithm on the toaster class in the 3D dataset [14]. Other classes show similar results.

The detection algorithm in this section could also be used for image classiﬁcation by assigning the test image a classiﬁcation score equal to the maximum of the detection scores (obtained in step 4 above) of all the estimated locations of the object. The idea is that the likelihood of an image containing an object is equal to the maximum likelihood of detecting the object in the image.

4 4.1

Evaluation Parameter Settings

The choice of training image size M , N and number of cells K1 , K2 are dependent on the object. Another set of parameters is used in supervised LLE [10]. These are d - dimensionality of the embedding space, k - number of nearest neighbors and degree of supervision α that is always set to 1. We did not resort to speciﬁc information about the object class to obtain the value of any of these parameters. Using a set of possible values for each parameter, we selected the ones that yielded the best performance on a validation set. We observed that, for 24 × 24 feature regions to be discriminative, the height and width of the image should usually be set to on the order of 100 pixels. Typically (M or N ) ∈ {60, 80, 100, 120} (the other dimension is dependent on the aspect ratio of the 2D object image). The cell structure should be chosen so that descriptors among the cells are distinct while within a cell they are similar across all training images. If the object is divided into a large number of cells, the number of features per cell decreases and it becomes diﬃcult to obtain a robust estimate of the spatial relationship between them. Typically (K1 , K2 ) ∈ [1, 2, 3] and the number of cells (K1 ∗ K2 ) did not exceed 4. The LLE parameters were found by varying (d, K) ∈ [15, 20]. They do not change much for diﬀerent object classes suggesting that the manifold represented by local SIFT features is independent of the underlying object class and the classiﬁcation of SIFT features can be done on a latent space of dimension 15-20 (much lower than 128, the size of the SIFT vector).

218

G. Gill and M. Levine Table 1. Comparison of AUC for diﬀerent classes in the 3D dataset [14] Bicycle Car Cellphone Iron Mouse Shoe Stapler Toaster

[14]

82.9

73.7

77.5

79.4

83.5

68.0

75.4

73.6

Single Embedding Space

97.9

89.8

66.7

88.5

73.4

85.0

70.7

90.9

Multiple Embedding Spaces

97.4

95.8

73.9

90.2

77.0

86.3

74.8

95.3

Normally, the view-sphere was divided into 8 view-segments, each comprised of 45◦ range in azimuth and full range in elevation (discretized into three heights in [14]). Since local features in views 180◦ apart are very similar, these views were combined in a single view segment. Therefore, the total number of viewsegments was 4. It was observed experimentally that merging views improved performance in both the single and multiple embedding spaces. 4.2

Experiments and Results

The performance of the detector described in this paper was evaluated using average precision (AP) and area-under-ROC-curve (AUC) for detection and classiﬁcation [15], respectively. We also used the criterion based on the area overlap [15] between the predicted bounding box and ground truth bounding box for considering a detection to be correct. Tests were carried out on 8 classes (Bicycle, Car, Cell Phone, Iron, Mouse, Shoe, Stapler and Toaster) from the 3D dataset [14] and Cars and Bicycles from PASCAL VOC 2006 [15]. A 2 × 2 cell structure was used for all classes, except for Mouse where 1 × 2 was used. The background class was modeled by extracting overlapping SIFT features from 125 background images randomly selected from diﬀerent databases (Graz, UIUC and Caltech [2]). We next discuss the results for each of the object databases. 3D Dataset [14]: This dataset contains images sampled uniformly across the view-sphere and, therefore, is a good database for testing the eﬃcacy of a method for detecting objects across views. We trained the object model using 144 images (36 images per view × 4 views) for each class and tested on 512 randomly selected images from all the classes (64 images per class × 8 classes). In contrast, [14] used 280 images per class for training and an average of 70 images from each of the classes for testing. Table 1 compares our results with [14]. The performance using multiple embedding spaces is superior by 1 − 7% when compared to using a single embedding space, except for Bicycle. Our method outperforms [14] on 5 of the 8 classes by a margin varying between 10 − 22%. Three classes, namely, Cell phone, Mouse and Stapler performed slightly poorer when compared to [14]. The primary reason for the low classiﬁcation rates on these classes is due to the simplistic local edge structure of these objects, such as edges at right angles, which often can also be detected in other object images and at various scales. Figure 4(a) shows RPC curves corresponding to each class using multiple

Multi-view Object Detection Based on Spatial Consistency

(a)

219

(b)

Fig. 4. RPC curves and AP for classes in the a) 3D dataset [14], b) VOC 2006 [15] (Best viewed with magniﬁcation)

Fig. 5. Detection results on a few classes in the 3D dataset [14] Table 2. Comparison of AUC and AP results on PASCAL VOC 2006 cars [15]

AUC AP

MIT fergus [15] 76.3 16.0

MIT torralba [15] 74.5 21.7

Liebelt et al. [9] 36.3

Ours 89.2 29.6

embedding spaces. Note that we are the first to report detection results on this dataset. Figure 5 shows some detection results. PASCAL VOC 06 [15]: We used the models learnt for cars and bicycles in the multiple embedding experiment above and tested them on a test set in PASCAL VOC 06. Such a methodology (in which training and testing is done on datasets obtained from diﬀerent sources) is referred to as “competition # 4” for the detection task in the PASCAL VOC 2006 challenge (competition # 2 for classiﬁcation task). In this competition, only two participants (MIT torralba and MIT fergus) submitted results for the cars dataset and none for the bicycles dataset. Recently, [9] used a 3D synthetic database for training and evaluated on VOC 06 cars and motorbikes. Table 2 compares the AP and AUC for these methods for the cars class. Our results are superior to both the MIT torralba and MIT fergus and comparable to those reported in [9]. Figure 4(b) shows the recall-precision curves obtained by our method on the car and bicycle classes.

5

Conclusion

We have presented a new framework for detecting instances of various object classes. Our approach represents the object as a ﬂexible constellation of

220

G. Gill and M. Levine

pre-deﬁned object parts. It uses supervised LLE on overlapping SIFT features to represent object parts and the background in a low dimensional space. Objects are then detected based on ﬁnding spatially consistent object parts. Currently, we are investigating the use of relaxation labeling for consistent labeling of test features as well as representing other aspects of the model (spatial relationships, spatial consistency) within a probabilistic framework.

References 1. Sung, K.K., Poggio, T.: Example-Based Learning for View-Based Human Face Detection. IEEE Transactions PAMI 20(1), 39–51 (1998) 2. Fergus, R., Perona, P., Zisserman, A.: Object Class Recognition by Unsupervised Scale-Invariant Learning. In: CVPR, vol. 2, pp. 264–271 (2003) 3. Sivic, J., Russell, B., Efros, A.A., Zisserman, A., Freeman, B.: Discovering Objects and Their Location in Images. In: ICCV, October 2005, pp. 370–377 (2005) 4. Toews, M., Arbel, T.: Detecting and Localizing 3D Object Classes using Viewpoint Invariant Reference Frames. In: ICCV 3DRR Workshop, pp. 1–8 (October 2007) 5. Yan, P., Khan, S.M., Shah, M.: 3D Model based Object Class Detection in an Arbitrary View. In: ICCV (2007) 6. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2, 2169–2178 (2006) 7. Kushal, A., Schmid, C., Ponce, J.: Flexible Object Models for Category-Level 3D Object Recognition. In: CVPR, June 2007, pp. 1–8 (2007) 8. Fritz, M., schiele, B.: Decomposition, Discovery and Detection of Visual Categories Using Topic Models. In: CVPR (2008) 9. Liebelt, J., Schmid, C., Schertler, K.: Viewpoint-Independent Object Class Detection using 3D Feature Maps. In: CVPR (2008) 10. de Ridder, D., Duin, R.: Locally Linear Embedding for Classiﬁcation. Technical Report PH-2002-01, Delft Univ. of Tech., Delft (2002) 11. Gill, G., Levine, M.: A Single Classiﬁer for View-Invariant Multiple Object Class Recognition. In: BMVC, vol. 1, pp. 257–266 (2006) 12. Chum, O., Zisserman, A.: An Exemplar Model for Learning Object Classes. In: CVPR (2007) 13. Leibe, B., Schiele, B.: Scale-invariant object categorization using a scale-adaptive mean-shift search. In: Rasmussen, C.E., B¨ ulthoﬀ, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 145–153. Springer, Heidelberg (2004) 14. Savarese, S., Fei-Fei, L.: 3D Generic Object Categorization, Localization and Pose Estimation. In: ICCV, pp. 1–8 (October 2007) 15. Everingham, M., Zisserman, A., Williams, C.K.I., Van Gool, L.: The PASCAL Visual Object Classes Challenge (VOC2006) Results (2006), http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf 16. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60, 91–110 (2004) 17. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. J. Mach. Learn. Res. 4, 119–155 (2003)

Active Structured Learning for High-Speed Object Detection Christoph H. Lampert and Jan Peters Max Planck Institute for Biological Cybernetics T¨ ubingen, Germany [email protected]

Abstract. High-speed smooth and accurate visual tracking of objects in arbitrary, unstructured environments is essential for robotics and human motion analysis. However, building a system that can adapt to arbitrary objects and a wide range of lighting conditions is a challenging problem, especially if hard real-time constraints apply like in robotics scenarios. In this work, we introduce a method for learning a discriminative object tracking system based on the recent structured regression framework for object localization. Using a kernel function that allows fast evaluation on the GPU, the resulting system can process video streams at speed of 100 frames per second or more. Consecutive frames in high speed video sequences are typically very redundant, and for training an object detection system, it is suﬃcient to have training labels from only a subset of all images. We propose an active learning method that select training examples in a data-driven way, thereby minimizing the required number of training labeling. Experiments on realistic data show that the active learning is superior to previously used methods for dataset subsampling for this task.

1

Introduction

Smooth high-speed tracking of arbitrary visual objects is essential in industrial automation, in many robot applications, e.g. visual servoing, high-speed ball games, and manipulation of dynamic objects in complex scenarios, as well as in a variety of other topics ranging from human motion analysis to automatic microscope operation. Due to the importance of the problem, many diﬀerent solutions have been proposed both in academic research projects as well as for industrial applications. Despite great progress in computer vision research, most solutions used in industry still rely upon controlled environmental conditions that can only be achieved on factory ﬂoors. Commercially available tracking solutions typically use active solutions such as pulsed LED’s targtes or IR-reﬂecting markers. Basic research projects, on the other hand, have concentrated either on controlled setups with dark backgrounds, on complex marker patterns, or on systems that do not achieve pixel-exact tracking [1,2]. Overall, tracking objects in semicontrolled, human inhabited environments at high frame rates with oﬀ-the-shelf J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 221–231, 2009. c Springer-Verlag Berlin Heidelberg 2009

222

C.H. Lampert and J. Peters

hardware is still an open research problem. However, such components are essential in order to bring robots into human inhabited environments. 1.1

Object Tracking in Image Sequences

Most computer vision tracking technique, in particular Kalman ﬁlters [3] and particle ﬁlters [4], consists of two main components: a detection step that estimates an object’s position in each individual frame and a motion model that performs temporal smoothing of the object trajectory, e.g. to suppress outliers. In probabilistic terms, the parts usually reﬂect a likelihood term and a prior. In this work, we concentrate on the detection step: given an image from a sequence, we want to identify the location of a freely moving object with high accuracy and high speed. For maximal robustness, we avoid search space reductions, such as a region of interest, to be able to recover from misdetections without delay. For use in an interactive robotics system, the detections from two independent cameras are integrated in a Markov chain model, and 3D object trajectories are recovered, but these steps are beyond the scope of this paper. 1.2

Object Localization in Single Frames

To formalize the problem of object detection, or localization, we ﬁrst introduce some notation. We treat images and object positions as random variables, denoting the image (the observed quantity) by x, and the position of the object (the unknown quantity) by y. Object localization, i.e. the task of predicting the object position from the image data, can be expressed as a localization function f : X → Y,

(1)

where X is the space of all images and Y is the space of possible object locations. For simplicity, we only treat the case where the output is parameterized by the object’s center point in pixel coordinates, and where exactly one object location has to be predicted. Generalization to the possibility of predicting “no object”, or multiple object locations are, of course, possible. A huge number of possibilities to construct localizations functions have been proposed in the computer vision literature, either static model-based techniques, such as matched ﬁlters [5] and motion templates [6], or systems that learn from training examples, e.g. local classiﬁers [7,8], voting procedures [9], mean-shift tracking [10], non-linear ﬁlters [11], and probabilistic random ﬁeld models [12]. Most of these techniques are not applicable to our situation, because they are either not fast enough or do not achieve single pixel prediction accuracy. In this work, we build on structured regression (SR) [13], a ﬂexible technique that treats Equation (1) as a (multidimensional) regression problem. Structured regression was originally introduced for performing object category localization with bag of visual words representations, thereby achieving strong invariance against withinclass variation, but providing low spatial detection accuracy. In this work, we show how to adapt SR to the speciﬁcs of our problem, where accurate localization

Active Structured Learning for High-Speed Object Detection

223

is crucial, because otherwise the subsequent stereo reconstruction breaks down. At the same time, we are able to work with a simpler object representation, because we only target the detection of speciﬁc objects, not of semantic object classes. Variations in appearance therefore do not occur arbitrarily, but mainly due to varying illumination, a non-static background and partial occlusions. In the following Section 2, we will recapitulate the concepts behind structured regression and explain our speciﬁc design choices. In Section 3, we introduce an improved training method based on active learning. Section 4 contains the experimental evaluation, where we show how active learning improves over other methods for eﬃcient training, and Section 5 contains a summary of the paper and directions for future work .

2

Object Localization by Structured Regression

Structured regression in its general form consists of using a structured support vector machine (S-SVM) [14] to learn a kernelized linear compatibility function F (x, y) = w, Φ(x, y)H ,

(2)

where Φ : X × Y → H is a joint feature function from X into a Hilbert space H that is implicitly given by a joint kernel function k : (X × Y) × (X × Y) → H. From F one obtains a regression function by maximization over the output space f (x) = argmaxy∈Y F (x, y).

(3)

For a ﬁxed choice of kernel k, the function f is completely determined by the weight vector w ∈ H that is obtained by solving an optimization problem [15]: min

w∈H, ξ1 ,...,ξn

∈R+

1 C n w2 + ξi i=1 2 n

(4)

subject to weighted margin constraints1 for i = 1, . . . , n: ∀y ∈ Y \ {yi } : w, Φ(xi , yi )H −w, Φ(xi , y)H , ≥ 1 −

1 ξi , Δ(yi , y)

(5)

where C > 0 is a regularization parameter and (x1 , y1 ), . . . , (xn , yn ) are training examples, i.e. images xi with manually annotated correct object location yi . Remembering that w, Φ(x, y)H is equal to the compatibility function F (x, y), one sees that the optimization (4) is a maximum margin procedure: for each training image xi , we would like to achieve a margin of 1 between the compatibility of the correct prediction yi to the compatibility of any other possible (and thereby suboptimal) prediction, i.e. F (xi , yi ) − F (xi , y) ≥ 1 for y ∈ Y \ {yi }. 1

In contrast to [13], we use the slack rescaling formulation of the S-SVM, which is generally considered more robust than the computationally easier margin rescaling [14].

224

C.H. Lampert and J. Peters

The constraint set (5) expresses this fact, with two additions: each training image gets a slack variable ξi , because it might not be possible to fulﬁll all margin constraints simultaneously, and a weight function Δ(yi , y) is introduced that reweights the slack variables to reﬂect the fact that in a regression setup some “wrong” predictions are less bad than others, and therefore not all slack variables should be penalized equally strong. Δ is also called the loss function, because it it proportional to the loss one has to pay in the objective function (4) when not achieving a suﬃcient margin for any of the training examples. 2.1

S-SVM Training by Delayed Constraint Generation

The S-SVM training step is a convex optimization problem. Therefore, one can aim for ﬁnding the globally optimal solution vector without the risk of converging only to a local minimum. However, generic optimization packages do not handle the optimization well, because the number of constraints is extremely high: for each training instance, there are as many constraints as there possible object locations in the image. In order to derive a specialized solution procedure, Joachims [15] observed that only very few of the constraint will be active at the optimal solution, which allows the use of a delayed constraint generation technique. One iterates between solving (4) for a subset of constraints, which is nearly the same quadratic program (QP) as solving an ordinary SVM, and a veriﬁcation step that checks if the resulting solution violates any element of the full constraint set (5). If it does not, one has found the globally optimal vector w. Otherwise, one adds one or several violated constraints to the constraint subset and restarts to the iteration2 . Theoretic results guarantee only polynomial time convergence [14] of this procedure, but practical experience shows that typically only few iteration are required until the optimal solution is found, see e.g. [17]. In many applications, including structured regression, the time critical part of the S-SVM training procedure is not the QP solution, but the check for violated constraints. This step requires answering the following argmax problem i∗ , y ∗ =

argmax i∈{1,...,n},y∈Y\{yi }

Δ(y, yi ) ( 1 + w, Φ(xi , y) − w, Φ(xi , yi ) ) .

(6)

Because w is kept ﬁxed in this expression, w, Φ(xi , yi ) is constant, and the maximization is nearly the same as Equation (3), except for an additional weighting by the loss function. Being able to solve (6) quickly is a crucial prerequisite to building an eﬃcient S-SVM training procedure. 2

In a computer vision context, the iterated algorithm is similar to bootstrapping methods for training object detection systems. These iteratively improve a detection function by searching for false positive detections and adding them as negative training examples [16]. S-SVM diﬀers from this as it requires no non-maximum suppression, because the loss function allows arbitrary regions to be included instead of only false positives, and no early stopping, because the margin conditions prevent overﬁtting.

Active Structured Learning for High-Speed Object Detection

2.2

225

Fast Object Localization

S-SVM based structured regression can be adapted to localization problems of very diﬀerent nature by choosing a suitable joint kernel function k and loss function Δ. For our system, the main requirements are high spatial accuracy, because the triangulation of 3D positions would otherwise fail, and high speed at test time, because the robotic system needs to operate in 100 Hz real-time. Our choice of Δ and k reﬂects this: we use the robust quadratic loss 1 Δ(y, y ) = min( 2 y − y 2L2 , 1) (7) σ where y encodes the object center in pixel coordinates and σ is a tolerance parameter that we set to one third of the expected radius of the object. The locally quadratic part enforces high spatial accuracy, whereas the cutoﬀ reﬂects that all predictions too far from the correct one are equally wrong, thereby making the measure robust to outliers. Because the kernel enters the compatibility function (2), which is evaluated repeatedly at test time, we cannot aﬀord expensive feature extraction steps like previous applications of S-SVMs to computer vision problems (e.g. [13,18,19]). Instead, we resort to an explicit kernel function t k (x, y) , (x , y ) = φ x, y + (u, v) φ x , y + (u, v) (8) (u,v)∈W

based on a per-pixel feature map φ : X × Y → Rk where W is a ﬁxed shape region, e.g. a square, centered at relative coordinate (0, 0), such that y + (u, v) runs over all positions of W translated to the center point y. φ(x, y) can be G B R G B vector valued with simplest choice φ(x, y) = (xR y , xy , xy ), where xy , xy , xy are the values of the red, green and blue channel of the pixel at position y in the image x. One can easily imagine more powerful representations, e.g. working in other color spaces, or using non-linear operations like gamma correction. The reason we choose the kernel k based on a per-pixel feature map is that it allows eﬃcient inference, because it turns the operation of w in Equation (2) into a linear shift-invariant (LSI) ﬁlter [20] on x. We write k in Equation (8) as a linear kernel k( (x, y) , (x , y ) ) = Φ(x, y)t Φ(x , y ) with explicit feature map Φ(x, y) = φ x, y + (u1 , v1 ) , . . . , φ x, y + (us , vs ) (9) for W = {(u1 , v1 ), . . . , (us , vs )}, such that H = RK for K = sk. Decomposing w into per-pixel contributions w = (w(u1 ,v1 ) , . . . , w(us ,vs ) ) in the same way as Φ, we can rewrite the compatibility function as k s c F (x, y) = w, Φ(x, y) = w(u φc x, y + (ui , vi ) (10) i ,vi ) c=1

i=1

where the index c denotes the vector components of φ. Writing w ˆc for the mirc c c rored and padded pattern of w , i.e. w ˆ(xi ,yi ) = w(−xi ,−yi ) where deﬁned, and c w ˆ(x = 0 elsewhere, we can write the inner sum as a 2D convolution i ,yi ) =

k c=1

[w ˆc ∗ φc (x)](y)

(11)

226

C.H. Lampert and J. Peters

where φc (x) denotes the c-th channels of the per-pixel feature representation of the whole image x. Now each summand in Equation (11) can be calculated eﬃciently even for large regions W using the convolution theorem [20]. Denoting the Fourier transform by F , we obtain = F −1

k c=1

Fw ˆc Fφc (x) [y]

(12)

where is the point-wise complex multiplication in Fourier space, and we were able to exchange the order of F −1 and the summation, because both are linear operations. The result is a score map of the same size as x, in which we can identify the argmax by a single scan through the elements. The same trick allows us to speed up the training procedure, where we have to repeatedly solve Equation (6). After calculating the scalar product by the convolution theorem, multiplying with the loss function is just a point-wise operation, and we identify the argmax by scanning the array. 2.3

Implementation with GPU Support

We implement the described S-SVM training procedure using the Python interface of SVMstruct3. Since training takes only seconds or few minutes, it is currently not a computational bottleneck for our object tracking system. Test time speed, however, is the crucial quantity we need to optimize for, because we have only milliseconds to evaluate the detection function (3) in a 100 Hz object detection system. We meet these requirement by calculating the Fourier transform on the GPU using the CUDA framework4. Using the FFT implementation provided by the CUDA SDK, the convolutions in (11) require less than 3ms to compute on an NVIDIA GeForce GTX 280 graphics card.

3

Training with Active Learning

Structured regression provides us with a method to train an object detection system from a set of given training examples. However, because the training labels for high accuracy object localization need to be very accurate, ideally to the pixel level, creating training examples is a tedious task, and we would like to get away with as few labeled examples as possible. To achieve this, we propose an active learning method, i.e. a setup in which the detector itself “decides” which images it would like to have labeled. While active learning is a well-established technique in the area of binary and multiclass classiﬁcation [21], for the problem of structured prediction only perceptron-like classiﬁers have so far been studied in an active learning context [22]. This is despite the fact that because of its inherent sparsity, the S-SVM’s is much better suited to this idea than the perceptron: due to the maximum-margin framework 3 4

http://svmlight.joachims.org/svm struct.html http://www.nvidia.com/object/cuda home.html

Active Structured Learning for High-Speed Object Detection

227

the optimal S-SVM solution vector will not depend on all training samples, but only on a subset of support vectors, which in our case are pairs of the training images xi with correct or incorrect labels y ∈ Y. The set of support vectors is typically much smaller than the number of training instances, and particularly so for very redundant data sources like high framerate video streams. Therefore, it makes sense not to label all images of a sequence, but only some relevant ones. If we dropped only images that are not support vectors, we would still obtain exactly the optimal S-SVM solution. Unfortunately, the support vector are a priori unknown, so in practice, heuristics subsampling methods are used, e.g. labelling only every k-th frame, or labelling a random subset. In this work, we instead propose the active learning setup illustrated in AlRequire: image sequence x1 , . . . , xn gorithm 1. It can be seen as a generS←∅ alization of the delayed constraint genrepeat eration procedure [15]. Instead of only w ← S-SVM trained with S iteratively adding training regions for for t = 1, . . . , n do each image, we iteratively add the imy˜t ← argmaxy∈Y w, Φ(xt , y) ages themselves. For each working set if outlier(˜ yt )∧(xt , .) ∈ S then of training examples, we train the Sask for label yt SVM, and then sequentially classifying S ← S ∪ {(xt , yt )} break from loop over t all available training images, including end if the unlabeled ones, until an outlier criend for terion is raised. If no outliers are found, until no outlier was detected. the procedure terminates. Otherwise, we ask the user to label the ﬁrst outlier Fig. 1. Active S-SVM Training image, add it to the training set and reiterate until convergence. Note that w is always a valid weight vector for object detection, so we also could interrupt the procedure at any time, e.g. after a ﬁxed number of training examples. The concept of an outlier serves as a proxy for a mistake, that would be the ideal criterion whether to include an image into the training set. However, to decide whether a predicted label diﬀers from the correct label, we would require all images to be labeled, which is exactly what we want to avoid. A predicate outlier, in contrast, we can deﬁne by looking only at object detection in previous frames, using either a physical motion model, or a simper criterion like the

Fig. 2. Example frames from the image sequences with varying lighting conditions and players. The task is to detect the table tennis ball (enlarged in top right excerpts) that is of known size and color, but undergoes appearance changes due to non-homogeneous illumination conditions and occlusions.

228

C.H. Lampert and J. Peters

distance between subsequent detections. Since in our practical experiments, all outlier criteria tested coincided almost perfectly with a true mistakes criterion, we settled for the simplest setup, declaring a prediction an outlier if its distance to the previous prediction is more than 4 object radii.

4

Experimental Evaluation

We show the performance of the proposed active learning setup on a realistic high-speed object detection task. With a static Prosilica GE640C Gigabit Ethernet camera, we captured sequences of people playing table tennis at a resolution of 640 × 480 with 100 frames per second. Figure 2 shows example images from four diﬀerent test sequences. The task consists of robustly detecting the position of the ball in each frame, for which we use a 33 × 33 rectangular region W . From an pure object detection point of view, this task can be considered relatively easy, as the table tennis ball is a homogeneously textured spherical object of known color and size. However, there are numerous practical complications because we have to work in a human inhabited environments that we cannot fully control: diﬀerent distractors may enter the image, e.g. due to peoples’ diﬀerent clothing. The image background can partly change due to people or objects entering and leaving the ﬁeld of view. Additionally, a large window front causes strong variation in the lighting conditions that are non-homogeneous within the room and over time. Overall, classical non-adaptive method for blob detection, in particular diﬀerence of Gaussians (DoG) ﬁlters, have proven unreliable under these conditions, as can also be seen from the subsequent experiments. Besides the DoG ﬁlter, we compare the proposed active learning setup with two frequently used baseline methods for creating reduced training sets: uniform subsampling (H) and random subsampling (R). For all methods, we measure the detec- Fig. 3. First 10 training examples from unition accuracy on four video sequences form (top), and random (center) and active consisting of 452, 505, 405 and 268 learning (bottom) selection. Active learning frames. For the trained methods, all chooses more diﬃcult, and thereby informaframes not used as training samples tive, training examples. are used for testing. Figure 4 visualizes the detection performance of the three learning methods and the best performing diﬀerent of Gaussian ﬁlter for the four sequence of Figure 2. As one can see, all trained method are able to learn a detector that is better than a predeﬁned DoG ﬁlter, even when given only few training examples. We explain this by the fact that the appearance of the table tennis ball in the sequences we use is not rotationally symmetric due to asymmetric illumination conditions, and it is therefore not well modeled by Gaussian model. The plots also show that the active methods of example selection consistently requires fewer training examples to reach an acceptable level of accuracy than

Active Structured Learning for High-Speed Object Detection

229

Fig. 4. Detection accuracy (L2 distance between prediction and ground truth) against the number of training samples used. Each ﬁgure corresponds to one test sequences, and each data points depicts the mean and standard error over 10 runs with diﬀerent start states. Table 1. Fraction of outliers (deﬁned by Δ(yi , yipred ) = 1) for the diﬀerent detection methods at 5 / 10 / 50 training examples sequence best DoG 1 0.24 2 0.84 3 0.11 4 0.53

random 0.15 / 0.10 / 0.02 0.26 / 0.19 / 0.08 0.08 / 0.06 / 0.02 0.07 / 0.06 / 0.03

uniform 0.12 / 0.11 / 0.02 0.28 / 0.15 / 0.07 0.07 / 0.05 / 0.04 0.13 / 0.08 / 0.03

active training 0.08 / 0.02 / 0.01 0.21 / 0.15 / 0.02 0.05 / 0.01 / 0.01 0.13 / 0.01 / 0.01

the other methods. Table 1 shows that one reason for this is that it produces much fewer strong outliers. The reason for this can be seen from Figure 3, which shows examples of the training sets resulting from the diﬀerent selection strategies. Because the baseline methods select their training example regardless of their diﬃculty, they require labels for samples that are “easy” and unlikely to become support vectors anyway. Active learning adds mainly diﬃcult example to the training set (e.g. the orange ball in front of varying amounts of skin color). Thus, the labels it requests are more likely to inﬂuence the decision function.

5

Summary and Future Work

We have presented a learning framework for eﬃcient object detection. Using a structured regression setup, we showed how to construct the kernel function in a way that allows evaluation on the GPU, thereby achieving detection speed of more than 100 frames per second, where previous S-SVM based methods requires seconds or even minutes per test image [13,18,19]. We also extended the usual structured SVM training procedure to an active learning setup. By this we were able to also strongly reduce the number of labeled training examples necessary.

230

C.H. Lampert and J. Peters

A strong advantage of the proposed systems is its ﬂexibility. While we applied it in a relative straight-forward setting, working directly with the images’ RGB components, other explicit per-pixel feature are easily integrated to yield more powerful classiﬁers. This includes elementary operations like gamma correction or color space transforms, but also non-linear features like locally binary pattern. Using temporal diﬀerences, one can incorporate background subtraction. An interesting step in this direction that would also further increase the speed would be the use of the cameras’ raw Bayer pattern as input features. A further direction of study would be the question if we can develop other kernel functions besides convolution-based ones that allow fast GPU-based evaluation. Acknowledgements. This work was funded in part by the EU project CLASS, IST 027978.

References 1. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. Systems, Man, and Cybernetics 34(3) (2004) 2. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(4) (2006) 3. Kalman, R.E.: A new approach to linear ﬁltering and prediction problems. Transaction of the ASME (1960) 4. Tanizaki, H.: Non-gaussian state-space modeling of nonstationary time series. J. Amer. Statist. Assoc. 82 (1987) 5. Tsatsanis, M.K., Giannakis, G.: Object detection and classiﬁcation using matched ﬁltering and higher-order statistics. In: Multidimensional Signal Processing (1989) 6. Hager, G.D., Belhumeur, P.N.: Eﬃcient region tracking with parametric models of geometry and illumination. IEEE Pattern Analysis and Machine Intelligence 20(10) (1998) 7. Viola, P.A., Jones, M.J.: Robust real-time face detection. In: ICCV (2001) 8. Grabner, H., Bischof, H.: On-line boosting and vision. In: CVPR (2006) 9. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. IJCV 77(1) (2008) 10. Bajramovic, F., Gr¨ aßl, C., Denzler, J.: Eﬃcient combination of histograms for real-time tracking using mean-shift and trust-region optimization. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 254– 261. Springer, Heidelberg (2005) 11. Reisert, M., Burkhardt, H.: Equivariant holomorphic ﬁlters for contour denoising and rapid object detection. IEEE Image Processing 17(2) (2008) 12. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 1–15. Springer, Heidelberg (2006) 13. Blaschko, M.B., Lampert, C.H.: Learning to localize objects with structured output regression. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 2–15. Springer, Heidelberg (2008) 14. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6(2), 1453 (2006)

Active Structured Learning for High-Speed Object Detection

231

15. Joachims, T., Finley, T., Yu, C.-N.: Cutting-plane training of structural SVMs. Machine Learning (2009) 16. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. In: CVPR (1996) 17. Joachims, T.: Training linear SVMs in linear time. In: KDD (2006) 18. Szummer, M., Kohli, P., Hoiem, D.: Learning cRFs using graph cuts. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 582– 595. Springer, Heidelberg (2008) 19. Li, Y., Huttenlocher, D.P.: Learning for stereo vision using the structured support vector machine. In: CVPR (2008) 20. J¨ ahne, B.: Digital Image Processing. Springer, Heidelberg (2005) 21. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active Learning with Statistical Models. Journal of Artiﬁcial Intelligence Research 4, 129–145 (1996) 22. Roth, D., Small, K.: Active learning with perceptron for structured output. In: ICML Workshop on Learning in Structured Output Spaces (2006)

Face Reconstruction from Skull Shapes and Physical Attributes Pascal Paysan1 , Marcel L¨uthi1 , Thomas Albrecht1 , Anita Lerch1, Brian Amberg1, Francesco Santini2 , and Thomas Vetter1 1

Computer Science Department, University of Basel, Switzerland Division of Radiological Physics, University of Basel Hospital, Switzerland {marcel.luethi,thomas.albrecht,anita.lerch,brian.amberg, francesco.santini,thomas.vetter}@unibas.ch 2

Abstract. Reconstructing a person’s face from its skeletal remains is a task that has over many decades fascinated artist and scientist alike. In this paper we treat facial reconstruction as a machine learning problem. We use separate statistical shape models to represent the skull and face morphology. We learn the relationship between the parameters of the models by fitting them to a set of MR images of the head and using ridge regression on the resulting model parameters. Since the facial shape is not uniquely defined by the skull shape, we allow to specify target attributes, such as age or weight. Our experiments show that the reconstruction results are generally close to the original face, and that by specifying the right attributes the perceptual and measured difference between the original and the predicted face is reduced.

1 Introduction Face reconstruction from skeletal remains has been practiced for well over a hundred years and is now an important technique in forensic science. Apart from its practical application, facial reconstruction also makes a great machine learning task. Given a set of training images depicting both the face and skull, can we learn a mapping from these data sets which predicts the correct face surface for a given skull? In this paper we propose to model the normal facial surface and skull morphology by means of two separate statistical shape models. We use a shape fitting algorithm to fit the statistical models to Magnetic Resonance (MR) images of the human head. Face reconstruction becomes the problem of learning the relation between the skull and face model parameters. More generally, our method can be seen as an attempt to learn the relationship between two separately constructed but dependent shape models. This makes it possible to use the statistical information represented in one model when given an observation for the other model. In the field of facial reconstruction, two schools of thoughts have developed [1]: Practitioners of the first school think that all reconstruction methods are inexact and the true face can only be approximated by a facial type which characterizes many possible faces. The second school of thought is dominated by the belief that the facial morphology can be determined from the skull with such accuracy as to make the individual recognizable by including subtle characteristic details of the skull morphology into the J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 232–241, 2009. c Springer-Verlag Berlin Heidelberg 2009

Face Reconstruction from Skull Shapes and Physical Attributes

233

analysis. Our method combines features of both schools of thought. The shape models represent both the general shape as well as the typical details of the individual’s morphology. It is clear, however, that even when a perfect reconstruction of the facial shape can be achieved, the relationship between skulls and faces is not one-to-one. The face of a single person can change with age or weight while the skull remains the same. Our method therefore allows to constrain the possible reconstructions by specifying such attributes. Our experiments show that correctly specified attributes lead to more accurate reconstructions. Moreover, different reconstruction results for the same individual can be computed, which has been hypothesized to make recognition easier [2]. Related Work. While traditional methods of modeling the reconstruction using clay are still in use, many methods for facial reconstruction based on 3D computer graphics have been developed. A recent review of current methods is given in [1]. Early approaches mimicked the manual approach and simply deform a template face to match the typical soft tissue thickness to discrete markers [3,4] or full soft-tissue maps [5]. These technologies also provide the key to recent methods based on statistical shape models [6,7,8]. Claes et al. [6] use a statistical face model and incorporate properties such as BMI, age and gender. In contrast to our method, the fitting of the face model is performed by simple interpolation of skin markers. Tu et al. [8] use warping techniques to align skulls from a training set to a new skull. After registration, a PCA model is built from the remaining differences in facial shape. This model captures the variation due to factors such as weight and age. Closest to our work is the work by Berar et al. [7]. They build a joint statistical model of face and skull shape. Face reconstruction is treated as a missing data problem, which has a straight-forward solution. While a similar goal as ours can be achieved, the model can only be built from data sets which clearly show both the skull and the face surface. This is an important difference in practice, as this currently requires the use of CT images, which are much more difficult to obtain.

2 Background At the core of our method are the statistical shape models. They efficiently capture the shape properties and guarantee that only statistically likely shapes are represented. 2.1 Statistical Shape Models Statistical shape models are a widely used tool in computer vision and medical imaging. While the method is independent of the kind of shape model used, we use a Morphable Model, which is obtained by applying Principal Component Analysis (PCA) to data sets for which dense point-to-point correspondence has been established. From n data ¯ and covariance matrix Σ are calcusets, represented by vectors si ∈ IRm , the mean s lated. PCA consists of an eigenvalue decomposition Σ = U D 2 U T , where U is the orthonormal matrix of the eigenvectors of Σ, and D 2 is a diagonal matrix with the corresponding eigenvalues. With the help of a coefficient vector α, each shape can be represented as a linear combination of the eigenvectors: ¯. s = s(α) = U D α + s

(1)

234

P. Paysan et al.

When constructing a PCA-based morphable model, it is assumed that the shape vectors s are distributed according to a multivariate normal distribution N (¯ s, Σ). Thanks to the representation in Equation 1, the density function takes the simple form: p(s(α)) ∝ exp(−α2 ).

(2)

2.2 Training Data Three different data sets are used for reconstruction, face scans, skull scans and anchor examples, as illustrated in Figure 1a. The face model we use in our experiments is built from 840 structured light 3D surface scans. For each scanned individual, a number of attributes such as age, weight, and gender were recorded in addition to the geometry and texture of the faces. As most of these attributes can be considered to be independent of the skull shape, we can use them to manipulate the predicted face. We can learn the relationship ϑ = f (α) between the model parameters α and the attributes ϑ by using a regression method (Figure 1a). For the actual face reconstruction, only the most significant mf = 50 principal components are used. The skull model consists of ms = 20 segmented CT scans. Its parameters are denoted by β. It is extremely difficult to obtain CT data sets of the full head of healthy persons, as the scanning process exposes the patients to harmful radiation. Our CT data set therefore includes many scans of dry skulls, which are more easily acquired in sufficient quality. In order to establish a connection between the face and the skull model, we have acquired a third data set of n = 23 MR Images, where both the skull and the face are visible. They can be used as anchor points between the skull and the face model. We can fit both models to these “anchor examples”, yielding pairs (αi , β i ) of face model parameters αi and skull model parameters βi . 1 2.3 Statistical Model Fitting Given an MR image of the head, the goal of model fitting is to find a parameter vector α, such that the shape s(α) matches the corresponding face or skull contour in the MR image. Moreover, it should be a likely instance of the shape, i.e. we require α2 to be small (cf. Equation (2)). More formally, let S ⊂ IR3 be the contour in the image and let DS [S ] be a function measuring the distance between the contour S and S . The optimal parameters are given as the solution to the optimization problem: min DS [sR(¯ s + U α) + t] + λα2

s,t,R,α

(3)

where s ∈ R is a scaling factor, t ∈ R3 a translation, R ∈ R3×3 a rotation matrix and λ a weighting coefficient. For more details we refer the reader to [9].

1

It is not possible to use the MR images directly to build the skull model, as skull segmentation from MR images requires a strong shape prior (which is, in our case, the (CT) skull model).

Face Reconstruction from Skull Shapes and Physical Attributes

235

3 Method As discussed above, the relationship between skulls and faces is not one-to-one. The face shapes offer much more flexibility than the skulls. In our case, this effect is amplified by the number of training examples for the face model being much larger than the number of skulls and anchor examples. We take advantage of this additional flexibility by reconstructing the face not only from the skull but from the skull and a set of attributes. In this way, we can reconstruct faces of different weight or age which all fit the given skull equally well. The problem is formulated as a minimization problem for the coefficients α of the face model. The coefficients which best fit a set of skull coefficients β and attributes ϑ are sought as the minimum of a compound functional: E(α) = Es (α, β) + λ1 Ea (α, ϑ) + λ2 Ep (α).

(4)

λ1 and λ2 are weights to balance the influence of the three terms of the functional: – The skull error Es (α, β) describes how well the predicted face fits the given skull model coefficients β. – The attribute error Ea (α, ϑ) measures how well the predicted face coefficients α match the user defined attributes ϑ. – The prior Ep (α) quantifies the probability that the predicted α represents a valid face. It has a regularizing effect and reduces overfitting. The goal is to find coefficients α that minimize all three terms simultaneously, as illustrated in Figure 1b. In the following subsections we will discuss these three terms in more depth. Attributes ϑ ϑ = f (α)

β = Mα

Skulls β

Faces α

Anchor Examples (αi , β i )

(a) Schema of the data used

(b) Face model parameters

Fig. 1. (a) Face and skull models are described by parameters α and β, face attributes by ϑ. For the anchor examples, α and β are known. The mappings ϑ = f (α) and β = M α are learned from the data. (b) It is assumed that several faces α fit a given skull β as well as the attributes ϑ. ˆ with minimal norm conforming to both requirements. We search for the α

3.1 Linear Skull Predictor As the most important step of our method, we wish to establish a relationship between the previously independent skull and face model. This is achieved by learning

236

P. Paysan et al.

the relation from the face and skulls surfaces given in the training examples. We can fit both models to these “anchor examples” to get n pairs of corresponding parameters {(αi , β i ) | i = 1, . . . , n}, cf. Section 2.3. For each individual i, αi ∈ IRmf are the parameters of the face model, and β i ∈ IRms those of the corresponding skull in the skull model. Using these pairs as training data, we wish to learn a mapping M from the face parameters to the skull parameters, i.e. M α = β. While in principle this can be achieved with any machine learning approach, we learn a linear mapping. Preferring linear over more complicated mappings has two reasons. First, assuming that an observed face surface can be well represented as a linear combination of training examples, we would expect the underlying skull to be the same combination of the skulls of the training examples, which leads to a linear mapping. Secondly, due to the limited number of training examples, we wish to use a relatively simple model. We now expand the above argument that if a face is well represented as a combination of example faces then its skull should be well represented by the same combination of the corresponding example skulls. For the anchor examples, for which we have both face and skull data, we write the face model parameters αi as a Matrix A := [α1 , . . . , αn ] ∈ IRmf ×n and the skull model parameters β i as B := ˆ from face parameters α of a [β 1 , . . . , β n ] ∈ IRms ×n . To predict skull parameters β ˆ = Ac of example face paramnewly observed face we first find a linear combination α eters best approximating α. This is done by projecting α into the space of the example faces: c = (AT A)−1 AT α = arg min Ac − α2 . c

(5)

the coefficient c are then used to generate the corresponding skull parameters ˆ = Bc = B (AT A)−1 AT α =: M α. β

(6)

As we have relatively few examples, it is necessary to introduce some regularisation in the projection. Therefore we change the above to: ˆ = Bc = B (AT A + λI)−1 AT α =: M α. β

(7)

The mapping matrix M = B (AT A + λI)−1 AT can equivalently be determined by ridge regression from face parameters to skull parameters: M = arg min M A − B2F + λ2 M 2F , M

(8)

where ·F is the Frobenius norm. For more details on ridge regression, see e.g. [10]. The mapping M is calculated only once from the training data and can then be used for all subsequent reconstructions. By exchanging A and B, we can exchange the role of faces and skulls and make a prediction in the opposite direction. For our overall error function (Equation 4) however, we need to evaluate how well the estimated face coefficients α fit the given skull coefficients β in skull space and therefore calculate the mapping M from face to skull coefficients. We define the error term Es (α) in Equation (4) as: Es (α) := M α − β2 .

(9)

Face Reconstruction from Skull Shapes and Physical Attributes

237

ˆ = M α, fit It measures how well the face coefficients α, or rather their mapping β the input skull coefficients β. This is the Mahalanobis distance in skull space, which is commonly used as a measurement for the similarity of two shapes. 3.2 Attribute Prediction from Face Coefficients The attribute error term Ea (α, ϑ) measures how well the set of face parameters α matches the chosen attributes ϑ. We relate these different values to each other by learning a function f mapping the face coefficients α to the corresponding attributes ϑ = f (α). Similar to the skull prediction, we use a training set with known matching parameter pairs (αi , ϑi ) to train the function f . As the attributes are known for all 840 face examples used to build the face model, we have a much larger training set and can also use nonlinear functions to learn this relationship. Notably, we train a support vector regression with radial basis function (RBF) kernels. We use the LIBSVM implementation for ν-Support Vector Regression [11] to find the parameters αj , α∗j , b ∈ IR of the RBF support vector regression function: f (x) =

l

2

(−αj + α∗j )e−γxj −x + b.

(10)

j=1

Here, l is the number of face examples xj . The kernel width γ and the upper bound for αi and α∗i are determined by grid search and ten fold cross validation. For each recorded attribute, a regression function fi is learned. The attribute error function Ea (α, ϑ) is then defined as: Ea (α, ϑ) := (wi (fi (α) − ϑi ))2 , (11) i∈I

where I is an index set for the different attributes and the wi are normalization factors for the value ranges of the different attributes. 3.3 Minimization and Face Prediction We are interested only in solutions α, which represent a valid face. The last term Ep (α) := α2 therefore penalizes unlikely faces (cf. Equation (2)). To find the minimum of the full functional (4) we use a conjugated gradient optimization method with ∇E(α) = 2(M T M α − M T β) + 2λ1 wi2 (fi (α) − ϑi )fi (α) + 2λ2 α, (12) i

fi (x)

where is the derivative of the SVR function Equation (10). Note that the term Ea in (4) is non-linear, and hence it is important to choose a good initial solution. Such an initial solution can be obtained by direct prediction of the coefficients α from β, in the same manner as described in section 3.1.

4 Results For our experiments we have used the data sets introduced in Section 2.2. The 840 face scans were brought into correspondence with a non-rigid iterative closest point

238

P. Paysan et al.

algorithm [12] and the 20 skull surfaces were brought into correspondence with a variational optical flow approach [13]. In the following we present the experimental results for the different parts of our algorithm individually. Finally we show results where we manipulate the attributes for the obtained reconstruction. 4.1 Skull and Face Prediction without Attributes First we evaluate the ability of the linear skull predictor introduced in Section 3.1 to reconstruct a skull from given face parameters. We conducted a leave-one-out experiment, comparing the prediction M trained on all but one of the anchor examples to the ground truth given by this left out example. In this experiment a parameter selection is used to determine a good regularization parameter λ. The best and the worst results are displayed in Figure 2a. For the prediction error of the skulls we obtained the mean absolute error (MAE) 1.24 mm and its standard deviation (STD) of 1.18 mm. Original

Prediction Prediction Error

Original

Prediction

Prediction Error

(a) Skull prediction: Best and worst example

(b) Face prediction (without attribute manipulation): Best and worst example

Fig. 2. Results of skulls predicted from faces and vice versa. In both cases, the best and worst results in terms of the Mahalanobis norm error were selected. The color-coded prediction error is the per-vertex L2 -error orthogonal to the surface. For the face prediction large errors occur at the cheeks where the soft tissue thickness depends strongly on the body weight and age.

Face Reconstruction from Skull Shapes and Physical Attributes 2

200

80

120

100 0

0

1

2

(a) Sex

50 40

170

60

−1

60

180

80

−1

−2 −2

70

190

1

239

30 20

160

40 40

60

80

100

120

(b) Weight

150 150

10 160

170

180

190

200

0

0

10 20 30 40 50 60 70 80

(c) Height

(d) Age

Fig. 3. Support Vector Regression results obtained by 10 fold cross validation on the face database. Predicted (y-axis) sex (1,-1 for male and female), weight, height and age plotted against the true value (x-axis). Original Face

Optimal Prediction

Changed Attribute

- 20 kg

+ 20 kg

+ 20 years

+ 40 years

Fig. 4. Results of the face prediction with attribute manipulation of the original faces (first column). The second column shows the reconstruction with the optimally estimated attributes. The renderings in the right column are obtained by varying the attributes weight and age.

Further, we tested the face prediction presented in Section 3.3, but still without attribute manipulation, i.e. we set the weighting parameter for the attribute term λ1 to zero. The best and worst result are shown in Figure 2b. We observe that the largest reconstruction errors occur in places where the soft tissue thickness can vary, whereas the eye and mouth area are well reconstructed even in the bad examples. Errors in the forehead and neck are mostly due to the model’s boundary conditions. While it is easy to recognize the best predicted face, the worst reconstruction is not close enough to the ground truth to be able to recognize the person’s face anymore. We obtained 2.85 mm MAE and 2.42 mm STD.

240

P. Paysan et al. Original Face

Prediction + 20 kg

Original Face

Prediction + 40 years

Fig. 5. Horizontal cuts to visualize the prediction results

4.2 Attribute Prediction Before performing the full face prediction with attribute manipulation, we tested the performance of the attribute prediction function introduced in Section 3.2. Figure 3 shows the true values plotted against the predicted values. A perfect prediction would produce values only on the diagonal. The values for weight, height and age are sufficiently close to the diagonal, while the values for sex show a good approximation of the binary attribute male/female with a continuous variable. We obtained the MAE [0.24, 3.64, 3.21, 3.31] with STD [0.25, 4.49, 4.12, 4.18] for sex, weight, height and age. 4.3 Face Prediction To evaluate the face prediction results we estimated for each examples of our MRI data set the corresponding face shapes (figure 2b). To separate the training from the test set we used again a leave one out scheme. To obtain an optimal reconstruction, we estimated the attributes for the face coefficients using the trained regression function. For each of the examples we predicted different results with varying attributes. Examples are shown in figure 4 and 5, where we show results for the most interesting attributes, weight and age.

5 Conclusion While a considerable amount of research has been devoted to face reconstruction, it is still arguable whether either of the techniques produces reliable results. Indeed, in a study performed in 2001, Stephan et al. [14] conclude that among 4 standard techniques for facial reconstruction, the only one method that gave identification rates slightly above chance rate. Our results confirm, that even though the prediction are close in terms of the average error, the individual is difficult to recognize. By constraining the result to satisfy certain attributes, the reconstruction comes perceptually closer to the original face. While the experimental results show the feasibility of our method, we see the biggest advantage of our method in the formulation of the problem in terms of finding a relationship among separate shape model parameters. This formulation does allow us to use prior knowledge about faces and skulls that can be acquired independently, using

Face Reconstruction from Skull Shapes and Physical Attributes

241

the suitable acquisition method for each model. Furthermore, the learning approach allows the use of the wide variety of algorithms developed in the field to find statistical dependencies among the model coefficients. While, due to the limited number of training examples, we used a simple linear regression function, we believe that the results can be improved using more data and more sophisticated methods such as canonical correlation analysis, to single out parameters which strongly correlate between the examples for predicting the shapes. The other parameters could then be set depending on the specified attributes. Investigating this possibility will be the subject of future work.

References 1. Wilkinson, C.: Computerized forensic facial reconstruction. Forensic Science, Medicine, and Pathology 1(3), 173–177 (2005) 2. Starbuck, J.M., Ward, R.E.: The affect of tissue depth variation on craniofacial reconstructions. Forensic Science International 172(2-3), 130–136 (2007) 3. Muller, J., Mang, A., Buzug, T.: A template-deformation method for facial reproduction. In: Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, ISPA 2005, pp. 359–364 (2005) 4. K¨ahler, K., Haber, J., Seidel, H.P.: Reanimating the dead: reconstruction of expressive faces from skull data. ACM Transactions on Graphics (TOG) 22(3), 554–561 (2003) 5. Pei, Y., Zha, H., Yuan, Z.: The Craniofacial Reconstruction from the Local Structural Diversity of Skulls. In: Computer Graphics Forum, vol. 27. Blackwell Publishing Ltd., Malden (2008) 6. Claes, P., Vandermeulen, D., De Greef, S., Willems, G., Suetens, P.: Craniofacial reconstruction using a combined statistical model of face shape and soft tissue depths: Methodology and validation. Forensic Science International 159, 147–158 (2006) 7. Berar, M., Desvignes, M., Bailly, G., Payan, Y.: 3D statistical facial reconstruction. In: Image and Signal Processing and Analysis, ISPA 2005, pp. 365–370 (2005) 8. Tu, P., Hartley, R., Lorensen, W., Allyassin, M., Gupta, R., Heier, L.: Face reconstructions using flesh deformation modes. International Association for Craniofacial Identification (2000) 9. L¨uthi, M., Lerch, A., Albrecht, T., Krol, Z., Vetter, T.: A hierarchical, multi-resolution approach for model-based skull-segmentation in mri volumes. Technical report, University of Basel (2009) 10. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004) 11. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/˜cjlin/libsvm 12. Amberg, B., Romdhani, S., Vetter, T.: Optimal step nonrigid ICP algorithms for surface registration. In: IEEE CVPR, June 2007, pp. 1–8 (2007) 13. Dedner, A., L¨uthi, M., Albrecht, T., Vetter, T.: Curvature guided level set registration using adaptive finite elements. In: Pattern Recognition, pp. 527–536 (2007) 14. Stephan, C.N., Henneberg, M.: Building faces from dry skulls: are they recognized above chance rates?. Journal of Forensic Sciences 46(3), 432–440 (2001)

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models Sudhir Raman and Volker Roth Department of Computer Science, University of Basel, Bernoullistr. 16, CH-4056 Basel, Switzerland {sudhir.raman,volker.roth}@unibas.ch

Abstract. A fully Bayesian framework for sparse regression in generalized linear models is introduced. Assuming that a natural group structure exists on the domain of predictor variables, sparsity conditions are applied to these variable groups in order to be able to explain the observations with simple and interpretable models. We introduce a general family of distributions which imposes a flexible amount of sparsity on variable groups. This model overcomes the problems associated with insufficient sparsity of traditional selection methods in high-dimensional spaces. The fully Bayesian inference mechanism allows us to quantify the uncertainty in the regression coefficient estimates. The general nature of the framework makes it applicable to a wide variety of generalized linear models with minimal modifications.An efficient MCMC algorithm is presented to sample from the posterior. Simulated experiments validate the strength of this new class of sparse regression models. When applied to the problem of splice site prediction on DNA sequence data, the method identifies key interaction terms of sequence positions which help in identifying “true” splice sites.

1 Introduction The standard linear regression model explains real-valued observations y = (yi , . . . , yn )t as products of input vectors xi ∈ Rp and regression coefficients β = (β1 , . . . , βp )t , with additional additive noise: yi = xti β + i ⇔ y = Xβ + ,

(1)

where X is the n × p “design” matrix containing the vectors of input variables as rows. It is usually assumed that the noise terms i are uncorrelated and follow a normal distribution. In many practical applications of regression, we are not only interested in finding regression models (i.e. coefficients β) which are good for predicting the target variable y, but also in identifying important explanatory factors. These explanatory factors might correspond to individual input variables, and finding the explanatory factors would then become a classical variable selection problem. Traditionally, such sparsity in terms of variables has been imposed in the form of a Lasso (1 -norm) constraint on β as introduced in [1]. More recently, a Bayesian version of the Lasso was introduced in [2,3]. Such a Bayesian interpretation allows for detailed analysis of variance estimates of the posterior distribution over the coefficients β. The availability of such J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 242–251, 2009. c Springer-Verlag Berlin Heidelberg 2009

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models

243

variance estimates solves a fundamental problem present in maximum-likelihood (ML) inference strategies for the Lasso: the form of the l1 -constraint function implies that the Hessian of β does not exist at the ML solution which makes the estimation of standard errors difficult. One problem, however, remains: both experimental studies and theoretical analysis show that the Lasso has the tendency to include too many features [4]. In order to overcome this problem, a flexible class of priors on the β coefficients has been introduced in [5], which are capable of enforcing varying levels of sparsity. In several application settings, the domain of the input variables is endowed with a natural group structure, and interpretable explanatory factors are related to these groups, rather than to single variables. Common examples are k-th order polynomial expansions of the input variables where the groups consist of products over combinations of variables. Another popular example are categorical variables (i.e. “factors” in the usual statistical terminology and/or their interactions) that are represented as groups of dummy variables. In the remainder of this paper, we will assume that such a group structure exists. More formally, we assume that the data matrix X and the coefficients β are subdivided into G groups: X = (X1 , . . . , XG ), β t = βt1 , . . . , β tG . (2) The size of the g-th subgroup will be denoted by pg . The Group-Lasso as given in [6] is a natural generalization of the Lasso to handle such grouped input domains. It has been applied to a variety of applications, and its properties have been analyzed theoretically (see, for instance, [7,8]). In this paper, we present a fully Bayesian approach which generalizes the class of prior distributions introduced in [5] to incorporate a more general case of sparsity imposed on non-overlapping groups of features. Compared with the Group-Lasso, this new sparse Bayesian regression model overcomes the problems associated with insufficient sparsity. The Bayesian inference mechanism allows us to analyze the variance in the posterior estimates which addresses the issues related to unclear significance or even non-uniqueness of ML solutions as discussed in [8]. The implementation is realized through MCMC sampling which turns out to be highly efficient for many practical applications. Further, the model presented is very flexible in that it can be adapted to all generalized linear models with minimal changes. Some examples of such models are presented in Section 3. Sections 4 and 5 show results obtained from simulations and a real-world application to splice-site prediction on DNA sequences.

2 Sparse Bayesian-Grouped Regression Model Generalized linear models (GLM). A GLM consists of three components, cf. [9]: 1. Stochastic component: y is the random or stochastic component which is distributed according to some distribution with mean μ. This component is sometimes also referred to as error structure or response distribution. 2. Systematic component: η = xt β is the systematic component producing a linear predictor. So the explanatory variables X affect the response variable y, through a function of η. The two assumptions implicit in this component are the additive effects of the variables and linearity of effects.

244

S. Raman and V. Roth

3. The link function connects the stochastic component, which describes the response variable from a wider variety of forms (typically an exponential family distribution), to the systematic component through the mean function g(μ) = η. Further, we extend the standard definition of the systematic component by adding a random effect to it. This enhancement allows the linear predictor xt β to have stochastic deviations making the model more flexible with respect to finding the effect of variables X on the response variable y. This is described as follows: where ∼ N (0, σ 2 ).

η = xt β + ,

(3)

Model specification. From a probabilistic view, this model is written as P (y, η) = P (y|g −1 (η)) N (η|xt β, σ 2 ),

(4)

where P (y|g −1 (η)) is the likelihood expression which can be replaced by various choice of distributions (Normal, Poisson, Binomial etc.). Having specified this general framework, we now take a Bayesian approach by applying priors to β, suitable to the needs of the modeling problem. We focus on enforcing sparsity on grouped regression coefficients β through a generalized class of distributions in the following manner. We extend the work in [5], which defines a prior over individual regression coefficients, by generalizing it to groups of regression coefficients. Although the prior can be written analytically as a pdf (see Appendix 1), it is defined as a two-level hierarchical model, by introducing latent variables λg , in order to make posterior analysis feasible. The general class of priors over the regression coefficients is defined as follows: G g=1

p(β g |σ) =

G g=1

N (βg |0, σ 2 λ2g I) p(λ2g ) dλ2g ,

(5)

where each βg is a scale mixture of Multivariate-Gaussians. Based on the work in [5], we apply a class of gamma distributions for the prior of λ2g , defined as follows: λ2g

˜

Gamma(pg α, (pg ρ)/2),

(6)

p +1

where pg = g2 . Based on eq. (5) and eq. (6), it is possible to derive the marginal pdf of βg (see Appendix 1 for derivation). Another novel extension to the work in [5], is the full Bayesian treatment of the model. This is achieved by introducing a prior on σ 2 , based on a standard conjugate joint prior (see [10]), described as a product of a Normal distribution of β given σ and an inverse-chi square distribution of σ 2 : p(β, σ 2 ) = p(β|σ 2 )p(σ 2 ) = N (β|μ, σ 2 Σ) · Inv-χ2 (σ 2 |ν0 , s20 )), and a joint prior on (ρ, α), based on [11], p(α, ρ|t, q, r, s) ∝ The full model is described in Figure 1.

tα−1 exp(−ρq) . Γ (α)r ρ−αs

(7)

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models

245

t, q, r, s Beta distribution with varying values of alpha 2.5

alpha =1.00 alpha =0.75 alpha =0.40

(ρ, α) ν0, s20

2

Density

Gamma(λ2g | • α, •ρ−1) Inv-χ2(σ 2|ν0, s20)

1.5

Λ N (β|•, σ 2Λ)

1

(σ 2, β)

x 0.5

N (η|xtβ, σ 2) 0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

η

y

2

Beta

P(y|g −1(η))

Fig. 1. Left panel: Displays the effect of α on the distribution over β pushing it toward greater sparsity with lower values. Right panel: Shows the general hierarchical model.

Posterior Sampling. In practice, sampling from the posterior distribution will not be possible analytically, hence we propose to use a Gibbs sampling strategy for stochastic integration. Multiplying the priors with the likelihood and rearranging the relevant terms yields the full conditional posteriors, which are needed in the Gibbs sampler for carrying out the stochastic integrations. Also, expressing the β prior as a two-level hierarchical model enables all conditionals to have a standard form, which makes Gibbs sampling feasible. Concerning β and σ 2 , the resulting conditionals have the standard form in Bayesian regression (see [10]) described as a product of a Normal distribution over β given σ and an inverse-chi square distribution over σ 2 . The conditional of λ2g results in a Generalized Inverse Gaussian distribution. Sampling of ρ and α based on their individual posteriors conditioned on each other is avoided, since this results in a slow mixing of the Markov chain due to a high correlation between samples from the two conditionals. To overcome this issue, the conditional posterior of (ρ, α) is split up into the conditional of ρ given α which results in a gamma distribution, p(ρ|α, •) ∝ Gamma(ρ|α(s +

pg ) + 1,

pg λ2g 2

+ q),

(8)

and the marginal of α is derived based on the work in [11]. This marginal results in a non-standard distribution, and sampling is done via a discretized version of the same. Finally, the conditional posterior of η would depend on the specification of the likelihood (as shown in eq. (4)) and would be the only conditional which would need to be derived depending on the chosen link function. Hence, all the rest of the above components remain unchanged for different likelihood functions used. Special Cases. For the case α = 1, this results in the Multivariate Laplace prior (which corresponds to a Bayesian version of the standard Group-Lasso). For the case of α = 1 and pg = 1 ∀g, this results in the Bayesian Lasso as derived in [3].

246

S. Raman and V. Roth

3 Application to Poisson and Binomial Models Based on the above description, the sparse Bayesian version introduced in the last section is applicable for many different likelihoods. We begin by an introduction to a simple Gaussian model to illustrate some of the basic aspects of modeling. We will then describe two frequently used models (Poisson and Binomial). Gaussian model with the identity link function. Consider the linear regression case: ηi = xti β,

(9)

where the link function is the identity function i.e. μi = ηi and μi is the mean of a Gaussian distribution. When the xi ’s are categorical variables, a standard dummy coding procedure is applied to the variables (see standard textbooks like [12]). This results in representing each categorical variable as groups of dummy-coded variables. This transformation of data leads to the inference of sparsity in grouped dummy variables. Apart from single variables, higher order interactions of variables are also added (like pairwise product of variables, product of 3 variables). This extension adds further strength to the model making it possible to find significant higher order interaction terms. If C1 , ..., Cd are d categorical random variables or factors, then the following representation shows the extension of the input matrix to higher order interactions: X =[X C1 , . . . , X Cd , X C1 :C2 , . . . , X Cd−1 :Cd , . . . , X C1 :···:CQ+1 , . . . , X Cd−Q:···:Cd ].

main effects

1st order interactions

highest order interactions

(10) On similar lines, we now describe two commonly used models: Poisson models for analyzing count data in contingency tables. Denote by y = (y1 ,. . . , yn ) the observed counts in a contingency table with n cells. The standard approach to modeling count data is Poisson regression which involves a log-linear model with independent terms. For i = 1, . . . , n : yi |μi ∼ Poisson(μi ) = (μyi i e−μi )/yi !,

(11)

with the link function given by μi = eηi . The corresponding conditional posterior of ηi , is difficult to sample from since it is not of recognized form. However, since the above conditional posterior is log-concave, it makes it possible to use “black-box” sampling methods like adaptive rejection sampling. Alternatively, we advocate the use of a Laplace approximation similar to that in [13], which in practice gives almost indistinguishable results while speeding up the computations considerably. Binomial models for a two-class classification. In the case of 2-class classification problem, the likelihood is defined in the form of a Binomial distribution with 1-trial. There are two standard link functions which are used with the Binomial distribution: logit and probit functions (for details see [9]). The probit link function has been used for the experiments described in Section 5. Sampling from the posterior conditional for η can again be resolved techniques mentioned in the Poisson case.

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models

247

4 Simulated Example In order to illustrate the performance of the described sparsity inducing distributions, experiments were carried out on simulated data based on the Poisson Model. The experiments show the need to enforce sparsity greater than what is imposed by the standard Group-Lasso. The data was simulated by assuming 6 categorical variables with 3 levels each (main effects) and all higher order interaction terms upto 2nd order (total of 42 groups representing the interaction terms, and 729 combinations of levels). The orthogonal (729 × 233) design matrix X was generated with polynomial contrast codes. Then three factors were chosen for generating the counts, namely one main effect (X2 ), a first order interaction (X5 : X6 ) and a second order interaction (X3 : X4 : X5 ). For these factors, positive values β = 3 were taken, with all other β values fixed to zero. The counts were then generated using eq. (11) and eq. (4) with σ 2 = 0.1. The experiment was performed for two cases: namely the Group-Lasso (by fixing α = 1) and another with a prior on alpha as described before. Hyperparameters were kept consistent for both the models being compared. Figure 2 shows the results in both cases, clearly indicating the need to have a model which can enforce varying levels of sparsity. Gibbs sampling was executed for 50000 iterations and the burn-in was taken to be 5000 iterations. The Left panel of Figure 3 shows example trace-plot of 2nd order interaction 3:4:5 indicating that the Markov chain converges almost immediately, an observation which is corroborated by a length control diagnosis according to [14] indicating that the necessary burn-in-period is probably 100 samples. The second experiment compares the overall performance of this model vs a fixed level of sparsity ( α = 1, which corresponds to the standard Group-Lasso). The Poisson model was again used to simulate data assuming sparsity in regression coefficients model with 6 categorical variables with 3 levels each (same as the previous experiment). The inference was carried out for 450 trials, where a random dataset was generated in each trial and applied to both models. Both models were then scored on each of the X2

X3

X3

X1

X4

X5

X6

X2

X4

X1

X5

X6

Fig. 2. Simulated data: 6 categorical variables with 3 levels each, and three “truly” nonzero interaction terms: 2, 5:6, 3:4:5. The size of the circles indicates the estimated significance of the main effects: 90% of the posterior samples for variable 2 have a positive sign. Correspondingly, the linewidth of the interactions (blue lines: 1st-order, reddish triangles: 2nd-order) indicates their significance. Left panel: Interactions identified from the posterior distributions for the case of Group-Lasso (α = 1). Right panel: Interactions identified from the posterior distributions for the general case(α < 1).

248

S. Raman and V. Roth 1

Trace of X3:X4:X5

8

F−Score for General Model

0.9

6 4 2 0 −2 0

10000

30000

50000

General Model

0.8 0.7 0.6 0.5 0.4

Group Lasso

0.3 0.2 0.1

Iterations

0 0

0.2

0.4

0.6

0.8

1

F−Score for Group Lasso

Fig. 3. Left panel: Example traceplot and moving average for the 2nd-order interaction 3:4:5. Right panel: Shows comparison of the F-scores of Group-Lasso Vs the general model over 450 trials. Each point shows the scores of both models on a particular trial. For all points lying above the diagonal, the general model scored higher than the Group-Lasso and vice versa for below the diagonal.

450 trials based on how well they did with respect to the ground truth. The score was based on the standard F-score which is described as the harmonic mean of the preciP TP sion ( T PT+F P ) and recall( T P +F N ). Figure 3 displays the final result which shows the consistent advantage of the model over Group-Lasso.

5 Application to MEMset Donor Dataset With respect to analyzing DNA sequences to find genes, it is very important to be able to recognize splice sites. Splice sites are regions in the DNA which separate coding (exons) and non-coding (introns) regions. In particular, the 5 end (starting point) of an intron is called the donor splice site and is analyzed in this paper. For this purpose, the MEMset Donor dataset (http://genes.mit.edu/burgelab/maxent/ssdata/) is used which consists of 8415 true and 179438 false human donor sites. For the analysis done in this paper, the data was balanced (see [7]) in both datasets to have equal sizes (8415). Each instance of data consists of a sequence of DNA within a window of the splice site which consist of the last 3 positions of the exon (-3,-2,-1) and first 4 positions Column−wise data visualization for TRUE splice sites

9000 A C T G

8000 7000

Column−wise data visualization of FALSE splice sites

9000

7000

6000

6000

5000

5000

4000

4000

3000

3000

2000

2000 1000

1000 0

A C T G

8000

−3

−2

−1

2

3

4

5

0

−3

−2

−1

2

3

4

5

Fig. 4. Visualization of the MEMset data Left panel: Distribution of A,C,T,G in the 7 window positions for the dataset with TRUE splice sites. Right panel: Distribution of A,C,T,G in the 7 window positions for the dataset with FALSE splice sites.

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models

249

−1

−1

−2

−2 2

2

−3

−3

3

3

5

5 4

4

Fig. 5. Results of the interaction patterns for true and false splice sites.The thickness of the lines indicate the significance of the interactions. Left panel: Interaction patterns of the TRUE splice sites.Right panel: Interaction patterns of FALSE splice sites. −1 −2 2

−3

3 5 4

Fig. 6. Results of the interaction patterns for the classification between true and false splice sites. The thickness of the lines indicate the significance of the interactions.

(2,3,4,5)of the intron (string of length 7). Hence these strings of length 7 are made up of 4 characters A, C, T, G, see [15] for details. Figure 4 shows the distribution of A, C, T, G in all the positions in both true and false splice site datasets. Apart from the main effects, the data is extended further to include 1st order(pairwise) and 2nd order(triplet) interactions. Each interaction term is then coded with dummy variables using a polynomial contrast code giving rise to a 16384 × 1156 design matrix. A Poisson model applied to contingency tables was used to analyze the interactions in both true and false splice sites individually. Figure 5, shows the difference in the interaction patterns of true and false splice sites. In particular, we observe a very strong 2nd order intra-region interaction between window positions (2:3:4) in true splices sites, which is completely missing in the case of false splice sites. Interestingly, we also observe (in the true case) a strong 1st order inter-region interaction between (-1:2), which are the last position of the exon and first position of the intron respectively which conforms to what one would hope to expect with such a sequence related pattern. The particular observation further validates the assertion made in [8], which does not find the inter-region interaction as important, but shows that inter-region interactions may have a role in solutions with the same (or -close) likelihood. A second experiment was performed with the same data in order to infer the significant interaction patterns in the context of classification of a given sequence as true or false splice site. A Binomial model with probit link function was used for this purpose. Figure 6 shows the significant interaction patterns which help in differentiation between true and false splice sites. Apart from observing some

250

S. Raman and V. Roth

patterns similar to the first experiment, we also observe a strong 2nd order inter-region interaction between (-1:3:4), which again emphasizes the importance of long range interactions in this classification task. This observation, in particular, shows the ability of the model to address the issue raised in [8], regarding the non-uniqueness or incompleteness of ML solutions to the Group-Lasso functional. The prediction performance on a test set (correlation with the true labels ρ = 0.66) was practically identical to the results reported in [8] and in the original paper [15], which has been viewed as among the best methods for short motive modeling.

6 Conclusion This paper has described a novel framework to deal with sparsity in grouped variables applied to generalized linear models. This has been achieved via a full Bayesian treatment of a generalized class of distributions inducing varying levels of sparsity for grouped variables. The proposed solution extends the existing sparsity solutions on single variables to grouped variables and makes it fully Bayesian by introducing suitable hyperpriors. Due to its general nature, one of key advantages of this framework is its applicability to a variety of standard linear models without making much change to the overall inference process. Apart from the wide applicability, the Bayesian approach of the solution allows the analysis of the variance estimates of the regression coefficients which overcomes the issue with existing Group-Lasso estimators concerning the completeness of solutions. Also, the simulation results, when compared to the standard Group-lasso, have shown that the proposed model is more effective in dealing with sparse estimates. In terms of practical implementation, the proposed solution is coupled with an efficient MCMC algorithm (using Gibbs sampling) which allows the analysis of higherorder interactions as well. Its usage has been demonstrated using the MEMset donor dataset which not only parallels the results obtained previously with this dataset but also helps in confirming some other significant interaction patterns from the given sequence data which were not confirmed earlier due to the unclear significance of coefficients found in ML solutions and due to the potential non-uniqueness of these solutions.

Acknowledgments The work was partly financed with a grant of the Swiss SystemsX.ch Initiative to the project “LiverX” of the Competence Center for Systems Physiology and Metabolic Diseases. The LiverX project was evaluated by the Swiss National Science Foundation.

References 1. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B 58(1), 267–288 (1996) 2. Figueiredo, M., Jain, A.: Bayesian learning of sparse classifiers. In: Proc. IEEE Comp. Soc. Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 35–41 (2001)

Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models

251

3. Park, T., Casella, G.: The Bayesian Lasso. Journal of the American Statistical Association 103, 681–686 (2008) 4. Meinshausen, N.: Relaxed lasso. Computational Statistics & Data Analysis 52(1), 374–393 (2007) 5. Caron, F., Doucet, A.: Sparse bayesian nonparametric regression. In: ICML 2008, pp. 88–95. ACM Press, New York (2008) 6. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. B, 49–67 (2006) 7. Meier, L., van de Geer, S., B¨uhlmann, P.: The Group Lasso for Logistic Regression. J. Roy. Stat. Soc. B 70(1), 53–71 (2008) 8. Roth, V., Fischer, B.: The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: ICML 2008, pp. 848–855. ACM, New York (2008) 9. McCullaghand, P., Nelder, J.A.: Generalized Linear Models. Chapman and Hall, Boca Raton (1983) 10. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman and Hall, Boca Raton (1995) 11. Fink, D.: A compendium of conjugate priors. in progress report: Extension and enhancement of methods for setting data quality objectives. Technical Report (1995) 12. Everitt, B.S.: The Analysis of Contingency Tables. Chapman and Hall, Boca Raton (1997) 13. Green, P.E., Park, T.: Bayesian methods for contingency tables using Gibbs sampling. Statistical Papers 45(1), 33–50 (2004) 14. Raftery, A.E., Lewis, S.M.: One long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical Science 7, 493–497 (1992) 15. Yeo, G., Burge, C.B.: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comp. Biology 11, 377–394 (2004) 16. Seshadri, V.: The inverse Gaussian distribution: a case study in exponential families. Clarendon Press, Oxford (1993)

A Appendix 1 Based on eq. (5) and eq. (6), we derive the marginal pdf of βg , by using the pdf of a generalized inverse gaussian distribution (see [16]) ∞ p(βg |σ) = 0 N (βg ; 0, σ 2 λ2g I) p(λ2g ) dλ2g pg ∞ pg (σ 2 )− 2 1 bg pg ρ pg α 2 = √ (λ2g )pg α− 2 −1 exp(− [ 2 + λ2g pg ρ])( ) dλg 2 λg 2 2π Γ (pg α) 0 p 1 pg pg (p α− 2g ) (σ 2 )− 2 bg2 g K(pg α− pg ) ( pg ρbg ) (pg ρ)(pg α+ 2 ) 2 = √ 1 π Γ (pg α) 2(pg α− 2 ) (12) where bg =

βg 2 σ2

and Kν (.) is the modified Bessel function of the second kind.

Learning with Few Examples by Transferring Feature Relevance Erik Rodner and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena {Erik.Rodner,Joachim.Denzler}@uni-jena.de http://www.inf-cv.uni-jena.de

Abstract. The human ability to learn diﬃcult object categories from just a few views is often explained by an extensive use of knowledge from related classes. In this work we study the use of feature relevance as prior information from similar binary classiﬁcation tasks. An approach is presented which is capable to use this information to increase the recognition performance for learning with few examples on a new binary classiﬁcation task. Feature relevance probabilities are estimated by a randomized decision forest of a related task and used as a prior distribution in the construction of a new forest. Experiments in an image categorization scenario show a signiﬁcant performance gain in the case of few training examples.

1

Introduction

What is the minimum number of training examples to build robust classiﬁcation systems? As a human just a single view of one object instance is suﬃcient in most cases; a machine with current state-of-the-art methods often needs hundreds and thousands of samples. One possible reason for this gap could be the inability of current systems to determine relevant features of a large pool of generic features from few examples. Another reason is suggested by psychological studies [1] which argue that a key concept of the human ability to recognize from few examples is mainly the concept of interclass or knowledge transfer. It states that prior knowledge from previously learned categories is the most important additional information source when learning object models from weak representations or few examples [2]. The goal of this paper is to improve the recognition performance of an object recognition system for the case of few training examples using the idea of both explanations. We argue that transferring the relevance of features from related tasks can be very helpful to increase the generalization performance. Feature relevance can be roughly deﬁned as the usefulness of a feature value to predict the class of an object instance mostly deﬁned in terms of mutual information [3]. To give an illustrative example of our transfer idea, consider the recognition of a new animal class. With the aid of prior knowledge from related animal classes, J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 252–261, 2009. c Springer-Verlag Berlin Heidelberg 2009

Learning with Few Examples by Transferring Feature Relevance Support Task

253

Target Task

Knowledge Transfer

probabilities βi of feature relevance

Fig. 1. Illustration of the general principle of our approach: Feature relevance is estimated from a support task and used to regularize the feature selection in the training process (randomized decision forest) of a target task with few training examples. The probabilities and feature visualizations are directly obtained from our image categorization task.

such as the importance of typical body parts like hooves, legs and head, the separation to other categories becomes much easier. We concentrate on knowledge transfer between two binary classiﬁcation tasks. A related or support task with a relative large number of training examples and a target task with few training examples is given. We assume that support task and target task share a common set of relevant features. Therefore probabilities of feature relevance for a support task are estimated in a preliminary step. This estimation can use a large number of training examples and thus yields more accurate results than an estimation using just few training examples of the target task. The estimated distribution of feature relevance can then be utilized in the construction process of a randomized decision forest [4]. In contrast to other work [5] which uses a uniform feature distribution, the prior information increases the probability of a relevant feature to be selected for the target task. Fig. 1 provides an overview of this idea. The remainder of the paper is organized as follows. First of all, we will brieﬂy review related work on object recognition using prior knowledge from related tasks. In Sect. 3 our method is described by ﬁrst outlining the relationship to Bayesian model averaging. It shows the relationship of our method to the deﬁnition of a prior distribution on hypotheses. An estimation technique of feature relevance using a randomized decision forest follows in Sect. 4. Experiments in Sect. 5 show the beneﬁts of our approach in an image categorization task. A summary of our ﬁndings and a discussion of future research directions conclude the paper.

2

Related Work

Previous work using the interclass transfer varies signiﬁcantly in the type of information transferred from related object classes. An intuitive assumption is

254

E. Rodner and J. Denzler

that similar classes share common geometric intraclass transformations. The Congealing approach of Miller et al. [6] therefore tries to estimate those transformations and use them to increase the amount of training data of a target class. For example, a single training image of a letter in a text recognition setting can be transformed using typical rotations estimated from other letters. Another idea is to assume shared structures in feature space and estimate a metric or transformation from support classes [7,8]. This leads to methods similar to linear discriminant analysis. Alternatively, Fei-Fei et al. [9] develop a generative framework with maximum-a-posteriori estimation (MAP) of model parameters using a prior distribution estimated from support classes. The approach of Rodner and Denzler [5] utilizes MAP estimation in a similar sense and re-estimates leaf probabilities of decision trees. As opposed to the approach presented in this paper which builds new decision trees from the scratch, their approach is based on a ﬁxed pre-built decision tree and a ﬁxed set of features not weighted due to their relevancy. Shared relevant features and class boundaries are exploited in the work of Torralba et al. [10]. They develop a boosting technique that jointly learns several binary classiﬁcation tasks similar to the combined boosting idea of [11]. Lee et al. [12] transfer feature relevance as a prior distribution on a weight vector in a generalized linear model. Our work is similar to their underlying idea of transfering feature relevance. In contrast prior knowledge in our work is deﬁned using the probability of a feature to be relevant instead of a prior distribution on a speciﬁc model parameter. We will show that our approach additionally allows to use a state-of-the-art classiﬁer in form of randomized decision trees [4].

3

Transfer of Feature Relevance (TFR)

Given few training examples a learner tends to overﬁt and a classiﬁcation decision is often based on irrelevant or approximately irrelevant features [3]. The goal of our approach is to reduce this overﬁtting by incorporating a prior distribution β on relevant features. To describe this more precisely, let us ﬁrst deﬁne some simple notations used in the remainder of this paper. Let T S = (xi , yi )ni=1 be a set of training examples of a given supporting binary classiﬁcation task with yi ∈ {0, 1} and object instances xi ∈ I (such as an image in an image categorization scenario or an arbitrary multi-dimensional oberservation I ⊆ Rm ). Furthermore let F be an application-speciﬁc set of features f : I → R that can be calculated on a given object instance. A feature f is said to be relevant for a speciﬁc task iﬀ ∃(x, y) ∈ I × {0, 1} : p(f (x), y|x) = p(f (x)|x) p(y|x) and thus retains information about y given x. Our approach to transfer learning relies on the assumption, that support task and target task share a set R ⊆ F of relevant features. We therefore transfer

Learning with Few Examples by Transferring Feature Relevance

255

the probability βi of a feature fi to be relevant using the training examples T S of the related task: ˜ | F) = p(R ˜ | F, T S ) = p(R p(fi ∈ R | F, T S ) . (1) ˜ fi ∈R

βi

The last reformulation in (1) assumes that the relevance of features is independently distributed. While we delay the estimation of β to Sect. 4, the following section shows that β can be used as a prior distribution on the set of possible hypotheses or models for the target task. This also highlights that our prior knowledge can be easily integrated in the concept of randomized classiﬁer ensembles and specially the randomized decision tree approach of [4]. 3.1

Incorporation of TFR into Randomized Classifier Ensembles

We will now describe the randomized forest approach of Geurts et al. [4] in a theoretical framework related to Bayesian model averaging. This allows to motivate the transfer of feature relevance as a Bayesian approach of deﬁning a prior distribution on models or hypotheses. The ﬁnal goal is to estimate the probability of the event Ω that a previously unseen object instance x belongs to class y = 1 conditioned on the set of few training examples of the target task T T and the set of all possible features F . As a classiﬁcation model we will use an ensemble of base models h (in our case non-randomized single decision trees) in the following sense: p(Ω | x, T T , F ) = p(Ω | x, h) p(h | T T , F ) dh . (2) h

The model h is often assumed to be deterministic for a given training and feature set, but there are multiple ways to sample from those sets and thus generate multiple models. One idea is the concept of bagging [13] which use random subsets of the training data. As proposed in [13] and [4] another possibility is to use random subsets of all features. This approach can be regarded as Bayesian model averaging and reﬂects our uncertainty about the set R of relevant features: ˜ p(R ˜ | F) . p(h | T T , F ) = p(h | T T , R) (3) ˜ R⊆F

˜ | F) describes the probability that R ˜ is the set of relevant The distribution p(R features. A base model h is deterministic given a training set and set of relevant ˜ = δ(h − h(T T , R)). ˜ features: p(h | T T , R) Combining all equations yields the ﬁnal classiﬁcation model: ˜ p(R ˜ | F) . p(Ω | x, T T , F ) = p(Ω | x, h(T T , R)) (4) ˜ R⊆F

This sum can not be computed eﬃciently for large feature spaces, therefore we can approximate it by simple Monte Carlo estimation: p(Ω | x, T T , F ) =

M 1 p(Ω | x, h(T T , Ri )) . M i=1

(5)

256

E. Rodner and J. Denzler

˜ | F). This distribution is often assumed Feature subsets Ri are sampled from p(R ˜ =m to be uniform and samples of only a ﬁxed number of training instances |R| are used [4]. This assumes that we have a prior estimate of |R| or the integral of (4) can be nevertheless approximated by a subspace of the power set of F . We now apply the idea of our transfer learning technique that was described ˜ | F) one at the beginning of Sect. 3. Instead of using a uniform distribution p(R S can use the probabilities βi = p(fi ∈ R | F, T ) obtained from the related class. This prior information reduces the uncertainty of the learner about the optimal set of relevant features. In the following section we will brieﬂy outline the special characteristics of randomized decision trees and the connection to the previous description of general randomized ensembles. 3.2

Randomized Decision Trees

As we use randomized decision trees [4] our base models are decision tree classiﬁers. Those classiﬁers are binary trees with two types of nodes. Each inner node represents a weak classiﬁer, a feature f and a threshold, which deﬁnes a hyperplane in feature space and thus determines the traversal of a new example in the tree. The traversal ends in a leaf node n. Each of those nodes is associated with a posterior distribution p(Ω | n), which is an estimation of the probability of the object class given that this speciﬁc leaf is reached. Building a tree is done by iteratively splitting the training set with the most informative weak classiﬁer. The selection of a weak classiﬁer is done by choosing the weak classiﬁer with the highest gain in information from a random fraction of features R and possible thresholds. Note that in contrast to the theoretical explanation in Sect. 3.1, the selection of a random subset of features is performed in each node rather than a single time for the whole decision tree. This fact is also highlighted by the illustration in Fig. 1. Relevant features are sampled from the distribution β in each split node during the training process. Using just few training examples leads to decision trees of small depth. Due to the model averaging technique described in the previous section and by using the idea of bagging [13] they still allow to build robust classiﬁers.

4

Estimating Feature Relevance Using Randomized Decision Forests

As pointed out by Rogers and Gunn [14], the use of ensembles of decision trees allows to provide robust estimates of feature relevance that also incorporates dependence between features. Our technique is similar to their method which uses a modiﬁed average mutual information between a feature and the class variable y in each inner node. The ﬁrst step to estimate underlying feature relevance of the supporting task is the training of a randomized decision forest with all training examples. Afterwards we count the number of times ci a feature fi is used in a split node. A

Learning with Few Examples by Transferring Feature Relevance

257

feature with a high occurrence ci is likely to be relevant for this task. We did not directly use the mutual information associated with a split node because our goal is to estimate a well deﬁned distribution rather than a relevance ranking. To obtain the ﬁnal vector β of feature relevance, we use maximum-a-posteriori estimation:

MAP ci S β = arg max p(T | β) p(β | α) = arg max βi p(β | α) (6) β β i with α being the hyperparameter of a Dirichlet prior p(β) and ∀i : αi = α. Without this prior distribution, the optimal β is the normalized vector c of all counts. The prior distribution can be thought of as a smoothing term, that prevents zero probability of relevance for some features. This is theoretically important if there is a feature fi that is completely irrelevant for the supporting tasks, but with discriminative power in the target class. In Sect. 5.2 we evaluate the inﬂuence of parameter α. Surprisingly it turns out that in our experimental setting a ﬂat prior distribution (α = 1) is suﬃcient.

5

Experiments

We experimentally evaluated our approach to illustrate the beneﬁts of transferring feature relevance from related tasks. In the following we empirically validate the following hypotheses: 1. Transferring feature relevance (TFR) from related tasks helps to improve recognition performance in the case of few examples. 2. The beneﬁt is most prevalent, if the supporting task is visually similar to the target task. 3. A smoothing of feature relevance is not necessary (α = 1). We use image data from the Caltech 101 dataset [9] to show the applicability to image categorization tasks. Three classes were used to conduct binary classiﬁcation tasks: Okapi, Gerenuk and Chair vs. the Caltech background class with 200 training images (cf. Fig. 2).

Fig. 2. Example images of all three classes from the Caltech 101 database [9] which are used for evaluation: Okapi, Gerenuk and Chair

258

E. Rodner and J. Denzler

To use the transfer of feature relevance supporting task and target task should use a common feature representation. In our experiments we used a bag-offeatures representation as described in Sect. 5.1. Therefore a bag-of-features codebook of the supporting class is used to calculate features of the target class. Measuring the performance of the binary classiﬁcation tasks is done by the area under the ROC curve (AUC). Unless additionally speciﬁed we obtain a statistically meaningful estimate of the performance by calculating the average of 10 runs with a random subset of the training data. Due to the behavior of randomized decision trees for each of those subsets the classiﬁer is trained and tested 50 times and the performance values are also averaged. This results in 500 runs in total which produce the ﬁnal AUC value for a speciﬁc setting. 5.1

Feature Extraction

A standard approach to image categorization is the bag-of-features idea. A quantization of local features, which is often called a codebook, is computed at the time of training. For each image a histogram can then be calculated which counts for each codebook entry the number of matching local features. The method of [15], which utilizes a randomized decision forest as a clustering mechanism, is used to construct the codebook This codebook generation procedure showed superior results compared to standard k-Means in all experiments. It also allows to create large codebooks in a few seconds on a standard PC. Local features are extracted for each image by dense sampling of feature points with a horizontal/vertical pixel spacing of 10 pixels. Descriptors are calculated using Opponent-SIFT [16]. 5.2

Dirichlet Parameter

In a ﬁrst experiment we evaluate the inﬂuence of the generic prior distribution in equation (6). This data-independent prior distribution serves as a smoothing term for the estimation of relevant features. We build a randomized decision forest using a ﬁxed set of 30 examples of the Okapi class and 200 examples of the background class. From this randomized decision forest we estimate feature relevance as described in Sect. 4 with a varying Dirichlet parameter α. These estimates are used afterwards to classify a set of one training example from the Gerenuk class and the same background images used before. Average performance values and standard deviation of 50 runs are illustrated in Fig. 3(a). It can be seen that with an increasing value of α the performance drops and the optimal value remains at α = 1. For this reason we ﬁx α to this value, which corresponds to maximum likelihood estimation of β. This highlights that the complete removal of features which are irrelevant for the supporting task (p(fi ∈ R) = 0) is beneﬁcial. This may be not the case in situations with a smaller feature set and features, that are completely irrelevant for the supporting task but essential for a target task.

Area under ROC curve

0.74

0.76

0.02

0.72 0.73 0.68 0.72

0.01

0.64 0.6

0.71

TFR Codebook Only Difference

0.56 0

1

2 3 4 5 6 7 Dirichlet parameter α

8

9

(a)

40

80 120 Number of Trees

160

0 200

259 Performance Benefit of TFR

Learning with Few Examples by Transferring Feature Relevance

(b)

Fig. 3. (a) Evaluation of the hyper-parameter α of the Dirichlet distribution, which is used to smooth the probabilities of feature relevance. (b) Inﬂuence of the number of decision trees in the forest.

5.3

Influence of the Ensemble Size

We analyzed the inﬂuence of the number of base learners in the ensemble. The the same experiment as in Sect. 5.2 is performed with a varying size of the ensemble. The results are illustrated in Fig. 3(b). The performance obviously increases with an increasing number of base learners in the ensemble which was also proven theoretically and empirically in [13]. Another interesting eﬀect is that the performance beneﬁt of transferring feature relevance is most prevalent when a small number of base learners are used. For all other experiments we utilized randomized decision forests with 200 trees. 5.4

TFR Improves Learning with Few Examples

We analyzed the performance of the binary classiﬁcation task Okapi class vs. background class with diﬀerent types of knowledge transfer: transfer of feature relevance using the support classes Gerenuk or Chair; using only the codebook of the supporting class; no knowledge transfer of other tasks at all. Note that no knowledge transfer means that a codebook is generated only from training examples of the target task. Fig. 4(a) illustrates the resulting recognition rates. Additionally Fig. 4(b) shows the same results for the class Gerenuk with prior information learned also from the Okapi task. At ﬁrst it can be seen that transferring feature relevance from related tasks really improves the recognition performance compared to a method which uses no knowledge transfer at all. This performance beneﬁt is most prevalent with a visually similar class such as the related animal class. Using prior knowledge the chair class is sometimes also beneﬁcial. It is most likely that this is due to the learning of natural generic prior knowledge, which also showed in other work to improve recognition performance [9].

E. Rodner and J. Denzler

Area under ROC curve

0.85 0.8 0.75

Gerenuk Gerenuk Codebook Only Chair Chair Codebook Only No Knowledge-Transfer

0.7 0.65 0.6 0.55 0.5

0.95 Area under ROC curve

260

0.9 0.85

Okapi Okapi Codebook Only Chair Chair Codebook Only No Knowledge-Transfer

0.8 0.75 0.7 0.65

1 2 5 Number of training examples for class Okapi

(a)

1 2 5 Number of training examples for class Gerenuk

(b)

Fig. 4. Experiments with the target task 4(a) “Okapi” and 4(b) “Gerenuk” vs. background and several types of support tasks with a varying number of training examples of the target task

Transferring only the codebook from the supporting task also increases the performance. The diﬀerence between TFR and this method in Fig. 4(b) for one training example might seem to be minor at ﬁrst glance and insigniﬁcant due to a standard deviation of about 1% in the previous experiment (Fig. 3(a)). But using a paired t-test and the corresponding average results of all 10 training and test runs, we are able to show signiﬁcance with a level of p < 0.003.

6

Conclusion

We presented a classiﬁcation approach that transfers knowledge from related classiﬁcation tasks to improve the recognition performance on a task with few training examples. The key concept of our method is the transfer of feature relevance. We use probabilities of feature relevance, which are estimated using a randomized decision forest of a related task. Those probabilities form a distribution that is used to select a random subset of features during the building process of a randomized decision forest for the target class with few examples. The relationship of our method to Bayesian model averaging was outlined. It shows that the technique indirectly uses a prior distribution of hypotheses to regularize the training of a target classiﬁcation task. Experiments on an image categorization task show a signiﬁcant performance gain. This performance beneﬁt is most striking if the supporting binary classiﬁcation task is visually related to the task with few training examples.

7

Further Work

The presented method can be applied to arbitrary machine learning problems and is not restricted to generic image categorization. For this reason we plan to apply the classiﬁcation technique to other application areas, such as object localization. An interesting open question is, whether the general idea to transfer

Learning with Few Examples by Transferring Feature Relevance

261

feature relevance can be applied to other classiﬁer techniques such as support vector machines. Additionally our method of feature relevance estimation should be compared to other methods of feature selection.

References 1. Jones, S.S., Smith, L.B.: The place of perception in children’s concepts. Cognitive Development 8, 113–139 (1993) 2. Fei-Fei, L.: Knowledge transfer in learning to recognize visual objects classes. In: Proceedings of the International Conference on Development and Learning, ICDL (2006) 3. Guyon, I., Gunn, S., Lotﬁ, A., Zadeh, M.N.: Feature Extraction: Foundations and Applications. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2006) 4. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Maching Learning 63(1), 3–42 (2006) 5. Rodner, E., Denzler, J.: Learning with few examples using a constrained gaussian prior on randomized trees. In: Proceedings of the Vision, Modelling, and Visualization Workshop, Konstanz, October 2008, pp. 159–168 (2008) 6. Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 464–471 (2000) 7. Fink, M.: Object classiﬁcation from a single example utilizing class relevance pseudo-metrics. In: Advances in Neural Information Processing Systems, vol. 17, pp. 449–456. MIT Press, Cambridge (2004) 8. Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 9. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 594–611 (2006) 10. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5), 854–869 (2007) 11. Levi, K., Fink, M., Weiss, Y.: Learning from a small number of training examples by exploiting object categories. In: CVPRW 2004: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, vol. 6, pp. 96–104 (2004) 12. Lee, S.I., Chatalbashev, V., Vickrey, D., Koller, D.: Learning a meta-level prior for feature relevance from multiple related tasks. In: ICML 2007: Proceedings of the 24th International Conference on Machine Learning, Corvalis, Oregon, pp. 489–496 (2007) 13. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 14. Rogers, J., Gunn, S.R.: Identifying feature relevance using a random forest. In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop, pp. 173–184 (2005) 15. Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Advances in Neural Information Processing Systems, pp. 985–992 (2006) 16. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluation of color descriptors for object and scene recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)

Simultaneous Estimation of Pose and Motion at Highly Dynamic Turn Maneuvers Alexander Barth1 , Jan Siegemund2 , Uwe Franke1 , and Wolfgang F¨ orstner2 1

Daimler AG, Group Research and Advanced Engineering, Sindelﬁngen, Germany 2 University of Bonn, Department of Photogrammetry, Institute of Geodesy and Geoinformation, Bonn, Germany

Abstract. The (Extended) Kalman ﬁlter has been established as a standard method for object tracking. While a constraining motion model stabilizes the tracking results given noisy measurements, it limits the ability to follow an object in non-modeled maneuvers. In the context of a stereo-vision based vehicle tracking approach, we propose and compare three diﬀerent strategies to automatically adapt the dynamics of the ﬁlter to the dynamics of the object. These strategies include an IMM-based multi-ﬁlter setup, an extension of the motion model considering higher order terms, as well as the adaptive parametrization of the ﬁlter variances using an independent maximum likelihood estimator. For evaluation, various recorded real world trajectories and simulated maneuvers, including skidding, are used. The experimental results show signiﬁcant improvements in the simultaneous estimation of pose and motion.

1

Introduction

Detecting and tracking other traﬃc participants accurately is an essential task of future intelligent vehicles. Recent driver assistance and safety systems, such as the Adaptive Cruise Control (ACC) system, focus on tracking the leading vehicle in scenarios with relative low dynamics in lateral direction. Future collision avoidance systems, however, must be able to also track the oncoming traﬃc and to deal with a wide range of driving maneuvers, including turn maneuvers. The complex dynamics of a vehicle are usually approximated by simpler motion models, e.g. constant longitudinal velocity and constant yaw rate (angular velocity) models [1] [2], or constant longitudinal acceleration and constant yaw rate models [3] [4]. These models are special cases of the well-known bicycle model [5] and restrict lateral movements to circular path motion since vehicles can not move sideward. Higher order derivatives, such as yaw acceleration, are modelled as zero-mean Gaussian noise. At turn maneuvers, however, the yaw acceleration becomes a signiﬁcant issue. Vehicles quickly develop a yaw rate if turning left or right at an intersection. A single (Kalman) ﬁlter, parametrized to yield smooth tracking results for mainly longitudinal movements, is often too slow to follow in such situations (see Fig. 1). J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 262–271, 2009. c Springer-Verlag Berlin Heidelberg 2009

Simultaneous Estimation of Pose and Motion

263

Fig. 1. (a) The ﬁlter can not follow a turning vehicle if parametrized for mainly longitudinal movements. (b)The same ﬁlter parametrized for turn maneuvers allows for accurate tracking. (c)Bird’s eye view on the problem at hand.

On the other hand, making the ﬁlter more reactive in general increases the sensitivity to noise and outliers in the measurements. Multi-ﬁlter approaches, such as the Interacting Multiple Models (IMM) method proposed by Bar-Shalom [6], tackle this problem by running several ﬁlters with diﬀerent motion models in parallel. Kaempchen et al. [3] successfully applied the IMM framework to track leading vehicles in Stop&Go scenarios. Alternative methods try to estimate the higher order terms, modelled as system noise, by an additional detector outside the ﬁlter. Chan et al. [7] introduced an input estimation scheme that computes the maximum likelihood (constant) acceleration of a moving target over a sliding window of observations, using least squares estimation techniques. The estimated acceleration is used to update the constant velocity model using the Kalman ﬁlter control vector mechanism. The drawback of this approach is that the control vector has a direct inﬂuence on the state estimate. Thus, errors in the input data directly aﬀect the state estimate. In [4], a stereo-vision based approach for tracking vehicles by means of a rigid 3D point cloud with a single Extended Kalman Filter (EKF) has been proposed. This approach yields promising results both for oncoming and leading vehicles at a variety of scenarios with mainly longitudinal movements and relatively slow turn rates. However, in general, approaches utilizing a single ﬁlter conﬁguration suﬀer from the fact that the ﬁlter parametrization strongly depends on the expected dynamics. In this contribution, we will extend the approach proposed in [4] and present three diﬀerent solutions overcoming the problem of manual situation-dependent parameter tuning.

2

Object Model

An object is represented as rigid 3D point cloud attached to a local object coordinate system. The Z-axis of this system corresponds to the moving direction, i.e. the longitudinal axis for vehicles. The X- and Y-axis represent the lateral and

264

A. Barth et al.

height axis respectively. The origin of the object coordinate system is deﬁned at the center rear axis on the ground. In the same way, the ego coordinate system is deﬁned for the ego vehicle. The following state vector is estimated: ⎡ ⎤T ⎢ ⎥ x = ⎣ e X0 , e Z0 , ψ , v, ξ, v˙ , o X1 , o Y1 , o Z1 , . . . , o XM , o YM , o ZM ⎦ Ω =pose

Φ=motion

(1)

Θ=shape

where the object pose w.r.t. the ego vehicle is given by position [ e X0 , 0, e Z0 ]T , i.e. the object origin in the ego system, and orientation ψ. The motion parameters include the absolute velocity v and acceleration v˙ in longitudinal direction as well as the yaw rate ξ. Finally, o Pm = [ o Xm , o Ym , o Zm ]T , 1 ≤ m ≤ M , denote the coordinates of a point instance Pm within the local object coordinate system. So the ﬁlter does not only estimate the pose and motion state, it also reﬁnes the 3D point cloud representing the object’s shape. This is a signiﬁcant diﬀerence to the original approach in [4], where the point positions have been estimated outside the ﬁlter due to real-time constraints. Furthermore, opposed to [4], the position of the center rear axis of a detected vehicle is assumed to be known suﬃciently well at initialization for simplicity. System Model: The following nonlinear equations deﬁne the system model (motion model) in time-discrete form: ⎤ ⎡ e ⎤ ⎡ t+ΔT −v(τ )sin(ψ(τ ))dτ Δ X0

t t+ΔT ⎥ ⎢ Δ e Z0 ⎥ ⎢ v(τ )cos(ψ(τ ))dτ ⎥ ⎢ ⎥ ⎢ t ⎢ ⎥ ⎢ Δψ ⎥ ⎢ ξΔT ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ Δv ⎥ ⎢ vΔT ˙ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ Δξ ⎥ ⎢ 0 ⎥ (2) Δx(t) = ⎢ ⎥=⎢ ⎥ ⎢ Δv˙ ⎥ ⎢ 0 ⎥ ⎢ o ⎥ ⎢ ⎥ T ⎢ Δ P1 ⎥ ⎢ ⎥ [0, 0, 0] ⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎣. ⎦ ⎣ .. ⎦ T Δ o PM [0, 0, 0] The scalar ΔT indicates the discrete time interval between current and last time step. The higher order terms are modelled as normal distributed noise, i.e. x(t + ΔT ) = x(t) + Δx(t) + w(t) with w(t) ∼ N (0, (t)). The system 2 2 2 noise matrix , with (t) = diag(σX (t), σZ (t), σψ2 (t), . . . , σZ (t)), controls the 0 0 M reactiveness of the ﬁlter. In the original approach, a constant system matrix is used for all time steps. For simplicity, we assume a stationary camera throughout this article, which is equivalent to an ideally ego-motion compensated scene, for example using the method proposed by Badino [8]. In practice, the estimated errors of the ego-motion have to be integrated into the system noise matrix. Measurement Model: The measurement vector is of the form T

z(t) = [u1 (t), v1 (t), d1 (t), . . . , uM (t), vM (t), dM (t)]

(3)

where (um (t), vm (t)) represents a sub-pixel accurate image position of a given object point Pm projected onto the image plane at time step t, and dm (t) the

Simultaneous Estimation of Pose and Motion

265

stereo disparity at this position. The image positions are tracked using a feature tracker, e.g. [9], to be able to reassign measurements to the same object point over time. The nonlinear measurement equations used for measurement prediction directly follow from the projection equations (see [4] for details).

3 3.1

Filter Approaches Interacting Multiple Model (IMM)

A set of r ﬁlters builds the basis of the IMM framework. Each ﬁlter represents a certain mode, e.g. non-maneuvering or maneuvering. One IMM ﬁlter cycle consists of three main parts: First, the r a posteriori state estimates and covariance matrices of the previous discrete time step are mixed using a weighted sum (interaction step). Then, the ﬁlters are updated based on a common measurement vector (ﬁltering step). Finally, a mode probability is computed for each ﬁlter (mode probability update step). The normalized residual, i.e. the deviation between prediction and measurements in terms of standard deviations, is used as indicator for mode likelihood. Furthermore, a priori mode transition probabilities must be provided. The weighting coeﬃcients in the mixing step are derived from the mode likelihood and mode transition probability. Details can be found in [6]. We employ a two ﬁlter approach. Both ﬁlters use the same object, measurement, and motion model as proposed in Sec. 2. The diﬀerent behaviour of the non-maneuvering ﬁlter (=mode 1) and maneuvering ﬁlter (=mode 2) is conﬁgured via the system noise matrices only, denoted as stat and mnv , respectively. It is possible to parametrize the system matrices in a way that the non-maneuvering ﬁlter corresponds to a constant acceleration, constant yaw rate motion model, while the maneuvering ﬁlter allows for larger changes in yaw rate and acceleration (see Sec. 4.1). 3.2

Oracle Measurements and Adaptive Noise (EKF-OR)

˜ = {z(T ), z(T − 1), . . . , z(T − (k − 1))}, Given a set of k measurement vectors z where T denotes the current discrete time step, we can estimate T

y = [Θ, Ω(T ), Ω(T − 1), ..., Ω(T − (k − 1)] ,

(4)

i.e. the shape parameters and a set of object poses, via a maximum likelihood = argmaxy (p( z˜ | y)), with y denoting the estimated parameters. estimation y The idea of this approach is, using the known camera geometry, we directly obtain e Pm (t), i.e. the coordinates of the observed object points in the ego coordinate system at a given time step. From these point coordinates and the corresponding coordinates in the object system, we can derive the six parameter similarity transformation (no scale) between the ego and object system, which gives the object’s pose (see [10]). Assuming the object point cloud to be rigid over time, we can simultaneously estimate the object point cloud (shape) and

266

A. Barth et al.

Λ(λ)

the object poses in a least squares sense. As we assume Gaussian noise, this is equivalent to a maximum likelihood estimation. In addition, the motion parameters can be easily derived from the estimated poses by numerical diﬀerentiation. We will refer to the estimation process outside the Kalman ﬁlter as oracle in the following. The oracle is purly data driven and not constrained by a motion model. Thus, the oracle approach is able to follow all movements describable in terms of a 3D rotation and translation. The yaw rate estimate of the oracle ξ OR gives a strong evidance for maneuvers. At non1 maneuvering periods the yaw rate is assumed to be low in typical driving scenarios and may η 0.5 be constant for a longer time interval. On the other hand, a large yaw rate is very likely to 0 change within a short time interval, since ve0 1 2 hicles usually do not drive in a circle for very λ long. This leads to the idea of steering the system noise level based on the magnitude of the Fig. 2. Adaptive noise function as yaw rate estimate of the oracle (using the ora- used in the experiments. The funccle’s yaw acceleration estimate instead of the tion converges to σmax for λ > η yaw rate turned out to be too noisy as reliable and equals σmin + for λ = 0. The values of σmin , σmax , and η are demaneuver detector). sign parameters and a small conThe yaw rate entry in the system noise mastant. trix at a given time step t is set to Λ(ξ OR (t)), where Λ is a sigmoidal function that smoothly interpolates between a minimum and maximum noise level σmin and σmax , respectively, as can be seen in Fig. 2. Next, we pass the yaw rate estimate of the oracle to the Kalman ﬁlter as additional measurement. The measurement vector z(t) is extended to z (t) = [z(t), ξ OR ]T . The measurement noise for the yaw rate estimate can be derived from the uncertainties of the estimated oracle poses. 3.3

Yaw Acceleration Model (EKF-YA)

In the third solution, the state vector x is augmented by an additional parameter ˙ The resulting system model corresponds to a constant for the yaw acceleration ξ. yaw acceleration model, i.e. ξ¨ ∼ N (0, σξ¨). Accordingly, the linearized system matrix and system noise matrix have to be adapted, while the measurement model remains unchanged.

4 4.1

Experimental Results Filter Configuration

The diﬀerent ﬁlter approaches have been parametrized with the following system noise throughout the experiments (given in standard deviations):

Simultaneous Estimation of Pose and Motion

σX0 EKF 0.01 EKF+YA 0.01 EKF+OR 0.01 IMM( stat ) 0.01 IMM( mnv ) 0.01

σZ0 0.01 0.01 0.01 0.01 0.01

σψ 0.001 0.001 0.001 0.001 0.001

σv σξ σv˙ σξ˙ 0.001 0.05 1.0 0.001 0.0001 1.0 1.0 0.001 Λ(|ξ OR |) 1.0 0.001 0.0001 0.0001 0.001 1.0 1.0 -

σXm 0.025 0.025 0.025 0.025 0.025

σYm 0.025 0.025 0.025 0.025 0.025

267

σZm 0.025 0.025 0.025 0.025 0.025

where Λ(.) represents the adaptive noise function deﬁned in Sec. 3.2. The noise function has been parametrized with σmin = 0.05, σmax = 1.0, and η = 1.5. This means, for small ξ OR the EKF-OR ﬁlter is conﬁgured as the standard EKF ﬁlter and for |ξ OR | > 1.5, the ﬁlter corresponds to the maneuvering mode of the IMM ﬁlter. A window size of k = 5 has been used for the oracle. The mode transition probability matrix trans for the IMM ﬁlter has been chosen as 0.02 trans = 0.98 (5) 0.02 0.98 where the value at the ith row and jth column indicates the a priori probability that the ﬁlter switches from mode i to mode j (i, j ∈ {1, 2}), i.e. it is much more likely that the ﬁlter remains in the same mode. Diagonal entries close to 1 prevent frequent mode switching and result in more distinct mode probabilities. 4.2

Synthetic Ground Truth

In a simulation, a virtual vehicle, represented by a set of M = 60 rigid 3D points, is moved along a synthetic ground truth trajectory. The trajectory shows a turn maneuver that can not be followed well by the single EKF approach if parametrized for typical longitudinal or slow turn movements. At each time step, the pose and motion state of the virtual object is known. Measurements are generated by projecting the object points at a given time step onto the image planes of a virtual stereo camera pair. The measurements are disturbed by adding Gaussian noise with standard deviation σu = 0.1 pixel to the horizontal image coordinate in both images, i.e. σd = (σu2 + σu2 ) and σv = 0. The vehicle starts approaching at 50 m distance with constant velocity of 10 m/s and a small yaw acceleration of 0.05 rad/s2 . After 50 time steps, a sudden constant yaw acceleration of 2 rad/s2 is generated for 10 time steps (0.4 s). Due to the random measurement noise, the simulation has been repeated 40 times. The mean estimated yaw rate for each ﬁlter is shown in Fig. 3(a), together with the 1-σ (standard deviation) error band. As can be seen, the single EKF approach (constant yaw rate assumption) can not follow the fast yaw rate increase, while the proposed extended versions approximate the ground truth much better. Diﬀerences are in the delay of the yaw rate increase, i.e. how fast does a given ﬁlter react to the yaw acceleration, and in overshooting. The EKF-YA ﬁlter quickly follows the yaw rate increase by estimating the yaw acceleration. Fig. 3(b) shows the response to the rectangular yaw acceleration input. The resulting overshooting of the estimated yaw rate is about twice

268

A. Barth et al. yawr

3 yaw acceleration (rad/s2)

1.2

EKF+YA IMM

1

rad/s

0.8

Estimated Value Ground Truth 2

1

0 0

Ground Truth (dotted line) 0.6

1σ band

20

40 60 Frame #

1.5

Non−Maneuvering Maneuvering

EKF Mode Probability

EKF−OR EKF+YA

0.2

0

50

55

60

65 Frame #

70

(a)

75

80

100

(b)

ORACLE

0.4

80

1

0.5

0 0

20

40 60 Frame #

80

100

(c)

Fig. 3. (a) Error bands of yaw rate estimates based on mean and standard deviation over 40 monte carlo runs. All ﬁlter extensions approximate the ground truth much better than the original single EKF approach. (b) Estimated yaw acceleration of EKFYA ﬁlter compared to ground truth. (c) IMM mode probabilities.

as large compared to the IMM approach, which is able to follow the yaw rate increase even faster by switching from non-maneuvering mode to maneuvering mode at frame 51 (see Fig. 3(c)). As the yaw acceleration maneuver ends at frame 60, the probability of the maneuvering mode starts to decrease. The oracle detects the yaw rate increase without delay and shows no overshooting at all. Since the oracle is model-free and unconstrained, the results show a larger standard deviation compared to the Kalman ﬁlter based methods. The resulting trajectories are quite unsteady. However, combining the oracle with a Kalman ﬁlter in the EKF-OR approach, yields both a very accurate yaw rate estimate and smooth trajectories with almost no overshooting due to the additional yaw rate measurements. The delay until the yaw rate starts to increase depends on the design of the adaptive noise control function (see Fig. 2). 4.3

Real World Ground Truth

The diﬀerent ﬁltering approaches have been also evaluated using a database of 57 real vehicle trajectories recorded from in-car sensors at diﬀerent urban intersections, including 18 left turn maneuvers, 19 right turn maneuvers, and 20 straight crossing scenarios. Analog to the synthetic trajectory, a virtual point cloud object has been moved along the real trajectories. The root mean squared error (RMSE) between estimated value and ground truth is taken as evaluation criteria. Fig. 4 shows the distribution of the RMSE for the diﬀerent ﬁlters in terms of median and the 25th and 75th percentile (boxes). The whiskers indicate the ﬁrst and (clipped) 99th percentile. As can

0.25

0.2

0.2

0.15

0.1

0.05

0.2

0.15

0.1

0.15

0.1

0.05

0.05

0

0 EKF

EKF−YA

IMM

EKF+OR

EKF

EKF−YA

IMM

EKF+OR

EKF

(b) Yaw Rate - Turn

0.8

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

RMSE (m)

0.8

0.7

0.4

0.4 0.3

0.3

0.2

0.2

0.2

0.1

0.1 EKF−YA

IMM

EKF+OR

(d) Position - Total

0

IMM

EKF+OR

0.4

0.3

EKF

EKF−YA

(c) Yaw Rate - Straight

0.8

RMSE (m)

RMSE (m)

(a) Yaw Rate - Total

0

269

0.25

RMSE (rad/s)

0.25

RMSE (rad/s)

RMSE (rad/s)

Simultaneous Estimation of Pose and Motion

0.1 EKF

EKF−YA

IMM

EKF+OR

(e) Position - Turn

0

EKF

EKF−YA

IMM

EKF+OR

(f) Position - Straight

Fig. 4. Distribution of RMSE between estimated state and ground truth for yaw rate (top row) and position (bottom row). The boxes indicate the 25th and 75th percentile as well as the median.

be seen, the median RMSE of the yaw rate over all runs is signiﬁcantly reduced by all proposed extensions compared to the original EKF approach (Fig 4(a)). The IMM and EKF+OR approach yield almost similar results and perform slidely better than the EKF-YA. The improvements in the estimated yaw rate directly aﬀect the error in position, which is also decreased for all extensions (Fig 4(d)). Especially if only the subset of trajectories including a turn maneuver is considered for evaluation (second column). On the other hand, even for the straight motion trajectories (third column), improvements are achieved. The parametrization with much lower system variances at the non-maneuvering mode of the IMM ﬁlter is beneﬁcal at low dynamic scenes as can be seen at Fig. 4(f). The EKF-OR ﬁlter equals the conﬁguration of the EKF ﬁlter at straight motion, thus, the results do not diﬀer much. Modelling of the yaw acceleration in the EKF-YA leads to larger deviations of the error at non dynamic scenes. The ﬁlter has to compensate for small errors in the orientation estimate by adapting the yaw acceleration, which typically leads to oscillation eﬀects that increase the RMSE. We did not ﬁnd a signiﬁcant improvement in the shape estimate on the test data, since the initial noisy object point cloud is reﬁned by the ﬁlters before the maneuvers start. 4.4

Filter Behaviour at Extreme Maneuvers

So far we have considered maneuvers that typically occur at urban intersections. In extreme situations, such as skidding on black ice, the assumption that the vehicle’s orientation is aligned with the moving direction is violated. We simulate a skidding trajectory by introducing an external force in terms of a lateral velocity component of 6 m/s, pushing the oncoming vehicle toward the observer

270

A. Barth et al. Estimated Trajectories and Final Poses Ground Truth (t=50)

26

Ground Truth (t=60)

Longitudinal Position

24

EKF+OR (t=70) 1

22

0.9

Absolute Orientation Difference (rad) System Noise for Position (σ , σ ) X

Z

0.8 0.7

20

18

Ground Truth (t=70)

0.6 0.5

EKF+ORX (t=70)

EKF+YA (t=70)

0.4 0.3

IMM (t=70)

EKF (t=70)

16 −10

0.2 0.1

−8

−6

−4 −2 Lateral Position

(a)

0

2

4

0 0

10

20

30 40 Frame #

50

60

70

(b)

Fig. 5. (a) Estimation results of skidding trajectory. The EKF-ORX approach perfectly reconstructs the ground truth trajectory and ﬁnal pose. (b) Absolute diﬀerence between object orientation and moving direction used for adaptation of the system noise.

while turning left with a constant yaw acceleration of 6 rad/s2 . As can be seen in Fig. 5(a), all ﬁlters discussed above fail to estimate the unexpected trajectory correctly since it does not agree with a circular path motion model. However, the model-free oracle is able to follow the sideslipping vehicle very accurately. Thus, we have adapted the idea of using the oracle not only to control the system noise for the yaw rate, but also for the position. The deviation between actual moving direction and object orientation, estimated by the oracle, is used as skidding detector (see Fig. 5(b)). Larger deviations indicate a violation of the circular path motion model. In this case, the system variances for position are increased using the same mechanism as introduced for the yaw rate. With large system variances for position, the vehicle is able to shift in arbitrary direction independent of the current orientation. The resulting ﬁnal pose estimated by the so called EKF+ORX approach almost perfectly ﬁts the ground truth, demonstrating the potential of this approach.

5

Conclusion

All of the proposed extensions have improved the tracking results of vehicles at typcial turn maneuvers without loosing performance at straight maneuvers and without any manual parameter tuning. The error in the estimate of yaw rate and position could be signiﬁcantly reduced compared to the original single EKF approach. We found that the IMM approach yields the best compromise between tracking accuracy and computational complexity. Since the two IMM ﬁlters can be run in parallel on multi-processor architectures, the additional load reduces

Simultaneous Estimation of Pose and Motion

271

to the mode mixing and probability update. The computation of the oracle is more costly compared to the Kalman ﬁlter, even if the sparseness in the involved matrices is exploited, and depends on the window size k. However, the skidding experiment has shown the potential of the oracle approach not only to improve the tracking at standard turn maneuvers, but also to detect extreme situations that are of special interest w.r.t. collision avoidance systems. The EKF-YA ﬁlter is an alternative if computation time is critical and only a single core processor is available. The additional computation time compared to the original EKF approach is negligible, while the results show a signiﬁcant improvement at turn maneuvers. The ﬁndings of this contribution are not restricted to image-based vehicle tracking approaches and could be directly applied to other sensors such as LIDAR sensors. Next steps will include to adapt the idea of the oracle to also control the noise level of the longitudinal acceleration and to reduce the oracle’s complexity by restricting the rotation and translation from six to four degrees of freedom (in plane motion assumption).

References 1. Koller, D., Daniilidis, K., Nagel, H.: Model-based object tracking in monocular image sequences of road traﬃc scenes. International Journal of Computer Vision 10(3), 257–281 (1993) 2. Dellaert, F., Thorpe, C.: Robust car tracking using kalman ﬁltering and bayesian templates. In: Conference on Intelligent Transportation Systems (1997) 3. Kaempchen, N., Weiss, K., Schaefer, M., Dietmayer, K.: IMM object tracking for high dynamic driving maneuvers. In: Intelligent Vehicles Symposium, pp. 825–830. IEEE, Los Alamitos (2004) 4. Barth, A., Franke, U.: Where will the oncoming vehicle be the next second? In: Intelligent Vehicles Symposium. IEEE, Los Alamitos (2008) 5. Zomotor, A.: Fahrwerktechnik/Fahrverhalten, 1st edn. Vogel (1987) 6. Bar-Shalom, Y., Rong Li, X., Kirubarajan, T.: Estimation with Applications To Tracking and Navigation. John Wiley & Sons, Inc., Chichester (2001) 7. Chan, Y., Hu, A., Plant, J.: A kalman ﬁlter based tracking scheme with input estimation. IEEE Trans. on AES 15(2), 237–244 (1979) 8. Badino, H.: A robust approach for ego-motion estimation using a mobile stereo platform. In: J¨ ahne, B., Mester, R., Barth, E., Scharr, H. (eds.) IWCM 2004. LNCS, vol. 3417, Springer, Heidelberg (2004) 9. Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon University (April 1991) 10. Horn, B.K.P., Hilden, H., Negahdaripour, S.: Closed-form solution of absolute orientation using orthonormal matrices. Journal of the Optical Society A 5(7), 1127– 1135 (1988)

Making Archetypal Analysis Practical Christian Bauckhage1,2 and Christian Thurau1 1

Fraunhofer IAIS, Sankt Augustin, Germany B-IT, University of Bonn, Bonn, Germany {christian.bauckhage,christian.thurau}@iais.fraunhofer.de 2

Abstract. Archetypal analysis represents the members of a set of multivariate data as a convex combination of extremal points of the data. It allows for dimensionality reduction and clustering and is particularly useful whenever the data are superpositions of basic entities. However, since its computation costs grow quadratically with the number of data points, the original algorithm hardly applies to modern pattern recognition or data mining settings. In this paper, we introduce ways of notably accelerating archetypal analysis. Our experiments are the ﬁrst successful application of the technique to large scale data analysis problems.

1

Introduction

Archetypal Analysis (AA) was introduced by Cutler and Breiman [1] as a new way of dimensionality reduction for multivariate data. The basic idea is to approximate each point in a data set as a convex combination of a set of archetypes. The archetypes themselves are restricted to being sparse mixtures of individual data points and are thus supposed to be easily interpretable by human experts. This contrasts with familiar techniques such as (kernel) PCA [2,3] where the resulting basis elements often lack physical meaning. And while NMF [4,5] yields characteristic parts, AA yields archetypal composites. In order to identify suitable archetypes, Cutler and Breiman minimize the squared error in representing each data point as a mixture of archetypes (see Fig. 1). They show that minima are attained if the archetypes are extreme data points lying on the convex hull of the data. Their minimization algorithm consists of an alternating least squares procedure where each iteration requires the solution of several constrained quadratic optimization problems. Representations based on convex combinations of archetypal elements oﬀer interesting possibilities for pattern recognition. As the coeﬃcient vectors of a convex combination reside in a simplex, AA lends itself to subsequent soft clustering, probabilistic ranking, or classiﬁcation using latent class models. So far, however, AA has found application in physics and biology [6,7,8] but did not prevail as a commodity tool for pattern analysis or classiﬁcation. We assume this to be a consequence of the complexity of the algorithm proposed in [1]. In this paper, we brieﬂy review AA, discuss some of its characteristics, and point out why it scales quadratically with the size of a data set. The main contribution of this paper is presented in section 3: We propose a working set J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 272–281, 2009. c Springer-Verlag Berlin Heidelberg 2009

Making Archetypal Analysis Practical

(a) 100 data points and their convex hull

273

(b) hull approximation with 4 archetypes

(c) hull approximation with 7 archetypes

(d) points contributing to RSS

Fig. 1. Archetypal analysis approximates the convex hull of a set of data. Increasing the number p of archetypes improves the approximation. Solutions found for diﬀerent choices of p do not necessarily nest; for instance, none of the 4 archetypes in (b) reoccurs among the 7 archetypes in (c). While points inside of an approximated convex hull can be represented exactly as a convex combination of archetypes, points on the outside are represented by their nearest point on the archetype hull. Suitable archetypes result from iteratively minimizing the residuals of the points outside of the hull (d).

mechanism for AA that accelerates the procedure. In addition, we introduce a workable way of preselecting auspicious archetypal candidates and thus gain further speed up. In section 4, we apply our accelerated AA algorithm to large image collections. To the best of our knowledge, this is the ﬁrst time that archetypal analysis is being applied to data sets consisting of tens of thousands of elements rather than of just several dozens. Finally, a summary concludes this paper.

2

Archetypal Analysis

Suppose that we are given a set of data X = {x1 , x2 , . . . , xn } where xi ∈ Rm . Archetypal analysis (AA) deals with ﬁnding a set of archetypes {z1 , z2 , . . . , zp } with p n that are linear combinations of the data points zj =

n i=1

xi bij

(1)

274

C. Bauckhage and C. Thurau

where the coeﬃcients bij ≥ 0 so that the archetypes resemble the data and i bij = 1 so that they are convex mixtures of the data. Then, for a given choice of archetypes, AA minimizes p 2 xi − zj aji

(2)

j=1

to determine coeﬃcients aji that allow the data xi to be well represented by the archetypes. Again, AA imposes the constraints aji ≥ 0 so that each data point is a meaningful combination of archetypal elements and j aji = 1 so that the data points are represented as mixtures of archetypes. Therefore, a suitable choice of archetypes {z1 , . . . , zp } minimizes the residual sum of squares RSS(p) =

p p n n n 2 2 xi − xi − zj aji = xk bkj aji . i=1

j=1

i=1

(3)

j=1 k=1

For our discussion in this paper, it is more convenient to write (3) as a matrix equation. To this end, we collect the data points xi ∈ Rm in an m × n matrix X and the archetypes zj ∈ Rm in an m × p matrix Z and cast (3) as 2 2 RSS(p) = X − ZA = X − XBA (4) where A ∈ Rp×n and B ∈ Rn×p are column stochastic matrices. Computing an archetypal representation therefore requires the constrained optimization of two sets of coeﬃcients {aji } and {bij }. To accomplish this, Cutler and Breiman [1] present an alternating least squares algorithm which we discuss below; ﬁrst, however, we summarize some of the characteristics of AA. 2.1

Properties of Archetypal Analysis

In [1], Cutler and Breiman prove that, for p > 1, the archetypes {z1 , . . . , zp } are located on the data convex hull. For p = 1, the unique minimizer of (3) is the sample mean and for p = 2, the vector v = z1 − z2 corresponds to the ﬁrst principal axis of the data. If q ≤ n points deﬁne the convex hull of the data and p = q, the global minimizers of (3) are exactly those q points. Increasing the number of archetypes therefore improves the approximation of the data convex hull (see Fig. 1). Cutler and Breiman point out that archetypes do not nest and need not be orthogonal (see Fig. 1). Once suitable archetypes {z1 , . . . , zp } have been determined, every data point can either be exactly represented or approximated by a convex combination of the zj (see Fig. 1). Since aji ≥ 0 and j aji = 1 each data point can be interpreted as a distribution over the archetypes. Therefore, AA readily allows for soft clustering or classiﬁcation since the coeﬃcients aji of a data point xi can be interpreted as probabilities p(xi |zj ) indicating membership to classes represented by the archetypes zj (see Fig. 2). Since (3) generally does not have a closed form solution, one must resort to optimization. Cutler and Breiman point out that careful initialization improves the speed of convergence and lowers the risk of ﬁnding insigniﬁcant archetypes.

Making Archetypal Analysis Practical

(a) archetypal clusters

275

(b) simplex projection

Fig. 2. Example of using AA for clustering; each point xi is assigned to an archetype zk using k = argmaxj aji . The coeﬃcient vectors ai of the data xi are stochastic vectors and therefore reside in a simplex whose vertices correspond to the archetypes zk .

2.2

The Archetype Algorithm

In order to solve (3) for optimal coeﬃcients aji and bij , Cutler and Breiman propose an alternating least squares procedure. Given an initial guess of the archetypes {z1 , . . . , zp }, their method iterates the following steps: 1.) determine coeﬃcients aji by solving n constrained problems as in (2); in 2 matrix notation we have: min Zai − xi s.t. aji ≥ 0 and j aji = 1. To point out the computational complexity of this step, we recast these n problems as 1 T a Qai − qT ai , i = 1, . . . , n 2 i s.t. I ai ≥ 0

min

1T a i = 1

(5)

where Q = ZT Z is a p × p matrix and q = ZT xi is a p-vector. 2.) given the updated aji , compute intermediate archetypes that account for ˜ = XAT AAT −1 . the update, i.e. solve (2) for the zj to obtain Z 3.) determine the coeﬃcients bij as the minimizers of p constrained problems 2 min Xbj − z˜j s.t. bij ≥ 0 and i bij = 1 or equivalently 1 T b Rbj − rT bj , j = 1, . . . , p 2 j s.t. I bj ≥ 0

min

1T bj = 1

(6)

where R = XT X is a n × n matrix and r = XT z˜j is a corresponding n-vector. 4.) update the archetypes by setting Z = XB

276

C. Bauckhage and C. Thurau

5.) compute the new RSS; unless it falls below a threshold or only marginally improves the old RSS, continue with 1.) Apparently, computation and memory costs of this algorithm do not primarily depend on the dimension m of the data but are dominated by the optimization problems of the order of O(n2 ) in step 3.) where n denotes the size of the data set. Given the growing amount of data that characterizes modern data analysis problems, na¨ıvely implementing AA therefore impedes its application in most practical settings. Next, we suggest improvements to alleviate this.

3

Making Archetypal Analysis Practical

Both our modiﬁcations of the original AA algorithm exploit the fact that suitable archetypes reside on the data convex hull. In a nutshell, the basic idea is that, since archetypes are sparse mixtures of data points, data points xi inside of the convex hull do not contribute to these mixtures. Therefore, the corresponding coeﬃcients bij need not be estimated but can be set to 0. 3.1

Focusing on a Working Set

Data contained within the convex hull of a set of archetypal estimates do not contribute to the residual that is being minimized by the archetype algorithm (compare again Fig. 1). In each iteration of the algorithm, the data set can therefore be decomposed into X = X + ∪ X − where X − = {xi ∈ X|xi = Zai } and X + = {xi ∈ X|xi = Zai }. That is X consists of a set of points that can be represented exactly as a convex combination of archetypes and a working set (cf. [9]) containing points that can only be approximated. Applying a suitable permutation, the matrix X in (4) then reads X = X+ X− where X+ and X− are m × n and m × (n − n ) matrices, respectively, and n < n. Under this premise the residual in (4) becomes 2 2 X − ZA = X+ X− − Z A+ A− 2 2 = X+ − ZA+ + X− − ZA− (7)

=0

=0

and after expanding Z = XB it further reduces to

2 2 B+ + + A X − ZA+ = X+ − X+ X− B− 2 = X+ − X+ B+ + X− B− A+ 2 = X+ − X+ B+ A+ .

(8)

Here, the last step exploits that X− only contains data points inside of the convex hull of the currently estimated archetypes; as the archetypes themselves

Making Archetypal Analysis Practical

277

are mixtures of data points on the convex hull of X, the data points in X− do not contribute to Z, which, in turn, is tantamount to B− = 0. Consequently, the eﬀort required in the third step of the archetype algorithm reduces to O(n2 ) < O(n2 ). Moreover, as the algorithm improves archetypal estimates, the number of points outside of their convex hull decreases. This also decreases the size of the optimization problems and automatically accelerates the algorithm in later iterations. 3.2

Preselecting Archetypal Candidates

The overall gain in speed that can be achieved by the above approach depends on the initial choice of archetypes. If these were close to data convex hull, already the ﬁrst couple of iterations of the algorithm would have to consider small working sets only. In fact, if we were to know the points on the convex hull of X, we could restrict the optimization procedure to just these points. Unfortunately, the so called Upper Bound Theorem (cf. e.g. [10]) states that the worst case combinatorial complexity of computing the convex hull of n points in m dimensions is Θ(n(m/2) ). Although more sophisticated methods for computing the convex hull of m dimensional data exist, the problem seems ill posed for typical pattern recognition tasks such as the analysis of image databases. Our second contribution in this paper consists in a workable solution that avoids this problem. Instead of trying to compute the data convex hull directly, we propose to consider a sub-sample XH of points a on the convex hull of X. Since the optimization procedure in archetypal analysis will usually converge to approximations of the optimal choice of archetypes, we eﬀectively narrow the number of possible solutions. This step is crucial for making archetypal analysis practical for very large data sets, i.e. cases where n > 50000. To obtain XH we exploit that the original data matrix X contains only ﬁnitely many points so that its convex hull forms a polytope in Rm . The main theorem of polytope theory states that every image of a polytope P under an aﬃne map π : x → Mx + t is a polytope [11]. In particular, every vertex of an aﬃne image of P corresponds to a vertex of P . This allows us to sample the convex hull of X as the union of points found on convex hulls of diﬀerent 2D projections of the data. We project the data onto the h(h−1) 2D subspaces spanned by pairwise 2 combinations of the ﬁrst h eigenvectors of the covariance matrix of X where h is chosen such that the ﬁrst h eigenvectors account for 95% of the data variation. In our experiments with computer vision benchmark data, we found that the number of points n obtained from several 2D projections is much smaller than the set size n. Note that this is not a general property of convex hulls. On the contrary in extremely high dimensions all points of a normally distributed data set come to lie on the hull [12,13]. The fact that our experiments always revealed a computational complexity O(n2 ) O(n2 ) (see Fig. 4 for an exemplary quantitative analysis on synthetic 3D data) indicates that, at least in the feature spaces we considered, large collections of natural images are not normally distributed but reside on lower dimensional manifolds of diﬀerent structure.

278

C. Bauckhage and C. Thurau

Input: data matrix X ∈ Rm×n Output: matrix of archetypes Z ∈ Rm×p and coeﬃcient matrices A and B preselect archetypal candidates XH initialize matrices Z, A, and B, and compute RSSt=0 repeat optimize A = minA XH − ZA2 s.t. aji ≥ 0 and j aji = 1 ˜ + and set B− = 0 determine working set X + and matrices X+ , A+ , and Z ˜ + − X+ B+ 2 s.t. bij ≥ 0 and optimize B+ = minB+ Z i bij = 1 update the archetypes Z = X+ B+ until RSSt+1 < θ or |RSSt+1 − RSSt | < Fig. 3. Summary of the archetype algorithm combining both proposed accelerations

Fig. 4. Runtime behavior of archetypal analysis applied to a growing number n of 3D points. All three variant scale quadratically with n. The version conﬁned to points sampled from the convex hull clearly outperforms the other two.

4

Application Examples

In order to verify the practical applicability of the proposed extensions to archetypal analysis, we analyzed two large data sets from the image analysis domain. The ﬁrst data set consist of sequences of human silhouettes showing various activities [14]. The second data set consists of more than 50.000 images downloaded from flickr. For both data sets, we followed the proposal in [15] and re-scaled the RGB/Gray-value images to a resolution of 32 × 32 pixels. Visualizations of the archetypes we found are shown in Figs. 5 through 8. The computation times we measured were reasonable; 50 iterations on the flickr data took less than an hour using a Python implementation applying the cvxopt optimization library by Dahl and Vandenberghe (http://abel.ee.ucla.edu/cvxopt/). Note that, to

Making Archetypal Analysis Practical

(a)

279

(b)

Fig. 5. (a) 2D projection of the Weizman set containing 5.000 body poses; points on the convex hull are shown as pictures. (b) 6 archetypal poses extracted from the data.

Fig. 6. 2D projections of 50.000 images retrieved from flickr ; points located on the convex hull are shown as pictures

our knowledge, these are the largest data sets processed with archetypal analysis so far. Following our suggested initialization and optimization steps, the approach scales to millions of images since it no longer depends on the overall set size but rather on the number of data points sampled from the convex hull. Interestingly, the archetypes found among the flickr images display a geometric similarity to the Gabor ﬁlters that are found among the principal components of natural images [16]. They show prominent vertical, horizontal, or diagonal line patterns, or they feature blob-like elements in the center of the image. This

280

C. Bauckhage and C. Thurau

Fig. 7. 16 archetypes determined from a data set of 50.000 flickr images

Fig. 8. 16 archetypes resulting from a diﬀerent initialization of the algorithm. While these archetypes are not completely identical to the ones in Fig. 7, they show similar global geometric structures or brightness gradients.

suggests that the extremal points in this large collection of natural images are located close to the principal axes of the data. Since we do not observe this behavior for the pose images, this ﬁnding does not seem to be an artefact of AA but rather a phenomenon of the statistics of natural images. Moreover, for the flickr images, we ﬁnd the archetypes to be rather distant from the vast majority of the data. We currently investigate whether this is a boon or a bane. On the one hand, AA is aﬀected by outliers. On the other hand, the notion of an outlier is not that clearly deﬁned for a large set of natural images. While they are extreme in that they show stark contrasts and dominant structures, none of the archetypes we found can be considered an abnormal picture. Also, representing a data set by means of sparse convex combinations over the members of the set is of course best accomplished, if the basis elements are extreme. Whether or not this improves clustering or content-based classiﬁcation is examined in an ongoing study.

5

Summary

Archetypal analysis represents each point in a data set as a convex combination of a set of archetypes which themselves are sparse mixtures of individual data points. Unlike most familiar dimensionality reduction or clustering techniques, archetypal analysis therefore yields basis elements that are readily interpretable by human experts.

Making Archetypal Analysis Practical

281

Optimal archetypes reside on the data convex hull and are determined through a constrained quadratic optimization process. In this paper, we suggested two modiﬁcations of the original algorithm in order to notably speed up its runtime. We exploit that archetypes are sparse convex combinations of extremal elements of the data and only apply the procedure to working sets of correspondingly reduced sizes. Consequently, archetypal analysis becomes applicable to a wide range of realistic data analysis problems for it can now cope with data sets whose sizes exceed several hundred elements. The results we presented in this paper are, to the best of our knowledge, the ﬁrst instances of successful application of archetypal analysis to several tens of thousands of data points.

References 1. Cutler, A., Breiman, L.: Archetypal Analysis. Technometrics 36(4), 338–347 (1994) 2. Jolliﬀe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 3. Sch¨ olkopf, B., Smola, A.J., M¨ uller, K.-R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10(5), 1299–1319 (1998) 4. Lee, D.D., Seung, S.: Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 401(6755), 788 (1999) 5. Finesso, L., Spreij, P.: Approximate Nonnegative Matrix Factorization via Alternating Minimization. In: Proc. 16th Int. Symp. on Mathematical Theory of Networks and Systems, Leuven (July 2004) 6. Stone, E., Cutler, A.: Archetypal Analysis of Spatio-temporal Dynamics. Physica D 90(3), 209–224 (1996) 7. Chan, B.H.P.: Archetypal Analysis of Galaxy Spectra. Monthly Notices of the Royal Astronomical Society 338(3), 790–795 (2003) 8. Huggins, P., Pachter, L., Sturmfels, B.: Toward the Human Genotope. Bulletin of Mathematical Biology 69(8), 2723–2735 (2007) 9. Joachims, T.: Making Large-Scale Support Vector Machine Learningn Practical. In: Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge (1999) 10. de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry. Springer, Heidelberg (2000) 11. Ziegler, G.M.: Lectures on Polytopes. Springer, Heidelberg (1995) 12. Donoho, D.L., Tanner, J.: Neighborliness of Randomly-Projected Simplices in High Dimensions. Proc. of the Nat. Academy of Sciences 102(27), 9452–9457 (2005) 13. Hall, P., Marron, J., Neeman, A.: Geometric representation of high dimension low sample size data. J. of the Royal Statistical Society B 67(3), 427–444 (2005) 14. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as Space-Time Shapes. In: Proc. ICCV (2005) 15. Torralba, A., Fergus, R., Freeman, W.T.: 80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition. IEEE Trans. on Pattern Analalysis and Machine Intelligence 30(11), 1958–1970 (2008) 16. Heidemann, G.: The principal components of natural images revisited. IEEE Trans. on Pattern Analalysis and Machine Intelligence 28(5), 822–826 (2006)

Fast Multiscale Operator Development for Hexagonal Images Bryan Gardiner1 , Sonya Coleman1 , and Bryan Scotney2 1 2

University of Ulster, Magee, BT48 7JL, Northern Ireland University of Ulster, Coleraine, BT52 1SA, Northern Ireland [email protected], {sa.coleman,bw.scotney}@ulster.ac.uk

Abstract. For many years the concept of using hexagonal pixels for image capture has been investigated, and several advantages of such an approach have been highlighted. Recently there has been a renewed interested in using hexagonal pixel based images for various image processing tasks. Therefore, we present a design procedure for scalable hexagonal gradient operators, developed within the ﬁnite element framework, for use on hexagonal pixel based images. We highlight the eﬃciency of our approach, based on computing just one small neighbourhood operator and generating larger scale operators via linear additions of the small operator. We also demonstrate that scaled salient feature maps can be generated from one low level feature map without the need for application of larger operators.

1

Introduction

In machine vision, feature detection is often used to extract salient information from images. Image content often represents curved structures that may not be well represented on a rectangular lattice, and the characteristics of which may not be well captured by operators based on principal horizontal and vertical directions. The properties of operators developed on a rectangular grid are often inﬂuenced by the underlying Cartesian structure, i.e., operators may be dominated by the preferred directions along the x- and y-axes, leading to the inheritance of anisotropic properties. This problem may be further contributed to by the common practice of building operators using cross-products (in the co-ordinate directions) of existing one-dimensional operators. Such anisotropy is reﬂected in the spectral properties of the operators, and improvements can be achieved by developing operators that consider “circularity” [3, 4, 13]. One approach is the use of compass operators to rotate feature detection masks to successfully detect diagonal edges, whilst another is to increase image resolution if possible [13] to improve the representation of curved structures (but at increased computational cost). To overcome these problems, the hexagonal sampling lattice can be introduced, from which both spatial and spectral advantages may be derived: namely, equidistance of all pixel neighbours and improved spatial isotropy of spectral response. Pixel spatial equidistance facilitates the implementation of J. Denzler, G. Notni, and H. Süße (Eds.): DAGM 2009, LNCS 5748, pp. 282–291, 2009. c Springer-Verlag Berlin Heidelberg 2009

Fast Multiscale Operator Development for Hexagonal Images

283

circular symmetric kernels that is associated with an increase in accuracy when detecting edges, both straight and curved [1], and the improved accuracy of circular and near circular image processing operators has been demonstrated in [4, 15]. Additionally, better spatial sampling eﬃciency is achieved by the hexagonal structure compared with a rectangular grid of similar pixel separation, leading to improved computational performance. In a hexagonal grid with unit separation of pixel centres, approximately 13% fewer pixels are required to represent the same image resolution as required on a rectangular grid with unit horizontal and vertical separation of pixel centres [18]. Due to the number of advantages oﬀered by the hexagonal grid, the use of hexagonal lattices for image structure and representation has recently received renewed attention. The use of hexagonal pixel-based images dates back to the 1970s as the hexagonal structure is considered to be preferable to the standard rectangular structure typically used for images in terms of the improved accuracy and eﬃciency that can be achieved for a number of image processing tasks [5, 13]. More recently this area has become prominent with new developments in areas such as blue noise halftoning [12], hexagonal ﬁlter banks [10], image reconstruction [11, 19] and robot exploration [14] with applications including biologically inspired fovea modelling with neural networks that correspond to the hexagonal biological structure of photoreceptors [9], and the development of silicon retinas for robot vision [16, 17]. Although genuine hexagonal-based sensor systems and image capture devices do not yet exist, image representation in a hexagonal structure can be achieved readily through rectangular to hexagonal image conversion [8, 18, 19]. In [6, 7] we presented an approach to feature extraction operators using Gaussian test functions. In this paper, we present a novel and eﬃcient approach to the design of hexagonal image processing operators using linear basis and test functions within the ﬁnite element framework. Section 2 describes the approach in [20] used to obtain hexagonal images with Section 3 and 4 presenting our eﬃcient multi-scale operator design and algorithmic performance is presented in Section 6. Section 7 provides a summary and details of further work.

2

Hexagon Images

To date, a hexagonal image can only be obtained by re-sampling a standard square pixel-based image. We have chosen to use the approach of [20] whereby hexagonal pixels are created through clusters of square sub-pixels. We have modiﬁed this technique slightly by representing each pixel by a pixel block, as in [13], in order to create a sub-pixel eﬀect to enable the sub-pixel clustering; this modiﬁcation limits the loss of image resolution. Each pixel of the original image is represented by a pixel block, Figure 1(a), of equal intensity in the new image [13]. This creates a resized image of the same resolution as the original image with the ability to display each pixel as a group of sub-pixels. The motivation for image resizing is to enable the display of sub-pixels, which is not otherwise possible. With this structure now in place, a cluster of sub-pixels in the new image,

284

B. Gardiner, S. Coleman, and B. Scotney

(a)

(b)

Fig. 1. Resizing of image to enable display of image at sub-pixel level

closely representing the shape of a hexagon, can be created that represents a single hexagonal pixel in the resized image, Figure 1(b).

3

Hexagonal Operator Design

In order to develop scalable and eﬃcient gradient operators for use on hexagonal image structures, we use the ﬂexibility oﬀered by the ﬁnite element framework. To achieve this, we initially represent the hexagonal image by an array of samples of a continuous function u(x, y) of image intensity on a domain Ω with nodes placed in the centre of each pixel. These nodes are the reference points for ﬁnite element computation throughout the domain Ω, where the vertices of each triangular ﬁnite element are the pixel centres. Figure 2 represents an image compiled of hexagonal pixels with nodes placed in the centre of each pixel, overlaid by the triangular ﬁnite element mesh. Given an image represented by an array of n × n samples of some continuous function u(x, y) of image intensity, the goal here is to formulate operators involving a weak form of the directional derivative [2]. The weak form requires the image function to be once diﬀerentiable in the sense of belonging to the Hilbert ´ 2 space H 1 (Ω). That is, u = u(x, y) is such that the integral (|∇| + u2 )dΩ is ﬁΩ

nite, where Ω is the domain of the image, and ∇u is the vector (δu/δx, δu/δy)T .

Fig. 2. Hexagonal array of pixels and overlying mesh

Fast Multiscale Operator Development for Hexagonal Images

285

The derivative in the direction of a unit vector b may be expressed as δu/δb, where δu/δb ≡ b · ∇u. Thus by requiring that uH 1 , the problem is to ﬁnd the weak form of the directional derivative of the image on the image domain Ω, namely ˆ b · ∇uvdΩ

E(u) =

(1)

Ω

where v is a member of a function space H 1 , and b = (cos θ, sin θ)is the unit direction vector. This enables us to design our hexagonal operator using either a Cartesian coordinate system or the three axes of symmetry of the hexagon. Our current operator design uses the Cartesian coordinate system as the three axes of symmetry introduces redundancy. However, the symmetric hexagonal coordinate system has advantages when applied to tasks such as rotation that involve a large degree of symmetry [13], and can be readily obtained from our Cartesian operators if required. To approximate equation (1), a function in the image space H 1 may be approximately represented by a function from a ﬁnite-dimensional subspace S h ⊂ H 1 . The subspace S h has a ﬁnite basis {φ1 , ..., φN } and is of dimension N . So any function V (x, y)S h may be uniquely represented by using a set of parameters {V1 , ..., VN } in the form V (x, y) =

N j=1

Vj φj (x, y)

(2)

This form may be used to approximately represent the image u by a function U S h , where N U (x, y) = Uj φj (x, y) (3) j=1

and in which the parameters {U1 , ..., UN } are mapped from the sampled image intensity values. Using this representation of the image we may generate an approximate representation of the weak form of the directional derivative of the image by the functional ´ Ei (U ) = bi · ∇U φi dΩ (4) Ω

for each function φi in the basis of S h . As the test functions φi used in the weak form are from the same space as those used in the approximate representation of the image, we may identify this formulation with the Galerkin method in ﬁnite element analysis. We then construct a set of basis functions φi (x, y), i = 1, ..., N , so that the N-dimensional subspace S h of H 1 comprises of functions which are piecewise polynomials. Such a basis for S h may be formed by associating with each node i a basis function φi (x, y) which satisﬁes the properties 1 if i = j φi (xj , yj ) = (5) 0 if i =j

286

B. Gardiner, S. Coleman, and B. Scotney

where (xj , yj ) are the co-ordinates of the nodal point j in the ﬁnite element triangular mesh. Hence, φi (x, y) has a limited region of support Ωi consisting of those elements which have node i as a vertex. The approximate image representation is thus a simple polynomial on each element and has the sampled intensity value Uj at node j, j = 1, ..., N.

4

Operator Implementation

To illustrate the implementation of a ﬁrst order hexagonal 3 × 3 operator we build a hexagonal operator as shown in Figure 3. The neighbourhood Ωi covers a set of six elements em; where the piecewise linear basis function φi is associated with the central node i which shares common support with the surrounding six basis functions φj . Hence Eiσ (U ) needs to be computed over the six elements in the neighbourhood Ωi .

Fig. 3. Hexagonal operator structure

Substituting the image representation in (3) into the functional Ei (U ) in (4) yields Ei (U ) = bi1

N j=1

where Kij =

m|em S i

m kij

Kij Uj + bi2

N

(6)

Lij Uj

j=1

and Lij =

m lij

(7)

m|em S i

m m and kij and lij are the element integrals, ˆ ˆ δφj δφj m m kij = φi dxdy and lij = φi dxdy (8) δx δy For each of the six triangular elements in the neighbourhood a pair of triangular element operators must be generated, whose entries then map directly to the

Fast Multiscale Operator Development for Hexagonal Images

287

corresponding locations within the 3 × 3 neighbourhood. For example, consider element e1 shown in Figure 3. On this element the basis functions φj for j = i, j = i + n − 12 and j = i + n + 12 share common support with φi . Hence in this case the triangular element operators are 1 1 ki,i li,i 1 1 ki = k 1 and li = l1 (9) 1 1 ki,i+n+ li,i+n+ 1 1 i,i+n− 1 i,i+n− 1 2

2

2

2

where the entries in (9) are computed using the element integrals in (8) with the basis functions φi =

y + 1; k

φi+n− 12 = −

x y − ; h 2k

φi+n+ 12 =

x y − h 2k

(10)

Unlike with standard ﬁnite element matrices, the triangular element operators described here have two possible structures, those which are demonstrated in equation (9), suitable for elements e1 , e3 and e5 , and those which corresponds to the structure of elements e2 , e4 and e6 , for example: 2 2 2 2 ki,i−1 ki,i li,i−1 li,i 2 2 ki = and li = (11) 2 2 ki,i+n− li,i+n− 1 1 2

2

Similar triangular element operators are computed for each of the other elements in the neighbourhood and on completion, two masks (12) may be created by element assembly to compute the horizontal and vertical gradients respectively. Values for a, b and c are 0.1666, 0.3333 and 0.25 respectively. ⎡ ⎤ ⎡ ⎤ −a a c c 0 b ⎦ and L = ⎣ 0 0 0⎦ K = ⎣ −b (12) −a a −c −c

5

Fast Multiscale Operator Construction

On constructing the hexagonal equivalent of a 3×3 gradient operator, every other size of hexagonal operator (i.e., 5 × 5 ,7 × 7, etc) can be eﬃciently computed via the appropriate linear combinations of the 3 × 3 operator and this will be illustrated using the 5 × 5 operator as an example. Using the mesh in Figure 4(a) as a reference, in order to generate a 5 × 5 hexagonal operator, we place a 3 × 3 mask at the centre node of the mesh at level 0, node (0, 0), and 12 × (3 × 3) mask at the other six internal nodes at level 1. These are then combined in the typical ﬁnite element assembly manner as illustrated in Figure 4(b) for the 5 × 5 x-directional mask.

288

B. Gardiner, S. Coleman, and B. Scotney

(a)

(b)

Fig. 4. (a) Finite element mesh corresponding to 2 neighbourhood levels, i = 0, 1 (b) Combining of the 3 × 3 masks to obtain the 5 × 5 mask, showing only one of the six masks for the second level

We can generalise this procedure for any operator size greater than 3 (the initial operator). Let the hexagonal operator size (5, 7 etc.) be denoted by So , then the radius of the approximately circular hexagonal operator Ro can be determined as So − 1 Ro = (13) 2 Consider the nodes (i, j) illustrated in Figure 4(a), here i indicates the level of the neighbourhood nodes, i.e., i = 0 at the centre node, i = 1 for each of the surrounding nodes at the next level etc., and j is each node within a given level i. For each operator size S (>3), the x-directional operator, Sox , can be computed in an additive manner using the following formula: R o −1 6i−1 Ro − i Sox = K(0,0) + K(i,j) (14) Ro i=1 j=0 where K(i,j) is the 3×3 hexagonal mask placed at each node (i, j) and the number of levels, i, to be included is Ro − 1. Similarly, the y-directional operator, Soy , can be computed as R o −1 6i−1 Ro − i Soy = L(0,0) + L(i,j) (15) Ro i=1 j=0 where L(i, j) is the 3 × 3 hexagonal mask placed at each node (i, j).

6

Algorithmic Performance

We illustrate the capability of the proposed operators by providing salient feature maps in Figure 5 generated by applying the operators to real images re-sampled onto a hexagonal pixel-based grid using the Lena image.

Fast Multiscale Operator Development for Hexagonal Images

(a) Original

(b) 3 × 3 Edge Map

(c) 5 × 5 Edge Map

(d) 7 × 7 Edge Map

289

Fig. 5. Feature maps obtained using re-sampled real images Table 1. Run-times for 3 × 3 operators Operator Run-time (ms) Proposed Hexagonal 6.14 Prewitt 7.34 Sobel 7.11

One of the advantages of hexagonal pixel-based images is that they contain 13.5% fewer pixels than a standard square pixel-based image of equivalent resolution. In addition, the hexagonal operators presented and designed on a Cartesian axis contain fewer operator values than the corresponding square operators, thus generating a signiﬁcant overall reduction in computation. For example, for a given 256 × 256 image, removing boundary pixels, 63504 pixels will be processed. Using a 5×5 operator there will be 63504×25 multiplications totalling 1,587,600.

290

B. Gardiner, S. Coleman, and B. Scotney

If the same image is re-sampled onto a hexagonal based image there will be 55566 pixels processed by an equivalent hexagonal gradient operator containing only 19 values. Therefore there will be only 1,055,754 multiplications, corresponding to 66.5% of the computation required to generate a similar feature map using an equivalent traditional square pixel-based image. We illustrate this further by providing run-times for the application of 3 × 3 operators to an image: the proposed 3 × 3 hexagonal operator to a hexagonal pixel-based image; and the 3 × 3 Prewitt and Sobel operators to a standard square pixel-based image. The run-times are presented in Table 1 and in each case the time is an average of 3 runs.

7

Summary

Recently, there has been a renewed interest in the use of hexagonal pixel-based images for image processing tasks as demonstrated in, for example, [8 - 13], however, much less research has been undertaken on the development and application of feature extraction operators for direct use on such image structures. Some standard algorithms have been extended from rectangular to hexagonal arrays in simple cases [5, 13], but generalisation and scalability of operators require a more systematic approach. We have presented a design procedure for scalable hexagonal operators developed for use within a ﬁnite element framework, the Galerkin formulation. We have demonstrated the eﬃcient implementation of our approach through the eﬃcient way in which larger operators can be generated using additions of the 3 × 3 operator. We have illustrated computation eﬃciently in Section 6, using an explicit example and runtimes for equivalent operators. Building on this promising performance, further research will involve evaluation of the hexagonal operators with respect to edge orientation and displacement, and extending the operators to interest point detectors and descriptors.

References 1. Allen, J.D.: Filter Banks for Images on Hexagonal Grid, Signal Solutions (2003) 2. Becker, E.B., Carey, G.F., Oden, J.T.: Finite elements: An Introduction. Prentice Hall, London (1981) 3. Coleman, S.A., Scotney, B.W., Herron, M.G.: A Systematic Design Procedure for Scalable Near-Circular Laplacian of Gaussian Operators. In: Proceedings of the International Conference on Pattern Recognition, Cambridge, pp. 700–703 (2004) 4. Davies, E.R.: Circularity - A New Design Principle Underlying the Design of Accurate Edge Orientation Filters. Image and Vision Computing 2(3), 134–142 (1984) 5. Davies, E.R.: Optimising Computation of Hexagonal Diﬀerential Gradient Edge Detector. Elect. Letters 27(17) (1991) 6. Gardiner, B., Coleman, S., Scotney, B.: A Design Procedure for Gradient Operators on Hexagonal Images. In: Irish Machine Vision & Image Processing Conference (IMVIP 2008), pp. 47–54 (2008) 7. Gardiner, B., Coleman, S., Scotney, B.: Multi-Scale Feature Extraction in a SubPixel Virtual Hexagonal Environment. In: Irish Machine Vision & Image Processing Conference (IMVIP 2008), pp. 47–54 (2008)

Fast Multiscale Operator Development for Hexagonal Images

291

8. He, X., Jia, W.: Hexagonal Structure for Intelligent Vision. In: Information and Communication Technologies, ICICT, pp. 52–64 (2005) 9. Huang, C.-H., Lin, C.-T.: Bio-Inspired Computer Fovea Model Based on Hexagonal-Type Cellular Neural Network. IEEE Trans Circuits and Systems 54(1), 35–47 (2007) 10. Jiang, Q.: Orthogonal and Biorthogonal FIR Hexagonal Filter Banks with Sixfold Symmetry. IEEE Transactions on Signal Processing 56(12), 5861–5873 (2008) 11. Knaup, M., Steckmann, S., Bockenbach, O., Kachelrieb, M.: CT Image Reconstruction using Hexagonal Grids. In: Proceedings of IEEE Nuclear Science Symposium Conference Record, pp. 3074–3076 (2007) 12. Lau, D.L., Ulichney, R.: Blue-Noise Halftoning for Hexagonal Grids. IEEE Transaction on Image Processing 15(5), 1270–1284 (2006) 13. Middleton, L., Sivaswamy, J.: Hexagonal Image Processing; A Practical Approach. Springer, Heidelberg (2005) 14. Quijano, H.J., Garrido, L.: Improving Cooperative Robot Exploration Using a Hexagonal World Representation. In: Proceeding of the 4th Congress of Electronics, Robotics and Automotive Mechanics, pp. 450–455 (2007) 15. Scotney, B.W., Coleman, S.A.: Improving Angular Error via Systematically Designed Near-Circular Gaussian-based Feature Extraction Operators. In: Pattern Recognition, vol. 40(5), pp. 1451–1465. Elsevier, Amsterdam (2007) 16. Shimonomura, K., et al.: Neuromorphic binocular vision system for real-time disparity estimation. In: IEEE Int Conf on Robotics and Automation, pp. 4867–4872 (2007) 17. Takami, R., et al.: An Image Pre-processing system Employing Neuromorphic 100 x 100 Pixel Silicon Retina. In: IEEE Int. Symp. Circuits & Systems, vol. 3, pp. 2771–2774 (2005) 18. Vitulli, R.: Aliasing Eﬀects Mitigation by Optimized Sampling Grids and Impact on Image Acquisition Chains. In: Geoscience and Remote Sensing Symposium, pp. 979–981 (2002) 19. Wu, Q., He, X., Hintz, T.: Virtual Spiral Architecture. In: Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pp. 339–405 (2004) 20. Wuthrich, C.A., Stucki, P.: An Algorithmic Comparison Between Square-and Hexagonal-based Grid. In: CVGIP: Graphical Models and Image Processing, vol. 53, pp. 324–339 (1999)

Optimal Parameter Estimation with Homogeneous Entities and Arbitrary Constraints Jochen Meidow1 , Wolfgang F¨ orstner2 , and Christian Beder1 1

Research Institute for Optronics and Pattern Recognition, Ettlingen, Germany [email protected] 2 Institute for Geodesy and Geoinformation, University of Bonn, Germany [email protected]

Abstract. Well known estimation techniques in computational geometry usually deal only with single geometric entities as unknown parameters and do not account for constrained observations within the estimation. The estimation model proposed in this paper is much more general, as it can handle multiple homogeneous vectors as well as multiple constraints. Furthermore, it allows the consistent handling of arbitrary covariance matrices for the observed and the estimated entities. The major novelty is the proper handling of singular observation covariance matrices made possible by additional constraints within the estimation. These properties are of special interest for instance in the calculus of algebraic projective geometry, where singular covariance matrices arise naturally from the non-minimal parameterizations of the entities. The validity of the proposed adjustment model will be demonstrated by the estimation of a fundamental matrix from synthetic data and compared to heteroscedastic regression [1], which is considered as state-ofthe-art estimator for this task. As the latter is unable to simultaneously estimate multiple entities, we will also demonstrate the usefulness and the feasibility of our approach by the constrained estimation of three vanishing points from observed uncertain image line segments.

1

Introduction

The ﬁnal step in uncertain geometric reasoning usually is the optimal estimation of unknown parameters from given uncertain observations taking geometric or algebraic constraints into account, which either result from the structure of the problem or have been found by some hypothesis generation process, e.g. [2,3,4]. The well known estimation techniques in geometric computation (e.g. [1,5,6,7]) usually deal only with single homogeneous entities, for instance points or transformations. These estimation techniques, such as algebraic minimization, total least squares, renormalization, or heteroscedastic regression cannot easily be generalized to the estimation of multiple homogeneous entities with multiple constraints, which is necessary in many vision tasks for instance when dealing J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 292–301, 2009. c Springer-Verlag Berlin Heidelberg 2009

Optimal Parameter Estimation with Homogeneous Entities

293

with composed geometric entities such as straight line segments or the joint estimation of vanishing points. In order to address this problem we provide a generic estimation model for the simultaneous estimation of more than one uncertain geometric entity. Based on possibly correlated observed geometric entities, the results are derived in consideration of constraints for the parameters and the observations. In particular the proposed procedure is able to handle uncertain homogeneous vectors together with possibly singular covariance matrices extending the hitherto known techniques w.r.t. the continuous use of homogeneous representations. To demonstrate the applicability of the proposed estimation scheme we show how the presented general estimation framework can be specialized for two exemplary vision tasks. First, we demonstrate the applicability of our approach for the very well-known and well-understood task of estimating the fundamental matrix from uncertain homogeneous point correspondences. We show, that competitive results are achieved by comparing our approach to heteroscedastic regression [1], which is considered as state-of-the-art estimator for this task. Second, we show how three orthogonal vanishing points can be estimated simultaneously from uncertain straight line segments in homogeneous representation for a real scene using the proposed framework. In contrast to other estimation techniques (e.g. [1,5,6,7]), our approach is directly applicable for this task and no re-parameterization is required due to the rigorous handling of the singular uncertainty structure of the homogeneous entities. Therefore, the presented framework can directly beneﬁt from the compact representation of many vision problems using algebraic projective geometry, so that the task of formulating estimation schemes is greatly simpliﬁed.

2

General Adjustment Model with Constraints

We will start by presenting the general problem-speciﬁc modeling tasks before we will give examples for two speciﬁc vision tasks in Sect. 3. The approach is based on the adjustment model proposed in [8]. The model consists of a functional part for the unknown parameters and the observations, a stochastic part for the observations, an objective function, and an iterative estimation procedure for non-linear problems. 2.1

Mathematical Model

Functional model. The functional model describes the mutual relations between the considered entities comprising of the observations and the parameters to be estimated. We distinguish three types of constraints between the true observa: tions l and the U unknown true parameters p ) = 0 between the observations and parameters re1. the G conditions g(l, p ﬂecting their actual intended mutual relation (i.e. the model assumptions), 2. the H restrictions h( p) = 0 on the parameters alone reﬂecting intrinsic constraints (e.g. || p|| = 1 for homogeneous entities) and enabling singular parameter covariance matrices, and ﬁnally

294

J. Meidow, W. F¨ orstner, and C. Beder

3. the C constraints c(l) = 0 on the observations alone reﬂecting intrinsic constraints (e.g. ||l|| = 1 for homogeneous entities) and enabling singular observation covariance matrices. The error-free observations l are related to the real observations l by additive . Since the true values remain unknown they will unknown corrections l = l + v , l and v in the following. be replaced by their estimates p Stochastic model. The stochastic model describes the uncertainty of the obser(0) vations. An initial covariance matrix Σ ll of the observations is assumed to be known which subsumes the stochastic properties of the observations, thus l is ll ). With the possibly unknown assumed to be normally distributed l ∼ N (l, Σ (0) ll variance factor σ 2 , the matrix Σ is related to the true covariance matrix Σ ll

0 (0)

by Σ ll = σ02 Σ ll (cf. [9]). Note, that we explicitly allow Σ ll to be singular as long as its null space is properly handled by the constraint c(l) = 0. This is one of the major contributions of this work. Having deﬁned the problem speciﬁc model (see Sect. 3 for examples, how this framework can be specialized to speciﬁc vision problems), we will now derive a corresponding estimation scheme for estimating the parameters and the adjusted observations in the next section. We will also show how the unknown variance , i.e. the negative factor σ02 can be estimated from the estimated corrections v residuals. 2.2

Objective Function and Estimation and l for p and l resp. can be done by minimizing Finding optimal estimates p the weighted squared residuals subject to the given constraints, i.e. , λ, μ, ν) = L( v, p

1 T + Σ ll v + λT g(l + v , p ) + μT h( ) v p) + ν T c(l + v 2

(1)

with the Lagrangian vectors λ, μ and ν. In contrast to [2] we explicitly include the constraint c(l+ v ) in order to properly deal with singular covariance matrices consistent with the pseudo-inverse Σ + ll (see [8] for details). For solving this non-linear problem in an iterative manner we need approx(0) (0) and l = imate values p for the estimates of the unknown parameters p

=l+v and l = l(0) + Δl for the unknowns and (0) + Δp . The corrections Δp p the estimated observations l are obtained iteratively by applying the following steps (see [8] for a detailed derivation of the estimation formulas): 1. The Jacobians are computed at the current approximate values A=

∂g(l, p) , ∂p

BT =

∂g(l, p) , ∂l

CT =

∂c(l) , ∂l

HT =

∂h(p) ∂p

(2)

2. In each iteration τ compute the approximate values for the residuals of the constraints g τ = g(l(τ ) , p(τ ) ),

hτ = h(p(τ ) ),

cτ = c(l(τ ) )

(3)

Optimal Parameter Estimation with Homogeneous Entities

295

3. Compute the auxiliary variable a = B T C(C T C)−1 (C T (l − l(τ ) ) + cτ ) − B T (l − l(τ ) ) − g τ

(4)

4. Compute the covariance matrix Σ gg = B T Σ ll B of the contradictions g τ 5. The unknown corrections to the parameters are now computed by solving the normal equation system T −1 T −1 A Σ gg A H A Σ gg a Δp = . (5) −hτ μ HT 0 6. The Lagrangians and the residuals are ﬁnally computed as λ = Σ −1 gg (AΔp − a) v

(τ )

(6)

= −Σ ll Bλ − C(C T C)−1 (C T (l − l(τ ) ) + cτ )

(7)

The approximate values have to be iteratively improved for non-linear problems. (τ ) In doing so, the covariance matrix Σ ll of the observations have to be adjusted ), e.g. within each iteration step to be consistent with the constraint c(l(τ ) + v by spherical normalization, because of the change of the observations l(τ ) within the iteration process. Note, that the estimation procedure is not problem speciﬁc except for the computation of the Jacobians in the ﬁrst step and can be applied in a black-box manner. 2.3

Precision of the Estimates

One of the advantages of uncertainty modeling is the possibility of propagating errors. We will now show, how the precision of the estimated parameters can be from (7) we obtain the ﬁtted observations derived. With estimated corrections v l = l + v . The estimation for the variance factor σ02 is given by the maximum T Σ + /R with the redundancy R = G + H − U , likelihood estimation σ 02 = v ll v cf. [9]. The pseudo inverse can eventually eﬃciently computed by exploiting the block diagonal matrix structures and the relation C T C = I (cf. the examples in Sect. 3). pp = σ We ﬁnally obtain the estimated covariance matrix Σ 02 Σ pp of the estimated parameters, where Σ pp results from the inverted reduced normal equation matrix by variance propagation:

T −1 A Σ gg A Σ pp · = · · HT

H O

−1 (8)

Observe, the model has the same structure as the classical Gauss-Markov-model with constraints [9], which allows to easily modify for a robust ML-type estimation to cope with outliers [10] by iteratively reweighting the individual conditions gi .

296

3

J. Meidow, W. F¨ orstner, and C. Beder

Examples

Having derived a very generic modeling and estimation framework in the previous section we will now show, how the presented framework can be specialized for two exemplary vision problems. 3.1

Estimation of the Fundamental Matrix

As a ﬁrst exemplary problem we consider the very well-known and well-understood problem of fundamental matrix estimation from uncertain image point correspondences. Since the 3 × 3 fundamental matrix F is homogeneous and singular, two constraints have to be introduced for the 9 elements f = vec(F). With at least 7 corresponding point pairs captured by straight-line preserving cameras the fundamental matrix can be estimated from the coplanarity constraints T T T xi Fxi = (xi ⊗ xi )f = 0 (9) which are bilinear in the corresponding homogeneous image coordinates xi and xi , and linear in the elements of the fundamental matrix. Suitable constraints for the observations and for the parameters are xT i xi − 1 = 0,

f T f − 1 = 0,

and

det(F) = 0,

(10)

ﬁxing the scale factors and enforcing the rank two constraint. With the covariance matrix Σ xx of the Euclidean coordinates x = [x, y]T of an image point the initial representation is x = [xT , 1]T and Σ xx = Diag(Σ xx , 0) assuming the factor of proportionality to be non stochastic. Spherical normalization leads to the observations x := x/|x| with Σ xx := J Σ xx J T using the Jacobian 1 xxT J= I3 − T (11) |x| x x T for each point. For n image points the vector of observations is l = [xT 1 , x2 , . . . , T xT ] and the corresponding covariance matrix Σ = Diag(Σ , Σ , . ll x1 x 1 x2 x2 . . , n Σ xn xn ). Observe, we introduce all constraints at once, and in contrast to classical approaches may use the uncertain homogeneous entities directly, without the need for special treatment of the last coordinate. The Jacobian of the n coplanarity constraints (9) is A = [x1 ⊗ x1 , x2 ⊗ x2 , . . . , xn ⊗ xn ]T where ⊗ denotes the Kronecker product and the singular value decomposition (SVD) of A yields the approximate values for the parameters f . Furthermore, by considering the SVD of F the rank two property can be enforced [11]. The Jacobians of (9) and (10) are B = Diag([x1 T FT , x1 T F], [x2 T FT , x2 T F], . . .), A, C = 2 Diag(x1 , x1 , x2 , x2 , . . . , xn , xn ), and H = [2f , f ∗ ]T , where f ∗ = vec(F∗ ) denotes the elements of the adjoint F∗ of F. Since C T C = I holds, T −1 the pseudo inverse of Σ ll can eﬃciently be computed by Σ + − ll = (Σ ll +CC ) T CC exploiting the block diagonal matrix structures.

Optimal Parameter Estimation with Homogeneous Entities

0.14

χ27

0.12

relative frequency

0.1 0.08 0.06 0.04 0.02 0

2

4

6

8

10 12 14 Mahalanobis distance

16

18

20

22

297

our 8-point HEIV approach singular σ1 /σ2 0.0323 0.0290 0.0288 values 1st x1 0.0775 0.0585 0.0591 epipole y1 0.0808 0.0645 0.0631 2nd x2 0.0737 0.0633 0.0638 epipole y2 0.0798 0.0701 0.0703

Fig. 1. Right: Empirical distribution of the Mahalanobis distance and its theoretical χ27 -distribution. Left: Comparison of the estimation results. Robust estimation of the standard deviations of the parameters by the median absolute deviation w.r.t. the true values.

To validate the adjustment model we performed the following stochastic simulation: In 500 simulation runs we generated 50 3d points, each with normal distributed coordinates. The two camera orientations have been randomly selected with camera centers on a sphere with radius 6 around the point cloud and viewing directions toward the center of the point cloud. This leads to observed image coordinates in the range of approximately [−1, 1]. After adding isotropic noise with σn = 0.02 to the Euclidean image coordinates the parameters of the fundamental matrix have been estimated assuming i.i.d. observations. The Mahalanobis distance between the estimated and the true values f is computed for each of the simulation runs. Figure 1 shows the empirical distribution of the Mahalanobis distance which is χ2 distributed with 7 degrees of freedom as expected. The hypothesized inequality of both distributions has been rejected by the Kolmogorov-Smirnov goodness-of-ﬁt test at signiﬁcance level α = 0.05. To assess and compare the results we choose the HEIV based estimation [1] as a representative for competing state-of-the-art estimators. For the HEIV estimation the implementation [12] has been used. Figure 1 shows the results for the estimation of the coordinates of the epipoles and the ratio of the estimated singular values for the eight point algorithm, the HEIV, and our approach. For the estimated parameters, the stated values denote a robust estimation of the standard deviation by computing the median absolute deviation of the residuals w.r.t. the true values, multiplied by 1.4826. According to the achieved precisions, the results of the HEIV based estimation and our approach are the same up to numerical eﬀects due to the number of iterations. 3.2

Constrained Vanishing Points Determination

As a second example we chose the task of joint vanishing point estimation using all available constraints. In contrast to the computation of the fundamental matrix demonstrated in the previous section, the HEIV algorithm cannot be generalized easily to solve this task due to the multiple constraints on the estimated entities. Since a vanishing point may lie at inﬁnity, the use of the homogeneous representation is a reasonable choice.

298

J. Meidow, W. F¨ orstner, and C. Beder

Fig. 2. First image of the Oxford corridor sequence. Left: Extracted straight line segments, classified according to their vanishing directions and an outlier class. Right: Display detail with superimposed confidence regions, the approximate third vanishing point (◦), and its estimation (+).

Figure 2 shows on the left side the ﬁrst image of the corridor sequence with extracted straight line segments provided by the Visual Geometry Group, University of Oxford. By using a random sample consensus [13] the segments have been classiﬁed according to the 3 vanishing directions and an outliers class. For each of the three sets of straight line segments the corresponding nk straight lines lki should intersect in the vanishing point vk , not to be confused with the residual vector v. Thus the constraints are vkT lki = 0,

vkT vk − 1 = 0,

lT ki lki − 1 = 0,

k = 1 . . . 3, i = 1 . . . nk

(12)

because of the incidences and the spherical normalizations of the geometric entities. Furthermore, if the homogeneous calibration matrix K for the straight line preserving camera is known, we can introduce the two additional constraints v1T ωv2 = 0

and

v2T ωv3 = 0

(13)

which hold because of the orthogonality relations of the three vanishing directions K−1 vk and where ω = K−T K−1 denotes the image of the absolute conic [11]. The straight line segments are given without any information about their uncertainty. Therefore, we initially determined the covariance matrices Σ ai ai and Σ bi bi of the coordinates of the segment end-points ai and bi with the help of the F¨ orstner operator [14]. Then, assuming independent end-points, we determined each straight line li by joining the corresponding end-points accompanied by variance propagation [2] li = S ai bi ,

Σ li li = [−S bi , S ai ] Diag(Σ ai ai , Σ bi bi ) [−S bi , S ai ]

T

(14)

Optimal Parameter Estimation with Homogeneous Entities

299

where S(·) denotes the skew-symmetric matrix inducing the cross product. Thus,

T T T with all n straight lines the vector of observations is l = lT and its 1 , l2 , . . . , ln covariance matrix Σ ll = Diag(Σ l1 l1 , Σ l2 l2 , . . . , Σ ln ln ). Figure 2 shows on the right side a detail of the image with superimposed straight line segments, the 99% conﬁdence ellipses of the end-points enlarged by factor 10, and the resulting error hyperbolas of the corresponding straight lines. Approximate values can easily be obtained by considering each vanishing point individually. The Jacobian of the constraints w.r.t. the unknown parameters is

T simply Ak = lk1 , lk2 , . . . lkn , and the singular value decomposition of Ak yields the approximate values for the vanishing points vk . For the joint parameter estimation with n = n1 + n2 + n3 observed straight lines the block-diagonal Jacobians are A = Diag(A1 , A2 , A2 ) B = Diag(I n1 ⊗ v1 , I n2 ⊗ v2 , I n3 ⊗ v3 ), C = 2Diag(l1 , l2 , . . . , ln )

⎡

⎤T 2v1 0 0 ωv2 0 H = ⎣ 0 2v2 0 ωv1 ωv3 ⎦ , 0 0 2v3 0 ωv2

whereas ⊗ denotes the Kronecker product. Again, since C T C = I holds the pseudo inverse of Σ ll can eﬃciently be computed by exploitation of the block diagonal matrix structures. For the visual presentation of the estimation results we choose a gnomonic projection of the image and the three vanishing points, because this projection is suitable to represent the two vanishing points near inﬁnity. For the projection we used a sphere with radius equal to the camera constant and the principal point of the camera as tangent point. Figure 3 shows the results with and without the orthogonality constraints (13). For the visualisation of the conﬁdence regions we generated normal distributed vv ), in homogeneous samples according to the estimates, e.g., v ∼ N ( v, Σ

Fig. 3. Image and vanishing points in gnomonic projection. left side: without orthogonal constraints, right side: with additional constraints to enforce the directions to be mutually orthogonal. The 99% confidence regions of the vanishing directions follow Bingham’s distribution and have been enlarged by a factor of 10.

300

J. Meidow, W. F¨ orstner, and C. Beder

representation and mapped them with the inverse calibration matrix K−1 to the sphere. As expected, the three orthogonal clusters follow Bingham’s distribution [15]. Observe, that the inclusion of the orthogonality constraint into a joint estimation improves the accuracy as expected. Also note, that for the third vanishing point the 99% conﬁdence region overlaps the equator (i.e. the corresponding image point is near inﬁnity and therefore may lie on the opposite side in the image), which is handled by the use of uncertain homogeneous entities in a straightforward manner.

4

Conclusions

We developed a scheme for simultaneous estimating sets of homogeneous entities from observed homogeneous entities with an arbitrary number of constraints and possible rank defect covariance matrices in order to integrate projective geometry and estimation theory. The consideration of uncertainty and correlations of the observed entities leads to statistically optimal results as in the case of the equivalent Euclidean representation. Thereby, possibly singular covariance matrices of homogeneous entities can be treated. Since the model uses the same equations as for the unconstrained algebraic minimization, there is no need to change the representations during the geometric reasoning. The adjustment model is of special interest within the calculus of projective geometry, but the approach is not restricted to problems with normalization constraints for homogeneous entities. The proposed adjustment model has been statistically validated with synthetic data and approved with real data sets. The results for the estimation of fundamental matrices based on synthetic data are comparable to the ones achieved by the heteroscedastic errors-in-variables (HEIV) approach being considered as a stateof-the-art estimator for such problems. The procedure can cope with considerably large noise of the point coordinates. The constrained estimation of vanishing points in a real image leads to statistically optimal results due to the stringent consideration of uncertainty and correlation of the observed straight line segments. The generality of the model makes it applicable to all problems containing homogeneous entities. With a small modiﬁcation one can transfer it into a robust ML-estimation procedure, by iteratively reweighting the conditions gi . Of course, due to the use of the redundant representation and the additional constraints, computation times are larger than when using specially adapted representations because of the used overparametrization. But such a generic module for estimating with homogeneous entities may be used for rapid prototyping and for not too large problems in case computing time is not critical.

References 1. Matei, B., Meer, P.: A General Method for Errors–in–Variables Problems in Computer Vision. In: Computer Vision and Pattern Recognition Conference, vol. II, pp. 18–25. IEEE, Los Alamitos (2000)

Optimal Parameter Estimation with Homogeneous Entities

301

2. Heuel, S.: Uncertain Projective Geometry. LNCS, vol. 3008. Springer, Heidelberg (2004) 3. Utcke, S.: Grouping based on Projective Geometry Constraints and Uncertainty. In: Ahuja, N., Desai, U. (eds.) Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, January 4-7, 1998, pp. 739–746. Narosa Publishing House (1998) 4. Criminisi, A.: Accurate Visual Metrology from Single and Multiple Uncalibrated Images. Distinguished Dissertations. Springer, Heidelberg (2001) 5. Clarke, J.C.: Modelling Uncertainty: A Primer. Technical Report 2161/98, Department of Engineering Science, University of Oxford (1998) 6. Chojnacki, W., Brooks, M.J., van den Hengel, A.: Rationalising the Renormalisation Method of Kanatani. Journal of Mathematical Imaging and Vision 14, 21–38 (2001) 7. Kanatani, K.: Statistical Analysis of Geometric Computation. CVGIP: Image Understanding 59(3), 286–306 (1994) 8. Meidow, J., Beder, C., F¨ orstner, W.: Reasoning with Uncertain Points, Straight Lines, and Straight Line Segments in 2D. ISPRS Journal of Photogrammetry and Remote Sensing 64(2), 125–139 (2009) 9. Koch, K.R.: Parameter Estimation and Hypothesis Testing in Linear Models, 2nd edn. Springer, Berlin (1999) 10. Huber, P.J.: Robust Statistics. J. Wiley, New York (1981) 11. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 12. Georgescu, B.: Software for HEIV based estimation, binary version. Center of Advanced Information Processing, Robust Image Understanding Laboratory, Rutgers University, http://www.caip.rutgers.edu/riul/research/code/heiv/ (2002) 13. Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the Association for Computing Machinery 24(6), 381–395 (1981) 14. F¨ orstner, W., G¨ ulch, E.: A Fast Operator for Detection and Precise Location of Distinct Points, Corners and Circular Features. In: Proceedings of the ISPRS Intercommission Conference on Fast Processing of Photogrammetric Data, Interlaken, June 1987, pp. 281–305 (1987) 15. Collins, R.T., Weiss, R.S.: Vanishing Point Calculation as a Statistical Inference on the Unit Sphere. In: International Conference on Computer Vision, December 1990, pp. 400–403 (1990)

Detecting Hubs in Music Audio Based on Network Analysis Alexandros Nanopoulos Institute of Computer Science, University of Hildesheim, Germany

Abstract. Spectral similarity measures are considered among the bestperforming audio-based music similarity measures. However, they tend to produce hubs, i.e., songs measured closely to many other songs, to which they have no perceptual similarity. In this paper, we deﬁne a novel way to measure the hubness of songs. Based on network analysis methods, we propose a hubness score that is computed by analyzing the interaction of songs in the similarity space. We experimentally evaluate the eﬀectiveness of the proposed approach.

1

Introduction

Similarity measures for music audio have attracted signiﬁcant attention during the previous years [2,4,6,7]. Although music similarity is subjective and context dependent, such measures produce a single, intuitive number that is directly usable in applications such as playlist generation, classiﬁcation, recommendation, or collection browsing and visualization. Spectral similarity measures have recently seen a growing interest [1,2,7]. They describe aspects related to timbre and they model the “global sound” of a music signal. Today, spectral similarity measures are considered among the bestperforming audio-based music similarity measures [2]. However, Aucouturier and Pachet [1] have reported the existence of a “glass ceiling” that cannot be surpassed without taking higher level cognitive processing into account. Spectral similarity measures have the undesirable property that some songs are measured closely to many other songs, to which they have no perceptual similarity [3]. These songs are called hubs and produce high false-positives rates. Hubs neither exist due to the spectral features, nor are they a property of a given modelling strategy [3]. Aucouturier and Pachet [3] propose methods to detect hubs according to the n-occurrence, i.e., the number of times they appear as nearest neighbors of other songs. As found in [3], n-occurrence values are distributed along a scale-free distribution. Therefore, hubness is a continuous variable and we should not consider hubs on a boolean base (i.e., that songs are either hubs or not).

We gratefully acknowledge the partial co-funding of this work through the European Commission FP7 project MyMedia (www.mymediaproject.org) under the grant agreement no. 215006.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 302–311, 2009. c Springer-Verlag Berlin Heidelberg 2009

Detecting Hubs in Music Audio Based on Network Analysis

303

In this paper, we propose a novel deﬁnition to measure the hubness of songs. We assign to each song a hubness score according to the interaction of the song with other songs in the similarity space. Interactions are computed with the Hypertext Induced Topics Search (HITS) [5], a prominent network analysis algorithm. In contrast to the n-occurrence score, network analysis can reveal songs, which themselves may not appear frequently as nearest neighbors, but whose hubnsess stems from the type of interactions they have with other songs in the similarity space. We experimentally evaluate the eﬀectiveness of the proposed approach to capture more ‘problematic’ songs compared to the n-occurrence score. The contributions of this paper are summarized as follows: – We provide the insight that network analysis can be used to examine the interaction of songs in the similarity space in order to characterize their hubness. – We propose a novel hubness score, which considers hubness as a continuous property that results from the analyzed interactions. – We investigate three application scenarios: genre classiﬁcation, collection browsing, and clustering. For these application scenarios we compare the use of the proposed hubness score and the use of n-occurrence score in order to detect hubs. Our experimental results illustrate the advantages of the proposed method. The rest of this paper is organized as follows. Section 2 describes the related work. In Section 3 we deﬁne the proposed hubness score, whereas Section 4 develops the evaluation framework. Section 5 contains the experimental results and Section 6 concludes this paper.

2 2.1

Related work Timbral Similarity Measures

Timbral similarity measures divide the audio signal into short, overlapping frames. From each frame, spectral representation features are extracted, such as Mel Frequency Cepstrum Coeﬃcients (MFCCs). The overall distribution of features is summarized with a statistical model, such as clustering or as Gaussian Mixture Model (GMM). The distance between two music signals is computed by comparing their models with, e.g., Earth Movers Distance [6] or Monte Carlo sampling of the Kullback-Leibler distance [1]. Experiments in [1] indicated that the number of MFCCs is a more critical factor than the number of Gaussian components. For this reason, the latter can be decreased without much hurting the precision, in order to reduce the computational cost. Along these lines, the Single Gaussian Combined (G1C) method has been proposed [7], which combines spectral similarity with complementary information (ﬂuctuation patterns). In MIREX 2006 contest, G1C was the fastest overall and achieved the highest score [7]. Therefore, G1C is considered among

304

A. Nanopoulos

the state-of-the-art spectral similarity measures. Like all other spectral similarity measures, G1C has been reported [7] to produce hubs too. Pohle et. al [8] proposed the Proximity Veriﬁcation (PV), which is a transformation that handles the unbalanced distribution of songs produced by spectral similarity measures. For two songs s and t, dP (s, t) is equal to k, if t is the k th nearest neighbor of s. The modiﬁed distance measure is computed as: dP V (s, t) = dP (s, t) + dP (t, s). 2.2

Hubs and Hubness Measures

The existence of hub songs was ﬁrst identiﬁed experimentally by Aucouturier and Pachet [1]. They observed the existence of a small number of songs that occur frequently as false positives, i.e., which are irrelevantly close to many other songs. In another study [3], the same authors described the nature and causes of hubs. Moreover, they introduced two measures of hubness, the number of n-occurrences and the mean neighbor angle. For a song s, its n-occurrences is deﬁned as the number of times s occurs in the ﬁrst n nearest neighbors of all the other songs in the collection. For a song s and two of its neighbors, a and b, the neighbor angle is computed as the cosine of angle formed by the segments [s, a] and [s, b]. The average neighbor angle results from the average cosine value among all duplets of neighbors. Since n-occurrences is more intuitive and more eﬃcient to compute, it has been adopted in successive works [8].

3

Hubness Score Based on Network Analysis

Let a collection of songs D = {s1 , . . . , sm } and a spectral distance measure d(si , sj ). First, we ﬁnd the n nearest neighbors of each song in the collection according to d measure. The number n plays the same role as in the n-occurrences measure [3]. Let L(si ) denote the list of the nearest neighbors of si , sorted in increasing order of distance. Next, we form a graph G, where each node of G corresponds to a song of D. The adjacency matrix of G is A, where A(i, j) = 1 if si ∈ L(sj ). In other words, in G there is an arrow from the node corresponding to si to the node corresponding to sj , if si is among the n nearest neighbors of sj . Notice that A is not symmetric. Figure 1a illustrates a collection with 10 song IDs and, for each song ID, the list L with the IDs of its 2 nearest neighbors. In Figure 1b the corresponding graph is plotted. For instance, there is an arrow from song 1 to song 5, because 1 appears in the list L of song 5. The graph formed with the aforementioned procedure deﬁnes a linking structure that reﬂects the structure of the similarity space of the songs. Based on this structure, network analysis can be performed to detect the hubness score of each song. For this task, we use HITS [5], a prominent network analysis method. HITS uses the notion of hubs and authorities to deﬁne a recursive relationship between nodes in a network: An authority is a node that many hubs link to1 , and a hub is 1

An authority is a song plagued by many hubs, and maybe considered as “lamb” in the analogy of [3].

Detecting Hubs in Music Audio Based on Network Analysis

Song ID 1 2 3 4 5 6 7 8 9 10

L 9, 10 3, 4 2, 4 2, 8 6, 1 9, 2 10, 4 2, 4 1, 6 1, 9

.00 .10

8

5

.12

1 6

.22

10

9

305

2

.66

.04 .27

4

3

.12

.64

7 .00

(a)

(b) Fig. 1. Example of graph formation

a node that links to many authorities. HITS deﬁnes hubs and authorities recursively. Let H and X be two vectors, which contain the hub and authority values of all nodes, respectively. The hub values can be determined from the authority values as: H = AX (A is the adjacency matrix of G), whereas the authority values are computed from the hub values as: X = AT H. Finding the primary eigenvectors for AAT and AT A, respectively, solves these linear equations [5]. Therefore, the solutions in the vector H contain the hubness scores (henceforth HS) of all songs, which are computed by taking the primary eigenvector of AAT and by normalizing it (between 0 and 1) with the || ||2 norm. For the graph of Figure 1b, we have annotated each node with its HS. Notice that although fewer links emanate from node 10 (2 links) than from node 1 (3 links), the hubness score HS(10) is equal to 0.22, which is higher than HS(1) = 0.1. The reason is that node 10 links to node 7, whereas node 1 links to node 5. Node 7 has higher authority score (0.33 – not shown in ﬁgure) than node 5 (0.05 – not shown in ﬁgure). This fact strengthens the hubness score of node 10 and weakens the hubness score of node 1, due to the recursive property of HITS. Also, node 2 has the highest hubness score (0.66) among all nodes, although it has the same number of emanating links (4 links) as node 4. Conclusively, with the proposed hubness score, HS, we consider hubness as a continuous variable. HS stems from the analysis of the structure of the similarity space, not by simply measuring the emanating number of links as the n-occurrences measure does. The advantages of the proposed score will be illustrated experimentally in the following.

4

Evaluation Framework

To evaluate the proposed hubness score (HS), we consider the following application scenarios.

306

4.1

A. Nanopoulos

Genre Classification

Nearest-neighbor genre classiﬁcation has been extensively used to evaluate spectral music similarity measures. In these applications, the objective is to evaluate the similarity measure itself, by ﬁnding how many among the k nearest neighbor (k-NN) songs belong to the same genre as the query song. Since our objective is to evaluate HS and not a similarity measure, we perform genre classiﬁcation as follows: (i) Find the k-NN songs of the query song, according to the provided similarity measure. (ii) Assign to each k-NN song a weight equal to the inverse of its HS. (iii) Sum separately for each genre the weights of those k-NN songs that belong to this genre. (iv) Assign query song to the genre with the highest sum (majority voting). We measure the precision of genre classiﬁcation as the fraction of query songs with correctly assigned genre, using the leave-one-out evaluation procedure. The weighting scheme reduces the eﬀect of hubs, as hubs are less discriminating for k-NN genre classiﬁcation. We also examine weighting by the inverse of noccurrence values. Our hypothesis is that weighting by inverse HS results to better precision. 4.2

Collection Browsing

We developed a simple browsing system and performed a user-study with 32 users (pre- and post-graduate students in Aristotle University) and the Magnatune collection (more details about this collection are presented in the following section). In each browsing session, a user initially chooses a query song. As a result the user receives a list with the 30 nearest neighbor songs, according the given similarity measure. Next, the user can hear any of the returned songs. If the user wills, he can pick one of the results as the next query song. This way, the session is continued and a new result list is formed. In the end of each session we count the total number of result songs that did not match the genres of the corresponding query songs. This number of mismatches are denoted as errors. We measure separately the percentage of errors due to the songs with the highest HS and n-occurrence values, to see how much hub songs (detected separately by the two methods) account to the total error. Our hypothesis is that songs with highest HS will account more to the total error. The reason is that a session acts as a ‘random walk’ in the graph of similarities. The HITS algorithm designates, through their HS values, the songs that are more probable to occur within random walks [5]. Therefore, songs with high HS are expected to produce more errors. Such constitute better candidates for an application that wants to ﬁlter ‘erroneous’ k-NN songs. For instance, ﬁltering can be done by providing users with indication about which k-NN songs are hubs. In our user study, we used such a simple ﬁltering mechanism, the results of which are described in Section 5.2. 4.3

Clustering

Clustering of songs according to their similarities is useful to organize a collection, for visualization, etc. A hub song can negatively aﬀect clustering, as it

Detecting Hubs in Music Audio Based on Network Analysis

307

tends incorrectly to be close to many songs, thus resulting to incorrect merging of clusters. As our input is only the given similarities, we use hierarchical clustering algorithms, which do not require further information (moreover, they do not require songs to be vectors in a dimensional space). The number of clusters is set equal to the number of genres in the examined collections. We separately evaluate the clustering result after the exclusion of the top hubs according to HS and n-occurrence. The quality of the clustering result is measured with the Jaccard coeﬃcient. Let a be the number of song pairs assigned to same cluster and belonging to same genre, b those assigned to same cluster but belonging to diﬀerent genres, and c those assigned to diﬀerent clusters but belonging to same genre. The Jaccard coeﬃcient is the fraction a/(a + b + c) and takes values in the range 0–1. Our hypothesis is that, the exclusion of hubs according to HS, leads to better clustering result.

5

Experimental Results

5.1

Methods and Data

In this section we present experimental results for the three application scenarios of our evaluation framework. With these results we evaluate the proposed hubness score (HS) compared to the existing n-occurrence score (n-occ). We use two music collections that are publicly available: (i) The Magnatune (denoted as MIREX’04), which consists of 729 songs from 6 genres. The ﬁles were downsampled to 22,050 Hz and one minute from the center is used. Frame size is 512 samples (about 23 ms). 20 MFCC coeﬃcients are extracted from each frame. (ii) The USPOP’02 (denoted as USPOP).2 The MFCC coeﬃcients are readily provided for this collection. For each frame we keep 20 MFCCs and half minute from the center is used. As spectral similarity measure we selected the G1C, for the reasons discussed in Section 2.1. The computation of the distance matrixes is done with the MA Toolbox.3 We additionally examined the transformation of G1C with the proximity veriﬁcation technique (denoted as G1C-PV). For the computation of HS and n-occ scores, the default k value is set to 10. 5.2

Results

Initially, we examine the existence of hub songs in the two data sets for the G1C spectral similarity. Following the methodology of [3], Figures 2a and b plot the distribution of n-occ values in MIREX’04 and USPOP collections, respectively, for k = 10 (linear ﬁt is plotted with solid line) and k = 50 (linear ﬁt is plotted with dash-dotted line). Based on the approach of [3], the plots are in log-log scale and the distributions show themselves to be linear. This indicates the existence of hub songs that have high n-occ values. Similarly, Figures 2c and d plot the corresponding distributions of HS values. Analogous results are found, indicating the existence of songs (hubs) with high HS values. 2 3

http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html http://www.ofai.at/∼elias.pampalk/ma.

308

A. Nanopoulos Binned distribution of n−occ to songs (USPOP) 4

3

log10(Proportion of songs)

log10(Proportion of songs)

Binned distribution of n−occ to songs (MIREX’04)

2.5 2 1.5 1 0.5 0 0

0.5

1

1.5

3

2

1

0 0

2

0.5

log10(n−occ)

1.5

2

10

(a)

(b)

Binned distribution of HS to songs (MIREX’04)

Binned distribution of HS to songs (USPOP)

3.5

4

log10(Proportion of song)

log10(Proportion of songs)

1

log (n−occ)

3 2.5 2 1.5 1 0.5 0

−3

−2.5

−2

−1.5

log10(HS)

(c)

−1

−0.5

3

2

1

0 −2.5

−2

−1.5

−1

log10(HS)

(d)

Fig. 2. Distribution of n-occ (a, b) and HS (c, d) in log-log scale, for k = 10 (linear ﬁt is plotted with solid line) and k = 50 (linear ﬁt is plotted with dash-dotted line)

Next, we examine genre classiﬁcation. We compare the two weighting schemes, according to HS and n-occ, and the plain k-NN classiﬁer (no weighting). Figure 3a presents precision for varying k, for the G1C distance measure. For very small k values, all methods result to similar precision. The reason is that, when k is very small, hubs appear less frequently in the lists with the k-NN songs (hubs tend to be similar to many songs but are not among the most similar), thus weighting does not take eﬀect. As k increases, weighting with HS and n-occ becomes more eﬀective and gives better precision than k-NN classiﬁer, whereas weighting with HS giving the best results. Figure 3b presents the analogous results for the G1C-PV case. For the USPOP collection, as it is very unbalanced (most pieces belong to the genre rock pop), precision measured with majority voting is not indicative, because it is overwhelmed by the rock pop genre. We instead measure the probability of correct class, which is the sum of weights of the k-NN songs that belong to the correct genre (that of the query song) divided by the total sum of weights of all k-NN songs. The results are presented in Figure 3c, showing the clear advantage of weighting with HS. Analogous results are derived in Figure 3d for G1C-PV.

Detecting Hubs in Music Audio Based on Network Analysis MIREX’04 (G1C)

309

MIREX’04 (G1C−PV)

0.85 HS n−occ kNN

precision

0.8

precision

0.85

0.75 0.7

HS n−occ kNN

0.8

0.75 0.65

1

5

0.7 1

10 15 20 25 30 35 40 45 50

5

k

10 15 20 25 30 35 40 45 50

k

(a)

(b)

USPOP (G1C)

USPOP (G1C−PV) 0.75

HS n−occ kNN

0.65

0.6

0.55 1 5

10 15 20 25 30 35 40 45 50 k

(c)

correct class prob.

correct class prob.

0.7 HS n−occ kNN

0.7

0.65

0.6

0.55 1

5

10 15 20 25 30 35 40 45 50

k

(d)

Fig. 3. Genre classiﬁcation: precision for varying k

For collection browsing, as described in Section 4.2, we counted the percentage of error that is produced due to top hubs, i.e., songs with the highest hub scores. The top hubs are separately measured with HS and n-occ. The results for G1C are presented in Figure 4a. The horizontal axis gives the number of top hubs as percentage of the total number of songs in the collection. Both for HS and n-occ, the few ﬁrst songs with the largest hub scores are responsible for a high percentage of total error. For example, the 5% songs with the highest HS and n-occ scores are responsible for more than 30% and 20%, respectively, of total error. As Figure 4b presents the results for G1C-PV. The diﬀerence in the error between hubs measured with HS and n-occ means that, as needed, HS identiﬁes as hub songs those songs that cause more errors. Such songs are better candidates for ﬁltering. In our user study we used a simple ﬁltering mechanism, by indicating in each result list, those k-NN songs (if any) with the highest 30 HS values among all songs (about the top 4% HS values). We asked users to pay attention to these songs and report in the end, in a scale from 1 to 5, how much do they ﬁnd that the indicated songs both lack perceptual similarity to the corresponding queries and appear with high frequency in the results. The bigger the score, the more the users agree that the indicated songs are hubs. The average score was 4.2 (standard deviation 0.3), which denotes

310

A. Nanopoulos MIREX’04 (G1C)

MIREX’04 (G1C−PV)

60 HS n−occ

40 30 20 10 0 0

HS n−occ

50

perc. error

perc. error

50

60

40 30 20 10

5

10

0 0

15

5

num of hubs (perc.)

10

15

num of hubs (perc.)

(a)

(b) MIREX’04 (G1C) HS n−occ rand

Jaccard coeff.

0.35

0.3

0.25

0.2 0

1

2

3

4

num of hubs (perc.)

(c) Fig. 4. (a,b) Percentage of errors due to top hubs. (c) Jaccard coeﬃcient for clustering.

the agreement of users that the indicated songs, identiﬁed by HS, have the characteristics of hubs. Finally, we measured the Jaccard coeﬃcient for the results of hierarchical clustering4 , when the songs with the highest hub scores, measured separately with HS and n-occ, are excluded. Figure 4c presents the results for varying number of excluded hubs (given as percentage of the total number of songs). For comparison purposes, we also present results for the case we exclude the same number of randomly selected songs. The exclusion of hubs by both HS and n-occ, produces better results than the random exclusion of songs, whereas HS presents the best results.

6

Conclusion

We have proposed a novel method to detect hub songs that result from spectral similarity measures, based on the analysis of the network that reﬂects the interaction of songs in the similarity space. We evaluated the eﬀectiveness of our approach with three application scenarios. 4

We examined single-linkage, complete-linkage, and averaging, and for all methods we report results for the best cases.

Detecting Hubs in Music Audio Based on Network Analysis

311

References 1. Aucouturier, J.-J., Pachet, F.: Improving timbre similarity: How high is the sky? Journal on Negative Results in Speech and Audio Sciences 1(1) (2004) 2. Aucouturier, J.-J., Pachet., F., Sandler, M.: The way it Sounds: timbre models for analysis and retrieval of music signals. IEEE Transactions on Multimedia 7(6) (2005) 3. Aucouturier, J.-J., Pachet, F.: A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition 41(1) (2008) 4. Berenzweig, A., Logan, B., Ellis, D.P.W., Whitman, B.: A large-scale evaluation of acoustic and subjective music similarity measures. In: Proceedings Int. Conf. on Music Information Retrieval, ISMIR (2003) 5. Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46, 604–632 (1999) 6. Logan, B., Salomon, A.: A music similarity function based on signal analysis. In: Proceedings of IEEE Int. Conf. on Multimedia and Expo, ICME (2001) 7. Pampalk, E.: Audio-Based Music Similarity and Retrieval: Combining a Spectral Similarity Model with Information Extracted from Fluctuation Patterns. In: MIREX 2006-audio music similarity and retrieval, http://www.music-ir.org/mirex2006/ 8. Pohle, T., Knees, P., Schedl, M., Widmer, G.: Automatically Adapting the Structure of Audio Similarity Spaces. In: Proceedings Workshop on Learning the Semantics of Audio Signals, LSAS (2006)

A Gradient Descent Approximation for Graph Cuts Alparslan Yildiz and Yusuf Sinan Akgul Computer Vision Lab., Department of Computer Engineering, Gebze Institute of Technology Gebze, Kocaeli 41400 Turkey {yildiz,akgul}@bilmuh.gyte.edu.tr http://vision.gyte.edu.tr

Abstract. Graph cuts have become very popular in many areas of computer vision including segmentation, energy minimization, and 3D reconstruction. Their ability to ﬁnd optimal results eﬃciently and the convenience of usage are some of the factors of this popularity. However, there are a few issues with graph cuts, such as inherent sequential nature of popular algorithms and the memory bloat in large scale problems. In this paper, we introduce a novel method for the approximation of the graph cut optimization by posing the problem as a gradient descent formulation. The advantages of our method is the ability to work eﬃciently on large problems and the possibility of convenient implementation on parallel architectures such as inexpensive Graphics Processing Units (GPUs). We have implemented the proposed method on the Nvidia 8800GTS GPU. The classical segmentation experiments on static images and video data showed the eﬀectiveness of our method.

1

Introduction

Graph cuts have been extensively used in computer vision in solving a wide range of problems such as stereo correspondence, multi view reconstruction, image segmentation, and image restoration. One of the reasons of their popularity is the availability of practical polynomial time algorithms such as [1] and [2]. As a result, the literature includes many successful graph cut based computer vision systems. Minimum cut on a graph can be formulated as ﬁnding the maximum ﬂow that can be pushed from the source node to the sink node on the graph through the links between the nodes with known capacities. Ford and Fulkerson [9] showed that ﬁnding the minimum cut on a graph that divides the nodes into two distinct sets as source and sink is equivalent to ﬁnding the maximum ﬂow from source node to the sink node in the same graph. In computer vision, the algorithm of Boykov and Kolmogorov [1] is the most popular one. It computes the maximum ﬂow using a modiﬁed version of the Ford-Fulkerson maximum ﬂow algorithm. The algorithm ﬁnds augmenting paths to push ﬂow using two search trees, one rooted at the source node and the other rooted at the sink node. Once these two search trees meet, an augmenting path from source to sink is found and J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 312–321, 2009. c Springer-Verlag Berlin Heidelberg 2009

A Gradient Descent Approximation for Graph Cuts

313

the maximum possible ﬂow is pushed through this path. As an improvement to standard augmenting path algorithm, this algorithm searches for the next augmenting path using the same search trees, which signiﬁcantly improves the total running times. Although this algorithm is successfully used for medium sized image segmentation and small sized 3D segmentation, it is not practical for high resolution images, 3D data, and realtime dynamic segmentation applications due to large amounts of memory requirements and slower running times. The literature includes methods to speed-up the graph cut computation for larger graphs, such as [3] which proposes a coarse to ﬁne scenario. A speed-up for realtime purposes is suggested by [4] which uses an initial ﬂow to compute the solution. Another speed-up is proposed by [5] which uses the residual graph from previous frames for the graph cut computation of the next frames of a video sequence. Another direction for graph cut speed-up employs Graphical Processing Units (GPUs). Dixit et. al [6] implemented the push relabel maximum ﬂow algorithm on the GPU. Vineet and Narayanan [7] showed that the recent GPUs, such as the Nvidia GTX280 [10], outperform the best sequential maximum ﬂow algorithms by a factor of around 10 in running times. Recently, Bhusnurmath and Taylor [8] have formulated the graph cut problem as an unconstrained l1 -norm minimization and solved it using the Newton’s method on the GPU. A very interesting property of their work is that, diﬀerent from the other methods above, it is based on solving the graph cut problem using the minimum cut formulation rather than the maximum ﬂow formulation. The minimum cut formulation directly solves the labels of the nodes and as they are known to converge to one of the labels, rounding errors are less of a problem. Another very useful property of the minimum cut formulation is that, as it uses the labels of the nodes directly, the initial estimate of the solution can be given without any processing on the graph, which is not possible with the maximum ﬂow formulations. In spite of the novelty of [8], the method uses too much memory and it becomes impractical for some applications that have to deal with large graphs. In this paper, we present a gradient descent approximation of graph cuts. Our method is similar to the minimum cut formulation given by Bhusnurmath and Taylor [8] but, we have several major advantages. Although the gradient descent method will require more iterations than Newton’s method, it has some very important advantages over the Newton’s method. The most important advantage of the gradient descent method is the dramatically lower memory requirements. For applications where the data structures for the Newton’s method will not ﬁt in the memory, our method can successfully work. Other advantages of our gradient descent method include the local computation of gradients, high parallelization and the ease of implementation. Finally, using the red-black order and the SOR (Successive Over-relaxation) method to speed-up the process even further can be considered as additional beneﬁts of our method. The reminder of the paper is organized as follows; we give an overview of the graph cuts and the minimum cut formulation in Section 2. The details of applying gradient descent method on the minimum cut formulation is explained

314

A. Yildiz and Y.S. Akgul

in Section 3. We provide experimental results and running times of our method in Section 4 and ﬁnally we give concluding remarks in Section 5.

2

Graph Cuts

In this section we give an overview of the graph cuts and the minimum cut formulation as an unconstrained l1 -norm minimization. A graph G(V, E) is a set of n nodes V and m weighted edges (capacities) E that connect these nodes. There are two special nodes in the binary graph cuts formulation, the source node s and the sink node t. The special nodes s and t have edges to all other nodes, some of which may have 0 weights. The edges between s and other nodes, excluding the node t, are called source links and the edges between t and other nodes, excluding the node s, are called sink links. The source and sink links are also included in the set E. The maximum ﬂow problem is to push as much ﬂow as possible from the source node s to the sink node t. The weights of the edges between the nodes deﬁne the capacities that limit how much ﬂow can be pushed through these edges. Thus, if there are no saturated edges, it means that there is still some ﬂow that can be pushed from s to t. Hence, the maximum ﬂow will saturate some of these edges. The max-ﬂow/min-cut theorem [9] states that the minimum cut on a graph will include saturated edges and will cut the graph in two partitions such that some of the nodes will be in the source set and the other nodes will be in the sink set. The value of the minimum cut is the sum of the weights of the saturated edges in the cut and is equal to the value of the maximum ﬂow. See Fig. 2 for a simple graph with saturated edges and a cut. Since maximum ﬂow and minimum cut is equivalent, solving either of these formulations will also reveal the solution of the other. In computer vision, binary s

cut

t

Fig. 1. Portion of a typical graph used in computer vision. A cut is shown partitioning the nodes to source and sink sets. Saturated edges that are forming the cut are indicated as dotted.

A Gradient Descent Approximation for Graph Cuts

315

labeling problems are usually deﬁned in terms of minimum cut that will assign appropriate labels to pixels (nodes). However, the problem is usually solved via the maximum ﬂow formulation. The maximum ﬂow algorithm of the Boykov and Kolmogorov [1] is the most popular one in computer vision. Their algorithm implements a tuned version of the well-known Ford-Fulkerson maximum ﬂow algorithm which ﬁnds augmenting paths between s and t and pushes ﬂow through these augmenting paths until there is no augmenting path left. Recently, Bhusnurmath and Taylor [8] has formulated the maximum ﬂow problem as a linear programming problem and solved the minimum cut formulation as the dual of the maximum ﬂow formulation. They give the minimum cut formulation as the minimization of F (v) =

m

wi |(AT v − c)i |,

(1)

i

where wi is the i-th edge capacity and m is the number of edges including the source and the sink links. The n × m matrix A represents the graph structure with n being the number of nodes. c is the vector of length m which indicates the source links with 1 and any other link with 0. The values of the nodes are represented by the vector v of length n. They proved that this unconstrained l1 -norm minimization will lead the node values v to either 1 (source) or 0 (sink). The matrix A can get very large even for reasonable graphs and applying Newton’s method to this formulation requires some careful work. The next section introduces our novel solution to this problem by using the gradient descent method to this formulation without building the matrix A, which lets the method work easily even for very large graphs.

3

Gradient Descent for Graph Cuts

Our gradient descent method for solving graph cuts is motivated by the increasing popularity of GPUs for general purpose computing in vision. As the GPUs have high number of parallel processors and many times more computational power than the CPUs, it is logical to move some intensive computations to the GPUs. Graph cuts are very popular for low-level vision and there are a number of GPU implementations for graph cuts as mentioned before. Our method for computing graph cuts is the application of well-known gradient descent minimization to the unconstrained l1 -norm formulation of the minimum cut. The gradient descent algorithm to minimize a function F (x) of Eq. 1 is given in Algorithm 1. For the gradient descent method to work, F (x) needs to be a smooth function, i.e. it should be diﬀerentiable for every value of x. However, the minimum cut formulation contains the absolute value function which is not diﬀerentiable at its root. Bhusnurmath and Taylor [8] solved this problem using the interior point method with logarithmic barrier potentials. At each iteration they solved a smoothed version of the function and the solution is given as the initial solution for the next iteration where the function is not smoothed as much as the

316

A. Yildiz and Y.S. Akgul

Algorithm 1. Minimize F (x) using Gradient Descent 1: x = initial solution 2: repeat 3: 4: 5:

Calculate derivatives Δx = ∂F/∂x Choose a proper β parameter Update x: x = x − βΔx

6: until convergence

previous iteration. With each successive step of this operation, the minimized function approaches to the original non-diﬀerentiable function while the solution of the smoothed function approaches to the desired solution. We employ a similar strategy. However, instead of smoothing the function F (x) of Eq. 1 directly, we expand the Eq. 1 as ⎛ ⎞ n ⎝si |1 − vi | + ti |vi | + F (v) = wij |vi − vj |⎠ (2) i

j∈N (i)

and approximate the absolute value function by smoothing it with a small variable μ using |vi | = vi2 + μ . (3) Using this approximation of the absolute value function, Eq. 2 is now diﬀerentiable and solvable using the gradient descent method. The advantage of this approach is that expanded formulation requires less memory and provides more parallelization. The derivative calculation is local and highly parallel. The step factor β of Algorithm 1 is chosen at each iteration such that the update in the given direction is as much as possible while keeping the vector v feasible, i.e. vi ∈ [0, 1]. Note that v does not have to be strictly in the interval [0, 1] because it will converge eventually. However, we observed in our experiments that, this constraint dramatically speeds up the convergence. The time complexity of our algorithm is the number of iterations times O(n) for a grid graph which is the time to compute the gradient direction and to update the variables. The memory complexity is linear in number of nodes for a grid graph. As the gradient descent method employs local computation of derivatives and local update of variables, the propagation of information from one end of the graph to the other end might take longer for some applications. This results in higher number of iterations for some diﬃcult examples. As the time required for a single iteration of the algorithm is less than the Newton’s method, the total running time is expected not to be worse than the Newton’s method of [8]. The advantage of gradient descent is the less memory requirement and the local management of variables. Another advantage is the possibility of applying the well-known popular parallel processing techniques such as the red-black ordering

A Gradient Descent Approximation for Graph Cuts

317

and the SOR. The red-black ordering method is the computation of independent variables in each iteration. Consider a 2D grid graph, which is very common in vision problems. If we think the graph as a checker board, the red and black squares are two sets of locally independent variables. Computing for the red variables and updating them will let the black variables use the most recent values of their neighbors and vice versa. This method dramatically decreases the number if iterations required for convergence. However the current Nvidia GPUs has a memory bottleneck unless the memory access pattern is well-organized. Although the number of iterations is decreased, the total load stays the same. We observed that red-black ordering is currently not improving the running times. On the other hand, applying SOR with ω around 1.5 speeds-up the convergence around 20-30%.

4

Experiments

We have tested our algorithm on the image segmentation task with various examples. The source and sink links are computed using mixtures of Gaussian built from the user selected sample regions. The neighboring link are computed using the images’ spatial gradients. Figure 2 and 3 show a typical output of the

(a) Input image

(b) Max-ﬂow result

(c) Our result

Fig. 2. Results for the ﬁsh image

(a) Input image

(b) Max-ﬂow result Fig. 3. Results for the puppy image

(c) Our result

318

A. Yildiz and Y.S. Akgul

Table 1. Running times (in ms) comparison for max-ﬂow and our algorithm with diﬀerent images Image ﬂower1 ﬂower2 ﬂower3 puppy ﬁsh ﬁsh ﬁsh

Resolution Max-ﬂow [1] Our method 512x512 111.14 112.5 512x512 128.64 194.63 512x400 72.92 8.04 512x372 82.44 94.39 256x256 24.44 18.59 512x512 128.56 117.6 1024x1024 557.46 743.98

Table 2. Memory usage (in KBs) comparison for max-ﬂow and our algorithm with diﬀerent image sizes Algorithm 256x256 512x512 1024x1024 Max-ﬂow [1] 6140 24672 98724 Our method 1536 6144 24576

segmentation task. To visually compare the results with the popular maximum ﬂow algorithm of Boykov and Kolmogorov [1], we give outputs from both of the algorithms. As our algorithm smooths the energy function at each iteration with decreasing strengths, the borders of the segmentation seem to be smoother in our results which is desirable in some applications. It can be seen in ﬂower1 and puppy images that, our method produces some small erroneous regions which correspond to the graph regions that are diﬃcult to solve. The ﬂower2 image in Fig. 4 is especially a very diﬃcult example and our method produces some erroneous regions. This is due to the approximation and smoothing of the energy. On the other hand, for very easy examples such as the ﬂower3 image in Fig. 5, our method converges very fast (see Table 1) and gives very accurate results. As noted before, our algorithm requires less memory than other graph cut algorithms. In Table 2, we give the memory usages of the max-ﬂow algorithm

(a) Input image

(b) Max-ﬂow result Fig. 4. Results for the ﬂower2 image

(c) Our result

A Gradient Descent Approximation for Graph Cuts

(a) Input image

(b) Max-ﬂow result

319

(c) Our result

Fig. 5. Results for the ﬂower3 image

(a) Input image

(b) Max-ﬂow result

(c) Our result

Fig. 6. Results for the puppy image

of Boykov and Kolmogorov [1] and our gradient descent algorithm. Both algorithms’ memory complexity seems linear, however, our method uses less memory compared to [1]. Our gradient descent formulation can directly initiate from a given labeling without any processing in contrast to the maximum ﬂow algorithms. For a video sequence, as the successive frames will have little diﬀerence in results, this property can be exploited very usefully. In Fig. 8 we give the comparison of running times of our algorithm with the max-ﬂow algorithm [1]. The running times are produced from the 320x240 video sequence given in Fig. 7.

Fig. 7. Frames from a video sequence. The output of each frame is given as the initial segmentation for the next frame.

320

A. Yildiz and Y.S. Akgul 140 120 100

Processing time(ms)

80 MaxͲflow[1]

60

Ourmethod 40 20 0 0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

Frame

Fig. 8. Running times (in ms) for the video sequence given in Fig. 7

5

Conclusions

We presented a gradient descent approximation of graph cuts, which is based on the unconstrained l1 -norm formulation of minimum cut. The advantages of our method are the local computation of derivatives and lower memory requirement. We also apply some well-known speed-up techniques such as the red-black ordering and SOR. We showed that our method can be easily initiated from a given approximation without any preprocessing on the graph, in contrast to dynamic maximum ﬂow methods which initiate from a residual graph. The disadvantage of our method is the high number of iterations required due to the nature of the gradient descent algorithm. We address this issue using speed-up techniques and fully utilizing the parallel processors on the GPU. As a future work, we plan to investigate diﬀerent strategies including the multi-scale scheme and heuristics to speed-up the method even further.

Acknowledgements This work was conducted at the Computer Vision Laboratory at Gebze Institute of Technology. It was partially supported by TUBITAK Career Project 105E097.

References 1. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. PAMI 26(9) (September 2004) 2. GoldBerg, A.V., Tarjan, R.E.: A new approach to the maximum ﬂow problem. JACM 1988 (1988) 3. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: ICCV 2005 (2005) 4. Juan, O., Boykov, Y.: Active Graph Cuts. In: CVPR 2006, pp. 1023–1029 (2006)

A Gradient Descent Approximation for Graph Cuts

321

5. Kohli, P., Torr, P.H.S.: Dynamic graph cuts for eﬃcient inference in markov random ﬁelds. PAMI 2007 (2007) 6. Dixit, N., Keriven, R., Paragios, N.: GPU-Cuts: Combinatorial optimisation, graphic processing units and adaptive object extraction. Research Report 05-07, CERTIS (March 2005) 7. Vineet, V., Narayanan, P.J.: CUDA Cuts: Fast graphs cuts on the gpu. Technical Report, International Institute of Information Technology, Hyderabd 8. Bhusnurmath, A., Taylor, C.J.: Solving the graph cut problem via l1-norm minimization. Technical Reports, CIS (2007) 9. Ford, Jr., L.R., Fulkerson, D.R.: Maximal ﬂow through a network. Canadian Journal of Mathematics (1956) 10. Nvidia Geforce Family GPUs, http://www.nvidia.com/object/geforce_family.html

A Stereo Depth Recovery Method Using Layered Representation of the Scene Tarkan Aydin and Yusuf Sinan Akgul GIT Vision Lab, Department of Computer Engineering, Gebze Institute of Technology Gebze, Kocaeli 41400 Turkey [email protected], [email protected] http://vision.gyte.edu.tr/

Abstract. Recent progresses in stereo research imply that performance of the disparity estimation depends on the discontinuity localization in the disparity space which is generally predicated on discontinuities in the image intensities. However, these approaches have known limitations at highly textured and occluded regions. In this paper, we propose to employ a layered representation of the scene as an approximation of the scene structure. The layered representation of the scenes was obtained by using partially focused image set of the scene. Although self occlusions are still present in real aperture imaging systems, our approach does not suﬀer from the occlusion problems as much as stereo and focus/defocus based methods. Our disparity estimation method is based on synchronously optimized two interdependent processes which are regularized with a nonlinear diﬀusion operator. The amount of diﬀusion between the neighbors is adjusted adaptively according to information in the layered scene representation and temporal positions of the processes. The system is initialization insensitive and very robust against local minima. In addition, it accurately handles the depth discontinuities. The performance of the presented method has been veriﬁed through experiments on real and synthetic scenes.

1

Introduction

Recovering 3D information of a scene from 2D images is one of the most fundamental tasks in computer vision. A wide range of image cues have been used to accomplish this task (e.g degree of focus/defocus, stereo correspondence, etc.). Among them, stereo methods try to estimate spatial shifts in the images captured from diﬀerent views by establishing the visual correspondences between them. The main diﬃculty of the correspondence problem is the ambiguity due to the image noise, repeated texture, and occlusions that make the mathematical formulation of the problem ill-posed. Therefore, a regularization strategy should be employed by imposing prior assumptions about scene geometry or by including additional information about the scene. A commonly adopted approach is to J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 322–331, 2009. c Springer-Verlag Berlin Heidelberg 2009

A Stereo Depth Recovery Method Using Layered Representation

323

formulate the problem in an energy minimization framework by explicitly introducing a smoothness term which allows retrieval of piecewise smooth disparity maps. Although considerable progress has been achieved in minimization of the energy functional, solutions with lower energy values does not necessarily result in higher performance values [1]. Therefore researchers tend to include additional information from images of the scene related to its geometry. One of the most meaningful information for the depth estimation methods is the possible locations of depth discontinuities. Many state of the art stereo algorithms utilize image intensity variations to align depth discontinuities with the intensity discontinuities. Local methods use this information to shape support windows adaptively [2]. Recently, several global stereo algorithms have been proposed that match segments rather than pixels in the optimization process using graph cuts and belief propagation methods [3,4]. Similarly, diﬀusion based methods include edge information in their anisotropic operators to supervise ﬂow between neighbors [5]. Another serious challenge for stereo is the occlusion problem which means that some scene points are not visible from all views. Because it is very hard to extract depth values of occluded regions using only their intensity values, use of active illumination were proposed [6,7]. In order to overcome the limitations associated with the intensity based discontinuity localization and to detect surface structure of the occluded regions without using any illumination setup, we propose to employ more reliable and practical information from other cues. This paper presents a system that represents the scene in a layered form in which layers are ordered according to their distance to the image plane. Layered form of the scene is extracted from the image set of the scene which are captured from the same view by focusing to the virtual layers in the scene. The layered scene structure is employed by our stereo method to determine the possible depth discontinuity locations. Our approach has partial biological motivations because it is known that human vision uses both stereo and focus for depth estimation. Pentland [8] has reported that human perception of depth is strongly inﬂuenced by the gradient of focus as a useful source of depth information. In order to extract the layers, we establish correspondences between the images in the set and the all-focus image of the scene from the same view. Our approach is closely related to shape from focus (SFF) methods that recover depth of a scene from intensity variations in the images by searching sharpest image sections. However, the occlusion problem inherent to real aperture lenses makes the SFF results ambiguous [9]. With the employment of the all focus image of the scene, we formulate the layer extraction as a correspondence problem. As a result, our method does not suﬀer from the occlusion problem as much as SFF. Our previous work introduced synchronous optimization processes for the stereo depth estimation [10]. The optimizations are expressed as two separate but dependent processes which are iteratively minimized to deform two initial surfaces towards each other using a gradient descent minimization. Although gradient descent method does not guarantee optimality and it highly depends on initial settings, due to the interaction between the optimization processes,

324

T. Aydin and Y.S. Akgul

the overall result of our system is always better than the results achievable by a single optimization process. Reliable convergence is ensured by starting each process with diﬀerent initial positions. In this paper, we also introduce a novel nonlinear diﬀusion operator which is speciﬁcally designed to utilize both layered representation and temporal information in the synchronous processes. It performs isotropic smoothing and anisotropic smoothing adaptively around inhomogeneous regions. Unlike the previous diﬀusion techniques, the proposed operator adapts itself to meta-states of both processes allowing depth discontinuities to emerge during the optimization process. The proposed method does not need any calibration procedures other than the stereo rectiﬁcation. In other words, it does not need registration between stereo disparity values and layer depth ordering because we use the layering information only for the depth discontinuity detection. The alternative of focus setting-disparity registration would be too diﬃcult to achieve for real life situations due to complex calibration routines required [11]. The rest of this paper is organized as follows. Section 2 explains the layered form of the scene and synchronous energy functional. Section 3 describes the system validation and experiments. Finally, we provide concluding remarks in Section 4.

2 2.1

Method Layered Representation of the Scene

The visible surface of a scene from one view can be approximated by nonoverlapping fronto parallel layers. Availability of this information will be very helpful to estimate possible locations of depth discontinuities to align them across the layer boundaries. Many state of the art stereo methods have taken advantage of this approach by simply applying color segmentation on the images[3,4]. However, excessive number of layers may be produced for highly textured images. In addition, relative distances or ordering of the layers cannot be established with color segmentation. In our system, we approximate the surface of the scene as a composition of ordered layers. Layers are extracted from image set of the scene in which each image is captured from the same view by focusing to the successive virtual planes in the scene. The images are taken with the widest lens aperture setting so that the scene points that lie in the focused layer can easily be detected due to shallow depth of ﬁeld. Layer assignment to the image points is accomplished by establishing correspondences between images in the set and the all focus image of the scene from the same view that we have already for the stereo system. Note that the system setup and data acquisition method is similar to that of shape from focus (SFF) [12] methods in which depth is reconstructed from multiple images which are taken with diﬀerent focus settings. However, our layer assignment strategy is completely diﬀerent from the classical SFF, hence

A Stereo Depth Recovery Method Using Layered Representation

325

P

P R

R

q (a)

q (b)

(c)

(d)

Fig. 1. Image formation process of a scene with two fronto parallel layers where focus is set to (a) background and (b) foreground. The point q may appear focused when focus is set to both foreground or background. Real images of a sample scene taken with focus setting set to (c) background and (d) foreground.

it does not suﬀer from occlusion problem which is present in ﬁnite aperture imaging systems [9]. In the presence of the occlusions, image points of the occluding object receive a mixture of light from both focused background and blurred foreground when focus is set to the occluded. Consequently, corresponding image points of occluding object may appear focused, even though it is out of focus. It makes depth estimation for these regions ambiguous [9]. This situation is illustrated in Fig. 1. As seen in Fig. 1.(c) occluding object cause attenuation in the brightness proﬁle of the occluded region[13]. In order to address the ambiguity problem, we establish correspondences between focused images of the virtual planes in the scene and the all focus image of the scene from the same view. Assuming the attenuation in the brightness proﬁle is constant in a small patch, we use the normalized cross-correlation as a similarity measure because it is insensitive to the brightness diﬀerences. Ci (x, y) = N CCΩ (Ip (x, y), Ii (x, y)),

(1)

where Ip is all focused image, and Ii is the images in the set. The robustness of the matching score C depends on the size of the patch Ω. Although increasing it also increases the reliability of matching by reducing the eﬀect of noise, it results in shifts at estimated location of discontinuities. In order to increase robustness while preserving location of discontinuities, we aggregate initial matching scores with larger but adaptively weighted support windows [14]. Weights of support windows are assigned by using all focus image of the scene. Computation of the weights is based on similarity and proximity metrics between center pixel and its neighboring pixels that fall inside the support window. Similar and closer pixels get larger weights in the assumption that they probably lie on the same surface. Weights are computed using all-focused image I according to the following formula. ωx0 y0 (x, y) = e−(Δd/γ1 +ΔI/γ2 )

(2)

326

T. Aydin and Y.S. Akgul

(a)

(b)

(c)

Fig. 2. Our layered representation of the scene (a), results from adaptive shape from focus [14] (b), and shape from focus method [12](c). Although our layered representation does not hold correct depth values, it approximates the structure of the scene more accurately than shape from focus methods.

where Δd =

(x − x0 )2 + (y − y0 )2 ,

(3)

and ΔI = I(x, y) − I(x0 , y0 )

(4)

Δd and ΔI are euclidian distance in spatial domain and in color space respectively. γ1 and γ2 are constant parameters to supervise relative weights. Using the computed correlation values, matching scores are aggregated. Ci (x0 , y0 ) = ωx0 y0 (x, y) Ci (x, y) i = 1..N (5) (x,y)∈Ω(x0,y0 )

where N is the number of virtual planes. Then each image point is assigned to a layer with maximum correlation value

Il (x, y) = arg max Ci (x, y)

(6)

1≤i≤N

Figure 2 shows the layers of image of a sample scene and depth image of the scene found by traditional SFF[12] and [14]. 2.2

Synchronous Energy Formulation for Stereo

Our stereo energy formulation is built upon synchronous optimization processes which employs segmented image to estimate depth discontinuities in the scene [10]. Starting from the classical stereo energy functional, synchronous optimization processes was formulated by introducing Etns term as in following equations which are intended to be optimized synchronously and to produce two diﬀerent disparity maps D1 and D2 . E (D1 ) = αEdata (D1 ) + βEsmth (D1 ) + λt Etsn (D1 , D2 ) E (D2 ) = αEdata (D2 ) + βEsmth (D2 ) + λt Etsn (D2 , D1 )

(7) (8)

As in the classical stereo energy formulation, the data term Edata is for satisfying the image similarity requirement. Edata = φ(D)dp (9)

A Stereo Depth Recovery Method Using Layered Representation

327

The similarity measure is calculated by using normalized cross correlation (N CC) values between the left and right image regions due to its robustness against any brightness diﬀerences between the left and right images. In order to reject outliers in data and increase convergence time of the method, data space should be pre-smoothed while preserving the discontinuities recover them accurately. Therefore, we pre-smooth data space with the bilateral ﬁlter [15] whose kernels derived from left image of stereo pairs. The term Esmth enforces smoothness to desired disparity map. 2 Esmth = c |∇D| dp, (10) where c is the diﬀusion coeﬃcient, which inhibits the smoothing across the marked discontinuities. The tension energy Etsn is for the synchronization of the two optimization processes and it is the core idea of the synchronous optimization method. 2 Etsn (D1 , D2 ) = (D1 − D2 ) dp, (11) The main function of the tension term is to make the disparity values of two surfaces D1 and D2 get close to each other by pushing the optimization process with the worse data term towards the other process. Minimization of the energy functionals deﬁned in Equation 7 and 8 with the tension terms yields following equations: ∂D1 ∂φD1 =γ α + β∇ · (c∇D1 ) + λt (D1 − D2 ) ∂t ∂D1 ∂D2 ∂φD2 =γ α + β∇ · (c∇D2 ) + λt (D2 − D1 ) . ∂t ∂D2

(12) (13)

The introduced tension energy enforces both processes to converge the same solution. However, continually forcing both process to pull each other may result in convergence to an irrelevant local minima. In order to prevent processes to force each other symmetrically, we set λt to a spatially and temporarily varying coeﬃcient. The coeﬃcient for the ﬁrst process is computed as Δφ(D1 ) 2 ( ) λ Δφ(D1 ) ≥ 0 , λt1 = 1 − e (14) 0 otherwise where Δφ(D1 ) = φ(D1 ) − φ(D2 ) and λ is a constant. The same coeﬃcient will be computed for the second process analogously. Note that, the tension is not symmetric anymore and it is heavily dependent on the local positions of the processes, hence it computes a diﬀerent value for each process. The process with lower data term has the zero coeﬃcient and its optimization is not aﬀected from the other process. On the other hand, if the process has higher data energy than the other, it will be pulled by the tension term towards the other process.

328

2.3

T. Aydin and Y.S. Akgul

Discontinuity Preserving Nonlinear Regularization

Anisotropic regularizer can be used to prevent diﬀusion between inhomogeneous regions, i.e. across the discontinuities to prevent surface discontinuities to be oversmoothed. Consequently, an anisotropic disparity regularization process requires prior information about possible location of discontinuities. One practical and reasonable prediction can be made by analyzing intensity variations in the stereo images. Assuming that depth discontinuities overlap with some intensity discontinuities in the image, Alvarez [5] adjusted the amount of diﬀusion among the neighboring elements according to the intensity diﬀerence between them. One negative consequence of this assumption is oversmoothing of depth discontinuities that have small intensity variations at corresponding positions in the image. The diﬀusion constant c is deﬁned as a function of gradient of image I in anisotropic smoothing as c(x, y, t) = g (∇I) , (15) where g is called edge stopping function. In order to increase the accuracy of depth recovery around discontinuities, the intensity based discontinuity estimation step should be replaced by a more reliable and robust estimation. We propose to employ our layered representation described in section 2.1 to take advantage of its superior performance at discontinuity localization. Using layered image Il , we deﬁne the edge stopping function g as, 2 g (∇Il ) = e−(|∇Il |/κ1 ) , (16) where κ1 is a constant. If we minimize the resulting equations using diﬀusion coeﬃcient in Equation 16, the processes turn out to be sensitive to local minima, especially around noisy and small surface patches in the estimated depth map because they cannot get suﬃcient ﬂow from the neighbors. Note that this situation does not occur in isotropic regularization in which diﬀusion coeﬃcient c is taken as a constant so that ﬂow is allowed between all regions. As a result, it is very robust against local minima but it oversmooths the discontinuities. In order to take advantage of both isotropic and anisotropic regularization, we introduce a novel diﬀusion operator which adapts itself to the meta-states of the synchronous processes and performs isotropic or anisotropic regularization adaptively. The operator utilize the temporal position information of synchronous processes. The temporal distance between the processes is given as Δd(x, y, t) = |D1 (x, y, t) − D2 (x, y, t)| .

(17)

Initially, one of the synchronous processes is started from minimum disparity values and the other is started from maximum disparity values. Therefore, Δd has the maximum value that is possible. At this time, the regularization should be isotropic to avoid getting stuck to a local minima. During the minimization, the processes get close to each other and Δd goes to zero. In order to prevent smoothing of discontinuities, the regularization should behave anisotropically, as

A Stereo Depth Recovery Method Using Layered Representation

329

the processes approach the desired minimum. At the end (Δd = 0), the operator should exhibit pure anisotropic behavior. We deﬁne diﬀusion function c (x, y, t) by including the distance between the synchronous processes as c (x, y, t) = (1 − h(Δd)) + h(Δd) · g(∇Il ), where

2

h (Δd) = e−(Δd/κ2 ) ,

(18)

(19)

where κ2 is a constant. Initially, Δd has high values and h evaluates to nearly zero. If h is zero, the diﬀusion coeﬃcient functions as a isotropic diﬀusion coefﬁcient. When the processes ﬁnd the same positions (Δd = 0), the h would be 1 and the diﬀusion coeﬃcient evaluates to g (∇Df ) and it functions as a isotropic diﬀusion coeﬃcient.

3

Experiments

The proposed method has been tested on real scenes and real scenes with synthetically refocused images which are generated with iris ﬁlter [16]. In all experiments, matching cost is pre-smoothed with 11x11 pixel sized bilateral ﬁlter. The scenes with known depth maps are obtained from Middlebury [17] image base where ground truth information for stereo pairs are available for benchmarking. 16 refocused images are produced for Venus and Cones data sets which have sharp depth discontinuities. Error rates of the proposed algorithm are computed for non-occluded areas, near discontinuities, and for complete images. Table 1 compares the error rates of disparity map obtained by employing proposed layered from of the scene and segmented image. Figure 3 shows the layered representation, and resulting disparity maps of our algorithm. The results show that our method can robustly recover piecewise smooth surfaces and preserve discontinuities well. Experiments on real scenes are performed using 25mm lenses on c-mount cameras. By changing focus setting, only 10 images are captured from only the left view of the stereo setup to reconstruct layered form. Our results with left stereo images are shown in Fig. 4. Table 1. Error rates on visible, all, and discontinuous regions for our method using layered representation of the scenes and segmented image as a source of discontinuity estimation Venus Cones 1 pixel 0.5 pixels 1 pixel 0.5 pixels vis all disc vis all disc vis all disc vis all disc our method 0.10 0.25 1.41 0.5 0.89 5.30 5.74 12.0 16.4 8.24 15.0 21.3 with Segment[10] 0.32 0.40 3.45 1.23 1.52 9.55 7.08 14.5 19.7 8.34 16.2 22.4 error threshold

330

T. Aydin and Y.S. Akgul

a

b

c

d

0

255

Fig. 3. Layered representation of venus (a) and cones (c) image and their corresponding disparity images (b,d) found by our method

a

b

c

d

Fig. 4. Left stereo images of the stereo pairs (a,b), and resulting disparity maps from our method(c,d)

4

Conclusions

We proposed a novel system that uses two energy functionals which are optimized by two dependent optimization processes. Unlike the intensity based regularization methods, we utilized the layered representation of the scene as a source of discontinuity estimation. Our layered representation requires focused images of the virtual planes in the scene which can be obtained by changing focus setting of one of the stereo cameras. Consequently, the system can be easily implemented with a simple setup. We also introduced a novel nonlinear diﬀusion operator which eﬀectively utilize the layered representation of the scene and temporal positions of the synchronous processes. The operator is capable of adapting itself to meta-states of synchronous processes and perform isotropic or anisotropic regularization accordingly. Although the proposed system does not include an explicit occlusion mechanism, by using layered representation of the scene, the proposed anisotropic operator propagates disparity values inside homogenous regions and ﬁlls values of occluded regions from its neighbors lying in the same surface. This may not be possible in the intensity or segment based diﬀusion methods, especially when occluded regions have high intensity variation. Despite the advantages of our method, currently, it cannot handle the situation in which disparity discontinuities are located inside the assigned layers.

A Stereo Depth Recovery Method Using Layered Representation

331

Acknowledgements This work was conducted at the Computer Vision Laboratory at Gebze Institute of Technology. It was supported by TUBITAK Career Project 105E097.

References 1. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In: International Conference on Computer Vision, pp. 900–907 (2003) 2. Yoon, K.-J., Kweon, I.S.: Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 650–656 (2006) 3. Zhang, Y., Kambhamettu, C.: Stereo matching with segmentation-based cooperation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. Part II. LNCS, vol. 2351, pp. 556–571. Springer, Heidelberg (2002) 4. Hong, L., Chen, G.: Segment-based stereo matching using graph cuts. In: IEEE Computer Vision and Pattern Recognition or CVPR, pp. I. 74–I. 81 (2004) 5. Alvarez, L., Deriche, R., Sanchez, J., Weickert, J.: Dense disparity map estimation respecting image discontinuities: A pde and scale-space based approach. Journal of Visual Communication and Image Representation 13(1/2), 3–21 (2002) 6. Raskar, R., Tan, K.H., Feris, R., Yu, J., Turk, M.: Non-photorealistic camera: depth edge detection and stylized rendering using multi-ﬂash imaging. ACM Trans. Graph. 23(3), 679–688 (2004) 7. Zickler, T.E., Belhumeur, P.N., Kriegman, D.J.: Helmholtz stereopsis: Exploiting reciprocity for surface reconstruction. Int. J. Comput. Vision 49(2-3), 215–227 (2002) 8. Pentland, A.P.: A new sense for depth of ﬁeld. IEEE Trans. Pattern Anal. Mach. Intell. 9(4), 523–531 (1987) 9. Schechner, Y.Y., Kiryati, N.: Depth from defocus vs. stereo: How diﬀerent really are they? Int. J. Comput. Vision 39(2), 141–162 (2000) 10. Aydin, T., Akgul, Y.: Stereo depth estimation using synchronous optimization with segment based regularization. Technical report, Gebze Institude of Technology, Kocaeli, Turkey (2008) 11. Ahuja, N., Abbott, A.L.: Active stereo: Integrating disparity, vergence, focus, aperture and calibration for surface estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(10), 1007–1029 (1993) 12. Nayar, S.K., Nakagawa, Y.: Shape from focus. PAMI 16(8), 824–831 (1994) 13. Asada, N., Fujiwara, H., Matsuyama, T.: Seeing behind the scene: analysis of photometric properties of occluding edges by the reversed projection blurring model. In: IEEE International Conference on Computer Vision, p. 150 (1995) 14. Aydin, T., Akgul, Y.S.: A new adaptive focus measure for shape from focus. In: BMVC 2008 (2008) 15. Tomasi, C., Manduchi, R.: Bilateral ﬁltering for gray and color images. In: ICCV, pp. 839–846 (1998) 16. Sakurai, R.: Irisﬁlter (2004), http://www.reiji.net/ 17. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002)

Reconstruction of Sewer Shaft Profiles from Fisheye-Lens Camera Images Sandro Esquivel1 , Reinhard Koch1 , and Heino Rehse2 1

2

Christian-Albrechts-University, Kiel, Germany IBAK Helmut Hunger GmbH & Co. KG, Kiel, Germany

Abstract. In this paper we propose a robust image and sensor based approach for automatic 3d model acquisition of sewer shafts from survey videos captured by a downward-looking ﬁsheye-lens camera while lowering it into the shaft. Our approach is based on Structure from Motion adjusted to the constrained motion and scene, and involves shape recognition in order to obtain the geometry of the scene appropriately. The approach has been implemented and applied successfully to the practical stage as part of a commercial software.

1

Introduction

Automatic sewer inspection is an important application for computer vision and robotics. Remotely controlled inspection devices based on mobile robots equipped with diﬀerent sensors are commonly used for this task since the observed structures are often not directly accessible for humans or access is diﬃcult to achieve. As regular inspection of manholes and sewer shafts is required by law, this special application is interesting for commercial systems. While diﬀerent approaches to this problem exist – including solutions based on structured light, multi-frequency sonar, infrared sensors, or recently time-of-ﬂight cameras – we rely on an approach which is R based mainly on video sequences captured Fig. 1. IBAK PANORAMO SI by a ﬁsheye-lens camera provided with a ﬂash light which is lowered into the manhole. Additional data acquired by a rotation sensor which is attached to the camera is used to facilitate the task. For reconstruction, we are able to assume additional constraints since the given problem of sewer shaft inspection using a hanging camera diﬀers slightly from general sewer inspection. Figure 1 shows a commercial system using our which has been built by our industry partner IBAK Helmut Hunger GmbH & Co. KG. The video sequences used are byproducts of interactive shaft inspection. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 332–341, 2009. c Springer-Verlag Berlin Heidelberg 2009

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images

333

Previous Work: An early idea for recovering shape and camera pose relative to the pipe axis automatically from sewer survey videos was presented in [1]. Kannala et al. [2,3] considered an approach for automatic 3d model acquisition from video sequences captured by a calibrated ﬁsheye-lens camera moving through a sewer pipe. They recover camera positions and scene structure by computing calibrated multi-view tensors for image sub-sequences and merging the results hierarchically, which results in a point cloud approximating the scene structure as an initial 3d model. This approach suﬀers from error accumulation and sensitivity to inaccurate camera calibration resulting in bent and conical pipe reconstructions which are known to be straight. Our problem formulation is slightly diﬀerent since we aim to measure the shape of a shaft from a camera hanging down rather than from traveling through the sewerage. Our Approach: The main idea of our approach is to incorporate a priori knowledge about the scene geometry and to simplify the resulting 3d model appropriately to stabilize the whole reconstruction process. Our approach computes shaft proﬁles at diﬀerent depths by a Structure from Motion approach, classiﬁes them as appropriate 2d shapes, and builds a 3d model by connecting shapes from subsequent cross-sections. The reconstruction is geometrically corrected using knowledge about the camera motion. Since it is designed for practical purposes, the focus of our work is on ﬂexibility, robustness, and automation of the reconstruction process.

2 2.1

Background Problem Specification and Setting

The setting of our work is illustrated in Fig.2: A ﬁsheye-lens camera designed for sewerage survey is lowered vertically into a sewer shaft which is speciﬁed to be vertical with arbitrary basic shape, but often rectangular shafts or shafts with elliptical proﬁle. Images are captured in ﬁxed translation intervals which can be measured accurately from the feed of the conducting cable (in our case, the camera moves up to 35 cm/s, but a ﬂash ensures sharp images every 5 cm). Additional, an inertial sensor is mounted to the camera which measures roll rotation around the viewing axis for each image to compensate this rotation later in the imFig. 2. Setup for inspection ages. While it is assumed that the camera is looking approximately along the axis of the shaft, the exact position of the camera is unknown. The camera might also oscillate around the cable axis. The task is to classify and measure the cross-sectional shape of the shaft at diﬀerent depths robustly and obtain an approximate 3d model of the shaft by appropriately merging proﬁles from subsequent cross-sections.

334

S. Esquivel, R. Koch, and H. Rehse

Fig. 3. Input images captured by a hanging ﬁsheye-lens camera during lowering

Figure 3 shows typical input images captured by the ﬁsheye-lens camera during lowering it into a sewer shaft through the manhole. Apparently, the task of visual reconstruction is not trivial: Illumination and visibility decrease rapidly towards the center of the image, the hanging camera is rotating signiﬁcantly around its view axis, there are reﬂections especially on fronto-parallel parts of the shaft surface and obscuring structures such as stairs and branching pipes, and vision is very poor in larger rooms where the camera is located oﬀ-center. 2.2

3D Reconstruction Using Structure from Motion

There are several Structure from Motion (SfM) approaches using spherical cameras such as the ones described in [6], [2,3] and [7]: Chang and Hebert [6] describe a SfM approach for general scenes using cameras with wide ﬁeld of view such as ﬁsheye-lens cameras and analyze its uncertainty. They show that SfM performs better in certain situations (e.g. sparse scene structure, motion along the viewing direction of the camera) which apply to the given setting. Kannala et al. [2,3] compute sparse structure of sewer pipes from feature point correspondences for image triplets by estimating the trifocal tensor, and merge local reconstructions by hierarchical bundle adjustment. We use an approach by Bartczak et al. [7]. They describe SfM from a dense sequence of spherical images of a rigid scene with no a priori knowledge about camera motion which avoids heuristics by heavily using information about measurement uncertainties and error propagation. The reconstruction process is separated into a bootstrapping and a tracking stage based on 2d-2d correspondences between subsequent images. During bootstrapping, an initial sparse scene structure and camera position is computed from the essential matrix estimated from 2d-2d correspondences between the ﬁrst image pair. During tracking, subsequent camera poses are computed from 2d-3d correspondences while new scene structure is estimated from 2d-2d correspondences and existing 3d structure is updated using further 2d measurements. Bundle adjustment is used after bootstrapping to ensure a good initialization. All mentioned approaches rely on multi-frame feature point correspondences (trails) obtained using the well-known KLT feature tracker [8]. Because the input video sequence is rather sparse in our setting, feature prediction between subsequent images is inevitable, and bundle adjustment as a ﬁnal step is not useful since feature trails are in average very short (3–5 images).

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images

335

Note that without knowledge about the distance between the ﬁrst two camera positions, 3d structure can only be estimated up to an unknown scale. In our setting, we can overcome this ambiguity since the translation amount of the camera between subsequent image captures is known to be approximately 5 cm. In [3], Kannala et al. derive the course of the pipe from the camera path. Nevertheless there is no reasonable distinction between error accumulation on the camera motion and factual curvature of the pipe. This can result in bent reconstructions of essentially straight pipes. Since the feature point correspondences from which camera motion is estimated can only be obtained for very few subsequent images, drift of the reconstructed camera path is very likely as has been observed in our experiments. In contrast to Kannala’s approach, in our setting the camera path is known to be oscillating around a straight line – given by the vector of gravity – which will allow us to correct the reconstruction.

3

Our Approach

Our algorithm is composed from the following steps which will be explained in detail in the following sections: 1. Reconstruction of 3d points on the shaft surface (a) Cylinder-mapping of input camera images, removing roll rotation using the input of an additional rotation sensor (pre-processing step). (b) Structure from Motion as described in [7] using cylinder-mapped images with problem-speciﬁc feature prediction (tracking phase). (c) Correction of reconstructed geometry and camera motion using a priori knowledge about motion and scene geometry (post-processing step). 2. Contour shape classiﬁcation and shape ﬁtting in cross-sections of the shaft, and construction of a simple 3d geometry by connecting contours from subsequent cross-sections and optional 3d shape ﬁtting. Prior to application, common calibration techniques for ﬁsheye-lens cameras [5] are used in order to estimate the intrinsic parameters and radial distortion of the camera. While the camera used is almost distortion-free, note that we allow the focal length of the camera calibration to have an error of up to several percent for sake of robustness. 3.1

Reconstruction of 3D Points

Reconstruction of 3d points relies on the following assumptions which are supposed to hold for the given problem setting: – The viewing direction of the camera is mainly along the shaft axis with up to 5–10◦ pan/tilt rotation and almost no roll rotation. – The average motion vector of the camera coincides with the shaft axis. – Within shaft sections, the camera-local shaft proﬁle changes only slightly between subsequent images due to small transversal motion and pan/tilt rotation of the camera, or continuous proﬁle changes (e.g. conical sections). – Abrupt changes of the entire camera-local shaft proﬁle indicate geometry changes of the shaft (e.g. at the junction of shaft and base room).

336

S. Esquivel, R. Koch, and H. Rehse

Cylinder-mapping of camera images. Existing approaches for image-based sewer reconstruction detect and track feature points directly in the spherical camera images resp. apply local perspective undistortion ﬁrst [2,3]. In our work we determined that “unwinding” the image according to spherical coordinates as seen in Fig.4 (left) – approximating an image of the unrolled shaft surface – facilitates the feature tracking process as long as the camera’s viewing direction is approximately parallel to the shaft axis. We account for a ring-shaped part of the camera’s ﬁeld of view corresponding to the viewing range between zenith angles θmax ≥ θ ≥ θmin (here: θmax := 85◦ , θmin := 45◦ ) which holds the most usable visual information (compare Fig.3 (left) and Fig.4 (left)). Rotation φRS around the camera’s z-axis measured by an inertial sensor is used during mapping to compensate roll rotation of the camera. This is easy since roll rotation results simply in a vertical shift of the cylinder image. Tracking points in cylinder-mapped images. For feature detection and tracking in the cylinder-mapped images we use an implementation of the KLT feature tracker [8]. Since the displacement of image points in subsequent images is large, even for multi-resolution tracking, either feature prediction or region search is necessary. Tracking points in cylinder-mapped images is based on the fact that feature points move mainly along image rows, and disparity deviation is distinctive for points in diﬀerent image columns but only small within the same row. Figure 4 illustrates the disparity between corresponding cylinder image points for an exemplary motion along the cylinder axis with oﬀ-center position, small transversal motion and 5◦ pan/tilt rotation. The displacement vectors of corresponding image points are basically horizontal, average disparity varies strongly with respect to vertical image position but only little with respect to horizontal image position. The latter curve depends on the cross-sectional shape of the cylinder and the excentricity of the camera. The former curve is aﬀected by roll rotation between images which is minimal since it is compensated during mapping.

Fig. 4. Average column/row-wise disparities for corresponding cylinder image points

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images

337

As a consequence, the following feature tracking method is proposed: Row Scan (Init): Without any knowledge about the shaft’s diameter and proﬁle shape, feature point correspondences are generated by scanning the image row and computing a similarity measure between feature points detected in the previous and current images. The vertical tolerance is deﬁned by the maximal valid distance to the shaft surface (here: 300 cm) and maximal pan/tilt rotation (here: 10◦ ). This method is used for initialization and for re-initialization when tracking fails during SfM (e.g. due to noticeable shaft geometry change). Row Track: After initialization, the average row-wise disparity of the last images is used to predict feature point positions within the current image. 3.2

Geometric Correction

As described above, in SfM the camera motion estimation tends to drift which results in globally erroneous reconstructions as shown in Fig.5 (left): The shaft will be bent and its diameter will narrow resp. widen over time. The latter eﬀect occurs prominently as a systematical error when the focal length of the ﬁsheye-lens camera has not been calibrated correctly before application. Common solutions to the problem of error accumulation, such as multi-frame SfM, depend on tracking feature points through a large number of frames and thus can not be applied here. On the other hand, the reconstruction is nevertheless locally correct since inter-frame errors are small. In the following, a simple approach is described how to correct the reconstructed camera poses. Once this is done, structure is corrected locally with respect to the camera by which it was originally seen, resulting in a globally consistent reconstruction. Fig. 5. 3d points before In our setting, the camera is hanging into the shaft (left) and after geometric by its own weight on a static cable. Image acquisition correction (right) is triggered at certain equidistant amounts of feed (5 cm). Hence for camera pose correction we can assume (a) that the average camera path approximates the vector of gravity, and (b) that the distance between subsequent camera positions along the vector of gravity is ﬁxed and known. Geometric correction (GC) is accomplished hence as follows: 1. First, approximate the mean “drifted” camera motion by ﬁtting a polynomial curve p(t) := (px (t), py (t), t) to all camera positions C 0 , . . . , C N . 2. Correct camera positions C 0 , . . . , C N by “unbending” the mean camera motion so that it is mapped to the world z-axis (i.e. the vector of gravity). 3. Correct camera positions further by rescaling camera motion locally so that inter-camera distance is equalized to 5 cm each.

338

S. Esquivel, R. Koch, and H. Rehse

4. Correct pan/tilt rotation of each camera such that the local “drifted” gravity vector ∇p(t) becomes parallel to the world z-axis. The estimation of camera roll rotation is assumed to be drift-free. 5. Finally, the positions of all 3d points X j are updated with respect to the new pose of the ﬁrst camera C i they were visible in. Note that geometric correction transfers all 3d points and camera poses into a common coordinate frame which is registered to the (ideal) shaft geometry – i.e. z-axis is parallel to shaft axis and x/y-plane is parallel to cross-sections. 3.3

Shape Classification and Estimation

The goal of the next stage is to estimate the average cross-sectional shape of the shaft at M evenly distributed depths h0 , . . . , hM−1 (here: hi := i · 5 cm each). First, the 3d points are partitioned into M slices S0 , . . . , SM−1 where each slice Si consists of the 2d projections (xj , yj ) of all 3d points (xj , yj , zj ) with zj ∈ [hi , hi+1 [ onto the x/y-plane. Common 2d shape ﬁtting methods – such as [10] for ellipses, [11] for rectangles or [12] for closed spline curves – are used to obtain shape estimates for each slice Si . To enable robustness against 3d points that do not lay on the shaft surface or result from incorrect triangulation, a RANSAC approach is used in combination with the shape ﬁtting methods. The classiﬁcation is done by ﬁtting an instance of each shape class to all 2d points in slice Si robustly, evaluating a quality score for each shape which is based on the average geometric distance of all inliers and weighted such that shape class changes are punished, and selecting the shape with the highest resulting score. 3.4

3D Model Creation from Cross-Sections

In general, the structure of sewer shafts can be modelled as a sequence of straight segments of extrusion-like geometries (i.e. generalized cylinders) without branches. Hence the 3d surface can be constructed simply by connecting subsequent contours of the same shape class with signiﬁcantly small diﬀerence in shape parameters and interpolating linearly between cross-sections. The reconstructed model can be further simpliﬁed by ﬁtting special extrusion surfaces (e.g. cylinders for elliptical cross-sectional shape, cuboids for rectangular shape) to 3d points within each segment (see e.g. [9]).

4 4.1

Experiments and Results Evaluation with Real Labelled Data

To evaluate the performance of our approach with real data, our industry partner has provided us with a number of video sequences that has been captured from diﬀerent sewer shafts (44 sequences from 36 diﬀerent shafts). The observed shafts show a great variety of depth, diameter and shape. First, we identiﬁed the shape and diameter for 60 subsequences (”reference sections”) manually using previous knowledge about the parts the shafts are made up from. We applied our algorithm to each video sequence and evaluated for each reference section if the

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images

339

correct shape class was identiﬁed and measured the average estimation error for each correctly classiﬁed cross-section by comparing it with the manual reference data. For elliptical cross-sections the average diameter error is regarded, for rectangular cross-sections the average lateral length error. The results are shown in Fig.6. For each reference section, the average diameter estimation error and the standard deviation of the errors are shown. In order to evaluate the performance of the proposed geometric correction, we applied our implementation once with and again without geometric correction. Note that the last 3 sequences have in fact pulvinate rectangular shape. Our approach failed for 3 reference sequences that are not shown in Fig.6 which have pentagonal shape. Apparently, the average relative error is ca. 1–2% which corresponds to an absolute error of ca. 2 cm in diameter resp. lateral length. Since the reference data is idealized and does not pay attention to possible local deformations of the shafts, the comparison has to be interpreted rather as a veriﬁcation of our approach than as an exact evaluation of accuracy. Note also that using geometric correction the estimated cross-sectional diameters vary less than without geometric correction and the overall accuracy improves signiﬁcantly. Geometric correction is also capable of compensating deviating scale errors resulting from inaccurate focal length calibration of the camera as shown in Fig.7 for one shaft.

Fig. 6. Results of our algorithm with and without geometric correction (GC) for 57 out of 60 shaft segments with ﬁxed and approximately known diameter

Fig. 7. Estimation results for shaft sequence no. 8 with varying focal length f without GC (top) and with GC (bottom). Note the diﬀerent scales of the graphs.

340

4.2

S. Esquivel, R. Koch, and H. Rehse

Practical Issues: Robustness and Runtime

Contour-based model generation with GC failed only for 3 total sequences and 3 subsequences out of the test set of 44 sequences: 3 examples show a shaft with non-standard pentagonal geometry which is not supported up to now. The other reconstructions failed due to very poor vision and strong reﬂections, both concerned shafts consist mainly of the base room (5–6 images) following a very short shaft. Nevertheless, SfM and GC succeeded for all sequences but resulted in very sparse point clouds for the latter 2 shafts. Without GC, model generation failed for 2 more shafts where the GC approach succeeded. Test applications done by our industrial partner with more than 160 shafts yielded similar results: Contourbased reconstruction failed for 4 shafts due to illumination/visibility problems while point-based reconstruction succeeded always with plausible results. The total runtime of our implementation is basically linear in the number of input images. Repeated tests with all provided test sequences yielded an average factor of 0.45 ± 0.07 sec per image on a PC with 2.66 GHz CPU and 4 GB RAM. By further optimization the runtime is expected to be improved signiﬁcantly, e.g. by utilizing the GPU for tasks such as computing the cylinder-mapping which consumes ca. 30% of the total runtime at the moment. Nevertheless, the runtime is already acceptable for post-processing of image sequences of up to 500 frames (i.e. 10 m of shaft) in clearly less than 5 min. 4.3

Resulting 3D Models

Using our algorithm we were able to build simpliﬁed 3d models of the surveyed shafts. Figure 8 shows the reconstruction results (i.e. the corrected 3d points resulting from the SfM, and the original and simpliﬁed wire frame mesh of the identiﬁed contours) for one of the shafts which consists of a conical part below the manhole, a cylindrical main part, and a cubic base room. The example illustrates that our SfM approach is capable of recovering even ﬁne structure reliably such as the stairs or the channel at the ground of the shaft (Fig.8, Fig. 8. Corrected 3d points with camera path and wire frame model left) while the contour classiﬁcation is robust enough to regard such structures as outliers with respect to the basic shape (Fig.8, right). Although we build only wire frame models from the resulting geometry, standard texture mapping techniques could be used to reconstruct a fully textured 3d model from the video sequences and the reconstructed geometry which allows to navigate virtually through the shaft under survey and perform measurements.

Reconstruction of Sewer Shaft Proﬁles from Fisheye-Lens Camera Images

5

341

Conclusions

We have proposed a robust practical approach for automatic shape measuring and 3d reconstruction of sewer shafts using a ﬁsheye-lens camera provided with an inertial sensor unit. Our approach overcomes the problems determined by similar works considering the problem of building 3d models for sewerage, such as bent or conical reconstruction and restriction to elliptical pipes [2,3]. It can easily be extended to support other shaft shapes than ellipses, rectangles, and free-form curves on demand, e.g. ovoid or polygonal shapes. An implementation has been applied successfully to the practical stage in cooperation with our indusR trial partner IBAK as part of the software for the widely used PANORAMO SI system (see Fig.1). Practical test applications, i.a. done by the G¨ ottinger Entsorgungsbetriebe [13], have shown that our approach is robust and useful. Future Work: Up to now our approach is performed as a post-processing step. We are planning to merge all parts of our approach into an online process. By building an approximate 3d model during tracking, a more elaborate feature prediction can be considered by projecting feature points onto the estimated scene surface – approaching an on-the-ﬂy Analysis by Synthesis technique.

References 1. Cooper, D., Pridmore, T.P., Taylor, N.: Towards the Recovery of Extrinsic Camera Parameters from Video Records of Sewer Surveys. Machine Vision and Applications 11, 53–63 (1998) 2. Kannala, J.: Measuring the Shape of Sewer Pipes from Video. Master thesis, Helsinki University of Technology, Helsinki (2004) 3. Kannala, J., Brandt, S.S., Heikkil¨ a, J.: Measuring and Modelling Sewer Pipes from Video. Machine Vision and Applications 19(2), 73–83 (2008) 4. Zhang, Z.: Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. In: Proc. ICCV, pp. 666–673 (1999) 5. Scaramuzza, D., Martinelli, A., Siegwart, R.: A Flexible Technique for Accurate Omnidirection Camera Calibration and Structure from Motion. In: Proc. ICVS, p. 45 (2006) 6. Chang, P., Hebert, M.: Omni-Directional Structure from Motion. In: Proc. IEEE Workshop on Omnidirectional Vision, p. 127 (2000) 7. Bartczak, B., K¨ oser, K., Woelk, F., Koch, R.: Extraction of 3D Freeform Surfaces as Visual Landmarks for Real-Time Tracking. Journal of Real-Time Image Processing 2(2–3), 81–101 (2007) 8. Tomasi, C., Kanade, T.: Detection and Tracking of Point Features. Carnegie Mellon University Technical Report CMU-CS-91-132, 04/1991 9. Pratt, V.: Direct Least Squares Fitting of Algebraic Surfaces. Computer Graphics 21(4), 145–152 (1987) 10. Fitzgibbon, A., Pilu, M., Fisher, R.: Direct Least Squares Fitting of Ellipses. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(5), 476–480 (1999) 11. Chauduri, D., Samal, A.: A Simple Method for Fitting of Boundary Rectangles to Closed Regions. Pattern Recognition 40(7), 1981–1989 (2007) 12. Dierckx, P.: Curve and Surface Fitting with Splines. Oxford University Press, Oxford (1993) 13. Burger, B., Fiedler, M., Gellrich, J., Reuter, H.-P.: Schachtinspektion in neuer Qualit¨ at. Bauwirtschaftliche Information UmweltBau Nr. 1, 02/2009

A Superresolution Framework for High-Accuracy Multiview Reconstruction Bastian Goldl¨ucke and Daniel Cremers Computer Science Department University of Bonn, Germany {bastian,dcremers}@cs.uni-bonn.de

Abstract. We present a variational approach to jointly estimate a displacement map and a superresolution texture for a 3D model from multiple calibrated views. The superresolution image formation model leads to an energy functional defined in terms of an integral over the object surface. This functional can be minimized by alternately solving a deblurring PDE and a total variation minimization on the surface, leading to increasingly accurate estimates of photometry and geometry, respectively. The resulting equations can be discretized and solved on texture space with the help of a conformal atlas. The superresolution approach to texture reconstruction allows to obtain fine details in the texture map which surpass individual input image resolution.

1 Introduction Modern image-based 3D reconstruction algorithms achieve high levels of geometric accuracy. However, due to intrinsic limitations like voxel volume resolution and mesh size, or limits imposed by the application, the geometric resolution of the model is usually well below the pixel resolution in a rendering. This leads to a number of problems if one wants to estimate a texture for the model from the camera images. Mainly, since geometry is never perfectly accurate, the image registration cannot be exactly correct, which leads to a blurry estimated texture, Fig. 1. Consequently, previous methods on texture generation usually employ some form of additional registration before estimating texel color [1,2,3]. In methods fitting a local lighting model on a per-texel basis, it is generally true that the fewer source cameras influence the result for a single texel, the sharper the resulting texture will be. However, if only the contributions of few cameras are blended for a given texture patch, it is likely that seams and discontinuities arise at visibility boundaries, so some form of stitching has to take place to smoothen the result [4,5]. Furthermore, not using all available source images implies discarding a lot of potentially useful information. The superresolution framework presented in this paper is designed to alleviate these problems. We account for the interdependency of geometry and photometry by minimizing a single functional with respect to both a displacement map as well as a superresolved texture. The image formation model is based on current state-of-the-art superresolution frameworks [6,7,8], for which there is a well-developed theory [9]. Because every patch of the surface is captured from several cameras, by adopting this J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 342–351, 2009. c Springer-Verlag Berlin Heidelberg 2009

A Superresolution Framework for High-Accuracy Multiview Reconstruction

343

Fig. 1. From left to right: (a),(b) Two out of fourty input-images for a multiview reconstruction. (c) Close-up of one of the low-resolution input images. (d) Rendered model with blurry texture initialized by weighted averaging of input images. (e) High-quality texture optimized with the proposed superresolution approach.

model we are able to recover the texture in higher resolution and level of detail than provided by the input images. By design, the method scales very well with the number f input cameras: more cameras will always lead to a more accurate solution. While we introduced superresolution textures in [10], the presented framework is the first formulation for joint geometry optimization and superresolution texture estimation in multiview stereo. The resulting models are of excellent quality and can be rendered from arbitrary viewpoints. The displacement map helps with high-fidelity relighting.

2 Displacement Maps and Texture Superresolution In this section, we introduce a superresolution image formation model, where the camera images depend on the unknown displacement map and texture. The model induces an energy functional which is minimized by the desired optimal maps. Let I1 , . . . , In : Ωi → R be the input images captured by cameras with known projections π1 , . . . , πn : R3 → R2 . The cameras observe a known Lambertian surface Σ ⊂ R3 , which is textured with the unknown texture map T : Σ → R. At each point of the surface, we allow a small displacement of the geometry in normal direction. This displacement is given by a displacement map D : Σ → R, the second unknown in the model. 2.1 Variational Formulation The basic idea is to recover both the unknown texture map as well as the displacement map as the minimizer of a joint energy functional, consisting of a data term and a regularization term for both maps, E(T, D) := Edata (T, D) + Etv (T, D), with Etv (T ) := σt ∇Σ T Σ + σd ∇Σ DΣ ds.

(1)

Σ

Here, σt , σd ≥ 0 are parameters controlling the desired smoothness of the texture and displacement map, respectively. Reasonable choices are σt , σd = 1. The differential operators on the surface and the norm on the tangent space are explained in detail in [11]. The total variation norm of the texture was chosen as the regularizer, because compared

344

B. Goldl¨ucke and D. Cremers Σ

dom(τj )

T

τj

βi

τ

Ωi πi Σ

?

τi dom(τi )

φi = πi ◦ τ

(a) The various mappings connecting texture space T, the surface Σ and the image planes Ωi .

(b) Boundary texel neighbour connections on the computation grid are established by searching in normal direction.

Fig. 2. Texture space and computation grid

to alternatives, it is better suited to preserve a crisp texture with sharp high-resolution features. The data term is based on the current state-of-the-art superresolution model [8], with the limitation that we currently do not take noise in the input images into account explicitly. The idea is that a real-world camera downsamples the input by integrating over the visible texels inside each sensor element. This integration process is modeled with a convolution kernel b, which can be derived from the properties of the camera. Possible choices are discussed in [9], we use a Gaussian with standard deviation of half the pixel size. The resulting data term is n 2 Edata (T, D) := b ∗ (T ◦ βiD ) − Ii dx. (2) i=1

ˆi S

T ◦ βiD denotes the visible texture intensity of the high-resolution input in the image plane. The backprojection mappings βiD : Si → Σ assign the visible point on the surface to each point in the silhouettes Si := πi (Σ) ⊂ R2 , Fig. 2(a). Note that this backprojection depends on D. Actual integration takes place over the smaller subset Sˆi ⊂ Si where all of the kernel covers only points within the silhouette. 2.2 Transformation of the Data Term to the Surface In order to find a local minimum of the energy, it is necessary that integration for both data and regularization term takes place over the surface. A straightforward transformation of the integral yields n D 2 Edata (T, D) = vD Ji Ei ◦ πiD ds. (3) i i=1

Σ

Here, the error images Ei are defined for abbreviation as the difference between the current rendering of the object and the original images, Ei := b ∗ (T ◦ βiD ) − Ii .

(4)

A Superresolution Framework for High-Accuracy Multiview Reconstruction

345

The binary functions vD i : Σ → {0, 1} indicate visibility of a surface point in an image, 1 if πiD (s) ∈ Sˆi and s = βiD ◦ πiD (s), D vi (s) := (5) 0 otherwise. Finally, JiD is the inverse surface area element with respect to the backprojection, D −1 ∂βi ∂βiD . JiD (x, y) = × (6) ∂x ∂y Ji accounts for foreshortening of the surface in the input views, and is small in regions where the backprojection varies strongly, which is usually the case at silhouette boundaries or discontinuities of the backprojection due to self-occlusions. As a consequence, we have the desirable property that in those regions where texture information from the image is unreliable, the input is assigned less weight. Note that we did not need any heuristic assumptions to arrive at this weighting scheme. It is rather a direct mathematical consequence of the variational formulation. While in general JiD and vD i depend on the displacement D, in the following we approximate both by Ji := Ji0 and vi := v0i . This simplification is necessary to make the computation of a local minimum technically feasible. 2.3 Solving for the Superresolution Texture The first step is to keep the displacement D constant and solve for the superresolution texture T . To this end, we minimize the functional n vi E(T ) = ∇Σ T Σ + Ji Ei2 ◦ πiD ds. (7) σt Σ i=1 by solving the Euler-Lagrange equation n ∇Σ T vi ¯ divΣ + Ji b ∗ Ei ◦ πiD = 0, ∇Σ T Σ σ i=1 t

(8)

which is a PDE on the surface Σ. The mirrored kernel ¯b(x) := b(−x) stems from directional derivative of the convolution operation [12]. After transformation to 2D texture space, the Euler-Lagrange equation can be solved via a gradient descent scheme resembling a deblurring process. 2.4 Solving for the Displacement Map In the second optimization step, we keep the texture constant. Thus, the functional to be minimized for the displacement map is E(D) = Σ

∇Σ DΣ +

n vi Ji Ei2 ◦ πiD ds. σd i=1

(9)

346

B. Goldl¨ucke and D. Cremers

Fig. 3. Illustration of the charts and parametrization mappings. From left to right: (a) Chart domains dom(τj ) in R2 forming the texture space T. (b) Corresponding regions Cj on the surface. (c) Texture map T = T ◦ τj on texture space. (d) Texture T mapped on surface.

Because the data term is non-convex and no good initialization is readily available, we minimize it with a different approach. We make the simplifying assumption that for each point, D is constant in the sampling area of the kernel b, so the energy takes the form E(D) = Σ

∇Σ DΣ + ρs ◦ D ds.

(10)

with a point-wise data term ρs . Based on [13], we introduce an auxiliary variable U and decouple the regularization from the point-wise optimization by defining a convex approximation to the energy, E(U, D) = Σ

∇Σ UΣ +

1 (U − D)2 + ρs ◦ D ds. 2θ

(11)

For θ → 0, the solution of this auxiliary problem approaches the solution to the original problem, as the coupling term forces U to be close to D. The idea is that for fixed U , we can perform a point-wise optimization in D, since no spatial derivatives of D appear in the functional. On the other hand, for fixed D, the resulting energy functional resembles the ROF model, which is convex and thus can also be optimized globally. Thus, by alternating two global optimization steps, one can arrive at a good minimizer for the original energy (9), which will however in the general case be only a local minimum.

3 PDE-Based Energy Minimization on Texture Space In order to obtain a high-resolution representation of texture and displacement map, we require a global parametrization of the surface and define texture und displacement map on a grid in 2D space. As the goal is to solve a PDE on the surface, it is desireable to have the parametrization conformal, because then one gets a particularly simple representation of the differential operators [14,15]. Our method to compute a conformal atlas is a straightforward implementation of [14]. It is fully automatic and has the desirable property that chart boundaries tend to coincide with high-curvature edges on the surface.

A Superresolution Framework for High-Accuracy Multiview Reconstruction

347

Start with D = 0 and set T to the per-texel weighted average of the projected pixel colors in the input views, with weights given by the backprojection area element Ji . Then, alternate between solving the following two optimization problems until convergence. Superresolution texture optimization: Keep D fixed and via gradient descent, obtain a texture map T : T → R which satisfies the Euler-Lagrange equation „ « X n ” √ ∇T 1 vi “ div λ + (Ji Ei ) ◦ φD =0 i λ ∇T σt i=1

(12)

D on the chart domains dom(τj ), j = 1, . . . , k. Here, φD i := πi ◦ τ are the mappings from texture space into the image planes, taking into account the current displacement. λ assigns the conformal factor of the parametrization to each point in T, and vi := vi ◦ τ indicates visibility of a texel in image i. Ei and Ji are defined according to Eqns. (4) and (6), respectively. Displacement map optimization: Keep T fixed and alternate between solving the following two optimization problems until convergence. – For D fixed, find the solution U for the Euler-Lagrange equation of the ROF model, „ « √ ∇U 1 1 div λ + (U − D) = 0 (13) λ ∇U θ

via gradient descent. – For U fixed and every x ∈ T, find the global optimum D(x) of (U (x) − D(x))2 + ρτ (x) (D(x))

(14)

using a complete search in the allowed displacement range.

Fig. 4. Algorithm for joint displacement map and superresolution texture

3.1 Conformal Maps and Differential Operators Assume for the following that we have a collection of k charts (Cj , τj ) with chart areas Cj ⊂ Σ and mappings τj : dom(τj ) → Cj . The union T := ∪kj=1 dom(τj ) of the chart domains is called the texture space, and to simplify notation, the single mappings τj are combined to form a global mapping τ : T → Σ. Fig. 3 illustrates the concept. Since the parametrization is conformal, the Jacobian of each τj is everywhere a scalar λ, called the conformal factor, times a rotation matrix. Let λ : T → R be the mapping assigning the conformal factor to each point in texture space. Then the smoothness term of the Euler-Lagrange equation (8) can be expressed by pulling it back onto texture space [11]: √ ∇T ∇Σ T 1 divΣ = div λ , (12) ∇Σ T Σ λ ∇T where T := T ◦ τ is the texture map of the surface defined on the texture space T, see Fig. 3(c). An analogous expression holds U := U ◦ τ in the gradient descent equation of the ROF model, Eq. 13. We also define the displacement map D := D ◦ τ on texture space.

348

B. Goldl¨ucke and D. Cremers

3.2 Discretization and Computation Grid For discretization, the texture space is subdivided into a grid of texels. The grid needs to admit a flexible topology, since only texels in the interior of chart domains are connected to their direct neighbours. On the boundary of charts, neighbourhood is established according to the correct relationships on the surface. To achieve this, we take the outer normals of boundary texels in texture space and transform them up onto the surface. Then, we search for the closest texel of the neighboring chart in this direction, and assign that one as a neighbour, Fig. 2(b). For discretization of (12), we employ the scheme from [16] which offers improved rotation invariance. The diffusion tensor G is set to the (isotropic) regularized TV flow, √

λ 10 G= , (13) max( , ∇T ) 0 1 where > 0 is a small regularization parameter.

Fig. 5. Estimated displacement map for the Bunny dataset. From left to right: (a) Rendering with Gouraud shading. The underlying mesh has low geometric detail. (b) Normal map lighting showing improved geometric detail from the estimated displacement map. (c) Rendering with superresolution texture and normal map lighting.

(a) Per-texel weighted average

(b) Superresolution only

(c) With displacement map

Fig. 6. While the superresolution texture estimate (b) already improves over the commonly used weighted average (a), the jointly estimated displacement map leads to a much more detailed result (c)

A Superresolution Framework for High-Accuracy Multiview Reconstruction

(a) Per-texel weighted average

(b) Superresolution result

(c) Input image

(a) Per-texel weighted average

(b) Superresolution result

(c) Input image

(a) Per-texel weighted average

(b) Superresolution result

(c) Input image

349

Fig. 7. Results from three real-world multiview datasets. The 3D model is rendered with the texture from texel-wise initialization (a) and the texture resulting from the proposed joint displacement map and superresolution algorithm (b). The result has more visible details than an input image taken from the same viewpoint (c). The rows below the large images show some closeups. The reader is invited to zoom in on the electronic version to better appraise the differences.

350

B. Goldl¨ucke and D. Cremers

3.3 Final Algorithm Implementation Summarizing the results from the previous sections, in order to arrive at an optimal displacement map D and texture T , we need to solve the problem described in Fig. 4. For reasonable performance, an efficient parallelized implementation of the terms occuring in the equations is crucial. The most time-consuming part is to compute the backprojection mapping βiD , for which we raytrace the surface using CUDA, employing the algorithm in [17] to account for the displacement. Ji is obtained numerically from the backprojection via Eq. (6) using central differences. Having available the backprojection, we can easily perform the rendering T ◦ βiD of the surface into the ith view using the current displacement map and texture. For this, we just need to color each image pixel x with the texel color of the corresponding surface point βiD (x). Note that while the model is formulated for grayscale textures, it can readily be extended to color using a multidimensional total variation norm [18] or color image diffusion [19].

4 Experiments We performed experiments on three different real-world datasets at input image resolution 768×584, see Fig. 7. An initial 3D reconstruction was obtained using an implementation of the algorithm in [20]. For the initial texture map, each texel was assigned the weighted average color in each camera, with weights given by the backprojection area element. An optimized displacement map and texture was computed using the proposed algorithm, which takes about 5 hours until convergence, running on a 2.8 GHz Core 2 Duo processor with CUDA enhancements running on a GeForce GTX. Main memory required is around 6 GByte. All parameters are set according to the recommendations in the previous sections, and remained the same for all data sets. Fig. 6 shows that the initial texture is very blurry due to small inaccuracies in the geometry, and can already be improved significiantly just by applying the superresolution texture reconstruction. Only when including small scale displacements from the estimated displacement map, however, almost perfect sharpness can be achieved. A rendering of the final textured model is of at least similar quality than an input image from the same viewpoint, and in many cases the level of detail is even exceeded, see Fig. 7. The displacement map can be leveraged to include additional effects into the rendering, like relighting used the derived normal map, as exemplified in Fig. 5.

5 Conclusion We proposed the first superresolution approach to multiview reconstruction. Based on a unifying and elegant mathematical formalism, an algorithm for jointly estimating a displacement map as well as a high-quality texture for an approximate 3D model was derived. Both unknowns appear as the solutions to PDEs on the input surface, which can be solved via total variation minimization techniques on planar 2D texture space with the help of a conformal atlas. Experiments on several real-world objects demonstrate that the resulting displacement map improves the accuracy of the geometric model. Moreover, the computed superresolved textures typically exhibit more visible details than individual input images.

A Superresolution Framework for High-Accuracy Multiview Reconstruction

351

References 1. Bernardini, F., Martin, I., Rushmeier, H.: High-quality texture reconstruction from multiple scans. IEEE Transactions on Visualization and Computer Graphics 7(4), 318–332 (2001) 2. Lensch, H., Heidrich, W., Seidel, H.P.: A silhouette-based algorithm for texture registration and stitching. Graphical Models 63(4), 245–262 (2001) 3. Theobalt, C., Ahmed, N., Lensch, H., Magnor, M., Seidel, H.P.: Seeing people in different light-joint shape, motion, and reflectance capture. IEEE Transactions on Visualization and Computer Graphics 13(4), 663–674 (2007) 4. Allne, C., Pons, J.P., Keriven, R.: Seamless image-based texture atlases using multi-band blending. In: 19th International Conference on Pattern Recognition (2008) 5. Lempitsky, V., Ivanov, D.: Seamless mosaicing of image-based texture maps. In: Proceedings of CVPR, vol. 1, pp. 1–6 (2007) 6. Fransens, R., Strecha, C., van Gool, L.: Optical flow based super-resolution: A probabilistic approach. Computer Vision and Image Understanding 106(1), 106–115 (2007) 7. Schoenemann, T., Cremers, D.: High resolution motion layer decomposition using dualspace graph cuts. In: Proceedings of CVPR, pp. 1–7 (2008) 8. Sroubek, F., Cristobal, G., Flusser, J.: A unified approach to superresolution and multichannel blind deconvolution. IEEE Transactions on Image Processing 16(9), 2322–2332 (2007) 9. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. PAMI 24(9), 1167– 1183 (2002) 10. Goldluecke, B., Cremers, D.: Superresolution texture maps for multiview reconstruction. In: Proceedings of ICCV (accepted, 2009) 11. Lui, L.M., Wang, Y., Chan, T.F.: Solving PDEs on manifold using global conformal parameterization. In: Variational, Geometric, and Level Set Methods in Computer Vision: Third International Workshop (VLSM), pp. 309–319 (2005) 12. Welk, M., Theis, D., Brox, T., Weickert, J.: PDE-based deconvolution with forwardbackward diffusivities and diffusion tensors. In: Kimmel, R., Sochen, N.A., Weickert, J. (eds.) Scale-Space 2005. LNCS, vol. 3459, pp. 585–597. Springer, Heidelberg (2005) 13. Chambolle, A.: An algorithm for total variation minimization and applications. Mathematical Imaging and Vision 20, 89–97 (2004) 14. Lvy, B., Petitjean, S., Ray, N., Maillot, J.: Least squares conformal maps for automatic texture atlas generation. ACM Transactions on graphics (SIGGRAPH) 21(3), 362–371 (2003) 15. Wang, Y., Gu, X., Hayashi, K., Chan, T.F., Thompson, P., Yau, S.T.: Surface parameterization using Riemann surface structure. Proceedings of ICCV 2, 1061–1066 (2005) 16. Weickert, J., Scharr, H.: A scheme for coherence-enhancing diffusion filtering with optimized rotation invariance. Journal of Visual Communication and Image Representation 13(1–2), 103–118 (2002) 17. Donelly, W.: Per-Pixel Displacement Mapping with Distance Functions. In: GPU Gems 2. Addison-Wesley Longman, Amsterdam (2005) 18. Blomgren, P., Chan, T.F.: Color TV: Total variation methods for restoration of vector-valued images. IEEE Trans. Image Processing 7, 304–309 (1998) 19. Weickert, J.: Coherence-enhancing diffusion of colour images. Image and Vision Computing 17(3–4), 201–212 (1999) 20. Kolev, K., Cremers, D.: Integration of multiview stereo and silhouettes via convex functionals on convex domains. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 752–765. Springer, Heidelberg (2008)

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data Christoph Munkelt1 , Michael Trummer2 , Peter K¨ uhmstedt1 , Gunther Notni1 , and Joachim Denzler2 1

Fraunhofer IOF, Albert-Einstein-Str. 7, 07745 Jena, Germany {christoph.munkelt,peter.kuehmstedt,gunther.notni}@iof.fraunhofer.de 2 Friedrich-Schiller University of Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany {michael.trummer,joachim.denzler}@uni-jena.de

Abstract. Solving the next best view (NBV) problem is an important task for automated 3D reconstruction. An NBV algorithm provides sensor positions, from which maximal information gain about the measurement object during the next scan can be expected. With no or limited information available during the ﬁrst views, automatic data driven view planning performs suboptimal. In order to overcome these ineﬃciencies during startup phase, we examined the use of time-of-ﬂight (TOF) camera data to improve view planning. The additional low resolution 3D information, gathered during sensor movement, allows to plan even the ﬁrst scans customized to previously unknown objects. Measurement examples using a robot mounted fringe projection stereo 3D scanner with a TOF camera are presented. Keywords: View Planning, Next Best View, Optical 3D Reconstruction, Time-of-Flight Camera.

1

Introduction

For complete reconstruction of unknown objects or dimensional inspection of complex and large components, planning the next view of an optical sensor has been a challenging task for algorithms and human experts alike. Multiple views from diﬀerent sensor positions are usually needed to completely reconstruct more than simple objects. Eﬃciency of data driven approaches during the ﬁrst scans is poor due absent or limited object information. Alternatively, oﬄine planning could be used. However, usable geometry data has to be available. This is often not the case because of non existing or overly complex CAD data. Our proposed method therefore uses an additional low resolution TOF camera to gather 3D object data during sensor movements. This 3D video feed, in conjunction with the sensors robot based positioning system, provides a viable planning foundation for our high resolution fringe projection stereo 3D scanner. 1.1

Motivation

The determination of the best next view based on information from previous scans is inherently a local optimization problem [1] in data driven approaches. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 352–361, 2009. Springer-Verlag Berlin Heidelberg 2009

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data

353

With no or very limited information in the startup phase of the 3D scan, any planning system performs inferior to experienced operators. They have a-priori an overview over the macro topography of the object. However, a mixed sensor composed of an optical high resolution 3D sensor and a TOF camera can combine both technologies strengths to solve the problem above. The TOF camera provides fast, but low resolution, low precision 3D measurements. Yet the swiftness of measurement enables data acquisition even during periods unsuitable for traditional 3D scanners, especially during sensor movement. Our goal therefore is to use this “free” TOF 3D data gathered during sensor positioning periods to acquire a rough 3D scan of the object. Thereby the view planner has preliminary object information in parts of the measurement volume yet unscanned by the high resolution scanner. This paper proposes such a hybrid sensor and shows, how to incorporate TOF camera data into the view planning process. The underlying planning approach utilizes a volumetric representation of the measurement volume. Typical reconstruction quality requirements, like completeness, fringe contrast (modulation), measurement conﬁdence, and surface sampling quality (average point distance) and acquisition constraints, like positioning and sensing constraints, are jointly evaluated for both sensor types. 1.2

Literature Review

With research on actively placing vision sensors for modeling objects being an active topic for more than 20 years [2], many diﬀerent approaches have been studied. Volumetric representations have been explored early, together with modeling of occlusion and the boundary between seen and unseen volume [3]. By employing the sensor’s model within the volumetric model, strategies which reveal the largest quantity of previously unknown scene information can be formulated [4]. To our knowledge no previous attempt to use TOF 3D data to guide the NBV planning has been made. However, the combination of a high resolution camera with a TOF 3D camera for other purposes (e.g. robot navigation [5]) has been examined. Pito introduced the notion of positional space [6], which allowed, in combination with such a detailed scanner model, refraining from ray-casting from every possible sensor location in order to determine the NBV. For smoothly curved surfaces Chen et al. [7] proposed a planning method, which analyzed the target’s trend surface. Thus it is possible to predict unknown area of the object and derive the exploration direction. While multiple sensor placement constraints were considered, such as ﬁeld-of-view, resolution, and viewing angle, the method faces signiﬁcant problems in surface trend prediction for objects with object boundaries and surface edges. A similar approach was taken by Li et al. [8]. Their information entropy based planning slices the measurement object into a series of cross section curves. By analyzing the uncertainty of the reconstruction by closed B-spline curves, the viewpoint that contains the maximal information gain is chosen as the NBV. However, no examples for reconstruction of technical objects with surface edges and object boundaries were given.

354

C. Munkelt et al.

The viewpoint planning approach presented by Wenhardt at al. [9] is based on the usage of an extended Kalman ﬁlter for reconstruction, which allows the application of statistical optimization criteria such as the entropy. Gathering the coordinates of 3D points within the state vector of the Kalman ﬁlter, the position of each point can be estimated together with the covariance matrix. The authors show that minimizing the conditional entropy of the state vector, depending on the observation, can be performed with respect to the covariance matrix and improves the accuracy of the reconstruction result. The hierarchical approach by Low et al [1] greatly accelerates the view planning process by exploiting various spatial coherences. Real world constraints (positioning, sensing and registration constraints) and quality requirements (completeness and surface sampling quality) are easily integrated. Their view metric primarily maximizes the expected newly scanned volume while satisfying the desired sampling density. The prefered quality requirement can be boosted by changing the associated weights. 1.3

Paper Organization

The remaining paper is organized as follows. Section 2 presents our volumetric planning framework and how we integrated the TOF camera for planning purposes. Section 3 shows results obtained with a ﬁrst implementation using a robot mounted fringe projection 3D scanner, while Section 4 summarizes our ﬁndings and gives an outlook to further intended developments.

2

Volumetric Planning Approach with TOF Camera Data

In this section we review our extended quality criteria based volumetric planning approach. Furthermore we present our integration of TOF 3D camera data for planning purposes. Contrary to methods inferring from diﬀerent types of surfaces in the partial model (e.g. [4,1]), our method propagates the calculated quality criteria along the sensors viewing rays into the volume. We think that this method works better for small objects, which are scanned from a sensor positioned on a sphere around them. Reasons for that assumption include potentially available measurement volume information and, for highly detailed objects, potential diﬃculties in computation of the relevant surfaces. 2.1

Evaluation – Assessing a Scan’s Contribution

Any data driven method faces the challenge of incomplete knowledge of the measurement volume. If object information is present – e.g. through a previous scan – corresponding planning information is usually assigned only to the object surface. We will show later how propagation of quality criteria into the volume can be achieved and how the planning process beneﬁts by beeing able to draw conclusions from a larger volume augmented with quality criteria.

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data

355

s s

Planning Element (unknown) Near Surface (behind)

Planning Element (empty) Near Surface (front)

Surface

n

Approx. Surface normal Sensor viewing ray

Local approx. surface Planning Element (arbitrary)

Planning Element (behind surface) Planning Element (emtpy volume)

Exemplary weight function Planning Element (on surface)

Fig. 1. Volumetric modeling of the planning volume. On the left side distance dependent voxel classiﬁcation of a single view is illustrated. The middle ﬁgure shows local surface normal approximation for conﬁdence calculation, while on the right conﬁdence n propagation using distance dependent determination of weight wcon (P ) is typiﬁed.

In order to determine the next best view one must be able to evaluate the result of a scan using a suitable arbitrary sensor conﬁguration (furthermore called sensor pose). This requires an adequate sensor model and analysis of the available information about the current measurement volume. We call this volumetric mapping of the measurement volume, which combines depth fusion information and planning information, planning volume. The goal of such an evaluation of a particular pose would be a measure of its information gain respective to relevant quality criteria. A typical measure of conﬁdence for triangulation based 3D scanners [6] is cn (PS ) = −nTPS sn,PS

(1)

with nTPS = |sn,PS | = 1. For a point PS = (x, y, z)T on the object’s surface cn (PS ) describes the conﬁdence in the nth measurement, where nPS is the unit surface normal of a locally ﬁtted plane around PS and sn,PS the unit vector of the viewing ray through PS in view n (see Figure 1). Intuitively, best results are usually yielded from scanning a surface frontally. To apply this measure to all points in space it must be computable for points in front and behind the object’s surface. To this end we convey an approach shown by Hernandez et al. [10] in the context of probabilistic sensor data fusion of range images from multiple views. Together with the sensor model a visibility evidence based on the cumulative normal of the range measurements probability (dependent on the distance to the ’true’, that is fusioned, object surface) is proposed. For optical measurement systems it allows to classify the volume in front of object’s surface (towards the sensor) as outside the object. Furthermore, it allows the propagation of surface measurement conﬁdence, since the viewing

356

C. Munkelt et al.

ray through an arbitrary point P of the volume will ultimately hit the object’s surface in point PS . This allows to set cn (P ) = cn (PS ). Depending on the sensor model, very limited knowledge can be gained behind the object’s surface. Therefore we apply the visibility evidence from the aboven mentioned approach, which, depicted as function wcon (P ), we simpliﬁed for our planning purpose to assign a weight near 1 to points in front, but a weight near 0 to points behind the surface (see Figure 1). Consequently we deﬁne the weighted confidence of an arbitrary volume point P measured in view n as n conn (P ) = wcon (P ) cn (PSn (P )) ,

(2)

where PSn (P ) is the object’s surface point PS hit by the viewing ray through P . Further quality criteria examined were modulation as a measure for (fringe) contrast and average point distance as a measure of average point spacing at a particular object surface point. Modulation mn (P ) can e.g. be modeled as fringe visibility [11], where higher modulation causes lower measurement errors. This is a physically motivated, measurement technology induced criterion of the used fringe projection technique. Average point distance an (P ), on the other hand, is independent from the used measurement technology, and is determined by the sensors resolution in object space. Redundancy through multiple views of the same object surface area can decrease it further. Lower distances lead to ﬁner surface approximation. Both criteria typically have sensor system speciﬁc bounds. Further improvements will not result in improved measurement results. Consider for instance modulation: while falling below a certain lower boundary vastly increases measurement error, exceeding a certain upper boundary (but not saturating the sensor) does not reduce measurement error signiﬁcantly. Therefore both criteria are evaluated corresponding to desired values m∗ and a∗ respectively and weighted m∗ a∗ according to adequately adapted logistic functions wmod and wapd respectively in such a way, that reaching the desired bounds yields nearly 1. 2.2

Multi-view Fusion

So far we described the evaluation of single scans. New scans that have been chosen by the planner need to be fusioned to the previous scans in order to update both the measurement and the planning volume. To this end we perform range data and quality criteria fusion in analogy to Curless et al. [12]. We deﬁne a voxel’s fused conﬁdence of the ﬁrst n views as weighted average of its individual conﬁdences: 1 coni (P ) . i i=1 wcon (P ) i=1

con1,n (P ) = n

n

(3)

Thus, fusion of volume hidden from the sensor by object surface in a particular view does not alter the fusioned conﬁdence. The average conﬁdence CONn of the measurement volume V after n views can then be written as

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data

1 CONn = vol3 (V )

357

con1,n (P ) dP

(4)

V

and the conﬁdence gain gainn+1 con , achieved through the (n + 1)-th view, as gainn+1 con = CONn+1 − CONn .

(5)

For evaluation of the corresponding expected modulation and average point distance gains we currently consider only object surface voxel PS . Assuming an appropriately fusioned modulation m1,n (P ) and average point distance a1,n (P ) we deﬁne modulation gain as gainn+1 mod = M ODn+1 − M ODn with 1 M ODn = vol2 (S) ∗

(6)

mod1,n (PS ) dPS S

m mod1,n (PS ) = wmod (m1,n (PS )) ,

where mod1,n (PS ) is the weighted modulation of PS throughout scan 1 . . . n. Average point distance gain is deﬁned as gainn+1 apd = AP Dn+1 − AP Dn

(7)

with AP Dn and apd1,n (PS ) deﬁned analogically, where apd1,n (PS ) is the weighted locally estimated point density around PS after n scans. 2.3

View Planning

With the current knowledge about the measurement object represented in the planning volume and the ability to evaluate scans regarding their contribution in improving the current object model, we have the necessary prerequisites at our disposal to determine the next best view. For that purpose we utilize a generateand-test approach and evaluate synthesized sensor positions. Eventually, a sensor pose with maximum gain is found and chosen for the NBV. The remaining section describes the process in detail. Combined Fusion and Planning. The current partial object model contains valuable ancillary information. With a suitable model of the used 3D sensor (intrinsic and extrinsic parameters) and the positioning system (hand-eye transformation) one can generate scan data prior actually scanning from that sensor pose (further on denoted as estimation). Evaluation of such an estimation reveals many clues for reaching typical reconstruction quality requirements. By ranking the view scores obtained by an objective function combining each potential next poses individual gains shown above, the planning algorithm chooses consecutively sensor poses. The chosen next pose decreases at the same time uncertainty throughout the volume (by conﬁdence gains for scanning unknown volume) and improves reconstruction quality of already scanned surfaces. This greedy-type approach implies, that scanning from this pose reveals enough new information about so far unseen volume.

358

2.4

C. Munkelt et al.

Proposed TOF Camera Based View Planning Extension

Bearing an earlier approach [13] in mind, we used 3D data from a TOF camera to initialize the planning volume. The resulting quality criteria are weighted according to the TOF’s increased uncertainty. The fusion of the TOF camera’s 3D data is otherwise equal to the high resolution sensor data fusion. Since capture and fusion of TOF 3D data can happen upon planning startup, during sensor movement, and during view planning computation, the necessary TOF scans should actually be regarded as “free” (in contrary to additional views). Nevertheless they allow the planner to take the macro topography of the object into account and determine better next views. View Planning Our data driven planning process consists of the following steps: 1. TOF 3D data capture before initial / during sensor positioning 2. coarse sampling of possible sensor parameter space over partial object model 3. repeat iterative optimization, while signiﬁcant improvement: (a) estimation of sensor poses and evaluation; ranking of view score (b) locally refine the sampling of sensor’s parameter space in the vicinity of the objective function’s local maxima 4. termination test: continue data acquisition? 5. sensor pose change to best view from step 3; continue from step 1. During sensor movement, TOF camera 3D data capture is performed. In the second step, a number of arbitrary, yet preferably uniformly distributed, sensor conﬁgurations are chosen from the accessible sensor parameter space (consisting for instance of sensor position, orientation, focal length, or shutter speed). After estimation and evaluation of that poses, those initial ones are replaced by a reﬁned sampling of possible sensor parameters around local maxima of the objective function combining the various gains resulting from the evaluation. This repeated reﬁnement converges when no signiﬁcant improvements over the last iteration can be achieved. The fourth step compares the current planning volume’s evaluation against the eﬀective termination criteria, like number of allowed views, object completeness, or signiﬁcance of contribution compared to the last planned view. If planning continues, the optimal sensor position from step 3 is chosen as the NBV.

3

Experimental Results

All tests were performed by a robot mounted, stereo camera based 3D sensor (see Figure 2) using active illumination (fringe projection). The sensor head consists of two cameras, a TOF camera, and a digital projector. The reconstruction method employed is phase correlation based fringe projection [15] for the high resolution 3D scanner and TOF 3D scanning using a PMD O3 camera [16]. The used measurement object (NBV test object, see [14]) was optimally aligned and wall-mounted, with sensor positions restricted to positions on a half sphere

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data

(a)

(b)

359

(c)

Fig. 2. (a) Measurement setup and utilized measurement system. (b) Sensor head combining stereo cameras, digital projector and TOF camera. (c) Photo of NBV test object (160 mm) (see [14]).

around it. However, the current realization of the sensor positioning system limits the sensors reachable operating space to 80 ≤ θ ≤ 125 and −35 ≤ ϕ ≤ 35 (zenith and azimuth, respectively, in spherical coordinates). For performing the estimation step, our planner currently models only a single camera (pinhole camera model). This is a valid simpliﬁcation, since correlation based triangulation is not needed for rendering the estimated view (camera and object parameter are known). However, stereo visibility – how to optimally place a stereo head within the visibility region of a feature – is not modeled. Depending on measurement object complexity, this leads to minor discrepancies between estimation and real scan. The TOF 3D sensor was placed between the two sensor cameras. It’s ﬁeld of view is substantial larger compared to the high resolution cameras, allowing better overview of the measurement volume (see Figure 3 on the next page). However, the low pixel count (64 × 48), combined with a speciﬁed repeatability of ±80 mm for the used measurement distance leads to noisy preliminary object information. Even after using robot position information and capturing TOF 3D data without sensor motion, reliable automatic registration of its 3D data with the partial 3D data of the high resolution sensor proved diﬃcult. Therefore we temporarily used 3D data with characteristics comparable to the above TOF sensor, but with known external parameters. The TOF 3D sensors considerably reduced resolution in object space (see Figure 3) is suﬃcient to represent the test objects macro topography. Smaller object details however, like the tripod structure, can hardly be scanned. The automatically planned ﬁrst view is a frontal view, capturing all the three accessible faces of the test object at once. The next three views enable overall completeness of approx. 45% (human experts achieve in the same situation 49% on average). After 8 views, overall completeness reaches 49% (compared to 48% without TOF 3D usage and 53% on average for human experts). However, some smaller individual object details, like the tripod structure, are only partially scanned. Apparently false volume updates during the TOF data fusion due to

360

C. Munkelt et al.

(a)

(b)

(c)

Fig. 3. (a) Single TOF scan of NBV test object. (b) Multiple TOF scans (green) fusioned and registered to partial 3D model of fringe projection scanner (blue). (c) Fringe projection 3D scanning result after 4 views with planner using TOF 3D data.

erroneous depth measurements occur for a greater numbers of still unscanned volume elements. A second observation concerns the planning strategy. The employed greedytype ranking of potential new sensor poses is not the optimal strategy when planning multiple views. Highest ranked poses favor views, which improve the majority of the partial model. If, as is the case using TOF 3D data, more information about object structure is available beforehand, clustering of similar adjacent faces (see [1]) promises to improve on detail completeness.

4

Conclusions and Future Work

We examined the application of TOF camera 3D data to improve view planning for 3D reconstruction. To our knowledge this has not been studied by others before. By modeling the TOF camera as a conventional 3D sensor, easy integration into volumetric view planning approaches is possible. If TOF 3D data is gathered during up to now unused periods of time (e.g. before measurement start, during sensor movement, during prolonged planning calculation periods), the additional information can improve view planning without causing additional costs. First experiments were successfully performed and conﬁrm the general usability of TOF 3D data for view planning purposes. The reduced spatial resolution and increased measurement uncertainty of TOF cameras are not limiting the intended purpose for capturing the objects macro topography. Complete scans of delicate object details still need to be planned based on the partial 3D model yielded by higher resolution sensors. Further advancements in TOF camera technology will also address this issue. Furthermore it became clear that more advanced view planning strategies need to be applied to beneﬁt from the additional information. We will both examine methods to plan several consecutive next best views, as well as applying existing NBV techniques to better control TOF camera views itself. As technical prerequisite external TOF camera parameter determination must be more robustly calculated.

View Planning for 3D Reconstruction Using Time-of-Flight Camera Data

361

References 1. Low, K.L., Lastra, A.: An Adaptive Hierarchical Next-Best-View Algorithm for 3D Reconstruction of Indoor Scenes. In: 14th Paciﬁc Conference on Computer Graphics and Applications (October 2006) 2. Cowan, C.K., Kovesi, P.D.: Automatic sensor placement from vision task requirements. IEEE Trans. Pattern Anal. Machine Intell. 10, 407–416 (1988) 3. Massios, N.A., Fisher, R.B.: A Best Next View selection algorithm incorporating a quality criterion. In: 9th BMVC, September 1998, pp. 780–789 (1998) 4. Banta, J.E., Abidi, M.A.: Autonomous placement of a range sensor for acquisition of optimal 3-D models. In: 22nd IEEE International Conference on Industrial Electronics, Control, and Instrumentation, vol. 3, pp. 1583–1588 (1996) 5. Prusak, A., Melnychuk, O., Roth, H., Schiller, I., Koch, R.: Pose estimation and map building with a Time-Of-Flight-camera for robot navigation. Intern. Journ. of Intelligent Systems Technologies and Applications 5(3/4), 355–364 (2008) 6. Pito, R.: A Solution to the Next Best View Problem for Automated Surface Acquisition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(10), 1016–1030 (1999) 7. Chen, S.Y., Li, Y.F.: Vision Sensor Planning for 3D Model Acquisition. IEEE Transactions on Systems, Man and Cybernetics – B 35(5), 894–904 (2005) 8. Li, Y.F., Liu, Z.G.: Information Entropy-Based Viewpoint Planning for 3-D Object Reconstruction. IEEE Transactions on Robotics 21(3), 324–337 (2005) 9. Wenhardt, S., Deutsch, B., Hornegger, J., Niemann, H., Denzler, J.: An Information Theoretic Approach for Next Best View Planning in 3-D Reconstruction. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, vol. 1, pp. 103–106 (2006) 10. Hernandez, C., Vogiatzis, G., Cipolla, R.: Probabilistic visibility for multi-view stereo. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR 2007, June 17-22, 2007, pp. 1–8 (2007) 11. Creath, K.: Temporal Phase Measurement Methods. In: Interferogram Analysis – Digital Fringe Pattern Measurement Techniques, pp. 94–140. Institute of Physics Publishing (1993) 12. Curless, B., Levoy, M.: A Volumetric Method for Building Complex Models from Range Images. In: Proceedings of SIGGRAPH 1996, pp. 303–312. ACM Press, New York (1996) 13. Munkelt, C., Kuehmstedt, P., Denzler, J.: Incorporation of a-priori information in planning the next best view. In: Kobbelt, L., Kuhlen, T., Aach, T., Westermann, R. (eds.) Vision, Modeling, and Visualization 2006, pp. 261–268 (2006) 14. Munkelt, C., Trummer, M., Denzler, J., Wenhardt, S.: Benchmarking 3D Reconstructions from Next Best View Planning. In: Proceedings of IAPR Conference on Machine Vision Applications, May 2007, pp. 552–555 (2007) 15. K¨ uhmstedt, P., Munkelt, C., Heinze, M., Br¨ auer-Burchardt, C., Notni, G.: 3D shape measurement with phase correlation based fringe projection. In: Optical Measurement Systems for Industrial Inspection V, vol. 6616, p. 66160B. SPIE (2007) 16. PMDTechnologies GmbH: PMD [vision] O3 Datasheet (2007), http://www.pmdtec.com/e_inhalt/documents/datasheet_O3_v0100.pdf

Real Aperture Axial Stereo: Solving for Correspondences in Blur Rajiv Ranjan Sahay and Ambasamudram N. Rajagopalan Image Processing and Computer Vision Laboratory Department of Electrical Engineering Indian Institute of Technology Madras Chennai-600 036 India [email protected], [email protected]

Abstract. When there is relative motion along the optical axis between a real-aperture camera and a 3D scene, the sequence of images captured will not only be space-variantly defocused but will also exhibit pixel motion due to motion parallax. Existing single viewpoint techniques such as shape-from-focus (SFF)/depth-from-defocus (DFD) and axial stereo operate in mutually exclusive domains. SFF and DFD assume no pixel motion and use the focus and defocus information, respectively, to recover structure. Axial stereo, on the other hand, assumes a pinhole camera and uses the disparity cue to infer depth. We show that in real-aperture axial stereo, both blur and pixel motion are tightly coupled to the underlying shape of the object. We propose an algorithm which fuses the twin cues of defocus and parallax for recovering 3D structure. The eﬀectiveness of the proposed method is validated with many examples.

1

Introduction

Inferring 3D information from images is a fundamental problem in computer vision. In many applications, one is constrained by viewpoint, short focal length and small depth-of-ﬁeld. In the domain of machine vision, researchers have explored axial stereo due to the advantages of single viewpoint, motion only along the optical axis, and common ﬁeld of view [1] [2]. However, the camera is assumed to be pinhole and the depth information is recovered using only the disparity cue. In fact, blurring in the images would impair the ability to compute point correspondences accurately and will aﬀect the performance of stereo. We observe that in real-world scenarios, images of 3D scenes are invariably aﬀected by space-variant blurring and the assumption of a pinhole camera is unrealistic and restrictive. Hence, there exists a need to extend the axial stereo method to handle scaled as well as defocused images. Relative motion between the scene and the lens, or camera zooming, or changing the intrinsic parameters of a camera (e.g. focus settings) not only cause pixel motion [3] but also result in defocusing. A brief review of related works is in order. In [4], a technique is proposed using blur-invariant moments to recover aﬃne motion parameters as well as defocus blur. In [5] an over-determined system of equations is solved to obtain the motion and blur estimates using a pair J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 362–371, 2009. c Springer-Verlag Berlin Heidelberg 2009

Real Aperture Axial Stereo: Solving for Correspondences in Blur

363

of images. Deschenes et al. [6] simultaneously compute structure as well as estimate spatial shifts using two images with the equifocal assumption. However, it is well-known that window-based approaches can critically aﬀect the estimation of structure information [7]. Zooming has been exploited to extract depth information from a monocular image sequence [8], [9]. But these works do not consider the blur cue and derive estimates of depth using only the pixel motion in the images. Fusion of stereo and defocus cues has been attempted [10] for the case of lateral stereo. Recently, in [11] it has been proposed to compute the 3D shape by controlling focus and aperture of a single camera especially for scenes that are highly geometrically complex and have ﬁne texture. However, this method uses sophisticated SLR cameras and captures hundreds of high-resolution images to derive the structure. A variational approach is adopted in [12] for estimating 3D shape and radiance. In [12] magniﬁcation eﬀects have been avoided by using telecentric optics [13]. In another work [14], depth is estimated by a least squares approach. Modeling defocus blur in an image as a diﬀusion process, in [15] the shape of the 3D specimen has been recovered using the concept of relative blur and forward diﬀusion. An active illumination based approach has been proposed in [16] to compute the depth map of scenes using a coaxial projector and camera system. In [17] the depth map and the focused image are reconstructed from a single frame. Occlusions have been handled for estimating depth information in [18], [19]. We remark that in all the above works the observations are not aﬀected by parallax and their formulations do not account for it. One of the popular passive ranging techniques which uses a single real-aperture (small depth-of-ﬁeld) camera and a stack of space-variantly defocused frames to estimate structure is shape-from-focus (SFF) [20]. Here, the degree of focus in the images is used as the principal cue for estimating the shape of a 3D specimen. Assuming no parallax, a focus measure proﬁle is computed for each pixel and the depth information is acquired by inferring the position of best focus for each point. Within the depth from defocus (DFD) framework, to avoid magniﬁcation, imageside telecentricity is proposed in [13]. But not only are telecentric lenses expensive, they also necessitate an involved procedure for positioning the aperture relative to a conventional lens. We observe that in both SFF and DFD the pixel motion in the frames is not exploited. On the contrary, it is avoided. Akin to image-side telecentricity, telecentric lenses can be used to obtain object-side telecentricity also. Since the front element of object-side telecentric lenses needs to be as large as the ﬁeld-of-view, these lenses are typically very big and heavy apart from being quite expensive. There are unique situations such as in endoscopy where the working distances are small and the environment is highly constrained so that only axial motion of the camera is possible. Under these circumstances, one cannot use DFD due to physical limitations of accessibility to camera control. Moreover, the camera motion is signiﬁcant and rules out the ‘no parallax’ assumption that is typically made in SFF. The goal of our work is not to project it as an option to DFD or SFF but to rather propose a working scheme for handling situations when

364

R.R. Sahay and A.N. Rajagopalan

neither DFD nor SFF is directly applicable. In contrast to SFF or DFD, the proposed method is able to handle both parallax and defocus eﬀects together. In this work, we generalize the axial stereo [1] technique to accommodate a realaperture camera that induces space-variant blurring in the captured frames. We propose an integrated framework that exploits both the cues of defocus and pixel motion in the given sequence of images. We explicitly model the crucial structure-dependent aspect of the movement of pixels by using the perspective projection model for image formation and relate this to the underlying blurring phenomenon. To the best of our knowledge, such a uniﬁed framework that combines motion and defocus cues within the axial stereo framework has not been attempted previously. We show that the proposed method is able to handle even regions of low texture. Since the frames in the stack are all space-variantly blurred, reconstruction of the focused image of the 3D sample is advantageous but non-trivial due to pixel motion. In conjunction with the depth map, availability of a focused image enhances the understanding of the shape of the object and the ability to detect meaningful patterns in vision-based inspection applications.

2

Scaling and Defocus in Axial Stereo

As shown in Fig. 1 (a), a single real-aperture camera captures a sequence of space-variantly blurred images as a 3D specimen is moved along the optical axis in ﬁxed ﬁnite steps of Δd, on a translating stage. Usually, in axial stereo the camera is moved with respect to the object but moving the object along the optical axis would also achieve the same eﬀect. Owing to the ﬁnite depth-of ﬁeld of the camera and the 3D nature of the specimen, none of the observations is in complete focus. The distance between the lens plane and the focused plane is denoted as the working distance wd and it is given by w1d = f1 − v1 where f is the focal length and v is the distance between the lens and the image plane. The object is initially placed such that the translating stage is on the focused plane. The 3D object is translated in steps of Δd downwards and a frame is captured at each step. The quantity d(k, l) is the amount by which the stage should be translated to bring the point (k, l) to the focused plane. When evaluated for all (k, l), d(k, l) yields the 3D structure. Assume that N frames, {ym (i, j)}, m = 0, 1, . . . , N −1, each of size M ×M are captured. These observations are derived from a single focused image {x(i, j)} of the 3D specimen. The scaled and defocused frames can be related to the focused image by the degradation model ym = Hm (d)Wm (d)x + nm

m = 0, . . . , N − 1

(1)

where ym is the lexicographically arranged vector of size M × 1 derived from the mth defocused and scaled frame, Wm is the matrix describing the motion of the pixels in the mth frame, Hm is the blurring matrix for the mth frame, and nm is an M 2 × 1 Gaussian noise vector. The degree of space-variant defocus blur induced at each point in the image of a 3D scene is dependent upon the depth of the object from the lens plane. Also, the pixel motion is a function of the 3D 2

Real Aperture Axial Stereo: Solving for Correspondences in Blur

365

Image Plane

Lens

Motion along optical axis (k,l)

wd 3D Object

Translational Stage d(k, l) Δd

Focused Plane m=0 m=1 m=2 m=3

(a)

(b)

Fig. 1. (a) A multi-image real-aperture axial stereo. (b) Schematic showing structuredependent pixel motion.

structure of the object. In fact, the twin cues of defocus and pixel motion are intertwined with the 3D structure and must be judiciously exploited. Only the correct d will induce the correct pixel motion and the correct blurring to yield the given observation ym . Our goal is to solve for d, given ym , m = 0, 1, 2, . . . , N −1. Let us ﬁrst try to understand the phenomenon of pixel motion (which we loosely refer to as magniﬁcation or scaling) in the stack as denoted by Wm (d) in (1). To explain the mechanism of this structure-dependent pixel migration, we initially consider a pinhole camera and describe how scaled images are formed. As shown in Fig. 1 (b), we examine a speciﬁc point on the specimen which is moved relative to the pinhole camera. A point on the 3D object with world coordinates P (X P , Y P , Z P ) is moved to Q(X Q , Y Q , Z Q ) along the Z-axis by a distance of mΔd and away from the pinhole denoted by O. The distances of the points P and Q from the pinhole are Z P and Z Q , respectively. The point P is imaged at p on the image plane and has coordinates (x, y). Let this image be the reference plane. When the 3D object is moved away from the pinhole by an amount mΔd, the point Q is imaged at q with coordinates (x , y ) on the image plane. The corresponding image is the mth frame in the stack. Assuming that the size of the images is M ×M , according to the basic perspecP Q P Q tive projection equations x = vX , x = vX and y = vY , y = vY ZP ZQ ZP ZQ . The motion of the object relative to the pinhole is only along the Z-axis since the 3D specimen is translated away from or towards the camera along the optical axis. Hence, for the real-aperture axial stereo scenario, we have X P = X Q , Y P = Y Q and Z Q = Z P + mΔd = wd − d(x, y) + mΔd, where v is the distance M between the pinhole and the image plane, and −M 2 ≤ x , y ≤ 2 . Thus, it can be shown that x =

x(wd − d(x, y)) , (wd − d(x, y)) + mΔd

y =

y(wd − d(x, y)) (wd − d(x, y)) + mΔd

(2)

Note that the pixel motion is a function of d, the 3D structure of the scene. For a general 3D object, there will be structure-dependent pixel motion which

366

R.R. Sahay and A.N. Rajagopalan

cannot be described by a homography. For the same reason, it is not possible to reconstruct the focused image of the 3D specimen using the stack by simply picking the pixels from the frames where they come in focus. Next, we explain the space-variant blurring in the stack of images as represented by Hm (d) in (1). Due to diﬀraction and lens aberrations, the point spread function (PSF) of the camera is best described by a circularly symmetric 2D Gaussian function [21] with standard deviation σ = ρrb where ρ is camera constant, rb is the blur radius and σ is called the blur parameter. There exist several works that have validated this approximation [7], [15] and hence, we too are motivated to use this model. As shown in Fig. 1 (a), when the translating stage is moved downwards in steps of Δd, for the mth frame we can express the blur parameter for a 3D point whose image pixel coordinates are (k, l) as 1 1 σm (k, l) = ρRv − (3) wd wd + mΔd − d(k, l) where R is the radius of aperture of the lens. Since the blur parameter is a function of depth, the blurring induced on the image plane is space-variant. The product ρRv can be found using an appropriate calibration procedure. Its value remains constant during the entire image capturing process which is an advantage. With this relationship in place, we are now in a position to understand the forward image formation process given by (1). The focused reference image when blurred by blur map σ0 yields the observation y0 . The focused image x when warped (2) and blurred (by σ1 ) after remapping (σ1 ) to the warped grid yields y1 and so on.

3

Structure and Focused Image Recovery

Given a stack of space-variantly defocused images wherein there is perceptible magniﬁcation from frame to frame due to the 3D nature of the specimen, how do we recover the depth proﬁle and focused image of the object? Comparing (2) and (3) we notice that both blurring and pixel motion are a function of d. It is this tight coupling between motion and defocus cues (arising through d) that we exploit judiciously to estimate d. Simultaneous reconstruction of the depth proﬁle d and the focused image x is an ill-posed inverse problem and hence, the solution has to be regularized using a priori constraints. Realworld objects usually have depth proﬁles which are locally smooth. The same argument holds good for the focused image also. Markov random ﬁelds (MRFs) have the capability to model spatial dependencies [22]. We model the structure of the 3D specimen using a Gauss-Markov random ﬁeld (GMRF) with a ﬁrstorder neighbourhood. The prior joint PDF for the depth map is given as P (d) = 1 d exp − V (d) where Z is the partition function, c is a clique, C is the c∈Cd c Z set of all cliques and Vc (·) is the potential associated with clique c. For details on MRF, see [22]. We model the focused image to be estimated by a separate GMRF whose PDF can be expressed as P (x) = Z1 exp − c∈Cx Vcx (x) .

Real Aperture Axial Stereo: Solving for Correspondences in Blur

367

We seek the maximum a posteriori (MAP) estimate of d and x. Let us consider a set of p frames chosen from the stack of N observations. Assuming the noise process nm s to be independent, the MAP estimates of d and x can be obtained by minimizing the posterior energy function U p (d, x) =

ym − Hm (d)Wm (d)x2 d + λ V (d) + λ Vcx (x) x c d 2ση2

m∈O

c∈Cd

c∈Cx

where O = {u1 , u2 , . . . , up }, ui is the frame number and ση2 is the variance of the Gaussian noise. Graph cuts are limited to minimization of submodular energy functions [23] [24]. In applications that involve blur, the cost functions in the MAP-MRF framework turn out to be non-submodular [25], [27]. For such energy functions, graph cuts have not been shown to exceed the performance of simulated annealing (SA) [26], [27]. The usefulness of the Quadratic Pseudo Boolean Optimization (QPBO) algorithm depends upon how many nodes in the graph are labeled [27] (page 3, (section 2.2)). Also, in section 3 of [27], it is asserted that the roof duality works well in cases where the number of non-submodular terms is small. However, in more diﬃcult cases the roof duality technique leaves many nodes unassigned. We refer to Table 1, on page 7 of [27]. The comparison results for image deconvolution (only ‘space-invariant’ blurring with number of gray levels limited to 32) are presented in the last two rows (3 × 3 and 5 × 5 kernels) of this table. Note that the number of unassigned labels for the QPBO and the ‘probing’ QPBO (QPBOP) methods are a whopping 80% for the 5 × 5 sized kernel. Note that using SA, the energy at convergence is zero and all the nodes are labeled. Interestingly, SA is shown to outperform all the methods including QPBO and QPBOP (pointed out in [27] in the ‘Image Deconvolution’ sub-section on Page 8). Hence, we use the SA algorithm to minimize U p (d, x). Parameters λd and λx must be tuned to obtain a good estimate of both d and x.

4

Experimental Results

Since ours is the ﬁrst formulation of its kind for real aperture axial stereo, we are unable to provide comparisons. Existing formulations either assume parallax (and no blur) [1] [2] or blur (and no parallax) [11] [12] [14] [15] [16]. We ﬁrst present results for a synthetic case. We chose a ramp of height 3 cm as the ground truth 3D object which is shown as a grayscale image in Fig. 2 (a). The ‘calf texture’ (Fig. 2 (b)) from the Brodatz class of textures [28] was mapped onto its surface. We simulated the motion of this specimen by ﬁnite steps of Δd = 1 mm and captured a sequence of frames. The structure and the focused image of the specimen was reconstructed using the proposed technique. The values of the MRF parameters were chosen as λd = 1×109 and λx = 0.005. The reconstructed depth proﬁle represented as a grayscale image is shown in Fig. 2 (c). The depth map is smoothly varying (as expected) from the left edge to the right. The rms error was found to be only 0.0817 cm. The estimated focused image is shown in

368

R.R. Sahay and A.N. Rajagopalan

(a)

(b)

(c)

(d)

Fig. 2. (a) Grayscale image of ground truth depth proﬁle of 3D ramp specimen. (b) Ground truth focused image. (c) Grayscale image output of the depth map. (d) The estimated focused image.

(a)

(b)

(c)

(d)

Fig. 3. A portion from a clay model of a bunny. (a, b) Two of the observations chosen from the stack. (c) Restored image. (d) Grayscale image of the estimated depth map.

Fig. 2 (d). The image is sharp and shows all the details clearly. The rms error for the reconstructed image is only 13 gray levels. We next describe real experiments that we performed using an oﬀ-the-shelf Olympus C-5050ZOOM digital camera. The camera is operated in the supermacro mode where space-variant blurring is observed. In this mode, the signalto-noise ratio (SNR) is low since the aperture size is small. We ﬁrst used a small clay model of a rabbit and imaged an area around the eye of the specimen by moving the object along the Z-axis by Δd = 1 mm. Two of the frames from

Real Aperture Axial Stereo: Solving for Correspondences in Blur

(a)

(b)

(d)

369

(c)

(e)

Fig. 4. Wooden specimen of a face. (a, b, c) Frames chosen from the stack. (d) Restored image. (e) Recovered depth map.

the stack are shown in Fig. 3 (a) and (b) to depict the eﬀect of pixel motion in the frames. The size of each image was 161 × 199 pixels. The estimated focused image of the 3D specimen using the proposed method is shown in Fig. 3 (c). Note that the eyelids and various details of the texture are reconstructed well. We remark that the proposed method performs denoising also because of which the focused image looks clean and smooth as compared to the observations. The structure of the object is depicted in Fig. 3 (d) as a grayscale image. The MRF parameters used were λd = 1 × 108 and λx = 0.05. Depth variations depicting the eyelids and the pupil can be easily discerned in Fig. 3 (d). Next, we imaged a small portion of a wooden specimen which had a face carved on its surface. Moving the object away from the lens plane, in steps of Δd = 1 mm, we captured a stack of images using the same Olympus camera. Some of the frames chosen from the stack are shown in Figs. 4 (a) - (c). The focused image and the depth proﬁle of the 3D specimen are shown in Figs. 4 (d) and (e), respectively. Observe that the texture on the forehead, the cheek region and sharp discontinuities at the eyebrows, the nose and the straight edge below

370

R.R. Sahay and A.N. Rajagopalan

it are restored faithfully in Fig. 4 (d). The proposed method has successfully removed noise in the observations and has obtained a clean focused image. The values of the parameters λx and λd were identical to those used in the previous experiment. One can clearly observe details such as the eyebrows and the nose in the grayscale image corresponding to the depth proﬁle in Fig. 4 (e). Note that we used only four frames from the stack for reconstructing the structure and the focused image for both the synthetic case and experiments using realworld specimens. These four observations were chosen such that there is suﬃcient relative blur among them.

5

Conclusions

We proposed a computational approach to extend the scope of axial stereo to handle defocus eﬀects of a real aperture camera. The method can accurately recover not only the depth information but also the focused image of the underlying 3D object by solving for correspondences in blur. Our approach judiciously incorporates the cues of defocus and pixel motion within a uniﬁed framework. This work can be used as a basis for extending structure-from-motion to real-aperture cameras. Literature abounds with papers on motion-based or motion-free superresolution, although independently. Observe the interesting fact that within the proposed framework both pixel motion and defocus cue are embedded and can be leveraged to reconstruct a high-resolution image as well as depth proﬁle. Acknowledgment. The second author is grateful to the Alexander von Humboldt Foundation for its support.

References 1. Alvertos, N., Brzakovic, D., Gonzalez, R.C.: Camera geometries for image matching in 3-D machine vision. IEEE Trans. Pattern Anal. Mach. Intell. 11(9), 897–915 (1989) 2. Jain, R., Bartlett, S.L., O’Brien, N.: Motion stereo using ego-motion complex logarithmic mapping. IEEE Trans. Pattern Anal. Mach. Intell. 9(3), 356–369 (1987) 3. Willson, R.G., Shafer, S.A.: What is the center of the image? Journal of the Optical Society of America A 11(11), 2946–2955 (1994) 4. Zhang, Y., Wen, C., Zhang, Y.: Estimation of motion parameters from blurred images. Pattern Recognition Letters 21(5), 425–433 (2000) 5. Myles, Z., Lobo, N.: Recovering aﬃne motion and defocus blur simultaneously. IEEE Trans. Pattern Anal. Mach. Intell. 20(6), 652–658 (1998) 6. Deschenes, F., Ziou, D., Fuchs, P.: An uniﬁed approach for a simultaneous and cooperative estimation of defocus blur and spatial shifts. Image and Vision Computing 22(1), 35–57 (2004) 7. Chaudhuri, S., Rajagopalan, A.N.: Depth from defocus: A real aperture imaging approach. Springer, New York (1999) 8. Ma, J., Olsen, S.I.: Depth from zooming. Journal of the Optical Society of America A 7(10), 1883–1890 (1990)

Real Aperture Axial Stereo: Solving for Correspondences in Blur

371

9. Lavest, J.M., Rives, G., Dhome, M.: Three-dimensional reconstruction by zooming. IEEE Trans. Robotics and Automation 9(2), 196–207 (1993) 10. Rajagopalan, A.N., Chaudhuri, S., Mudenagudi, U.: Depth estimation and image restoration using defocused stereo pairs. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1521–1525 (2004) 11. Hasinoﬀ, S.W., Kutulakos, K.N.: Confocal stereo. International Journal of Computer Vision 81(1), 82–104 (2009) 12. Jin, H., Favaro, P.: A variational approach to shape from defocus. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 18–30. Springer, Heidelberg (2002) 13. Watanabe, M., Nayar, S.K.: Telecentric optics for focus analysis. IEEE Trans. Pattern Anal. Mach. Intell. 19(12), 1360–1365 (1997) 14. Favaro, P., Soatto, S.: A geometric approach to shape from defocus. IEEE Trans. on Pattern Anal. Mach. Intell. 27(3), 406–416 (2005) 15. Favaro, P., Soatto, S., Burger, M., Osher, S.J.: Shape from defocus via diﬀusion. IEEE Trans. on Pattern Anal. Mach. Intell. 30(3), 518–531 (2008) 16. Zhang, L., Nayar, S.K.: Projection defocus analysis for scene capture and image display. In: Proc. ACM SIGGRAPH, pp. 907–915 (2006) 17. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera with a coded aperture. In: Proc. ACM SIGGRAPH (2007) 18. Favaro, P., Soatto, S.: Seeing beyond occlusions (and other marvels of a ﬁnite lens aperture). In: Proc. Computer Vision and Pattern Recognition, pp. 579–586 (2003) 19. Hasinoﬀ, S.W., Kutulakos, K.N.: A layer-based restoration framework for variableaperture photography. In: Proc. IEEE Intl. Conf. Computer Vision (2007) 20. Nayar, S.K., Nakagawa, Y.: Shape from focus. IEEE Trans. Pattern Anal. Mach. Intell. 16(8), 824–831 (1994) 21. Pentland, A.P.: A new sense for depth of ﬁeld. IEEE Trans. Pattern Anal. Mach. Intell. 9(4), 523–531 (1987) 22. Li, S.Z.: Markov random ﬁeld modeling in computer vision. Springer, Heidelberg (1995) 23. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. and Mach. Intell. 23(11), 1222–1239 (2001) 24. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 147–159 (2004) 25. Raj, A., Zabih, R.: A graph cut algorithm for generalized image deconvolution. In: Proc. IEEE Intl. Conf. on Computer Vision, pp. 1048–1054 (2005) 26. Kolmogorov, V., Rother, C.: Minimizing nonsubmodular functions with graph cutsA review. IEEE Trans. Pattern Anal. Mach. Intell. 29(7), 1274–1279 (2007) 27. Rother, C., Kolmogorov, V., Lempitsky, V., Szummer, M.: Optimizing binary MRFs via extended roof duality. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 28. Brodatz, P.: Textures: A photographic album for artists and designers. Dover, New York (1966)

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling Alexander Schick1 and Rainer Stiefelhagen1,2 1

Interactive Analysis and Diagnosis, Fraunhofer IITB Karlsruhe [email protected] 2 Institut f¨ ur Anthropomatik, Universit¨ at Karlsruhe (TH) [email protected], [email protected]

Abstract. We present an approach to compute the visual hulls of multiple people in real-time in the presence of occlusions. We prove that the resulting visual hulls are correct and minimal under occlusions. Our proposed algorithm runs completely on the GPU with framerates up to 50f ps for multiple people using only one computer equipped with oﬀthe-shelf hardware. We also compare runtimes for diﬀerent graphic chips and show that our approach scales very well without additional eﬀort. Comparison to other work shows that our algorithm is as fast as stateof-the-art technology. The resulting visual hulls can be the basis for a wide range of algorithms that require a robust voxel representation as input.

1

Introduction

We are interested in analyzing people that are interacting in multi-camera environments like smart rooms, smart control rooms, and smart housing. In these scenarios, the same scene is observed by multiple cameras. Due to diﬀerent viewpoints, this leads to ambiguities. In one image, the actions of a person could be clearly visible, whereas in another image, the person might be occluded. Applications working with this data must ﬁnd a way to reason with ambiguities, for example by fusing the inputs into one well-deﬁned representation. Volumetric reconstruction algorithms, like voxel carving, provide a solution by building a coherent 3D representation. The visual hull concept is well-known in the context of volumetric reconstruction and many applications can beneﬁt from it. However, it relies on silhouette information and is very sensitive to segmentation errors. Static occlusions, like tables and chairs, cause severe problems for foreground segmentation and therefore for visual hulls in general. Current solutions do not address this problem properly. We will present a systematic solution to compute the minimal visual hulls during occlusions and prove its correctness. To achieve real-time runtimes, many computer vision applications use the power of modern GPUs. By strictly following the highly parallel mathematical J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 372–381, 2009. c Springer-Verlag Berlin Heidelberg 2009

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

373

problem formulation of voxel carving, we implemented our algorithm with integrated occlusion handling on the GPU. It runs with up to 50f ps for multiple people and will scale eﬀortlessly with future graphic chips.

2

Related Work

Laurentini introduced the visual hull concept to reconstruct 3D structures using only silhouette images [1]. There are many ways to compute the visual hull, e.g. ray intersection [2]; we focus on voxel carving because it is very robust and can be implemented on the GPU due to its highly parallel nature. In voxel carving, the 3D space is sampled into a set of 3D points. Each point is projected into every silhouette image. If the projected voxel is outside the silhouette in one or more images, it gets removed or carved. This is repeated for every voxel until the set of remaining voxels forms the visual hull. Static scene occlusions, like tables, introduce signiﬁcant diﬃculties for most visual hull approaches because they directly aﬀect the silhouettes of objects that they occlude. This leads to severe defects in the visual hulls and is a huge problem for real-world applications. Guan et al. handle occlusions by using artiﬁcial silhouettes of the occluding objects [3], but the resulting visual hulls seem not to be minimal. In other work, Guan et al. infer the 3D location of static occluders in very diﬃcult scenarios using a Bayesian sensor fusion approach [4], but their system needs approximately one minute per frame. Kim et al. allow one silhouette miss in one camera per voxel to avoid defects in the visual hull [5] which will not work when occlusions occur in more than one view. Ladikos et al. add the silhouettes of occluders that were either manually annotated or the result of projections of 3D models [6]. However, this potentially adds a lot of noise. Even though some partial solutions were proposed, most visual hull algorithms are still evaluated in scenarios without occlusions and with only one person. In Section 3.2, we will propose a systematic and general solution by introducing depth-based occlusion maps. In addition, we will prove that the resulting visual hulls are correct and minimal. For interactive smart environments, computation speed is another critical factor because the applications must run in real-time. Lookup tables reduce computation speed by storing the voxel projections. Examples are provided by Luck et al. [7] and Kehl et al. [8]. But lookup tables have severe drawbacks: once computed they are ﬁxed, one lookup table per image is required, and their size grows cubical in the number of voxels. Octrees oﬀer a way to reduce the number of computations without relying on lookup tables. They start with very large voxels and only reﬁne the resolution where the voxel projection results in a partial silhouette hit. Examples of octree-based visual hull computations are presented by Caillette and Howard [9] and Erol et al. [10]. Instead of minimizing computations, they can be carried out very eﬃciently using the power of modern GPUs, for example with NVIDIA Cuda [11]. Overviews of GPU programming using Cuda are given by Lindholm et al. [12] and Garland et al. [13]. Fung and Mann discuss the importance of GPUs in future computer vision and image processing research [15]. Kim et al. implement

374

A. Schick and R. Stiefelhagen

their volumetric reconstruction on the GPU and show a signiﬁcant speed-up [5]. Ladikos et al. compare four diﬀerent voxel carving implementations on the GPU that are based on octree- and non-octree-versions with diﬀerent levels of precomputation [6]. In Section 4, we will compare our results with theirs.

3

GPU-Based Voxel Carving

This section introduces our voxel carving algorithm that computes visual hulls even in the presence of static occlusions. The voxel carving runs completely on the GPU and is implemented using NVIDIA Cuda [11]. For the remainder of this chapter, we assume a static scene that is observed by n calibrated cameras. The silhouette images Sj for camera j are computed using diﬀerence-based foreground segmentation and morphological operators. 3.1

Problem Description

We followed a discrete formulation of voxel carving. The voxels are deﬁned by a 3D grid. For the sake of simplicity, we assume that the grid is cubic with k ∈ N entries along each axis resulting in k 3 voxels. In addition, its axes match the axes of the world coordinate system. Furthermore, the voxels are assumed to be cubes with side length c ∈ R. Thus, the set of voxels is deﬁned by: V = {(x, y, z) ∈ R3 |∃k1 , k2 , k3 < k : x = k1 c ∧ y = k2 c ∧ z = k3 c}.

(1)

Let Sj ⊆ N0 × N0 be the set of image pixels that are part of the silhouette in image j and let pj : V → N0 ×N0 be the projection from voxel space to the image plane. The visual hull H ⊆ V is the set of voxels that project onto silhouette pixels for every image: H = {x ∈ V |∀j : pj (x) ∈ Sj }.

(2)

Voxels are volumetric structures. In our work, however, they project onto single pixels due to their small size. Therefore, we just project the center of each voxel into the image. This approach can be modiﬁed when the volumetric nature must be maintained, e.g. in case of a coarser resolution. Then, the corners of the voxels are projected into the image and silhouette lookups done on integral images [14]. Voxel carving starts with a full set of voxels. Voxels that do not satisfy Equation 2 are carved, e.g. removed from H. In this context, Equation 2 can be reformulated: H = V \ {x ∈ V |∃j : pj (x) ∈ / Sj }. (3) The key for eﬃcient visual hull computation using the power of modern GPUs lies in Equation 3 as we will show later.

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

3.2

375

Static Scene Occluders

Voxel carving is based on silhouettes extracted by foreground-background segmentation. These segmentation techniques usually learn the background model of an empty scene and compute the foreground of new images by comparing the diﬀerences. Even though foreground segmentation is very fast and robust, it runs into problems when there are static objects in the scene. These objects are part of the background and therefore not segmented. This is also true for every object behind that occluder. This causes problems for visual hulls that rely on the silhouettes. Figure 3 shows that, without occlusion handling, large parts of the visual hull are not reconstructed. Adding the binary silhouettes of occluders to the silhouette image solves this problem in the sense that the occluded parts are reconstructed. But then the visual hull also contains a lot of noise, i.e. voxels that could be carved. We are interested in the best possible minimal visual hulls. Kim et al. do not carve voxels that fall into background pixels for only one camera [5], thus relaxing the carving procedure. But this would not work when occlusions happen in multiple camera images. Guan et al. add the occluder’s binary silhouette if it intersects with the foreground silhouettes [3]. However, voxels that project into these silhouettes are not always truly occluded. They are only occluded when the 3D position of the object is behind the occluding object in the camera view. But binary silhouettes cannot account for this and will lead to avoidable noise resulting in a non-minimal visual hull. We will now present our solution and prove its correctness. We assume that the size and position of static scene occluders is known. Let Hunoccl ⊆ V be the visual hull assuming no occluding objects would be present, Hoccl the (defective) visual hull in the presence of occlusions, and O ⊆ V the set of 3D points occupied by the occluders. The function dj : V → R computes the distance of voxel x to camera j. It directly follows that Hoccl = Hunoccl \ {x ∈ V |∃j∃y ∈ O : pj (x) = pj (y) ∧ dj (y) < dj (x)}.

(4)

Let Mj be the set of voxels that are occluded by static scene objects for camera j: Mj = {x ∈ V |∃y ∈ O : pj (x) = pj (y) ∧ dj (y) < dj (x)}.

(5)

The modiﬁed visual hull is H = {x ∈ V |∀j : (pj (x) ∈ Sj ∨ ∃y ∈ Mj : x = y)} \ O.

(6)

We will prove now that the visual hull is correct and minimal. Proof. Using silhouette images and 3D positions of occluding objects only, the modiﬁed visual hull H from Equation 6 is the smallest approximation of Hunoccl with Hunoccl ⊆ H. We show that Hunoccl ⊆ H. Let x ∈ Hunoccl . If x is not occluded, it contributes to every silhouette Sj and is therefore also in H. If x ∈ Hunoccl is occluded in

376

A. Schick and R. Stiefelhagen

view j, it follows from Equation 5 that it is in Mj and, following Equation 6, also in H. Thus, Hunoccl ⊆ H. Now, let H be not minimal, y ∈ H, and H \ y minimal. From Equation 6 follows that ∃1 j : pj (y) ∈ / Sj . Assuming no errors in the silhouette segmentation, this can only happen due to occlusion, e.g. when an object is in front of y with respect to view j. But then y ∈ Mj and it is impossible to decide based on silhouettes alone if y ∈ Hunoccl or y ∈ / Hunoccl . Therefore, we can choose y ∈ Hunoccl leading to Hunoccl H \ y.

3.3

GPU Implementation Using Cuda

After we have introduced our voxel carving algorithm with occlusion handling, we will now present our GPU implementation. We assume that the basic concepts of GPU programming with NVIDIA Cuda are known [11,12,13]. In our GPU algorithm, we implement Equations 5 and 6 using additional occlusion maps. An occlusion map has the same dimensions as the camera image and stores for every pixel the distance of the closest occluding object at that pixel to the camera (or inﬁnite in case of no occlusions). This allows us to decide if a voxel is truly occluded because the voxel’s distance to the camera can be compared to the corresponding distance stored in the occlusion map. Using binary silhouette information, this would not be possible. Algorithm 1 shows the pseudocode of our kernel that is executed by every thread on the GPU. Each voxel can be identiﬁed by a unique id due to the ﬁxed grid size (see Section 3.1). Each thread projects one voxel into every camera image. Only if the voxel projects into the silhouette or behind an occluding object for all camera views, it’s id is inserted into the visual hull result array. The design of Algorithm 1 allows using an arbitrary number of camera images. Unnecessary computations are avoided by terminating as soon as one projection fails. Occlusion map lookups do not cause additional expensive computations because the projection was already done for the silhouette image. To allow efﬁcient image lookups, the silhouette images as well as the occlusion maps are stored in texture memory; the camera parameters used for the projections reside in constant memory space. In addition, the algorithm scales independent of the number of occluders because the occlusion map size is constant. However, we experienced one major bottleneck. The ids of the uncarved voxels are returned to the host using an array. To avoid that multiple parallel threads write at the same position into this array, we used atomic operations to increase the array index. This leads to a bottleneck if a large number of threads try to access the array at the same time. We found two additional extensions to be useful: First, to avoid the reconstruction of large areas that are occluded in every view, at least one true silhouette hit can be required. Second, to reduce the bottleneck introduced by the use of the atomicAdd operator, each thread can check more than one voxel and insert the uncarved voxels in one batch into the visual hull vector thus reducing the total number of atomic operations.

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

377

Algorithm 1. Voxel Carving Kernel in NVIDIA Cuda Pseudocode voxelCarvingKernel(visualHullArray): blockId, threadId ← blockIdx.x, threadIdx.x voxel ← getVoxel(blockId, threadId) FORALL camera views j: (x, y) ← projectVoxel(voxel, j) voxelT oCamDist ← distanceToCam(voxel, j) IF silhouetteHit(x, y, j) OR voxelT oCamDist > occlusionM ask(x, y, j) continue ELSE terminate thread add(visualHullArray, voxel)

Details on the number of voxels, the exact kernel conﬁguration, and runtime results are presented in Section 4. There, we will also show that the design of this algorithm allows eﬀortless scaling with future graphic cards.

4

Results

After introducing our voxel carving algorithm in Section 3, we will now discuss results obtained with it. We evaluated our algorithm in two diﬀerent scenarios. To provide a basis for comparison with other algorithms, results on the Middlebury dataset [16] are shown in Section 4.1. Results in a multi-camera environment are given in Section 4.2. We did all experiments on a standard workstation with a 3GHz Intel(R) Core(TM) 2 Duo CPU and 4GB of RAM. The graphic chip was an NVIDIA GTX280. To show that our algorithm scales very well with diﬀerent graphic chips, we also ran experiments on the same workstation but with an NVIDIA NVS290 that is signiﬁcantly weaker. All components are oﬀ-the-shelf hardware. 4.1

Evaluation on the Middlebury Dataset

The Middlebury dataset provides data to compare volumetric reconstruction algorithms and was introduced by Seitz et al. [16]. It consists of two objects – the ”Temple of the Dioskouroi” and a stegosaurus – that are recorded in three diﬀerent camera setups (16, 47, and 312 views for the temple; 16, 48, and 363 for the dino). Results of various algorithms can be found on their page [17]. We did not use occlusion maps in this evaluation because no occlusions occured. We conﬁgured our kernel to use 512 threads per block; each thread checked 64 voxels. Figure 1 shows qualitative results. Fine details were preserved and the volumetric structure was well reconstructed. However, due to the nature of voxel carving, concavities are not reconstructed. Table 1 shows computation times for both objects with two diﬀerent camera conﬁgurations each. We used two diﬀerent graphic chips, the NVIDIA GTX280

378

A. Schick and R. Stiefelhagen

Fig. 1. Results on the Middlebury dataset. The images show examples from the Middlebury dataset [17] and our voxel-based visual hulls. Even with a very coarse grid resolution (643 ) many details were preserved. Table 1. Runtimes on Middlebury dataset with NVIDIA GTX280 and NVIDIA NVS290 graphic chips compared to results provided by Ladikos et el. [6]. Note that [6] used 4 PCs with dedicated NVIDIA 8800 GTX graphic chips while our algorithm runs on only one PC using an NVIDIA GTX280. Data

Cams Total Visual Runtime Runtime Ladikos Grid Hull GTX280 NVS290 et al. [6] Voxels Voxels [ms] [ms] GPU2, (4 PCs) [ms]

dinoSparseRing dinoSparseRing dinoRing dinoRing templeSparseRing templeSparseRing templeRing templeRing

16 16 48 48 16 16 47 47

1283 2563 1283 2563 1283 2563 1283 2563

823 1653 793 1593 843 1703 833 1663

72 511 171 1214 66 486 151 1036

1056 2933 935 2495 -

417.79 3033.89 372.95 3022.10

Ladikos et al. [6] GPU2 OT, (4 PCs) [ms] 99.89 296.71 170.91 516.80

with 30 multiprocessors and the NVIDIA NVS290 with only 2 multiprocessors. In Table 1, we also compare our direct voxel carving to the work of Ladikos et al. [6] which is one of the fastest GPU-based voxel carving approaches to the best of our knowledge. Direct comparison is diﬃcult because they use four PCs with dedicated graphic cards simultaneously. However, our approach is faster than their direct algorithm (GPU2 ). Taking into account that they were using 4 PCs, our runtime results are comparable to their highly optimized octree-based voxel carving (GPU2 OT ). Table 1 also shows runtimes for a signiﬁcantly weaker graphic chip. The GTX280 constructs the visual hull approximately 15 times faster than the NVS290 without additional modiﬁcations; this speedup was expected due to hardware speciﬁcations. We follow that this will be true for future graphic chips as well allowing even faster and more detailed reconstructions without additional eﬀort. This implies that researchers can rely on available and future hardware to produce reliable high-quality visual hulls.

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

379

Fig. 2. Camera images and resulting visual hulls for multiple people rendered from an unobserved viewpoint. The voxel size is 1.8cm3 and the grid resolution 2563 . Note that the resolution of the people in the camera images is relatively small compared to the whole room which introduces additional diﬃculties. Table 2. Runtimes in a multi-camera environment with varying number of people and multiple grid resolutions. All experiments were carried out on an NVIDIA GTX280 graphic chip. Data 1 1 3 3

4.2

person person people people

Cameras Total Grid Voxels Visual Hull Voxels Runtime GTX280 [ms] 4 4 4 4

1283 2563 1283 2563

193 373 253 503

17 43 17 45

Experiments in a Multi-camera Environment

We also evaluated our voxel carving algorithm in a multi-camera environment. The room is approximately 6 × 9m and four calibarated AXIS 210 network cameras [18] positioned in the corners observe the scene. They capture 640 × 480px RGB images with frame rates up to 30f ps. We used 512 threads per block and one thread per voxel. In our evaluation, we used oﬄine video data to avoid being limited by the cameras’ capturing speed. Figure 2 shows the visual hulls of one and multiple people in our multi-camera environment from an unobserved viewpoint. The voxels are carved according to Section 3.3. The images show results obtained at a grid resolution of 2563 and the voxels’ side length is 1.8cm. The visual hulls are detailed enough to provide a solid basis for further processing, e.g. for articulated body tracking. Table 2 shows that the algorithm runs in real-time with 20 − 50f ps depending on the resolution. This leaves enough time for additional components. Comparing Table 1 to Table 2 shows that the runtimes diﬀer signiﬁcantly for the same grid resolution but are comparable within the same setting. The reason for this is the number of voxels remaining after carving. If a voxel gets carved the threads are terminated and no other image is checked, thus less computation is necessary. In addition, every non-carved voxel must be returned to the CPU-based host program. If many voxels are returned, this leads to a signiﬁcant bottleneck as was discussed in Section 3.3. We will now present our results in case of severe occlusions. Figure 3 shows one person sitting at a table. The legs are only visible in two out of four cameras.

380

A. Schick and R. Stiefelhagen

Fig. 3. Visual hull computation in the presence of severe occlusions. The legs are occluded in two of the four camera views. Without occlusion reasoning, the voxels under the table are carved. Our approach computes the best possible minimal visual hull even in the presence of occlusions. The resolution of the person is very small due to the size of the room. However, our approach even works under these conditions. To avoid large blocks of uncarved voxels under the table that are occluded in all views, we used the extension described in Section 3.3.

Without occlusion handling, the visual hull has huge defects. With our proposed occlusion maps, however, we can model the exact position of the table and can include this information in the voxel carving process. This allows to reason about voxels under the table from the two occluded camera views resulting in a visual hull that includes the occluded legs. Even though noise is still present, the visual hull is the best possible minimal visual hull obtainable.

5

Conclusion

We presented a voxel carving algorithm for computing visual hulls that is robust to severe occlusions in multiple camera views. We proved that the visual hulls obtained under occlusions are correct and that they are minimal which is important to reduce noise as much as possible while maintaining valuable information about foreground objects. To the best of our knowledge, this was not yet addressed systematically for real-time applications. We mapped the voxel carving together with the occlusion handling to a parallel GPU-implementation that is an extension to the general voxel carving approach. Our GPU-implementation runs in real-time with up to 50f ps for multiple people in a multi-camera environment. In addition, we showed that our algorithm scales very well with diﬀerent graphic chips. Therefore, we assume it will also scale with future GPU technologies thus allowing even faster computations and higher resolutions without additional eﬀort. Acknowledgements. This work was supported by the FhG Internal Programs under Grant No. 692026. We also thank the reviewers for their valuable comments that helped to improve this paper.

References 1. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 150–162 (1994)

Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling

381

2. Li, M., Magnor, M., Seidel, H.-P.: Hardware-accelerated visual hull reconstruction and rendering. Graphics Interface, 65–71 (2003) 3. Guan, L., Sinha, S., Franco, J.-S., Pollefeys, M.: Visual Hull Construction in the Presence of Partial Occlusion. In: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 413–420 (2006) 4. Guan, L., Franco, J.-S., Pollefeys, M.: 3D Occlusion Inference from Silhouette Cues. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 5. Kim, H., Sakamoto, R., Kitahara, I., Orman, N., Toriyama, T., Kogure, K.: Compensated Visual Hull for Defective Segmentation and Occlusion. In: 17th International Conference on Artiﬁcial Reality and Telexistence, pp. 210–217 (2007) 6. Ladikos, A., Benhimane, S., Navab, N.: Eﬃcient visual hull computation for realtime 3D reconstruction using CUDA. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008) 7. Luck, J., Small, D., Little, C.Q.: Real-Time Tracking of Articulated Human Models Using a 3D Shape-from-Silhouette Method. In: Proceedings of the International Workshop on Robot Vision, pp. 19–26 (2001) 8. Kehl, R., Bray, M., Van Gool, L.: Full body tracking from multiple views using stochastic sampling. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 129–136 (2005) 9. Caillette, L., Howard, T.: Real-Time Markerless Human Body Tracking with MultiView 3-D Voxel Reconstruction. In: Proc. ISMAR, pp. 597–606 (2004) 10. Erol, A., Bebis, G., Boyle, R.D., Nicolescu, M.: Visual Hull Construction Using Adaptive Sampling. In: Proceedings of the Seventh IEEE Workshops on Application of Computer Vision, vol. 1, pp. 234–241 (2005) 11. NVIDIA Cuda (March 2009), http://www.nvidia.com/cuda 12. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A Uniﬁed Graphics and Computing Architecture. IEEE Micro 28, 39–55 (2008) 13. Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel Computing Experiences with CUDA. IEEE Micro 28, 13–27 (2008) 14. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 511–518 (2001) 15. Fung, J., Mann, S.: Using graphics devices in reverse: GPU-based Image Processing and Computer Vision. In: IEEE International Conference on Multimedia and Expo, pp. 9–12 (2008) 16. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 519–528 (2006) 17. Middlebury data sets (March 2009), http://vision.middlebury.edu/mview/data/ 18. AXIS Communications (March 2009), http://www.axis.com/

Image-Based Lunar Surface Reconstruction Stephan Wenger, Anita Sellent, Ole Sch¨ utt, and Marcus Magnor Computer Graphics Lab, TU Braunschweig, M¨ uhlenpfordtstraße 23, D-38106 Braunschweig, Germany

Abstract. For the creation of a realistic 3 meter-sized relief globe of the Moon, a detailed height map of the entire lunar surface is required. Available height measurements of the Moon’s surface are too coarse by a factor of 15 for this purpose. The only publicly available source of high-resolution information are photographic images from the Lunar Orbiter IV mission in 1967. We present a shape-from-shading approach to plausibly increase the resolution of existing low-resolution height data, based on a single high-resolution photographic mosaic image of the Moon. The presented reconstruction approach is designed to be robust with respect to frequent imperfections of the photographic imagery. Aside from the automatic reconstruction of a complete detailed lunar surface height map, we give a qualitative validation by the reconstruction of lunar surface details from close-up photographs of the Apollo 15 landing site.

1

Introduction

In July 1969, Neil Armstrong and Edwin Aldrin were the ﬁrst men to land on the Moon during the Apollo 11 mission. Forty years and another ﬁve manned Moon landings later, much of the Moon’s surface structure still remains unrevealed. While many international space missions have been carried out since then [1], the most detailed photographs covering much of the lunar surface are still the ones taken by the Lunar Orbiter space probes (1966–1967). The available topographic data surprisingly is a lot more sparse for the Moon than, e.g., for the planet Mars. Our research project was initiated by the constructors of a Moon museum who noticed that the available lunar surface height data was utterly insuﬃcient for the creation of one of the planned exhibits: a 3 meter-sized globe of the Moon with a realistic surface relief. For convincing eﬀect, the necessary resolution of the lunar model height map would have to be about 3 pixels per millimeter on the model, or about 30 000 pixels around the model’s equator. For comparison, the resolution of the best existing height data from the Unified Lunar Control Network 2005 [3], Fig. 3, is on average a factor of 15 lower. While the Lunar Orbiter data seems to be the best available source for highresolution photographs of the entire lunar surface, displaying most parts of the Moon with a resolution of 60 meters per pixel or better, this imagery is challenging to interpret by a computer. The conventional photographic emulsion ﬁlm was developed aboard the spacecraft, then digitized in stripes and transmitted J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 382–391, 2009. c Springer-Verlag Berlin Heidelberg 2009

Image-Based Lunar Surface Reconstruction

383

Fig. 1. The Lunar Orbiter mosaic [2] was used for the reconstruction of the lunar surface. The mosaic is stitched together from patches with diﬀerent quality and varying exposure; some parts are entirely missing. Still it is the most comprehensive source of high-resolution shading information of the lunar surface.

Fig. 2. The quality of Lunar Orbiter photographs suﬀers from stains of photographic developer ﬂuid (visible in this picture detail), missing patches, limited dynamic range (i.e. over- and underexposure) and ex post ﬁltering

to Earth where the stripes were put back together, all that using the technology of the 1960ies. The many snapshots were combined into a mosaic of the entire Moon, Fig. 1 (available at http://webgis.wr.usgs.gov/pigwad/down/ Lunar_Orbiter_mosaic.htm). The quality of the mosaic suﬀers from stains of photographic developer ﬂuid, missing patches, limited dynamic range (saturation) and ex post high-pass ﬁltering, Fig 2. Additionally, there is a variation in the incident angle of the sunlight: in most images the sunlight is incident from the right, about 20 degrees above the horizon, but both angles change for an unknown amount towards the Moon’s poles. Fortunately, the intended purpose of the reconstructed lunar surface height map does not require exact reconstruction, but rather necessitates a certain visual plausibility of the resulting model. In order to be able to distinguish comparatively ﬂat surface features on the model of the Moon, the actual height

384

S. Wenger et al.

Fig. 3. This low-resolution height map from the ULCN2005 network [3] is the best publicly available height information. We use it to initialize our reconstruction method of the entire moon surface.

data would have to be exaggerated anyway, and the reconstruction algorithm needs to qualitatively reproduce the real Moon’s surface. Making use of the existing low-resolution height data, Fig.3, we present a method to automatically reconstruct high-resolution surface detail based on a shape-from-shading approach applied to high-resolution imagery from the Lunar Orbiter mission. The algorithm is designed to be robust to the deﬁciencies of the input images. In addition, it reconstructs the entire surface of the Moon in a global and consistent way without further user interaction.

2

Related Work

The largest control network for the Moon published today is the Unified Lunar Control Network 2005 (ULCN2005 ) [3]. It combines images from the Clementine mission and data from an earlier network which had been derived from Earth-based and Apollo photographs, as well as Mariner 10 and Galileo images of the Moon. This network provides a global lunar topographic model that is denser than that provided by Clementine laser altimetry (LIDAR) and of similar accuracy. It consists of 272,931 unevenly distributed measuring points, resulting in an average resolution of about 12 kilometers per pixel. At the time being, higher density topographic data is only available in limited areas of the Moon. The Japanese Kaguya mission is aiming to acquire height data at a resolution of about 2 km per pixel, but until now only 30 km per pixel data has been published [4]. Up to day, the most comprehensive, high-resolution coverage of the lunar surface is achieved by the monocular images acquired by Lunar Orbiter [2]. Using monocular photographs to determine 3D structure has a long tradition in remote

Image-Based Lunar Surface Reconstruction

385

sensing, where it is known as photoclinometry, as well as in machine vision, where it is known as shape-from-shading (SFS). Since the ﬁrst solutions introduced by Rindﬂeisch [5] and Horn [6], this approach has run through many reﬁnements [7,8]. However, none of these approaches addresses the image imperfections one encounters in Lunar Orbiter images. SFS is an underdetermined problem as it assigns the two directional angles of inclination based on one measured gray scale value. With image acquisition in machine vision growing cheaper and cheaper, today most algorithms for height or depth estimation rely on several images. The most common approach is stereopsis using stereo image pairs acquired from diﬀerent viewpoints but under the same lighting conditions [9]. Multi-image shape-from-shading unites the advantages of stereo with SFS. It considers the reﬂection model and uses images acquired under the same lighting conditions from diﬀerent viewing directions [10]. Image pairs that depict the lunar surface under comparable lighting conditions are rare – especially on the far side of the Moon – and were acquired only in low resolution during the Clementine mission. This information is already included in the ULCN2005 height data. Another approach to obtain height information is to consider several images acquired under diﬀerent lighting conditions: Clementine images and groundbased telescopic CCD images were used to reconstruct 3D elevation information for certain Moon regions in Ref. [11], [12] and [13]. Still these methods do not obtain the resolution required for our purpose. In their recent work, Glencross et al. [14] concentrate on perceptionally plausible height map reconstruction. As input, they require a pair of one diﬀusely illuminated and one ﬂashlight illuminated image. Although the Lunar Orbiter and Clementine images were acquired under diﬀerent lighting conditions, the Lunar Orbiter mosaic is already high-pass ﬁltered in order to account for slowly varying albedo. The Clementine images only provide pure albedo measurements. Thus these images cannot be used as a comparable input to constrain the solutions of the SFS problem. A great part of SFS literature directly calculates height information instead of estimating surface normals [15,16]. In our algorithm, we divide normal estimation and height estimation into two steps adapting the integration algorithm of Smith and Bors [17]. This allows us, on the one hand, to easily incorporate the low resolution height ﬁeld given with ULCN2005 and, on the other hand, to weight estimated normal information with a credibility map.

3

Algorithm

Our algorithm is designed to deal automatically with the imperfections of the monocular high-resolution Lunar Orbiter images. In order to calculate a global height map of the Lunar surface, our algorithm proceeds in two steps. In a ﬁrst step, we calculate normals wherever information is available. In a second step, a given low-resolution height map, e.g. the ULCN2005, is iteratively reﬁned until it closely ﬁts the reconstructed normals. The resulting height map is used as a basis for another reconstruction step, iteratively increasing the resolution.

386

S. Wenger et al. n

l

I

I

Fig. 4. SFS on Lambertian surfaces determines only the angle between normal n and incident light l. We pick the normal in direction of the image gradient ∇I that is closest to the normal of a ﬂat surface.

Since the amount of data being handled during the reconstruction of the whole lunar surface easily exceeds the available memory of recent desktop computers, large height maps are cut into overlapping pieces that are reconstructed separately. The results can be blended without problems – using a suitable continuous weighting function, e.g. a linear ramp – because the long-range coherence of the resulting height map is ensured by the lower resolution height map used as a basis. The same procedure is used to ensure wrap-around continuity at the left and right image borders. 3.1

Normal Estimation

In order to estimate the normal vector for each pixel based on its intensity, we ﬁrst assume Lambertian reﬂectance of the lunar surface so that the angle α between the normal vector n and the incident light direction l can be computed from the observed intensity I via I = l · n = cos α

(1)

where I ∈ [0, 1]. While the Moon’s surface is not perfectly approximated by a Lambertian reﬂector [18], the error introduced by this assumption vanishes in comparison to the error caused by the unknown deviation of the light source from the position at the right side and at an angle of 20 degrees over the horizon. Knowing α, the normal vector is only restricted to a circle around the light direction vector l. In order to entirely ﬁx the normal vector, another constraint is needed. For typical lunar geometries, the height gradient (and thus the projection of the normal vector onto the horizontal plane) is likely to be approximately collinear with the intensity gradient of the image, Fig. 4. This assumption proves reasonable because important height map features like rims and ridges cause strong intensity gradients (as long as they are not parallel to the incident light direction), while variations in the direction of the ridges – which might cause an intensity gradient that violates the assumption – are usually on such large scales that the associated intensity gradient is small, cf. Fig. 6(a). If we use this

Image-Based Lunar Surface Reconstruction

387

assumption to further constrain the normal vector, at most two possible normal vectors remain. We select the one that is closer to the normal vector of a ﬂat surface. (For the incident light angle of only about 20 degrees above the horizon, the other possible normal vector would usually represent an almost vertical wall that is highly unlikely.) Note that the input data is presented in a cylindrical projection. Therefore, the x coordinate has to be scaled by the cosine of the latitude whenever image gradients are calculated in order to maintain the correct length scale throughout the whole map. 3.2

Credibility Map

Because of the challenging input data, some precautions have to be taken in order to compensate for shortcomings of the photographic images. Regions that are saturated or underexposed do not yield any gradient information. They are assigned a credibility of zero. Towards saturated or underexposed regions, the credibility of usable pixels decreases with a Gaussian function to ensure smooth and plausible transitions. Additionally, all image gradients are smoothed using a Gaussian ﬁlter. The following normal integration step then takes care of enforcing continuity between the heights of neighboring pixels. 3.3

Normal Integration

The reconstructed normal vectors have to be integrated in order to obtain the ﬁnal height map. We adapt an iterative algorithm by Smith and Bors [17] which we extend so that it takes the credibility map into account. The algorithm iteratively modiﬁes a low-resolution height map, changing the pixel heights in order ˜ i (x, y) to approximate the speciﬁed normal vectors. In each step i, the height h dictated by the normal map is computed for each pixel (x, y) from the current heights hi (x, y) of its neighbors and the x and y normal vector components n1 (x, y) and n2 (x, y) as ˜ i (x, y) = 1 h (hi (x + u, y + v) + un1 (x + u, y) + vn2 (x, y + v)) , (2) 4 (u,v)∈N

˜ by a function ηi (x, y) = where N = {(±1, 0), (0, ±1)}. We weight h and h η0 c(x,y) 1+2i/(w+h) which is proportional to the credibility c(x, y) of the corresponding pixel and decreases with iteration i in order to enforce convergence. w and h are the image width and height, respectively, and η0 was set to 0.2. The heights are then updated by ˜ i (x, y) . hi+1 (x, y) = (1 − ηi (x, y))hi (x, y) + ηi (x, y)h

(3)

Particularly, the initial height map doesn’t change at all when the credibility is zero, i.e. in regions without any available image data. The iteration stops as soon as the average diﬀerence between ˜ hi and hi falls below a speciﬁed threshold that we set to 0.005.

388

S. Wenger et al.

Fig. 5. The result of our reconstruction algorithm (top) for large-scale photographic input data (bottom), compared to the initial ULCN2005 height map (middle). Plausible surface detail has been added where shading information was available; the missing areas of the photograph were recognized as invalid and have therefore remained unmodiﬁed. Note how many surface features become recognizable in the reconstruction that were not present in the initial height map.

(a)

(b)

Fig. 6. Input close-up photograph (a) and resulting height ﬁeld (b) of the Apollo 15 landing site near Hadley rille

Image-Based Lunar Surface Reconstruction

4

389

Results

The goal of our algorithm is to produce a plausible high resolution height map of the entire lunar surface in a fully automated way. Because of the lack of high resolution ground truth data and the aim of perceptual plausibility rather than accuracy, we give two examples of the typical results of our reconstruction algorithm that can be visually evaluated. Figure 5 shows our reconstruction of an approximately 5000 km by 1000 km patch from the far side of the Moon close to the equator. In regions where the image data is usable, the perceived resolution of the heightmap is increased, while regions for which no suitable data is available remain unaltered. Note how e.g. small craters are added to formerly ﬂat regions. In the second example we show that our reconstruction algorithm also works on much smaller scales. We reconstruct an approximately 64 km by 64 km region around the Apollo 15 landing site. For this region, images acquired during extravehicular activity are available that permit perceptual validation of reconstruced heights. At this scale, no reasonable height data is available, so the height map was initialized as a plane and updated with the single photographic image shown in Figure 6(a). The reconstructed height map is displayed in Figure 6(b). The human observer easily recognizes the reconstructed surface features of the photographic image. This is even more apparent in the comparison of the rendered height map with an actual photograph of the site shown in Fig. 7. However, due to the little amount of input data to our algorithm, same limitations remain: Of course, there are surface features which cannot be determined metrically correct based on one image acquired with ﬁxed lighting conditions, e.g. the small

Fig. 7. Apollo 15 surface panoramic photograph (top) and perspective rendering of the reconstructed height map from a similar viewpoint (bottom) allow for visual validation of the presence of important surface features

390

S. Wenger et al.

rille in the left part of the image cannot be reconstructed where it runs parallel to the incident light direction. Also, the saturated regions and shadows close to high mountains cause overshooting eﬀects in some places, but still the result looks plausible to the human observer and reproduces the important geographical features well enough to make the region easily recognizable.

5

Conclusion and Discussion

We have presented a shape-from-shading reconstruction method for lunar surface geometry that is based on known low-resolution height data and single high-resolution photographic images. While large-scale coherence of the height data is inherited from the low-resolution data, surface detail is plausibly added based on shading information. The algorithm is robust with respect to the many ﬂaws present in high-resolution lunar surface imagery. It has successfully been used to reconstruct a detailed height map of the entire lunar surface based on ULCN2005 height data and imagery from the Lunar Orbiter mission. In spite of the quality deﬁcits of the Lunar Orbiter images, the algorithm strongly increases the perceived resolution and richness of detail of the height map. The reconstruction algorithm is able to detect typical error sources and assigns a lower credibility value to the corresponding regions so that the known height data is left unchanged where no better information is available.

References 1. Kirk, R., Archinal, B.A., Gaddis, L.R., Rosiek, M.R.: Carthography for lunar exploration: 2008 status and mission plans. European Planetary Science Congress 3 (2008) 2. United States Geological Survey, http://webgis.wr.usgs.gov/pigwad/down/Lunar_Orbiter_mosaic.htm 3. Archinal, B.A., Rosiek, M.R., Kirk, R.L., Redding, B.L.: Completion of the Uniﬁed Lunar Control Network 2005 and topographic model. In: 37th Annual Lunar and Planetary Science Conference, vol. 37, pp. 2310–2311 (2006) 4. Araki, H., Tazawa, S., Noda, H., Ishihara, Y., Goossens, S., Sasaki, S., Kawano, N., Kamiya, I., Otake, H., Oberst, J., Shum, C.: Lunar global shape and polar topography derived from Kaguya-LALT laser altimetry. Science 323(5916), 897– 900 (2009) 5. Rindﬂeisch, T.: Photometric method for lunar topography. Photogrammetric Engineering 32(2), 262–277 (1966) 6. Horn, B.K.P.: Shape from Shading: a Method for Obtaining the Shape of a Smooth Opaque Object from one View. PhD thesis, Department of Electrical Engineering, MIT (1970) 7. Horn, B.K.P.: Height and gradient from shading. Int. J. of Computer Vision 5(1), 37–75 (1990) 8. Zhang, R., Tsai, P.-S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. IEEE T-PAMI 21(8), 690–706 (1999) 9. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. of Computer Vision 47(1), 7–42 (2002)

Image-Based Lunar Surface Reconstruction

391

10. Heipke, C., Piechullek, C., Ebner, H.: Simulation studies and practical tests using multi-image shape from shading. J. of Photogrammetry and Remote Sensing 56(2), 139–148 (2001) 11. W¨ ohler, C., Hafezi, K.: A general framework for three-dimensional surface reconstruction by self-consistent fusion of shading and shadow features. Pattern Recognition 38(7), 965–983 (2005) 12. Lena, R., W¨ ohler, C., Bregante, M.T., Fattinnanzi, C.: A combined morphometric and spectrophotometric study of the complex lunar volcanic region in the south of Petavius. J. of the RASC 100(1), 14 (2006) 13. W¨ ohler, C., Lena, R., Lazzarotti, P., Phillips, J., Wirths, M., Pujic, Z.: A combined spectrophotometric and morphometric study of the lunar mare dome ﬁelds near Cauchy, Arago, Hortensius, and Milichius. Icarus 183(2), 237–264 (2006) 14. Glencross, M., Ward, G.J., Jay, C., Liu, J., Melendez, F., Hubbold, R.: A perceptually validated model for surface depth hallucination. ACM Transactions on Graphics 27, 1–8 (2008) 15. Worthington, P.L., Hancock, E.R.: New constraints on data-closeness and needle map consistency for shape-from-shading. IEEE T-PAMI 21(12), 1250–1267 (1999) 16. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE T-PAMI 10(4), 439–451 (1988) 17. Smith, G.D.J., Bors, A.G.: Height estimation from vector ﬁelds of surface normals. In: 14th Int. Conf. on Digital Signal Processing, vol. 2, pp. 1031–1034 (2002) 18. Wildey, R.L.: The Moon’s photometric function. Nature 200(4911), 1056–1058 (1963)

Use of Coloured Tracers in Gas Flow Experiments for a Lagrangian Flow Analysis with Increased Tracer Density Christian Bendicks1 , Dominique Tarlet2 , Bernd Michaelis1 , Dominique Th´evenin2 , and Bernd Wunderlich2 1

Institut f¨ ur Elektronik, Signalverarbeitung und Kommunikationstechnik (IESK), Otto-von-Guericke-Universit¨ at Magdeburg 2 Institut f¨ ur Str¨ omungstechnik und Thermodynamik (ISUT), Otto-von-Guericke-Universit¨ at Magdeburg

Abstract. In this article a 3-d particle tracking velocimetry system (PTV system) is presented which enables the investigation of relatively fast gaseous (air) ﬂows and tiny turbulences in a small scaled wind tunnel. To satisfy the demand of a high spatial and temporal resolution, a suﬃciently high tracer particle concentration has to be applied to the gas. Solving the correspondence problem among diﬀerent cameras becomes extremely diﬃcult due to ambiguities: Each tracer has to be found in all pictures of the diﬀerent views during many successive time steps. Here, the correspondence problem is facilitated by the use of coloured particles and the application of suitable classiﬁers for particle classiﬁcation.

1

Introduction

3-d particle tracking velocimetry is an established technique in the ﬁeld of ﬂuid mechanics to obtain three-dimensional velocity ﬁelds up to large Lagrangian trajectories. So a wide variety of ﬂow processes can be measured. The method is based on seeding the ﬂow of interest with small, buoyancy neutral light-scattering particles (also called tracers), which are captured by two or more synchronized cameras. A ﬂow chart for a typical 3-d PTV algorithm is shown in Fig. 1. Assuming the cameras are properly calibrated the algorithm consists of ﬁve major modules which process the tasks: 1. Preprocessing: Correction of inhomogeneous illumination and noise reduction 2. Segmentation: Making a decision for every pixel if it belongs to a particle or to the background 3. Locating Particle Centres: Computing particle centres from results of segmentation 4. Determination of 3-d coordinates: Determining spatial particle locations for each time from corresponding locations in camera images and calibration data 5. Linking Trajectories: Tracking 3-d positions in time. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 392–401, 2009. c Springer-Verlag Berlin Heidelberg 2009

Preprocessing

Preprocessing

Image Sequence

Image Sequence

Image Sequence

Particle Segmentation

Locating Particle Centres

Particle Segmentation

Locating Particle Centres

Camera 3

Preprocessing

Camera 2

Camera 1

Use of Coloured Tracers in Gas Flow Experiments

393

Particle Segmentation

Locating Particle Centres

Determination of 3-d Coordinates Trajectory Linking

Fig. 1. One possible design of a 3-d PTV algorithm

Fig. 2. Decreasing the relative particle density with the help of colour classes

Topical applications of 3-d PTV reach from the conventional application in liquid ﬂows up to the application in gas ﬂows. When considering a gas PTV is still a major challenge. On one hand, as a consequence of the high temporal and spatial resolution requested by measurements in gas ﬂows, it is necessary to apply a suﬃciently high particle concentration. On the other hand the trajectories have to be long enough for a Lagrangian ﬂow analysis: integral time and length scales can only be determined if long correlation lengths have been recorded. This can only be realized if the probability of ambiguities is reduced when searching for corresponding tracers in space and time. To reduce the number of ambiguities one could use a low seeding density, thus, in turn, concurrently reducing the spatial resolution of the system and the basis for a statistical analysis. So in most cases a tradeoﬀ has to be found between the applied particle density and the desired length of trajectories. A possibility to facilitate the correspondence problem is the application of coloured tracers. In this way a dense particle cloud can be separated into particle clouds with lower density when the particles are classiﬁed by their colour. One gets decoupled systems for each used colour class and the spatial and temporal correspondence analysis becomes easier. The idea is illustrated in Fig. 2. In the following the practical application of coloured particles is demonstrated in a ﬂow experiment. The next section gives a description of the experimental setup and used tracer particles. Then, the advantage of using colour classes in combination with the epipolar constraint to reduce the number of ambiguities is explained in detail. After that, the classiﬁcation of coloured particles is elucidated. The ﬁnal section presents the results with a concluding discussion.

394

2

C. Bendicks et al.

Experimental Setup

The experimental setup is shown in Fig. 3. For the present PTV measurements a suitable ﬂow involving vortical structures has been generated within the focal depth of the cameras. It relies on a small Eiﬀel wind-tunnel (non-recirculating, about 1m long) seeded on the suction side with coloured tracers. EMS particles (EMS – Expanded Micro Spheres) with a diameter of 20μm are suited as tracers in gas ﬂows [11]. Three polyethylene winglets with a length of 17mm are placed within the measurement region of size 25mm (X) × 30mm (Y) × 8mm (Z). Based on simulations in FLUENT [7] (Fig. 3, right) these winglets are intended to create three simultaneous characteristic ﬂow patterns involving recirculation (top), large streamline curvatures (middle), and sudden accelerations (bottom) [3]. PTV and even more PTV based on colour recognition request an excellent illumination. For this purpose four tungsten light heads are employed. They are equipped with daylight ﬁlters enabling to obtain a colour temperature between 5000K and 5300K, which corresponds to a white light for the human eye. This colour temperature is essential to obtain proper colour recognition. To capture the image sequences three highspeed CMOS Bayer cameras (1280 × 1024 pixels, synchronized with 500Hz) are aimed at the measurement section in a distance of 20cm. The angle between two lines of sight of each camera pair is about 22◦ . Distortion-free 75mm lenses are used, the photo scale is 1:2. To reconstruct spatial tracer positions and also to resolve ambiguities when searching for homologous points in camera images with help of epipolar-geometry (see section 3) the cameras have to be calibrated. Calibration means to determine camera speciﬁc parameters that deﬁne location and orientation of the camera reference frame with respect to a known world frame as well as parameters that characterize optical, geometric, and digital properties of the camera [1]. The mathematical formulation of the camera model is expressed by the collinearity equations which describe the transformation of 3-d world coordinates to 2-d image coordinates [2]. To compute the unknown camera parameters a set of

Fig. 3. Left: Cameras and lights with focus on the observation window, Right: Arrangement of polyethylene winglets to create characteristic ﬂow patterns. Here, the velocity vector ﬁeld is simulated by the FLUENT ﬂow modeling software [7].

Use of Coloured Tracers in Gas Flow Experiments

395

well-known 3-d coordinates is needed, which can be mapped to their corresponding positions in camera images. For this purpose a two-level calibration target with 25 ground control points is placed in the centre of the observation volume and is captured from each camera. For each ground control point two collinearity equations are set up (one for each image coordinate). This leads to an over-determined system of equations, solved by the least-squares method.

3

Reducing the Number of Ambiguities with Coloured Particles

A signiﬁcant problem in conventional PTV is the occurrence of ambiguities during the spatial correspondence analysis because of the high number of particles. The correlation of particles at a given time is mainly based on geometrical conditions such as the epipolar geometry. The intersection between the image plane and a plane formed by the object point and the perspective centres of the cameras form a line. A corresponding tracer can only be found along this line. This decreases the search area from 2-d (the whole image) to 1-d (a line in the image). Applying more than two cameras, the search-space becomes further limited. Nevertheless, the ambiguities cannot be completely avoided. The same problem arises when correspondence analysis is performed in time to deduce information concerning the Lagrangian description of the ﬂow ﬁeld. Trajectories should be as long as possible without any interruption [5]. A restriction of the search-space in successive time steps can be derived because of the restricted variations of velocity and acceleration and by considering the local correlation of the velocity vectors. Nevertheless, this does not fully suppress all ambiguities, leading to a reduced resolution of the ﬂow features. In the following the beneﬁt by introducing colour classes will be quantiﬁed. In the following the spatial correspondence problem is to be analysed. The formula to calculate the total number of expected ambiguities Na in a threecamera arrangement is given by [13]: 4 · (n2 − n) · ε2 b12 b12 Na = · 1+ + (1) F · sin α b23 b13 with n F α ε bxy

total number of tracers in image (tracer density or seeding density) image size intersecting angle between epipolar lines in the third image tolerance of the epipolar band distance between camera x and camera y

The number of ambiguities becomes minimal when the cameras are arranged in an equilateral triangle so that b12 = b13 = b23 , and α = 60◦ . Under this condition, as in the experimental setup, the term in brackets is replaced by factor 3. Note that the number of ambiguities correlates with the square of the number of

C. Bendicks et al.

Number of ambiguities (Nac)

396

100 90 80 70 60 50 40 30 20 10 0

n=3000 n=2000

n=1000

1

2 3 4 5 Number of colour classes (c)

6

Fig. 4. Theoretical decreasing of ambiguities when using up to six colour classes

tracers. An introduction of colour classes for tracers acts like a reduction of the tracer particle density. This reduces the number of ambiguities when the set of particles is separated by colour into individual subsets. Assuming the coloured particles are uniformly distributed in camera images then the number of tracers nc of a particular colour in each subset is about n/c, where c is the number of used colours. Since the correspondence problem is solved for each particular subset, the number of ambiguities Nac decreases by the square of the number of colour classes (replacing n = n/c in Eq. (1)): Nac =

4 · (n2c − nc ) · ε2 ·3 F · sin α

(2)

Results of a numerical example are listed in Fig. 4: F = 1280 × 1024 pixels, α = 60◦ and ε = 1 pixel with three diﬀerent values for n (1000, 2000 and 3000). Fig. 5 illustrates the whole process of solving the spatial correspondence problem for a single particle. The particle of interest is marked by a surrounding red square in Camera 1 – it was classiﬁed as red one (see section 4). Now, one corresponding partner should be found in Camera 2 and Camera 3. There are eight possible candidates near to the epipolar line (red, from Camera 1) which is constructed in Camera 2, that means their distance to the line is smaller than the tolerance value ε. Because the observation volume is limited in depth (8mm), only a small segment of the epipolar line is considered. This leads to a further restriction of the search space, shown as red box. Only four candidates are located in this red box and only two of them are classiﬁed to be red. In Camera 3 an epipolar box is constructed from the chosen particle in Camera 1, and two additional boxes are constructed from the red candidates in Camera 2. Taking a look at the two intersection areas of the epipolar boxes, one can realize that only one red particle can serve as correspondence partner. If there is more than one candidate, the one is chosen with smallest distance to one of the epipolar line intersections. The approach works well only if the particles can be clearly assigned to their real colour class.

Use of Coloured Tracers in Gas Flow Experiments

397

Camera 2

Camera 1

Camera 3 Fig. 5. Reducing the number of ambiguities

4

Colour Classification

The used cameras are single chip cameras with a so called Bayer ﬁlter array in front of the sensor. It consists of 2 × 2 pixel structure elements, each with a ﬁlter of one red, two green and one blue element. Bayer cameras produce mosaic-like grey-scale images. To convert them into colour images there are several methods [8] to interpolate the colour value of a particular pixel from its grey-scale value and the grey-scale values of pixels in the neighbourhood (for instance bilinear interpolation). The process of conversion is also known as demosaicking. Applying popular demosaicking methods on particle images for colour reconstruction of particle structures yields to dissatisfying results, when the particle size is only some pixels. The colour of a tracer with a size of 20μm cannot be determined accurately as shown in Fig. 7.Nevertheless, it is not interesting to know the colour of a tracer, but only to identify an associated colour class. Our purpose is only the diﬀerentiation of many classes, with the classiﬁer used at the end. As a consequence the idea is not to use the RGB values any more, but to employ directly the grey values from the Bayer pattern to avoid loss of data due to any kind of Bayer demosaicking. Regardless, demosaicking is used to create RGB images. These, in turn, are converted into grey-scaled images. After a segmentation [12] the particle centres are determined for instance by weighted averaging or Gaussian ﬁtting [5]. The centre coordinates are used to build the feature vector, based on Bayer raw data for a classiﬁer presented below. So the ﬁrst feature is the kind of Bayer pixel the centre is located, and its value is chosen as second feature. The values of the eight neighbour pixels are used to ﬁll the rest of the feature vector (Fig. 7). For classiﬁcation the following classiﬁers have been investigated: k-Nearest Neighbors: The k-NN classiﬁer [16] generally achieves good classiﬁcation results when the training data is well representative and consistent. This

398

C. Bendicks et al.

(a) raw data

(b) blue particle

(c) green particle

(d) red particle

Fig. 6. Appearance of diﬀerent coloured tracer particles (20µm) in RGB images

(a)

(b)

(c)

(d)

Fig. 7. Detecting particle centres and feature extraction. (a) shows the arrangement of a Bayer ﬁlter array. Bayer raw data (b) is converted to grey-scale image (c) to segment particle structures (red contour) and determine particle centres (blue mark). Then the features are extracted from Bayer raw data directly (d).

technique is one of the simplest machine learning algorithms and requires only an accumulation of labeled template samples for training, which are further used during decision. The distance between a test and the training samples can be computed in several ways. In this work, the Euclidean distance metric is applied and a simple majority vote is used with the parameter selection of k=3, which has been determined through the heuristic technique of cross validation. Multi Layer Perceptron: The classiﬁcation technique of multi-layer artiﬁcial neural networks is applied in this work, whereas a net topology is favored that can be learned under supervision, as the matching of learning and target data is known. Thus, a feed forward net topology of a fully connected back propagation network with a sigmoid transfer function is used and has proved to produce superior results. In particular two hidden layers are used with a number of eight hidden neurons each [9], the input layer has ten neurons and the output layer three neurons. The Fast Artiﬁcial Neural Network Toolbox [14] has been used for the implementation. Support Vector Machines: Generally, the SVM learner is based on an underlying two-class or binary classiﬁcation in which it is attempted to maximize the hyper plane margin between the classes [10]. The Pairwise Coupling extension is used to adapt SVM for the multi-class problem [17]. In this work, the RadialBasis-Function (RBF) Gaussian kernel is used which has performed robustly with the given number of features and provided optimum results as compared

Use of Coloured Tracers in Gas Flow Experiments

399

Table 1. Classiﬁcation based on several classiﬁers. Both, training set and test set consist of 30,000 samples. (a) Support Vector Machine (b) Multi Layer Perceptron

(c) k–Nearest Neighbours

Class

P(C1)

P(C2)

P(C3)

Class

P(C1)

P(C2)

P(C3)

Class

P(C1)

P(C2)

P(C3)

C1

89.52

10.37

0.11

C1

87.37

12.09

0.54

C1

83.12

16.59

0.29

C2

7.96

91.82

0.22

C2

5.49

92.79

1.72

C2

7.95

91.68

0.37

C3

0.1

4.15

95.75

C3

0.31

1.48

98.21

C3

0.18

0.38

99.44

to other kernels. For the optimization, kernel width σ = 3 and the penalty parameter C=5 are used. For more details the reader may refer to [10]. The libSVM implementation has been used for software realization [4]. The classiﬁcation accuracy for our test data can be analysed by the confusion matrices listed in Tab. 1, which contain information about the actual classes Ci and their prediction P (Ci ), based on the particular classiﬁer. For the classes C1 , C2 , and C3 (e.g. colour classes blue, green, and red) the recognition rates are high despite of bad imaging conditions and the results are mostly independent of the classiﬁer. This encourages the feasibility of the concept to reduce the tracer density for correspondence search. The improvement of temporal correspondence can analogously be deduced from the improvement of spatial correspondence.

5

Results and Discussion

The presently employed PTV method uses the speciﬁc algorithm of Crocker and Grier [6] to link locations of the N particles present at a given step of time. The principle of this algorithm relies on the squared displacement δi2 between the point of ID i and its corresponding candidate at the next step of time. Thisalgorithm actually minimizes the sum of these squared displacements: N 2 → min. i=1 δi Fig. 8 demonstrates the advantage of using colour classes in the process of linking trajectories. When colour is not considered there are too many uncertainties when linking particles from one point in time to the next. Here, the used algorithm delivers only a few trajectories. Considering colour facilitates the temporal correspondence problem, because only particles belonging to the same colour class will be linked to a trajectory. Hence, the algorithm is able to create much longer trajectories. The obtained experimental results demonstrate that the complete 3-D PTV procedure is working very well in the considered ﬂow, involving organized structures, and is able to reveal small-scale recirculating ﬂow at a millimeter scale. It reveals also simultaneously long, uninterrupted and curved trajectories over the top winglet, and accelerating trajectories between the two bottom winglet. This constitutes an ideal complement to other measuring techniques such as Particle-Image Velocimetry. Furthermore, the viscous boundary layers with a

400

C. Bendicks et al. −10

−5

0

5

−10 10

15

20

25

−5

0

5

10

15

20

25

25

25

20

20

15

15

10

10

5

5

0.6

0.5

0.4

0.3

y

0

0

−5

−5

4

−2−4 20

z

x 4

0.2

0.1

−2 −4 20

Fig. 8. 3-d Trajectories, velocity m/s is coded by colour (see colour bar). Result of trajectory linking without (left) and with (right) considering particle colour, where the SVM classiﬁer was used for classiﬁcation.

smooth gradient of velocity from close to zero at the walls to the free-stream values indicate an excellent precision of the obtained measurement of velocity. The next work will focus on the combination of the presented approach with alternative methods to determine trajectories [15].

References 1. Albrecht, P., Michaelis, B.: Improvement of the Spatial Resolution of an Optical 3–D Measurement Procedure. IEEE Transactions on Instrumentation and Measurement 47(1), 158–162 (1998) 2. Albertz, J., Wiggenhagen, M.: Guide for Photogrammetry and Remote Sensing, 5th edn. Herbert–Wichmann Verlag (2009) 3. Bordas, R., Bendicks, C., Kuhn, R., Wunderlich, B., Thevenin, D., Michaelis, B.: Coloured tracer particles employed for 3d-ptv in gas ﬂows. In: ISFV13 - 13th International Symposium on Flow Visualization, and FLUVISU12 - 12th French Congress on Visualization in Fluid Mechanics, Paper #93, Nice, July 1-4 (2008) 4. Chang, C.-C., Lin., C.-J.: LIBSVM: a library for support vector machines (2009), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 5. Ouellette, N.T., Xu, H., Bodenschatz, E.: A quantitative study of three-dimensional Lagrangian particle tracking algorithms. Experiments in Fluids 40(2), 301–313 (2006) 6. Crocker, J.C., Grier, D.G.: Methods of digital video microscopy for colloidal studies. J. Coll. Interface Sci. 179, 298–310 (1996) 7. ANSYS: FLUENT Flow Modeling Software, http://www.fluent.com 8. Ramanath, R., Snyder, W.E., Bilbro, G.L., Sander, W.A.: Demosaicking methods for Bayer color arrays. Journal of Electronic Imaging 11(2), 306–315 (2002) 9. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood Cliﬀs (1998) 10. Herbrich, R.: Learning kernel Classiﬁers: theory and algorithms, ISBN:0-262-08306X (2003)

Use of Coloured Tracers in Gas Flow Experiments

401

11. Kuhn, R.W., Bordas, R., Wunderlich, B., Michaelis, B., Thevenin, D.: Colour class identiﬁcation of tracers using artiﬁcial neural networks. In: 10th International Conference on Engineering Applications of Neural Networks, Thessaloniki, Greece (2007); 13/2/1-13/2/8 12. Maas, H.G.: Digitale Photogrammetrie in der dreidimensionalen Str¨ omungsmesstechnik. Dissertation ETH Z¨ urich Nr. 9665 (1992) 13. Maas, H.-G.: Complexity analysis for the determination of image correspondences in dense spatial target ﬁelds. In: International Archives of Photogrammetry and Remote Sensing, vol. XXIX, pp. 102–107 (1992) 14. Nissen, S., Nemerson, E.: Fast Artiﬁcal Neural Network, FANN (2009), http://leenissen.dk/fann/ 15. Ruhnau, P., Guetter, C., Schn¨ orr, C.: A Variational Approach for Particle Tracking Velocimetry, Measurement. Science and Technology 16, 1449–1458 (2005) 16. Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice, ISBN:978-0-262-19547-8 (2006) 17. Wu, T.F., Lin, C.J.: Probability Estimates for Multi-class Classiﬁcation by Pair wise Coupling. Journal of Machine Learning Research 5, 975–1005 (2004)

Reading from Scratch – A Vision-System for Reading Data on Micro-structured Surfaces Ralf Dragon, Christian Becker, Bodo Rosenhahn, and J¨ orn Ostermann Institut f¨ ur Informationsverarbeitung Leibniz Universit¨ at Hannover Appelstraße 9a, 30167 Hannover, Germany {dragon,becker,rosenhahn,ostermann}@tnt.uni-hannover.de

Abstract. Labeling and marking industrial manufactured objects gets increasingly important nowadays because of novel material properties and plagiarism. As part of the Collaborative Research Center 653 which investigates micro-structured metallic surfaces for inherent mechanical data storage, we research into a stable and reliable optical readout of the written data. Since this comprises a qualitative surface reconstruction, we use directed illumination to make the micro structures visible. Then we apply a spectral analysis to obtain image partitioning and perform signal tracking enhanced by a customized Hidden Markov Model. In this paper, we derive the algorithms used and demonstrate reading data from a surface with 1.6 kbit/cm2 from a micro-structured groove which varies by only 3 µm in depth (thus a “scratch”). We demonstrate the system’s robustness with experiments with real and artiﬁcially-rendered surfaces.

1

Introduction

In this paper an optical shape reconstruction method for the readout of data mechanically written as micro-structure on a surface is presented. The purpose is to store information about a mechanical component on the component itself. Since component and information then form a unit, the information cannot get lost or is unnecessarily stored even if the component has been replaced. Such

20 mm

4.0 mm

0.5 mm

Fig. 1. Mechanical component with micro-structured surface

This work was funded as part of the Sonderforschungsbereich 653 by the DFG.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 402–411, 2009. c Springer-Verlag Berlin Heidelberg 2009

Reading from Scratch – A Vision-System for Reading Data

403

Fig. 2. Left: Groove wound around the micro-structured mechanical component. Right: Measuring principle of directed illumination: Varying depth d is detected by shifts wr and ws of the reﬂexion and the shadow border respectively.

micro-structured surfaces are created by a Piezo tool during the last step of a turning process[1]. The tool cuts in a groove wound around the component whose depth is modulated by the digital payload. As the cutting is performed with small mechanical feed, the groove forms a micro-structure spreading on the whole surface of the mechanical component (Fig. 1). The mechanical storage principle used is the same as for the phonograph cylinder, the gramophone record or the capacitive electronic disc [2]. Unlike optical data storage systems like e.g., bar code, DVD or 3D storage [3], the domain of the applied signal is analog. To enable digital data storage, the run of the groove as well as writing and reading is modeled as the channel of a communication system [4], which consists of coding, modulation and channel. So writing in a groove means freezing an analog signal run as an imprint which is to be recovered during the shape reconstruction. Since the micro-structures are designed for simple and fast writing during the production and insensitivity to mechanical stress, data density is low compared to state-of-the-art magnetic and optical storage media but suﬃcient for all labeling purposes. As rough estimate about the dimensions of a cutting parameter set suitable for writing and reading: mechanical feed df = 70 μm, groove depth = 6 μm, signal amplitude = 3 μm, data rate R = 1.1 bit/mm. When using a binary amplitude modulation (2-ASK), the data density is R × 1/df ≈ 1.6 kbit/cm2 . 1.1

Previous Works

Reading the information comprises at least a qualitative surface analysis to reconstruct the signal written in. The reconstruction should be possible in a cheap way so a mechanical component could be inspected at a service station. Thus, only shape analysis methods with a customary stereo microscope with low magniﬁcation are considered here which later can be adapted to hand-held units. In the past 30 years, there were many shape reconstruction methods developed. In the ﬁeld of microscopy with a small ﬁeld of view, shape reconstruction is eased as only a depth map needs to be created. Thus, in the following we use the term depth instead of shape. Two main measuring principles are used: depth of ﬁeld and perspective. It can be further distinguished between single-view and multi-view methods. Depth of ﬁeld as measuring principle utilizes that the image varies in sharpness, depending on the distance between camera and surface. Thus, the small

404

R. Dragon et al.

depth of ﬁeld of microscopes means high depth resolution. Depth from focus [5] is a multi-view method which builds a depth map by varying the object’s distance and assigning all visible regions in focus to the same depth slice. As the distance variation is calibrated, points from diﬀerent depth slices can be combined to a depth map. Depth from defocus [6] is a single-view method which makes use of the knowledge about the point spread function of the optical system to deduce depth from unsharpness. Perspective as measuring principle means either using perspective diﬀerences in diﬀerent surface views or analyzing illumination eﬀects. Both types use knowledge about the set-up geometry. Depth from stereo [7] builds a depth map of the surface using a stereo microscope observing the same image from diﬀerent perspectives. Using the pixel-wise found disparity along the epipolar line and the knowledge about the relation of both cameras, the depth map can be reconstructed. Depth from shading [8] makes use of the reﬂectivity function which relates knowledge about the set-up with the direction of the surface normal. Depth from directed illumination is similar to that. The surface is illuminated in such a way that only regions with a speciﬁc surface normal appear bright. The shadow and reﬂexion borders are used to deduce depth information. 1.2

Surface Reconstruction at Scratch Scale

During our research, we investigated three promising methods for surface reconstruction: depth from stereo (DFS), depth from focus (DFF) and depth from directed illumination (DFI). To the best of our knowledge, similar applications of depth reconstruction are usually one magnitude larger in size. E.g., the optical reconstruction of LP records [9] using DFF seems similar. However the depth of the groove is about 20 times and the groove distance about 24 times bigger than here. Another example is the optical inspection of solder joints [10] using depth from shading, where the solder joint height is about 25 times greater than the structures which are to be recovered here. Third example: The surface structures analyzed by [11] using DFF are 10 times deeper than our groove. It turned out that DFS as well as DFF were able to reconstruct the coarse cylindrical shape of the mechanical component. However, both methods failed to reconstruct the groove which is about 10 times smaller. For DFS, one major problem is the calibration of the focal length with a microscopic calibration pattern, which is very important for exact depth estimation. It is error-prone as nearly no perspective eﬀects are noticeable. DFF is heavily perturbed by mechanical inaccuracies when changing the focus depth. Another problem for both approaches is the highly-specular surface with many perturbations. The measuring principle used for DFI here is displayed in Fig. 2. A depth variation of the groove is deduced by a shift of the reﬂexion and the shadow border. As this eﬀect is even noticeable on low-magniﬁcation views, e.g., right image of Fig. 1, where DFS and DFF failed, we decided for this method. Under the assumption that the groove depth is proportional to the shift of the shadow border, the groove depth can be reconstructed qualitatively. The two key problems which are to be solved for this are the image partitioning, which is necessary

Reading from Scratch – A Vision-System for Reading Data

405

to determine the position of the groove, and the tracking of the reﬂexion border for a robust depth reconstruction. In this paper, we present the following contributions using DFI: In Section 2, the method to determine the position of the groove is explained. In Section 3, the method to qualitatively reconstruct the analog signal using a Hidden Markov Model with a regular topology is presented. Both methods are then evaluated in Section 4 and the readout of a real surface is demonstrated. In Section 5, a conclusion is given.

2

Image Partitioning Using Spectral Analysis

We start with the ﬁrst key problem of the groove reconstruction: Given one view of the surface with proper illumination like in Fig. 3, we want to extract the position of the groove containing the data. Given that the groove is wound around the mechanical component, several groove sections are visible. Each contains a diﬀerent part of the signal run. As the mechanical feed during the turning process is constant, all neighbored groove sections have distance λ. We assume they run approximately horizontal in the microscope view, forming periodic coarse structures in vertical direction. The vertical position yn = nλ − φ

(1)

of the nth groove center can be determined by analyzing the image texture in vertical direction to extract the groove distance λ and the phase φ. The 1D texture analysis is based on several vertical cuts through the image. In order to just analyze the coarse structure and thus remove the impact of the ﬁne structure, which contains the signal, we average these cuts to form the one-dimensional signal f (y). The idea of the texture analysis is to model f (y) as a cosine function and to ﬁnd the sinusoidal parameters using a maximum likelihood (ML) estimation of the power spectrum density (PSD). The PSD p(u) of f (y) is estimated using the average periodogram method [12, p 72ﬀ.]. This means averaging PSD estimates of r diﬀerent realizations using a windowed discrete Fourier transform (DFT). Here, parts of f (y) with length sw around varying positions ξ are analyzed. Each PSD estimate |(gξ (u)| is: gξ (u) =

s w −1 k=0

w(k)f

2π sw − 1 k+ξ− Ws−u , with Wsw = ej sw . w 2

(2)

We choose a Blackman-Harris Window [13] as windowing function w(k) of odd size sw . Since PSD estimates are shift-invariant, p(u) and the ML estimate are: r−1

p(u) =

1 |gξ (u)| , (3) r i=0 r

uml = argu max p(u) , (4)

λ=

sw . (5) uml

As the following phase computation requires a very exact estimation of λ, the ML estimation is followed by a maximum search with bounds uml ± 1. These

406

R. Dragon et al.

are the preceding and following DFT sampling positions which have lower magnitudes. Now the wavelength of our sinusoidal parameter estimation is known and the phase φ is to be determined. It is obviously not shift-invariant and thus may not be averaged like in (3); so PSD estimation does not include phase estimation. When using sinusoidal estimation to determine the phase, usually only a single realization is used which in our case is too much inﬂuenced by noise. In the following step, we combine the idea of averaging multiple realizations with the ML sinusoidal parameter estimation. As gξ (u) is found by taking diﬀerent parts of f (y), the phase estimate φξ =

sw gξ (uml ) 2π

(6)

of diﬀerent realizations can be combined and averaged later when compensating the shift of ξ using r−1 sw − 1 1 φξ = φξ − ξ + mod λ , (7) φ= φ . (8) 2 r i=0 ξr The interval [0 . . . λ] used in (7) could be detrimental for the averaging in the case of phase jumps. A more stable estimate is found when setting the interval borders symmetrically around median φ , setting the lower border to α = median φ −λ/2 and recompute using r−1 sw − 1 1 φξ = φξ − ξ + + α mod λ − α , (9) φ= φ . (10) 2 r i=0 ξr As distance and phase are known, the groove positions can be computed using (1). In reality, the groove centers do not run exactly horizontal as the mechanical component wobbles when being turned (visible in Fig. 8). Thus, one estimate is not enough for a whole surface image. We start from an initial estimate (λ0 , φ0 ) at position x0 and then iteratively connect estimate k+1 with k using interval borders φk ± λk /2 for correct phase unwrapping of φk+1 . In Fig. 3, an image partitioning using one estimate at the left and one at the right image border is displayed.

3

Signal Tracking Using a Hidden Markov Model

After the image partitioning, image parts of size sx × sy (where sy = λ) containing exactly one groove section are created. The direction of the illumination has been chosen such that the lower wall of the groove wall appears bright (see Fig. 3). The task to solve the second key problem is tracking the reﬂexion border from the left to the right side of the image part. We will now deduce the Hidden Markov Model (HMM) used for the signal tracking. A heuristic approximation for the observation probability of the groove’s edge in image column xi being at position yi is the negative discrete deviation of the image I in vertical direction, shifted by a constant c to be non-negative:

Reading from Scratch – A Vision-System for Reading Data

407

Fig. 3. Partitioned image with groove middle (dotted green) and tracked signal (blue)

I(xi , yi + 1) − I(xi , yi − 1) + c. (11) 2 The transition probability of the actor’s movement is modeled as a normal distribution with a maximum probability if the groove is constantly continued: po (xi , yi |I) ∝ −

pt (xi+1 , yi+1 |yi ) ∝ N (σ 2 , yi ) .

(12)

Both po as well as pt have to be normalized over one column as there is always exactly one state per column passed when traversing from the left to the right image side. sy

po (xi , yi = j|I) = 1 ,

∀i = 1 . . . sx

(13)

pt (xi+1 , yi+1 = j|yi ) = 1 ,

∀i = 1 . . . sx − 1, yi = 1 . . . sy

(14)

j=1 sy

j=1

Given po and pt , the HMM is established. The state Sx,y stands for the groove’s edge being at position (x, y), so there are sx × sy states. The model’s topology is adapted to the task: All states Sxi ,yi in the same image column xi are parallel; transitions are only possible to Sxi +1,yi+1 (Fig. 4). S1,1

S2,1

...

Ssx ,1

S1,2

S2,2

...

Ssx ,2

.. .

.. .

S1,sy

S2,sy

.. . ...

Ssx ,sy

Fig. 4. The model’s topology. Each state Sx,y corresponds to the image pixel I(x, y)

Under these premises, the most probable groove edge Y = (y1 , y2 , . . . , ysx ) is found by maximizing the overall transition probability P = po (xsx , ysx )

s x −1 i=1

pt (xi+1 , yi+1 |yi )po (xi , yi ) .

(15)

408

R. Dragon et al.

Fig. 5. Left: Original surface view with deepenings on the right half. Right: Povray rendering with deepenings in the top-left quarter.

Because of the simple and regular topology, Y is easy to compute as a Viterbi algorithm [14] has always exactly sy paths to follow at the same time. The only parameter that has to be known a priori is the state transition standard deviation σ which depends on the cutting parameters and the surface quality. As its inﬂuence is relatively small for the readout result, it was empirically determined as 0.15 px.

4 4.1

Results Perturbation Sensitivity of the Signal Tracking

In this section the robustness of the image partitioning and the signal tracking is estimated. Perturbations, which are due to non-ideal cutting and aging of the groove, are simulated on synthetic data. To create a ground truth surface, a depth map is established by simulating the cutting eﬀect. It is used as surface bump map during ray tracing using Povray [15] (right image of Fig. 5). The signal run found on the undistorted images is taken as ground truth signal. To simulate aging, ellipses with random orientation are drawn in white and black over the resulting image. White ellipses model reﬂecting particles like dust whereas black model light-absorbing particles like rust. The ellipse length is 150 μm at an aspect ratio of 20 and 1/3 transparency. Realistic cutting artifacts are simulated as depth map bumps of 10 μm diameter with normal-distributed heights. Their standard deviation is called roughness σsurf . Such bumps usually originate from material inhomogeneities and non-ideal cutting. The inﬂuence of both perturbations on the SNR is shown in Fig. 6. To have an estimate about the strongest perturbation degrees, the processed images are also displayed. It can be deduced that particles as well as roughness have inﬂuence on the accuracy of the signal reconstruction but the image partitioning does not fail even in images where no coarse structure is cognizable. The inﬂuence of particles on the surface is nearly linear for exponentially-rising particle count. Surface roughness has a stronger impact on the SNR as it disturbs the signal tracking more. However, in both cases highly-distorted surface images data could be used to extract the signal with an SNR satisfactory for further channel coding processing. 4.2

Writing and Reading Data

With the proposed methods from Section 2 and 3, it is possible to track groove sections from one surface view. In the following, the steps necessary to

Reading from Scratch – A Vision-System for Reading Data best median worst

30 20

20 SNR/dB

SNR/dB

40

10

409

best median worst

16 12 8 4

0

0 1

1.5

2

2.5

0.8

log10 (nparticles )

1

1.2

1.4

1.6

σsurf /μm

Fig. 6. Top: Signal to noise ratio for particle perturbations (left) and surface roughness (right). Bottom: Corresponding images for both perturbations with highest distortion tested. The views could be read out with SNR = 7.7 dB and SNR = 4.5 dB. y/px

groove groove groove groove

300 320

section 3 section 2 section 1 center

340 360 380 400 -30

-20

-10

0

10

20

x/103 px

Fig. 7. The fusion of diﬀerent groove segments at x = 0. The large sinusoidal run of the groove center originates from the mechanical component wobbling during the readout.

reconstruct the whole groove run from several surface views are explained. The set-up to assemble the groove is as follows: The cylindrical mechanical component is mounted onto a turn table such that its axis of symmetry coincides with the rotation axis. By this, the component can be turned without changing its distance to the microscope. We assume that a rotation by a small angle θ results in an horizontal oﬀset Δ in the microscope view. The ratio r = θ/Δ is calibrated by estimating Δ using a minimization of the normalized cross correlation. Analog, vertical movement can be calibrated. Now image slices from diﬀerent surface views can be stitched together to the surface view of one rotation. After a whole surface view is created, the readout consisting of the partitioning and the signal tracking is performed. Next, the extracted groove sections are fused at the joint position where the right border of the whole surface view joins the left. At this position, groove section n meets section n + 1 (Fig. 7) such that the whole groove signal can be assembled. As last groove processing step, the signal is unbiased by subtracting the groove center from the signal. It is then demodulated re-using the sinusoidal parameter estimation derived in Section 2. So the digital signal is sampled at positions xn = nλs − φs with the estimates from Equations (5) and (10). To fulﬁll the Shannon sampling theorem [16], ﬁrst

410

R. Dragon et al. y/px

D

3 2

A

G

M

0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1

1 0 -1 -2 64

66

68

70

72

74

76

x/103 px

Fig. 8. Left: The extracted signal (blue) with sample positions (red), threshold (cyan) and the ASCII string “DAGM” read out. Right: Set-up for the readout.

the signal is low-pass ﬁltered with wavelength λs /2. A threshold is extracted automatically by a predeﬁned initialization sequence. To demonstrate writing and reading, we chose the ASCII string “DAGM” and modulated it using 2-ASK with every second bit set to 0 to ease phase estimation during the demodulation. The signal was written to a component of 49 mm diameter. Surface views were automatically stitched together to the overall surface image of size 48500 px × 768 px. The run of the groove center was determined using 100 estimates around the surface. All groove sections were tracked and combined to the reconstructed groove signal, which is shown in Fig. 8. The processing time for such a diameter is approximately 5 min for surface recording and 10 min for partitioning and tracking of 3 groove sections along one rotation. It can be seen all bits are demodulated correctly although the peak amplitude is only ≈ 3 px which is hardly cognizable in the surface views. The strongest perturbation is a nearly-periodic signal component with wavelength 2/3λs which is due to resonance during the cutting process. Investigations on various surfaces with diﬀerent data showed that the readout results have high repeatability and are in combination with a channel code very robust.

5

Summary and Conclusion

An optical readout procedure for data mechanically stored as a groove varying by only 3 μm in depth has been presented. The groove which is wound around a cylindrical mechanical part is reconstructed using directed illumination. The two key problems which were solved are the image partitioning of the periodicallyoccurring groove sections and the tracking of the groove border. The image partitioning into single groove sections is based on the vertical spectral analysis of the image texture. For precise and robust PSD estimates, the average periodogram method was combined with an ML estimation. Analog, a phase estimation algorithm was derived which combines the estimations of multiple realizations. For this purpose, an adapted phase unwrapping is used. As solution of the second key problem, a Hidden Markov Model for the tracking of the groove border was derived which consists of one node per pixel. Its observation probabilities model the probability of the groove border passing the speciﬁc node whereas its transition probabilities model the properties of the Piezo tool which has cut the groove. We demonstrated the robustness of the image partitioning and the signal

Reading from Scratch – A Vision-System for Reading Data

411

tracking by perturbing synthetic ground truth data rendered by Povray with realistic distortions. We further demonstrated reading out data from real surfaces. Our fully automatic system is running reliable as on-line demo and is frequently presented at diﬀerent events in our lab. We further research in transferring this readout method to arbitrary-curved surfaces which are created during a milling process. It is planned to modulate the depth of the milling head by the analog signal which corresponds to the variation of the Piezo tool excitation here.

References 1. Denkena, B., Ostermann, J., Becker, C., Spille, C.: Mechanical Information Storage by Use of an Excited Turning Tool. Annals of the German Academic Society for Production Engineering 1(1), 25–30 (2007) 2. Isailovi´c, J.: 2.4: Capacitive Videodiscs. In: Videodisc and Optical Memory Systems. Prentice-Hall, Englewood Cliﬀs (1985) 3. Takita, A., Yamamoto, H., Hayasaki, Y., Nishida, N.: Three-Dimensional Optical Storage Inside Transparent Materials. Optics Letters 21(24), 2023–2030 (1996) 4. Schwartz, M.: 1: Introduction to Information Transmission. In: Information Transmission, Modulation, and Noise. McGraw-Hill, New York (1970) 5. Grossmann, P.: Depth from focus. Pattern Recognition Letters 5(1), 63–69 (1987) 6. Namboodiri, V.P., Chaudhuri, S.: On defocus, diﬀusion and depth estimation. Pattern Recognition Letters 28(3), 311–319 (2007) 7. Falkenhagen, L.: Depth Estimation from Stereoscopic Image Pairs Assuming Piecewise Continuos Surfaces. In: Proc. European Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Productions, Hamburg, Germany (November 1994) 8. Horn, B.K.P., Brooks, M.J. (eds.): Shape From Shading. MIT Press, Cambridge (1989) 9. Tian, B., Barron, J.L.: Reproduction of sound signal from gramophone records using 3d scene reconstruction. In: Irish Machine Vision and Image Processing Conference (2006) 10. Sanderson, A.C., Weiss, L.E., Nayar, S.K.: Structured Highlight Inspection of Specular Surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(1), 44–55 (1988) 11. Schaper, D.: Automated Quality Control for Micro-Technology Components Using a Depth From Focus Approach. In: Fifth IEEE Southwest Symposium on Image Analysis and Interpretation, Sante Fe, USA, April 2002, pp. 50–54 (2002) 12. Kay, S.M.: Modern Spectral Estimation. Prentice-Hall, Englewood Cliﬀs (1988) 13. Harris, F.J.: On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform. Proceedings of the IEEE 66(1), 51–83 (1978) 14. Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding algorithm. IEEE Transactions on Information Theory 13(2), 260– 269 (1967) 15. Povray: The persistence of vision raytracer, http://www.povray.org/ 16. Shannon, C.E.: Communication in the Presence of Noise. Proceedings of the IRE 37(1), 10–21 (1949)

Diﬀusion MRI Tractography of Crossing Fibers by Cone-Beam ODF Regularization H.H. Ehricke1 , K.M. Otto1 , V. Kumar2 , and U. Klose2 1

Institute for Applied Computer Science, University of Applied Sciences, Stralsund, Germany 2 Section of Experimental NMR of the CNS, University Hospitals, T¨ ubingen, Germany

Abstract. Since the advent of high angular resolution diﬀusion imaging (HARDI) techniques in diﬀusion MRI great eﬀorts have been taken in order to reconstruct complex white-matter structures, such as crossing, branching and kissing ﬁbers. However, even highly sophisticated ﬁber tracking schemes, such as probabilistic tracking, suﬀer from the data’s poor signal-to-noise (SNR) ratio. In this paper we present a novel regularization approach for q-ball ﬁelds, exploiting structural information within the data. We also propose a straightforward deterministic tracking algorithm, allowing delineation of even non-dominant pathways through crossing regions. Results from a phantom study with a biological phantom as well as a patient study, in which we reconstruct the pyramidal tract, emphasize the method’s eﬃciency.

1

Introduction

In Magnetic Resonance Diﬀusion Imaging the diﬀusion tensor has been widely used as a model for the diﬀusion behavior of water molecules in a voxel. Streamline ﬁber tracking approaches on the basis of diﬀusion tensor imaging (DTI) have proven to yield good results for the reconstruction of dominant ﬁber structures within the brain white matter. However, for the delineation of more complex structures, such as kissing, crossing or branching ﬁbers, clinically applicable techniques for image data acquisition and tracking are needed. Especially, for the tracking of non-dominant ﬁber populations, such as many sections of the optic, acoustic and pyramidal tracts, novel acquisition and processing schemes have to be elaborated which are capable of considering multiple ﬁber orientations in a voxel. Various imaging methods have been proposed in order to acquire a voxel’s diﬀusion proﬁle with far more than 60 gradient directions. Qball Imaging (QBI) and diﬀusion spectrum imaging (DSI) are two candidates in this category. Unfortunately, these high angular resolution diﬀusion imaging (HARDI) techniques have in common, that they are highly susceptible to artifacts, e.g. resulting from eddy currents or motion shift, and especially to noise. In QBI a model-independent reconstruction of the HARDI-Signal, leading to a diﬀusion orientation distribution function (ODF), is performed [1]. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 412–421, 2009. c Springer-Verlag Berlin Heidelberg 2009

Diﬀusion MRI Tractography of Crossing Fibers

413

An ODF for a voxel at a discrete position (xi , yi , zi ) in 3D space may be imagined as a projection of diﬀusion probabilities pij onto the surface of a unit sphere, constructed around that voxel. Each diﬀusion probability pij corresponds to a discrete reconstruction point rj on the unit sphere surface which can be deﬁned by inclination/declination angles (φj , υj ). For each of the reconstructed surface points rj , its ODF value pij describes the diﬀusion probability in the direction rj , given by the unit vector from the sphere’s center to the surface point. Thus we can describe an ODF as a discrete functional Ψi (rj ) = pij

(1)

Due to symmetry reasons it is enough to distribute the datapoints rj on a halfsphere: φ ∈ [0..2π], υ ∈ [0..π]. If we interpret the probability value pij as a distance parameter, describing the datapoint’s distance from the voxel center, we can transform the sphere model into a more complex shaped model, geometrically revealing the angular diﬀusion proﬁle. The sharpness of this geometric representation is enhanced by normalization to the minimum to maximum interval, thus producing a min-max normalized ODF. However, the ODF’s signal-to-noise ratio (SNR) typically is quite low, due to the restrictions of the imaging method. Therefore, an ODF usually is locally smoothed by applying a spherical convolution matrix. The kernel width deﬁnes the extent of a datapoint’s neighborhood on the sphere in which samples contribute to smoothing. Especially in low-anisotropy regions of crossing and branching ﬁbers this approach must be applied with great care, since it may suppress important details, e.g. signals stemming from non-dominant ﬁbers. During the last ﬁve years a variety of authors have proposed novel HARDIbased approaches, pursuing the idea of tracing pathways through areas of low diffusion anisotropy and thus allowing tracking of non-dominant pathways through crossing regions. Most of these techniques focus on tracking algorithms. Both, multiple-orientation deterministic methods as well as highly sophisticated and computationally expensive probabilistic techniques [2][3][4][5] have been designed. Most of them could improve the results for the delineation of speciﬁc brain regions with more complex white matter populations. Nevertheless, we are far from a general solution which would be clinically applicable.

2

Regularization of Diﬀusion MRI Datasets

In diﬀusion MRI the last decade has seen various approaches, most of them focused on DTI regularization. They fall into three categories, namely 1. regularization of diﬀusion weighted images prior to tensor reconstruction [6][7], 2. incorporation of a regularization scheme into the tensor estimation process [8], 3. regularization of tensor ﬁelds [9][10].

414

H.H. Ehricke et al.

Most of these DTI-based approaches may not easily be transfered to HARDIbased imaging. Only a few novel techniques have been proposed for ODF regularization. Savadjiev et al. [11] model ﬁbers by chains of helical segments. They use the notion of co-helicity in order to compute conﬁdence values of ODF-directions of neighboring voxels. Three neighboring directions are co-helical if the direction vectors may be regarded as tangents of a helix segment, thus yielding a high conﬁdence value for that direction. Summing up a direction’s conﬁdence values over the local neighborhood provides an estimate of local support. An iterative relaxation technique is used to maximize average local support. The drawbacks are computational expensiveness and the method’s high complexity. For the latter reason parameter handling and interpretation of unwanted eﬀects is not easy. Jonasson et al. [12] use ODF regularization prior to the segmentation of major white-matter tracts. They transfer the ODF dataset to a ﬁve-dimensional non-Euclidean position-orientation space (POS). Within this space they use an anisotropic diﬀusion ﬁlter with increasing number of iterations in order to produce multiple scales of resolution for the data. However, the spatial resolution and ODF sharpness is reduced with increasing degree of regularization and therefore it remains unclear, whether the approach might successfully be used in the framework of white-matter tractography. Descoteaux et al. [13] use spherical harmonics for ODF reconstruction. They incorporate a Laplace-Beltrami operator into the reconstruction scheme in order to sharpen the ODF. The operator reduces the inﬂuence of higher-order terms due to noise. The approach does not exploit the data from the voxel’s local neighborhood, which certainly will limit its regularization eﬀect when applied to real data. However, the method might be combined with regularization schemes, prior or after ODF reconstruction, such as our cone-beam regularization technique.

3

Cone-Beam ODF Regularization (CB-REG)

In signal processing noise-smoothing is a commonly used preprocessing step, necessary as a prerequisite for most signal analysis procedures. For QBI-based ﬁber tractography approaches smoothing of the ODF data is highly recommended, especially if more complex white matter structures are to be investigated. As explained above, ODF smoothing within a voxel [1][13] may lead to better tracking results, but by limiting the data source to the values, available within a voxel, the eﬀect of the smoothing procedure is limited and the angular resolution is reduced. One way out of this dilemma is the incorporation of modeling assumptions into the smoothing procedure. E.g. Descoteaux et al. propose to transform the relatively unsharp and noisy diﬀusion-ODF into a sharper ﬁber-ODF by deconvolution with a diﬀusion-ODF kernel of a linear ﬁber model [14]. Another way, commonly used in image processing, is to operate on a local neighborhood, incorporating information from neighboring voxels into the process. With these local smoothing operators care has to be taken to avoid suppression of structural information, such as blurring of object edges. For this reason, more sophisticated smoothing approaches use structural information to steer the smoothing process. E.g. with anisotropic

Diﬀusion MRI Tractography of Crossing Fibers

415

diﬀusion ﬁltering [15] the intensity gradient magnitude is used to scale the ﬁlter function. Thus, smoothing over diﬀerent objects is inhibited. In the case of ODF de-noising our goal is to design a ﬁlter scheme which allows smoothing along ﬁber trajectories only. When we reduce noise, the ODF’s shape becomes sharper and is more aligned with the underlying ﬁber architecture. Thus, the ODF ﬁeld is regularized. In order to avoid smoothing over anisotropic regions with diﬀerent diﬀusion orientations, we have to use structural information, given by the ODF function. In our CB-REG approach we apply smoothing to each datapoint pij on the ODF sphere at position (xi , yi , zi ) in 3D space. The → datapoint’s local neighborhood is deﬁned by its direction vector − rj and two parameters α and l, describing opening angle and length of the cone-shaped neigh→ borhood (ﬁg. 1). The cone is centered around the direction vector − rj (cone A). Since the ODF is a symmetric function, a second cone B is constructed in the op→ → posite direction −− rj . Within each cone rays are sent along all direction vectors − rk α → − → − encompassed by the cone shape: (rk , rj ) ≤ 2 . Each ray is then sampled by trilinear interpolation of ODFs at each unit step along the ray until the bottom of the cone has been reached. From each of the ODFs, sampled along a certain ray, only the datapoint, whose direction coincides with the ray direction is relevant. All relevant datapoints pmk which have been sampled inside the cone are used in order to compute a smoothed ODF value Ψ˜i (rj ) by a weighted sum. We apply a two-dimensional Gaussian kernel G(d, δ) to scale each datapoint’s weight according to its Euclidean distance dmi = (xm − xi )2 + (ym − yi )2 + (zm − zi )2 → → → from the cone center and the angle δkj = (− rk , − rj ) between its direction − rk and → the direction − rj around which the cone is centered . Thus the ﬁltered ODF value Ψ˜i (rj ) is derived from the sampled ODF values pmk = Ψm (rk ) by: 1 Ψ˜i (rj ) = w

G(dmi , δkj )Ψm (rk )

(2)

m∈M,k∈K

where M : set of sampled positions within cone A and cone B K: set of sampled ray directions within cone A and cone B w: sum of all weights of the sampled datapoints within cone A and cone B: w= G(dmi , δkj ) (3) m∈M,k∈K

and

2

G(d, δ) = e

2

− d2 − δ 2 σ

d

σ

δ

.

(4)

The parameters σd and σδ deﬁne the sharpness of the two-dimensional Gaussian kernel. For simpliﬁcation, we deﬁne a weight ω for the farthest distance d = l and the widest angle δ = α2 and compute σd and σ δ such, that the utmost sample reaches a weight of ω. In this manner we smooth only over a neighborhood of equal or similar diffusion directions, thus exploiting the structural information, given by the ODF itself. In the next chapter we explain, how we use the regularized q-ball ﬁeld for tractography.

416

H.H. Ehricke et al.

Fig. 1. Sketch of neighborhood sampling strategy used in cone-beam regularization, illustrating cone construction and sampling for a single ODF datapoint. Around the datapoint’s direction and opposite direction (bold arrows) two cones are constructed. Within each cone ODFs are sampled (small circles) along direction rays by tri-linear interpolation within the ODF ﬁeld (big circles). Only datapoints with directions along sampling rays are accepted (small arrows).

4

Deterministic ODF-Based Fiber Tracking

White matter tractography algorithms fall into two categories, deterministic and probabilistic. Deterministic approaches, such as streamline tracking [16] and tensor deﬂection (TEND) [17], integrate deterministic pathways using each sample’s primary direction vector, typically derived from the diﬀusion tensor. Especially in more complex white matter regions estimated ﬁber directions contain a great amount of uncertainty, caused by noise, various artifacts and partial voluming. In these regions more sophisticated approaches are needed, e.g. dynamic ﬁber tracking which places secondary seeds in order to analyze connectivity of branching and crossing ﬁbers [18]. With techniques from probability theory various authors have tried to improve tracking results. Most probabilistic tracking algorithms are based on an iterative Monte Carlo sampling scheme which is used to generate multiple trajectories from a seed point [2][19][3]. Perrin et al. are among the ﬁrst to apply probabilistic ﬁber tracking to q-ball ﬁelds [20]. They propose a particle tracing approach where a particle entering a voxel with a certain speed and motion direction is deﬂected by a force, stemming

Diﬀusion MRI Tractography of Crossing Fibers

417

from the local q-ball. The orientation of the force is chosen randomly inside a cone, deﬁned from the incident direction. The ODF datapoints within the cone are used to control the random process. The main drawback of their approach is, that the reconstructed ﬁbers diverge with increasing distance from the seed region. We have adapted their method to deterministic ﬁber tracking by substituting the random selection process. In our approach the trajectory direction we follow, is derived from the incident direction and the sample’s ODF. We deﬁne a cone, centered around the incident direction. From the datapoints on the sphere only those, encompassed by the cone are taken into account. We select the tracking direction by determining the maximum ODF value within the cone. If the cone’s maximum is less than 75 percent of the q-ball’s overall maximum, we stop tracking, because the direction obviously does not represent anisotropic behavior. Otherwise, the direction vector of the maximum value is used for the next step of the integration process along the ﬁber pathway. We use this straightforward algorithm because of its eﬃciency and its ability to track ﬁbers through crossing regions, without being deﬂected by major pathways. Of course, the latter can only be achieved, if the ODFs in the q-ball ﬁeld are sharp enough and their SNR is suﬃciently high. Therefore, our deterministic ﬁber tracking algorithm is a good choice for the evaluation of regularization approaches.

5

Results

For evaluation of our regularization approach diﬀusion phantom data, provided by McGill University, Toronto was used. The phantom was constructed from excised rat spinal cord, embedded in agar in a conﬁguration designed to have curved, straight and crossing tracts [21] (ﬁg. 2). The q-ball data was acquired on a 1.5 Tesla Sonata MR scanner (Siemens, Erlangen) with 90 diﬀusion weighting directions, 30 slices and an isotropic resolution of 2.8 mm. Fig. 2 shows the generalized fractional anisotropy (GFA) values from a slice through the phantom dataset (top left). The original ODFs are illustrated by the zoomed ODF shape display of the crossing region (top right). In the lower row results from ODF regularization with diﬀerent parameters are displayed. In the left picture a cone length l of 2 voxels was used, whereas the right picture was produced with l = 4 voxels. In both cases a cone opening angle α of 30◦ and a ω of 0.5 were used. The results illustrate, that the regularization sharpens the ODFs and that with increasing cone length the eﬀect becomes more obvious. The regularized ODFs within the ﬁber crossing area clearly show the expected bi-directional anisotropic behavior. We also applied the tracking algorithm, described above, to the regularized as well as the original phantom data. Fig. 3 shows the streamlines which were generated by usage of two seedboxes. Tracking through the crossing region fails due to partial voluming (left picture). After regularization with α = 30◦ , l = 2 voxels and ω = 0.25 the results are much better. Note, that smoothing with a relatively small voxel neighborhood is obviously suﬃcient to substantially enhance tracking results.

418

H.H. Ehricke et al.

Fig. 2. Regularization impact on the shape of the ODFs in crossing area of the phantom

Fig. 3. Tracking result from phantom study with original ODFs (left) and after regularization (right)

Furthermore, we applied our method to data from a patient study, acquired on a 3 Tesla Trio scanner (Siemens, Erlangen) with an isotropic resolution of 2.0 mm, 126 gradient directions and 56 slices. Each 10 diﬀusion measurements were followed by a non-diﬀusion measurement, which was used to estimate the rotation matrices of a head motion correction procedure. Furthermore, an eddy current correction was performed.

Diﬀusion MRI Tractography of Crossing Fibers

419

Fig. 4. Tracking result from pyramidal tract with original ODFs (left) and after regularization (right)

We focussed on the delineation of the pyramidal tract. Many studies have shown, that especially near the corpus callosum tracking of pyramidal ﬁbers is diﬃcult because of crossing callosal projections. This ﬁnding was conﬁrmed by our tracking experiments, using non-regularized q-ball ﬁelds (ﬁg. 4). After regularization with our CB-REG approach, substantially more ﬁbers could be tracked through the crossing region. Again we used a relatively small voxel neighborhood for ODF de-noising: α = 30◦ , l = 2, ω = 0.5.

6

Conclusions

We have presented a new method for regularization of q-ball ﬁelds which does not depend on highly complex modeling assumptions. We use a cone-beam strategy with 3 parameters (cone opening angle α and length l, weight of utmost sample ω) to sharpen the ODF’s shape and reduce noise. Our experiments show, that tracking ﬁber pathways through crossing regions beneﬁts from the regularization of the q-ball ﬁeld. Care has to be taken, not to overdo the regularization eﬀect, e.g. by the deﬁnition of an arbitrarily large neighborhood. Artifacts might be induced, constructing wrong connections. From our studies we found, that with a neighborhood of up to two voxels and an opening angle of 25 to 35 degrees reliable results could be achieved. Currently we elaborate our strategy by incorporating anisotropy data (GFA values) into the smoothing procedure. By this we plan to reduce the erroneous inﬂuence of neighboring isotropic voxels on ODFs representing regions at fascicle borders.

Acknowledgements We would like to thank Jennifer Campbell of the McConnell Brain Imaging Center, Montreal Neurological Institute, McGill University, for providing the phantom data, we used in our experiments.

420

H.H. Ehricke et al.

References 1. Tuch, D.S.: Q-ball imaging. Magn. Reson. Med. 52(6), 1358–1372 (2004) 2. Parker, G.J.M., Haroon, H.A., Wheeler-Kingshott, C.A.M.: A framework for a streamline-based probabilistic index of connectivity (PICo) using a structural interpretation of mri diﬀusion measurements. J. Magn. Reson. Imaging 18(2), 242–254 (2003) 3. Lazar, M., Alexander, A.L.: Bootstrap white matter tractography (BOOT-TRAC). Neuroimage 24(2), 524–532 (2005) 4. Behrens, T.E.J., Berg, H.J., Jbabdi, S., Rushworth, M.F.S., Woolrich, M.W.: Probabilistic diﬀusion tractography with multiple ﬁbre orientations: What can we gain? Neuroimage 34(1), 144–155 (2007) 5. Zalesky, A.: DT-MRI ﬁber tracking: a shortest paths approach. IEEE Trans. Med. Imaging 27(10), 1458–1471 (2008) 6. Parker, G.J., Schnabel, J.A., Symms, M.R., Werring, D.J., Barker, G.J.: Nonlinear smoothing for reduction of systematic and random errors in diﬀusion tensor imaging. J. Magn. Reson. Imaging 11(6), 702–710 (2000) 7. Martin-Fernandez, M., Mu˜ noz-Moreno, E., Cammoun, L., Thiran, J.P., Westin, C.F., Alberola-L´ opez, C.: Sequential anisotropic multichannel wiener ﬁltering with rician bias correction applied to 3d regularization of DWI data. Med. Image Anal. 13(1), 19–35 (2009) 8. Wang, Z., Vemuri, B.C., Chen, Y., Mareci, T.H.: A constrained variational principle for direct estimation and smoothing of the diﬀusion tensor ﬁeld from complex DWI. IEEE Trans. Med. Imaging 23(8), 930–939 (2004) 9. Poupon, C., Clark, C.A., Frouin, V., R´egis, J., Bloch, I., Bihan, D.L., Mangin, J.: Regularization of diﬀusion-based direction maps for the tracking of brain white matter fascicles. Neuroimage 12(2), 184–195 (2000) 10. Coulon, O., Alexander, D.C., Arridge, S.: Diﬀusion tensor magnetic resonance image regularization. Med. Image Anal. 8(1), 47–67 (2004) 11. Savadjiev, P., Campbell, J.S.W., Pike, G.B., Siddiqi, K.: 3D curve inference for diﬀusion MRI regularization. In: Int. Conf. on Med. Image Comput Assist Interv., vol. 8(Pt 1), pp. 123–130 (2005) 12. Jonasson, L., Bresson, X., Thiran, J.P., Wedeen, V.J., Hagmann, P.: Representing diﬀusion MRI in 5-D simpliﬁes regularization and segmentation of white matter tracts. IEEE Trans. Med. Imaging 26(11), 1547–1554 (2007) 13. Descoteaux, M., Angelino, E., Fitzgibbons, S., Deriche, R.: Regularized, fast, and robust analytical q-ball imaging. Magn. Reson. Med. 58(3), 497–510 (2007) 14. Descoteaux, M., Deriche, R., Kn¨ osche, T.R., Anwander, A.: Deterministic and probabilistic tractography based on complex ﬁbre orientation distributions. IEEE Trans. Med. Imaging 28(2), 269–286 (2009) 15. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diﬀusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 16. Mori, S., Crain, B.J., Chacko, V.P., van Zijl, P.C.: Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Ann. Neurol. 45(2), 265–269 (1999)

Diﬀusion MRI Tractography of Crossing Fibers

421

17. Lazar, M., Weinstein, D.M., Tsuruda, J.S., Hasan, K.M., Arfanakis, K., Meyerand, M.E., Badie, B., Rowley, H.A., Haughton, V., Field, A., Alexander, A.L.: White matter tractography using diﬀusion tensor deﬂection. Hum. Brain Mapp. 18(4), 306–321 (2003) 18. Ehricke, H.H., Klose, U., Grodd, W.: Visualizing MR diﬀusion tensor ﬁelds by dynamic ﬁber tracking and uncertainty mapping. Computers & Graphics 30(2), 255–264 (2006) 19. Behrens, T.E.J., Woolrich, M.W., Jenkinson, M., Johansen-Berg, H., Nunes, R.G., Clare, S., Matthews, P.M., Brady, J.M., Smith, S.M.: Characterization and propagation of uncertainty in diﬀusion-weighted MR imaging. Magn. Reson. Med. 50(5), 1077–1088 (2003) 20. Perrin, M., Poupon, C., Cointepas, Y., Rieul, B., Golestani, N., Pallier, C., Rivi`ere, D., Constantinesco, A., Bihan, D.L., Mangin, J.F.: Fiber tracking in q-ball ﬁelds using regularized particle trajectories. Inf. Process Med. Imaging 19, 52–63 (2005) 21. Campbell, J.S.W., Siddiqi, K., Rymar, V.V., Sadikot, A.F., Pike, G.B.: Flow-based ﬁber tracking with diﬀusion tensor and q-ball data: validation and comparison to principal diﬀusion direction techniques. Neuroimage 27(4), 725–736 (2005)

Feature Extraction Algorithm for Banknote Textures Based on Incomplete Shift Invariant Wavelet Packet Transform Stefan Glock1 , Eugen Gillich1 , Johannes Schaede2 , and Volker Lohweg1 1

inIT - Institute Industrial IT, Ostwestfalen-Lippe University of Applied Sciences, Liebigstr. 87, D-32657 Lemgo, Germany [email protected] www.init-owl.de 2 KBA-Giori S.A., 4, Rue de la Paix, CH-1003 Lausanne, Switzerland

Abstract. Segmentation and feature extraction algorithms based on Wavelet Transform or Wavelet Packet Transform are established in pattern recognition. Especially in the ﬁeld of texture analysis they are known to be practical. One diﬃculty of texture analysis was in the past the characterization of diﬀerent printing processes. In this paper we present a new algorithmic concept to feature extraction of textures, printed by diﬀerent printing techniques, without the necessity of a previous teaching phase. The typical characters of distinct printed textures are extracted by ﬁrst order statistical moments of wavelet coeﬃcients. The algorithm uses the 2D incomplete shift invariant Wavelet Packet Transform, resulting in a fast execution time of O(N log2 (N )). Since the incomplete shift invariant Wavelet Packet Transform was exclusively deﬁned for 1D-signals, it has been modiﬁed in this research. The application describes the detection of diﬀerent printed security textures.

1

Introduction

In recent years there have been several studies in texture analysis and classiﬁcation using Wavelet Transform (WT) and Wavelet Packet Transform (WPT), respectively. The pyramid structured Wavelet Transform [1] and the shift invariant Wavelet Transform [2], [3] decompose successively the low frequency scales. However, a large class of textures has its dominant frequencies in the middle frequency scales. To overcome this drawback, the Wavelet Packet Transform has been applied to extend the decomposition to these scales [4]. Security prints like banknotes are mainly produced by line oﬀset, letterpress printing, foil printing and intaglio printing. Especially the latter technique plays a major role in banknote reliability [5]. The term ”intaglio” is of italian origin and means to engrave. The printing method of the same name uses a steel plate with engraved characters and structures. During the printing process the engraved structures are ﬁlled with ink and pressed under huge pressure (tens of tons per inch) directly on the paper [6]. A tactile relief and ﬁne lines are formed, unique to intaglio printing process and almost impossible to reproduce J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 422–431, 2009. Springer-Verlag Berlin Heidelberg 2009

Feature Extraction Algorithm for Banknote Textures

423

via commercial printing methods [7]. Since intaglio process is used to produce the currencies of the world, intaglio printing presses and the companies who own them are monitored by government agencies. In terms of signal processing, the ﬁne structures of intaglio technique can be considered as textures with certain ranges of spatial frequencies. Therefore, it should be possible to detect them with WPT. For this purpose a new feature extraction algorithm based on incomplete WPT [8] is proposed. It decomposes only the best branch of the Wavelet Packet Tree according to a criterion which is based on ﬁrst order statistical moments of wavelet coeﬃcients. As the Wavelet Packet Tree is pruned during its decomposition (see Fig. 4) the algorithm can be assigned to the top-down approaches [9]. This paper is organized as follows: First redundant shift invariant and shift invariant WPT are brieﬂy described in section 2. A feature segmentation algorithm to decompose the Wavelet Packet Tree incompletely and its decomposition criterion are presented in section 3. Section 4 and section 5 provide experimental results and conclusions, respectively.

2

Shift Invariant Wavelet Packet Transform

The WPT is a generalization of the classical WT which means that not only the approximation (low frequency parts), but also the details (high frequency parts) of a signal are decomposed [10]. This results to a tree-structured WPT, displayed in Fig. 1, and the above mentioned richer solution of middle and high spatial frequency scales. Due to its tree characteristic the frequency scales are called nodes or subimages. In each decomposition level all leaf nodes are decomposed in one approximation Ai,j and three detail nodes cVi,j , cHi,j , and cDi,j . cVi,j represents the vertical, cHi,j the horizontal, and cDi,j the diagonal details, where i is the decomposition level and j the node number. The majority of existing texture analysis methods, based on 2D-WPT, makes the explicit or implicit assumption that textured images are acquired from the same viewpoint [9]. In many practical applications it is all but impossible to ensure this limitation. Therefore, shift invariant WPTs are highly desirable. In the traditional implementation of the 2D-WPT signals are ﬁrst convoluted by wavelet ﬁlters and then downsampled. The length of the decomposed signal is root A0,0

node A1,0

A2,0 cV2,1

cV1,1

leaf node cD2,3

cH2,2

A2,4 cV2,5

cH1,2

cD2,7 cH2,6

A2,8 cV2,9

cD1,3

cD2,11 cH2,10

A2,12 cV2,13

cD2,15 cH2,14

Fig. 1. Two dimensional tree-structured Wavelet Packet Transform with 3 tree levels

424

S. Glock et al.

1/4 i times the original signal, where i is the decomposition level. The downsampling results in a shift variant signal representation [1]. The approach described by Shensa [11] yields to a shift invariant transform by omitting the downsampling in each level. The great burden of this method is the high computational eﬀort because of the highly redundant signal representation. In consideration of these disadvantages, the one-dimensional shift invariant WPT (SIWPT) was proposed in [3]. It is based on the fact that an arbitrary signal translation of Δ samples is bounded by mod(Δ,2), because of the downsampling in each decomposition level. Therefore, a shift invariant representation can be achieved by the decomposition of a nonshifted, deﬁned by Eq. (1) and (2), and a for one-pixel-shifted version, deﬁned by Eq. (3) and (4), of the approximation and detail node. di+1,2j [k] = h[n] · di,j [n + 2k], (1) n

di+1,2j+1 [k] =

g[n] · di,j [n + 2k],

(2)

h[n] · di,j [n + 2k + 1],

(3)

n

di+1,2j+22i−1 [k] =

n

di+1,2j+1+22i−1 [k] =

g[n] · di,j [n + 2k + 1].

(4)

n

Both versions are downsampled and convoluted by arbitrary wavelet ﬁlters h[n] and g[n]. h[n] is a lowpass and g[n] a highpass wavelet ﬁlter, respectively [1], [12]. On the basis of an information content criterion, deﬁned in section 3, the version with the larger information content is further decomposed, whereas the other version is upcast. The upcasting yields to a nonredundant representation and to a fast execution time. The implementation of the 1D-SIWPT as ﬁlter bank is illustrated in Fig. 2. The above mentioned method was exclusively deﬁned for 1D-signals. In this research the SIWPT has been modiﬁed for 2D-signals such as images. The

>>

2

d1,3[k]

2

d1,1[k]

2

d1,2[k]

2

d1,0[k]

h[n] d0,0[k] >> g[n]

Fig. 2. 1D-SIWPT implementation as ﬁlter bank. In each tree level a nonshifted and a for one-pixel-shifted version is decomposed and downsampled. Due to an information content criterion one version is decomposed further, whereas the other version is deleted.

Feature Extraction Algorithm for Banknote Textures

425

2D-SIWPT decomposes ﬁrst four diﬀerent shifted versions. Based on their information content three versions are deleted, whereas the version with the highest information is further decomposed. According to experiments there is no difference in feature stability and quality between both described shift invariant WPTs.

3

Feature Extraction Algorithm for Shift Invariant Wavelet Packet Transform

The WPT enables an entire characterization of textures in all frequency scales. However, with increasing decomposition level the number of subimages grows exponentially. This lowers the execution time considerably. 3.1

Information Content and Stopping Criterion

For texture analysis it is usually unnecessary to achieve a complete Wavelet Packet Tree. Instead it is more important to focus on nodes which provide the best spatial frequency solution and the largest information content, respectively. Therefore, the WPT is decomposed according to an information content criterion, resulting in an incomplete WPT. Most known methods like [4], [8], [9], [13], [14] and [16] use the entropy or the average energy of an image for this purpose. [17] applies the WT with ﬁrst order statistics to classifying diﬀerent kind of banknotes. From a global point of view instances of the same texture which are printed by diﬀerent printing techniques are barely distinguishable. Entropy or energy based methods are designed to separate diﬀerent textures and cannot discriminate them with satisfactory results. Diverse printed instances of one and the same texture are diﬀerent in their gray scale transitions and discontinuities, respectively. In particular the discontinuities of intaglio printed instances are more pronounced compared to those of commercial prints. This diﬀerence can be determined by variance and excess of wavelet coeﬃcients [15]. Figure 3 shows the normalized histograms of wavelet coeﬃcients of an intaglio and a commercial printed instance of the same texture after a one-level 2D-SIWPT. The highly discontinuous structure of intaglio printing yields to a weighting on middle and high wavelet coeﬃcients, whereas the histogram of commercial printing is narrowly distributed and weighted on small coeﬃcients. According to Fig. 3 and heuristic investigations on characteristics of printing techniques [5], diﬀerent printed instances of the same texture can be discriminated by decomposing the tree towards variance and excess, until the subimage contrast is maximized. In consideration of production tolerances and the digitalization process, textures can be inﬂuenced by additive noise. Taking into account, that noise is represented by small wavelet coeﬃcients [18], the histograms of noisy textures are widely distributed. The afore mentioned heuristically evaluated properties lead to a three-stage stopping criterion:

426

S. Glock et al. 0.16 0,16

0.16 0,16

0.14

0.14

0.12 0,12

0.12 0,12

0.1

0.1

0.08 0,08

0.08 0,08

0.06

0.06

0.04 0,04

0.04 0,04

0.02 0 0 -200 -100 -50 -200-150-100 00

0.02 50

100 100

150

200 200

0 0 -200 -100 -50 -200-150 -100 00

50

100 100

150

200 200

Fig. 3. Histogram of wavelet coeﬃcients of an intaglio (left) and a commercial printed instance (right) of the same texture. It is obvious that both printing techniques are distinguishable by the shape of their histograms.

1. If the variance drops during decomposition, the subimages will be lower in contrast. Thus, the decomposition should be stopped. 2. If the variance grows at least to the same degree as the excess drops, the small wavelet coeﬃcients of the previous level will become larger. Thus the subimages are less noisy and should be decomposed further. This criterion can be formulated as 2 σi−1 − σi2 Ci−1 − Ci ≥ (5) σ2 Ci−1 . i−1 3. If both, the variance and the excess grow during decomposition, the contrast of the subimages will be enhanced. Therefore, the tree should be decomposed further. Furthermore, if the size of a subimage is smaller than the empirically determined value of 16x16 coeﬃcients, variance and excess may vary widely from sample to sample. As a result features could get unstable [4]. Hence, this size should be used as an overall stopping criterion. 3.2

Best Branch Algorithm

In this sub-chapter a new algorithmic concept for feature extraction of security textures is presented. It is predicated on the condition that only the branch which provides the best spatial frequency resolution is important for texture discrimination. The following investigations on tree properties lead to the Best Branch Algorithm (BBA) which is described at the end of this subchapter: The detail nodes of the Wavelet Packet Tree, as the name suggests, contain speciﬁc characteristics of a texture. Therefore, even if textures are akin, they could be discriminated by this information. The approximation nodes of the most left tree branch, the so-called approximation branch, contain low frequency information. Therefore, it is nearly impossible to distinguish diﬀerent printing techniques with their information. However, the children of the approximation nodes which represent the lower part of the middle frequency scales can also contain signiﬁcant texture characteristics. Therefore, they can contribute to texture discrimination, as well. Since the full decomposition of the Wavelet Packet Tree results in a high computational eﬀort [9], it is advantageous to concentrate on branches with the highest spatial frequency solution which means the best

Feature Extraction Algorithm for Banknote Textures

427

representation of a texture. In this way only the best detail branch and the approximation branch of each level are decomposed. For the evaluation of the best detail branch the node with the highest information content of the current level, the so-called best node, has to be investigated. Due to empirically evaluations, the excess of subimages at the same tree level is almost equal. Therefore, the best detail node is described by the node with the highest variance of a certain level. Moreover, if once the children of the approximation branch have a poorer spatial frequency solution than the children of the best detail branch, the texture can be characterized best by middle and upper frequency scales. Therefore, only the detail branch has to be decomposed further. The afore mentioned heuristically evaluated conditions and the information content and stopping criterion (chapter 3.1) lead to the Best Branch Algorithm, described below: Algorithm 1. Best Branch Algorithm Require: mod(M xM, 2) = 0 f inished ⇐ f alse ; i ⇐ 1 Ai,0 , cVi,1 , cHi,2 , cDi,3 ⇐ 2D-SIWPT(Ai−1,0 ) while (i ≤log2 (M xM/16x16) and (f inished) ) do cBi ⇐ max(σ(Ai,0, cVi,1 , ..., cDi,7 )) {determine the best detail node cBi } if cBi ⊂ Ai−1,0 then delete Ai,4 ,cVi,5 ,cHi,6 ,cDi,7 {best node is part of the approximation branch} j⇐0 else delete Ai,0 ,cVi,1 ,cHi,2 ,cDi,3 {best node is part of the detail branch} j⇐4 end if 2 2 2 σi2 ⇐ σcV + σcH + σcD i,j+1 i,j+2 i,j+3 Ci ⇐ CcVi,j+1 + CcHi,j+2 + CcDi,j+3 2 if σi−1 > σi2 then f inished ⇐ true {best spatial frequency resolution has been reached} else if Ci−1 > Ci then if ( Eq. (5)) then f inished ⇐ true {best spatial frequency resolution has been reached} end if else increment i if cBi−1 ⊂ Ai−2,0 then Ai,0 , cVi,1 , cHi,2 , cDi,3 ⇐ 2D-SIWPT(Ai−1,0 ) Ai,4 , cVi,5 , cHi,6 , cDi,7 ⇐ 2D-SIWPT(cBi−1 ) else Ai,4 , cVi,5 , cHi,6 , cDi,7 ⇐ 2D-SIWPT(cBi−1 ) end if end if 2 end while{Ci−1 and σi−1 represent the texture best possible}

Figure 4 shows an incomplete Wavelet Packet Tree which has been decomposed by the BBA. The highlighted nodes are the best nodes of their level and the

428

S. Glock et al. A0,0 A1,0 A2,0 A3,0

cV3,1

cH3,2

cV2,1 cD3,3

cH2,2 A3,12

cV1,1

cH1,2

cD2,3

cD1,3 A2,12

cV2,13 cH2,14 cD2,15

cV3,13 cH3,14 cD3,15 A4,56

cV4,57 cH4,58 cD4,59

Fig. 4. Three-level incomplete Wavelet Packet Tree, decomposed by the Best Branch Algorithm. The highlighted nodes contain the highest information content of their level and have been decomposed further, until the texture has been represented as best possible. The dashed nodes have been deleted during decomposition.

dashed nodes have been deleted during decomposition. The detail branch of the 3rd level characterizes the texture almost optimally. Therefore, the extracted features are appropriate for subsequent classiﬁcation.

4

Experimental Results

In the experiment, a set of 6 textures with 150 instances each has been investigated. One-third of the instances has been produced with intaglio printing process. Two-thirds have been printed by oﬀset printing, a commercial printing method used among others for newspaper manufacturing. This part can be further subdivided into high-quality and medium-quality prints. The mediumquality prints are aﬀected by additive noise. The commercial printed instances are barely distinguishable from intaglio printed instances of the same texture by the human eye. The instances are translated and rotated a few pixels owing to production tolerances. They have been scanned with a resolution of 1200 dpi and have been converted to gray-scale images. All instances have a size of 256x256 pixels. As illustrated in Fig. 5 the textures are diﬀerent in contrast, latitude of gray-scale transitions and structure.

Fig. 5. Investigated textures. The set represents the most common textures of security prints. The textures are diﬀerent in contrast, latitude of gray-scale transitions, and structure. They have been decomposed by the Best Branch Algorithm with 2D-SIWPT.

All 150 instances of each texture have been decomposed by 2D-SIWPT with db2-Wavelets [12] and the BBA. The investigation of all 900 instances has been

Feature Extraction Algorithm for Banknote Textures

429

shown, that the feature extraction is independent of production tolerances like transitions and varying contrast. Moreover, the BBA outputs stable features for all investigated printing techniques regardless to texture properties such as contrast, latitude of gray-scale transitions, or structure. Therefore, even if the generality of this approach has not been proven in this contribution, the BBA seems to be adaptable to other security prints, as well. Due to the afore mentioned properties, the BBA provides class clusters which are narrowly distributed and linear separable. This is exemplarily visualized in Figure 6 for the feature space of all 150 instances of texture 1. Regarding to the excellent separation properties of the BBA, the requirements for the subsequent classiﬁer are modest. Furthermore, the heuristically deﬁned decomposition criterion and features result in error rates of 0% and detection rates of 100%. The execution time of the BBA based on incomplete shift invariant WPT is described by O(N log2 (N )), where N = M xM is size of the texture [3].

6 5

4

C

3 2

1

0

-1 50

60

70

80

90

100

110

120

V2

Fig. 6. Feature space of all 150 instances of texture 1, decomposed by the BBA. The circles illustrate the medium-quality commercial, the diamonds the high-quality commercial and the squares the intaglio printed instances. The class clusters are narrowly distributed and linear separable. Therefore, error rates of 0% can be reached by a subsequent classiﬁer.

For the estimation of overall separation results the extracted features have been normalized to a uniform range of values between 0 and 1. Figure 7 shows the inter- and intra-class distance between intaglio, medium-, and high-quality commercial printed instances of all investigated textures. The dashed oval frames stress the level where the BBA has been stopped the decomposition. On the basis of Fig. 7, it is comprehensible that the BBA stops for all 900 investigated instances the decomposition at the level, where the printing technique is best characterized.

430

S. Glock et al. texture 1 and 2

texture 3 and 4

inter-class distance

1.2

1.2

BBA

BBA 0,8

0.8

0.8

0,4

0.4

0.4

0

1

2

3

0

1

2

3

0.06

0.1

íntra-class distance

texture 5 and 6 1.2

BBA

0

1

2

3

2

3

0.08

BBA

0.08

0.04 0.04 0.04

0

1

2

inter-class distance: intra-class distance: inter-class distance: intra-class distance:

3

0

1

BBA

BBA

0.02

2

3

0

1

texture intaglio versus high-quality offset print high-quality offset print intaglio versus medium-quality offset print medium-quality offset print

1 2 3 4 5 6

Fig. 7. Inter- and intra-class distance of all intaglio and commercial printed instances of one texture. The dashed oval frames stress the level where the BBA has been stopped the decomposition. It is observable that the BBA stops at the level which achieves the best inter-class distance with a rate of 100%. At 58% the intra-class distance is minimized, as well. Even if not for all 6 textures the intra-class distance is obtained best, the classes are narrowly distributed in either case. Therefore, in average the BBA stops at the level where the classes are best separated and lowest expanded.

5

Conclusions

In this paper a new algorithmic feature extraction concept based on incomplete shift invariant Wavelet Packet Transform for discrimination of diﬀerent printed banknote textures has been presented. It has been shown that the usage of ﬁrst order statistical moments of wavelet coeﬃcients is adequate. Since the algorithm was capable of characterizing the used printing techniques for all instances of the investigated textures, the application to other security purposes seems to be possible. However, the generality of this approach has to be proven in further studies. Considering the rapid execution time of O(N log2 (N )), an implementation on a Field Programmable Gate Array (FPGA) seems to be possible. The proposed algorithm can be applied to textures with various contrasts, width of gray-scale transition and structure. Furthermore, it is independent of additive noise and translations. The feature extraction results in narrowly distributed and separate class clusters. Therefore, the requirements for subsequent classiﬁer are exceedingly modest. For this reason, classiﬁcation rates up to 100% can be seen for further approaches and will be the basis of advanced investigations.

Feature Extraction Algorithm for Banknote Textures

431

Acknowledgement This work is supported by KBA-Giori S.A., Lausanne, Switzerland.

References 1. Mallat, S.G.: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Analysis and Machine Intelligence 11, 674–693 (1989) 2. Unser, M.: Texture classiﬁcation and segmentation using wavelet frames. IEEE Trans. Image Processing 4, 1549–1560 (1995) 3. Fritzsche, M.: Anwendung von Verfahren der Mustererkennung zur Detektion von Landminen mit Georadaren. PhD Thesis Karlsruhe University (2001) 4. Chang, T., Kuo, C.-C.J.: Texture analysis and classiﬁcation with tree-structured wavelet transform. IEEE Trans. Image Processing 4, 429–441 (1993) 5. Dyck, W., T¨ urke, T., Schaede, J., Lohweg, V.: A New Concept on Quality Inspection and Machine Conditioning for Security Prints. In: Optical Document Security - The 2008 Conference on Optical Security and Counterfeit Deterrence, San Francisco, CA, USA (2008); Reconnaissance International Publishers and Consultants, published on CD-ROM (2008) 6. Van Renesse, R.L.: Optical Document Security, 3rd edn., Artechhouse Boston/London (2005) 7. Schaede, J., Lohweg, V.: The Mechanisms of Human Recognition as a Guideline For Security Feature Development. In: IS&T/SPIE 18th Annual Symposium on Electronic Imaging - Optical Security and Counterfeit Deterrence Techniques VI, vol. 6075, pp. 1–12 (2006) 8. Jiang, X.-Y., Zhao, R.-C.: Texture segmentation based on incomplete wavelet packet frame. In: IEEE Pro. Machine Learning and Cybernetics, vol. 5, pp. 3172– 3177 (2003) 9. Coifman, R.R., Wickerhauser, M.V.: Entropy-based algorithms for best basis selection. IEEE Trans. Information Theory 38, 713–718 (1992) 10. Zhang, J., Tan, T.: Brief review of invariant texture analysis methods. Pattern Recognition 35, 735–747 (2002) ´ Trous and Mallat 11. Shensa, M.J.: The Discrete Wavelet Transform: Wedding the A Algorithms. IEEE Trans. On Signal Processing 40, 2464–2482 (1992) 12. Daubechies, I.: Ten Lectures On Wavelets. Society for Industrial and Applied Mathematics (1992) 13. Saito, N.: Local Feature Extraction and its Applications using a Library of Bases. PhD Thesis Yale University (1994) 14. Wang, Q., Li, H., Liu, J.: Subset Selection using Rough Set in Wavelet Packet Based Texture Classiﬁcation. In: IEEE Pro. Wavelet Analysis and Pattern Recognition, vol. 2, pp. 662–666 (2008) 15. Webb, R.A.: Statistical Pattern Recognition, 2nd edn. Wiley, Chichester (2002) 16. Wang, X., Jin, H., Zhao, R.: Wavelet Transform and Fuzzy Kohonen Clustering Network. In: Pro. of the 3rd Congress on Intelligent Control and Automation, pp. 2684–2687 (2000) 17. Choi, E., Lee, J., Yoon, J.: Feature Extraction for Bank Note Classiﬁcation Using Wavelet Transform. In: IEEE Proceedings of the 18th International Conference on Pattern Recognition, vol. 2, pp. 934–937 (2006) 18. Fowler, J.E.: The Redundant Discrete Wavelet Transform and Additive Noise. IEEE Signal Processing Letters 9, 629–632 (2005)

Video Super Resolution Using Duality Based TV-L1 Optical Flow Dennis Mitzel1,2 , Thomas Pock3, Thomas Schoenemann1 , and Daniel Cremers1 1

3

Department of Computer Science University of Bonn, Germany 2 UMIC Research Centre RWTH Aachen, Germany Institute for Computer Graphics and Vision TU Graz, Austria

Abstract. In this paper, we propose a variational framework for computing a superresolved image of a scene from an arbitrary input video. To this end, we employ a recently proposed quadratic relaxation scheme for high accuracy optic ﬂow estimation. Subsequently we estimate a high resolution image using a variational approach that models the image formation process and imposes a total variation regularity of the estimated intensity map. Minimization of this variational approach by gradient descent gives rise to a deblurring process with a nonlinear diﬀusion. In contrast to many alternative approaches, the proposed algorithm does not make assumptions regarding the motion of objects. We demonstrate good experimental performance on a variety of real-world examples. In particular we show that the computed super resolution images are indeed sharper than the individual input images.

1

Introduction

Increasing the resolution of images. In many applications of Computer Vision it is important to determine a scene model of high spatial resolution as this may help – for example – to identify a car licence plate in surveillance images or to more accurately localize a tumor in medical images. Fig. 1 shows a super resolution result computed on a real-world surveillance video using the algorithm proposed in this paper. Clearly, the licence plate is better visible in the computed superresolved image than in the original input image. The resolution of an acquired image depends on the acquisition device. Increasing the resolution of the acquisition device sensor is one way to increase the resolution of acquired images. Unfortunately, this option is not always desirable as it leads to substantially increased cost of the device sensor. Moreover, the noise increases when reducing the pixel size. Alternatively, one can exploit the fact that even with a lower-resolution video camera running at 30 frames per second, one observes projections of the same image structure around 30 times a second. The algorithmic estimation of a high resolution image from a set of low resolution input images is referred to as Super Resolution. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 432–441, 2009. c Springer-Verlag Berlin Heidelberg 2009

Video Super Resolution Using Duality Based TV-L1 Optical Flow

433

Fig. 1. In contrast to the upscaled input image [21] (left) the super resolution image computed with the proposed algorithm (right) allows to better identify the licence plate of the car observed in this surveillance video

Fig. 2. The inversion of this image formation process is referred to as Superresolution

General model of super resolution. The challenge in super resolution is to invert the image formation process which is typically modeled by series of linear transformations that are performed on the high resolution image, presented in Fig. 2. Given N low resolution images {ILk }N k=1 of size L1 × L2 . Find a high resolution image IH of size H1 × H2 with H1 > L1 and H2 > L2 which minimizes the cost function: N (k) E(IH ) = (1) Pk (IH ) − IL k=1

where Pk (IH ) is the projection of IH onto coordinate system and sampling grid of image ILk . · - can be any norm, but usually it is L1 or L2 -norm. Pk is usually modeled by four linear transformations, that subject the high resolution image IH to motion, camera blur, down sampling operations and ﬁnally add additive noise to the resulted low resolution image. Fig. 2 illustrates this projection. This projection connecting the k th low resolution image to the high resolution image can be formulated using matrix-vector notation. [10]: ILk = Dk Bk Wk IH + ek

(2)

where Dk is a down sampling matrix, Bk blurring matrix, Wk warping matrix and ek a noise vector. We use matrix-vector notation only for the analysis, the implementation will be realized by standard operations such as convolution, warping, sampling [10]. Related Work. Super resolution is a well known problem and extensively treated in the literature. Tsai and Huang [9] were ﬁrst who addressed the

434

D. Mitzel et al.

problem of recovering a super resolution image from a set of low resolution images. They proposed a frequency domain approach, that works for band limited and noise-free images. Kim et al. [8] extended this work to noisy and blurred images. Approaches in frequency domain are computationally cheap, but they are sensitive to model deviations and can only be used for sequences with pure global translational motion [7]. Ur and Gross proposed a method based on multi channel sampling theorem in spatial domain [6]. They perform a non-uniform interpolation of an ensemble of spatially shifted low resolution pictures, followed by deblurring. The method is restricted to global 2D translation in the input images. A diﬀerent approach was suggested by Irani and Peleg [5]. Their approach is based on the iterative back projection method frequently used in computer aided tomography. This method has no limits regarding motion and handles nonuniform blur function, but assumes motion and blurring to be known precisely. Elad and Feuer proposed an uniﬁed methodology that combines the three main estimation tools in the single image restoration theory (ML) estimator, (MAP) estimator and the set theoretic approach using POCS [4]. The proposed method is general but assumes explicit knowledge of the blur and the smooth motion constraints. In our approach we don’t constrain to any motion model and don’t assume the motion as to be known. In recently published super resolution algorithms [17] and [16] the authors describe a super-resolution approach with no explicit motion estimation that is based on the Nonlocal-Means denoising algorithm. The method is practically limited since it requires very high computational power. Contribution of this work In this paper we will present an robust variational approach for super resolution using L1 error norm for data and regularization term. Rather than restricting ourselves to a speciﬁc motion model, we will employ a recently proposed high accuracy optic ﬂow method which is based on quadratic relaxation [13]. We assume blur as space invariant and constant for all measured images, which is justiﬁed since the same camera was used for taking the video sequence. This paper is organized as follows. In Section 2 we brieﬂy review the optic ﬂow estimation scheme introduced in [13]. In Section 3 we present super resolution approaches using L2 and L1 error norms for data and regularization terms. Subsequently, we present experimental results achieved with respective approaches. We conclude with a summary and outlook.

2

Optical Flow

The major diﬃculty in applying the above super resolution approach Eq. (2) is that the warping function Wk is generally not known. Rather than trying to simultaneously estimate warping and a super resolved image (which is computationally diﬃcult and prone to local minima), we ﬁrst separately estimate the warping using the optic ﬂow algorithm recently introduced in [13] and in the second step we use the motion to compute the inverse problem (2).

Video Super Resolution Using Duality Based TV-L1 Optical Flow

435

In this section we will shortly describe the optical ﬂow algorithm as posed in [13]. An extension of this approach [2] was recently shown to provide excellent ﬂow ﬁeld estimates on the well-known Middlebury benchmark. Formal Definition. Given two consecutive frames I1 and I2 : (Ω ⊂ IR2 → IR) of an image sequence. Find displacement vector ﬁeld u : Ω → IR2 that maps all points of the ﬁrst frame onto their new location in the second frame and minimizes following error criterion: E(u(x )) = λ |I1 (x ) − I2 (x + u(x ))| + (|∇u 1 (x )| + |∇u 2 (x )|)dx (3) Ω

where the ﬁrst term (data term) is known as the optical ﬂow constraint. It assumes that the grey values of pixels do not change by the motion, I1 (x ) = I2 (x + u(x )). The second term (regularization term) penalizes high variations in u to obtain smooth displacement ﬁelds. λ weights between the both assumption. At ﬁrst we use the ﬁrst order Taylor approximation for I2 i.e.: I2 (x + u ) = I2 (x + u 0 ) + (u − u 0 ), ∇I2 where u 0 is a ﬁx given disparity ﬁeld. Since we linearized I2 , we will use multi-level Coarse-to-Fine warping techniques in order to allow large displacements between the images and to avoid trapping in local minima. Inserting the linearized I2 in the functional (3) results in: 2 E(u ) = λ |I2 (x + u 0 ) + (u − u 0 ), ∇I2 − I1 (x )| + |∇ud (x )| dx (4) Ω

d=1

In the next step we label I2 (x + u 0 ) + (u − u 0 ), ∇I2 − I1 (x ) as ρ(u). We introduce additionally an auxiliary variable v , that is a close approximation of u in order to convexify the functional and propose to minimize the following convex approximation of the functional (4): 2 1 2 E(u , v ) = |∇ud | + (ud − vd ) + λ|ρ(v )| dx (5) 2θ Ω d=1

where θ is a small constant, such that vd is a close approximation of ud . This convex functional can be minimized alternately by holding u or v ﬁx. For a ﬁx v1 and d = 1 solve 1 min (u1 − v1 )2 + |∇u1 | dx (6) u1 Ω 2θ This is the denoising model that was presented by Rudin, Osher and Fatemi in [11]. An eﬃcient solution for this functional was proposed in [1], which uses a dual formulation of (6) to derive an eﬃcient and globally convergent scheme as shown in Theorem 1. Theorem 1. [1,13] The solution for Eq. (6) is given by u1 = v1 + θ divp where p fulfils ∇ (v1 + θ divp) = p |∇(v1 + θ divp)| that can be solved by using semiimplicit gradient descent algorithm that was proposed by Chambolle [1]: pn + τθ (∇(v + θ div pn )) 1 + τθ |∇(v + θ div pn )| 0 is the time step, p = 0 and τ ≤ 1/4 . pn+1 =

where

τ θ

(7)

436

D. Mitzel et al.

(a) Dimetrodon

(b) Grove2

(c) Hydrangea

(d) AAE 6.8◦

(e) AAE 6, 2◦

(f) AAE 5.5◦

Fig. 3. Performance evaluation of the test data from [3]. The ﬁrst row shows a image from the input sequence. The second row shows the results obtained by implementation of the above algorithm by setting the parameters as follows λ = 80.0 , θ = 0.4 and τ = 0.249.

The minimization for ﬁx v2 and d = 2 can be done in analogical. For u being ﬁxed, our functional (5) reduces to 2 1 E(v ) = (ud − vd ) + λ|ρ(v )| dΩ Ω 2θ

(8)

d=1

Theorem 2. [1,13] The solution for the optimization problem (8) is given by the following threshold scheme: ⎧ if ρ(u) < −λθ|∇I1 |2 ⎨ λθ∇I2 if ρ(u) > λθ|∇I2 |2 v = u + −λθ∇I2 (9) ⎩ 2 −ρ(u)∇I2 /|∇I2 | if |ρ(u)| ≤ λθ|∇I2 |2 Theorem 2 can easily be shown by analyzing the cases ρ(v ) < 0, ρ(v ) > 0 and ρ(v ) = 0 – see [1,13] for details. Implementation. We implement the complete algorithm on the GPU by using CUDA framework and reached a high performance compared to the implementation on the CPU. Precise initialization of parameters u , v and p d and further details can be found in [13]. Results that we could obtain with this approach are presented in Fig. 3.

3

Super Resolution

In this section we present variational formulations for motion-based superresolution using ﬁrst an L2 norm and subsequently a more robust L1 approach.

Video Super Resolution Using Duality Based TV-L1 Optical Flow

437

Variational Superresolution In the ﬁrst step we extend the data term which imposes similarity of the desired high-resolution image IH (after warping W , blurring B and downsampling D) with the N observed images {ILk }N k=1 shown in Eq. (2) by a regularization term which imposes spatial smoothness of the estimated image IH . To this end we start by penalizing the L2 norm of its gradient [20]: N

2 1

(k)

2 E(IH ) = (10)

Dk Bk Wk IH − IL + λ |∇IH | dx 2 Ω k=1

The regularization term is necessary since without it the inverse problem is typically ill-posed, i.e., does not possess a unique solution that depends continuously on the measurements. The parameter λ allows to weight the relative importance of the regularizer. For ﬁnding the solution IH , we minimize the energy function (10) by solving the corresponding Euler-Lagrange equation: dE (k) = Wk Bk Dk (Dk Bk Wk IH − IL ) − λ IH = 0 dIH N

(11)

k=1

The linear operators DK , Wk and Bk denote the inverse operations associated with the down-sampling, warping and blurring in the image formation process. Speciﬁcally, Dk is implemented as a simple up-sampling without interpolation. Bk can be implemented by using the conjugate of the kernel: If h(i, j) is the blur ˜ j) = h(−i, −j). In our kernel then the conjugate kernel ˜ h satisﬁes for all i, j: h(i, approach we model blurring through a convolution with an isotropic Gaussian ˜ coincides kernel. Since the Gaussian kernel h is symmetric , the adjoint kernel h with h. In addition, we will assume that blurring and downsampling is identical for all observed images such that we can drop the index k in the operators B and D. The operator Wk is implemented by forward warping. We solve the Euler-Lagrange equation in (11) by a steepest descent (SD) solved by an explicit Euler scheme: N (k) (n) n+1 n T T n IH = IH + τ Wk BD (IL − DBWk IH ) + λ IH (12) k=1

where τ is the time step. The two terms in the evolution of the high resolution image IH induce a driving force that aims to match IH (after warping, blurring and down-sampling) to all observations while imposing a linear diﬀusion of the intensities weighted by λ. The whole algorithm including the accurate motion estimation (which is a very important aspect of super resolution) is summarized below. Algorithm 1 Goal: Given a sequence of N - low resolution images {ILk }N k=1 estimate the interframe motion and infer a high resolution image IH of the depicted scene.

438

D. Mitzel et al.

Fig. 4. Motion estimation between the frames

Fig. 5. Motion computation between the reference image and the frames 1: Choose an image from the sequence as the reference image. 2: Estimate for each pair of consecutive frames the motion from one frame to

the next (see Fig. 4) using the algorithm presented in Section 2 . f

f

3: Using the motion ﬁelds ui and vi compute the motion ﬁelds uri , vir relative to

4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

the reference image (Fig. 5), the indices f (motion between mutual frames) and r (motion between reference frame and individual image) should indicate the diﬀerence of the motion maps. Interpolate the motion ﬁelds uri and vir to the size of the image IH . Initialize IH , by setting all pixel values to 0. for t = 1 to T do sum := 0; for k = 1 to N do t b := Wk IH (backward warping); c := h(x, y) ∗ b (convolution with the Gaussian kernel); c := Dc (down sampling to the size of IL ); d := (ILk − c); b := DT d (up sampling without interpolation); c := h(x, y) ∗ b d := WkT c (forward warping); sum := sum + d; end for t+1 t t IH = IH + τ (sum − λ IH ); end for

The complete algorithm was implemented on the GPU. In Fig. 6 you can ﬁnd the results, which were produced by Algorithm 1. As you can see there is a high spatial resolution improvement compared to the upscaling of a single frame. All characters on the licence plate are clearly identiﬁable. Nevertheless the resulting high resolution image is somewhat blurred, because L2 -regularizer does not allow for discontinuities in high resolution image and it does not handle outliers that may arise from incorrectly estimated optical ﬂow.

Video Super Resolution Using Duality Based TV-L1 Optical Flow

439

Fig. 6. A comparison between L2 -norm and one image from the sequence [21] upscaled by factor 2 shows obvious spatial resolution improvement. High resolution image is the result of Algorithm 1 using a sequence of 10 input images. The parameters were set as λ = 0.4, time step τ = 0.01 and iteration number T = 150.

Fig. 7. A comparison between L2 -norm and L1 -norm shows that the L1 - norm allows to better preserve sharp edges in the super resolved image

Robust Superresolution using L1 Data and Regularity In order to account for outliers and allow discontinuities in the reconstructed image, we replace data and regularity terms in (10) with respective L1 expressions giving rise to the energy [20]: E(IH ) =

N

(k)

DBWk IH − IL + λ |∇IH | dx k=1

(13)

Ω

This gives rise to a gradient descent of the form: I − DBWk IH ∂IH = Wk B D L(k) + λ div ∂t |I − DBWk IH | N

k=1

(k)

L

∇IH |∇IH |

,

(14)

which is also implemented using an explicit Euler scheme – see equation (12). The robust regularizer gives rise to a nonlinear discontinuity preserving diﬀusion. For the numerical implementation we use a regularized diﬀerentiable approximation of the L1 norm given by: |s|ε = |s|2 + ε2 . The experimental results in (Fig. 7-9 ) demonstrate clearly that the L1 formulation for motion-based super-resolution substantially improves the quality of the reconstructed image. Compared to the L2 -norm, we can see sharper

440

D. Mitzel et al.

Fig. 8. While the L2 -norm allows a restoration of the image which is visibly better than the input images, the L1 -norm preserves sharp discontinuities even better. As input we used the book sequence from [15].

Fig. 9. Closeups show that the L1 -norm better preserves sharp edges in the restoration of the high resolution image. As input we used the car sequence from [14].

edges. The numbers or letters that were not distinguishable in sequences are now clearly recognizable. The quality of the reconstructed super-resolution image depends on the accuracy of the estimated motion. In future research, we plan to investigate the joint optimization of intensity and motion ﬁeld.

4

Conclusion

In this paper, we proposed a variational approach to super resolution which can handle arbitrary motion ﬁelds. In contrast to alternative super resolution approaches the motion ﬁeld was not assumed as to be known. Instead we make use of a recently proposed dual decoupling scheme for high accuracy optic ﬂow estimation. By minimizing a functional which depends on the input images and the estimated ﬂow ﬁeld we propose to invert the image formation process in order to compute a high resolution image of the ﬁlmed scene. We compared diﬀerent variational approaches using L2 and L1 error norms for data and regularization term. This comparison shows that the L1 - norm is more robust to errors in motion and blur estimation and results in sharper super resolution images. Future work is focused on trying to simultaneously estimate the motion ﬁeld and the super resolved image.

Video Super Resolution Using Duality Based TV-L1 Optical Flow

441

References 1. Chambolle, A.: An Algorithm for Total Variation Minimization and Applications. J. Math. Imaging Vis. 20, 89–97 (2004) 2. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical ﬂow computation. In: Proceedings of the Dagstuhl Visual Motion Analysis Workshop (2008) 3. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow, http://vision.middlebury.edu/flow/data/ 4. Elad, E., Feuer, A.: Restoration of single super-resolution image from several blurred, noisy and down-sampled measured images. IEEE Trans. Image Processing 6, 1646–1658 (1997) 5. Irani, M., Peleg, S.: Improving resolution by image registration. In: CVGIP: Graph. Models Image Process, pp. 231–239 (1991) 6. Ur, H., Gross, D.: Improved resolution from subpixel shifted pictures. Graphical Models and Image Processing 54, 181–186 (1992) 7. Farsiu, S., Robinson, M., Elad, M., Milanfar, P.: Fast and robust multiframe super resolution. IEEE Transactions on Image Processing 13, 1327–1344 (2004) 8. Kim, S., Bose, N., Valenzuela, H.: Recursive reconstruction of high resolution image from noisy undersampled multiframes. IEEE Transactions on Acoustics, Speech and Signal Processing 38, 1013–1027 (1990) 9. Huang, T., Tsai, R.: Multi-frame image restoration and registration. Advances in Computer Vision and Image Processing 1, 317–339 (1984) 10. Elad, M., Feuer, A.: Super-resolution reconstruction of image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 817–834 (1999) 11. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 12. Pock, T.: Fast Total Variation for Computer Vision. Graz University of Technology, Austria (2008); PhD 13. Zach, C., Pock, T., Bischof, H.: A Duality Based Approach for Realtime TV-L1 Optical Flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 14. Malinfar, P.: MDSP Super-Resolution And Demosaicing Datasets. University of California, Santa Cruz, http://www.ee.ucsc.edu/~ milanfar/software/sr-datasets.html 15. Wang, C.: Vision and Autonomous Systems Center’s Image Database. Carnegie Mellon University, http://vasc.ri.cmu.edu/idb/html/motion/index.html 16. Protter, M., Elad, M., Takeda, H., Milanfar, P.: Generalizing the Non-Local-Means to Super-Resolution Reconstruction. IEEE Transactions on Image Processing 18, 36–51 (2009) 17. Ebrahimi, M., Vrscay, R.: Multi-Frame Super-Resolution with No Explicit Motion Estimation IPCV, pp. 455–459 (2008) 18. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. SIAM, Philadelphia (1995) 19. Zomet, A., Peleg, S.: Superresolution from multiple images having arbitrary mutual motion. In: Super-Resolution Imaging, pp. 195–209. Kluwer, Dordrecht (2001) 20. Marquina, A., Osher, S.J.: Image Super-Resolution by TV-Regularization and Bregman Iteration. Journal of Scientiﬁc Computing 37(3) (2008) 21. Daimler Research Stuttgart

HMM-Based Defect Localization in Wire Ropes – A New Approach to Unusual Subsequence Recognition Esther-Sabrina Platzer1 , Josef N¨agele2 , Karl-Heinz Wehking2 , and Joachim Denzler1 1

2

Chair for Computer Vision, Friedrich Schiller University of Jena {Esther.Platzer,Joachim.Denzler}@uni-jena.de http://www.inf-cv.uni-jena.de Institute of Mechanical Handling and Logistics, University Stuttgart {Naegele,Karl-Heinz.Wehking}@ift.uni-stuttgart.de http://www.uni-stuttgart.de/ift

Abstract. Automatic visual inspection has become an important application of pattern recognition, as it supports the human in this demanding and often dangerous work. Nevertheless, often missing abnormal or defective samples prohibit a supervised learning of defect models. For this reason, techniques known as one-class classiﬁcation and novelty- or unusual event detection have arisen in the past years. This paper presents a new strategy to employ Hidden Markov models for defect localization in wire ropes. It is shown, that the Viterbi scores can be used as indicator for unusual subsequences. This prevents a partition of the signal into sufﬁcient small signal windows at cost of the temporal context. Our results outperform recent time-invariant one-class classiﬁcation approaches and depict a great advance for an automatic visual inspection of wire ropes.

1

Introduction

Visual inspection of material surfaces has become an important application of pattern recognition [1,2,3,4]. Especially in scenarios, in which a manual inspection implies a high risk for the human life an automatic inspection is highly appreciated. Furthermore, in case of long-time inspections a human suﬀers from fatigue and a reduced level of concentration. The inspection of wire ropes of ropeways, elevators or bridges is an example for such a dangerous and at the same time demanding inspection task. As the ropes cannot be unmounted, a manual inspection bears a high risk for the human life. The inspection speed is often quite high (0.5 meters/s) and defects are small and nearly invisible. In Fig. 1(a) two common classes of surface defects in wire ropes are shown: a missing wire and a wire fraction. In order to aﬀord an automatic visual inspection of wire ropes, a prototype system displayed in Fig. 1(b) was developed [5]. Line cameras project the rope to a ﬁxed number of rope views, visible in Fig. 1(c). As it becomes clear from Fig. 1(a) surface defects in wire ropes are not obvious. Therefore, a machine-based recognition is a challenging problem. Furthermore, a J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 442–451, 2009. c Springer-Verlag Berlin Heidelberg 2009

HMM-Based Defect Localization in Wire Ropes

(a) Rope defects

(b) Prototype system

443

(c) Rope views

Fig. 1. (a) Two common kind of rope defects: a missing wire (top) and a wire fraction (bottom). (b) the prototype system leading to four diﬀerent rope views (c).

frequently reported problem in automatic visual inspection is the lack of faulty, abnormal examples. It is hard to obtain defective examples from a real ropeway, especially due to the strict rules for a regular visual examination [6]. In order to cope with the missing abnormal training material, one-class classiﬁcation approaches, also known as novelty detection or unusual event detection, have arisen the past years [7,1,3,4,8]. A brief overview of visual inspection and one-class classiﬁcation is given in the following section. In Sect. 3 anomaly detection with Hidden-Markov models is introduced, followed by a derivation of our new strategy for the detection of anomalous subsequences. The ratio between the maximum Viterbi scores of two consecutive time steps serves as anomaly indicator and the theoretical meaning of this ratio is derived. In Sect. 5 the method and the payoﬀ are summarized and discussed followed by an outlook on future work.

2

Related Work

One-class classiﬁcation (OCC) [7] is a concept often used in automatic visual inspection applications. The key idea is to learn only a representation of a target class - mostly the class of intact examples. Afterwards, anomalies and defects are recognized by outlier detection. Xie [4] provides a comprehensive summary of recent advances in visual inspection coping also with one-class classiﬁcation strategies for anomaly and defect detection. In [9] and [10] general novelty and outlier detection methods are subsumed. Application examples in the ﬁeld of automatic visual inspection are given by Maenpaa et al. [1], who use self-organizing maps for real-time surface inspection. Xie and Mirmehdi [11,3] employ a Gaussian mixture model (GMM) to detect abnormal variations from the random texture of ceramic tiles. In [12] an OCC approach using a GMM for automatic defect detection in wire ropes is presented. Potential defects in ropes are identiﬁed by an outlier detection scheme. The work is extended in [8] by a comparison of diﬀerent features. The approach allows a defect detection in wire ropes, but its localization ability is strongly restricted. Fig. 2 shows an exemplary detection

444

E.-S. Platzer et al.

Fig. 2. Result of the automatic defect detection in wire ropes presented in [8]. The white window borders the rope region, which was classiﬁed as potential defect.

result obtained with our implementation of the approach of Platzer et al. [8]. Obviously, just a small part of the defect was recovered. Our working hypothesis states, that defect detection and localization can be improved by the usage of temporal context during the classiﬁcation. Hidden Markov models (HMM) are a well-known technique to incorporate temporal context from time series into classiﬁcation problems. OCC in the context of sequential instead of static data is often called unusual event detection. For example, Zhang et al. [13] present a semi-supervised HMM approach for unusual event detection. A HMM explaining usual events is learned from a huge amount of normal training data. Unusual events are recognized by a reduced likelihood of the data given the model, and the model is adapted to this unusual events by a Bayesian approach. Brewer et al. [14] employ coupled Hidden Markov models (CHMM) to identify suspects in digital forensics. An application to surface inspection based on CHMMs is presented by Pernkopf [2]. A defect localization in texture using HMMs was presented by Hadizadeh and Shokouhi [15]. They utilize the HMM as a texture unit descriptor and predict the pixel values of the texture. Defect detection is performed, based on the prediction error. In most HMM-based anomaly detection approaches a decision is made based on the sequence likelihood given the learned model. By windowing the signal it is possible to get a better localization, but at the expense of less temporal context. So, obviously these are opposing intentions. For our purpose, a preferably wide temporal context covering at least one rope period is important. At the same time, an exact localization of defects within the sequence would be a great improvement. Therefore, in the following section HMM-based anomaly detection will be explained more detailed and a new approach for the recognition of unusual subsequences will be introduced.

3

HMM-Based Anomaly Detection

A Hidden Markov model is a probabilistic graphical model for a two-step random process. In the context of wire rope surface analysis the rope views are treated as observation sequences, whereas the hidden states are linked to the position in the rope. Emission distributions are modeled by a GMM based on histograms of oriented gradients (HOG) [16], which serve as features. This feature choice is motivated by the regular rope structure ruled by gradients oriented perpendicular to the twist direction. To improve the discrimination ability of features, also the entropy of each HOG cell is computed and used within the features. As it is no problem to

HMM-Based Defect Localization in Wire Ropes

445

obtain a lot of intact rope data, model learning is performed in the usual way by the well-known Baum-Welch algorithm [17]. Due to the periodic structure of wire ropes, a cyclic model with a ﬁxed bandwidth is used preventing an error-prone segmentation of the training data into periodic segments. By deﬁning a threshold on the probability of the observation sequence given the model a decision of the sequence belonging to the model can be made. Again it should be referred to the opposing goals: by separating the signal into eﬀectual small test sequences it is possible to achieve a good localization, but at cost of the temporal context used to compare the test sequence with the model. For this reason, the following section introduces a new way for an HMM-based recognition of unusual subsequences. 3.1

Unusual Subsequence Detection

To decode the optimal state sequence given a learned HMM λ and an observation sequence O 1:T of length T , the Viterbi algorithm is used [17]. Based on the Viterbi score δt (i) = maxS1 ,...,St−1 P (S1 , . . . , St = si , o1 , . . . , ot | λ) at time t for state St = si , the likelihood of the optimal path (marked by *) can be computed recursively by P ∗ = max δT (i). 1≤i≤N

(1)

δt (i) is deﬁned as δ1 (i) = πi bi (o1 ), ∀ 1 ≤ i ≤ N δt (j) = max [δt−1 (i)aij ] bj (ot ), 1≤i≤N

∀ 1 ≤ j ≤ N, 2 ≤ t ≤ T,

(2) (3)

where aij is the state transition probability from state St = si to state St+1 = sj , bj (ot ) represents the emission probability of state St = sj to emit the observation ot at time t, πi is the initial probability of state S1 = si and N denotes the number of states used in the topology of the model. To decode the optimal state sequence the argument which maximized (3) must be stored in the forward step ψ1 (i) = 0

(4)

ψt (j) = arg max [δt−1 (i)aij ] , 1≤i≤N

∀1 ≤ j ≤ N, 2 ≤ t ≤ T.

(5)

∗ The optimal path is then deﬁned by St∗ = ψt+1 (St+1 ) for t = T − 1, T − 2, . . . , 1 leading to the optimal state sequence. Accordingly, the maximum Viterbi score max1≤j≤N δt (j) at time t gives the likelihood of the optimal path of the partial observation sequence O 1:t . In case of defective rope regions, we assume that subsequences of the data cannot be explained well by the model. This should bereﬂected in the maximum Viterbi score of the according time steps, as the likelihood on the optimal path should decrease signiﬁcantly. Hence, the ratio of two consecutive maximum Viterbi scores of neighboring time steps can be used as anomaly indicator and can be written as

446

E.-S. Platzer et al.

t+1 j

t

t−1 i

k r

Fig. 3. Graphical illustration of the meaning of ratio r. Bold arrows indicate the state transition ai j , which was chosen to maximize δt+1 (j). The dashed arrow represents the transition from k to j skipped in this case. Black shaded circles represent states with maximum Viterbi score. The ellipsis marks the states i and k for which r is computed.

max1≤j≤N δt+1 (j) max1≤k≤N δt (k) max1≤j≤N max1≤i≤N [δt (i)aij ] bj (ot+1 ) = max1≤k≤N δt (k)

R=

(6) (7)

By substituting j = arg max1≤j≤N δt+1 (j), i = arg max1≤i≤N δt (i)aij and k = arg max1≤k≤N δt (k) (7) can be rewritten as R=

δt (i )ai j bj (ot+1 ) δt (i ) = ai j bj (ot+1 ) δt (k ) δt (k ) z

(8)

r≤1

From (8) it becomes clear, that the ratio of two consecutive maximum Viterbi scores can be seen as a supplementary weightage of z given the information from the next time step. r is the ratio of the Viterbi score δt (i ) of state i , which was chosen to maximize δt+1 (j) and the maximum Viterbi score δt (k ) in state k at time t. This ratio can be referred to represent the structural uncertainty present in the model with respect to the input data and the optimal state at time t. The lower r becomes, the less certainty is present with regard to the choice of the optimal St . Fig. 3 illustrates the meaning of r. δt+1 is not maximized by a transition from state St = k which oﬀers the maximum δt (k). Instead δt+1 (j) becomes maximal for a transition from St = si , although the Viterbi score δt (i ) was not maximum. This obeys that i was not optimal with respect to the observation sequence O 1:t , but becomes optimal if O1:t+1 is considered. As the regular characteristic of the rope implies a certain ﬁxed structure, the structural uncertainty is supposed to grow if the underlying data cannot be explained well by the learned model. This happens if anomalies are existent in the data. A threshold on the scalar obtained by (8) for every time step t is used to evaluate the presented approach by means of ROC curves. We will call this threshold anomaly indicator in the remaining paper. Localization results for two common rope defects are visible in Fig. 4 and clarify the potential of our theory. Although the detection results not perfectly match the ground truth labeling, regions with an anomalous visual appearance

HMM-Based Defect Localization in Wire Ropes

447

(a) Broken wire

(b) Missing wire Fig. 4. Defect localization results on a sequence containing a broken wire (a) and a sequence with a missing wire (b). The light gray bar (green in colored version) gives the ground truth labeling of the human expert while the dark gray bar (pink in colored version) above the thin white timeline shows the obtained detection result.

were recognized nearly to their full extent. The improvement becomes clear, if you compare the localization result for the missing wire in Fig. 4(b) with the detection result for the same defect of the same data sequence from Fig. 2. A quantitative evaluation of our approach will follow in the next section.

4

Experiments

All experiments were performed on authentic rope data acquired from real ropeways. HMM and GMM implementations of the Torch3 machine learning library [18] were employed. The number of states in the HMM was chosen to be 10 with eight components in the GMM. The HOG features were computed for blocks of 20 camera lines and a cell size of 20 × 20 pixels with m = 4 orientation bins. As the entropy for each cell was used as additional feature, the feature dimension for a rope with 150 pixels diameter results in a size of 7 ∗ (4 + 1) = 35. Model learning was done for each of the four rope views. A region which is known to be defect-free was chosen for this task. For numerical stability we used the log-likelihood instead of the likelihood, which turns the ratio in (6) into a diﬀerence. By varying the threshold on the diﬀerences of consecutive logarithmized maximum Viterbi scores, we analyzed our results with help of ROC curves. A camera line-based ratio between human-labeled defects, recovered as anomaly and the overall sum of defective camera lines was computed and is referred to as true positive rate (TPR). The false positive rate (FPR) gives a measure of the false alarm frequency. It should be noted, that the resolution of the defect location depends on the block size, used for the feature compuation. Rope analysis was performed individually for every view given the associated model. The resulting ROC curves were averaged over all views. Interference

448

E.-S. Platzer et al.

Area under ROC curve

0.962

0.96

0.958

0.956

0.954

0

50000

100000

150000

Size of the Trainingset (in camera lines)

200000

Fig. 5. Inﬂuence of the size of the training set (in camera lines) on the system performance. The performance measures are given as area under the ROC curve (AUC).

between the views was not considered yet. Ground truth data for all experiments was given by a carefully accomplished defect labeling of a human expert. Model learning can be done within a few minutes, as just a few rope meters are used. The time for the analysis of the rope depends on the length of the rope. In our experiments the speed for anomaly detection was approximately 6 meters per minute which leads to a processing speed up to 1000 camera lines per second (10cm/s). A parallel computation for all camera views needs approximately 3.5 hours for a rope resulting in 13.000.000 camera lines (1300 meters). 4.1

Importance of the Amount of Training Data

The ﬁrst experiment was designed to reveal the inﬂuence of the size of the training set. In Fig. 5 the area under the ROC curve (AUC) is given for diﬀerent sized training sets. The ROC curves were averaged over the four rope views for a testrun on 13.000.000 camera lines (1300m of rope). Models were learned on training sets reaching from 10.000 camera lines to 200.000 camera lines (120m of rope). As expected, the size of the training set has an inﬂuence on the performance, because the HMM needs an adequate data basis for the estimation of the model parameters. It becomes clear, that at least 50.000 camera lines are required to obtain a robust model. In the following, experiments a training set containing 100.000 camera lines of training data was used for model learning. 4.2

Recovered Defect Area

This experiment evaluates the amount of recovered defect area. The averaged result over 10 test runs with individually learned models is shown in Fig. 6(a). The AUC value for this curve is 0.96. The circle marks the ﬁrst anomaly indicator value for which no defects were missed, while the square outlines the best recognition rates obtained for a tolerance range of one missed defect. The results of four individual camera views of a selected test run are visible in Fig. 6(b).

HMM-Based Defect Localization in Wire Ropes

449

1

1 0.9

0.9

True Positive Rate

True Positive Rate

0.8 0.7 0.6 0.4 0.3 Average over all runs No defect missed One defect missed

0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

False Positive Rate

0.7

0.8

0.9

1

0.6

Cam 1 Cam 2 Cam 3 Cam 4

0.98 0.96 0.94

0.7

0.2

AUC value

1

0.8

0.5

0.92

Cam 1 Cam 2 Cam 3 Cam 4

0

(a)

0.05

0.9

1

2

3 4 Camera

0.1 0.15 0.2 False Positive Rate

0.25

0.3

(b)

Fig. 6. Averaged ROC curve (AUC= 0.96) over all cameras and testruns (a) and ROC curves with corresponding AUC values for individual camera views (b) of a single testrun. The circle in (a) marks the recognition rates (TPR=0.97,FPR=0.36) obtained if all defects are detected while the square gives the recognition rates (TPR= 0.88, FPR= 0.04) for a tolerance range of one missed defect. Table 1. Averaged number of missed defects related to anomaly indicator value threshold 10 20 30 40 50 60 70 80 90 missed defects 5.925 4.750 3.675 3.100 2.275 1.100 0.000 0.000 0.000

It becomes clear, that cameras two and three have a weaker performance compared to one and four. This is due to missing wires which are very inconspicuous and therefore really hard to detect with an appearance based approach in these views. In Table 1 the averaged number of defects which were not discovered in the rope subjected to the value of the anomaly indicator are summarized. 4.3

Comparison to Time Invariant OCC

To outline the improvement concerning an automatic defect detection and localization we compare our results to the approach of Platzer et al. [8], where a time invariant OCC method was used. As in [8] the subject of interest was just a defect detection, they perform a classiﬁcation postprocessing. This postprocessing step marks the whole defect area as recognized as soon as one signal window in the defect range is classiﬁed as anomaly. Hence, always 100% of the defect area is recovered unless a defect is totally missed. To compare both approaches regarding their localization ability, we skip this postprocessing for our evaluation. Fig. 7 compares the results obtained with the OCC approach and the presented HMM strategy. For both approaches, the TPR gives the percentage of recovered defect area while the FPR gives the false alarm rate. It becomes clear, that the HMM approach leads to a remarkable improvement in the defect localization compared to the OCC approach. This is a great beneﬁt and an important aspect for the practical applicability of the system.

450

E.-S. Platzer et al. 1 0.9 True Positive Rate

0.8 0.7 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3 0.2 0.1 0

AUC value HMM OCC

Approach

HMM approach OCC approach

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate

1

Fig. 7. Comparison of defect localization results obtained withthe OCC approach introduced in [8] and the HMM anomaly detection strategy presented in this paper

5

Conclusions

An new HMM-based approach for anomaly detection in wire ropes was presented. Contrary to most HMM-based anomaly detection approaches, which are based on the sequence likelihood of the whole observation sequence, a localization strategy for unusual subsequences was proposed. In contrast to the usual approaches our method needs no steering of the localization ability by choosing an eﬀectual small signal window for the analysis. The detection of anomalous subsequences is based on the ratio of two consecutive maximum Viterbi scores. As these ratio represents a supplementary weightage of the previous chosen state transition on the optimal path, it can serve as indicator for defects in wire ropes. Our experiments prove the working hypothesis of the paper. A context-based classiﬁcation using HMMs together with our new strategy for unusual subsequence recognition can lead to an improved and more robust defect detection in wire ropes. In experiments on real-life rope data from a ropeway it was possible to recover 90% of the overall defect area with the presented approach. At the same time, the false alarm rate stays clearly below 10%. A comparison to the work of Platzer et al. [8] in the ﬁeld of visual rope inspection emphasizes the impressible improvement gained by the presented approach. An interesting open question is still, how the dependency relations between the diﬀerent rope views can be taken into account to improve the method. Furthermore, the automatic adaption of the anomaly indicator value to the data will be a point under investigation.

References 1. M¨ aenp¨ aa ¨, T., Turtinen, M., Pietik¨ ainen, M.: Real-time surface inspection by texture. Real-Time Imaging 9(5), 289–296 (2003) 2. Pernkopf, F.: 3D Surface Analysis using Coupled HMMs. Machine Vision and Applications 16(5), 298–305 (2005)

HMM-Based Defect Localization in Wire Ropes

451

3. Xie, X., Mirmehdi, M.: TEXEMS: Texture Exemplars for Defect Detection on Random Textured Surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(8), 1454–1464 (2007) 4. Xie, X.: A Review of Recent Advances in Surface Defect Detection using Texture analysis Techniques. Electronic Letters on Computer Vision and Image Analysis 7(3), 1–22 (2008) 5. Moll, D.: Innovative procedure for visual rope inspection. Lift Report 29(3), 10–14 (2003) 6. EN 12927-7: Safety requirments for cableways installations designed to carry persons. ropes. Inspection, repair and maintenance. European Norm: EN 12927-7:2004 (2004) 7. Tax, D.M.J.: One-class classiﬁcation - Concept-learning in the absence of counterexamples. Phd thesis, Technische Universitt Delft (2001) 8. Platzer, E.S., Denzler, J., S¨ uße, H., N¨ agele, J., Wehking, K.H.: Robustness of Different Features for One-class Classiﬁcation and Anomaly Detection in Wire Ropes. In: Proceedings of the 4th International Conference on Computer Vision Theory and Applications (VISAPP), vol. 1, pp. 171–178 (2009) 9. Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches. Signal Processing 83(12), 2481–2497 (2003) 10. Hodge, V., Austin, J.: A Survey of Outlier Detection Methodologies. Artiﬁcial Intelligence Review 22(2), 85–126 (2004) 11. Xie, X., Mirmehdi, M.: Localising Surface Defects in Random Colour Textures using Multiscale Texem Analysis in Image Eigenchannels, pp. 1124–1127 (2005) 12. Platzer, E.S., Denzler, J., S¨ uße, H., N¨ agele, J., Wehking, K.H.: Challenging Anomaly Detection in Wire Ropes Using Linear Prediction Combined with Oneclass Classiﬁcation. In: Proceedings of the 13th International Fall Workshop Vision, Modeling and Visualization, pp. 343–352 (2008) 13. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Semi-supervised adapted HMMs for unusual event detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 611–618 (2005) 14. Brewer, N., Nianjun, L., Vel, O.D., Caelli, T.: Using Coupled Hidden Markov Models to Model Suspect Interactions in Digital Forensic Analysis. In: International Workshop on Integrating AI and Data Mining, pp. 58–64 (2006) 15. Hadizadeh, H., Shokouhi, B.: Random Texture Defect Detection Using 1-D Hidden Markov Models Based on Local Binary Patterns. IEICE Transactions on Information and Systems E91-D(7), 1937–1945 (2008) 16. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005) 17. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 18. Collobert, R., Bengio, S., Mari´ethoz, J.: Torch: a modular machine learning software library. Technical report, IDIAP (2002)

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion Christian Schmaltz, Joachim Weickert, and Andr´es Bruhn Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Campus, E1.1 Saarland University, 66041 Saarbr¨ ucken, Germany {schmaltz,weickert,bruhn}@mia.uni-saarland.de

Abstract. Although widely used standards such as JPEG and JPEG 2000 exist in the literature, lossy image compression is still a subject of ongoing research. Gali´c et al. (2008) have shown that compression based on edge-enhancing anisotropic diﬀusion can outperform JPEG for medium to high compression ratios when the interpolation points are chosen as vertices of an adaptive triangulation. In this paper we demonstrate that it is even possible to beat the quality of the much more advanced JPEG 2000 standard when one uses subdivisions on rectangles and a number of additional optimisations. They include improved entropy coding, brightness rescaling, diﬀusivity optimisation, and interpolation swapping. Experiments on classical test images are presented that illustrate the potential of our approach.

1

Introduction

Image compression is becoming more and more important due to the increasing amount and resolution of images. Lossless image compression algorithms can only achieve mediocre compression rates compared to lossy compression algorithms, though. Popular lossy image compression algorithms are JPEG [1], which uses a discrete cosine transform, and its successor JPEG 2000 [2], which is based on wavelets. In the last decade the interpolation qualities of nonlinear partial diﬀerential equations (PDEs) have become evident by an axiomatic analysis [3] and by applying them to inpainting problems [4,5]. Extending inpainting to image compression drives these ideas to the extreme: Only a small subset of pixels is stored, and the remaining image regions are reconstructed by PDE-based interpolation. This idea has been introduced by Gali´c et al. in 2005 [6] and extended later on in [7]. These authors used edge-enhancing anisotropic diﬀusion (EED) [8] because of it favourable performance as an interpolation operator. By encoding pixel locations in a B-tree that results from an adaptive triangulation [9] and introducing a number of amendments, they ended up with a codec that outperforms JPEG quality for medium to high compression ratios. However, they could not reach the substantially higher quality of JPEG 2000. The goal of our paper is to address this problem. We show that it is possible to beat the quality of JPEG 2000 with edge-enhancing anisotropic diﬀusion, J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 452–461, 2009. c Springer-Verlag Berlin Heidelberg 2009

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion

453

provided that a number of carefully optimised concepts are combined that have not been considered in [7]: First of all, we replace the adaptive triangulation by a subdivision into rectangles. Afterwards we use an improved entropy encoding of the interpolation data, a rescaling of the brightness map to the interval [0,255], an optimisation of the contrast parameter within the diﬀusion processes of encoding and decoding, and a swapping of the role of interpolation points and interpolation domain in the decoding step. The resulting novel codec that uses EED within a rectangular subdivision is called R-EED. Our paper is organised as follows: In Section 2 we review the main ideas behind diﬀusion-based image interpolation. Our R-EED codec is introduced in detail in Section 3, and its performance is evaluated in Section 4. The paper is concluded with a summary in Section 5. Related Work. While there are numerous papers that apply nonlinear PDEs and related variational techniques as pre- or postprocessing tools for image and video coding, their embedding within the encoding or decoding step has hardly been studied so far. Notable exceptions include work by Chan and Zhou [10] where total variation regularisation is incorporated into wavelet shrinkage, research on speciﬁc surface interpolation applications such as digital elevation maps [11], and some recent embedding of inpainting ideas into standard codecs such as JPEG [12].

2

Diﬀusion-Based Image Compression

As explained in the introduction, the basic idea behind the image compression approach used in this article is to save the brightness values only at a subset K ⊂ Ω of the whole image domain Ω ⊂ R2 . These values will be denoted by the function G : K → R+ 0 . In order to reconstruct the image, we introduce an artiﬁcial time parameter t. The reconstructed version R of the original image I is given by the steady state R = inf u(x, t) of the evolution equation t→∞

∂t u = Lu

(1)

with Dirichlet boundary conditions given by G and some elliptic diﬀerential operator L. That is, we set the brightness on K to the given values, initialise the remainder of the image arbitrarily, e.g. by setting it to zero, and diﬀuse the unknown parts of the image until convergence. As diﬀerential operator, we use edge-enhancing diﬀusion (EED) [8] because it has been shown in [7] that it performs favourable in this context. EED is given by Lu = div(g(∇uσ ∇u σ )∇u),

(2)

where ∇uσ := Kσ ∗ u is the image smoothed with a Gaussian Kσ with standard deviation σ, and g is a diﬀusivity function. The diﬀusion tensor g(∇uσ ∇u σ) is a symmetric 2 × 2 matrix with eigenvectors parallel and orthogonal to ∇uσ , and corresponding eigenvalues g(|∇uσ |2 ) and 1. Here we use the Charbonnier diﬀusivity [13]

454

C. Schmaltz, J. Weickert, and A. Bruhn

1

g(s2 ) :=

1+

s2 λ2

,

(3)

where λ is a contrast parameter. Note that EED is designed in such a way that it smoothes along edges, but not across them. Thus, this diﬀusion process can produce sharp edges. To compare the performance of diﬀerent compression algorithms, one considers the compression ratio, i.e. the ratio between the ﬁle size before and after compression, and some error measure that quantiﬁes the reconstruction error between the initial image I and the reconstruction R. We choose the mean square error (MSE) MSE(I, R) :=

1 (R(x) − I(x))2 , |Ω|

(4)

Ω

since it shows the quality diﬀerences between the methods very well. Moreover, there is a monotone mapping from the MSE to the popular peak signal to noise ratio (PSNR).

3

Our Novel Codec

In order to make anisotropic diﬀusion competitive for image coding great care has to be taken to select an appropriate set of interpolation points and to encode these data in a very compact way. Let us now discuss this in more detail. 3.1

Rectangular Subdivision

The proposed compression algorithm starts by saving the four boundary lines of the image. However, note that whenever we state that we “save” a line of the image, this is done by saving only three points on the line: the two endpoints and the midpoint of the line. Thus, the four boundary lines are saved as eight pixels since pixels lying on several lines must only be saved once. Next, we check the quality of the image reconstruction when only boundary data are known. Although the complete boundary is not known, this allows to save subimages independently from each other. Then we compute the reconstruction error, i.e. the MSE between the image and the reconstruction. If it is larger than the splitting threshold given by ald , where a and l are parameters, and d is the recursion depth, the image is split into two subimages by saving a line between the two subimages. These subimages are then saved recursively. The line saved is always the line in the middle of the larger image dimension, as shown in the left image in Figure 1. Thus, in order to decrease the space required to store the positions of the saved pixels, we do not store points at arbitrary positions, but only save the adaptive grid structure indicated in the left image in Figure 1. This is done by storing the maximal and minimal recursion depth reached in the compression algorithm, as well as one additional bit for each subimage between these two

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion

455

Depth 0

Depth 1

Depth 2

Fig. 1. Left: Illustration of the adaptive grid used for the proposed recursive compression routine. The white area is the area being reconstructed in the corresponding step. Right: Example point mask used for compressing the image “trui” with the proposed compression algorithm and a compression ratio close to 60 : 1. Table 1. The eﬀect of entropy coding for the image “walter”, shown in Figure 3, using diﬀerent splitting thresholds in the subdivision process. Shown are the size of the compressed pixel data in bytes without entropy coding (none), with Huﬀman coding (HC), with Huﬀman coding using canonical codes (HCc), with arithmetic coding with static (ACs) or adaptive (ACa) model, Lempel-Ziv-Welch coding (LZW), range coding (RC), gzip (version 1.3.5), bzip2 (version 1.0.3), and PAQ. The best result for each ratio is highlighted. None 5200 2602 1270

HC 3219 1345 694

HCc 3311 1517 866

ACs 3202 1390 716

ACa 3125 1291 646

LZW 3288 1504 789

RC 3549 1758 1114

gzip 2918 1350 683

bzip2 2878 1337 720

PAQ 2366 1136 613

depths. These bits indicate whether the corresponding subimages has been split in the compression step. 3.2

Improved Entropy Coding

To further decrease the ﬁle size, the pixel values saved are quantised and coded using a general purpose entropy coder. For the proposed codec, we tested several compression algorithms ranging from Huﬀman coding [14] over arithmetic coding [15] (with static or adaptive model), Lempel-Ziv-Welch coding [16] to standard tools like gzip (version 1.3.5) and bzip2 (version 1.0.3). Most of the time, PAQ [17] yielded the best results. In our implementation, we used a slightly modiﬁed version of paq8o8z-feb28. If very few pixels have to be compressed, a simple arithmetic coding with an adaptive model works best, though. Except for gzip and bzip2, which are standard tools, and PAQ, the source code of which is available at [18], we used the implementations from [19] in this paper. The performances of diﬀerent entropy coders are compared in Table 1 and in Section 4.

456

3.3

C. Schmaltz, J. Weickert, and A. Bruhn

Brightness Rescaling

Some images do not use the whole range of possible grey values. For example, the pixels in the image “trui”, which is shown in Figure 2, have a brightness between 56 and 241. Thus, it can make sense to map the brightness of the image such that the complete range is used. This can improve the reconstruction because quantisation has less eﬀects in this way. Note that quantisation does not only occur when quantising the grey values in the encoding step, but also when mapping the real numbers obtained by diﬀusion to integer values in the decoding. To illustrate the improvement of this step, we compressed the image “walter” (see Figure 3) once with the method explained in the last section and once with the same method using brightness adjustment. With brightness adjustment, the MSE for a compression rate of approximately 45 : 1 dropped from 50.64 to 46.33. 3.4

Diﬀusivity Optimisation

In the explanation of EED in the last section, we did not state how to choose λ for the diﬀusivity in Equation (3). While the same λ was used for all images in [7], we found out that the best parameter is dependent on the image and the compression ratio. Thus, we save the λ parameter that should be used in the reconstruction. We assume this parameter is between 0 and 1, quantise it to 256 diﬀerent values and use a single byte to store it. Furthermore, we noticed that a diﬀerent λ parameter should be used in the compression and decompression steps. This is advantageous due to two reasons: Firstly, the subimages reconstructed in the compression step necessary to generate the grid structure are not equal to the corresponding subimages in the reconstructed image since the inﬂuence of surrounding parts of the image are neglected. Secondly, the compression algorithm raises or lowers the saved brightness of each saved point if that improves the reconstruction error, similar to the approach proposed in [7]. During these point optimisations, the optimal λ for the reconstruction may change. Thus, searching an optimal λ and performing the point optimisations is interleaved. That is, after the optimal λ is found, each saved point is optimised once, and these two steps are repeated until convergence. When using an optimised λ in the compression step, the MSE of our test image “walter” improves from 46.33 to 38.91. After using the optimised parameter for the decompression, we get an error of 38.38. Using one point optimisation step, the error drops to 24.67, and ﬁnally to 21.38 after multiple optimisations. 3.5

Interpolation Swapping

Due to quantisation, most points stored in the compressed ﬁle are actually slightly inaccurate. This eﬀect can be even stronger after the point optimisations explained in the last section. To ease this problem, we follow an idea by Bae [20] and perform an additional step after the decompression step explained so far: Once the image is reconstructed, we swap the roles of known and unknown points. That is, the points

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion

457

reconstructed by diﬀusion are assumed to be known, and the previously known points on the interpolation mask are reconstructed with EED. With this interpolation swapping step, the reconstruction error of the image “walter” drops from 21.38 to 20.13. We abbreviate our EED-based image compression method with rectangular subdivision, improved entropy encoding, brightness rescaling, and diﬀusivity optimisation by R-EED. 3.6

File Format

The image format used by the proposed algorithm is given by: – – – – – – – –

image size (between 8 (small image, equal width and height) and 18 bits) entropy coder used (4 bits, see Section 3.2) brightness mapping information (between 1 and 17 bits, see Section 3.3) contrast parameter for decompression (1 byte, see Section 3.4) ﬂag if interpolation swapping should be used (1 bit, see Section 3.5) minimal and maximal recursion depth (2 bytes, see Section 3.1) splitting information (variable size, see Section 3.1) compressed pixel data (variable size)

In our implementation, there are currently four additional bits used for ﬂags which had been used to test extensions not used any more. Thus, these bits can be used for checksums.

4

Experiments

In this section, we show compression results of the proposed algorithm for diﬀerent compression rates and compare the results against JPEG, JPEG 2000, and the approach from Gali´c et al. [7]. For all experiments, we set σ to 0.8. The ﬁrst image for which results are presented is the image “trui”. This image is a standard test image often used in image processing. In order to compare our algorithm against the one proposed in [7], we scaled the image to 257 × 257, since that resolution was used there. Figure 2 shows the result using the diﬀerent compression methods. The images for JPEG and JPEG 2000 have been created with “convert” (version ImageMagick 6.2.4 02/10/07). Note that convert uses optimised entropy coding parameters when saving JPEG ﬁles. Since this was not the case for the JPEG images shown in [7], those images are worse than the JPEG images shown here. Nevertheless, JPEG is clearly worse than the other approaches, though. In order to demonstrate the performance diﬀerence between the proposed approach and JPEG 2000, we compute their MSE for comparable compression ratios. We observe that the reconstruction error for JPEG 2000 is substantially inferior to R-EED: For compression ratios around 43 : 1, the JPEG 2000 error is 48 % larger, and for ratios around 57 : 1 it is even 66 % worse.

C. Schmaltz, J. Weickert, and A. Bruhn

MSE

458

300

JPEG

200 100

JPEG 2000 R−EED

0 20 : 1

JPEG

Galic et al.

42.2:1, MSE: 71.16

44.3:1, MSE: 50.89

54.5:1, MSE: 111.01

57.5:1, MSE: 75.81

40 : 1

60 : 1

JPEG 2000

43.4:1, MSE: 45.99

57.2:1, MSE: 70.79

80 : 1 Ratio

R−EED

44.1:1, MSE: 31.00

57.6:1, MSE: 42.77

Fig. 2. Comparison of the image quality when saving the image “trui”, scaled to 257 × 257 with diﬀerent compression algorithms. Top row: Input image, and plots showing the MSEs for diﬀerent compression ratios. Middle row: Images obtained with JPEG, the method by Gali´c et al. [7], JPEG 2000 and with the proposed method (R-EED) with a compression rate close to 43 : 1. Bottom row: Results with compression rate close to 57 : 1. The images showing the compression result of [7] are courtesy of Irena Gali´c. Table 2. Compression results for JPEG, JPEG 2000, and for the proposed algorithm (R-EED) for the images “trui”, “walter”, and a subimage of “peppers”. The best results are highlighted.

JPEG JPEG 2000 R-EED

trui Ratio 42.17 : 1 43.44 : 1 44.11 : 1

MSE 71.16 45.99 31.00

walter Ratio MSE 45.15 : 1 39.67 45.40 : 1 27.55 45.40 : 1 20.13

peppers Ratio MSE 42.03 : 1 70.47 42.57 : 1 48.97 42.96 : 1 42.61

Walter

250 250

250 250

JPEG

200 200 150 150 100 100

150 150

R−EED−HC

50 50

100 100 50 50

R−EED

00 0

20:1 20 40:1 40 60:1 60 80:1 80

JPEG

200 200

JPEG 2000

100

00

459

Peppers

MSE

MSE

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion

JPEG 2000 R−EED−HC R−EED 0 20:1 20 40:1 40 60:1 60 80:1 80

100

Fig. 3. Compression results with the images “walter” (with resolution 256 × 256) and a 257 × 257 subimage of the image “peppers”. Shown are, from left to right, the initial image, and results obtained with our approach for compression rates of approximately 44 : 1, 66 : 1, and 89 : 1. The graphs show the performance of the proposed algorithm with the best entropy coder (R-EED) or with Huﬀman coding (R-EED-HC), and of JPEG and JPEG 2000 for various compression rates of these images.

Compared to the approach by Gali´c et al. [7], R-EED also yield clearly superior results. The images created by Irena Gali´c, which are shown in Figure 2, have an MSE of 50.89 (compression ratio 44 : 1) and 75.81 (compression ratio 58 : 1), which is 64 % and 77 % higher than that obtained with R-EED. Furthermore, let us also consider two more images: the ﬁrst image of an image sequence of Walter Cronkite, available at [21], and a subimage of the image “peppers”. For both images, the proposed compression algorithm beats JPEG 2000 for high compression rates, and achieves a similar performance for medium compression rates, as is demonstrated in Figure 3 and Table 2. The graphs in Figure 3 also show results of the proposed algorithm when only Huﬀman coding is used. As can be seen, even with this simple entropy coder, we still achieve a better quality than JPEG 2000 for high compression ratios.

460

C. Schmaltz, J. Weickert, and A. Bruhn

Fig. 4. Experiments with diﬀerent compression ratios for the image “trui”, scaled to 257×257. From left to right: Compression ratio of 86.43 : 1, 135.76 : 1, and 186.77 : 1.

Note that the graphs in the ﬁgures show the complete compression range for JPEG and JPEG 2000, i.e. with quality settings from 100 to 1. However, only a small subinterval of the results obtainable with the proposed algorithm is shown. Results in which the image “trui” was compressed with compression ratios up to 186.77 : 1 are shown in Figure 4. Higher compression ratios are also possible.

5

Conclusions

We have presented an image compression method that performs edge-enhancing anisotropic diﬀusion inpainting on an adaptive rectangular grid. By using an improved entropy coding step, brightness rescaling, an optimised diﬀusion parameter in the compression as well as in the decompression step, and interpolation swapping, the proposed algorithm can yield results that clearly surpass those of related previous work [7] as well as of JPEG and even the sophisticated JPEG 2000 standard. Our ongoing work includes research on parallelisation strategies for multicore architectures, optimal handling of highly textured regions, as well as extensions to colour images and videos. Acknowledgement. We thank Irena Gali´c for fruitful discussions and for providing two images in Figure 2.

References 1. Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, New York (1992) 2. Taubman, D., Marcellin, M.: JPEG 2000: Image Compression Fundamentals, Practice and Standards. Kluwer Academic Publishers, Dordrecht (2002) 3. Caselles, V., Morel, J.M., Sbert, C.: An axiomatic approach to image interpolation. IEEE Transactions on Image Processing 7(3), 376–386 (1998)

Beating the Quality of JPEG 2000 with Anisotropic Diﬀusion

461

4. Masnou, S., Morel, J.M.: Level lines based disocclusion. In: Proc. 1998 IEEE International Conference on Image Processing, Chicago, IL, October 1998, vol. 3, pp. 259–263 (1998) 5. Bertalm´ıo, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proc. SIGGRAPH 2000, New Orleans, LI, July 2000, pp. 417–424 (2000) 6. Gali´c, I., Weickert, J., Welk, M., Bruhn, A., Belyaev, A., Seidel, H.P.: Towards PDE-based image compression. In: Paragios, N., Faugeras, O., Chan, T., Schn¨orr, C. (eds.) VLSM 2005. LNCS, vol. 3752, pp. 37–48. Springer, Heidelberg (2005) 7. Gali´c, I., Weickert, J., Welk, M., Bruhn, A., Belyaev, A., Seidel, H.P.: Image compression with anisotropic diﬀusion. Journal of Mathematical Imaging and Vision 31(2–3), 255–269 (2008) 8. Weickert, J.: Theoretical foundations of anisotropic diﬀusion in image processing. Computing Supplement 11, 221–236 (1996) 9. Distasi, R., Nappi, M., Vitulano, S.: Image compression by B-tree triangular coding. IEEE Transactions on Communications 45(9), 1095–1100 (1997) 10. Chan, T.F., Zhou, H.M.: Total variation improved wavelet thresholding in image compression. In: Proc. Seventh International Conference on Image Processing, Vancouver, Canada, September 2000, vol. II, pp. 391–394 (2000) 11. Sol´e, A., Caselles, V., Sapiro, G., Arandiga, F.: Morse description and geometric encoding of digital elevation maps. IEEE Transactions on Image Processing 13(9), 1245–1262 (2004) 12. Liu, D., Sun, X., Wu, F., Li, S., Zhang, Y.Q.: Image compression with edge-based inpainting. IEEE Transactions on Circuits, Systems and Video Technology 17(10), 1273–1286 (2007) 13. Charbonnier, P., Blanc-F´eraud, L., Aubert, G., Barlaud, M.: Deterministic edgepreserving regularization in computed imaging. IEEE Transactions on Image Processing 6(2), 298–311 (1997) 14. Huﬀman, D.A.: A method for the construction of minimum redundancy codes. Proceedings of the IRE 40, 1098–1101 (1952) 15. Rissanen, J.J.: Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development 20(3), 198–203 (1976) 16. Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984) 17. Mahoney, M.: Adaptive weighing of context models for lossless data compression. Technical Report CS-2005-16, Florida Institute of Technology, Melbourne, Florida (2005) 18. Mahoney, M.: Data compression programs, http://www.cs.fit.edu/~ mmahoney/compression/ (Last visited March 01, 2009) 19. Dipperstein, M.: Michael dipperstein’s page o’stuﬀ, homepage, http://michael.dipperstein.com/index.html (Last visited January 22, 2009) 20. Bae, E., Weickert, J.: Partial diﬀerential equations for interpolation and compression of surfaces. In: Proc. Seventh International Conference on Mathematical Methods for Curves and Surfaces. LNCS. Springer, Berlin (2008) (to appear) 21. Signal and Image Processing Institute of the University of Southern California: The USC-SIPI image database, http://sipi.usc.edu/database/index.html (Last visited March 01, 2009)

Decoding Color Structured Light Patterns with a Region Adjacency Graph Christoph Schmalz Siemens AG, CT PS 9, Otto-Hahn-Ring 6, 81739 Munich, Germany

Abstract. We present a new technique for decoding color stripe or color checkerboard patterns as often used for single-shot 3d range data acquisition with structured light. The key idea is to segment the camera image into superpixels with a watershed transform. We then describe a new algorithm that constructs a regions adjacency graph and uses it to solve the correspondence problem. This is an improvement over existing scanline based evaluation methods as the spatial coherence assumption can be relaxed. It allows to measure non-smooth objects that have so far posed problems for single-shot acquisition. The algorithm works in near real time even in uncontrolled environments. Experimental results are given.

1

Introduction

Structured light is a popular method to acquire 3d information about the world. The basic setup consists of one projector and one camera. These rather low hardware requirements make structured light a good choice for building a versatile low-cost 3d scanning system. But structured light is only a very general category for many diﬀerent methods. [1] presents an overview of the various pattern designs. There are two basic classes, temporal patterns and spatial patterns, and mixed forms [2]. Examples for the ﬁrst class are the well-known Gray-coding or phase-shifting. Since several images have to be acquired, the objects in the scene may not move (sometimes called temporal coherence). In contrast, purely spatial patterns can be decoded after acquiring only a single image. The necessary information is embedded in the neighborhood of the building blocks of the pattern, the so-called primitives. These neighborhood relations must therefore stay intact for the decoding to be successful. This imposes a requirement of spatial coherence on the scene, i.e. the objects must be smooth on the scale of the neighborhood size. Hence single-shot methods, which use purely spatial patterns, have problems with small object structures. Another diﬀerence is the lateral resolution: Temporal patterns yield a depth value for every single pixel, whereas spatial patterns are typically limited to 1/N of the full camera resolution, where N is the pattern primitive size. The strong points of the single-shot approach are that it allows the measurement of moving objects and also makes it possible to use very simple hardware as only one pattern has to be projected, which can be done with a slide projector. This work shows how to improve its weak area J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 462–471, 2009. Springer-Verlag Berlin Heidelberg 2009

Decoding Color Structured Light Patterns with a Region Adjacency Graph

463

and decode the pattern even in places where the spatial coherence assumption does not hold. The algorithm is also robust enough to work on images taken under daylight conditions. First the watershed transform and its application to structured light patterns are introduced. In the next part we describe how the region adjacency graph is built. Then a graph-based algorithm for decoding the pattern is presented. In the last part experimental results are shown.

2

Single-Shot Structured Light

Single-shot pattern design is a fundamental tradeoﬀ between resolution and robustness. High resolution demands small pattern primitives, but smaller primitives are harder to detect reliably. Many diﬀerent types of single-shot structured light patterns have been described in the literature, most of them are based on pseudorandom sequences or arrays [3],[4]. They have the property that a given window of size n or n*m occurs at most once, thus observing such a window allows to deduce its position in the pattern. Another design decision is the alphabet size of the code. Ideally one wants to use a large alphabet for a long code with a small window size, but the smaller the distance between code letters the less robust the code. A method to determine the optimal tradeoﬀ is presented in [5]. One extreme is the rainbow range ﬁnder of [6], the other is the use of a black-and-white pattern like in [7]. A well known single-shot 3d scanning system using a color stripe pattern built from pseudorandom De Brujin sequences [8] is described in [9]. The decoding works per image scanline and is based on Dynamic Programming. The runtime is given as 1 minute on 900 MHz Pentium PC. Unfortunately this negates the speed advantage of the single-shot approach. We think however that it could be much improved by using a watershed segmentation pre-stage as many scanlines in fact contain identical stripe sequences, only the edge positions vary. Another system with a similar pattern, also using Dynamic Programming for decoding, was presented in [10], but without information about runtime. A true realtime 3d system is the one developed by [11]. 2.1

Pattern Design

In our system we use both checkerboard and stripe patterns. To ensure high robustness, the alphabet consists of just the six colors green, red, blue, cyan, magenta and yellow. These are six of the eight corners of the RGB color cube. We do not use black and white as they are easily confused with shadow or highlight areas, respectively. We require that the green channel changes at every edge as it is sampled with the highest resolution in the standard Bayer Pattern used in color CCD cameras. The length of the resulting code is dependent on the window size and Hamming Distance used. Shorter codes constrain the working volume of the system, but a small window size makes decoding easier and more reliable, so again a tradeoﬀ has to be made. Checkerboards contain twice as many edges and thus promise twice as much depth data as stripe patterns. In

464

C. Schmalz

practice the advantage is only about 30 percent: In places where two edges cross neither can be recovered accurately enough to allow computation of 3D data and the color contrast is generally lower.

3

Watershed Segmentation

The issue with the scanline-based pattern-decoding algorithms is that by decomposing a 2D problem into a series of 1D problems a lot of information is ignored. Consider the situation shown in ﬁgure 1. Although the order of the stripes is obvious it cannot be determined in a 1D scan. However by reducing the complexity of the image, it becomes possible to directly solve the original 2D problem of decoding the pattern. To achieve this reduction we need to segment the image and replace pixels by ’superpixels’. An additional advantage is that because superpixel properties like color are deﬁned as statistics over an ensemble of ’simple’ pixels, the eﬀects of defocus blur and noise are reduced.

Fig. 1. Scanline-based decoding failure

Segmentation is one of the fundamental problems in computer vision. Many algorithms have been proposed; they can broadly be classiﬁed as feature based, contour based or region based. A survey with focus on color images can be found in [12]. An important distinction lies also between supervised and unsupervised algorithms. The watershed transform does not require user interaction and is parameter free. It is however a low-level method that produces severe oversegmentation. Usually an area judged to be homogeneous by a human will be broken up into many small individual regions because of noise. For the purposes of this research this is not a problem: We want to represent the image with superpixels that are internally uniform. The Watershed transform was popularized in image processing by [13]. There are a number of diﬀerent deﬁnitions, a good overview is presented in [14]. The basic idea is that pixel values are interpreted as height. In immersion type algorithms the resulting landscape is successively ﬂooded by a rising ground water level. The algorithm keeps track of where water ﬁrst seeps in. Alternatively there are rainfalling simulations where the algorithm ﬁnds the places where water collects. In this work a modiﬁed immersion type algorithm is used that does not generate watershed lines but only basins. The input image for a watershed segmentation (ﬁgure 2) must be scalar so that a ‘height’ can be deﬁned. We use the modulus of the gradient of the image to be segmented. The basins of the watershed transform will be referred to as regions in the following.

Decoding Color Structured Light Patterns with a Region Adjacency Graph

465

Fig. 2. Example plot of gradient modulus of a stripe pattern as watershed transform input

−→

−→

(a) Original image (col-

(b) Watershed Basins

ors enhanced)

(false colors)

(c) Basins with orderﬁltered colors

Fig. 3. Watershed Transform

4

Region Adjacency Graph

We can now build the Region Adjacency Graph of the superpixels. It is not a regular 4- or 8-connected grid graph like the original pixels, but nevertheless one can generalize image processing to such topologies [15]. A typical one megapixel camera image of a scene illuminated by the pattern described above can be represented by a graph with about ﬁfty thousand vertices. By tracking the color changes between the vertices we can use it to decode the pattern. 4.1

Vertices

The region adjacency graph has one vertex for every region. Each region has a color, which is determined by a robust nonlinear order ﬁlter with marginal ordering [16] over all the original image pixels the region covers (ﬁgure 3). This color should additionally be corrected for the color crosstalk that occurs in the camera, see for example [10]. The regions have another property, namely the position of the pattern primitive they originate from, called their pattern position. But it is unknown so far. This is the correspondence problem and the purpose of the algorithm described in the following is to solve it.

466

4.2

C. Schmalz

Edges

The edges of the graph describe how the color changes between two adjacent regions. The color change C is a three-element vector. The scalar edge weight is deﬁned to be its L∞ norm. C = [cr cg cb ] T ∈ R3

(1)

These vectors have to be assigned to categories, i.e. channel rising, constant or falling. The categories are denoted by symbols, e.g. the symbol for red rising, T green falling, blue constant is S = [+1 − 1 0] . In text form this will be represented as R+G-. We implemented two methods for classiﬁcation. The ﬁrst is thresholding, the other is a classiﬁer based on a Gaussian Mixture Model. Thresholding. First C is normalized so that the maximum absolute channel value is 1. Cˆ =

C ||C||∞

(2)

The errors associated with giving the symbol si to ci and the symbol S to Cˆ is deﬁned as ⎧ 1+ci ⎪ ⎨ 1−t si = −1 et (ci , si ) = |cti | (3) si = 0 ⎪ ⎩ 1−ci si = +1 1−t et (ci , si )2 E(C, S) =

i

(4) 3 where t is the threshold value. Since we use an alphabet with two intensity levels per channel, only one threshold is needed. The obvious choice is 1/3 for an even partitioning of the interval [-1;+1]. To ﬁnd the best-ﬁtting edge symbol with the lowest possible error we set ⎧ ⎪ ⎨−1 ci ≤ −t si (ci , t) = 0 (5) −t < ci < t ⎪ ⎩ +1 ci ≥ t Each edge is also classiﬁed according to its direction, e.g. forward or backward. An example section of the resulting graph with edge symbols is shown in ﬁgure 4. Gaussian Mixture Model. Choosing threshold values is always diﬃcult. It is much more elegant to let the classiﬁcation algorithm learn what a given edge type looks like. The parameters of the model are determined with the EM algorithm [17]. The optimal number of components is found using the MDL criterion [18]. To get the training data we use a bootstrapping scheme: The pattern decoding algorithm is run with the threshold classiﬁer ﬁrst. The edges that could be

Decoding Color Structured Light Patterns with a Region Adjacency Graph

467

Fig. 4. Region adjacency graph of a checkerboard pattern with edge symbols. The edge symbols are valid in one direction only and have to be inverted for the opposite direction.

identiﬁed are presumed to be good and used as training data. In all following runs we can then use the GMM classiﬁer. This is especially useful for colored objects as the ﬁxed threshold does not work well when the color channels have diﬀerent dynamic ranges.

5

Pattern Decoding

Given the region adjacency graph and the knowledge of the coded pattern we can now decode it. The window uniqueness property allows to determine where in the pattern a given window occurs. There are analytic methods like [19] but since enough memory is available most implementations simply use precomputed lookup tables. In the following the term identiﬁcation is often used for decoding, as the observed regions in the camera image have to be identiﬁed with their respective pattern primitives. The windows used for identiﬁcation are represented as sequences of edges. We regard the projected pattern as a graph as well and calculate symbols and directions for all its edges. The decoding algorithm is local, i.e. starting at a certain region the neighboring vertices in the graph are recursively visited in a best-ﬁrst-search. We also experimented with an MRF-based graphcut optimization algorithm [20] to ﬁnd the globally optimal labeling of the regions, but the results were not satisfying because of diﬃculties in modeling long-range interactions and runtime issues. The correspondence problem is solved by ﬁnding sequences of edges in the region adjacency graph that are unique in the pattern. Before beginning we sort all edges by their match error. The ﬁrst edge (with the lowest match error) is selected and its possible positions in the pattern are determined. They in turn

468

C. Schmalz

determine the next edges that can be used to extend the sequence. If the end regions of the sequence have such edges we add them to the sequence and repeat until only one possible position remains; if there are no legal edges that can be added, we start again with a new edge. Once an edge sequence has been uniquely identiﬁed, we can set the pattern position of one of the ’starting region’ and check all its neighbors. If the edge symbol and direction between the two regions is consistent with the pattern deﬁnition we add the neighbors to the ’open’ heap. When all neighbors have been visited we continue the identiﬁcation process with the best neighbor on the heap, i.e. the one with the lowest edge error. Note that because of the normalization in the edge symbol calculation every edge gets a symbol, even if it connects two regions of equal color and the change is just noise. In a stripe pattern there are many such ‘null edges’. When an edge with a low weight (compared to the previous edge) is found, we calculate a new symbol for a virtual edge between the previous region and the neighbor. If it is identical to the previous edge symbol the neighbor has the same color and the same pattern position. This scheme is independent of local contrast. The pattern used has a certain Hamming Distance, typically 2 or 3. Therefore misclassifying an edge leads to an invalid codeword. A region typically has several neighbors, so it can still be identiﬁed by taking another way to reach it. When objects are partially occluded it is still possible that a region has two conﬂicting pattern positions. In that case the one corresponding to the lower edge symbol error is chosen.

6

Results

The advantages of the new method are on the one hand the robust colors assigned to the regions by order ﬁltering and on the other hand the relaxed spatial coherence requirement. It is enough if one edge sequence can be identiﬁed (per object), the remaining regions can then be identiﬁed by examining single edges. The edges used can have any direction. The position of all identiﬁed edges is calculated with subpixel precision by interpolating the gradient. Depth values are then determined via ray-plane intersection. Precision depends on several factors like camera resolution and triangulation angle, but also on the quality of the projector used. Under optimal conditions we found a standard deviation from the plane of 0.12 mm. The setup used consists of a DMD-based projector with a resolution of 1024x768 and a camera with a resolution 1388x1038. The baseline is 366 mm, with an angle of 19.2 and a working distance of roughly 1000mm. The standard deviation was calculated over 2700 samples, i.e. on a small patch of the image, disregarding calibration errors. To test the performance of our algorithm we have prepared a number of synthetic test scenes with ground truth. The virtual objects are located 1000mm from the camera. The images were corrupted with diﬀerent levels of additive white Gaussian noise (AWGN). Two of the test scenes are shown in ﬁgure 5. The ’grid’

Decoding Color Structured Light Patterns with a Region Adjacency Graph

(a) Part of the ’grid’ scene

469

(b) Part of the ’sun’ scene

Fig. 5. Test scene images at medium noise level Table 1. Results for the worst case test scenes Scene

AWGN variance

grid grid grid sun sun sun

10−4 10−3 10−2 10−4 10−3 10−2

Outliers Inliers

268 371 206 284 188 1827

53066 51535 22861 54761 51124 28602

(a) Fan with pattern

Mean Distance / mm 957 957 957 990 990 990

Error Mean / Error Sigma / mm mm 0.09 0.08 0.04 -0.02 -0.03 -0.02

1.44 1.60 2.21 1.66 1.87 2.72

(b) Color coded depth

Fig. 6. Measurement results for the fan object

scene was chosen for its diﬃcult geometry, the ’sun’ scene because it is heavily textured. All scenes are available on the internet [21]. Results are shown in table 1. No smoothing was used on the depth data, outliers are deﬁned as deviating from the ground truth by more than 10 mm. For the simulated geometry and camera resolution an edge location error of 1 pixel results in a depth error of about 4 mm, so the given standard deviations correspond to about half a pixel.

470

C. Schmalz

Figure 6 shows an especially diﬃcult real-life object together with the resulting depthmap. The fan consists of many small lamellas. Not all but most areas could be recovered.

7

Conclusion and Future Work

The contribution of this work is improved single-shot 3d acquisition of nonsmooth objects, which has traditionally been diﬃcult. It has been shown how the evaluation of color structured light patterns can beneﬁt from a watershed pre-segmentation and how the resulting region adjacency graph can be used to decode the pattern and identify pattern primitives even under noisy conditions and in places where the spatial coherence assumption does not hold. Currently an initial version of the algorithm runs at about 1 Hz on a 2GHz Core2Duo machine. The major part of the runtime is due to the watershed transform. We are looking into implementing a highly parallel rainfalling version of the watershed algorithm [22] on a graphics processing unit (GPU) to achieve true real-time speed.

References 1. Salvi, J., Pag`es, J., Batlle, J.: Pattern codiﬁcation strategies in structured light systems. Pattern Recognition 37, 827–849 (2004) 2. Davis, J., Nehab, D., Ramamoorthi, R., Rusinkiewicz, S.: Spacetime stereo: a unifying framework for depth from triangulation 27(2), 296–302 (February 2005) 3. Paterson, K.G.: Perfect maps 40(3), 743–753 (May 1994) 4. Mitchell, C.J.: Aperiodic and semi-periodic perfect maps 41(1), 88–95 (January 1995) 5. Horn, E., Kiryati, N.: Toward optimal structured light patterns. In: Proc. International Conference on Recent Advances in 3-D Digital Imaging and Modeling, May 12–15, pp. 28–35 (1997) 6. Tajima, J., Iwakawa, M.: 3-d data acquisition by rainbow range ﬁnder. In: Proc. of the 10th International Conference on Pattern Recognition, vol. 1, pp. 309–313 (1990) 7. Maruyama, M., Abe, S.: Range sensing by projecting multiple slits with random cuts 15(6), 647–651 (June 1993) 8. Annexstein, F.: Generating de bruijn sequences: An eﬃcient implementation. IEEE Transactions on Computers 46(2), 198–200 (1997) 9. Zhang, L., Curless, B., Seitz, S.M.: Rapid shape acquisition using color structured light and multi-pass dynamic programming. In: Proc. First International Symposium on 3D Data Processing Visualization and Transmission, June 19–21, pp. 24–36 (2002) 10. Pages, J., Salvi, J., Collewet, C., Forest, J.: Optimised de bruijn patterns for oneshot shape acquisition. Image and Vision Computing 23, 707–720 (2005) 11. Forster, F.: A high-resolution and high accuracy real-time 3d sensor based on structured light. In: International Symposium on 3D Data Processing Visualization and Transmission, pp. 208–215. IEEE Computer Society, Los Alamitos (2006)

Decoding Color Structured Light Patterns with a Region Adjacency Graph

471

12. Lucchese, L., Mitra, S.: Color image segmentation: A state-of-the-art survey. In: Proc. of the Indian National Science Academy (INSA-A), March 2001, vol. 67, pp. 207–221 (2001) 13. Vincent, L., Soille, P.: Watersheds in digital spaces: An eﬃcient algorithm based on immersion simulations 13(6), 583–598 (June 1991) 14. Roerdink, J.B.T.M., Meijster, A.: The watershed transform: deﬁnitions, algorithms and parallelization strategies. Fundam. Inf. 41(1-2), 187–228 (2000) 15. Grady, L.: Space-Variant Computer Vision: A Graph-Theoretic Approach. PhD thesis, Boston University, Boston, MA (2004) 16. Pitas, I., Tsakalides, P.: Multivariate ordering in color image ﬁltering 1(3), 247–259, 295–6 (September 1991) 17. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977) 18. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling 44(6), 2743–2760 (October 1998) 19. Mitchell, C.J., Etzion, T., Paterson, K.G.: A method for constructing decodable de bruijn sequences 42(5), 1472–1478 (September 1996) 20. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts 23(11), 1222–1239 (November 2001) 21. http://www.structuredlightsurvey.de (2009) 22. Osma-Ruiz, V., Godino-Llorente, J., Saenz-Lechon, N., Gomez-Vilda, P.: An improved watershed algorithm based on eﬃcient computation of shortest paths. Pattern Recognition 40(3), 1078–1090 (2007)

Residual Images Remove Illumination Artifacts! Tobi Vaudrey and Reinhard Klette The .enpeda.. Project, The University of Auckland, New Zealand

Abstract. In past studies, illumination effects have been proven to cause the most common problems in correspondence algorithms. In this paper, we conduct a study identifying that the residual images (i.e., differences between images and their smoothed versions) contain the important information in an image. We go on to show that this approach removes illumination artifacts between corresponding pairs of images (i.e., optical flow and stereo) using a mixture of synthetic and real-life images.

1 Introduction This paper applies the structure-texture image decomposition [2,11] as a basic approach for evaluating pre-processing options for image sequences, as recorded in vision-based driver assistance systems (DAS). When evaluating stereo and motion correspondence algorithms on real-world sequences as provided on [5], we realised that illumination artifacts define a major issue [14], causing serious reductions in accuracy for stereo and motion data. There might be basically two different approaches for dealing with this problem, either we try to map both images into a uniform illumination model, or we map both into images which carry the illumination-independent information. After some experiments with various unifying mappings we realized that the first approach is basically impossible (or, at least, a very big challenge), considering that impacts of shadow are often just local (e.g., “dancing lights” caused by sunshine through trees, altering illumination on the camera sensors). Thus we moved on to the second approach, and this paper actually shows that this is a very promising direction of research. For this second approach, we picked up the concept of residuals [7], which is the difference between an image and a smoothed version of itself, and generalized it by applying, not only the mean operator for smoothing, but also various smoothing operators as known from past and very recent studies in computer vision. (This also includes a small modification of an operator proposed in [8].) Let f be an image with an additive decomposition f (x) = s(x) + r(x), for all pixel positions x = (x, y) in a 2D grid Ω, where s = S(f ) denotes the smooth component (of an image) and r = R(f ) = f − S(f ) the residual. The residuum is not a (standard) image because it may also contain negative values. See Figure 1 for an example of such a decomposition. We use the straightforward iteration scheme s(0) = f , s(n+1) = S(s(n) ), and r(n+1) = f − s(n+1) , for n ≥ 0. Co-occurrence matrix [6] based information measures are used to characterize information in s(n) and r(n) , for n ≥ 1. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 472–481, 2009. c Springer-Verlag Berlin Heidelberg 2009

Residual Images Remove Illumination Artifacts!

473

This paper conducts a study identifying that the residuals r(n) contain the “important” (for correspondence algorithms) information in an image. We go on to show that they remove illumination artifacts between corresponding image pairs (for optical flow and stereo matching) using a mixture of synthetic and real-life images. In this paper we first introduce the chosen smoothing operators in Section 2. This is followed by an overview of the data set we use. We go on to show that the chosen co-occurence metrics demonstrate expected behaviour using the smoothed image s (which should be a good approximation for a low-pass filter). We then go on to show that the residual image r does contain the high-frequencies required for correspondence matching (Section 3), and the co-occurence metrics show that the texture information is not lost. Section 4 proposes a methodology to test if the illumination artifacts are, in fact, improved using residuals, then provides results using the proposed methodology. A conclusion and acknowledgments finalise this paper.

2 Smoothing Operators Let f be any frame of a given image sequence, defined on a rectangular open set Ω and sampled at regular grid points within Ω. Technically, we assume that f is a two-dimensional (2D) function in L2 (Ω) (i.e., informally speaking, square integrable on Ω), which defines a surface patch above Ω, whose contents (i.e., area) equals |∇f |. This integral of the gradient ∇f of f is also called the total variation (TV) of Ω f [11]. [11] assumed an additive decomposition f = s + r into a smooth component s and a residual component r, where s is assumed to be in L1 (Ω) with bounded TV (in brief: s ∈ BV), and r is in L2 (Ω). This allows one to consider the minimization of the following functional: inf 2 |∇s| + λ||r||2L2 (1) (s,r)∈BV ×L ∧f =s+r

Ω

The TV-L2 approach in [11] was approximating this minimum numerically for identifying the “desired clean image” s and “additive noise” r. Further studies (see [2]) identified s to be the “structure”, and r to be the “texture”. See Figure 1. The concept may be generalized as follows: any smoothing operator S generates a smoothed image s = S(f ) and a residuum r = f − S(f ). For example, TV-L2 generates the smoothed image s = ST V (f ) by solving Equ. (1).

Fig. 1. Example decomposition of RubberWhale image (top) into its smooth (left) and residual (right) components (example using TV-L2 )

474

T. Vaudrey and R. Klette

The residuum r will be normalized later in this paper into a 2D function with values in [-1,1]. (We may also use a residual operator R with r = R(f ) = f − S(f ); but, obviously, S and f already define both the low-frequency term s and the high-frequency term r.) The concept of residual images was already introduced in [7] by using a 3 × 3 mean for implementing S. We will include this simple smoothing operator Smean into our discussions in this paper. Figure 1 in [7] characterizes the histogram of a residuum r = f − Smean (f ) as being a Laplacian distribution of values. Smedian is another simple smoothing operator, defined by the m × m local median operator. Furthermore, the study [1] on comparing edge-preserving smoothing filters points to the (double-window) trimmed mean operator as introduced in [8]; we use the base principals of this for the trimmed mean filter ST M . This smoothing operator uses an m×m window, but calculates the mean only for pixels with values in [a−σf , a+σf ], where a is the central pixel value and σf is the standard deviation of f . Finally, we also include the bilateral [13] and the trilateral filter [4], defining smoothing operators SBL and ST L . In the bilateral case, offset vectors a and position-dependent real weights d1 (a) define a local convolution, and the weights d1 (a) are further scaled by a second weight function d2 , defined on the differences f (x + a) − f (x): 1 s(x) = f (x + a) · d1 (a) · d2 [f (x + a) − f (x)] da k(x) Ω k(x) = d1 (a) · d2 [f (x + a) − f (x)] da

(2)

Ω

Function k(x) is used for normalization. In this paper, weights d1 and d2 are defined by Gaussian functions with standard deviations σ1 and σ2 , respectively. The smoothed function s equals SBL (f ). The bilateral filter requires a specification of parameters σ1 , σ2 , and the size of the used filter kernel in f . The trilateral case only requires the specification of one parameter; it combines two bilateral filters. At first, a bilateral filter is applied on the derivatives of f (i.e., the gradients): 1 gf (x) = ∇f (x + a) · d1 (a) · d2 (||∇f (x + a) − ∇f (x)||) da (3) k∇ (x) Ω k∇ (x) = d1 (a) · d2 (||∇f (x + a) − ∇f (x)||) da Ω

Simple forward differences ∇f (x, y) ≈ (f (x+1, y)−f (x, y), f (x, y+1)−f (x, y)) are used for the digital image. For the subsequent second bilateral filter, [4] suggested the use of the smoothed gradient gf (x) [instead of ∇f (x)] for estimating an approximating plane pf (x, a) = f (x) + gf (x) · a. Let f (x, a) = f (x + a) − pf (x, a). Furthermore, a neighbourhood function n(x, a) =

1 0

if ||gf (x + a) − gf (x)|| < A otherwise

(4)

Residual Images Remove Illumination Artifacts!

475

is used for the second weighting. Parameter A specifies the adaptive region and is discussed further below. Finally, 1 s(x) = f (x) + f (x, a) · d1 (a) · d2 (f (x, a)) · n(x, a) da (5) k (x) Ω k (x) = d1 (a) · d2 (f (x, a)) · n(x, a) da Ω

The smoothed function s equals ST L (f ). Again, d1 and d2 are assumed to be Gaussian functions, with standard deviations σ1 and σ2 , respectively. The method requires specification of parameter σ1 only, which is at first used to be the radius of circular neighbourhoods at x in f ; let gf (x) be the mean gradient of f in such a neighbourhood. Let σ2 = 0.15 · || max gf (x) − min g f (x)|| (6) x∈Ω

x∈Ω

(Value 0.15 was recommended in [4]). Finally, also use A = σ2 . All filters have been implemented in OpenCV, where possible the native function was used (see acknowledgements at end of paper). For the TV-L2 , we use an implementation (with identical parameters) as in [15]. All other filters used are virtually parameterless (except a window size) and we use a window size of m = 3 (σ1 = 3 for trilateral filter). The only other parameter to set is the bilateral filter colour standard deviation σ1 = 0.1 · Irange , where Irange is the range of the intensity values. For this paper we use both the optical flow data set and the stereo data set of [9], and discuss in this paper our findings for the “good quality” low noise images. They are either synthetically generated, or use good lighting and cameras with good optics. They are also using the same lighting conditions and camera exposures. Specifically, this set includes the 2001 stereo set (provided by [12]): Barn1, Barn2, Bull, Map, Poster, Sawtooth, Tsukuba, and Venus. It also includes the optical flow set to show how both types of correspondence algorithms have the same issues. The optical flow set (provided by [3]) were used when ground truth was available, specifically: Dimetrodon, Grove2, Grove3, Hydrangea, RubberWhale, Urban2, Urban3, and Venus. The total dataset is 8 stereo and 8 optical flow pairs.

3 Residual Images Contain the Important Information This section demonstrates that the important information for correspondence algorithms is contained in the residual image r. The co-occurrence matrix has been defined for analysing different metrics about the texture of an image [6]: 1 if h(x) = i and h(x + a) = j C(i, j) = (7) 0 otherwise x∈Ω a∈N \{(0,0)}

where N + x is the neighbourhood of pixel location x, a = (0, 0) is one of the offsets in N , and 0 ≤ i, j ≤ Imax (maximum intensity). h represents any 2D image (e.g., f , r, or s). All images are scaled min ↔ max for utilizing the full 0 ↔ Imax scale. In our experiments we chose N to be the 4-neighbourhood, and we have Imax = 255. The

476

T. Vaudrey and R. Klette

loss in information is identified by the (common) textureness metrics for homogeneity C(i,j) Thomo (h) = i j 1+|i−j| , and entropy Tent (h) = − i j C(i, j) ln C(i, j), where an increase in homogeneity represents the image having more homogeneous areas, and a decrease in entropy shows that there is less information contained in the image. The following graphs and explanations are based on iteratively applying a smoothing filter to the specified data set. To get a better representation, we scaled each result by the original image’s metric, i.e., T (s)/ |T (f )|, and then average the results for all data (at the specific iteration). The effect of this can be seen in Figure 2, which shows (as expected, of course) that the more iterations performed on an image, the more homogeneous it becomes and less information (entropy) it contains. Both metrics show that there is a rapid loss of high frequencies initially, and this effect reduces after some time. Some filters come to a steady state (e.g., median), some come to a small steady increase (e.g., TV-L2 , bilateral, mean, and trimmed mean), and others behave poorly (e.g., trilateral). The main point to note is that all the selected smoothing filters reduce information rapidly, as expected, which demonstrates that the co-occurrence metrics chosen highlight the information we are trying to keep. The residual of an image is an approximation of the high frequencies of the image. Therefore, the information contained in a residual image should be less effected (of course, with any filtering process you are changing the information). The co-occurrence metrics were performed on the residual images (after a number of smoothing iterations), the results are shown in Figure 3. Each result is scaled, i.e., T (r)/ |T (f )| and the graphs show the average of for the data set (at the specific iteration). In the homogeneity graph of this figure, it can be seen that the residual images are in fact less homogeneous than the original image (except for median, which has a slight information loss, and trilateral that increases over time). This could be accounted for by introducing small amounts of (random) noise over the entire image. Note that the mean filter approaches the original graph, this is expected as eventually the mean filter will approximate to a uniform scale change by the mean of the entire image. Furthermore, the TV-L2 and median filter seems to be more stable than the rest (i.e., not having much range), but the others stabilize very quickly (except the trilateral which increases).

Fig. 2. Average homogeneity (left) and entropy (right) of a smoothed image s, averaged over the data set

Residual Images Remove Illumination Artifacts!

477

Fig. 3. Average homogeneity (left) and entropy (right) for the residual images r, over the data set

In the entropy graph, the trilateral and median filters stand out. The median filter has much lower information than the rest, and the trilateral filter increases. The other algorithms (except mean filter) are within similar magnitudes of the original image (if not better), showing that the information is not lost, or only slightly reduced.

4 Removing Illumination Artifacts with Residual Images Correspondence algorithms usually rely on the intensity consistency assumption, i.e., that the appearance of an object (according to illumination) does not change between the corresponding images. A previous study has suggested (by experimental data) that illumination artifacts propose the biggest problem for correspondence algorithms [10]. However, this does not hold true when using real-world images; this is due to, for example, shadows, reflections, differing exposures and sensor noise. We show that the errors from residual images are lower than the errors obtained using the original images. The process for showing this is highlighted in Figure 4: we warp one image to the perspective of the other (using ground truth) and compare the differences. The forward warping function W is defined by the following:

W h1 (x, t1 , c1 ), u(x, h1 , h2 ) = w x + u(x, h1 , h2 ) (8) where h(x, t, c) is the value of an image (e.g., f , r, or s) at x ∈ Ω, at time t (image sequences) from camera c (multiple cameras), and u is the 2D ground truth warping (remapping) vector from h1 = h(x, t1 , c1 ) to the perspective of h2 = h(x, t2 , c2 ). Subscripts 1 and 2 represent time frames or cameras. The simplest example is the stereo case, where t1 = t2 = t, c1 is the left camera, c2 is the right camera, and u is the ground truth disparity map from left to right (all vertical translations would be zero). Another common example is optical flow, where c2 = c1 = c, t1 = t, t2 = t + 1, and u is the ground truth flow field from t to t + 1. In practice, this is done using a lookup table using interpolation (e.g., bilinear or cubic). For the purposes of this paper, f is discrete in the functional inputs (x, t, and c), but continuous for the value of f itself. For a typical grey-scale image (Imax = 2n − 1),

478

T. Vaudrey and R. Klette

Fig. 4. Outline of the methodology used to compare images

n is usually 8 or 16. However, we find it easier to represent image data continuously by −1 ≤ f (x) ≤ 1 with f (x) ∈ Q2 , which takes away the ambiguity for the bits per pixel. Therefore, s will also be −1 ≤ s(x) ≤ 1 with s(x) ∈ Q2 . The residual images r satisfies −2 ≤ r(x) ≤ 2 with r(x) ∈ Q2 , but in practice the upper and lower magnitude are much less than 1. For better comparison, we scaled the residual images −1 by (maxx∈Ω |r(x)|) to bring them into the scale −1 ≤ r(x) ≤ 1. An error image e = E(h1 , h2 ) is the absolute difference between two images h1 and h2 , e(x) = | h1 (x) − h2 (x) |. For this paper, the error image is between h2 and the warped W (h1 ); see Figure 5 for an example. To assess the quality of an image, there needs to be an error metric. A common metric is the Root Mean Squared (RMS) Error, defined by

1 RM S(e) = |e(x)|2 (9) N x∈Ω

where N is the number of pixels in the (discrete) non-occluded image domain Ω (when occlusion maps are available). The standard RMS error gives an approximate average error for the entire signal. The problem with this metric is that it gives an even weighting to all pixels, no matter the proximity to other errors. In practice, if errors are happening in the same proximity, this is much worse than if the errors are randomly placed over an image. Most algorithms can handle (by denoising or such approaches) small amounts of error, but if the error is all in the same area, this is seen as signal. We have defined a more appropriate error to take the spatial properties of the error into account. This Spatial Root Mean Squared Error (Spatial-RMS) is defined by

2

1 RM SS (e) = G e(x) (10) N x∈Ω

where G is a function that propagates the errors in a local neighbourhood N . For our experiments, we chose a Gaussian convolution to propagate the error using a standard deviation σ = 1. A qualitative example of error images e can be seen in Figure 5. The image is from [3], and has ground truth available (warping from t to t + 1). In this figure, the image from time t is warped using the ground truth to establish an error map. This highlights that even in relatively good lighting conditions, the differences in intensity between the two images still has a high amount of error (left image). The error image using the

Residual Images Remove Illumination Artifacts!

479

Fig. 5. Error images using RubberWhale (see Figure 1), created using methodology in Figure 4. Left : error between warped and original image. Right : error between W rT V (·, t, ·) and rT V (·, t + 1, ·).

TV-L2 residual (right) may appear to have more error but, in fact, it shows that the error is more evenly spread. Sections to notice are areas of shadow (e.g., around the wheel, in the arch and next to the curtain) and also object boundaries (look at the difference in errors of any object boundary). Furthermore the magnitudes of the maximum errors are less; the left image is 1.33 and the TV-L2 residual image is 1.12. A quantitative evaluation over the entire data set has been performed. Again we evaluate the effect of repeated iterations of the smoothing filters, to obtain a residual image. The graphs in Figures 6 and 7 show the average RMS and Spatial-RMS for the optical flow dataset and stereo dataset separately. This was to show that although the stereo and optical flow algorithms appear to be quite different (and have differing communities following each), they both suffer from the same correspondence issue and use intensity consistency as their input data. At a first glance it is obvious that all graphs are similar. There is only a subtle difference in the magnitude of each. The main point to notice is that all residual images get better RMS and Spatial-RMS than the original images after around 20 iterations. Another interesting point to note is that the Spatial-RMS shows similar information to the RMS graph. This may be because

Fig. 6. RMS for each iteration. The left and right graph represents the average over the stereo and flow data, respectively.

480

T. Vaudrey and R. Klette

Fig. 7. Spatial RMS for each iteration. The left and right graph represents the average over the stereo and flow data, respectively.

the propagation method was not good, or that the even distribution of error (when using residual images) seems to offset the large clusters of errors in the original error images. From these graphs alone we can not decide which technique is the best, but if we use the graphs from Section 3, we obtain more information about the filters. From the graphs in Figures 6 and 7 it appears that median filtering is the best, however, if we look at Figure 3, the information in the median filter residual is being lost! So this improvement is probably due to the loss of information, rather than the matching error. So, if we only consider the filters that do not lose information (i.e., TV-L2 , bilateral, and trimmed mean) we can see how they rank. TV-L2 shows very good results, on average outperforming both the bilateral and trimmed mean filter. However, after a number of iterations, the difference is not that much. From a computational point of view, less iterations is desirable so this may make the TV-L2 filter much better. The other filters to consider are the mean and trilateral filter. These two filters retain information at low iterations (< 10 for trilateral and < 3 for mean). Both these filters provide good results for the RMS metrics, when at low iterations.

5 Conclusions and Future Research For the relatively “good quality” images of the chosen data set, we showed that using a residual image reduces the effect of illumination differences. Furthermore, the errors are spread more evenly over the image, reducing the effect of outliers. From these studies, we conclude that a simple mean filter may produce sufficient (and possibly the best) residual images. The TV-L2 filter is also a good candidate as it retains information in the residual image, but still improves results. The median filter and trilateral filter appear to be good when looking at RMS, but there is information loss associated with this. The trimmed mean and bilateral filter work well, but not as good as the other filters, so perhaps are better suited to other applications. Furthermore, the results from this test need to be compared using a correspondence algorithm. A small study has been conducted using the TV-L1 optical flow (see [15,16]), but more investigation needs to be done, including the application to a stereo algorithm.

Residual Images Remove Illumination Artifacts!

481

Acknowledgements. The authors would like to thank Andreas Wedel (Daimler AG, Germany) for his implementation of TV-L2 smoothing. Also, Prasun Choudhury (Adobe Systems, Inc., USA) and Jack Tumblin (EECS, Northwestern University, USA) for their implementation of the trilateral filter.

References 1. Abramson, S.B., Schowengerdt, R.A.: Evaluation of edge-preserving smoothing filters for digital image mapping. ISPRS J. Photogrammetry Remote Sensing 48, 2–17 (1993) 2. Aujol, J.F., Gilboa, G., Chan, T., Osher, S.: Structure-texture image decomposition modeling, algorithms, and parameter selection. Int. J. Computer Vision 67, 111–136 (2006) 3. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proc. IEEE Int. Conf. Computer Vision, pp. 1–8 (2007) 4. Choudhury, P., Tumblin, J.: The trilateral filter for high contrast images and meshes. In: Proc. Eurographics Symp. Rendering, pp. 1–11 (2003) 5. .enpeda.. Image Sequence Analysis Test Site (EISATS): http://www.mi.auckland.ac.nz/EISATS 6. Haralick, R.M., Bosley, R.: Texture features for image classification. In: Proc. ERTS Symposium, NASA SP-351, pp. 1219–1228 (1973) 7. Kuan, D.T., Sawchuk, A.A., Strand, T.C., Chavel, P.: Adaptive noise smoothing filter for images with signal-dependent noise. IEEE Trans. Pattern Analysis Machine Intelligence 7, 165–177 (1985) 8. Mao, Z., Strickland, R.N.: Image sequence processing for target estimation in forward-looking infrared imagery. Opt. Eng. 27, 541–549 (1988) 9. Middlebury data set: optical flow data, http://vision.middlebury.edu/flow/data/ and stereo data, http://vision.middlebury.edu/stereo/data/ 10. Morales, S., Woo, Y.W., Klette, R., Vaudrey, T.: A study on stereo and motion data accuracy for a moving platform. Technical report, MI-tech-32 University of Auckland (2009), http://www.mi.auckland.ac.nz/ 11. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 12. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Computer Vision 47, 7–42 (2002) 13. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proc. IEEE Int. Conf. Computer Vision, pp. 839–846 (1998) 14. Vaudrey, T., Rabe, C., Klette, R., Milburn, J.: Differences between stereo and motion behaviour on synthetic and real-world stereo sequences. In: IEEE Conf. Proc. IVCNZ (2008), doi:10.1109/IVCNZ.2008.4762133 15. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical flow. In: Post Proc. Dagstuhl Motion Workshop (to appear, 2009) 16. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨orr, C., J¨ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)

Superresolution and Denoising of 3D Fluid Flow Estimates Andrey Vlasenko and Christoph Schn¨ orr Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Proc.(HCI), University of Heidelberg, Germany {vlasenko,schnoerr}@math.uni-heidelberg.de

Abstract. Three-dimensional high-speed image measurements of ﬂuid ﬂows become state-of-the-art velocimetry techniques in experimental ﬂuid mechanics and related areas of industry. In this paper, we consider data produced by an established technique, sparse Particle Tracking Velocimetry (PTV), in terms of small set of velocity estimates irregularly distributed over a 3D volume, and study a variational approach to superresolution in a physically consistent way. The output consists of a highresolution vector ﬁeld on the voxel grid where additionally the typical quantization noise of the input data can be removed. Numerical experiments validate our approach with 3D data from turbulent ﬂows.

1

Introduction

Image velocimetry is a challenging ﬁeld of experimental hydrodynamics and highly relevant for a range of industrial applications [1]. Two measurement techniques known as PIV (Particle Image Velocimetry) and PTV (Particle Tracking Velocimetry) are commonly used for velocity measurements [2,3,4]. Whereas PIV computes a regular vector ﬁeld by correlating two particle density images with each other, PTV computed correspondences between individual particles, which necessitates a much lower particle density and results in an irregular velocity ﬁeld with low spatial resolution [3],[5,6,7]. Both techniques have their pros and cons in view of the resolution of the velocity estimate, its accuracy, computational costs of the estimation procedure, and costs of the imaging technique. In this paper, we focus on PTV applied to 3D particles in turbulent ﬂows. The common output format of this measurement technique consists of a sparse set of vectors (velocity estimates) irregularly distributed over the 3D volume – see Figure 1, left. The task studied in this paper is to compute from these input data a high-resolution vector ﬁeld on a regular voxel grid – see Figure 1, right. To this end, we generalize an existing 2D variational technique for denoising 2D velocity estimates in a physically consistent way [8], to the 3D case and apply it in a coarse-to-ﬁne multi-resolution estimation scheme. Numerical experiments with CFD-simulations of 3D turbulent ﬂows validate our approach. Furthermore, we show that achieving the same result from a pure signal processing point of view – e.g. by highly accurate interpolation [9,10] and resampling – is not possible, because the resulting vector ﬁeld does no provide a physically consistent fluid flow estimate. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 482–491, 2009. c Springer-Verlag Berlin Heidelberg 2009

Superresolution and Denoising of 3D Fluid Flow Estimates

483

Fig. 1. Vertical cross-sections of a four-cell convection motion velocity ﬁelds. Sparse input (left), reconstructed vector ﬁeld (right).

The paper is organized as follows. The extension to the variational denoising procedure to 3D is presented in section 2. The extension of the method to sparse vector ﬁelds is presented in section 3. We validate our approach with numerical experiments in section 4. We conclude and indicate further work in section 5.

2 2.1

Constrained Fluid Flow Denoising in 3D Notation, Definitions

Vectors and vector functions are denoted with bold font: v = (v1 , v2 , v2 ) denotes a velocity ﬁeld, and curl(v) = ω = (∂v3 /∂y − ∂v2 /∂z, ∂v1/∂z − ∂v2 /∂x, ∂v2 /∂x − ∂v1 /∂y) the corresponding vorticity ﬁeld. The expressions u, v, v mean the Euclidean inner product and norm, and u, vΩ , vΩ the inner product and norm of L2 (Ω)]d , d = 3, respectively, where Ω ⊂ Rd is the given volume section. Fluids are supposed to be incompressible, that is div(v) = 0 .

(1)

They satisfy the Vorticity Transport Equation (VTE) which is the form of Navier-Stokes Equations expressed in terms of curl(v) [11]. The VTE for the three-dimensional case reads ∂ ω + (v · ∇)ω + (ω · ∇)v = νΔω . ∂t

(2)

For simplicity, we abbreviate the left hand side of the VTE e(v) = νΔω , such that the VTE becomes

e(v) =

∂ ω + (v · ∇)ω + (ω · ∇)v , ∂t

(3)

484

A. Vlasenko and C. Schn¨ orr

e(v) = νΔω .

(4)

We consider the class of quasi-stationary ﬂuid ﬂows in which temporal derivatives are suﬃciently small and can be savely ignored. Accordingly, we omit the term ∂ ∂t ω from (2), (3), (4). 2.2

Solenoidal Projection

In view of incompressibility constraint (1), the ﬁrst step of our ﬂuid ﬂow denoising approach recovers the solenoidal component of the input data. It is based on the orthogonal decomposition of the space V = [L2 (Ω)]3 = Vg ⊕ Vsol into gradients and solenoidal (divergence-free) vector ﬁelds [12], d = ∇φ + ∇ × ψ. The orthogonal projection P : V → Vsol onto the space Vsol = v ∈ V ∇ · v = 0 , v · n = 0 on ∂Ω is accomplished by solving the boundary value problem Δφ = ∇ · d , 0 on ∂Ω, and removing the divergence: v = d − ∇φ ∈ Vsol . 2.3

φ = (5)

Lowpass Filtering

The next step takes the solenoidal vector ﬁeld v from the previous step as input data and applies a Gaussian lowpass ﬁlter 1 1 2 vg = gσ ∗ v , gσ (x) = exp − x . (6) 2σ 2 (2πσ)3/2 The cutoﬀ frequency in terms of σ is chosen high enough so as not to perturb the turbulent kinetic energy of the ﬂow concentrated in lower frequency bands [13]. High-frequency noise, on the other hand, is eﬀectively ﬁltered out. 2.4

Vorticity Rectification

The next stage of our approach rectiﬁes the physical properties of the ﬂow vg computed in the previous step. This is accomplished by computing the vorticity ﬁeld ω g = ∇ × vg , (7) which together with vg is used, in turn, as input data of the variational problem

min ω − ωg 2Ω + ν∇ × ω2Ω + 2 e(vg ), ω Ω . (8) ω

Here ν is the coeﬃcient of ﬂuid viscosity. The corresponding Euler-Lagrange equation reads:

Superresolution and Denoising of 3D Fluid Flow Estimates

ω g − ω = νΔω − e(vg ) .

485

(9)

The left hand side of (9) gives the diﬀerence between ω and data computed on the previous step, while the right hand side is the vorticity transport equation (VTE). Thus, equation (9) gives a solution which is both close to ω g and approximately satisﬁes the VTE. Equation (9) has to be complemented with boundary conditions. If the exact boundary conditions are unknown, we suggest by default that the Laplacian at ∂Ω should be reduced to linear diﬀusion along the boundary. 2.5

Velocity Restoration

The ﬁnal step converts the restored vorticity ﬁeld ω back to a velocity ﬁeld u. At this stage the incompressibility condition must be imposed as well. This leads to the minimization problem

min u − vg 2Ω + β∇ × u − ω2Ω , (10a) u

subject to ∇ · u = 0 ,

(10b)

with vg and ω computed at the previous steps (6) and (8), respectively. In terms of u and a Lagrange multiplier functions p, q, we obtain the variational system. The optimality system corresponding to (10) reads (cf. [14,15]) u, ψΩ + β∇ × u, ∇ × ψΩ − p, ∇ · ψΩ = β∇ × ω + vg , ψΩ , q, ∇ · uΩ = 0 , ∀p , q|∂Ω = 0 ,

(11a)

p|∂Ω = 0 , ∀ψ , (11b)

where β is a user parameter. This system is discretized using mixed ﬁnite element method, as speciﬁed in [14],[15]. A consequence of this discretization is that the resulting discretized saddle-point problem A B u b = (12) p 0 B 0 is numerically stable [15]. Speciﬁcally, we can eliminate u u = A−1 (b − B p)

(13)

and solve the resulting system BA−1 B p = BA−1 b

(14)

An important property of algorithm presented here is that the ﬁnal output vector ﬁeld u preserves all physical properties of incompressible ﬂuids: it has zero divergence indeed and conforms to the momentum balance equations.

486

3

A. Vlasenko and C. Schn¨ orr

Superresolution Approach

Suppose a sparse vector ﬁeld d is given (cf. Figure 1). Because most entries are zero, we work with a coarse-to-ﬁne iterative application of the restoration algorithm of the previous section: – First, we construct a sequence of dyadic voxel grids and corresponding representations of the data d2

L

h

(L−1)

, d2

h

, . . . , dh ,

where dh corresponds to the data d discretized at the ﬁnest grid with meshk k+1 size h. As speciﬁed next below, the ﬁne-to-coarse transfer d2 h → d2 h involves a local averaging operation, decreasing the portion of nonzero entries. The coarsest level L is chosen such that this portion is less than half of all grid position. k k−1 – Coarse-to-ﬁne transfer d2 h → d2 h by bilinear interpolation deﬁnes a prolongation operator P (cf. [16]) and, in turn, a restriction operator R for ﬁne-to-coarse transfer through uh , P v2h Ω,h = Ruh , v2h Ω,2h ,

∀uh , ∀v2h ,

with ·, ·Ω,h , ·, ·Ω,2h denoting the inner products at the corresponding levels. – At each level, the restoration algorithm of the previous section is iteratively applied until the diﬀerence of the energies (squared norms) of subsequent vector ﬁelds falls below a threshold (user parameter). The result is transferred to the next ﬁner grid as input data for the next iteration.

4 4.1

Numerical Experiments Data Sets and Set-Up

Data sets. The ﬁrst data set is a four-cell deep-water convection vector ﬁeld. 10% irregularly distributed samples were used as input data – Fig. 2, left. The original ground truth ﬁelds depicted in Figures 3, right and 4, right, of size 1283 were computed with the software package [17]. As a second data set we took the result of a CFD-simulation of the turbulent ﬂow behind a cylinder – Fig. 2, right – as descibed in [18]. In this case, 20% irregularly distributed samples were taken as input data. Quantization errors. Particle displacements measured in PIV or PTV experiments are often discrete and equal to integer numbers of pixels [19], because such estimation procedures can be implemented eﬃciently. We simulate such real measurements by corrupting both data sets with this type of noise. Comparison with other work. We compared the results of our method with vector ﬁelds computed by highly accurate interpolation [9,10] and resampling.

Superresolution and Denoising of 3D Fluid Flow Estimates

487

Fig. 2. Vertical cross-sections of the input velocity ﬁelds. A four-cell convection motion, sparsity: 10% (left); a turbulent ﬂow around cylinder, sparsity: 20% (right).

Fig. 3. Instantaneous snapshot of the vorticity of a four-cell vertical convection in three dimensions: reconstructed result using superresolution (left), ground truth (right)

Error measurements. In terms of g, d and u denoting ground truth ﬂow, corrupted sparse input data and reconstructed vector ﬁeld, respectively, we specify the quantitative error measurements SDR :=

d − gΩ , u − gΩ

LDR :=

uΩ . gΩ

Thus, large SDR and LDR ≈ 1 signal good restorations.

488

A. Vlasenko and C. Schn¨ orr

Fig. 4. Vertical cross-sections of the velocity ﬁelds through the centers of convective cells. These vector ﬁelds correspond to the vorticities shown in Figure 3: reconstructed result (left), ground truth (right). Error measurements: SDR = 3.03 , LDR = 0.80.

Fig. 5. Instantaneous vorticity snap shots based on cubic spline interpolation (left), ground truth (center), reconstruction with our superresolution approach (right)

4.2

Results and Comparison

Figure 3 shows the vorticity ﬁelds corresponding to both ground-truth and restoration of the four-cell vertical convection. The restored vorticity ﬁeld was computed from the vector ﬁeld resulting from our superresolution approach. Taking into account that 10% of the data only were available as input, the quality of the restoration is good, enabling to inspect physical properties of the ﬂow. The error measurements are SDR = 3.03 , LDR = 0.80. Figure 4 depicts corresponding cross-sections of the reconstructed and ground truth vector ﬁeld. In order to compare our approach to a state-of-the-art method of image processing, cubic spline interpolation and resampling [9,10], we reconstructed the second data set with both methods. The interpolation result looks quite similar to ground truth – Figs. 5 and 6, left panel – but closer inspection and comparison to ground truth ﬂow (Figs. 5 and 6, center) reveals that spline interpolation

Superresolution and Denoising of 3D Fluid Flow Estimates

489

Fig. 6. Three vertical cross-sections of the velocity ﬁelds through the center of the ﬂow around a cylinder: cubic spline interpolation (left), ground truth (center), reconstruction with our superresolution approach (right). Nonphysical structure is created on the left, which is not the case on the right. See also Fig. 7.

Fig. 7. Close-up views of the steady velocity vector ﬁelds in front of the cylinder (Fig. 6), rotated by 90 degrees for convenience. (Top:) Cubic spline reconstruction, (middle:) ground truth, (bottom:) superresolution approach reconstruction. Nonphysical “waves” dominated the restoration shown in the top panel and in Fig. 6, left panel.

creates nonphysical ﬂow structure: the resulting ﬂow contains artiﬁcial “waves” that appear in each part of the vector ﬁeld, even in regions where the ﬂow is supposed to be almost constant (Fig. 7, upper section). Their propagation along the current does not depend on a distance, which contradicts turbulence theory [13]. Some tiny parts of a ﬂow are reconstructed in wrong way. As a result, the corresponding vorticity ﬁeld looks poor (Fig. 5). The corresponding error measurements LDR = 0.65 and SDR = 1.91 reﬂect this. Our superresolution approach conforms hydrodynamic principles and does not produce nonphysical structures (Fig. 7, bottom). The resulting vector ﬁeld looks like a smoothed version of the ground truth vector ﬁeld. However, the main

490

A. Vlasenko and C. Schn¨ orr

ﬂow structures are clearly recovered. Small-scale parts of the ﬂow are also restored with satisfactory quality, such that the vorticities of the reconstructed and ground truth ﬂow are similar (Fig. 5). The corresponding error measurements are LDR = 0.94 , SDR = 8.0.

5

Conclusion and Further Work

We presented a 3D black-box variational approach to the restoration of velocity ﬂuid ﬂow data obtained from sparse measurements. Based on hydrodynamical principles, the method restores physically consistent ﬂow structures together with a higher resolution. The performance of the method is fairly independent on the ﬂow type. It copes with various types of corruptions and noise and degrades gracefully with decreasing signal-to-noise ratios. Our approach might be used in connection with tomographical methods for 3D-PIV [20] in order to speed up the estimation of fully time-resolved 3D ﬂuid ﬂow vector ﬁelds. Acknowledgements This work has been supported by the German Science Foundation, priority program 1147, grant SCHN 457/6-3. In particular, authors thank Florian Becker for many valuable comments that helped to improve our paper.

References 1. Adrian, R.J.: Twenty years of particle velocimetry. Experiments in Fluids 39(2), 159–169 (2005) 2. Tropea, C., Yarin, A.L., Foss, J.F. (eds.): Springer Handbook of Experimental Fluid Mechanics. Springer, Heidelberg (2007) 3. Ohmi, K., Li, H.-Y.: Particle-tracking velocimetry with new algorithms. Meas. Sci. Technol. (11), 603–616 (2000) 4. Raﬀel, M., Willert, C.E., Wereley, S.T., Kompenhans, J.: Particle Image Velocimery – A Practical Guide. Springer, Heidelberg (2007) 5. Cenedese, A., Querzoli, G.: Lagrangian statistics and transilient matrix measurements by PTV in a convective boundary layer. Meas. Sci. Technol. 8, 1553–1561 (1997) 6. Stanislas, M., Westerweel, J.: Particle Image Velocimetry: Resent Improvements. Springer, Heidelberg (2003) 7. Willneﬀ, J., Gruen, A.: A New Spatio-Temporal Matching Algorithm for 3DParticle Tracking Velocimetry. In: The 9th of International Symposium on Transport Phenomena and Dynamics of Rotating Machinery, pp. 10–14 (2002) 8. Vlasenko, A., Schn¨ orr, C.: Physically Consistent Variational Denoising of Image Fluid Flow Estimates. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 406– 415. Springer, Heidelberg (2008) 9. Unser, M., Aldroubi, A., Eden, M.: B-Spline Signal Processing: Part I–Theory. IEEE Trans. Signal Proc. 41(2), 821–832 (1993)

Superresolution and Denoising of 3D Fluid Flow Estimates

491

10. Unser, M., Aldroubi, A., Eden, M.: B-Spline Signal Processing: Part II–Eﬃcient Design and Applications. IEEE Trans. Signal Proc. 41(2), 834–848 (1993) 11. Landau, L.D., Lifschitz, E.M.: Fluid Mechanics (Course of Theoretical Physics). Butterworth-Heinemann, Butterworths (2000) 12. Girault, V., Raviart, P.A.: Finite Element Methods for Navier-Stokes Equations. Springer, Heidelberg (1986) 13. Kolmogorov, A.N.: C.R. Acad. Sci. USSR (30), 301 (1941) 14. Ruhnau, P., Schn¨ orr, C.: Optical stokes ﬂow estimation: An imaging-based control approach. Exp. Fluids 42, 61–78 (2007) 15. Brezzi, F., Fortin, M.: Mixed and Hybrid Finite Element Methods. Springer, Heidelberg (1991) 16. William, L., van Henson, B.E., McCormick, F.S.: A Multigrid Tutorial, 2nd edn. SIAM, Philadelphia (2000) 17. Marshall, J., Adcroft, A., Hill, C., Perelman, L., Heisey, C.: A ﬁnite-volume, incompressible Navier-Stokes model for studies of the ocean on parallel computers. J. Geophys. Res. 102, 5733–5752 (1997) 18. Frederich, O., Wassen, E., Thiele, F.: Prediction of the ﬂow around a short wallmounted cylinder using LES and DES. J. Numer. Analysis, Industrial and Appl. Mathematics 3(3-4), 231–247 (2008) 19. Westerweel, J., Dabiri, D., Gharib, M.: The eﬀect of a discrete window oﬀset on the accuracy of cross-correlation analysis of digital PIV recordings. Exp. Fluids 23, 20–28 (1997) 20. Elsinga, G., Scarano, F., Wieneke, B., van Oudheusden, B.: Tomographic particle image velocimetry. Exp. Fluids 41(6), 933–947 (2006)

Spatial Statistics for Tumor Cell Counting and Classification Oliver Wirjadi1 , Yoo-Jin Kim2 , and Thomas Breuel3 1

Fraunhofer ITWM, 67663 Kaiserslautern [email protected] 2 Institut f¨ ur Pathologie, Universit¨ at des Saarlandes, 66421 Homburg 3 Fachbereich Informatik, Technische Universit¨ at Kaiserslautern, 67663 Kaiserslautern

Abstract. To count and classify cells in histological sections is a standard task in histology. One example is the grading of meningiomas, benign tumors of the meninges, which requires to assess the fraction of proliferating cells in an image. As this process is very time consuming when performed manually, automation is required. To address such problems, we propose a novel application of Markov point process methods in computer vision, leading to algorithms for computing the locations of circular objects in images. In contrast to previous algorithms using such spatial statistics methods in image analysis, the present one is fully trainable. This is achieved by combining point process methods with statistical classiﬁers. Using simulated data, the method proposed in this paper will be shown to be more accurate and more robust to noise than standard image processing methods. On the publicly available SIMCEP benchmark for cell image analysis algorithms, the cell count performance of the present paper is signiﬁcantly more accurate than results published elsewhere, especially when cells form dense clusters. Furthermore, the proposed system performs as well as a state-of-the-art algorithm for the computer-aided histological grading of meningiomas when combined with a simple k-nearest neighbor classiﬁer for identifying proliferating cells.

1

Introduction

Computerized image analysis has emerged as a powerful tool for objective and reproducible quantiﬁcation of histological features, which are required e.g. for computer-aided grading of histological sections [1]. Such medical problems have frequently been solved using ﬁne-tuned image segmentation methods such as thresholding or the watershed algorithm, e.g. [2,3]. In contrast, trainable systems, once designed, have the potential to be trained to perform well on a variety of data. The method introduced in this paper is such a trainable system, designed for computing the locations of multiple approximately circular objects in images. We will show that this new trainable method outperforms other algorithms in the localization and counting of circular objects, and that it has promising applications in medical image analysis: It can be used to count cells and, when combined with classiﬁcation methods, it can also be used in computer-aided histological grading J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 492–501, 2009. c Springer-Verlag Berlin Heidelberg 2009

●

●

● ●

●

●

●

● ●

●

● ● ●

● ● ●

● ●

●

●

●

●

● ● ●

●

●

● ●

● ●

●

● ●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

● ● ●

0.2

● ●

0.6

●

●

h(|xi − xj|;κ κ,σ σ)

●

κ = 0.1 κ = 0.3 κ = 0.5

●

●

0.4

●

●

●

0.8

● ●

493

0.0

●

1.0

Spatial Statistics for Tumor Cell Counting and Classiﬁcation

● ●

● ●

(a) Poisson process

●

0 ●

●

σ

2σ σ

|xi − xj|

(b) Softcore process and its potential function h.

Fig. 1. Realizations of Poisson and softcore interaction processes, which are used in this paper as models for objects in images. The softcore potential has parameters σ, the size of each circular object, and κ, the amount of admissible object overlap. As κ → 0, the potential tends towards that of a hardcore point process (no overlap). Fig. 1(b) was generated by MCMC sampling [6] with κ = 0.1 and σ = 0.1.

of meningioma tumor cells. For training, our method requires images with known object locations, but not a full segmentation. This multiple object localization method is a novel combination of spatial statistics and machine learning. Statistical modeling of the spatial arrangement of objects has actively been used for object tracking in video, but these have mostly been ad-hoc models, e.g. non-integrable potential functions [4]. Yet, the use of such highly customized models hinders the use of existing methods from the statistics literature, e.g., parameter estimation for spatial random processes that will be used in this paper. A rigorous and well-established method for modeling spatial random processes is the Markov random ﬁeld (MRF) model, which has proven to be especially well suited for image denoising [5]. As MRFs, just as hidden Markov and conditional random ﬁeld models, are deﬁned on ﬁxed graphs, their use for modeling freely moving objects is limited. Therefore, we base our new method on spatial Markov point processes, which model random point patterns. A special case is the pairwise interaction point process [6], in which the joint probability density function includes contributions from each point and from interactions between any two points in the process. Descombes and Zerubia argued that point processes have an advantage over MRFs because they can model relations between objects, rather than lattice sites [7]. Due to this property, various authors applied point processes to image analysis, e.g. [8,9,10]. All of these previous methods relied on customized likelihood terms to model image characteristics such as noise. In contrast, we demonstrate how to combine interaction point processes with statistical classiﬁers, resulting in the ﬁrst fully trainable application of point process models in computer vision.

2

The Proposed Method

To construct a trainable model for cell locations in images, we describe cell locations by a random point process X with realizations X = {x1 , x2 , . . . , xn },

494

O. Wirjadi, Y.-J. Kim, and T. Breuel

xi ∈ [0, 1]2 . These are the locations of n objects in an image f . The number of objects n, which is not known a-priori, is the realization of a Poisson random variable N . Given an image f , Baddeley and van Lieshout proposed to use the following posterior of X and n for object detection [9]: p(X, n|f ) ∝ p(f |X, n)p(X, n).

(1)

Here, p(f |X, n) is a likelihood term that needs to be speciﬁed, and p(X, n) is a prior describing spatial relationships between cells, chosen from the family of point process densities, cf. Sec. 2.1. We reformulate that original model to p(X, n|f ) ≈ p(X|n)p(n)

n

p(f (xi )|xi ).

(2)

i=1

The crucial step here is the reduction to the pointwise likelihood p(f (xi )|xi ), which deliberately ignores all image background (locations that are not contained in X). This has two main advantages. First, it leads to eﬃcient algorithms as it is not necessary to integrate over the whole image f , and second, it enables the use of trainable classiﬁers to compute the likelihood term p(f (xi )|xi ). To achieve the latter, the likelihood in (2) needs to be recast. Let y : [0, 1]2 → [0, 1] be a function that assumes a value of 1 only at object locations, then X = {x1 , x2 , . . . , xn } = {x|y(x) = 1}.

(3)

Since y contains all information required to infer X from an image f , it follows that y is a suﬃcient statistic for X and p(X, n|f ) ≈ p(X|n)p(n)

n

p(y(xi ) = 1|xi ).

(4)

i=1

In this revised form, the parametric likelihood term p(f (x)|x) has been replaced by p(y(x) = 1|x). Various classiﬁers can approximate such terms nonparametrically. Note that using classiﬁers for this purpose avoids the need to explicitly specify y. Instead, classiﬁers “learn” y from training data. Cf. Sec. 2.2 for details on the speciﬁc classiﬁer that will be used in the present paper. Given an image f , the maximum a-posteriori (MAP) estimator is (X ∗ , n∗ ) = argmax p(X, n|f ).

(5)

Two possible algorithms for ﬁnding (X ∗ , n∗ ) will be described in Sec. 2.3. 2.1

Point Process Prior

The centers of objects (cells) were assumed to be realizations from a point process X in the plane. Next, we introduce one form of spatial point process which is suitable for these coordinates. Formally, a point process is a mapping from probability space to the space of all countable point conﬁgurations in S ⊂ [0, 1]2

Spatial Statistics for Tumor Cell Counting and Classiﬁcation

495

[6]. X will be assumed to be locally ﬁnite, i.e., the number of points in any bounded subset of S is ﬁnite, and simple, i.e., points will occur only once. The simplest point process is the Poisson process with intensity parameter β, which is the random distribution of non-interacting points. Poisson processes are no suitable models for cells, as they lead to overlap, cf. Fig. 1(a). Pairwise interaction processes, on the other hand, can model point patterns with regularities [6]. Informally, these lift the independence assumption of Poisson processes, and allow points to interact. These interactions could be attractions between points, which lead to clustering, or inhibition between points, avoiding them to get too close. A pairwise interaction process takes into account interactions between at most two points, and has a conditional density of the form p(X|n) = Z β h(xi , xj ). (6) i

i=j

Here, Z is the normalizing constant, β remains the intensity parameter of a Poisson process, and h(·) is an interaction potential. Speciﬁcation of diﬀerent pairwise interaction models always reduces to choosing a form of h, see e.g. [6] for suﬃcient conditions on h such that Z will be ﬁnite. One such pairwise interaction model is the softcore point process with interaction potential 2/κ σ h(xi , xj ; κ, σ) = exp − . (7) xi − xj This inhibitory point process is a more suitable model for locations X than the Poisson process (Fig. 1(b)). From the application point of view, this potential function has a convenient parameterization: κ can be used to model the degree of uncertainty in the object size or the amount of admissible overlap, and σ, determines how close points may get, cf. Fig. 1(b). Given some training locations X train , both parameters can be inferred using pseudo-likelihood estimation [11]. 2.2

Pointwise Likelihood

To model the pointwise likelihood term p(y(x) = 1|x) in (4), we use convolutional neural networks [12], which can directly be applied to images. As training data, square image patches are extracted at object locations X train in a training image f train as examples for y(x) = 1, and patches randomly sampled from the image background as training examples for y(x) = 1. The network consists of two convolutional layers, each followed by one subsampling layer, one fully connected layer and two output units. Indexing these two output units by 0 and 1, the values of output unit 1 at position x can directly be used as estimation of p(y(x) = 1|x) when using the sum-of-squares error function and 0-1 coding. 2.3

Two Algorithms for Fitting the Model to Images

Computing object locations X in an image f under a trained model p(X, n|f ) requires to solve the optimization problem in (5). Two alternative algorithms:

496

O. Wirjadi, Y.-J. Kim, and T. Breuel

RJMCMC. The ﬁrst algorithm is a reversible jump Markov chain Monte Carlo (RJMCMC) sampler [13] with annealing to guide the sampling sequence into optimal solutions [14]. It starts from a set X 0 of n0 uniformly distributed points in a given image f . In step t, a new point set is generated by moving a point in, deleting a point from, or adding new a point to X t . Using the MetropolisHastings ratio to decide whether or not this modiﬁed set will be accepted as X t+1 guarantees convergence of the chain to the desired distribution p(X, n|f ) [13]. As all terms in this posterior density can be determined from training data as described above, it can be evaluated for each tuple of points X t and point count nt in every step t. For pairwise interaction point process densities as in (6), not the full posterior, but only a conditional density, the Papangelou intensity [6], needs to be evaluated, which lowers the computational complexity. Greedy search. The second algorithm is a greedy search for object locations that is reminiscent of Besag’s iterated conditional modes (ICM) for image restoration [5]. Starting from an empty set, each pixel site in an image f is scanned in every iteration t. Among all pixels x, the pixel x∗ with the highest conditional density p(x|X t , f ) is added to the set, i.e., X t+1 = X t ∪ {x∗ }. This conditional density is again the Papangelou intensity [6]. By terminating the algorithm when p(x∗ |X t , f ) has decreased in ﬁve consecutive time steps, this algorithm automatically chooses a number of object locations according to the trained model.

3

Experimental Evaluation

1.0

To compare the two alternative ﬁtting algorithms introduced above against each other and also to compare to other methods, this section shows results on simulated data (Fig. 2(a)). Such an experiment requires a suitable error metric.

ε

0.4

y

0.6

0.8

Detected object location Type II error (false negative) True positive Type I error (false positive)

ε

0.0

0.2

ε

0.0

0.2

0.4

0.6

0.8

1.0

x

(a) 320×320 pixel images (shown partially) containing n non-overlapping discs with radius 9 pixels, generated using random sequential adsorption (RSA). n is a Poisson random number (mean β = 100). Gaussian noise with variance σ 2 is added (left: σ 2 = 200, right: σ 2 = 350).

(b) Given distance threshold ≥ 0, true positive rate is relative to the number of objects, false positive rate to the number of detections.

Fig. 2. Localization experiment and localization error types: Points on the receiver operating characteristic (ROC) curve are obtained by varying p(n) in (4)

watershed

497

1.0

1.0

1.0

Spatial Statistics for Tumor Cell Counting and Classiﬁcation

watershed

0.8

0.8

0.8

watershed

0.0

0.2

0.4

0.6

False positives

0.8

1.0

0.6

isodata

0.2

MLP / greedy MLP / nonmax MLP / RJMCMC

0.0

0.0

0.0

0.2

MLP / greedy MLP / nonmax MLP / RJMCMC

0.4

True positives

0.6

isodata 0.4

True positives

0.6 0.4

MLP / greedy MLP / nonmax MLP / RJMCMC

0.2

True positives

isodata

0.0

0.2

0.4

0.6

False positives

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

False positives

(a) noise variance σ 2 = 200 (b) noise variance σ 2 = 250 (c) noise variance σ 2 = 325 Fig. 3. ROCs of the proposed method against three control methods (nonmaximum suppression of MLP outputs, watershed and isodata binarization) on the simulated data shown in Fig. 2(a) for = 7. The proposed method consistently outperforms all three alternative methods and is more robust to noise. Among the two ﬁtting algorithms (greedy, RJMCMC), greedy search turns out to be more accurate, in this case.

Among estimated locations X, correct and erroneous detections can be identiﬁed by setting a threshold ≥ 0 on the distance between an estimated and a known true location. The overall performance can be summarized in terms of the true and false positive rates, as illustrated in Fig. 2(b). The same concept is used when comparing the performance of diﬀerent classiﬁers in terms of their receiver operating characteristic (ROC) [15, Ch. 9]. To use this concept here, an analogous parameter governing the trade-oﬀ between true and false positives is required. By modifying the prior p(n) in (4) from small to large numbers of expected objects, the size of the resulting set X will increase, and a behavior similar to that of thresholding the class-posterior probability of a classiﬁer is expected. However, the functional dependence between p(n)’s mean and the resulting number of points |X| is not necessarily monotonic. One convolutional neural network was trained at each of three noise levels (σ 2 = 200, 250, 325) using error backpropagation on the raw gray values in 28 × 28 pixel patches. The parameters of the prior p(X|n) were determined from the known object locations using pseudo-likelihood estimation [16]. A Poisson prior p(n) with mean parameter 100 was used and the softcore parameter κ was ﬁxed to 0.4. Next, RJMCMC, starting from 100 uniformly distributed points, and the greedy algorithm, starting from an empty set, were applied to 50 test images at each noise level, on which the ROC was evaluated. The annealing schedule for the RJMCMC sampler was limited to require roughly the same amount of CPU time as greedy search, around 30 seconds on a 2.4 GHz CPU. As control methods, nonmaximum suppression on MLP outputs and two segmentation methods (watershed and isodata, using ImageJ 1.41c) were used. The deterministic greedy method clearly outperforms the RJMCMC sampler at all noise levels (Fig. 3). This may be due to the limited CPU time allowed for the RJMCMC sampler, see above, and we do not necessarily expect this observation to generalize. From the three control methods, watershed-based separation of paricles works best, with almost equivalent accuracy to our statistical

498

O. Wirjadi, Y.-J. Kim, and T. Breuel

approach at low noise levels, cf. Fig. 3(a). The merits of using spatial statistics to select locations from the MLP’s output become evident when comparing the greedy algorithm’s results to the nonmaximum suppression method (“MLP / greedy” and “MLP / nonmax” in Fig. 3). The performance diﬀerence between these can be attributed directly to the novel combination of Markov point processes and classiﬁers introduced above.

4

Evaluation on SIMCEP Benchmark

280 260 240

True cell count This paper CellC ImageJ

200

220

Number of cells

300

320

We applied our method to Set 1 (“Clustering with increasing probability”) of the SIMCEP benchmark [19]. It contains 20 simulated ﬂuorescence microscope images [20] at each of 5 levels of cell clustering, cf. Fig. 4. Each of these images, 100 in total, contains 300 cells. We counted these cells using the method introduced above and two software tools used in cytology, CellC [17] and ImageJ [18]. The results of the CellC software shown in Fig. 4(b) were reported in [19]. Results shown here for ImageJ were created using Version 1.41c of that software by background subtraction, smoothing, watershed segmentation and ﬁnally by applying ImageJ’s “Analyze Particles” function. For the proposed method, all images were converted to grayscale, and a convolutional neural network was trained on the ﬁrst image (1RGB) at each clustering level. Point process parameters were obtained using the cell locations in that same image. The prior p(n) was assumed to be uniform for this experiment. Therefore, the count results of our method are not inﬂuenced by any prior knownledge. Count results reported are mean and standard deviations on 19 test images at every cell clustering level (2RGB to 20RGB). While the accuracy of cell counts obtained by all three tested methods decreases with increasing cell

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Probability of clustering

(a) Locations in the SIMCEP benchmark data computed using the method proposed in the present paper: SIMCEP clustering probability 0.15 (10RGB, left) and 0.45 (10RGB, right).

(b) Our method’s cell count is more accurate than both CellC [17] and ImageJ [18]; true cell count is 300.

Fig. 4. Results on the SIMCEP benchmark [19]. The clustering probability is a parameter of the SIMCEP simulation tool, where higher probabilities lead to denser cell agglomerations [20]. CellC results in Fig. 4(b) copied from [19].

Spatial Statistics for Tumor Cell Counting and Classiﬁcation

499

clustering, the method proposed in the present paper is signiﬁcantly more robust to this SIMCEP parameter than both CellC and ImageJ (Fig. 4(b)).

5

Proliferation Fraction in Meningioma Cells

0.8

1.0

This section compares the performance of the proposed method in meningioma tumor grading to a state-of-the-art algorithm [21]. Meningiomas are common brain tumors, deriving from the leptomeninges of the brain. They are usually benign, but histological grading is essential for accurate risk stratiﬁcation and to predict the risk of recurrence after surgery. According to the World Health Organization (WHO), quantiﬁcation of proliferative fraction, e.g. by Ki-67 labelingindex (LI), is crucial in the prediction of recurrence risk in meningiomas [1]. Six Ki-67 labeled 640 × 480 pixel images were available, containing 370 cells on average (Fig. 5(a)). For all images, an experienced pathologist provided cell locations and the corresponding cell types. The images contain meningioma cells that have been labeled by Ki-67 (brown appearance), unlabeled meningioma cells, and diﬀerent types of non-tumor cells, cf. Fig. 5(c). To compute the LI of these images, a three step procedure is applied. First, the localization method from above is used to locate the cells, where the model was trained on all but one image at a time, disregarding color. The trained MLP is applied to the remaining test image and the greedy algorithm detects the object locations, cf. Fig. 5(b) for localization performance. Second, a high-dimensional feature descriptor (also including color) as described in [22] is extracted at each detected location. Third, each detected point is assigned to one of seven cell types using k-nearest neighbor (k-NN) classiﬁcation. For k-NN classiﬁcation in a given image, the feature vectors at known cell locations in all remaining 5 images are 5

NOS

1

0

4

0

73

0

normal

0

0

5

0

39

0

1

11

0

1

0

1373

10

28

mean deconvolution

0

0

1

0

9

0

1

0

0

69

0

39

1

5

erys

0

0

0

0

3

0

0

endothelial

0

0

1

0

19

1

0

erys

lab. tumor

lime

nonlab. tumor

normal

NOS

k−NN result

0.6 0.4

lime

lab. tumor

0.0

meningioma Ki67 01 − 06 mean

0.0

0.2

0.4

0.6

False positives

0.8

1.0

endothelial

0.2

True positives

nonlab. tumor

True cell type

(a) Detected locations in a (b) Localization perfor- (c) Classiﬁcation perforpart of one image (grayscale mance compared to the mance of a k-NN classiﬁer for visualization only). deconvolution-method [21]. (k=5) on 9D-features. Fig. 5. Results of applying the proposed method to the task of quantiﬁcation of meningioma tumor cell proliferation in six histological resections treated with Ki-67 antibody. The localization performance of the proposed, trainable method is comparable to the highly specialized, state-of-the-art deconvolution-method [21] and achieves an overall cell type classiﬁcation accuracy of 85.1%.

500

O. Wirjadi, Y.-J. Kim, and T. Breuel

used for reference. As object classiﬁcation is not the core topic of this paper, k-NN was chosen here mainly for its simplicity and generally good classiﬁcation performance [15]. In practice, k-NN does perform reliably here. Especially the two important tumor cell classes (labeled and non-labeled) are classiﬁed with high accuracy (Fig. 5(c)). By relating the number of tumor cells classiﬁed as labeled against those classiﬁed as non-labeled ones in a given image, the LI results. Compared to the color-deconvolution based method proposed in [21], the mean LI of the approach proposed in the present paper is less accurate, as shown in the following table. # 1 2 3 4 5 6 Mean error Ground truth 12.65 6.09 2.81 3.54 6.43 4.56 Deconvolution [21] 19.29 6.46 3.23 2.97 7.79 5.54 -1.53 ±1.06 This paper 4.20 4.97 2.64 0.95 4.83 4.64 2.31 ±1.29 However, as the mean error in this six-fold cross validation experiment is dominated by the error made in a single image (#1), we conclude that the performance of the our trainable method is similar to that of the specialized in [21].

6

Discussion

In this contribution, a novel combination of machine learning and point process densities was proposed and applied to the task of locating multiple circular objects in images. In contrast to earlier image analysis systems based on point process theory, the use of statistical classiﬁers such as MLPs eliminates the need to derive new likelihood functions for every application, as was demonstrated by applying the proposed model to simulated and real-world cell image data. This is an improvement over earlier applications of point processes in vision. The resulting novel statistical multiple object localization method was shown to be more robust to noise than three control methods on simulated data, more precise for cell counting than two software tools used in cytology on a public benchmark dataset, and as accurate as a specialized state-of-the-art system for segmentation of meningioma tumor cells. This approach is a promising tool for quantitative cytology, e.g. in the determination of Ki-67 labeled proliferating cell fraction and other applications in quantitative immunohistochemistry.

References 1. Louis, D., Ohgaki, H., Wiestler, O., Cavenee, W. (eds.): WHO classiﬁcation of tumors of the central nervous system, 4th edn. IARC Press, Lyon (2007) 2. Bengtsson, E., W¨ ahlby, C., Lindblad, J.: Robust cell image segmentation methods. Pattern Recognition and Image Analysis 14(2), 157–167 (2004) 3. Malpica, N., de Solorzano, C., Vaquero, J., Santos, A., Vallcorba, I., GarciaSagredo, J., del Pozo, F.: Applying watershed algorithms to the segmentation of clustered nuclei. Cytometry 28(4), 289–297 (1997) 4. Yu, T., Wu, Y.: Collaborative tracking of multiple targets. In: Proc. Int. Conf. Computer Vision and Pattern Recognition, June 2004, vol. 1, pp. 834–841 (2004)

Spatial Statistics for Tumor Cell Counting and Classiﬁcation

501

5. Besag, J.: On the statistical analysis of dirty pictures. J. Royal Statistical Society B 48(3), 259–302 (1986) 6. Møller, J., Waagepetersen, R.P.: Statistical Inference and Simulation for Spatial Point Processes. In: Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton (2004) 7. Descombes, X., Zerubia, J.: Marked point process in image analysis. IEEE Signal Processing Magazine 19(5), 77–84 (2002) 8. Al-Awadhi, F., Jennison, C., Hurn, M.: Statistical image analysis for a confocal microscopy two-dimensional section of cartilage growth. J. Royal Statistical Society: Series C (Applied Statistics) 53(1), 31–49 (2004) 9. Baddeley, A., van Lieshout, M.N.M.: Object recognition using Markov spatial processes. In: Proc. Int. Conf. Pattern Recognition, August 1992, vol. 2, pp. 136–139 (1992) 10. Ortner, M., Descombes, X., Zerubia, J.: A marked point process of rectangles and segments for automatic analysis of digital elevation models. IEEE Trans. Pattern Analysis and Machine Intelligence 30(1), 105–119 (2008) 11. Diggle, P., Fiksel, T., Grabarnik, G., Ogata, Y., Stoyan, D., Tanemura, M.: On parameter estimation for pairwise interaction processes. Int. Statistical Review 62(1), 99–117 (1994) 12. Lecun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998) 13. Green, P.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995) 14. van Lieshout, M.N.M.: Stochastic annealing for nearest-neighbor point processes with application to object recognition. Advances in Applied Probability 26(2), 281–300 (1994) 15. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2001) 16. Baddeley, A., Turner, R.: Spatstat: An R package for analyzing spatial point patterns. J. Statistical Software 12(6), 1–42 (2005) 17. Selinummi, J., Sepp¨ al¨ a, J., Yli-Harja, O., Puhakka, J.: Software for quantiﬁcation of labeled bacteria from digital microscope images by automated image analysis. Biotechniques 39(6), 859–863 (2005) 18. Rasband, W.: ImageJ. U. S. National Institutes of Health, Bethesda, Maryland, USA (1997-2007) 19. Ruusuvuori, P., Lehmussola, A., Selinummi, J., Rajala, T., Huttunen, H., YliHarja, O.: Benchmark set of synthetic images for validating cell image analysis algorithms. In: Proc. 16th European Signal Processing conference (2008), http://www.cs.tut.fi/sgn/csb/simcep/benchmark/ 20. Lehmussola, A., Ruusuvuori, P., Selinummi, J., Huttunen, H., Yli-Harja, O.: Computational framework for simulating ﬂuorescence microscope images with cell populations. IEEE Trans. Medical Imaging 26(7), 1010–1016 (2007) 21. Kim, Y., Romeike, B., Uszkoreit, J., Feiden, W.: Automated nuclear segmentation in the determination of the Ki-67 labeling index in meningiomas. Clinical Neuropathology 25(2), 67–73 (2006) 22. Wirjadi, O., Breuel, T., Feiden, W., Kim, Y.: Automated feature selection for the classiﬁcation of meningioma cell nuclei. In: Bildverarbeitung f¨ ur die Medizin. Informatik aktuell, pp. 76–80. Springer, Heidelberg (2006)

Quantitative Assessment of Image Segmentation Quality by Random Walk Relaxation Times Bj¨ orn Andres1 , Ullrich K¨ othe1 , Andreea Bonea1 , Boaz Nadler2, , and Fred A. Hamprecht1 1 2

University of Heidelberg, Germany Weizmann Institute of Science, Israel

Abstract. The purpose of image segmentation is to partition the pixel grid of an image into connected components termed segments such that (i) each segment is homogenous and (ii) for any pair of adjacent segments, their union is not homogenous. (If it were homogenous the segments should be merged). We propose a rigorous deﬁnition of segment homogeneity which is scale-free and adaptive to the geometry of segments. We motivate this deﬁnition using random walk theory and show how segment homogeneity facilitates the quantiﬁcation of violations of the conditions (i) and (ii) which are referred to as under-segmentation and over-segmentation, respectively. We describe the theoretical foundations of our approach and present a proof of concept on a few natural images.

1

Introduction and Related Work

Image segmentation is an important step in many applications, sometimes even the ultimate goal of the analysis. It remains a challenging problem which requires the search for an optimal partition of the pixel grid of an image. Even under simple model assumptions, this problem is NP-hard [13]. The task of image segmentation has thus been addressed by various constructive algorithms, e.g., watershed segmentation [11] as well as by spectral methods such as normalized cuts [13]. All segmentation algorithms have design parameters which need to be tuned for each speciﬁc application. Hence, a quantitative validation of segmentations is all the more important. Given a segmentation, there are two possible types of errors: (i) under-segmentation – a segment contains parts which belong to diﬀerent regions and should be split; (ii) over-segmentation, two adjacent segments in fact belong to the same region and should be merged. Most image segmentations suﬀer from at least one if not both types of errors. It is thus desirable to develop a quantitative tool to assess the quality of each individual segment, and to provide a diagnostic measure of whether a given segment should be split, or whether two adjacent segments should be joined. In an interactive segmentation framework, this measure has the potential to focus user attention on segmentation errors whereas in an automated procedure,

The research of Boaz Nadler was supported by the Israel Science Foundation (grant 432/06).

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 502–511, 2009. c Springer-Verlag Berlin Heidelberg 2009

Quantitative Assessment of Image Segmentation Quality

503

incorrect segments can be subjected to further processing. In this paper, we develop such a measure. A key feature of this measure is scale-invariance. It can therefore be applied to both small and large segments. As a proof of concept, we present results on segmentations of Berkeley database images [8]. Both watershed segmentation [11] and a graph-based segmentation algorithm [2] are used to illustrate that the method works with any segmentation. The only prerequisite is a boundary indicator function b giving the probability b(r1 , r2 ) that adjacent pixels r1 , r2 are separeted by a boundary. A simple way to obtain such a function is to interpolate the output of an edge detector to inter-pixel positions and then normalize the result. Given an image annotation, i.e. a function a with a(r, c) the probability that the pixel r is associated with class c (e.g. sky, ground, car, building), b(r1 , r2 ) can be a distance of the distributions a(r1 , ·) and a(r2 , ·). The measure of segment homogeneity we propose is based on the spectral analysis of random walks on weighted graphs. Random walks on graphs are the foundation of diﬀusion distances and diﬀusion maps [14,1] which have been applied for clustering, dimensionality reduction and signal denoising. The close relation between random walks and normalized cuts [13] is noted in [9,10]. Random walks have recently been used for interactive image segmentation [4,5,6]. Similar to [4,5,6], we consider weighted graphs in which vertices represent image pixels and edges connect neighboring pixels. The novelty we propose is to relate the relaxation time of a random walk on a weighted graph to the relaxation time on the topologically identical graph with all weights set to an equal value. This approach facilitates scale-invariance and adaptivity to the geometry of segments.

2

Segment Homogeneity

In this section, the spectral analysis of random walks on graphs is outlined ﬁrst, followed by the description of the particular graphs we consider for segmentation analysis. A rigorous deﬁnition of segment homogeneity is given at the end. Consider an undirected weighted graph G = (V, E, s) with the edge weight function s : E → (0, 1]. Interpret this function as a measure of similarity of the two vertices which are incident to the edge. Let V be ﬁnite and identify each vertex with a positive integer, V = {1, . . . , n}. Let S ∈ IRn×n be the similarity matrix of the graph G with s({j, k}) if {j, k} ∈ E Sjk = (1) δjk otherwise In addition, deﬁne the degree of any vertex j ∈ V as the sum of weights of all edges which are incident to j, i.e. deg : V → IR such that ∀j ∈ V : deg(j) = Sjk . (2) k∈V

504

B. Andres et al.

Let D ∈ IRn×n be the diagonal matrix with ∀j ∈ V : Djj = deg(j) and let M = D−1 S .

(3)

M is obtained from S by an L1-normalization of each row and is thus a rowstochastic matrix. M is adjoint to the symmetric matrix Ms = D1/2 SD−1/2 . Thus, M and Ms share the same eigenvalues. Moreover, since Ms is symmetric, the matrix has n real eigenvalues λ1 , . . . , λn whose corresponding eigenvectors v1 , . . . , vn form an orthonormal basis of IRn . The left and right eigenvectors of M denoted φj and ψj are related to those of Ms according to φj = vj D1/2 ,

ψj = vj D−1/2 .

(4)

As the eigenvectors vj are orthonormal under the standard innner product in IRn , it follows that φj and ψj are bi-orthonormal, i.e. φj , ψk = δjk . One nor√ malized eigenvector of M is (1, . . . , 1)T / n and the corresponding eigenvalue is 1. In addition, no eigenvalue of a stochastic matrix can be larger than one in magnitude (this is a consequence of the Gelfand spectral radius theorem, cf. [12]). Thus, the eigenvalues λ1 , . . . , λn can be ordered such that 1 = |λ1 | ≥ . . . ≥ |λn | ≥ 0 .

(5)

A random walk on the weighted graph G is a discrete stochastic process whose states are the vertices of G. Let P (Xt = j) denote the probability that the process attains the state j ∈ V at the discrete point t ∈ IN0 in time. Assume that the process is Markovian, i.e. the probability of moving from a vertex j ∈ V to a vertex k ∈ V does not depend on the history of the process, ∀t ∈ IN0 :

P (Xt+1 |Xt , . . . , X0 ) = P (Xt+1 |Xt ) .

(6)

In addition, let the transition probabilities be given by the matrix M , ∀t ∈ IN0 ∀j, k ∈ V :

P (Xt+1 = k|Xt = j) = Mjk .

(7)

The evolution of the random walk is then given by the repeated multiplication of the transition matrix M with the probability row vector of the initial state, P (Xt ) = P (Xt−1 )M = P (Xt−2 )M 2 = . . . = P (X0 )M t . This evolution is elucidated by the spectral decomposition of M : ⎛ ⎞t n n n Mt = ⎝ λj φj ψjT ⎠ = λtj φj ψjT = φ1 ψ1T + λtj φj ψjT . j=1

j=1

(8)

(9)

j=2

If, for any suﬃciently large number of steps, any vertex in the graph is reachable from any other vertex via a sequence of edges, the process is called irreducible. If, for any large enough number of steps, the probability of returning to the start vertex is non-zero, the process is called aperiodic. If a process is both aperiodic

Quantitative Assessment of Image Segmentation Quality

505

and irreducible, there exists a t0 ∈ IN such that for all t ∈ IN>t0 , M t is a positive matrix. It then follows from Perron’s theorem [7] that |λ2 | < 1 and thus M t → φ1 ψ1T as t → ∞. The rate of convergence depends on |λ2 |. We refer to τ :=

1 1 − |λ2 |

(10)

as the characteristic relaxation time of the random walk. Since M is a square, sparse matrix, λ2 can be eﬃciently computed using the Lanczos algorithm [3]. Consider the pixel grid Γ and the 8-neighborhood system of pixels which we denote by ∼. For any subset of pixels U ⊆ Γ , let pairs(U ) be the set of all (unordered) pairs of pixels of U which are 8-neighbors, pairs(U ) := {{u, u} ⊆ U |u ∼ u }. We consider segmentation algorithms which output a number m ∈ IN of segments and a function seg : Γ → {1, . . . , m} which maps any pixel to the index of a segment. Our goal is to provide a quantitative measure of segment homogeneity. To this end, we consider two diﬀerent random walks on each candidate segment. First, we study the geometric properties of the segment, and second the possible presence of signiﬁcant inner boundaries in it. For any segment j ∈ {1, . . . , m}, we construct the weighted graph Gj = (Vj , Ej , sj ) where Vj = seg−1 (j) consists of all pixels of the segment, Ej = pairs(Vj ) is the set of all pairs of pixels in the segment which are 8-neighbors, and sj is a similarity measure. An example is depicted in Fig. 1. First, we chose sj such that all edge weights are equal, which yields a graph that depends exclusively on the size and shape of the segment and is independent of the boundary indicator function. We term this graph the geometric graph of the segment and denote the relaxation time of the associated random walk by τg (geometric relaxation time). Second, we deﬁne sj depending on the boundary indicator. Since we consider similarity graphs (as opposed to the dissimilarity measured by the boundary indicator), we invert the indicator by the function f (x) = 1 − (1 − s0 )x. The design parameter s0 ∈ [0, 1] sets the minimal permitted similarity of neighboring pixels. A positive value s0 > 0 ensures irreducibility of the random walk associated with the weighted graph. Such a simple linear transform is suﬃcient if the boundary indicator function is appropriately scaled. Otherwise, the exponential function f (x) = exp(−αx) can be used with α adapted to the scale of b. We denote the relaxation time of the random walk associated with the weighted graph by τw and deﬁne segment homogeneity as the ratio τw H := . (11) τg Under-segmentation is quantiﬁed by computing H for each segment. If a segment is split by a boundary, the random walk on the weighted graph stays trapped inside one sub-region, or “well”, of the segment for a long time and only occasionally escapes across a boundary into another sub-region. Thus, it takes longer for the random walk to converge to the stationary distribution on the weighted graph than it does on the graph with equal weights. Hence, H 1 is indicative of under-segmentation. Conversely, H = 1 holds for homogenous segments, regardless of their size and shape. H thus constitutes a scale-free measure of segment homogeneity.

506

B. Andres et al.

1 S12

2 S26 6

S23 S27

S36

S67

S13 3 S37

S14

S34

4

S45

5

S47

7

Fig. 1. Similarity graph of a simple segment consisting of seven pixels. Each pixel corresponds to one vertex (enumerated circles). Edges (gray lines) connect any two pixels which are 8-neighbors. A similarity Sjk is associated with every edge.

In over-segmentation analysis, all pairs (V1 , V2 ) of adjacent segments are considered and for each of these pairs, H is computed on the merged segment V1 ∪V2 . A homogenous pair is indicative of over-segmentation. V1 and V2 should thus be merged if H ≈ 1.

3

Experiments

To illustrate the principle, the two segments depicted in the rows of Fig. 2 exemplify the under-segmentation analysis. In the ﬁrst column, the segment itself is depicted in black, indicating that the boundary indicator is zero throughout the segment. Consequently, τg = τw and hence, H = 1. In the second and third column, the segment is split by a boundary depicted in gray. In these cases, H > 1 quantiﬁes the degree of under-segmentation. In the fourth column, there is a pronounced boundary which does however not split the segment. H is very close to one in this case because τw diﬀers only slightly from τg . This property of H is essential for segmentation analysis where split segments must not be confused with segments that include non-splitting boundaries. In order to quantify the under-segmentation of natural images of the Berkeley database [8], the boundary indicator function was taken to be the normalized squared gradient magnitude of the three channels of the L*a*b* color space. Figures 4 and 5 show segmentations obtained from this boundary indicator by means of a seeded region growing variant of the watershed algorithm [11]. Seeds were set at those points of a regular grid (with a grid point distance of 15 pixels) where the normalized boundary indicator did not exceed 0.4. Undersegmentation thus occurs especially if no seed is placed inside an object in the image. In Fig. 4, a coarse, graph-based segmentation (cf. [2]) is shown in addition, stressing that the measure of segment homogeneity is scale-free and can be used with any segmentation algorithm. Along with the segmentations, H is depicted for each segment in shades of gray. Under-segmentation occurs in those segments for which H is much larger than 1. The distribution of relaxation times for the watershed segmentation shown in Fig. 4 is depicted in Fig. 3a. Homogenous segments are concentrated along the diagonal which means that the

Quantitative Assessment of Image Segmentation Quality

507

relaxation times are similar (H ≈ 1). Segments which are to be split lie signiﬁcantly above the diagonal (H 1). It can happen that τw < τg if the weighted graph distributes the random walk faster than equal weights do. Clearly, this is not the case if a segment is split, so H < 1 can be taken as a strong indication of not having under-segmentation. Over-segmentation as depicted in Fig. 6 is a well-known property of the watershed algorithm where regions are grown from all local minima of the boundary indicator function. Along with the segmentation, a boundary map is depicted. The shade of a boundary separating two segments V1 and V2 is proportional to the segment homogeneity H of the merged pair V1 ∪ V2 . White (removed) boundaries indicate over-segmentation. The corresponding distribution of relaxation times is depicted in Fig. 3b.

H=1

H = 1.15 H = 1.57 H = 1.03

H=1

H = 1.13 H = 1.71 H = 1.09

Fig. 2. Illustration of two segments (rows) and four diﬀerent boundary indicator functions on these segments (columns). The gray edges inside of the segments indicate a boundary probability of 0.6. Segments which are split by boundaries exhibit a segment homogeneity H > 1 whereas H ≈ 1 for segments which are not split. a)

b) 12000

5000 4500

10000 4000 3500

8000

τw

τw

3000

6000

2500 2000

4000 1500 1000

2000

500

0

0

1000

2000 τg

3000

4000

0

0

200

400

τg

600

800

1000

Fig. 3. a) Relaxation times τg and τw of all segments of Fig. 4. Points close to the diagonal correspond to homogenous segments. Points signiﬁcantly above the diagonal indicate under-segmentation. b) Relaxation times of all merged pairs of adjacent segments of Fig. 6. Points on the diagonal indicate over-segmentation.

508

B. Andres et al.

10

9

8

7

6

5

4

3

2

1

0 10

9

8

7

6

5

4

3

2

1

0

Fig. 4. Quantiﬁcation of under-segmentation. One ﬁne watershed-segmentation and one coarser graph-based segmentation [2] are shown. For each segment, its homogeneity H is depicted in shades of gray. Under-segmentation occurs in those darker segments for which H is signiﬁcantly larger than 1.

Quantitative Assessment of Image Segmentation Quality

509

35

30

25

20

15

10

5

10

9

8

7

6

5

4

3

2

1

0

Fig. 5. Quantiﬁcation of under-segmentation (continued)

510

B. Andres et al.

Fig. 6. Quantiﬁcation of over-segmentation. Each boundary in the lower plot corresponds to a pair (V1 , V2 ) of segments which are adjacent via that boundary. The shade of a boundary is proportional to the segment homogeneity H of the merged pair V1 ∪V2 . Black corresponds to high values, white corresponds to H ≤ 1. White (removed) boundaries indicate over-segmentation.

Quantitative Assessment of Image Segmentation Quality

4

511

Conclusion

In this paper, we developed a quantitative measure of segment homogeneity, based on random walks. Random walks on two diﬀerent segment graphs were considered: A weighted graph with weights obtained from a boundary indicator function and a graph where all weights are set to an equal value. Segment homogeneity was deﬁned as the ratio of the relaxation times of these two random walks. This deﬁnition is scale-free and adaptive to the geometry of segments. It facilitates the quantitative assessment of segmentation quality, in particular the quantiﬁcation of under- and over-segmentation. We are currently applying the concept to 3D segmentations and explore its potential as a criterion in an unsupervised split-and-merge segmentation algorithm.

References 1. Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker, S.W.: Geometric diﬀusions as a tool for harmonic analysis and structure deﬁnition of data: Diﬀusion maps. PNAS 102(21), 7426–7431 (2005) 2. Felzenszwalb, P.F., Huttenlocher, D.P.: Eﬃcient graph-based image segmentation. Int. J. Comput. Vision 59(2), 167–181 (2004) 3. Golub, G.H., van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996) 4. Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 28, 1768–1783 (2006) 5. Grady, L., Sinop, A.K.: A seeded image segmentation framework unifying graph cuts and random walker which yields a new algorithm. In: ICCV. IEEE, Los Alamitos (2007) 6. Grady, L., Sinop, A.K.: Fast approximate random walker segmentation using eigenvector precomputation. In: CVPR, IEEE, Los Alamitos (2008) 7. MacCluer, C.R.: The many proofs and applications of Perron’s theorem. SIAM Rev. 42(3), 487–498 (2000) 8. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV, vol. 2, pp. 416–423 (2001) 9. Meila, M., Shi, J.: Learning segmentation by random walks. In: NIPS, pp. 873–879. MIT Press, Cambridge (2000) 10. Nadler, B., Lafon, S., Kevrekidis, I., Coifman, R.: Diﬀusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In: NIPS, pp. 955–962. MIT Press, Cambridge (2005) 11. Roerdink, J., Meijster, A.: The watershed transform: deﬁnitions, algorithms and parallelization strategies. Fundamenta Informaticae 41(1-2), 187–228 (2000) 12. Rudin, W.: Functional Analysis. McGraw-Hill, New York (1991) 13. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 14. Singer, A., Shkolnisky, Y., Nadler, B.: Diﬀusion interpretation of non-local neighborhood ﬁlters for signal denoising. SIAM J. on Imaging Sciences 2, 118–139 (2009)

Applying Recursive EM to Scene Segmentation Alexander Bachmann Department for Measurement and Control University of Karlsruhe (TH), 76 131 Karlsruhe, Germany [email protected]

Abstract. In this paper a novel approach for the interdependent task of multiple object tracking and scene segmentation is presented. The method partitions a stereo image sequence of a dynamic 3-dimensional (3D) scene into its most prominent moving groups with similar 3D motion. The unknown set of motion parameters is recursively estimated using an iterated extended Kalman filter (IEKF) which will be derived from the expectation-maximization (EM) algorithm. The EM formulation is used to incorporate a probabilistic data association measure into the tracking process. In a subsequent segregation step, each image point is assigned to the object hypothesis with maximum a posteriori (MAP) probability. Within the association process, which is implemented as labeling problem, a Markov Random Field (MRF) is used to express our expectations on spatial continuity of objects.

1 Introduction This contribution addresses the reliable detection of close-by traffic participants in the vicinity of an intelligent vehicle with an application to autonomous navigation in the road traffic domain. Given an image sequence of a calibrated stereo rig, our approach is able to simultaneously estimate the state of a multiple number of moving objects in the field of view and determine location and extent of each entity in 3D space. In this context the data association, i.e. the unknown relationship between object hypotheses and noisy observations, is crucial. The task is especially difficult for cluttered environments with a variable number of objects of unknown spatial extent and has been a field of active research since many years in various communities (see e.g. [1,2,3]). Generally, the different approaches can be classified as either hard or soft data assignment method. Hard denotes the assignment of an observation to one (and only one) hypothesis, whereas soft means the assignment of an observation to an object hypothesis proportional to some weight. We argue that hard and soft data assignment is not mutually exclusive but complementary and a combination of both principles allows a better interpretation of the data. By integrating hard and soft data assignment into one probabilistic framework, we obtain an approach that performs data association not solely on the basis of an observed quantity but also on a more abstract, object-specific level. We pose the problem of multiple object tracking and scene segmentation as incomplete data problem with the observations being the incomplete data, whereas the object-associated observations state the complete data. Object states are estimated using an EM-based approach that consists of a multiple object tracking filter with probabilistic data association. A priori object information is incorporated in the association process through an MRF model which accounts for spatial constraints on the label variable. In the Mstep, the state and error covariance of each object is robustly estimated using an IEKF. J. Denzler, G. Notni, and H. S¨uße (Eds.): DAGM 2009, LNCS 5748, pp. 512–521, 2009. c Springer-Verlag Berlin Heidelberg 2009

Applying Recursive EM and Scene Segmentation

513

From the recursively revisited object parameters we derive a global cost function that incorporates the image-based motion cue while considering the fundamental property of spatial consistency and segregate the image accordingly. The remainder of the paper is organized as follows: Section 2 introduces the notation and formulates the problem in a probabilistic manner. In Section 3 the segmentation filter framework is presented and its performance is illustrated in Section 4 on real and synthetic image data.

2 Problem Formulation Within our motion segmentation scheme, we express the entire scene dynamics by a set of rigidly moving objects, each measured relative to a camera-fixed coordinate system. For each object, the instantaneous rigid body motion of this coordinate system is specified by the translational and rotational motion of its origin, t = (tx ,ty ,tz )T and Ω = (ωx , ωy , ωz )T respectively. Formally this can be expressed by the 3d motion field ω (X) ∈ 3 for each scene point X = (X,Y, Z) in the scene. We indirectly have access to this measure through the projection u(x) ∈ 2 of ω (X) onto the image plane. The difficult task in motion segmentation is to determine an adequate set of motions that describes the image flow field u(x) best and segregate the image accordingly (x states the collection of all image points). The key problem here is to robustly estimate 3d motion and structure of the environment from multiple 2d views Gt . To reduce the complexity of the problem, we consider the scene structure Zt to be given within the scope of this work. Object state. In our framework, the 3d motion for each object entity is expressed by the continuous-valued variable θt . We use a factorial representation of our state vector with θt = {θt1 , . . . , θt j , . . . , θtJ }, representing the quantitative state information for each j object θt = (ωx , ωy , ωz ,tx ,ty ,tz )T independently. We assume the number of objects J within the current, discrete time instant t to be fixed. θt is estimated for each object j independently based on a distinct set of feature points that are tracked over time. More details will be given in Section 3.2. Observations. Information from the environment is acquired through a sequence of observations Y0:t = (Y0 , . . . , Yt ), with Yt = {yt,1 , . . . , yt,i , . . . , yt,N } being the set of observations at time t. N states the number of image points. Within the segmentation process, observations are directly related to the image gray values gt,i ∈ Gt . In the featurebased motion estimation process observations are extracted from a set of tracked feature points (see Section 3.2 for details). Observations are connected to the state through the j observation model yt,i = h(θt ) + rt , with observation noise rt and observation equation j h(θt ). An adequate description, which formally connects the observed 2d displacement ut,i with the 3d object motion, is given by the Longuet-Higgins equations [4] ⎡ ⎤ (f2 +x2i ) xi yi

− y i ⎦ −f 0 x . −1 f f ut,i = CΩ Ωt + Zt,i Ct tt , with CΩ = ⎣ (f2 +y , C = t 2) 0 −f y (1) i − xifyi −xi f C θt

j

f denotes the focal length of the single cameras in the stereo setup.

514

A. Bachmann

Association process. To capture the unknown relationship between an observation and the object that caused it, an association process lt is introduced with its components being defined lt,ij = 1 if yt,i originated from object j (2) 0 else . We define the association process as binary label field lt = (lt,1 , . . . , lt,N ) with a label vector lt,i = e j for each observation i. e j states a unity vector of length J. This formalizes our assumption that an observation yt,i originates from exactly one object j. For notational convenience we restrict our description to only one object instance within θt in the sequel and skip the hypothesis index j. If necessary, we will resort to it in the text. To account for observations with a low confidence, i.e. all hypotheses seem to be equally unlikely for that observation, we further introduce an ambiguity label (AMB) which will be expressed by j = 0. In our estimation scheme the association process is interpreted as missing data regarding the observations. We define the complete data through time as 0:t = {l0:t , Z0:t , Y0:t }. Optimization problem. The aim is now to find the correct relationships between the set of observations and object hypotheses and its optimal state parameters. This is equal to the best estimate of the joint probability of our set of continuous state parameters and discrete label variables {θˆ 0:t , ˆl0:t } = arg max[P(θ0:t , 0:t )]. There exist a number of approaches that handle this problem as sequential data estimation problem and solve it in batch style (see e.g. [5]). Though appealing from a theoretical point of view, such implementations are in general computationally very complex. In order to increase computational efficiency and reduce memory requirements we process the data in a time-recursive manner, estimating the joint filtering distribution at time instant t and assuming that the state evolves according to a Markov chain [3]. We propose an iterative scheme that optimizes one variable at a time while keeping the others fixed at the values of an earlier iteration step. The proposed method starts with maximizing the posterior density θˆ t = arg max[P(θt |0:t )], given the complete data 0:t . In Section 3 we will show how this missing data problem can be expressed in terms of the EM framework [6]. We perform soft data assignment between observations and object hypotheses (Section 3.1) and use these probabilities to update the state estimate θˆ t of each object hypothesis (Section 3.2). The presented approach is time-recursive in that the state estimates from the previous time instant are used as prior. Within one EM iteration, the image is segregated into a set of disjoint, non-overlapping regions, resulting in a hard scene segmentation ˆlt which will be explained in Section 3.3.

3 Segmentation Filter Concerning the estimation of θt we use the EM approach which consists of iteratively computing the expected complete data in the E-step and afterwards estimating the state based on the complete data in the M-step. Different to the standard EM algorithm, in our approach a penalized maximum likelihood (ML) estimate is obtained (see e.g. [7]), leading to a MAP estimate of θt according to the Bayesian recursive update rule P(θt |0:t ) = (P(t |0:t−1 ))−1 P(t |θt , 0:t−1 )P(θt |0:t−1 ). We further extend this formulation by a segmentation step, the S-step, where each observation is

Applying Recursive EM and Scene Segmentation

515

assigned to the process that explains it the best (in the MAP sense). Each iteration of the proposed algorithm consists of the following triple step on the (k + 1)-th iteration: E-step:

k k Q(θt |θˆ t ) = E[logP(t |θt , 0:t−1 )|Yt , θˆ t ] ,

(3)

M-step:

k+1 θˆ t

(4)

S-step:

ˆlt = arg max {Q(θt |θˆ K )} . t

0:t−1 )} ,

k = arg max {Q(θt |θˆ t ) + logP(θt | θt lt

(5)

In the E-step, the conditional expectations of lt are computed based on the current observations and state estimate, which is equivalent to computing the probabilities of an observation to belong to each of the object hypothesis. These probabilities are then used in the state update of the M-step to rate the observations. In the S-step, image segmentation is performed based on the final state estimate at iteration step K. The segmentation is used to derive object-specific data, as e.g. the gray value distribution within the object boundaries, which can then be exploited in the following iteration step. 3.1 E-step By only considering data from the present and previous time step, the likelihood term in (3) becomes P(t |θt , t−1 ) = P(Yt |lt , θt , Zt , t−1 )P(lt , Zt |θt , t−1 ) ≈ P(Yt |lt , θt , Zt )P(lt |θt , Zt )P(Zt |θt ) .

(6)

If observations are furthermore assumed to be independent of each other, the logarithmic likelihood in (6) can be written logP(Yt |lt , θt , Zt ) = ∑Ni=1 logP(yt,i |lt,i , θt , Zt,i ). k To be able to compute Q(θt |θˆ t ), it is convenient to first separate lt and θt from each other. Exploiting the binary nature of lt,i , above equation then gets T logP(yt,i |lt,i , θt , Zt,i ) = lt,i D(yt,i |θt , Zt,i ) , with

D(yt,i |θt , Zt,i ) = [logP(yt,i |e1 , θt , Zt,i ), . . . , logP(yt,i |eJ , θt , Zt,i )]T .

(7)

A model for the single elements in the likelihood term D(yt,i |θt , Zt,i ) will be introduced in Section 3.3. For a sparse set of salient feature points this measure is evaluated indirectly in the M-step (see Section 3.2). If label associations are assumed to be independent of each other, the individual terms of the label prior on the right of (6) get T logP(lt,i |θt , Zt ) = lt,i V1 (lt ), with V1 (lt ) = [logP(e1 |Zt,i , θt , ), . . . , logP(eJ |Zt,i , θt )]T .(8)

As this independence assumption is contradictory to our natural notion that physical objects extend in space, we incorporate spatial dependencies of the association variable into the model. This implies that it is necessary to impose certain constraints or expectations on the labeling process. Prior expectations on spatial smoothness of lt , i.e. the probability P(lt ), can be elegantly expressed using MRFs. An MRF is defined by the property P(lt,i |lt,1 , . . . , lt,i−1 , lt,i+1 , lt,N ) = P(lt,i |lt,s , ∀s ∈ Ni ), with Ni being the local

516

A. Bachmann

neighbourhood set of image point xi . Exploiting the equivalence of MRF and Gibbs distributions (see e.g. [8]), an MRF may be written as P(lt ) = Z−1 exp (−V(lt )), with Z being a normalization constant. V(·) is a Gibbs energy functional to be defined in the sequel. Assuming that we have a second-order MRF to represent the desired constraints on our labeling process, the Gibbs energy functional can be written N

T V(lt ) = ∑ lt,i V1 (lt ) + i=1

∑ lt,iT V2 (lt )lt,n ,

(9)

i,n∈c

with c stating a second-order clique. V2 (lt ) is a matrix of dimension J × J with element { j, u} equals to logP(lt,i = e j , lt,n = eu ) = λ (eTj eu ). λ is a regularization constant rating the influence of neighbouring sites to the prior term. This model can be interpreted as the well known Potts model. Using the Gibbs formulation of our MRF on lt , (3) gets N

k k k Q(θt |θˆ t ) = ∑ E[lt,i |yt,i , θˆ t , Zt,i ] D(yt,i |θt , Zt,i ) − E[V(lt ) + logZ|Yt , θˆ t ] .

(10)

i=1

If the normalization term is neglected, we then get N

Q(θt |θˆ t ) = ∑ πt,i (D(yt,i |θt , Zt,i ) − V1 (lt )) − k

i=1

∑ πt,i V2 (lt )πt,n ,

(11)

i,n∈c

k

with πt,i = E[lt,i |yt,i , θˆ t , Zt,i ] being the posterior probability of the label variable at position i. The j-th element, i.e. the probability of the label variable at position i being assigned to object j, is

πt,ij

=

k k P(yt,i |lt,i = e j , θˆ t , Zt,i )P(lt,i = e j |θˆ t , Zt,i ) k

k

∑Js=1 P(yt,i |lt,i = es , θˆ t , Zt,i )P(lt,i = es |θˆ t , Zt,i )

.

(12)

The single elements πt,ij , which state the expectation that lt,i = e j , consist of a likelihood and a prior term which is expressed in our MRF formulation (see (9)), i.e. P(lt,i = e j ) = P(lt,i = e j |lt,n , n ∈ Ni ). By applying pseudo-likelihood approximation, above expression is evaluated with lt,n → πt,n being the probability estimate of lt,n obtained at the previous iteration step. 3.2 M-step In the M-step, (the first term of) (4) can be maximized with respect to state vector θt either directly, by setting the derivative of (11) to zero and solve for θt , or indirectly based on a set of feature points. For reasons of better data handling and hypotheses management we use a feature-based approach in our current framework which will be presented in the sequel. For each object hypothesis independently, a state estimate is obtained based on a set of Mt ⊂ N salient feature points Xt,i , i = (1, . . . , Mt ) of a rigidly moving object in 3d space. Following [9], in this work we have applied the idea of the ‘reduced-order observer’ in order to reduce the dimension of Xt,i to one state for

Applying Recursive EM and Scene Segmentation

517

each tracked point, encoding its depth ρt,i , i.e. Xt,i = (xt,i , ρt,i ). It is assumed that the corresponding image points of xt,i can be determined exactly for all scene points in all views within the feature tracking scheme. Depth points propagate over time according to ρt,i = (0, 0, 1)[R(Ωt )Xt−1,i + tt ]. With this, the coordinates of a scene point at time t are Xt,i = Π −1 (xt,i , ρt,i ), where Π −1 (·) states the inverse projection function. Given the set of tracked feature points, observations consist of the corresponding image points xt,l,i in the left (obtained through projection function Πl (Xt,i )) and following right camera xt+1,i , i.e. yt,i = (xt,l,i , xt+1,i ). Concerning our observation model, the image position mt,i in the following right image can be predicted from the current frame using (1), which states the instantaneous velocity field model. Given the current depth estimate ρt,i it is also trivial to derive the corresponding image coordinates st,i in the current left image. Observation noise rt is assumed to be zero-mean, white Gaussian T R−1 v , with with covariance matrix Rt = E[rt rtT ]. The observation residual then is vt,i t,i t,i vt,i = (|xt+1,i − mt,i | + |xt,l,i − st,i |) expressing the projection error for each point xt,i into the following right and current left camera image. We weight this residual with the posterior probability πt,i of the label variable at the respective position i. By additionally evaluating the second term of (4), we obtain a MAP state estimate considering state information from time (t − 1), i.e. we integrate the state evolution through time into our estimation scheme. This is formulated by the Chapman Kolmogorov equation P(θt |t−1 ) = P(θt |θt−1 )P(θt−1 |t−1 )d θt−1 , which expresses the predicted state distribution from time instant (t − 1) to t based on the a priori state distribution P(θt−1 |t−1 ) and an appropriate system equation that accounts for the system dynamics. The model accounts for uncertainties and model errors through white, zero-mean Gaussian process noise qt with error covariance matrix Qt = E[qt qtT ]. We assume the prior distribution being embodied in the probability statement P(θt−1 |t−1 ) = N (θˆ t−1 , Pt−1 ), with θˆ t−1 and Pt−1 being the mean estimate and covariance of a Gaussian. The best choice for θt then is P(θt |t−1 )=N (θt|t−1 , Pt|t−1 ), with θt|t−1 = f (θˆ t−1 ) stating the predicted state based on the previous estimate θˆ t−1 . The same holds for the predicted error covariance Pt|t−1 . Our estimation problem is now fully defined and we are left with the task of finding the MAP estimate to this formulation. This is equivalent to minimize its negative log, which under the Gaussian assumption reduces to the merit functions N

−1 T −1 (θt − θt|t−1 )) , (13) J1 (θt ) ∝ ∑ πt,ij (vt,i Rt,i vt,i ) and J2 (θt ) ∝ ((θt − θt|t−1 )T Pt|t−1 i=1

regarding the likelihood and prior term (neglecting constant values). The combined estimate based on current and previous data can then be expressed as the sum of the two merit functions J(θt ) = J1 (θt ) + J2 (θt ). To find a solution that minimizes this term, we now algebraically rearrange above equation and consider the prior estimate as a pseudo-observation [10]. The resulting expression can then be formulated as an iterative optimization approach for a non-linear least square problem, with

θtk+1 = θtk − (∇2 J(θtk ))−1 ∇(J(θtk )) ,

k+1 −1 −1 Pt|t = ((Htk )T Rt−1 Htk + Pt|t ) , (14)

being state estimate and approximate covariance update respectively. k states the iteration index and has already been introduced. Htk is the Jacobian of h(θtk ). For small

518

A. Bachmann

residual problems, (14) can be solved with the Gauss-Newton method. This is equivalent to the formulation of the update step within the IEKF [1] k θt|tk+1 = θt|t−1 + Kt (vt|t − Htk (θt|t−1 − θt|tk )) , with N

N

k k vt|t = ∑ πt,i (yt,i − h(θt|tk )) = ∑ πt,i vt,i , Pt|t = (I − Kt Htk )Pt|t−1 , i=1 k T k Kt = Pt|t−1 (Ht ) (Ht Pt|t−1 (Htk )T + Rt )−1

(15)

i=1

,

being the combined innovation, error covariance update and Kalman gain respectively. The initial state is assumed to be Gaussian with initial estimate θ0|0 and error-covariance

P0|0 . After state prediction θt|t−1 , in the innovation step the iterative update of θt|tk is started with θt|t1 = θt|t−1 , iterating until either the parameter estimates converge or some maximum number of iterations is reached. It can be seen that at initialization of each iteration cycle the second term of (15) disappears and the expression reduces to the standard EKF. Nonlinearities in the system model are handled by iterative relinearization of the model equations within the update step. Within the tracking scheme it is essential to verify that all feature points xt,i can be described by the same system model. Otherwise the filter may not converge. Therefore the filter is robustified in the innovation step by minimizing the influence of outliers, i.e. observations with an large predicted residual error vt,i , to the estimation process. Outlier detection is performed based on a significance test of the error distribution of vt . Given the error covariance of the state Pt|t and assuming uncorrelated observations, the error covariance of residual vt,i is written ∑t,i = Ht Pt|t (Ht )T + Rt . The Mahalanobis distance T −1 then gets δi2 = vt,i ∑t,i vt,i (which follows a χ 2 -distribution). Outliers are detected using a standard χ 2 -test at a significance level of α = 1%. At each iteration step, the χ 2 -test is made for Γ subsets, randomly sampled from the entire set of observations. Each subset consists of S ⊆ Mt observations. For each observation subset, θt|t and Pt|t are determined according to (15) and the Mahalanobis distance δi is computed. After completion of the subsampling step, the optimal subset s is identified based on the least median of squares LMedS ]. With the state estimate resulting from s , a feature criterion, i.e. s = arg min[δs∈ Γ point is detected as outlier if δi∈Mt > δ max . For α = 1% and 3 observations per point the threshold value was set to δ max = 11.35. Besides yielding a robust state estimate, the output of the proposed method is also used to initialize new object hypotheses. This is done by analyzing the outliers for pattern of similar motion, as distinct moving objects that are not contained in the tracking process yet produce coherent groups in the outlier vector. In our framework we detect these groupings and initialize a new object hypotheses if the number of spatially clustered image points exceeds a certain threshold.

3.3 S-Step In the segmentation step, the (fixed) state estimate from the M-step is used to assign the correct object hypothesis, i.e. label with maximum probability, to each image point. Therefore we propose the cost function

Applying Recursive EM and Scene Segmentation

1 N J C = 1 − ∑ ∑ δij πt,ij , with δij = N i=1 j=1

s } 1 for πt,ij = maxs {πt,i 0 else .

519

(16)

K δij = δ (lt,i = e j |yt,i , θˆ t , Zt,i ) denotes our decision function and expresses the costs for assigning a certain hypothesis to each image point. In this work we have solved the decision problem by assigning the hypothesis to the respective point that produces the highest probability. C quantifies the overall costs of a segmentation with each erroneously assigned image point producing the same costs. Our Bayesian decision rule assigns the hypotheses with MAP probability to each image point and therefore minimizes C, i.e. minimizes the number of segmentation errors. Based on our test statistic πt,i , we can formulate our segmentation problem as N

ˆlt = arg min { ∑ D(yt,i |θˆ K , Zt,i ) + V1 (lt ) + t lt

i=1

∑ lt,iT V2 (lt )lt,n } .

(17)

i,n∈c

K The single elements D(yt,i |θˆ t , Zt,i ) are assumed to follow a Gaussian distribution. The similarity between expected and observed gray values within a block of size B centered around image point xt,i is expressed by K

logP(yt,i |e j , θˆ t , Zt,i ) =

(gt (xi,s,u ) − gt+1 (mi,s,u ))2 , 2σ 2 s,u∈B

∑

(18)

with mi stating the corresponding point in the next right image according to (1). σ is derived from the segmentation result of the previous time instant and expresses the gray value variations within the region that has been assigned to the respective object hypothK esis. Given parameter estimates θˆ t and σ for all object hypotheses, above equation is fully specified and the MAP-MRF labeling solution is completely defined. An optimal labeling is found using a discrete energy minimization technique based on the graph-cut framework [11]. The optimal labeling ˆlt is then used within a gating process in the update step of our tracking filter, restricting the number of valid observations for a given object hypothesis to the observations that have been assigned to the respective label. If a track has no support in the current segregation step, i.e. no image point is assigned to the respective label, the track is deleted from the list of tracked object hypotheses.

4 Experiments The performance of our approach has been evaluated on real and synthetic image data. For each track, the temporal and spatial image displacement observations within the M-step are extracted from the respective set of tracked feature points by a correlationbased block matching technique. A dense depth map is computed in a global manner using a stereo matching algorithm based on the belief propagation framework. The filter output, representing the time-smoothed object parameters of object j, is then fed back into the image segmentation process. In a subsequent step, points that are not conform with the labeling are deleted from the list of tracked feature points and replaced by new points, sampled from the labeled region in the image. The system state

520

A. Bachmann

Fig. 1. Propagation of (a) normalized residuals and estimation error in the innovation step of the IEKF (δ0LMedS , v0 and ε0 state the error measures at iteration cycle s = 1) and (b) the mean segmentation error per pixel when initialization of new object hypothesis is suppressed (red) and when new object hypotheses are added to the filter bank (blue); (c) segmentation result for (b) with the original image, the segmentation result when hypothesis initialization is suppressed and the segmentation result for 2 independently moving objects; (d) Propagation of θˆ t (for illustration reasons, here only the translational motion component is shown) for soft and hard data assignment. ‘GT’ states ground truth.

for each track is initialized with zero velocity and depth ρ1:M0 , which is easily computed from the precomputed disparity map. The system noise is set empirically to qt = (0.5, 0.25, 0.5, 1, 1, 1, 0.5, . . ., 0.5)T . The number of possible hypotheses J is defined by the momentary number of distinct 6-DoF motion profiles in the scene. Regarding the relative importance of data and smoothness term the regularization factor has been adapted empirically to values between λ = 0.05 − 0.5. Within the update step of the IEKF, we obtained satisfactory results for a total number of Γ = 10 subsets with each subset containing S = 12 observations. Figure 1-(a) shows the propagation of normalized residuals and state estimation error as a function of repeated linearization in the innovation step of the IEKF. The residuals have been normalized relative to the initial error measure at iteration step s = 1 respectively. Figure 1-(b) depicts the mean segmentation error per pixel, according to (18), over time when initialization of new object hypothesis is suppressed (red) and when new object hypotheses are added to the filter bank for the standard case (blue). The colored, vertical bars indicate the time instant when an object hypotheses has been added to the filter bank. In Figure 1-(c) the corresponding segmentation results for -(b) are shown. ‘ND’ is assigned to points with no data available, ‘EGO’ denotes the ego motion label and ‘AMB’ indicates the ambiguity label. For a synthetic image sequence with one moving object ‘OBJ1’, in Figure 1-(d) it can be seen that state estimates based on soft data assignment outperform the hard assignment estimates in both robustness and accuracy.

Applying Recursive EM and Scene Segmentation

521

5 Conclusion An approach has been presented for multiple object tracking and scene segmentation. The data association problem has been solved using the EM framework. Within the association process, which has been implemented as labeling problem, a MRF has been used to account for spatial relationships. From the resulting merit function of the EM algorithm an IEKF has been derived for time-recursive tracking of a multiple number of objects. Based on the continuously updated state estimates the image is segregated accordingly. In ongoing work we integrate relational classification based on Markov logic into our segmentation scheme. We assume that the interaction of segmentation/tracking with the results from the classification step can be exploited to drive low-level object detection schemes tending towards more human-like scene perception. First experiments indicate remarkable improvements compared to low-level segmentation alone.

Acknowledgement The author gratefully acknowledges the contribution of the German collaborative research center ‘SFB/TR 28 – Cognitive Automobiles’ granted by Deutsche Forschungsgemeinschaft.

References 1. Bar-Shalom, Y.: Tracking and data association. Academic Press Professional, Inc., San Diego (1987) 2. Blackman, S., Popoli, R.: Design and Analysis of Modern Tracking Systems. Artech House, Boston (1999) 3. Molnar, K., Modestino, J.: Application of the EM algorithm for the multitarget/multisensor tracking problem. IEEE Trans. Signal Processing 46(1), 115–129 (1998) 4. Longuet-Higgins, H., Prazdny, K.: The interpretation of a moving retinal image. Proceedings of Royal Society of London 208, 385–397 (1980) 5. Logothetis, A., Krishnamurthy, V.: Expectation maximization algorithms for map estimation of jump markov linear systems. IEEE Trans. Signal Processing 47(8), 2139–2156 (1999) 6. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal Royal Statistical Society B 39(1), 1–38 (1977) 7. Green, P.: On use of the EM algorithm for penalized likelihood estimation. Journal Royal Statistical Society 52(3), 443–452 (1990) 8. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. Journal Royal Statistical Society (2), 192–236 (1974) 9. Chiuso, A., Soatto, S.: Motion and structure form 2d motion causally integrated over time: Anlaysis. IEEE Trans. Robotics and Automation (2000) 10. Sibley, G., Sukhatme, G., Matthies, L.: The iterated sigma point kalman filter with applications to long range stereo. In: Proceedings of Robotics: Science and Systems, Philadelphia, USA (August 2006) 11. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI 23(11), 1222–1239 (2001)

Adaptive Foreground/Background Segmentation Using Multiview Silhouette Fusion Tobias Feldmann, Lars Dießelberg, and Annika W¨ orner Group on Human Motion Analysis Institute for Anthropomatics Faculty of Informatics, Universit¨ at Karlsruhe (TH) [email protected]

Abstract. We present a novel approach for adaptive foreground/background segmentation in non-static environments using multiview silhouette fusion. Our focus is on coping with moving objects in the background and inﬂuences of lighting conditions. It is shown, that by integrating 3d scene information, background motion can be compensated to achieve a better segmentation and a less error prone 3d reconstruction of the foreground. The proposed algorithm is based on a closed loop idea of segmentation and 3d reconstruction in form of a low level vision feedback system. The functionality of our approach is evaluated on two diﬀerent data sets in this paper and the beneﬁts of our algorithm are ﬁnally shown based on a quantitative error analysis.

1

Introduction

Current image based human motion capture frameworks focus most frequently on the segmentation and tracking of moving body parts in static laboratory environments with controlled lighting conditions [1,2,3]. Also, many frameworks exist which focus explicitly on the detection of people in front of cluttered backgrounds (bgs) for surveillance purposes, e.g. [4]. In both cases learned foregound (fg) and/or bg models are used for segmentation. These can be based on color, range or previous knowledge about shape. Color based models are typically probabilistic color distributions [5], aﬃne transformations of clusters in color spaces [6] or code books [7]. A combined range and color based model has been proposed in [8]. Another approach is to learn shape models of the fg [9]. The models can be learned a priori [9] or continuously over time [5,6,8]. A 3d reconstruction can be created based on this segmentation. This can be done monocular [9], with stereo [8], multi view [1] or multi view stereo setups [10]. When using a multi view approach, a major aspect is how to fuse the information of diﬀerent views for 3d reconstruction which can be done by either shape from stereo, shading, texture, or silhouettes (cf. [11]) with either point correspondences [10], probabilistic space carving [12,13], energy minimization [11] or volumetric graph cuts [10].

This work was supported by a grant from the Ministry of Science, Research and the Arts of Baden-W¨ urttemberg.

J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 522–531, 2009. c Springer-Verlag Berlin Heidelberg 2009

Adaptive Foreground/Background Segmentation

523

Fig. 1. High bar, vault, still rings, pommel horse and ﬂoor exercises. Note the observers and scoring judges who generate a non-static permanently changing bg. The images were taken at the World Artistic Gymnastic Championships 2007, Germany.

In [3], an approach is presented which automatically extracts fg from a multiple view setup utilizing the assumption that bg regions present some color coherence in each image. We are presenting another approach, on the contrary, where this assumption does not need to apply. Our approach aims for dense 3d volume reconstructions from camera images in the context of image based analysis of gymnasts doing exercises without using markers to avoid hindering them in their performances. The recordings should take place in natural environments which are usually cluttered by the audience, scoring judges and advertisements, (cf. ﬁg. 1) and may be additionally altered by lighting inﬂuences. Thus, we have to cope with a scenario of permanently changing bgs and hence propose an algorithm which is able to distinguish between dynamically changing fg and bg by permanently adapting the fg and bg models to the environment which is possible by using a static multi camera setup and taking the scene geometry into account.

2

Contribution

We are proposing a new algorithm for adaptive fg/bg segmentation in scenarios, where distracting motion exists in the bgs of many cameras. We are presenting a solution using a calibrated multi camera setup, to compensate bg motion by integrating 3d scene information, to achieve a better, less error prone 3d reconstruction of the fg. The fg is by deﬁnition anything which is within the volume seen by all cameras and bg is everything outside the volume. The alogrithm is based on a closed loop idea of segmentation and 3d reconstruction in form of a low level vision feedback system. The main idea is to learn the fg and the bg in terms of GMMs which are used to infer a 3d reconstruction of the scene by probabilistic fusion. The probabilistic approach is chosen to avoid holes due to silhouette misclassiﬁcations. The resulting 3d reconstruction is projected into the camera images to mask areas of fg and bg, then the GMMs are updated using the masks from the 3d projection and the cycle will restart. The proposed algorithm basically consists of two phases: A ﬁrst initialization phase and a second iteration phase. In the initialization phase, a sequence of images or at least one image per camera is utilized to learn the bg GMMs on a pixel based level in YUV color

524

T. Feldmann, L. Dießelberg, and A. W¨ orner

Background Model

3D Reconstruction

Projections

Updated Background Model

Segmentation

Updated Foreground Model

Camera Images Foreground Model

Fig. 2. Single iteration step of fg/bg segmentation with simultaneous learning via feedback of 3d reconstruction. Boxes represent data, arrows represent dependencies.

space. Additionally, a 3d occupancy grid is set up in which for each voxel a precalculation of the projections on all camera images is performed. The following iteration phase consists of six consecutive iteration steps as depicted in ﬁg. 2. In the ﬁrst step, the fg and bg GMMs of the initialization or the previous cycle are utilized and the current camera images are loaded and converted to YUV color space. This data is then utilized as input for the second step. In the second step, a probabilistic fusion [13] for 3d reconstruction is used. This allows us to integrate the probabilities of the fg of all cameras in the 3d reconstruction process. The probabilities are derived from the input camera images and the fg and bg GMMs. The probabilistic integration helps to cope with errors in silhouettes. If the silhouette extraction fails in some cameras due to ambiguity for some pixels, the probability of a voxel occupation only decreases. If silhouettes of other cameras conﬁrm the occupation of the same voxel with high probability due to unambiguous fg detection, the failure of the ﬁrst cameras can be compensated. In step three, the probabilities of occupancy are projected into camera image masks. We are then iterating over all voxels and project all probabilities into the masks. For each pixel, the voxel with the maximum probability on the line of sight is stored which results in a probabilistic fg mask. In the fourth step, the thresholded inverted fg mask is used to learn the bg. The inverted fg mask is a more reliable indicator where to learn, than the bg GMM itself. The bg GMM is hence updated at locations where the inverted fg probabilities range below a certain threshold. The bg is learned conservatively, because it is not critical to possibly leave some bg pixels unlearned. If the bg has really changed at this locations, it can be also learned in a later step, when due to movements of the fg a more easy discrimination is possible. In step ﬁve, a probabilistic fg segmentation is done using the current camera images, the current fg GMM and the updated bg GMM. Finally, in the sixth step, the fg GMM is learned from the current images, the probabilistic projections and the probabilistic segmentations. In contrast to the bg masking, the mask of the fg has to be very precise in order to avoid false positives and to make sure that elements of the bg are not learned into the fg GMM. This would lead to an iterative bloating of the assumed fg over time. Thus, a logical AND operation is used on the segmentation with its very sharp edges and the probabilistic projection, which removes possible noise in

Adaptive Foreground/Background Segmentation

525

areas outside the volume. This assures that only real fg is learned into the fg model.

3

Learning

The learning section is split into three parts: Describing the bg learning in sec. 3.1, the incorporation of shadows and illumination changes in sec. 3.2 and the construction of our fg GMM in sec. 3.3. 3.1

Learning the Background

In the bg learning approach the discrimination of fg and bg is done on a pixel location based level where all pixels are treated independently. C is deﬁned to be the random variable of the color value of a pixel and F ∈ {0, 1} to decide whether a pixel is fg (F = 1) or bg (F = 0). The probability distributions p(c|F = 1) and p(c|F = 0) are used to model the color distributions of the fg and the bg. The conditional probability P (F = 0|c) that an observation of a color value c belongs to the bg can be infered using Bayes’ rule. Following [5], each bg pixel is modeled by a GMM with up to K components (K between 3 and 5). We are using the YUV color space and, thus, deﬁne ct = (c1t , c2t , c3t ) = (Yt , Ut , Vt ) to be the color value of a pixel in an image at time t of the image sequence. The diﬀerent color channels are treated stochastically independent for simpliﬁcation of the calculations. The bg likelihood, thus, is: K 3 k,d p(ct |Ft = 0) = ωtk η(cdt , μk,d t , Σt ) k=1

d=1

where η(c, μ, Σ) is the density function of the gaussian distribution, ωtk is the weight of the k th gaussian component and μk,d and Σtk,d are mean and variance t th of the d color channel with d ∈ {1, 2, 3}. k,d At time t, a GMM with a set of coeﬃcients (ωtk , μk,d t , Σt ) has been built for a speciﬁc pixel from all seen bg colors. If at t + 1 a new color value ct+1 has been classiﬁed as bg, this color has to be integrated in the GMM. At ﬁrst, we are trying to ﬁnd an already matching gaussian distribution where the distribution’s mean diﬀers less than 2.5 the standard deviation from the current color value. If no matching color distribution could be found, a new component will be initialized or, in case the maximal number of components K has already been reached, the component with the least evidence will be replaced, whereas evidence is deﬁned as the component’s weight divided by its standard deviation. In that case, the new distribution is initialized with ct+1 as mean, an initial variance Σinit and an initial weight ωinit . In the opposite case, i.e. if a matching distribution is found, the distribution has to be updated with the current color value. If more than one matching distribution is found, the component k which returns the highest probability integrating ct+1 will be updated. The updating process is deduced from [14], to get a better bg initialization compared to [5]. A sliding window of the past of size L (between 20-50) is used. This is just a guiding value, since new bg is usually

526

T. Feldmann, L. Dießelberg, and A. W¨ orner

integrated within one frame. For the updating process, the following variables are deﬁned: t 1, if component k has been updated at step t k mt+1 = , skt = mkj 0, else j=1 t t k,d k,d k,d 1 1 1 ωtk = t skt , μt = sk mkj cdj , Σt = sk mkj (μj − cdj )2 t

t

j=1

j=1

If t < L, the updating step is done in the following way analogous to [14]: k ωt+1 = ωtk +

1 k t+1 (mt+1

− ωtk ) ,

k,d Σt+1 = Σtk,d +

k,d μk,d + t+1 = μt

mk t+1 ((μk,d t+1 sk t+1

mk t+1 (cdt+1 sk t+1

− μk,d t )

− cdt+1 )2 − Σtk,d )

The results depend only on mkt+1 and the new color values ct+1 . In case of t ≥ L, the updating step is done in a way deduced from [14] but modiﬁed as proposed in the implementation of [15]: k ωt+1 = ωtk +

1 k L (mt+1

− ωtk ) ,

k,d Σt+1 = Σtk,d +

k,d μk,d + t+1 = μt

mk t+1 ((μk,d k t+1 Lωt+1

mk t+1 (cdt+1 k Lωt+1

− μk,d t )

− cdt+1 )2 − Σtk,d )

The results of the latter case are approximations, so that the calculations again only depend on mkt+1 and ct+1 , to reduce the memory needs of the algorithm. 3.2

Dealing with Illumination Influences

Shadows and highlights are a problem in natural environments. In our scenario with possible sunshine and other lighting inﬂuences, we have to cope with local short term illumination changes. Thus, we are introducing a model for shadow and highlight detection and – in contrast to [16] – incorporate it directly into the bg model by extending the previously deﬁned bg likelihood to: K 3 k,d 1 p(ct |Ft = 0) = 12 ωtk η(cdt , μk,d t , Σt ) + 2 p(ct |St = 1) k=1

d=1

St ∈ {0, 1} denotes the presence of shadows or highlights. The amounts of shadows and highlights and the bg are assumed to be equally distributed. The shadow/highlight model is deﬁned like the bg model whereas the current weightings ωtk of the bg components are reused: K p(ct |St = 1) = ωtk p(ct |Stk = 1) k=1

Assuming that a color (YB , UB , VB ) at a pixel represents bg and a color value ct , taken at time t diﬀers from the bg in luminance (i.e. Y ), but not in chroma, c1t the variable λ is deﬁned to describe the luminance ratio with λ = YYBt = k,1 . μt Additional shadow and highlight thresholds are deﬁned with: τS < 1 and τH > 1. A color ct is classiﬁed as shadow or highlight, if τS ≤ λ ≤ τH . The probability density function of a component k should return 0, if λ is not within the thresholds. Otherwise, the probability should be calculated from a 2d probability density constructed from the chrominance channels of the bg. In the YUV color space U and V of the color ct can not be compared directly with the bg’s

Adaptive Foreground/Background Segmentation

527

mean. Instead, a rescaling of the bg’s mean using the luminance ratio λ is neck,d essary ﬁrst: μ t = λkt (μk,d − Δ) + Δ with d ∈ {2, 3}, where Δ = 128 in 8 bit t 1 color channels. An addditional scale factor (τH −τ 1 is needed to achieve the S )ct density’s integration to result in 1. The density of a single component k is hence deﬁned as: ⎧ k,d 1 ⎨ (τ −τ η(cdt , μ t , Σtk,d ) if τS ≤ λkt ≤ τH 1 )c H S t d=2,3 p(ct |Stk = 1) = ⎩0 else 3.3

Learning the Foreground

During fg learning only the fg pixels’ colors are of interest. The pixel locations are, in contrast to sec. 3.1, not considered. Assuming that the current fg pixels of a camera image are known, a per frame fg GMM can be built. The k-means algorithm is used to partition the fg colors. The parameters of the fg GMM can be derived from the resulting clusters. In contrast to the bg, the color distribution of the fg may change very fast due to new fg objects or changes due to rotations of fg objects with diﬀerently colored surfaces. Both cases can be handled by introducing a probability PNF (new fg) to observe up to then unknown fg. Because of the discontinuous behavior of the fg, the color distribution is assumed as to be uniform: U(c). This leads to the following probability density of the fg: k p(c|F = 1) = PNF U(c) + (1 − PNF ) ω k η(c, μk , Σ k ) i=1

4

3d Reconstruction by Probabilistic Fusion

Following the ideas of probabilistic fusion [13], a scene is represented by a 3d occupation grid where each voxel of the grid contains a probability of occupation infered from the information of fg and bg GMMs and given camera images from multiple cameras. A voxel in the 3d grid is deﬁned as V ∈ {0, 1} describing the occupation state. For 3d reconstruction, V has to be projected into N cameras. In fact, the voxel projection – dependent on the voxel sizes – possibly aﬀects multiple pixels. This can be incorporated by using the combined pixel information of the convex polygon of the projected voxel corners or by approximation of a rectangle which we did in our implementation. We used the geometric mean of the pixels likelihoods. For simpliﬁcation, it is assumed in the following that one voxel projects to exactly one pixel. In the nth camera, the voxel projection at a certain pixel results in the random variable Cn of color values. Based on Cn , a decision is made to classify the pixel as fg or bg which is used to determine the voxel occupation. The resulting causal chain reads: V → Fn → Cn where all Fn and Cn are assumed as to be stochastically independend. Certain a priori probabilities are now deﬁned which will be combined to a conditional probability in a later step. First, the probability – whether or not a voxel is occupied – is deﬁned as equally distributed: P (V) = 12 . Analogous to [13] three probabilities of possible errors during reconstruction are deﬁned. The

528

T. Feldmann, L. Dießelberg, and A. W¨ orner

ﬁrst error source is the probability PDF of a detection failure, i.e. that a voxel V should be occupied, but the silhouette in camera n is erroneously classiﬁed incorrectly (Fn = 0 instead of 1). This leads to the probability function of an occupied voxel V: P (Fn = 0|V = 1) = PDF . The second error source is the probability PFA of a false alarm which is the opposite of the ﬁrst error, i.e. a voxel V should not be occupied, but the silhouette in camera n is again erroneously classiﬁed incorrectly (Fn = 1 instead of 0 this time). The third error source is the probability PO of an obstruction, i.e. voxel V should not be occupied, but another occupied voxel V is on the same line of sight and, thus, the corresponding pixel is classiﬁed as silhouette (Fk = 1 instead of 0). These errors result in the following conditional probability of fg of an unoccupied voxel V: P (Fn = 1|V = 0) = PO (1 − PDF ) + (1 − PO )PFA . Together, the combined probability distribution of all variables is given by: N N p(V, F1 , . . . , FN , C1 , . . . , CN ) = P (V) P (Fn |V) p(Ck |Fn ) n=1

n=1

This allows to infer the probability of a voxel’s occupation under observation of the colors c1 , . . . , cN using Bayes’ rule and the marginalization over the unknown variables Fn : N

P (V = 1|c1 , . . . , cN ) =

n=1 f ∈{0,1}

N

P (Fn =f |V=1)p(cn |Fn =f )

v∈{0,1} n=1 f ∈{0,1}

P (Fn =f |V=v)p(cn |Fn =f )

where P (V) cancels out due to equal distribution and p(cn |Fn ) is the current fg or rather bg model of the nth camera.

5

Evaluation

For evaluation, we used the HumanEva [17] data set II, S2 with 4 cameras and our own dancer sequence recorded with a calibrated camera system consisting of 8 Prosilica GE680 cameras which have been set up in a circular conﬁguration. Our video streams were recorded externally triggered using NorPix StreamPix 4. Both sequences contain scenes of a person moving in the fg (i.e. in the 3d volume). In our own scene, additional noise is generated by persons and objects moving in the bg and by windows in the recording room with trees in the front, which results in shadows and illumination changes. We compared 3 approaches: 1. proposed approach, 2. static a priori learned bg GMM with adaptive fg GMM, 3. static a priori learned bg GMM. In general after one additional frame already our proposed algorithm learned better discrimination models to distinguish between fg and bg. The results became a lot better over time compared to the algorithms with static bg GMMs relating compensation of perturbations and closing holes in silhouettes (cf. ﬁg. 6). On our sequence with changing bgs, the proposed algorithm performed clearly much better compared to the approach using a static bg GMM and an adaptive fg GMM (cf. ﬁg. 3). The proposed algorithm was able to remove most of the

Adaptive Foreground/Background Segmentation

529

Fig. 3. In the dancer sequence, a person dances in the 3d volume while others walk around the volume and move objects to generate perturbations. Further noise result from windows on the left. Row 1: Images from randomly chosen camera. Row 2: Results with a priori learned bg and adaptive fg. Row 3: Our proposed algorithm. Please note the interferences between fg and bg in frames 800 and 1000 and compare with ﬁg. 4.

(a)

(b)

(c)

(d)

Fig. 4. 3d reconstructions from critical frames 800 (a) and 1000 (b) (cf. ﬁg. 3). Interferences are compensated by redundancy. 3d reconstruction inﬂuenced by noise (c) with static bg GMM and adaptive fg GMM and (d) with our proposed approach. Note the reconstruction of the feet. Our proposed approach is obviously achieving better results.

Fig. 5. Images of frame 419 of the evaluated HumanEva data set: (a) Image of camera 3, (b) hand labeled image, (c) algorithm with static bg GMM, (d) our proposed algorithm

530

T. Feldmann, L. Dießelberg, and A. W¨ orner HumanEva Sequence

Dancer Sequence 8

1

Adaptive+F Static+F Static

7

6 Total Error Percentage

0.8 Total Error Percentage

Adaptive+F Static+F Static

0.6

0.4

5

4

3

2 0.2 1

0

0 1

2

3

4

5 200 Frame No.

419

600

800

1000

400

600

800 Frame No.

1000

1200

Fig. 6. Evaluation of three algorithms: (1) Adaptive+F = proposed algorithm, (2) Static+F = static bg GMM, adaptive fg GMM, (3) Static = static bg GMM. Left: total pixel error rates of the HumanEva dataset (all cameras). Right: Total error rates of dancer sequence with moving bgs (randomly chosen camera). It is obvious that our combined adaptive algorithm is outperforming the ones with static bgs in all cases.

motions of the persons outside the volume by learning it into the bg GMM in typically 1 frame. Nevertheless, there were some interferences between fg and bg, e.g. in frames 800 and 1000 which result from step 4 of our algorithm (cf. p. 524) in which by deﬁnition bg is integrated very cautiously in the bg GMM in areas close to fgs. But due to the redundancy of camera views, those artifacts do not pertube the 3d reconstruction, which can be seen in ﬁg. 4. It is obvious, that our approach leads to signiﬁcantly better reconstructions in dynamic scenarios. We manually labeled images of the HumanEva data set (cf. ﬁg. 5) and of our dancer sequence in order to measure the quantitative errors of the proposed algorithm. The error of missclassiﬁed fg and bg has been measured, summed and the total error percentage has been calculated. The results are shown in ﬁg. 6. It is obvious that on the HumanEva data set our proposed adaptive algorithm leads to much better results over time than the algorithms with the a priori learned bgs. The results on our dancer sequence with moving objects in the bgs and lighting inﬂuences from the window front, show that our proposed fully adaptive algorithm is easily outperforming the static ones in such scenarios.

6

Discussion

We presented a novel approach for fg/bg segmentation in non-static environments with permanently changing bgs and lighting inﬂuences using a combination of probabilistic fg and bg learning techniques and probabilistic 3d reconstruction. The algorithm works very reliable in static as well as dynamic scenarios, containing extensive perturbations due to moving persons and objects. We showed in a quantitative analysis on manually labeled images that our approach is outperforming conventional approaches with a priori learned bgs and adaptive fg models in both, static as well as dynamic scenarios. The resulting 3d reconstructions were presented and it is obvious that silhouette artifacts which may arise from

Adaptive Foreground/Background Segmentation

531

moving bg at the silhouettes borders have nearly no signiﬁcant inﬂuence on the shape of the 3d reconstruction due to the redundant information of multiple cameras. Additionally, our approach leads to much better 3d reconstructions in case of perturbations due to changing bgs and, thus, should be preferred over static bg GMMs in the described or similar scenarios.

References 1. Cheung, G.K.M., Kanade, T., Bouguet, J.-Y., Holler, M.: A Real Time System for Robust 3D Voxel Reconstruction of Human Motions. CVPR 2, 2714 (2000) 2. Rosenhahn, B., Kersting, U.G., Smith, A.W., Gurney, J.K., Brox, T., Klette, R.: A system for marker-less human motion estimation. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 230–237. Springer, Heidelberg (2005) 3. Lee, W., Woo, W., Boyer, E.: Identifying Foreground from Multiple Images. In: ACCV, pp. 580–589 (2007) 4. Landabaso, J.L., Pard` as, M.: Foreground regions extraction and characterization towards real-time object tracking. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 241–249. Springer, Heidelberg (2006) 5. Stauﬀer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for RealTime Tracking. In: CVPR, vol. 2, pp. 2246–2252 (1999) 6. Sigal, L., Sclaroﬀ, S., Athitsos, V.: Skin Color-Based Video Segmentation under Time-Varying Illumination. PAMI 26, 862–877 (2004) 7. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.: Background modeling and subtraction by codebook construction. In: ICIP, vol. 5, pp. 3061–3064 (2004) 8. Gordon, G., Darrell, T., Harville, M., Woodﬁll, J.: Background Estimation and Removal Based on Range and Color. In: CVPR, vol. 2, p. 2459 (1999) 9. Grauman, K., Shakhnarovich, G., Darrell, T.: A Bayesian approach to image-based visual hull reconstruction. In: CVPR, vol. 1, pp. 187–194 (2003) 10. Vogiatzis, G., Torr, P.H.S., Cipolla, R.: Multi-view Stereo via Volumetric Graphcuts. In: CVPR, vol. 2, pp. 391–398 (2005) 11. Kolev, K., Brox, T., Cremers, D.: Robust variational segmentation of 3D objects from multiple views. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 688–697. Springer, Heidelberg (2006) 12. Broadhurst, A., Drummond, T.W., Cipolla, R.: A Probabilistic Framework for Space Carving. In: ICCV, pp. 388–393 (2001) 13. Franco, J.-S., Boyer, E.: Fusion of Multi-View Silhouette Cues Using A Space Occupancy Grid. In: INRIA, vol. 5551 (2005) 14. Kaewtrakulpong, P., Bowden, R.: An Improved Adaptive Background Mixture Model for Realtime Tracking with Shadow Detection. In: 2nd European Workshop on Advanced Video Based Surveillance Systems. Kluwer Academic Publishers, Dordrecht (2001) 15. Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly, Sebastopol (2008) 16. Martel-Brisson, N., Zaccarin, A.: Moving Cast Shadow Detection from a Gaussian Mixture Shadow Model. In: CVPR, vol. 2, pp. 643–648 (2005) 17. Sigal, L., Black, M.J.: HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion. Brown University, Technical Report (2006)

Evaluation of Structure Recognition Using Labelled Facade Images Nora Ripperda and Claus Brenner Institute of Cartography and Geoinformatics Leibniz Universit¨ at Hannover, Germany {Nora.Ripperda,Claus.Brenner}@ikg.uni-hannover.de

Abstract. In 3d city modelling the request for details is permanently growing. Because of that automatic facade reconstruction methods are needed. Structure information plays an important role in facade reconstruction since facade elements like windows are arranged in structural patterns. In this paper we present a facade grammar which models these structures. This grammar is used in an rjMCMC based facade reconstruction method. We evaluate the reconstruction method using the eTRIMS [1] image database.

1

Introduction

3d building models are used in a large number of applications. These are in the ﬁeld of architecture, city planning and tourism as well as navigation. In all these ﬁelds of application the demand for details is growing. Today’s automatic reconstruction methods don’t extract facade details from the data but some of them use texture to visualize the facade. To reconstruct large numbers of buildings with facade details automatic facade reconstruction methods are needed. Pu et al. [2] developed a bottom-up method for facade reconstruction from terrestrial laser scanning data. It is based on a segmentation into planes. The planes are labelled as wall, roof, door and extrusion by feature constraints. These are formulated like ‘a roof is higher than a wall’ or ‘a wall is a vertical plane’. Windows are extracted as holes in the plane in a following step. Another facade reconstruction method from range and image data is presented by Becker et al. [3]. They assume that a coarse model of the building and a registered point cloud are given. The facade of the building model is subdivided into cells according to the point density. The density also gives information about the cell being window or wall. In a further step the reconstructed windows can be reﬁned by image information. Both methods use the fact that the point density on windows is much smaller than on the facade wall. A more knowledge based approach for facade reconstruction from images is given by Reznik et al. [4]. They build implicit shape models from a set of training data and use them to ﬁnd window centres in the facade. The window model is then determined by a MCMC method. Dick et al. [5] present a method for building reconstruction from images. They reconstruct the building model from small modules like in a ’lego’ kit J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 532–541, 2009. c Springer-Verlag Berlin Heidelberg 2009

Evaluation of Structure Recognition Using Labelled Facade Images

533

whereas the procedure is guided by a stochastic process. Van Gool et al. [6] give an overview of building reconstruction and emphasise the use of structure information especially for facade reconstruction. In our approach we want to use knowledge about the structure to support the reconstruction process. Facades show regular structures like grid patterns, symmetries or repetitions which make it easier to reconstruct the facade elements. For structure detection Liu et al. [7] presents a method which ﬁnds repetitions in images in one and two dimensions analysing frieze and wallpaper groups. Pauly et al. [8] ﬁnd regular structures in point clouds. They search for pairwise similarity transformations between single patches and determine the parameter by ICP. In all resulting transformations they ﬁnd a grid structure. With this method it is not possible to ﬁnd superimposed symmetries. To detect these Bokeloh et al. [9] develop a method which is based on line features. These features are extracted from the data and a neighbourhood graph is build. With this graph symmetries are detected on the lines and the results are veriﬁed with the point cloud. While these methods detect general structures, it is our goal to obtain a solution speciﬁcally targeted at facade structures, using a formal grammar.

2

Facade Grammar

If we look at building facades, we see regularities in the arrangement of windows. These are grid structures, symmetries and repetitions. To describe these structures we use a formal grammar which is a 4-tuple G = (T, N, S, P ). The terminal symbols T and the non terminal symbols N build the alphabet of the grammar. S is the start symbol, a non terminal symbol from which all derivations start, P is a set of production rules. Grammars can be divided in classes of the Chomsky hierarchy depending on the form of the production rules. Grammars of type 2 in this hierarchy are called context-free grammars. Their production rules have the form N → (T N )+ . This means that a non terminal symbol on the left hand side can be replaced with a number of terminal and non terminal symbols. All words that can be derived from S with rules from P build the language L(G) of the grammar G. In a facade structure windows are often arranged in a grid and the windows in a row or column have the same size. Looking at the whole facade there occur symmetries or repetitions. And if there are parts which diﬀer in structure they can often be subdivided in ground ﬂoor and upper parts. To be able to model these cases our facade grammar GF contains the following symbols. The start symbol is the symbol F acade, it is a rectangular facade with no information but its width and height. Further structure free symbols are P artF acade, SplitP artF acade, F acadeRow, F acadeColumn and F acadeElement. They represent parts of the facade which may be larger parts, a row, a column or a small part which contain only one facade element. A facade element can be one of the symbols W indow, Door, ShopW indow or Doorway. These are terminal symbols of the grammar. Besides these symbols there are further symbols containing structural information. These are diﬀerent kinds of grids and SymmetricF acadeSide and

534

N. Ripperda and C. Brenner

RepeatedF acade, which model the above mentioned facade structures. The grids contain facade elements like windows or doors in each cell. Diﬀerent grid symbols are distinguished by two properties. The ﬁrst is the distance in x direction. It can be equidistant or with varying distances. The second property determines if they contain identical or diﬀerent kinds of windows. This results in four kinds of grids which additionally exist with and without door. One symbol is for example RegularIdenticalW indowDoorGrid. In the derivation process the model of the facade is developed further in each step. Each rule splits the part of the facade corresponding to the left hand side symbol in a variable number of facade parts corresponding to the right side symbols. The grammar contains two diﬀerent kinds of splits. The ﬁrst is a split in multiple symbols. It is caused by diﬀerences in the facade structure and each part is modelled individually in the next steps. The other kind of split is a split in similar regions modelled by one symbol e.g. RepeatedF acade. If a facade is symmetric or contains repetitions, the repeated pattern needs to be stored and modelled further only once. Additional information like number of repetitions completes the model. The grammar currently contains 45 rules. We don’t want to list all rules here, but give an idea of the main rule groups. The ﬁrst group divides the facade in two parts. These are F acadeRow or F acadeColumn and P artF acade which can also be split by a F acadeColumn and be a SplitP artF acade. If the structure in a F acadeRow is very irregular it can be replaced by several F acadeElements which can contain distinct terminal symbols, which are W indows, Doors, ShopW indows and Doorways. The structure symbols RepeatedF acade and SymmetricF acadeSide can replace the symbols F acade and P artF acade and optionally there is a F acadeColumn in the centre of a symmetric facade. The grid rules are very important in the facade grammar. They replace diﬀerent symbols like F acade, SymmetricF acadeSide, SplitP artF acade etc. with one of the grid symbols.

3

Reconstruction Process

The reconstruction process should guide the derivation of a facade model automatically to the model which ﬁts the data best. This means we search for the model with the highest probability p(θ|D) under given data. The data we usually use are range data DS and a rectiﬁed image of the facade DI . For the structure reconstruction in eTRIMS image data the input is limited to image data. The vector θ encodes the derivation tree of the current state and additional attributes. The probability p(θ|D) is unknown and to sample from an unknown distribution we use the Markov Chain Monte Carlo (MCMC) method. A Markov Chain is used to simulate a random walk in the space of facade models. The transition probabilities from one model to another are given in a transition kernel J. The probability to propose a move from θt−1 to θt is given by J(θt |θt−1 ). When a new model is proposed the acceptance probability decides whether the change is accepted. This acceptance probability is chosen in a way that the process converges to the target distribution p(θ|D). The basic MCMC is designed

Evaluation of Structure Recognition Using Labelled Facade Images

535

for parameter vectors with constant size. Since in our case the dimension of θ changes during the process. Because of that we use reversibe jump MCMC (rjMCMC) [10] which allows changes in the dimension of θ (so called jumps). The probability of a change in dimension is integrated to the transition kernel. For the rjMCMC process with target distribution p(θ|D) we have to deﬁne a transition kernel J(θt |θt−1 ) and the acceptance probability α. For more details on the transition kernel see [11]. p(θt |D) · J(θt−1 |θt ) α = min 1, p(θt−1 |D) · J(θt |θt−1 ) The transition kernel J(θt |θt−1 ) is given but the acceptance probability depends also on the unknown distribution p(θ|D). According to Bayes’ Theorem p(θ|D) is proportional to the product of likelihood and prior of the facade p(D|θ) · p(θ). 3.1

MDL

To determine the likelihood p(D|θ) we use the minimum description length (MDL) concept introduced by Rissanen [12]. This is often used for hypothesis selection; it selects the hypothesis with the smallest sum of the description of the hypothesis and the description of the data with respect to the hypothesis. For our reconstruction process the MDL has two beneﬁts. First it gives us a method to compute a score how good the data ﬁt to the model and second this method takes the complexity of the model into account. The second point is important because if the model complexity is ignored simple facades could be modelled by superﬂuous complex structures. A facade with a simple regular grid is not to be modelled as symmetric or repeated facade which was a possible model even for a simple facade. According to MDL the best model θ to explain the data D is the model with the smallest length L(θ) + L(D|θ), where L(θ) is the length of the description of the model and L(D|θ) the length of the description of the data D using the model θ. So we have to determine the code length of the model and of the data given a model. Information theory says that a code exists with length L(θ) = − log P (θ) resp. L(d|θ) = − log P (D|θ). Because it is impossible to determine P (D|θ) since huge training data sets would be required, we use another way to determine the code length. This is done analogous to the line description by Fua and Hanson in [13]. We have an image of the size A = n × k and in distinction from [13] the data contain no grey values but NC diﬀerent clusters. In [13] the score function is given by S = F − G where F = F (d|no model) − F (d|model) is the encoding eﬀectiveness which is the number of bits saved by representing the data using the model and G = − log P (model) is the number of bits used to describe the model. Code length of the data given a model. If we encode the data without a model, we need L(D|no model) = A · log NC bits to describe the data. A model

536

N. Ripperda and C. Brenner

θ given by a derivation tree of the grammar describes the data by specifying which pixels are wall, window or door. To encode the data with such a model we use the following reasoning. – The pixels split in n0 outliers and n1 regular pixels. – The cost to specify whether a pixel is outlier or not is given by the entropy E(n1 , n0 ) = −(n1 log nA1 + n0 log nA0 ). – To model the outliers we need n0 log NC bits. – For the points ﬁtting the model we need no further information. A Gaussian noise model like in [13] is not used because it is deﬁnitive if a pixel belongs to a cluster or not. This gives us a code eﬀectiveness F = A · log NC − (n0 log NC + E(n1 , n0 )) = n1 log NC − E(n1 , n0 ). We use an additional scale coeﬃcient s to regulate the scale of the input image. This ensures that the complexity of the data doesn’t depend on the size of the input image. If we zoom into an image the code should not get more complex. Code length of the model. To determine the code length of a model θ we use the prior knowledge that we also need for the rule proposal in the reconstruction process. To determine the code length we need the probability of the model. G = L(θ) = − log P (θ). The model θ is given by a derivation tree containing attributes of the symbols and the probability of the model is the probability of all used symbols P (θ) = P (S0 , . . . Sn ) where the Si are the symbols in the derivation tree. This is the product of conditional probabilities where all symbols depend on their parents. Fig. 1 shows a sample model θ. The probability of this model without considering the parameters is P (θ) = P (S0 , . . . , S31 ) = P (S0 )P (S10 S11 |S0 )P (S20 |S10 )P (S21 S22 |S11 )P (S30 |S21 )P (S31 |S22 ). The conditional probabilities in the probability of the model correspond to the grammar rules. The right hand side symbols of the rule depend on the left hand side symbol. So we can use the rule probabilities to determine the code length of the model. Approximations for these probabilities are determined by analysing facades photos and counting diﬀerent arrangements of facade elements. In addition to the rule probabilities we need the distribution of facade parameters. This was examined in [14] and integrated in the reconstruction process. For each parameter qij we sum up − log(P (qij )).

Fig. 1. Sample model to illustrate the probability of a derivation tree

Evaluation of Structure Recognition Using Labelled Facade Images

537

The probability of a model θ is explored from the root node of θ over all children recursively. P (θ) = P (root(θ))c({root(θ)}) with c(S) =

P (children(Si )|Si ) · c(children(Si )) ·

Si ∈S

P (qij ),

qij

where qij is the jth parameter of the children of Si . This leads to a code length of the model G = L(θ) = − log P (root(θ)) − log c({root(θ)}) and − log c(S) is − log c(S) = − log P (children(Si )|Si ) − log c(children(Si )) − log P (qij ) Si ∈S

qij

Combining model and data. To obtain the complete score function Fua and Hanson use a shape coeﬃcient γ to balance code length of the model and code length of the data description using this model. γ and the scale parameter s must be determined for each application. The resulting score function for the facade reconstruction is S=

4

1 (n1 log NC − E(n1 , n0 )) + γ(log P (root(θ)) + log c({root(θ)})). s2

Reconstruction of eTRIMS Facade Images

To test the generality of the structure detection we evaluate our reconstruction method with facade images from the eTRIMS image database [1]. It provides facade images, an object segmentation and a class segmentation with 4 or 8 classes. We use the 8 class segmentation in which the classes building, car, door, pavement, road, sky, vegetation and window are labelled. The eTRIMS database contains presently 60 annotated facade images and it will be extended. Because our approach is basically developed for range and image data we have to do some pre-processing to use the annotated images. The images are rectiﬁed and cut to size manually. Afterwards the scale is estimated by the number of ﬂoors. After the pre-processing we run our method with all facades in the database. Fig. 2 shows some images with the determined facade structure. In some cases the structure detection didn’t work as well as for the facades shown in ﬁg. 2. This might be the case because the structure given by the facade is not contained in the facade grammar. Fig. 3 shows two of these facades. In the left example the structure in the ground ﬂoor is not labelled correctly in the lower part. There are shop windows and garages which belong to no class in the eTRIMS data. So they are labelled as wall. But the grammar contains no structure with only a door in the ground ﬂoor. So the reconstruction process

538

N. Ripperda and C. Brenner

Fig. 2. Results of the structure reconstruction

Fig. 3. Sample facades and class segmentation with erroneous reconstruction

invents a window row according to the structure in the upper parts. The example on the right hand side shows a facade with a window grid in which one window is missing. This is presently not modelled in the grammar, so the reconstruction process invents an additional window where no window exists. Even if the structure is not fulﬁlled in some cases, the use of structure information can be seen especially with occlusions. Fig. 2 shows a facade (bottom right) in which windows are occluded by trees. The knowledge about facade structures enables the reconstruction process to model these windows anyhow. Because we have no measure for the reconstruction result we divide the results manually in ﬁve groups. 33.9% of the facades are correctly reconstructed (A), 25.0% of the reconstructions have errors in structure but well placed elements (B), 14.3% of the facades are correctly reconstructed in parts (C), 5.4% of the facades have correctly reconstructed structure but errors in element placement (D) and 21.4% are wrongly reconstructed (E). To evaluate the results in more detail we examine the reconstructions pixel based. First we illustrate the reconstruction results on one example facade. The rectiﬁed class segmentation image is given in ﬁg. 4a and the reconstruction in 4b. The image contains 108810 pixels.

Evaluation of Structure Recognition Using Labelled Facade Images

539

Wall Window Door correct 0.855 0.834 0.606 correctness neutral 0.093 0.026 0.096 completeness 0.932 0.861 0.873

(a)

(b)

(c)

Fig. 4. Example facade (a), reconstructed image (b) and table of correctness and completeness of the modelled classes (c)

Table 1 (left) lists the number of pixels for each class combination. Each row represents one of the eight classes in the eTRIMS data or the class ‘not labelled’. And there is a column for each modelled class which are wall, window and door. The other classes are not modelled in the grammar. Because of this we deﬁne correct, neutral and false matches, which are listed in table 1 (right). Correct matches assign a class to the same class. Occlusions in the image cause neutral matches. E.g. vegetation can be modelled by all three classes in the same way. Car and unlabelled pixels can be found at many places in the image. The classes sky, road and pavement occur at the top and bottom image borders because of an imperfect rectiﬁcation. We constrain the matches for these classes as seen in table 1 (right). Table (c) in ﬁgure 4 shows the correctness and completeness for the three classes in percent. Counting the neutral as positive assignments, 94.8% of wall pixels were wall pixels in the eTRIMS image, 86.0% of window pixels and 70.2% of door pixels were correctly labelled. Starting from the real classes 93.2% of the wall pixel got the right label, for the other classes 86.1% of the window pixel and 87.3% of the door pixels are labelled correctly. Figure 5 shows the correctness and completeness per facade grouped by the reconstruction result. The neutral assignments are included because otherwise Table 1. Reconstruction result for the facade in fig. 4 (left) and definition of correct, neutral and false matches of classes (right)

Wall Window Door Vegetation Car Sky Pavement Road unlabelled

Wall Window Door 66524 3882 990 3727 23121 0 291 0 2008 5197 717 0 0 0 0 53 0 0 1689 0 317 0 0 0 294 0 0

Wall Window Door Vegetation Car Sky Pavement Road unlabelled

Wall Window Door c f f f c f f f c n n n n n n n f f n f n n f n n n n

540

N. Ripperda and C. Brenner

ϭ

CORRECTNESS

COMPLETENESS

NEUTRALASSIGNMENTS

Ϭ͕ϵ Ϭ͕ϴ Ϭ͕ϳ Ϭ͕ϲ Ϭ͕ϱ Ϭ͕ϰ Ϭ͕ϯ Ϭ͕Ϯ Ϭ͕ϭ Ϭ !

"

#

$

%

Fig. 5. Correctness and completeness of reconstructions grouped by quality of the reconstruction result. A: correctly reconstructed, B: errors in structure but elements well placed, C: well reconstructed in parts, D: structure correctly reconstructed but errors in element position, E: false reconstruction.

the correctness seems very low in images with many vegetation pixels. For the correctly reconstructed facades the correctness lies between 94.1% and 84.6% and the completeness between 93.9% and 81.8%. In group B we have also good results with a correctness between 92.7% and 79.1% and a completeness between 91.1% and 78.3%. For group C the values depend on the size of the correctly modelled part and therefore the correctness varys between 87.6% and 71.8% and the completeness between 86.6% and 65.6%. Reconstructions of group D show a correctness between 87.7% and 82.9% and a completeness between 86.6% and 75.6%. And for group E the values vary between 87.8% and 57.3% for correctness and 85.8% and 55.0% for completeness with the number of vegetation pixels. Images with large vegetation areas are diﬃcult to reconstruct.

5

Conclusion

We developed a facade grammar to model facade structures and present an rjMCMC based reconstruction process to model facades automatically. In this paper we describe an MDL based scoring function which considers how well the data ﬁt the model as well as the model complexity. We analyse the generality of the approach and especially of the facade grammar by reconstructing a large number of facades. The test shows that many facades can be modelled but some special structures are not considered in the grammar. This gives us important hints for possible extensions by adding grammar symbols and rules. Acknowledgements. This work was done within in the scope of the junior research group ’Automatic methods for the fusion, reduction and consistent combination of complex, heterogeneous geoinformation’, funded by the VolkswagenStiftung, Germany.

Evaluation of Structure Recognition Using Labelled Facade Images

541

References 1. Korˇc, F., F¨ orstner, W.: eTRIMS Image Database for interpreting images of manmade scenes. Technical Report TR-IGG-P-2009-01 (April 2009) 2. Pu, S.: Automatic building modeling from terrestrial laser scanning. In: Advances in 3D Geoinformation Systems. Lecture Notes in Geoinformation and Cartography, pp. 147–160. Springer, Heidelberg (2008) 3. Becker, S., Haala, N.: Integrated lidar and image processing for the modelling of building facades. Photogrammetrie Fernerkundung Geoinformation 2, 65–81 (2008) 4. Reznik, S., Mayer, H.: Implicit shape models, self diagnosis, and model selection for 3d facade interpretation. Photogrammetrie Fernerkundung Geoinformation 3, 187–196 (2008) 5. Dick, A.R., Torr, P.H.S., Cipolla, R., Ribarsky, W.: Modelling and interpretation of architecture from several images. International Journal of Computer Vision 60(2), 111–134 (2004) 6. van Gool, L., Zeng, G., van den Borre, F., M¨ uller, P.: Towards mass-produced building models. In: Photogrammetric Image Analysis 2007 (2007) 7. Liu, Y., Collins, R.T., Tsin, Y.: A computational model for periodic pattern perception based on frieze and wallpaper groups. IEEE Transactions on Pattern Analysis and Machine Intellegence 26(3), 354–371 (2004) 8. Pauly, M., Mitra, N.J., Wallner, J., Pottmann, H., Guibas, L.: Discovering structural regularity in 3D geometry. ACM Transactions on Graphics 27(3), 1–11 (2008) 9. Bokeloh, M., Berner, A., Wand, M., Seidel, H.-P., Schilling, A.: Symmetry detection using line features. In: Computer Graphics Forum (Proceedings of Eurographics) (2009) 10. Green, P.J.: Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4), 711–732 (1995) 11. Ripperda, N.: Grammar based facade reconstruction using rjMCMC. Photogrammetrie Fernerkundung Geoinformation 2, 83–92 (2008) 12. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978) 13. Fua, P., Hanson, A.: Objective functions for feature discrimination. In: Proceedings of the Eleventh International Joint Conference on Artificial Intellegence, pp. 1596– 1602 (1989) 14. Ripperda, N.: Determination of facade attributes for facade reconstruction. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 37, pp. 285–290 (2008)

Using Lateral Coupled Snakes for Modeling the Contours of Worms Qing Wang1,2 , Olaf Ronneberger1,2, Ekkehard Schulze3 , Ralf Baumeister2,3 , and Hans Burkhardt1,2 1

3

Chair of Pattern Recognition and Image Processing, Department of Computer Science, University of Freiburg 2 Centre for Biological Signalling Studies (bioss), University of Freiburg Faculties of Biology and Medicine, FRIAS LIFENET, ZBSA, University of Freiburg [email protected]

Abstract. A model called lateral coupled snakes is proposed to describe the contours of moving C. elegans worms on 2D images with high accuracy. The model comprises two curves with point correspondence between them. The line linking a corresponding pair is approximately perpendicular to the curves at the two points, which is ensured by shear restoring forces. Experimental proofs reveal that the model is a promising tool for locating and segmenting worms or objects with similar shapes.

1

Introduction

Active contour models (snakes), introduced by Kass et al. in 1987 [1], are dynamic curves that evolve to minimize the energy functional, which is composed of internal and external parts. The internal energy deﬁnes the physical property of the curve and serves to impose regularity constraints on the snake. The external energy is calculated from the image and gives rise to forces that push the snake toward the desired image features such as lines and edges. A snake must be initialized near the real boundary to ensure convergence. To overcome this limitation, Cohen added a pressure force to the model so that it acts like a balloon [2]. Xu and Prince proposed a new type of external forces: the gradient vector ﬂow (GVF), which is computed as a diﬀusion of the gradient of the edge map derived from the image [3]. GVF snakes are insensitive to initialization and can be attracted to boundary concavities. Neither balloon forces nor GVF can be derived from a potential. The models must now be formulated directly in terms of forces instead of energies. Snakes are an example of deformable models, the essence of which is to take account of prior information about the objects and the image contents simultaneously. With more (correct) prior knowledge being incorporated, the results can become more reliable and robust. In this paper, we will show a model developed for describing and ﬁnding the exact contours of Caenorhabditis elegans. The model will be introduced mainly in the force-balance framework. The roundworm C. elegans is one of the prime model organisms in medical biology, and is used e.g. for the functional analysis of human disease genes. The J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 542–551, 2009. c Springer-Verlag Berlin Heidelberg 2009

Using Lateral Coupled Snakes for Modeling the Contours of Worms

543

small size and plethora of manipulations possible with this animal allow the setup of large-scale genetic and pharmacological analysis with C. elegans. Automated readouts, e.g., the measurement of behavioral diﬀerences of worms being treated are essential for high-throughput experiments. Appropriate description of the shape and position of a worm is crucial towards successful behavior analysis. In the state-of-the-art system for C. elegans movement analysis [4], the position of a worm is represented by the skeleton of its binary image, obtained through thresholding and morphological operations. This approach is simple but has low accuracy and may fail when a worm forms a loop while moving (see Fig. 4). On 2D microscopic images, the contour of a worm consists of two sides that are almost parallel to each other but meet at the ends. Based on the physical property of the worm body, we have developed a model for describing its contour using its internal coordinates. The model is made up of two curves and each point on one curve corresponds to a point on the other. The line linking a corresponding pair is approximately normal to the two curves at the two points, which is ensured by shear restoring forces. The two curves support and constrain each other through the coupling. We call this model Lateral Coupled Snakes (LCS). As will be shown in the experiments, the LCS model yields highly accurate description of worm contours. A model called ribbon snake [5] has similar applications as the LCS model. It is made up of only one snake with each point associated with a width. It has been applied to road extraction from aerial imagery [6]. Despite the similarity, the two models diﬀer in many ways. Unlike ribbon snakes, with LCS, the external forces act directly on the model points and the regularity of the two sides is directly controlled. The two models therefore often have diﬀerent behaviors, especially near the ends. There is a mechanism for the LCS model to grow naturally along the object boundary. When extended to 3D, the two models will have diﬀerent extensions. In addition, we have studied how to employ the LCS model to ﬁnd the contours of worms from bad starting points. The technique can be used for segmenting other elongated objects with two almost parallel sides. This work, as far as we know, has not been done with the ribbon snake model. In the very ﬁrst paper on snakes [1], coupling was already proposed for a pair of stereo snakes for surface construction. The additional energy is deﬁned as a function of the distance of snake contours on the left and right images. In [7], it is proposed to impose epipolar geometry constraints to a pair of stereo snakes to achieve consistency and robustness in stereo tracking. The dual active contour technique [8] was developed to overcome the problems of sensitivity to initialization and parameters. With this technique, one contour expands from inside the target features, the other contracts from the outside. The two contours are interlinked to provide an ability to reject “weak” local energy minima. Hohenhuser and Hommel [9] put these models as well as the original snakes into a general framework and called it coupled active contour. They then used two snakes which are interlinked beyond diﬀerent image domains (intensity and dense range images, more speciﬁcally) for segmentation. Our model also falls largely into this framework. What makes it distinct is the property of the coupling.

544

2

Q. Wang et al.

The Model

The LCS model is made up of two curves x0 (s) and x1 (s), s ∈ [0, 1] and xk (s) = xk (s), yk (s) , k = 0, 1 .

(1)

Optionally one can include the constraints x0 (0) = x1 (0)

and

x0 (1) = x1 (1)

(2)

to make their ends meet. The line linking the coupled pair x0 (s) and x1 (s) should be approximately perpendicular to the curves at the two points. See Fig. 1a for illustration. Each curve is like a traditional snake and has internal energy 1 dEk (s) = α(s)xk (s)2 + β(s)xk (s)2 ds , (3) 2 where α and β are weight parameters that control the elasticity and rigidity of the curve. The internal energy (3) gives rise to forces fkelasticity (s) = α(s)xk (s)

fkrigidity (s) = −β(s)x k (s) .

and

(4)

To insure that the linking line is approximately perpendicular to the curves, we introduce shear restoring forces to resist shear deformation tan ϑ (see Fig. 1b). For simplicity, ﬁrst assume the two sides are perfectly parallel at x0 (s) and x1 (s). Let the unit normal vector be n and the unit vector pointing from x0 (s) toward x1 (s) be nc , the shear restoring forces can be written as f0shear (s) = μ(s)

nc − (nc · n)n nc · n

and

f1shear (s) = −f0shear (s) .

(5)

where μ is the shear modulus. When the two sides are not perfectly parallel, the mean normal direction should be used in place of n. Above the parameters α, β and μ are written as functions of s for ﬂexibility. For common applications, it is usually enough to assign constant values for the whole model. The same also applies to other weight parameters introduced later.

(a) LCS model with meeting ends Fig. 1.

(b) Shear restoring forces

Using Lateral Coupled Snakes for Modeling the Contours of Worms

545

One can also apply balloon forces to the sides or introduce new types of forces that are deﬁned on coupled pairs. Some examples of the latter are: 1. Forces that keep the width vary smoothly. Deﬁne the width at s as w(s) = x1 (s) − x0 (s). The forces are proportional to w (s) and act along the coupling direction. 2. Repulsion and attraction between a coupled pair. They can be used to control the allowed maximal and minimal widths. Forces on the ends need to be treated specially, which we defer until section 2.2 after the discrete formulation of the model is given. To deal with the selfocclusion problem as shown in Fig. 3, forces between diﬀerent parts of the model are necessary to prevent the two tips of the model from occupying the same position. Due to limitation of space, we will not elaborate on them. 2.1

Discrete Formulation

When formulated discretely, curves are given by a set of ordered points. Suppose there are l + 1 points for each snake. The LCS model is represented by xk (i) | i = 0, 1, · · · , l and k = 0, 1 . (6) The formulas with the continuous model imply natural parameterization of the curve, which means that the points should be evenly spaced along the curve. For the LCS model, we deﬁne dk (i) = xk (i) − xk (i − 1) i = 1, · · · , l and k = 0, 1 , (7) and require that d0 (i) + d1 (i) /2 = h be constant with regard to i after reparameterization. We call the constant h the spacing. As dk (i) are not constant, we need to be careful with the elasticity and rigidity forces. To ensure that the parameters α and β can be chosen largely invariant to the spacing, we deﬁne xk (i) − xk (i − 1) dk (i) ak (i + 1) − ak (i) bk (i) = dk (i + 1) + dk (i) /2 ak (i) =

i = 1, 2, · · · , l

(8)

i = 1, 2, · · · , l − 1 .

(9)

The elasticity force on point xk (i) (i = 1, 2, · · · , l − 1) is formulated as fkelasticity (i) =

1 α(i + 1)ak (i + 1) − α(i)ak (i) h

(10)

and the rigidity force on point xk (i) (i = 2, 3, · · · , l − 2) is given by fkrigidity (i) =

1 2β(i)bk (i) − β(i − 1)bk (i − 1) − β(i + 1)bk (i + 1) . 2 h

(11)

To calculate the shear restoring forces, we ﬁnd ﬁrst the tangential direction of the curves at xk (i): Tk (i) = N ak (i) + ak (i + 1) (12)

546

Q. Wang et al.

where N is the normalization operator deﬁned as N v ≡ v/v for any nonzero vector v. As T0 (i) usually diﬀers somewhat from T1 (i), we use their mean direction as the eﬀective tangential direction t(i) = N T0 (i) + T1 (i) . (13) The eﬀective normal direction n(i) can be obtained by turning t(i) by π/2. The unit vector pointing from x0 (i) toward x1 (i) is nc (i) = N x1 (i) − x0 (i) . (14) Given n(i) and nc (i) we are able to calculate shear restoring forces using (5). 2.2

Forces on the Ends and Their Neighbors

The forces on the ends and the points next to them need to be treated separately. In the following we only give the forces on xk (0) and xk (1). Forces on xk (l − 1) and xk (l) can be derived accordingly. The elasticity forces given in (4) make a snake contract continuously, which is usually undesired but sometimes necessary (see section 2.4). Depending on the situation, the elasticity forces on the ends can be set to 0 or be kept as fkelasticity (0) =

1 α(1)ak (1) . h

(15)

As no rigidity energy is associated with the ends, the rigidity forces on xk (1) and xk (0) are 1 2β(1)bk (1) − β(2)bk (2) h2 1 fkrigidity (0) = − 2 β(1)bk (1) . h fkrigidity (1) =

(16) (17)

When the ends of the two sides meet, i.e. when x0 (0) = x1 (0), the elasticity and rigidity forces applied to an end will be from both sides. For example, f rigidity (0) = f0rigidity (0) + f1rigidity (0) .

(18)

Usually the model near the ends takes a convex form like in Fig. 2. The sum of the rigidity forces on an end thus often has a longitudinal component pointing forward, which enables the model to grow along the object boundary. When this is not desired, the longitudinal component should be set to 0. The lateral component shall always be kept so that the end can be swung into the correct position. To make the model grow along the boundary of the object, one can also assign a dragging force along the longitudinal direction on the end: f drag (0) = −γ(0) N a0 (1) + a1 (1) (19) where γ is the weight parameter for the dragging forces. It should be small enough compared to the image force parameter (will be given in section 2.3) so that the model can be attracted to the desired positions while growing.

Using Lateral Coupled Snakes for Modeling the Contours of Worms

2.3

547

Evolution and Multiscale Approach

Image forces are calculated from the image. For the experiments in this paper, we use the GVF derived from the gradient magnitude of the image as the image force ﬁeld. Let the GVF ﬁeld be g(x), the external force acting upon a model point is then fkimage (i) = κ(i)g(xk (i)) (20) where i = 1, 2, · · · , l − 1 and κ is the weight parameter for the external forces. The net force on a model point is the vector sum of all the internal and external forces. Let it be fk (i). To enable the model to reach a force balance state, its movement is deﬁned by x˙ k (i, t) = fk (i, t)

(21)

where k = 1, 2 and i = 0, 1, · · · , l. This system of equations can be used directly to update the model. To free ourselves from the problem of adjusting timesteps at development phase, however, we use the ordinary diﬀerential equation (ode) solver from the gnu scientiﬁc library (gsl) [10] for updating. It is computationally eﬃcient to take a multiscale approach for the model evolution, with coarser scales for faster convergence followed by ﬁner scales for better localization. The scale of the external force ﬁeld is controlled by the size of smoothing kernels for calculating edge maps and numbers of iterations for computing GVF. The scale of the model is represented by its spacing h. 2.4

LCS Model for Object Segmentation

The LCS model is more powerful than what it was designed for. Since the two coupled curves support, constrain and provide information to each other, the model is rather robust. It can be used for searching and segmenting worms or objects of similar shapes with initializations far away from the real positions. If it is possible to tell whether an end is clearly outside of the object (the LCS model should make the determination easier), one can follow three steps for segmentation (see Fig. 5 for an example):

Fig. 2. The model near an end often takes a convex form, enabling the sum of the rigidity forces on the end to point outward largely along the longitudinal direction

548

Q. Wang et al.

1. Alignment. The model aligns to the the object boundary. A coarse scale can be used for fast convergence. The elasticity force should be allowed for an end when it lies far outside of the object. Weak balloon forces are used to overcome the attraction of edges from inside the object. Balloon forces also usually ensure that the ends are in a convex form. The rigidity force on an end can push the end growth and should be allowed when the end is not far outside of the object. 2. Length Adaption. When the model only covers a part of the object as a result of the ﬁrst step, dragging forces are exerted on one or both ends for the model to grow until it covers the whole object. A ﬁner scale can be used for this step. 3. Relaxation and exact localization. Dragging forces and balloon forces should be reduced to a minimum. Fine scale is used for this step.

3

Experiments

Data and preprocessing: In the recording setup, worms move freely on an agar plate under a microscope. Images with the size of 1392 × 1040 pixels are taken by a camera at 3 frames per second. The resulting pixel size is around 10 μm. The images undergo brightness stabilization and shading correction. For initialization of the LCS model, binary masks of worms are obtained by thresholding. Description of worm shapes: The model is initialized as close to the desired position as possible. If there is no loop present (which can be determined by a simple analysis of the binary mask), the boundary of the binary mask is carefully analyzed and corresponding pairs are searched based on the perpendicular requirement of LCS. These pairs, combined with interpolation and estimation, initialize the model. In the presence of loops, the model is initialized with the result of the preceding frame. In this case, the model usually does not align completely with the object and it actually takes the steps described in section 2.4 for the model to localize. As shown in Fig. 3, the model ﬁnds the correct boundaries and describes worm shapes with high accuracy. The position of a worm is often represented by its longitudinal center line. In [4], the skeleton of the binary mask is taken as the center line. With LCS, the center line is just given by (x0 (i) + x1 (i))/2 | i = 0, 1, · · · , l . (22) We compare the center lines obtained by the two methods in Fig. 4. LCS yields smoother and more accurate center lines and can handle the situation when a worm forms a loop. Location and segmentation with rough initialization: LCS models can also be used for segmentation with an initialization that is quite diﬀerent from the true boundary. In Fig. 5, the model is initialized as a straight one, overlapping only a little of the object. When a part of one side aligns with the true boundary, the corresponding part on the other is dragged or pushed to the correct position.

Using Lateral Coupled Snakes for Modeling the Contours of Worms

(1a)

(1b)

(1c)

(2a)

(2b)

(2c)

(3a)

(3b)

(3c)

(4a)

(4b)

(4c)

(5a)

(5b)

(5c)

(6a)

(6b)

(6c)

549

Fig. 3. LCS model with close initializations. Scale bar in (1a) corresponds to 20 pixels, or about 200 μm. The first column shows images of a worm in temporal order cropped from a video clip. The second column shows the initialization positions of the LCS model. The last column gives the final contours. In the first row, there is no loop, the model is initialized by analyzing the binary mask. In the other rows, the model is initialized with the result of the preceding frame and we take a multiscale approach for the model to evolve. The main weight parameters are set as α = 0.06, β = 4.0, μ = 1.6 and κ = 1.0.

550

Q. Wang et al.

(1a)

(1b)

(1c)

(2a)

(2b)

(2c)

Fig. 4. Center lines of worms found by skeletonizing the binary masks (1b and 2b) and by deriving from the LCS model (1c and 2c), respectively

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 5. Evolution of LCS model with a straight initialization. Scale bar in (a) corresponds to 20 pixels, or about 200 μm. (a) Initialization. (b)-(f) Intermediate steps of alignment. (g) The length having been adapted. (h) Relaxation and exact localization. The inner image in (h) shows detailed contour with no linking lines being displayed. With this example, the second step as described in section 2.4 is actually not necessary. The main weight parameters are set as α = 0.06, β = 4.0, μ = 1.6 and κ = 1.0.

Using Lateral Coupled Snakes for Modeling the Contours of Worms

4

551

Conclusion and Outlook

The proposed lateral coupled snake model can describe shapes and positions of worms in their internal coordinates with subpixel accuracy. It also proves to be a promising tool for segmenting elongated objects with two sides that are largely parallel. The evolution of the model in this paper makes use of the gsl ode solver. The alignment process shown in Fig. 5 takes less than 10 seconds on a mainstream PC. We will test whether a direct update with timestep adaptation schemes tailored for the model can reduce the computation time. A future direction of our work is to extent the model to 3D so that it can be applied to pipe-like objects. The ideas of coupling snakes through shear restoring forces and using forces on the ends ﬂexibly shall also be useful there.

Acknowledgment This study was supported by the Excellence Initiative of the German Federal and State Governments (EXC294, bioss, FRIAS) and FRISYS (funded by BMBF FORSYS).

References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Cohen, L.D.: On active contour models and balloons. Computer Vision, Graphics, and Image Processing. Image Understanding 53(2), 211–218 (1991) 3. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. IEEE Transactions on Image Processing 7(3), 359–369 (1998) 4. Geng, W., Cosman, P., Berry, C., Feng, Z., Schafer, W.: Automatic tracking, feature extraction and classification of c. elegans phenotypes. IEEE Transactions on Biomedical Engineering 51(10), 1811–1820 (2004) 5. Fua, P.: Model-Based Optimization: An Approach to Fast, Accurate, and Consistent Site Modeling from Imagery. In: Firschein, O., Strat, T.M. (eds.) RADIUS: Image Understanding for Intelligence Imagery, pp. 903–908. Morgan Kaufmann, San Francisco (1997) 6. Mayer, H., Laptev, I., Baumgartner, A.: Multi-scale and snakes for automatic road extraction. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 720–733. Springer, Heidelberg (1998) 7. Cham, T., Cipolla, R.: Stereo coupled active contours. In: Proc. CVPR, pp. 1094– 1099 (1997) 8. Gunn, S.R., Nixon, M.S.: A robust snake implementation; a dual active contour. IEEE Trans. Pattern Anal. Mach. Intell. 19(1), 63–68 (1997) 9. Hohnhuser, B., Hommel, G., Berlin, T.U., Iv, F.: 3D pose estimation using coupled snakes. J. WSCG 12, 1213–6972 (2003) 10. gsl: http://www.gnu.org/software/gsl/

Globally Optimal Finsler Active Contours Christopher Zach, Liang Shan, and Marc Niethammer University of North Carolina at Chapel Hill Abstract. We present a continuous and convex formulation for Finsler active contours using seed regions or utilizing a regional bias term. The utilization of general Finsler metrics instead of Riemannian metrics allows the segmentation boundary to favor appropriate locations (e.g. with strong image discontinuities) and suitable directions (e.g. aligned with dark to bright image gradients). Strong edges are not required everywhere along the desired segmentation boundary due to incorporation of a regional bias. The resulting optimization procedure is simple and eﬃcient, and leads to binary segmentation results regardless of the underlying continuous formulation. We demonstrate the proposed method in several examples.

1

Introduction

Image segmentation is one of the fundamental tasks in low level vision. In order to obtain general, eﬃcient and globally optimal methods we focus on approaches using only local image information and disregard methods e.g. incorporating global foreground and background statistics leading to non-convex minimization tasks. We can identify several local inﬂucences determining the segmentation boundary between the foreground object and the background: 1. A regional bias, which favors either the foreground or background at particular image locations. The regional bias can be arbitrarily computed, and is understood as the log-likelihood ratio between the object and background probabilities. 2. A regularization force preferring smooth segmentation boundaries. We focus on regularizing the length (or area) of the segmentation boundary as induced by the underlying metric. This metric can be purely Euclidean or Riemannian with weights induced e.g. by strong image discontinuities. 3. A strong orientation force based on the total ﬂux through the segmentation boundary favoring particular local orientations. By the divergence theorem (or by identifying the adjoint operator in the discrete setting) the ﬂux term essentially modiﬁes the regional bias locally by appropriate raising and decreasing the likelihood ratios. We denote the ﬂux term as a strong force, since it is (as regional bias) always active in the energy functional regardless of the obtained segmentation boundary. 4. A weaker orientation force based on asymmetric Finsler metrics also favoring particular orientations but without modifying the regional bias. This term is only in eﬀect at the segmentation boundary and does not contribute to the overall energy otherwise. J. Denzler, G. Notni, and H. S¨ uße (Eds.): DAGM 2009, LNCS 5748, pp. 552–561, 2009. c Springer-Verlag Berlin Heidelberg 2009

Globally Optimal Finsler Active Contours

(a) Input

(b) GAC

(c) γ = 1/100 (d) γ = 1/10

(e) γ = 1

553

(f) Finsler

Fig. 1. Input image (a), Riemannian metric (GAC) segmentation result (b), ﬂux-based segmentations with increasing weights for the ﬂux term: γ = 1/100 (c), γ = 1/10 (d), and γ = 1 (e); and segmentation result using Finsler metrics (f)

The synthetic example shown in Figure 1 illustrates the diﬀerences between the strong (ﬂux-based) and the weaker (Finsler metric) orientation forces. Consider the input image shown in Figure 1(a): assume that the center region of the image is known to be inside the object and the image borders belong to the background (regional bias), and that the segmentation boundary between the objects interior and the background needs to coincide with a bright-to-dark transition in the image. Thus, the middle circle is the desired foreground boundary. A weighted Riemannian metric (which is always centrally symmetric) based on the strength of image edges generally favors the smallest segmentation result (Figure 1(b)). Adding the ﬂux energy with weight γ either gives the Riemannian result, if γ is very small, the intended result for the right choice of γ, or leads to unintended segmentation results also showing spurious foreground pixels for large γ (due to the strong inﬂuence of the ﬂux term on the regional bias, Figure 1(c–e)). Using a Finsler metric with weights as described in Section 3 yields the desired segmentation result (Figure 1(f)). Spurious foreground regions are never generated by Finsler metrics in locations without an appropriate regional bias.

2

Background

Segmentation boundaries generally coincide with strong edges in the source image, and a suitable weighting of the boundary term based on image gradient magnitudes leads to geodesic active contours (GAC) [1] and surfaces [2] optimizing a energy functional of the form E(S) = w(S)dS, (1) S

where S represents the contour/surface with appropriate parametrization. In order to avoid the trivial solution with vanishing S, suitable endpoint or seed region constraints are required if a globally optimal solution is sought. A common choice for the weight function w is w=

1 + , 1 + α∇I σ p

(2)

where I σ is a denoised (smoothed) version of the input image I, p is a shape parameter (usually p = 1 or p = 2), and α, > 0. If seed regions deﬁnitely

554

C. Zach, L. Shan, and M. Niethammer

belonging to the foreground and the background are known, then minimizing Eq. 1 corresponds to separating the seed regions with minimal boundary costs. Globally optimal minimizers for this segmentation task can be found using combinatorial methods [3] and a continuous formulation [4]. The energy in Eq. 1 only attracts the segmentation boundary to favor locations e.g. with large image gradients, but does not lead to preferred local boundary orientations. Such preference of particular directions can be achieved by using a ﬂux-based term [5,6], or by utilization of a position and direction dependent weighting function. Utilizing Finsler metrics for tractography is proposed in [7,8], where the isotropic weighting function w(·) in Eq. 1 is replaced by w(S, N (S)) (with N (S) denoting the normal direction to the curve S): E(S) = w(S, N (S)))dS. (3) S

Since the desired result is a curve in higher dimensions, a dynamic programming approach is employed to determine the minimizer of Eq. 3 (subject to endpoint constraints). The weight function w is not required to be a convex function, but the solution procedure implicitly convexiﬁes w. Our proposed method (Section 3) can be understood as a globally optimal approach for Finsler active contour segmentation with (optional) region-based terms. Kolmogorov and Boykov [9] present a globally optimal Finsler active contour approach based on graph cut construction. Finsler metrics are discretized and approximated by a symmetric (Riemannian) part and an anti-symmetric, ﬂux-based term. The latter term poses a problem (possibly leading to spurios foreground objects) if region-based likelihoods are added to the energy (recall Figure 1(e)). The two-phase Chan-Vese energy (also known as active contours without edges) [10] combines regional foreground and background likelihoods with boundary regularization, E(A, ρF , ρB ) = Per(∂A) + ρF dx + ρB dx (4) A Ω\A = Per(∂A) + ρ(x)1A dx + const (5) Ω

where A is a subset of Ω, Per(∂A) is the length of the boundary of A, and ρF and ρB are the negative foreground and background log-likelihoods given at every x ∈ Ω. ρ := ρF − ρB is the log-likelihood ratio. Remarkably, this model does not strongly rely on distinctive image edges to attract the segmentation boundary. The particular choice of ρF = (f − c1 )2 and ρB = (f − c2 )2 for a given source image f and unknown values c1 , c2 yields the classic Chan-Vese energy, E(A, c1 , c2 ) = Per(∂A) + λ (c1 − f (x))2 dx + λ (c2 − f (x))2 dx. (6) A

Ω\A

In the following we ﬁx the log-likelihood ratios ρ in advance and optimize only over the set A. Local minimizers of Eq. 5 can be determined e.g. by the level set

Globally Optimal Finsler Active Contours

555

approach [10]. Chan et al [11] propose to determine the optimal set A indirectly through u = 1A : E(u) = ∇u + ρ(x)u dx. (7) Ω

In [11] it is shown that the constraint u ∈ {0, 1} can be replaced by its LPrelaxation, u ∈ [0, 1], resulting in a convex minimization problem (for ﬁxed ρ). A globally optimal binary solution can be obtained by thresholding any minimizer of Eq. 7 subject to u ∈ [0, 1]. Bresson et al. [12] extend this result to the case of weighted total variation to favor segmentation boundaries at image discontinuities where existent. An alternating minimization scheme is proposed, which is based on the relaxation of Eq. 7, 1 E(u, v) = ∇u + (u − v)2 + ρu dx (8) 2θ Ω subject to v ∈ [0, 1]. This energy is optimized by alternating steps: update u using Chambolle’s dual approach for the ROF energy [13], and point-wise minimization for v. Our solution procedure does not rely on such convex relaxations.

3

Convex Formulation of Finsler Active Contours

In this section we replace the scalar weights in Eq. 7 by position and direction dependent weighting functions. The goal is to use a formulation incorporating region terms (i.e. forces favoring either foreground or background at particular positions) and boundary terms (forces attracting the segmentation boundary at certain positions with particular orientations). More formally, let (φx )x∈Ω be a family of positively 1-homogeneous functions and ρ : Ω → R a data cost function. We search for the minimizer of E(u) = φx (∇u) + ρu dx subject to u ∈ [0, 1], (9) Ω

Common choices for φx are · (total variation) and w(x) · (weighted TV). But φ can be more complex, e.g. an anisotropic version of total variation [14]. ξ Since every φx is positively 1-homogeneous, we can write φx (ξ) = ξφx ( ξ ). Figure 2(a) and (b) illustrate potential shapes Wφ induced by φ. These shapes are also denoted as Wulﬀ shapes [14]. If Ω is bounded, then the set of functions {u : Ω → [0, 1]} is also bounded, and a global minimum is attained for the convex and continuous functional E(u). Prominent choices for ρ(·) are ρ(x) = λ (c1 − f (x))2 − (c2 − f (x))2 for the Chan-Vese energy Eq. 6, and ρ(x) = λ if f (x) = 0 and ρ(x) = −λ for f (x) = 1 corresponding to the TV-L1 energy with a binary input image, i.e. E(u; f ) = ∇u + λ|u − f | dx subject to u ∈ [0, 1], (10) Ω

with f : Ω → {0, 1} (see [15]). Allowing ρ to be an extended function ρ : Ω → R ∪ {−∞, +∞} also enables the incorporation of strict constraints u(S) = 1 and

556

C. Zach, L. Shan, and M. Niethammer

(a)

(b)

(c)

Fig. 2. General Wulﬀ shape (a), the utilized shape for segmentation (b), and its alignment with image gradients (c)

u(T ) = 0 for source and sink regions S, T ⊆ Ω. If ρ is zero in Ω \ (S ∪ T ), we arrive at a convex formulation of Finsler active contours (Eq. 3). Without the regularization term, an optimal solution is simply given by u = 1{x:ρ(x)<0} . The essentially binary nature of solution of Eq. 9 was already shown for the unweighted total variation [11] and weighted TV [12] (by rewriting the total variation in terms of the level sets of u). We give an alternative proof based on strong duality in convex analysis that directly extends to general families of convex, positively 1-homogeneous functions φx : Theorem 1 Let φ be a positively 1-homogeneous function, and ρ : Ω → R. Then any global minimizer of Eq. 9 can be converted into a purely binary global minimizer by thresholding with an arbitray value θ ∈ (0, 1). Proof: Assume u∗ : Ω → [0, 1] is a minimizer of Eq. 9. The corresponding thresholded binary function u ˆ is given by 1 if u∗ (x) ≥ θ u ˆ(x) = 0 otherwise. First note, that ∇ˆ u

= 0 only at the θ-level set, where it has the same direction as ∇u∗ . Thus, we can write ∇ˆ u = c∇u∗ (point-wise) for c ≥ 0. The dual energy of Eq. 9 is given by (we omit the straightforward calculation to due lack of space) E ∗ (p) = min(0, div p + ρ) dx, (11) Ω

which is maximized with respect to a vector ﬁeld p subject to −p ∈ Wφx . Wφx is the convex Wulﬀ shape induced by φx . By inserting the respective constraints on u and p using the δ function, the primal and dual energies Eq. 9 and Eq. 11 can be stated as E(u) = φ(∇u) + ρu + δ[0,1] (u) dx (12) Ω E ∗ (p) = min(0, div p + ρ) − δWφ (−p) dx, (13) Ω

Globally Optimal Finsler Active Contours

557

where we also drop the explicit dependence on x for φ. We employ the KKT conditions to show the optimality of u ˆ [16]. Let p∗ be the corresponding dual ∗ solution for u . The KKT conditions for our particular minimization task are given by ∗ ∗ ∗ ∇u ∈ ∂ δWφ (−p )dx and −div p ∈ ∂ ρu∗ + δ[0,1] (u∗ )dx. (14) The terms under the integral are independent, hence the KKT conditions can be applied point-wise. Therefore, (u∗ , p∗ ) are minimizers of E(u) (Eq. 9) and the corresponding dual energy if and only if ∇u∗ ∈ ∂ δWφ (−p∗ ) and −div p∗ ∈ ∂ ρu∗ + δ[0,1] (u∗ ) . (15) First, we show ∇ˆ u ∈ ∂ δWφ (−p∗ ) . The deﬁnition of the subgradient reads as δWφ (−p∗ ) + (∇ˆ u)T (p − p∗ ) ≤ δWφ (−p). Since p∗ is feasible and the inequality is trivially true for every −p ∈ / Wφ , we can assume −p∗ and −p are in Wφ , i.e. δWφ (−p∗ ) = 0 and δWφ (−p) = 0. But (∇ˆ u)T (p − p∗ ) = c(∇u∗ )T (p − p∗ ) ≤ 0,

(16)

since c ≥ 0 by construction. Hence, ∇ˆ u is also a subgradient of δWφ (−p∗ ). Next, we prove −div p∗ ∈ ∂ ρˆ u + δ[0,1] (ˆ u) . If u∗ is already either 0 or 1, then ∗ u ˆ = u and there is nothing to show. If u∗ is in the open interval (0, 1), then ∂δ[0,1] (u∗ ) is 0, since δ[0,1] (·) is constant in [0, 1]. Further, the mapping u → ρu is smooth, and the the gradient is the only subgradient, i.e. −div p∗ ∈ {∂(ρu∗ )} = {ρ}. In order to prove that −div p∗ is a subgradient of u → ρˆ u + δ[0,1] (ˆ u) we have to show that ρˆ u + δ[0,1] (ˆ u) − (div p∗ )(u − u ˆ) ≤ ρu + δ[0,1] (u)

(17)

for every u. But ρˆ u + δ[0,1] (ˆ u) − (div p∗ )(u − u ˆ) = ρˆ u − (div p∗ )(u − u ˆ) = ρˆ u + ρ(u − u ˆ) = ρu ≤ ρu + δ[0,1] (u),

[δ[0,1] (ˆ u) = 0] [−div p∗ = ρ] (18)

Hence, −div p∗ is a subgradient of ρˆ u + δ[0,1] (ˆ u), thus (ˆ u, p∗ ) also satisﬁes the KKT conditions and uˆ is therefore a global minimizer. In ﬁnite settings where Ω is represented by a discrete grid, simple thresholding also modiﬁes the level lines, i.e. we have only ∇(ˆ u) ≈ c∇u∗ . Thus, pure thresholding in the discrete setting yields to (slightly) inferior energies for u ˆ. By utilization of φ(∇u) = maxp∈Wφ (−pT ∇u), where Wφ is the convex Wulﬀ shape induced by φ, we rewrite the energy Eq. 9 in a primal-dual setting (omitting the explicit dependence on x):

558

C. Zach, L. Shan, and M. Niethammer

E(u) =

max (−pT ∇u) + ρu dx

Ω p∈Wφ

subject to u ∈ [0, 1],

(19)

and the respective gradient descent (in u) and ascent (for p) equations are ∂u = − div p − ρ ∂t ∂p = −∇u ∂t

s.t. u ∈ [0, 1] s.t. −p ∈ Wφ

(20)

for the artiﬁcial time parameter t. Enforcing the constraints on u and p is simply done by clamping u(x) to [0, 1], and reprojecting p(x) onto the feasible set W √φx . Standard stability arguments establish the maximal stable timestep τ < 1/ K, where K is the dimension of Ω (i.e. 2 for images). These equations have a similar structure as the continuous maximal ﬂow equations [4], but diﬀer substantially from the solution procedure proposed in [12] for direction independent (isotropic) functions φx based on Eq. 8. There is a wide range of possibilities how to design φx (or the respective Wulﬀ shape Wφx ). The Wulﬀ shape depicted in Figure 2(b), composed by a half-circle (with radius 1) and a circular segment (with height w), naturally combines gradient direction with gradient magnitude. The orientation of the shape is aligned with ∇I (Figure 2(c)), and the respective height w(x) is given by Eq. 2 with p = 1. Homogenous regions (∇I(x) = 0, i.e. w(x) = 1) lead to direction independent perimeter regularization, and strong edges (w(x) 1) result in low cost if the boundary is locally aligned with the image discontinuity. The situation depicted in Figure 2(c) corresponds to u = 1 representing the foreground, and the shaded region indicating darker image values. This particular shape also allows very simple reprojection operations for p after the gradient ascent update Eq. 20.

4 4.1

Results Histology Segmentation

The prototypical example for Finsler active contours is the segmentation of thickwalled anatomical structures like blood vessels in histology slides, see Figure 3. Given foreground seeds inside the lumen of the artery, geodesic active contours (i.e. with isotropic weighting of the contour length) generally return the inner wall of the artery as segmentation result (Figure 3(a)). Incorporating knowledge on the expected intensity gradient (going from dark to bright) using the proposed asymmetric Finsler metric as depicted in Figure 2(c) aligns the segmentation boundary with the exterior wall of the vessel, see Figure 3(b). 4.2

Bone Segmentation

We apply the proposed method on a bone segmentation task given MR images of the knee joint. Cortical bone appears black in both T1 and T2 weighted

Globally Optimal Finsler Active Contours

(a) Geodesic active contour

559

(b) Finsler active contour

Fig. 3. Result of vessel segmentation in histology slides using geodesic active contours (a) and Finsler active contours (b). [Best viewed in color. Image data courtesy of Prof. David King, http://www.siumed.edu/∼dking2/crr/.]

MR images, whereas muscles and tissues appear bright. Hence, a proper bone segmentation boundary runs through a dark-to-bright intensity transition. Consequently, the correct segmentation boundary is often not solely induced by the strongest edge in T1 and T2 weighted images. In order to obtain the regional bias ρ, we compute the likelihoods p(IT1 , IT2 |bone) and p(IT1 , IT2 |background) for the Bayesian classiﬁer based on a non-parametric estimation of the joint histogram of T1 and T2 intensities for foreground (bone) and background (everything else). Prior to the computation of the histograms, we mask out the image background followed by MR bias ﬁeld correction using the MNI’s N3 algorithm [17]. The data samples for kernel density estimation to derive the respective probabilities are obtained by a user-guided segmentation of one test-case. Figure 4 shows the segmentation results using the isotropic Riemannian metric (GAC) approach (b) and the proposed Finsler metric method (c). Both intensity gradients in the T1 and T2 images are used to obtain the weighting function

(a) Regional bias

(b) GAC

(c) Finsler

Fig. 4. Regional bias—dark regions indicate likely bone structure (a). T2 image and overlaid bone segmentation results using Riemannian metrics (b) and Finsler metrics (c). [Best viewed in color. Image data courtesy of Duke Image Analysis Laboratory (http://dial.mc.duke.edu). ]

560

C. Zach, L. Shan, and M. Niethammer

(a) Input image

(b) c1 = 0 and c2 = 1

(c) Energies

Fig. 5. (a) input image; (b) segmentation result with c1 = 0 and c2 = 1; (c) primal and dual energies (with respect to the iteration number) of the proposed method (solid lines) and the relaxation approach [12] (dashed lines)

(Eq. 2) with α = 20. The only diﬀerence in the settings between the Riemannian and the Finsler approach is the utilization of direction dependent weighting φx as induced by the shape depicted in Figure 2(c). Note the “eroded” bone segmentation result returned by standard geodesic active contours in Figure 4(b). 4.3

Run-Time Performance for the Chan-Vese Model

The proposed procedure (Eq. 20) is in practice also more eﬃcient than the convex relaxation approach, Eq. 8. We compared the run-time performance of the primal-dual scheme Eq. 20 with the performance of the relaxation approach Eq. 8 proposed in [12] for the standard Chan-Vese model (Eq. 6). Figure 5 displays the 256 × 256 input image, the segmentation result using black and white for c1 and c2 , respectively, and the corresponding primal and dual energies. The data weight λ is set to 4. The run-time for one iteration is very similar in both methods, hence we display the evolution of primal and dual energies with respect to the iteration number. With GPU acceleration, real-time performance even for the iterated approach successively updating the means c1 and c2 can be obtained.

5

Conclusion

We developed a continuous and convex formulation for binary segmentation tasks incorporating a regional term and a position and orientation dependent prior for the segmentation boundary represented by Finsler metrics. Finsler active contours provide an alternative approach to incorporate image-based priors on the location and orientation of the segmentation boundary. The continuous relaxation yields an eﬃcient solution method highly suitable for data-parallel implementations. Nevertheless, global optimal binary segmentation results are obtained in the continuous framework. Future work will address extending the class of energies that can be optimized in the convex and continuous framework. For instance, the continuous formulation for length ratio minimization given in [18] can be easily extended to Finsler metrics.

Globally Optimal Finsler Active Contours

561

References 1. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. IJCV 22(1), 61–79 (1997) 2. Caselles, V., Kimmel, R., Sapiro, G., Sbert, C.: Minimal surfaces based object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 19, 394–398 (1997) 3. Boykov, Y., Kolmogorov, V.: Computing geodesics and minimal surfaces via graph cuts. In: Proc. ICCV, pp. 26–33 (2003) 4. Appleton, B., Talbot, H.: Globally minimal surfaces by continuous maximal ﬂows. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 106–118 (2006) 5. Vasilevskiy, A., Siddiqi, K.: Flux maximizing geometric ﬂows. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1565–1578 (2002) 6. Kimmel, R., Bruckstein, A.: Regularized Laplacian zero crossings as optimal edge integrators. IJCV 53(3), 225–243 (2003) 7. Pichon, E., Westin, C.-F., Tannenbaum, A.R.: A hamilton-jacobi-bellman approach to high angular resolution diﬀusion tractography. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3749, pp. 180–187. Springer, Heidelberg (2005) 8. Melonakos, J., Pichon, E., Angenent, S., Tannenbaum, A.: Finsler active contours. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 412–423 (2008) 9. Kolmogorov, V., Boykov, Y.: What metrics can be approximated by geo-cuts, or global optimization of length/area and ﬂux. In: Proc. ICCV, pp. 564–571 (2005) 10. Chan, T.F., Vese, L.: Active contours without edges. IEEE Trans. Image Processing 10(2), 266–277 (2001) 11. Chan, T.F., Esedoglu, S., Nikolova, M.: Algorithms for ﬁnding global minimizers of image segmentation and denoising models. SIAM Journal on Applied Mathematics 66(5), 1632–1648 (2006) 12. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J., Osher, S.: Fast Global Minimization of the Active Contour/Snake Model. J. Math. Imag. Vision (2007) 13. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imag. Vision 20(1–2), 89–97 (2004) 14. Osher, S., Esedoglu, S.: Decomposition of images by the anisotropic Rudin-OsherFatemi model. Comm. Pure Appl. Math. 57, 1609–1626 (2004) 15. Chan, T.F., Esedoglu, S.: Aspects of total variation regularized L1 function approximation. SIAM Journal on Applied Mathematics 65(5), 1817–1837 (2004) 16. Borwein, J., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, Heidelberg (2000) 17. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A non-parametric method for automatic correction of intensity non-uniformity in MRI data. IEEE Transactions on Medical Imaging 17, 87–97 (1998) 18. Kolev, K., Cremers, D.: Continuous ratio optimization via convex relaxation with applications to multiview 3D reconstruction. In: Proc. CVPR (2009)

Author Index

Akgul, Yusuf Sinan 312, 322 Albrecht, Thomas 232 Ali, Karim 151 Amberg, Brian 232 Andres, Bj¨ orn 502 Aydin, Tarkan 322 Bachmann, Alexander 512 Badino, Hern´ an 51 B¨ ahnisch, Christian 111 Baratoﬀ, Gregory 91 Barth, Alexander 262 Bauckhage, Christian 272 Baumeister, Ralf 542 Becker, Christian 402 Beder, Christian 292 Bendicks, Christian 392 Berger, Marie-Odile 1 Bonea, Andreea 502 Breidt, Martin 41 Brenner, Claus 61, 532 Breuel, Thomas 492 Breuß, Michael 191 Brox, Thomas 21 Bruhn, Andr´es 452 B¨ ulthoﬀ, Heinrich 41 Burkhardt, Hans 131, 141, 542 Coleman, Sonya Cremers, Daniel Curio, Crist´ obal

282 31, 171, 342, 432 41

Denzler, Joachim 91, 161, 252, 352, 442 Deselaers, Thomas 201 Dießelberg, Lars 522 Dragon, Ralf 402

Gardiner, Bryan 282 Gass, Tobias 201 Gavrila, Dariu M. 71, 81, 101 Gill, Gurman 211 Gillich, Eugen 422 Glock, Stefan 422 Goldl¨ ucke, Bastian 342 Hamprecht, Fred A. 502 Hofmann, Michael 71 H¨ ornlein, Thomas 121 ˙ Imre, Evren J¨ ahne, Bernd

1 121

Keller, Christoph Gustav Kim, Yoo-Jin 492 Klette, Reinhard 472 Klose, Uwe 412 Koch, Reinhard 332 Kolev, Kalin 171 K¨ othe, Ullrich 111, 502 K¨ uhmstedt, Peter 352 Kumar, Vinoid 412

81

Lampert, Christoph H. 221 Lerch, Anita 232 Levine, Martin 211 Lindner, Albrecht 151 Llorca, David Fern´ andez 81 Lohweg, Volker 422 L¨ uthi, Marcel 232

Ehricke, Hans-Heino 412 Enzweiler, Markus 101 Esquivel, Sandro 332

Magnor, Marcus 382 Meidow, Jochen 292 Michaelis, Bernd 392 Mitzel, Dennis 432 Munder, Stefan 91 Munkelt, Christoph 161, 352

Feldmann, Tobias 522 F¨ orstner, Wolfgang 262, 292 Franke, Uwe 51, 262 Fua, Pascal 151

Nadler, Boaz 502 N¨ agele, Josef 442 Nanopoulos, Alexandros Neumann, Heiko 11

302

564

Author Index

Ney, Hermann 201 Niethammer, Marc 552 Notni, Gunther 352 Ostermann, J¨ orn 402 Oswald, Martin R. 171 Otto, Kay M. 412 Paysan, Pascal 232 Peters, Jan 221 Pfeiﬀer, David 51 Platzer, Esther-Sabrina Pock, Thomas 432

442

Rajagopalan, Ambasamudram N. 181, 362 Raman, Sudhir 242 Ramnath, Krishnamurthy 181 Rapus, Martin 91 Raudies, Florian 11 Rehse, Heino 332 Reisert, Marco 131, 141 Ripperda, Nora 532 Rodner, Erik 252 Rohrbach, Marcus 101 Ronneberger, Olaf 141, 542 Rosenhahn, Bodo 21, 402 Roth, Volker 242 Sahay, Rajiv Ranjan 362 Santini, Francesco 232 Schaede, Johannes 422 Schick, Alexander 372 Schmaltz, Christian 21, 452 Schmalz, Christoph 462 Schmidt, Frank R. 31 Schn¨ orr, Christoph 482

Schoenemann, Thomas 432 Sch¨ olkopf, Bernhard 41 Schulze, Ekkehard 542 Sch¨ utt, Ole 382 Scotney, Bryan 282 Sellent, Anita 382 Shan, Liang 552 Siegemund, Jan 262 Skibbe, Henrik 141 Stelldinger, Peer 111 Stiefelhagen, Rainer 372 Strecha, Christoph 151 Tarlet, Dominique 392 Th´evenin, Dominique 392 Thurau, Christian 272 T¨ oppe, Eno 171 Trummer, Michael 161, 352 Valgaerts, Levi 191 Vaudrey, Tobi 472 Vetter, Thomas 232 Vlasenko, Andrey 482 Vogel, Oliver 191 Walder, Christian 41 Wang, Qing 542 Wehking, Karl-Heinz 442 Weickert, Joachim 21, 191, 452 Wenger, Stephan 382 Wirjadi, Oliver 492 W¨ orner, Annika 522 Wunderlich, Bernd 392 Yildiz, Alparslan Zach, Christopher

312 552

Pattern Recognition: 32nd DAGM Symposium, Darmstadt, Germany, September 22-24, 2010, Proceedings (Lecture Notes in Computer Science Image ... Vision, Pattern Recognition, and Graphics)

Image Processing and Pattern Recognition

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, ... Vision, Pattern Recognition, and Graphics)

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 15th Iberoamerican Congress on Pattern Recognition, CIARP 2010, ... Vision, Pattern Recognition, and Graphics)

Smart Graphics: 8th International Symposium, SG 2007, Kyoto, Japan, June 25-27, 2007, Proceedings (Lecture Notes in Computer Science Image Processing, ... Vision, Pattern Recognition, and Graphics)

Pattern Recognition and Image Analysis: Third Iberian Conference, IbPRIA 2007, Girona, Spain, June 6-8, 2007, Proceedings, Part I (Lecture Notes in Computer ... Vision, Pattern Recognition, and Graphics)

Pattern Recognition And Image Preprocessing

Pattern Recognition and Image Analysis: 4th Iberian Conference, IbPRIA 2009 Povoa de Varzim, Portugal, June 10-12, 2009 Proceedings (Lecture Notes in ... Vision, Pattern Recognition, and Graphics)

Energy Minimization Methods in Computer Vision and Pattern Recognition: 7th International Conference, EMMCVPR 2009, Bonn, Germany, August 24-27, 2009, ... Vision, Pattern Recognition, and Graphics)

Pattern Recognition: 31st DAGM Symposium, Jena, Germany, September 9-11, 2009, Proceedings (Lecture Notes in Computer Science Image Processing, Computer Vision, Pattern Recognition, and Graphics)

Pattern Recognition: 32nd DAGM Symposium, Darmstadt, Germany, September 22-24, 2010, Proceedings (Lecture Notes in Computer Science Image ... Vision, Pattern Recognition, and Graphics)

Image Processing and Pattern Recognition

Advances in Pattern Recognition (Lecture Notes in Computer Science, 6256)

Handbook of Pattern Recognition and Computer Vision